Glossary/Data Integration & Transformation

Incremental Processing

Incremental Processing is a data pipeline pattern that processes only new or changed data since the last execution, rather than reprocessing the entire dataset.

Incremental processing dramatically reduces cost and time of data pipelines: instead of reloading millions of historical customer records daily, process only new customers and updates to existing records. Incremental relies on tracking state: the last run time, the last ID processed, or explicit change markers (updated_at column). Tools track these watermarks and only pull data beyond the previous watermark. Incremental processing is essential for large-scale analytics: reprocessing terabytes daily is economically infeasible, but processing gigabytes of daily changes is practical.

Incremental processing introduced complexity: ensuring no changes are missed (handle late-arriving data), handling deletes correctly (mark records as deleted rather than removing), and recovering from failures (resume from watermark after failure). Technologies like Change Data Capture simplified incremental processing by reliably identifying changes. dbt and other frameworks now include incremental materialization patterns that abstract the complexity.

Trade-offs exist: incremental processing is more complex to implement and debug than full refresh, but the cost savings at scale are substantial. Organizations typically start with full refresh for simplicity, then switch to incremental as data volumes grow.

Key Characteristics

  • Processes only new or changed data since last execution
  • Tracks watermarks (timestamps, IDs, sequences)
  • Handles late-arriving changes and deletes
  • Enables fast, cost-efficient processing of large datasets
  • Requires mechanisms to recover from failures
  • May maintain state across multiple runs

Why It Matters

  • Reduces processing cost by eliminating unnecessary recomputation
  • Improves freshness by enabling frequent updates
  • Reduces storage costs by not materializing unnecessary intermediate data
  • Enables scaling to terabyte-scale datasets with practical budgets
  • Improves query performance by processing smaller batches frequently
  • Enables real-time or near-real-time data availability

Example

A subscription SaaS company processes revenue incrementally: full refresh runs weekly (historical accuracy check), daily incremental jobs process only new subscriptions, cancellations, and plan changes using updated_at timestamps. Daily job runs in 5 minutes on small cluster, weekly full refresh runs in 30 minutes on larger cluster but validates accuracy and recalculates metrics from scratch. This hybrid approach combines efficiency (incremental) with reliability (weekly full validation).

Coginiti Perspective

CoginitiScript has first-class support for incremental processing through its publication system. Blocks declare incremental strategies (append, merge, or merge_conditionally) in metadata, and the publication.Incremental() function lets the same code handle both incremental and full-refresh modes. The publication.Target() function references the existing target table for watermark queries, eliminating the need for external state tracking. A fullRefresh parameter on publication.Run() overrides incremental behavior when a complete rebuild is needed.

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.