Glossary/Data Integration & Transformation

Change Data Capture (CDC)

Change Data Capture is a technique that identifies and captures new, updated, and deleted records from source systems, enabling efficient incremental data movement instead of full refreshes.

Change Data Capture tracks what changed in a source system since the last pipeline run and extracts only those changes, rather than reloading the entire dataset. CDC uses database logs (write-ahead logs), timestamps (updated_at columns), or query-based approaches (SELECT WHERE updated_at > last_run). CDC is especially valuable for large tables where full refresh is expensive and impractical. A table with 100 million customer records may only change 100,000 records daily; CDC captures only those changes, reducing extraction time and bandwidth.

CDC technology matured with managed services (Fivetran, Debezium) that handle the complexity of reading database logs. Organizations initially used timestamp-based CDC (simple but imperfect) and evolved toward log-based CDC (more reliable but more complex). Cloud platforms now provide native CDC through services like AWS Database Migration Service.

CDC enables real-time and near-real-time data movement by triggering downstream pipelines immediately when data changes occur. This supports operational analytics (dashboards updated seconds after transactions) and reduces freshness latency. The trade-off is complexity: CDC systems must handle out-of-order changes, deletions, and ensure no changes are missed.

Key Characteristics

  • Identifies changes (inserts, updates, deletes) in source systems
  • Uses database logs, timestamps, or query comparison methods
  • Tracks changes since last extraction point
  • Enables incremental movement of only changed data
  • Reduces extraction time and bandwidth for large datasets
  • Supports real-time or near-real-time latency

Why It Matters

  • Reduces cost and latency of data movement by eliminating full refreshes
  • Enables real-time data availability for operational analytics
  • Reduces source system load by not continuously scanning for changes
  • Improves freshness of downstream data through frequent incremental updates
  • Enables compliance with data deletion by capturing deletes
  • Scales efficiently to handle large data volumes

Example

A payment processor uses CDC to stream transactions: database logs capture every new transaction and update (refund, reversal), Debezium reads logs microseconds after commit, streams to Kafka, and multiple consumers: settlement_system updates account balances, analytics_warehouse increments transaction counts, fraud_detector scores for patterns. Customer transaction data is current within seconds, eliminating batch latency.

Coginiti Perspective

CDC feeds naturally into Coginiti's ELT workflow. Once change data lands in a warehouse or lake, CoginitiScript's incremental publication strategies (append, merge, and merge_conditionally) handle the downstream transformation logic. The publication.Incremental() function lets blocks detect whether they are running in incremental or full-refresh mode, so the same CoginitiScript code handles both CDC-driven updates and initial loads without separate pipeline definitions.

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.