Directed Acyclic Graph (DAG)
Directed Acyclic Graph is a mathematical structure used in data systems to represent dependencies between tasks, ensuring they execute in correct order without circular dependencies.
A DAG is a graph where edges have direction (pointing from one node to another) and there are no cycles (no way to follow edges and return to starting node). In data systems, DAGs represent task dependencies: nodes are tasks (extract_orders, transform_customers, load_warehouse), edges show precedence (extract must complete before transform). DAGs ensure correct execution order: orchestration systems topologically sort DAGs to determine which tasks can run in parallel and which must wait.
DAGs became standard in data orchestration because they naturally model the sequential and parallel aspects of data pipelines. Airflow popularized DAGs: Apache Airflow uses DAGs to represent workflows, dbt creates DAGs from model dependencies, Kubernetes uses DAGs for workflow scheduling. The acyclic property is critical: cycles would create deadlocks (task A waits for B, B waits for A).
In practice, DAGs provide powerful capabilities: visualization (seeing the entire pipeline at a glance), automatic dependency resolution (scheduler determines correct execution order), parallelization (tasks with no dependencies run simultaneously), and failure handling (if task fails, downstream tasks can fail or skip intelligently).
Key Characteristics
- ▶Directed edges represent dependencies between tasks
- ▶Acyclic structure prevents circular dependencies
- ▶Enables topological sorting for execution ordering
- ▶Supports parallel execution of independent tasks
- ▶Visualizes entire pipeline structure
- ▶Provides foundation for orchestration and monitoring
Why It Matters
- ▶Prevents deadlocks by ensuring no circular dependencies
- ▶Enables automatic parallel execution through topological sorting
- ▶Provides visibility into pipeline structure for debugging
- ▶Enables recovery planning by showing task dependencies
- ▶Supports incremental re-execution (rerun only failed task and dependents)
- ▶Simplifies reasoning about pipeline behavior
Example
Airflow DAG for analytics pipeline: extract_transactions task has no dependencies (runs first), transform_transactions depends on extract (waits for completion), load_warehouse depends on transform, generate_reports depends on load_warehouse. Airflow visualizes this DAG, topologically sorts to execute, if extract_transactions fails, downstream tasks wait or skip. If operator manually reruns transform_transactions, Airflow understands load_warehouse and generate_reports must rerun. DAG structure enables all these behaviors automatically.
Coginiti Perspective
Coginiti uses DAG structures in two places. CoginitiScript's publication system analyzes block dependencies to build an execution DAG, grouping independent blocks into steps that run in parallel. Coginiti Actions define DAGs explicitly through job dependencies in TOML configuration, where jobs without depends_on entries execute concurrently at the start. In both cases, circular dependencies are detected and rejected at validation time rather than at runtime.
Related Concepts
More in Data Integration & Transformation
Change Data Capture (CDC)
Change Data Capture is a technique that identifies and captures new, updated, and deleted records from source systems, enabling efficient incremental data movement instead of full refreshes.
Data Cleansing
Data Cleansing is the process of identifying and correcting errors, inconsistencies, and anomalies in data to improve quality and reliability for analysis.
Data Deduplication
Data Deduplication is the process of identifying and eliminating duplicate records or data points that represent the same entity but appear multiple times in a dataset.
Data Dependency Graph
Data Dependency Graph is a directed representation of relationships between data entities, showing which tables, pipelines, or datasets depend on which other ones.
Data Enrichment
Data Enrichment is the process of enhancing data by adding valuable attributes, calculated fields, or external information that provides additional context and insight.
Data Ingestion
Data Ingestion is the process of capturing data from source systems and moving it into platforms for processing, storage, and analysis.
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.