Glossary/Data Integration & Transformation

Directed Acyclic Graph (DAG)

Directed Acyclic Graph is a mathematical structure used in data systems to represent dependencies between tasks, ensuring they execute in correct order without circular dependencies.

A DAG is a graph where edges have direction (pointing from one node to another) and there are no cycles (no way to follow edges and return to starting node). In data systems, DAGs represent task dependencies: nodes are tasks (extract_orders, transform_customers, load_warehouse), edges show precedence (extract must complete before transform). DAGs ensure correct execution order: orchestration systems topologically sort DAGs to determine which tasks can run in parallel and which must wait.

DAGs became standard in data orchestration because they naturally model the sequential and parallel aspects of data pipelines. Airflow popularized DAGs: Apache Airflow uses DAGs to represent workflows, dbt creates DAGs from model dependencies, Kubernetes uses DAGs for workflow scheduling. The acyclic property is critical: cycles would create deadlocks (task A waits for B, B waits for A).

In practice, DAGs provide powerful capabilities: visualization (seeing the entire pipeline at a glance), automatic dependency resolution (scheduler determines correct execution order), parallelization (tasks with no dependencies run simultaneously), and failure handling (if task fails, downstream tasks can fail or skip intelligently).

Key Characteristics

  • Directed edges represent dependencies between tasks
  • Acyclic structure prevents circular dependencies
  • Enables topological sorting for execution ordering
  • Supports parallel execution of independent tasks
  • Visualizes entire pipeline structure
  • Provides foundation for orchestration and monitoring

Why It Matters

  • Prevents deadlocks by ensuring no circular dependencies
  • Enables automatic parallel execution through topological sorting
  • Provides visibility into pipeline structure for debugging
  • Enables recovery planning by showing task dependencies
  • Supports incremental re-execution (rerun only failed task and dependents)
  • Simplifies reasoning about pipeline behavior

Example

Airflow DAG for analytics pipeline: extract_transactions task has no dependencies (runs first), transform_transactions depends on extract (waits for completion), load_warehouse depends on transform, generate_reports depends on load_warehouse. Airflow visualizes this DAG, topologically sorts to execute, if extract_transactions fails, downstream tasks wait or skip. If operator manually reruns transform_transactions, Airflow understands load_warehouse and generate_reports must rerun. DAG structure enables all these behaviors automatically.

Coginiti Perspective

Coginiti uses DAG structures in two places. CoginitiScript's publication system analyzes block dependencies to build an execution DAG, grouping independent blocks into steps that run in parallel. Coginiti Actions define DAGs explicitly through job dependencies in TOML configuration, where jobs without depends_on entries execute concurrently at the start. In both cases, circular dependencies are detected and rejected at validation time rather than at runtime.

Related Concepts

Data PipelineData OrchestrationData WorkflowTask SchedulingDependency ManagementTopological SortPipeline VisualizationGraph Theory

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.