Glossary/Data Governance & Quality

Data Lineage

Data lineage is the complete path a piece of data takes from source systems through transformations to consumption points, enabling understanding of data dependencies and impact analysis.

Data lineage answers the question: where did this data come from and where does it go? Lineage includes: source systems (which databases or APIs feed the data), transformations (SQL, Python, dbt models that modify it), intermediate tables (staging, marts), and consumption points (dashboards, reports, machine learning models). Complete lineage spans the entire pipeline: from operational system to analytics presentation.

Data lineage emerged because data increasingly comes from complex pipelines and teams need to understand dependencies. When a source system is down, which analytics tables are affected? When data quality is wrong, where did the bad data originate? Lineage enables answers to both questions. It also powers impact analysis: understanding that changing a column impacts 50 downstream dashboards helps teams make careful decisions.

Data lineage is typically inferred from code and system metadata: SQL queries reveal table references, dbt models explicitly document lineage, data pipeline tools log dependencies, and data catalogs display relationships. Modern lineage systems are graph-based: nodes are data entities, edges are transformations. They support drill-down: clicking on a transformation shows the code that performs it. Lineage is often combined with data observability: when anomalies occur, lineage helps identify which upstream changes caused them.

Key Characteristics

  • Documents source systems and dependencies
  • Maps transformation steps and data flows
  • Shows consumption points and downstream impact
  • Typically represented as directed acyclic graphs
  • Inferred from code and system metadata
  • Enables impact analysis and root cause diagnosis

Why It Matters

  • Understanding: Users understand data origins and transformations
  • Trust: Visible lineage enables trust in data
  • Impact Analysis: Know which tables are affected by changes
  • Diagnosis: Root cause analysis uses lineage to trace problems
  • Compliance: Demonstrates data provenance for regulations

Example

Lineage for revenue metric: production database (orders table) > ETL extraction > staging_orders > dbt transformation (join with products and customers) > analytics_revenue > BI tool dashboards. Following this lineage reveals that changing product table cardinality might affect revenue metrics.

Coginiti Perspective

CoginitiScript provides structural lineage through its block reference system. When a block invokes another via {{ block-name(args) }}, the dependency is explicit and traceable. Publication pipelines build on these references, creating a lineage path from source queries through transformations to materialized outputs (tables, views, Parquet, Iceberg). The SMDL layer adds a semantic lineage dimension: entity definitions trace back to their source tables or queries, and measure definitions document how business metrics derive from physical data.

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.