Data Lineage
Data lineage is the complete path a piece of data takes from source systems through transformations to consumption points, enabling understanding of data dependencies and impact analysis.
Data lineage answers the question: where did this data come from and where does it go? Lineage includes: source systems (which databases or APIs feed the data), transformations (SQL, Python, dbt models that modify it), intermediate tables (staging, marts), and consumption points (dashboards, reports, machine learning models). Complete lineage spans the entire pipeline: from operational system to analytics presentation.
Data lineage emerged because data increasingly comes from complex pipelines and teams need to understand dependencies. When a source system is down, which analytics tables are affected? When data quality is wrong, where did the bad data originate? Lineage enables answers to both questions. It also powers impact analysis: understanding that changing a column impacts 50 downstream dashboards helps teams make careful decisions.
Data lineage is typically inferred from code and system metadata: SQL queries reveal table references, dbt models explicitly document lineage, data pipeline tools log dependencies, and data catalogs display relationships. Modern lineage systems are graph-based: nodes are data entities, edges are transformations. They support drill-down: clicking on a transformation shows the code that performs it. Lineage is often combined with data observability: when anomalies occur, lineage helps identify which upstream changes caused them.
Key Characteristics
- ▶Documents source systems and dependencies
- ▶Maps transformation steps and data flows
- ▶Shows consumption points and downstream impact
- ▶Typically represented as directed acyclic graphs
- ▶Inferred from code and system metadata
- ▶Enables impact analysis and root cause diagnosis
Why It Matters
- ▶Understanding: Users understand data origins and transformations
- ▶Trust: Visible lineage enables trust in data
- ▶Impact Analysis: Know which tables are affected by changes
- ▶Diagnosis: Root cause analysis uses lineage to trace problems
- ▶Compliance: Demonstrates data provenance for regulations
Example
Lineage for revenue metric: production database (orders table) > ETL extraction > staging_orders > dbt transformation (join with products and customers) > analytics_revenue > BI tool dashboards. Following this lineage reveals that changing product table cardinality might affect revenue metrics.
Coginiti Perspective
CoginitiScript provides structural lineage through its block reference system. When a block invokes another via {{ block-name(args) }}, the dependency is explicit and traceable. Publication pipelines build on these references, creating a lineage path from source queries through transformations to materialized outputs (tables, views, Parquet, Iceberg). The SMDL layer adds a semantic lineage dimension: entity definitions trace back to their source tables or queries, and measure definitions document how business metrics derive from physical data.
Related Concepts
More in Data Governance & Quality
Analytics Catalog
An analytics catalog is a specialized data catalog focused on analytics assets such as metrics, dimensions, dashboards, and saved queries, enabling discovery and governance of analytics-specific objects.
Business Metadata
Business metadata is contextual information that gives data meaning to business users, including definitions, descriptions, ownership, and guidance on appropriate use.
Data Catalog
A data catalog is a searchable repository of metadata about data assets that helps users discover available datasets, understand their content, and assess their quality and suitability for use.
Data Certification
Data certification is a formal process of validating and approving data quality, documenting that data meets governance standards and is safe for use in critical business decisions.
Data Contracts
A data contract is a formal agreement specifying the expectations between data producers and consumers, including schema, quality guarantees, freshness SLAs, and remediation obligations.
Data Governance
Data governance is a framework of policies, processes, and controls that define how data is managed, who is responsible for it, and how it should be used to ensure quality, security, and compliance.
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.