Glossary/Data Storage & Compute

Data Lakehouse

Data Lakehouse is an architecture that combines data lake storage advantages (cheap, flexible, scalable) with data warehouse query capabilities (schema, performance, governance).

Data lakehouses address a key tension: data lakes are cheap and flexible but querying raw data is complex; data warehouses enable efficient queries but are expensive and rigid. Lakehouses use cloud object storage (cheap) with layered query engines and metadata systems (schema, optimization). Technically, lakehouses use table formats (Apache Iceberg, Delta Lake) that layer structured metadata on top of object storage, enabling ACID transactions, schema enforcement, and optimization like data warehouses while maintaining lake flexibility and cost.

Lakehouses emerged from organizations recognizing that raw data and curated data both have value: raw data enables exploration and model training, curated data supports governed analytics. Rather than choosing one, lakehouses provide both: raw data lives in the lake, curated tables are organized in warehouse-like schemas, same query engine accesses both. Platforms like Databricks were built around lakehouse architecture, treating the system as unified storage with multiple access patterns.

In practice, lakehouses enable organizations to consolidate systems: instead of separate lake (for raw data, ML) and warehouse (for analytics), a single lakehouse serves both. This reduces duplication, simplifies data movement, and lowers costs. The tradeoff is complexity: lakehouses are newer technology with less operational maturity than established warehouses.

Key Characteristics

  • Combines lake flexibility with warehouse structure
  • Uses object storage for cost efficiency
  • Implements ACID transactions and schema enforcement
  • Supports both raw and curated data access
  • Provides query optimization like warehouses
  • Enables multiple access patterns on same data

Why It Matters

  • Reduces total cost versus separate lake and warehouse
  • Enables unified analytics and ML on same platform
  • Supports governance on raw data without separate systems
  • Eliminates data movement between lake and warehouse
  • Reduces complexity by consolidating storage systems
  • Enables new use cases by providing both raw and curated access

Example

A financial services firm uses Databricks lakehouse: raw transaction data lands in object storage in Parquet files, Delta Lake format adds ACID transactions and schema. Finance team uses SQL to query curated revenue tables; ML team uses Python/Spark to train risk models on raw transaction logs; data scientists explore raw data for new feature discovery. Same underlying storage, same infrastructure, different access patterns. Previously required separate S3 lake and Snowflake warehouse with complex data movement between them.

Coginiti Perspective

Coginiti supports lakehouse architectures directly through CoginitiScript's ability to publish Iceberg tables on Snowflake, Databricks, BigQuery, Trino, and Athena. This means governed transformations can produce open table format outputs that any lakehouse-compatible engine can read. The semantic layer provides consistent metric definitions whether the underlying data is accessed through a warehouse SQL interface or a lakehouse query engine, preventing definitional drift across access patterns.

Related Concepts

Data LakeData WarehouseDelta LakeApache IcebergObject StorageCloud StorageAnalytics DatabaseUnified Storage

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.