Glossary/Open Table Formats

Apache Hudi

Apache Hudi is an open-source data lake framework providing incremental processing, ACID transactions, and fast ingestion for analytical and operational workloads.

Apache Hudi (Hadoop Upsert Deletes and Incremental) enables efficient data ingestion and updates in data lakes through an incremental computing framework. Unlike traditional append-only systems, Hudi supports upserts and deletes natively, making it suitable for synchronizing operational databases with analytical platforms.

Hudi stores data in two table types: Copy-on-Write tables minimize read latency by materializing updates into Parquet files immediately, while Merge-on-Read tables defer updates into a delta log for faster ingestion and materialized later during queries. This flexibility allows organizations to choose speed and freshness based on their requirements.

The framework excels at incremental processing, tracking which data has changed and processing only deltas rather than full datasets. This significantly reduces compute costs and latency for regular data synchronization tasks. Hudi integrates with Apache Spark for both ingestion and querying, and supports multiple storage backends including HDFS and cloud object storage.

Key Characteristics

  • Supports upsert and delete operations for operational data synchronization
  • Offers Copy-on-Write and Merge-on-Read table types for different latency-throughput tradeoffs
  • Tracks data lineage and change history through commit metadata
  • Enables incremental querying to process only changed data
  • Provides indexing mechanisms to optimize upsert performance
  • Integrates seamlessly with Apache Spark ecosystems

Why It Matters

  • Reduces network bandwidth and compute time by processing only incremental changes
  • Enables real-time data synchronization from transactional systems to analytics platforms
  • Supports data compliance by enabling efficient delete operations at scale
  • Lowers infrastructure costs through intelligent file organization and caching strategies
  • Improves data freshness for operational dashboards without full daily reloads
  • Simplifies CDC (Change Data Capture) implementations with native upsert semantics

Example

`
-- Ingest with upserts using Copy-on-Write
spark.write
  .format("hudi")
  .option("hoodie.table.name", "customer_events")
  .option("hoodie.primaryKey.column", "customer_id")
  .option("hoodie.operation", "upsert")
  .mode("append")
  .save(s3_path)

-- Query incremental changes
spark.read
  .format("hudi")
  .option("as.of.instant", "20240401120000")
  .load(s3_path)
`

Coginiti Perspective

Coginiti can query Hudi tables through connected platforms that support the format, such as Athena, Trino, and Spark. While CoginitiScript's Iceberg publication targets are more directly integrated, teams using Hudi-based lakes can still develop and govern their analytics logic in Coginiti's analytics catalog and semantic layer, applying consistent business definitions regardless of the underlying table format.

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.