Apache Hudi
Apache Hudi is an open-source data lake framework providing incremental processing, ACID transactions, and fast ingestion for analytical and operational workloads.
Apache Hudi (Hadoop Upsert Deletes and Incremental) enables efficient data ingestion and updates in data lakes through an incremental computing framework. Unlike traditional append-only systems, Hudi supports upserts and deletes natively, making it suitable for synchronizing operational databases with analytical platforms.
Hudi stores data in two table types: Copy-on-Write tables minimize read latency by materializing updates into Parquet files immediately, while Merge-on-Read tables defer updates into a delta log for faster ingestion and materialized later during queries. This flexibility allows organizations to choose speed and freshness based on their requirements.
The framework excels at incremental processing, tracking which data has changed and processing only deltas rather than full datasets. This significantly reduces compute costs and latency for regular data synchronization tasks. Hudi integrates with Apache Spark for both ingestion and querying, and supports multiple storage backends including HDFS and cloud object storage.
Key Characteristics
- ▶Supports upsert and delete operations for operational data synchronization
- ▶Offers Copy-on-Write and Merge-on-Read table types for different latency-throughput tradeoffs
- ▶Tracks data lineage and change history through commit metadata
- ▶Enables incremental querying to process only changed data
- ▶Provides indexing mechanisms to optimize upsert performance
- ▶Integrates seamlessly with Apache Spark ecosystems
Why It Matters
- ▶Reduces network bandwidth and compute time by processing only incremental changes
- ▶Enables real-time data synchronization from transactional systems to analytics platforms
- ▶Supports data compliance by enabling efficient delete operations at scale
- ▶Lowers infrastructure costs through intelligent file organization and caching strategies
- ▶Improves data freshness for operational dashboards without full daily reloads
- ▶Simplifies CDC (Change Data Capture) implementations with native upsert semantics
Example
`
-- Ingest with upserts using Copy-on-Write
spark.write
.format("hudi")
.option("hoodie.table.name", "customer_events")
.option("hoodie.primaryKey.column", "customer_id")
.option("hoodie.operation", "upsert")
.mode("append")
.save(s3_path)
-- Query incremental changes
spark.read
.format("hudi")
.option("as.of.instant", "20240401120000")
.load(s3_path)
`Coginiti Perspective
Coginiti can query Hudi tables through connected platforms that support the format, such as Athena, Trino, and Spark. While CoginitiScript's Iceberg publication targets are more directly integrated, teams using Hudi-based lakes can still develop and govern their analytics logic in Coginiti's analytics catalog and semantic layer, applying consistent business definitions regardless of the underlying table format.
Related Concepts
More in Open Table Formats
Apache Iceberg
Apache Iceberg is an open-source table format that organizes data files with a metadata layer enabling ACID transactions, schema evolution, and time travel capabilities for data lakes.
Data Compaction
Data compaction is a maintenance process that combines small data files into larger ones, improving query performance and reducing storage overhead without changing data or schema.
Delta Lake
Delta Lake is an open-source storage layer providing ACID transactions, schema governance, and data versioning to data lakes built on cloud object storage.
Hidden Partitioning
Hidden partitioning is a table format feature that partitions data logically for query optimization without encoding partition values in file paths or requiring file reorganization during partition scheme changes.
Open Table Format
An open table format is a vendor-neutral specification for organizing and managing data files and metadata in data lakes, enabling ACID transactions and multi-engine interoperability.
Partitioning
Partitioning is a data organization technique that divides tables into logical or physical segments based on column values, enabling query engines to scan only relevant data.
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.