Glossary/Open Table Formats

Open Table Format

An open table format is a vendor-neutral specification for organizing and managing data files and metadata in data lakes, enabling ACID transactions and multi-engine interoperability.

Open table formats emerged in response to fragmentation in data lake technology, where different platforms used incompatible metadata systems. They define standardized ways to layout data files, track changes, and maintain consistency across distributed reads and writes, independent of any single storage or compute provider.

Key open table formats include Apache Iceberg, Delta Lake, and Apache Hudi, each with different architectural approaches but shared goals: eliminating data corruption from concurrent operations, supporting schema evolution, and providing audit trails. By adhering to open standards, organizations avoid vendor lock-in and gain flexibility to choose compute engines based on performance and cost requirements.

The critical innovation of open table formats is separating metadata management from compute. This allows multiple query engines (Spark, Trino, Flink, Duckdb) to operate on the same physical data while maintaining transactional consistency. The investment in standardization across the industry reflects growing maturity in data lake technology and recognition that format interoperability is essential for enterprise analytics.

Key Characteristics

  • Define standardized file layouts and metadata organization schemes
  • Enable multiple compute engines to query the same data consistently
  • Provide ACID transaction guarantees across distributed systems
  • Support schema versioning and evolution without data rewriting
  • Maintain complete transaction history for audit and compliance
  • Operate on cloud object storage without proprietary file systems

Why It Matters

  • Eliminates vendor lock-in by supporting multiple compute engines
  • Reduces costs through competitive purchasing and engine optimization for workload type
  • Ensures data correctness in complex analytical environments with concurrent operations
  • Provides governance through standardized schema management and audit trails
  • Simplifies disaster recovery and data migration between platforms
  • Enables organizations to adopt best-fit tools without rearchitecting data infrastructure

Example

`
-- Same data, queried from different engines
-- Spark
spark.read.format("iceberg").load("s3://data-lake/sales").show()

-- Trino
SELECT * FROM iceberg.data_lake.sales;

-- Both read identical, consistent metadata and data files
`

Coginiti Perspective

Coginiti embraces open table formats as a materialization target. CoginitiScript publishes Iceberg tables across Snowflake, Databricks, BigQuery, Trino, and Athena, and writes Parquet files directly to object storage. This commitment to open formats means data produced through Coginiti's governed workflows is not locked into a proprietary format or a single query engine. Any tool that reads Iceberg or Parquet can consume the output independently.

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.