Glossary/Open Table Formats

Data Compaction

Data compaction is a maintenance process that combines small data files into larger ones, improving query performance and reducing storage overhead without changing data or schema.

Data lakes accumulate small files from frequent incremental writes, ingestions, and streaming updates. Query engines incur overhead for each file accessed (network round trips, metadata operations), making many small files problematic. Compaction consolidates these scattered files into larger, more efficient units that reduce operation overhead and improve I/O throughput.

Compaction is non-destructive and transparent to queries. The process reads multiple files, writes consolidated output, and updates table metadata to reference new files while retiring old ones. The underlying data and schema remain unchanged. Scheduling compaction involves tradeoffs: frequent compaction reduces query overhead but consumes compute resources; infrequent compaction defers costs but allows query degradation.

Many open table formats implement intelligent compaction strategies. Rather than compacting entire tables, they target partitions with excessive small files or use size-based heuristics. Some tools can run compaction incrementally, processing subsets at a time. Integration with cloud object storage lifecycle policies can automate cleanup of old file versions after compaction commits.

Key Characteristics

  • Combine multiple small files into fewer, larger files for efficiency
  • Non-destructive operation that doesn't alter data or structure
  • Transparent to running queries, can happen concurrently in many systems
  • Reduce metadata overhead and I/O costs associated with many small files
  • Support incremental compaction of partitions or time ranges
  • Enable cleanup of intermediate files from updates and deletes

Why It Matters

  • Reduces query latency by 50+ percent in tables with excessive fragmentation
  • Lowers cloud storage costs through efficient file organization
  • Reduces metadata operation overhead from managing thousands of small files
  • Simplifies operations by automating cleanup after frequent updates or deletes
  • Improves resource utilization for concurrent queries on the same table
  • Defers performance degradation from incremental data ingestion patterns

Example

`
-- Table has grown fragmented from daily incremental loads
-- 5000 small files averaging 10 MB each = 50 GB metadata overhead

-- Trigger compaction on recent partitions
ALTER TABLE transactions COMPACT PARTITION year=2024, month=4;

-- Compaction process internally:
-- 1. Read 500 small files from the partition
-- 2. Write 10 larger files (50 MB each)
-- 3. Update metadata to reference new files
-- 4. Old small files marked for deletion after retention period

-- Query performance improves:
-- Before: 500 file opens + network roundtrips
-- After: 10 file opens + network roundtrips
`

Coginiti Perspective

CoginitiScript's incremental publication can produce small files over time as append and merge operations accumulate. Compaction is handled by the target platform (Snowflake's automatic clustering, Databricks' OPTIMIZE, Iceberg's rewrite_data_files). Coginiti's role is ensuring that the transformation logic producing these files is governed and that publication metadata clearly defines materialization targets, so platform-level compaction processes know which tables to maintain.

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.