Glossary/File Formats & Data Exchange

Parquet

Parquet is an open-source columnar data file format that stores data in a compressed, efficient manner, enabling fast analytical queries while reducing storage requirements.

Parquet stores data in columns rather than rows, organizing values from the same column together. This columnar organization enables compression algorithms to work more effectively: a column of product IDs repeated millions of times compresses to a fraction of row-based storage. Parquet files are self-describing, containing metadata about column names, types, and structure, allowing tools to read files without external schema information. Parquet supports complex nested data types and handles missing values efficiently, making it suitable for diverse analytical data.

Parquet has become the standard for analytics data lakes, used extensively in data warehouses, cloud storage systems, and analytics platforms. Tools like Spark, Hive, and modern SQL engines natively support Parquet. The columnar format provides specific performance advantages: analytical queries typically access a subset of columns, and Parquet allows reading only necessary columns instead of entire rows. Compression ratios of 10:1 or higher are common, significantly reducing storage costs. Parquet competes with ORC in the Hadoop ecosystem and Arrow for in-memory analytics, with each optimized for different use cases.

Key Characteristics

  • Columnar format optimizing for analytical query patterns
  • Self-describing with embedded schema and metadata
  • Supports compression, achieving 10:1 or higher ratios
  • Enables reading only necessary columns
  • Supports complex and nested data types
  • Widely supported across analytics platforms and tools

Why It Matters

  • Reduces storage costs by 10x or more through compression
  • Dramatically accelerates analytical queries accessing subsets of columns
  • Enables efficient data lake storage for long-term analytics
  • Supports schema evolution and missing values
  • Reduces network bandwidth when querying cloud storage
  • Standard format for modern data architectures and tools

Example

A table with 100 billion rows including columns for id, timestamp, user_id, action, product_id, and amount (six columns, 200GB in CSV). Stored as Parquet with snappy compression, it reduces to 20GB. A query selecting only user_id and amount need only read 20GB of data instead of 200GB. When the same query in CSV format required scanning all 200GB and filtering columns in memory, Parquet reads 20GB directly from storage, improving performance by 10x.

Coginiti Perspective

Coginiti's materialization engine publishes Parquet files to object storage (S3, Azure Blob, GCS) with configurable row group sizes and compression algorithms, enabling cost-efficient incremental updates through append and merge strategies. CoginitiScript materializations leverage Parquet's columnar efficiency to store analytical results, allowing consumers to query outputs directly from object storage or import into any SQL engine, reducing warehouse compute costs and supporting ELT patterns where transformations occur at query time.

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.