Data Aggregation
Data aggregation is the process of combining multiple rows of data using aggregate functions to compute summary statistics, totals, averages, and other derived metrics.
Aggregation transforms raw transactional data into business metrics. Rather than examining individual transactions, aggregation computes totals (sum of revenue), counts (number of customers), averages (mean purchase value), or other summaries. These metrics drive business decisions: How much revenue did we generate? How many customers are active? What is average order value by region?
Common aggregate functions include SUM (total), COUNT (row count), AVG (average), MIN/MAX (extremes), and statistical functions (STDDEV, PERCENTILE). Queries often group aggregations by dimensions (region, product, time period) to slice metrics across business domains. Aggregation is computationally expensive for large datasets, requiring scanning all rows and combining results through hash tables or sorting.
Aggregation is so fundamental to analytics that entire optimization strategies focus on making it efficient. Materialized views pre-aggregate, dynamic tables incrementally update aggregates, and indexes enable aggregate computation without full table scans. Modern systems often separate OLTP (efficient single-row operations) from OLAP (efficient aggregation) using different data structures.
Key Characteristics
- ▶Combine rows using aggregate functions (SUM, COUNT, AVG, MIN, MAX)
- ▶Group results by dimensions for sliced metrics
- ▶Reduce data volume from millions to thousands of rows
- ▶Require scanning all relevant data before computing results
- ▶Support hierarchy (total company, by division, by region)
- ▶Enable efficient metrics computation through materialization
Why It Matters
- ▶Enable business metrics and KPIs from raw transaction data
- ▶Reduce data volume from billions to millions for dashboards
- ▶Support fast metric queries through pre-aggregation
- ▶Enable trend analysis and comparative metrics across dimensions
- ▶Drive decision-making through summarized, actionable insights
- ▶Reduce query costs by computing aggregates once instead of repeatedly
Example
` -- Row-level transaction data (millions of rows) SELECT * FROM orders LIMIT 5; -- order_id | customer_id | order_date | product_id | order_value -- 1 | 100 | 2024-01-01 | 50 | 99.99 -- 2 | 101 | 2024-01-01 | 51 | 149.99 -- 3 | 100 | 2024-01-02 | 52 | 59.99 -- Aggregation: total revenue by day SELECT DATE(order_date) as sale_date, SUM(order_value) as daily_revenue, COUNT(*) as order_count, COUNT(DISTINCT customer_id) as unique_customers, AVG(order_value) as avg_order_value FROM orders WHERE order_date >= '2024-01-01' GROUP BY DATE(order_date) ORDER BY sale_date; -- Result (thousands of rows instead of millions): -- sale_date | daily_revenue | order_count | unique_customers | avg_order_value -- 2024-01-01 | 249.98 | 2 | 2 | 124.99 -- 2024-01-02 | 59.99 | 1 | 1 | 59.99 -- Hierarchical aggregation SELECT product_category, region, SUM(order_value) as category_region_revenue, SUM(SUM(order_value)) OVER (PARTITION BY product_category) as category_total FROM orders o JOIN products p ON o.product_id = p.id JOIN customers c ON o.customer_id = c.id GROUP BY product_category, region; `
Coginiti Perspective
Coginiti's semantic layer centralizes aggregation definitions. SMDL measures declare their aggregation type (sum, count, count_distinct, avg, min, max, median, stdev, stdevp, var, varp, or custom expressions), and the MEASURE() function in Semantic SQL applies the correct aggregation automatically. This eliminates the common problem of different teams using different aggregation logic for the same metric, because the aggregation is defined once in the semantic model and enforced for every query.
Related Concepts
More in Analytics & Querying
Ad Hoc Query
An ad hoc query is an unplanned SQL query executed on demand to answer a specific, immediate question about data without prior optimization or scheduling.
Analytical Query
An analytical query is a SQL operation that aggregates, transforms, or examines data across multiple rows to produce summary results, statistics, or insights for decision-making.
BI (Business Intelligence)
Business Intelligence is the process of collecting, integrating, analyzing, and presenting data to support strategic and operational decision-making across an organization.
Cost-Based Optimization
Cost-based optimization is a query execution strategy where the optimizer estimates the computational cost of alternative execution plans and selects the plan with the lowest projected cost.
Data Exploration
Data exploration is the systematic investigation of datasets to understand structure, quality, distributions, relationships, and characteristics before formal analysis or modeling.
Dynamic Tables
Dynamic tables are incrementally updated materialized views that automatically compute and refresh only changed data, reducing compute costs while maintaining freshness.
See Semantic Intelligence in Action
Coginiti operationalizes business meaning across your entire data estate.