Glossary/Analytics & Querying

Cost-Based Optimization

Cost-based optimization is a query execution strategy where the optimizer estimates the computational cost of alternative execution plans and selects the plan with the lowest projected cost.

Early database optimizers used heuristic rules (e.g., always use the most selective filter first) but often made poor decisions without understanding data characteristics. Cost-based optimization uses statistics about table sizes, value distributions, and column cardinalities to estimate the cost of each alternative plan, then selects the cheapest option.

The process requires accurate statistics. If the optimizer believes a filter matches 1000 rows but it actually matches 1 million, the cost estimates become meaningless, leading to poor plan selection. Modern systems gather statistics automatically, but they can become stale as data evolves. Database administrators must periodically refresh statistics, especially after large data changes.

Cost estimates incorporate multiple factors: I/O cost (reading files), CPU cost (processing rows), and network cost (distributing data in distributed systems). Different execution engines weight these differently. Columnar systems emphasize I/O efficiency, while distributed systems focus on minimizing network shuffles. The optimizer's cost model must match the actual hardware and software characteristics to make good decisions.

Key Characteristics

  • Estimate computational cost of multiple execution plans
  • Select plan with lowest projected cost
  • Rely on table statistics for accurate cost estimation
  • Account for CPU, I/O, and network costs
  • Automatically explore alternative plans without user intervention
  • Update statistics periodically to maintain accuracy

Why It Matters

  • Automatically selects efficient execution plans without manual tuning
  • Adapts to data changes as statistics update
  • Enables queries to run efficiently on data of varying sizes
  • Reduces need for specialized optimization expertise
  • Scales database usage by optimizing without per-query tuning
  • Provides consistent performance across different workload patterns

Example

`
-- Optimizer evaluates multiple plans for this query:
SELECT o.customer_id, SUM(o.order_value)
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE c.region = 'US' AND o.order_date >= '2024-01-01'
GROUP BY o.customer_id;

-- Plan A: Join orders to customers, then filter
-- Estimated cost: scan all 100M orders, hash join, then filter
-- Cost: 100M  10 (scan) + 100M  5 (join) + processing = 1500M

-- Plan B: Filter customers first, then join
-- Estimated cost: scan customers, filter to 10M US customers, join
-- Cost: 1M  10 (customer scan) + 10M  5 (join) + processing = 60M

-- Optimizer selects Plan B (lower estimated cost)

-- Statistics used:
-- customers table: 1M rows, region='US' selectivity = 10%
-- orders table: 100M rows, average 100 per customer

SELECT * FROM table_statistics
WHERE table_name = 'customers';
-- Returns: rows=1000000, avg_row_size=500
-- Returns: column stats for region distribution
`

Coginiti Perspective

Coginiti delegates cost-based optimization to the target platform's query engine, where it is most effective. CoginitiScript generates SQL that preserves the optimizer's ability to evaluate alternative plans, and the semantic layer translates Semantic SQL into platform-native queries that each engine's CBO can reason about. CoginitiScript's query tags also enable organizational cost-based thinking by attributing query costs to specific departments and projects, supporting decisions about which queries are worth optimizing.

Related Concepts

Query OptimizationQuery PlanStatistics (Table)Cardinality EstimationExecution PlanPredicate PushdownIndex Selection

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.