SQL Query => {Plan Enumeration, Query Optimizer, Cost Prediction} => Query Plan

Cost Prediction

1. Data Statistics in Postgres (PG)

Number of distinct values in column
Ratio of SQL null values in column
Most frequent values with associated frequency
Histograms approximating column data distribution

How to use statistics

Force DBMS to create statistics via VACUUM ANALYZE
- collects and updates distribution statistics of data in tables
Generated statistics are exploited by query optimizer
- Statistics causes good/poor query performance
Can access column statistics via pg**stats view
- E.g., tablename, attname, n**distinct, null**frac, …
- Find what query optimizer relies on and check for updates

2. Selectivity Model (Estimating Selectivity - pass % of rows)

Selectivity: estimate how much data will be filtered out (probability the rows satisfy predicate)
- More selective - filters more rows
- A selectivity of 1: all rows pass the condition (the predicate is true for all rows).
- A selectivity of 0.1 : 10% of the rows pass the condition, filtering out 90% of the data.
Estimate by constraints (uniqueness, primary keys) and data statistics
- E.g., Selectivity(Column=Value) = 1/NrValues
  - If you have a column with NrValues distinct values, the selectivity of a query where Column = Value is estimated as 1/NrValues.
  - Assumes Uniform Data - each value is equally likely
- E.g., NrRows(Column=Value) ≤ 1 if Key Column
  - If the column is a key column (i.e., it has a unique constraint like a primary key), the selectivity of a query Column = Value is 1 or less, since a key column can have at most one matching row.
- Selectivity(A⋀B) = Selectivity(A) * Selectivity(B)
  - Assumes Independent Predicates (not affect each other)

3. Cardinality Model (produce how many rows)

Cardinality: the number of rows produced by a query
Helps to predict the cost of different query execution strategies

Estimate Cardinality

cardinality of single tables
- A table with 100 rows - cardinality of 100
cardinality product for tables in FROM clause (more than one table)
- Result rows number if predicates = true
Now multiply with selectivity estimates for all predicates
- Cardinality = Cardinality table A ** Cardinality table B ** …
- Estimate how many rows pass predicate filter

Estimate Page Size

Find page size for cost functions

Find Average byte size per Record
Find records number per page
- Data in pages (fixed-size blocks), a page has records.
- Pages cannot store fractional records → round down
Find total number of Pages:
- row number / record num per page

4. Cost Model (estimate the cost: I/O + operators)

Count & sum cost of page reads and writes by operators in the plan
Each operator
- input: how often is it read from disk?
- output: is it written to disk? Is it final result?
- Does the operator read/write intermediate results?

Plan Enumeration

Generate every possible plan
Estimate cost for each plan
Select plan with minimal cost

Query Plan Space

Decide order of operations and implementation
Apply heuristic restrictions

H1: Early Predicates and Projections

Processing more data => more expensive
Reduce data size ASAP
option: Add predicates (discarding rows)
option: Add projections (discard columns)

H2: Avoid Joins without Predicates

Join result size: product of input cardinality * selectivity
Selectivity = 1 if joining tables without predicates
Sub-optimal for very large join results
(Heuristic may discard optimal order in special cases)

H3: Left Deep Plans

Allows pipelining: joins pass on result parts in-memory
Allows to use indices on join columns (of base tables)

Asymptotic Analysis

num of join orders = num of possible ways (orders) to join the tables = O(nrTables!).
- For 3 tables: 3! = 6
- For 5 tables: 5!=120
Number of join orders <= total number of query execution plans.
- Join order many implementations (e.g., nested loops, hash joins)
Enumerating plans is considered impractical from the factorial growth
- 5 tables has 120 join orders and MANY plans

Principle of Optimality

Definition
- Find the most efficient & cheapest plan to join tables
- an optimal plan for the full join must include optimal plans for joining smaller subsets of the tables along the way.
- If any subset is joined sub-optimally, the overall plan can be improved by replacing it with the optimal one.
This plan joins table subsets “on the way”
1. looks for the optimal way to join smaller subsets of tables.
2. Once found the best way to join the subsets, it combines all results to build the full join plan.
Assume we use sub-optimal plan for joining table subset
Replacing by a better plan can only improve overall cost

Efficient Optimization

Find optimal plans for (smaller) sub-queries first
- Sub-query: joins subsets of tables
- if joins 5 tables, first find the best way to join subsets of 2 or 3 tables.
Compose optimal plans from optimal sub-query plans

Are you ready?

8. Query Optimizer

Cost Prediction

1. Data Statistics in Postgres (PG)

How to use statistics

2. Selectivity Model (Estimating Selectivity - pass % of rows)

3. Cardinality Model (produce how many rows)

Estimate Cardinality

Estimate Page Size

4. Cost Model (estimate the cost: I/O + operators)

Plan Enumeration

Query Plan Space

H1: Early Predicates and Projections

H2: Avoid Joins without Predicates

H3: Left Deep Plans

Asymptotic Analysis

Principle of Optimality

Efficient Optimization