17. Window Functions & Analytics Queries
17. Window Functions & Analytics Queries¶
Learning Objectives¶
- Understand window function concepts and differences from regular aggregate functions
- Master OVER clause, partitions, and frame concepts
- Utilize ranking functions (ROW_NUMBER, RANK, DENSE_RANK)
- Utilize analytical functions (LEAD, LAG, FIRST_VALUE)
- Improve practical analytical query writing skills
Table of Contents¶
- Window Function Basics
- Ranking Functions
- Analytical Functions
- Aggregate Window Functions
- Frame Details
- Practical Use Patterns
- Practice Problems
1. Window Function Basics¶
1.1 What are Window Functions?¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Regular Aggregate vs Window Functions β
β β
β Regular Aggregate (GROUP BY) Window Function (OVER) β
β βββββββββββββββββ βββββββββββββββββ β
β β A | B | SUM β β A | B | val | SUM β
β βββββββββββββββββ€ βββββββββββββββββββββ€ β
β β X | | 150 β β X | 1 | 50 | 150 β β
β β Y | | 120 β β X | 2 | 100 | 150 β β
β βββββββββββββββββ β Y | 3 | 70 | 120 β β
β (rows collapsed to groups) β Y | 4 | 50 | 120 β β
β βββββββββββββββββββββ β
β (all rows kept + aggregate) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1.2 Basic Syntax¶
-- Window function basic structure
function_name() OVER (
[PARTITION BY column] -- Group division (optional)
[ORDER BY column] -- Ordering (optional)
[frame_clause] -- Range specification (optional)
)
-- Example
SELECT
department,
employee_name,
salary,
SUM(salary) OVER (PARTITION BY department) AS dept_total,
AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees;
1.3 Sample Data¶
-- Create test table
CREATE TABLE sales (
id SERIAL PRIMARY KEY,
salesperson VARCHAR(50),
region VARCHAR(20),
sale_date DATE,
amount NUMERIC(10,2)
);
INSERT INTO sales (salesperson, region, sale_date, amount) VALUES
('Alice', 'East', '2024-01-15', 1000),
('Alice', 'East', '2024-01-20', 1500),
('Alice', 'East', '2024-02-10', 2000),
('Bob', 'East', '2024-01-18', 800),
('Bob', 'East', '2024-02-15', 1200),
('Charlie', 'West', '2024-01-10', 900),
('Charlie', 'West', '2024-01-25', 1100),
('Charlie', 'West', '2024-02-20', 1300),
('Diana', 'West', '2024-01-30', 700),
('Diana', 'West', '2024-02-05', 1600);
2. Ranking Functions¶
2.1 ROW_NUMBER, RANK, DENSE_RANK Comparison¶
-- Compare ranking functions
SELECT
salesperson,
amount,
ROW_NUMBER() OVER (ORDER BY amount DESC) AS row_num,
RANK() OVER (ORDER BY amount DESC) AS rank,
DENSE_RANK() OVER (ORDER BY amount DESC) AS dense_rank
FROM sales;
Example result (tie handling difference):
ββββββββββββββ¬βββββββββ¬ββββββββββ¬βββββββ¬βββββββββββββ
β salespersonβ amount β row_num β rank β dense_rank β
ββββββββββββββΌβββββββββΌββββββββββΌβββββββΌβββββββββββββ€
β Alice β 2000 β 1 β 1 β 1 β
β Diana β 1600 β 2 β 2 β 2 β
β Alice β 1500 β 3 β 3 β 3 β
β Charlie β 1300 β 4 β 4 β 4 β
β Bob β 1200 β 5 β 5 β 5 β
β Charlie β 1100 β 6 β 6 β 6 β
β Alice β 1000 β 7 β 7 β 7 β -- no ties
β Charlie β 900 β 8 β 8 β 8 β
β Bob β 800 β 9 β 9 β 9 β
β Diana β 700 β 10 β 10 β 10 β
ββββββββββββββ΄βββββββββ΄ββββββββββ΄βββββββ΄βββββββββββββ
When there are ties:
β A β 1000 β 1 β 1 β 1 β
β B β 1000 β 2 β 1 β 1 β -- tie!
β C β 900 β 3 β 3 β 2 β
(sequential) (skip) (sequential)
2.2 Ranking with PARTITION BY¶
-- Regional sales ranking
SELECT
region,
salesperson,
amount,
RANK() OVER (
PARTITION BY region
ORDER BY amount DESC
) AS region_rank
FROM sales;
Result:
ββββββββββ¬βββββββββββββ¬βββββββββ¬ββββββββββββββ
β region β salespersonβ amount β region_rank β
ββββββββββΌβββββββββββββΌβββββββββΌββββββββββββββ€
β East β Alice β 2000 β 1 β
β East β Alice β 1500 β 2 β
β East β Bob β 1200 β 3 β
β East β Alice β 1000 β 4 β
β East β Bob β 800 β 5 β
β West β Diana β 1600 β 1 β β partition reset
β West β Charlie β 1300 β 2 β
β West β Charlie β 1100 β 3 β
β West β Charlie β 900 β 4 β
β West β Diana β 700 β 5 β
ββββββββββ΄βββββββββββββ΄βββββββββ΄ββββββββββββββ
2.3 NTILE - Assign Quantiles¶
-- Divide into 4 quartiles
SELECT
salesperson,
amount,
NTILE(4) OVER (ORDER BY amount DESC) AS quartile
FROM sales;
-- Use case: identify top 25% customers
SELECT *
FROM (
SELECT
customer_id,
total_purchase,
NTILE(4) OVER (ORDER BY total_purchase DESC) AS quartile
FROM customer_summary
) sub
WHERE quartile = 1; -- top 25%
2.4 Top-N Queries¶
-- Extract top 2 sales per region
SELECT *
FROM (
SELECT
region,
salesperson,
sale_date,
amount,
ROW_NUMBER() OVER (
PARTITION BY region
ORDER BY amount DESC
) AS rn
FROM sales
) ranked
WHERE rn <= 2;
3. Analytical Functions¶
3.1 LAG and LEAD¶
-- LAG: reference previous row value
-- LEAD: reference next row value
SELECT
salesperson,
sale_date,
amount,
LAG(amount) OVER (
PARTITION BY salesperson
ORDER BY sale_date
) AS prev_amount,
LEAD(amount) OVER (
PARTITION BY salesperson
ORDER BY sale_date
) AS next_amount
FROM sales
ORDER BY salesperson, sale_date;
Result:
ββββββββββββββ¬βββββββββββββ¬βββββββββ¬ββββββββββββββ¬ββββββββββββββ
β salespersonβ sale_date β amount β prev_amount β next_amount β
ββββββββββββββΌβββββββββββββΌβββββββββΌββββββββββββββΌββββββββββββββ€
β Alice β 2024-01-15 β 1000 β NULL β 1500 β
β Alice β 2024-01-20 β 1500 β 1000 β 2000 β
β Alice β 2024-02-10 β 2000 β 1500 β NULL β
β Bob β 2024-01-18 β 800 β NULL β 1200 β
β Bob β 2024-02-15 β 1200 β 800 β NULL β
ββββββββββββββ΄βββββββββββββ΄βββββββββ΄ββββββββββββββ΄ββββββββββββββ
3.2 Calculate Growth Rate¶
-- Month-over-month growth rate
SELECT
salesperson,
sale_date,
amount,
LAG(amount) OVER (
PARTITION BY salesperson
ORDER BY sale_date
) AS prev_amount,
ROUND(
(amount - LAG(amount) OVER (
PARTITION BY salesperson ORDER BY sale_date
)) * 100.0 /
NULLIF(LAG(amount) OVER (
PARTITION BY salesperson ORDER BY sale_date
), 0),
2
) AS growth_pct
FROM sales
ORDER BY salesperson, sale_date;
3.3 FIRST_VALUE, LAST_VALUE, NTH_VALUE¶
-- First/last value within partition
SELECT
salesperson,
sale_date,
amount,
FIRST_VALUE(amount) OVER (
PARTITION BY salesperson
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first_sale,
LAST_VALUE(amount) OVER (
PARTITION BY salesperson
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_sale,
NTH_VALUE(amount, 2) OVER (
PARTITION BY salesperson
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS second_sale
FROM sales;
4. Aggregate Window Functions¶
4.1 SUM, AVG, COUNT¶
-- Window aggregate functions
SELECT
salesperson,
sale_date,
amount,
-- Total by salesperson
SUM(amount) OVER (PARTITION BY salesperson) AS person_total,
-- Average by salesperson
AVG(amount) OVER (PARTITION BY salesperson) AS person_avg,
-- Percentage of total
ROUND(amount * 100.0 / SUM(amount) OVER (), 2) AS pct_of_total
FROM sales
ORDER BY salesperson, sale_date;
4.2 Running Total¶
-- Cumulative sum
SELECT
sale_date,
amount,
SUM(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM sales
ORDER BY sale_date;
-- Cumulative sum by salesperson
SELECT
salesperson,
sale_date,
amount,
SUM(amount) OVER (
PARTITION BY salesperson
ORDER BY sale_date
) AS cumulative_sales
FROM sales
ORDER BY salesperson, sale_date;
4.3 Moving Average¶
-- Moving average of last 3 records
SELECT
sale_date,
amount,
AVG(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3
FROM sales
ORDER BY sale_date;
-- Centered moving average (including 1 before and after)
SELECT
sale_date,
amount,
AVG(amount) OVER (
ORDER BY sale_date
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS centered_avg
FROM sales
ORDER BY sale_date;
5. Frame Details¶
5.1 Frame Syntax¶
ROWS | RANGE | GROUPS BETWEEN start_point AND end_point
start_point/end_point:
- UNBOUNDED PRECEDING -- partition start
- n PRECEDING -- n rows before
- CURRENT ROW -- current row
- n FOLLOWING -- n rows after
- UNBOUNDED FOLLOWING -- partition end
5.2 ROWS vs RANGE¶
-- Test with tie data to see difference
CREATE TABLE test_frame (
id INT,
val INT
);
INSERT INTO test_frame VALUES (1, 100), (2, 100), (3, 200), (4, 200), (5, 300);
-- ROWS: physical row-based
SELECT
id, val,
SUM(val) OVER (ORDER BY val ROWS UNBOUNDED PRECEDING) AS rows_sum
FROM test_frame;
-- RANGE: logical value-based (same values grouped)
SELECT
id, val,
SUM(val) OVER (ORDER BY val RANGE UNBOUNDED PRECEDING) AS range_sum
FROM test_frame;
Result comparison:
ROWS: RANGE:
ββββββ¬ββββββ¬βββββββββββ ββββββ¬ββββββ¬ββββββββββββ
β id β val β rows_sum β β id β val β range_sum β
ββββββΌββββββΌβββββββββββ€ ββββββΌββββββΌββββββββββββ€
β 1 β 100 β 100 β β 1 β 100 β 200 β β two 100s
β 2 β 100 β 200 β β 2 β 100 β 200 β β same
β 3 β 200 β 400 β β 3 β 200 β 600 β β two 200s
β 4 β 200 β 600 β β 4 β 200 β 600 β β same
β 5 β 300 β 900 β β 5 β 300 β 900 β
ββββββ΄ββββββ΄βββββββββββ ββββββ΄ββββββ΄ββββββββββββ
5.3 GROUPS (PostgreSQL 11+)¶
-- GROUPS: same ORDER BY values as one group
SELECT
id, val,
SUM(val) OVER (
ORDER BY val
GROUPS BETWEEN 1 PRECEDING AND CURRENT ROW
) AS groups_sum
FROM test_frame;
5.4 EXCLUDE Clause (PostgreSQL 11+)¶
-- Exclude specific rows from frame
SELECT
id, val,
SUM(val) OVER (
ORDER BY val
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
EXCLUDE CURRENT ROW -- exclude current row
) AS sum_excluding_current
FROM test_frame;
-- EXCLUDE options:
-- EXCLUDE NO OTHERS (default)
-- EXCLUDE CURRENT ROW
-- EXCLUDE GROUP (current row and same values)
-- EXCLUDE TIES (same values except current row)
6. Practical Use Patterns¶
6.1 Daily Cumulative Sales and Target Achievement¶
SELECT
sale_date,
amount,
SUM(amount) OVER (
ORDER BY sale_date
) AS cumulative_sales,
ROUND(
SUM(amount) OVER (ORDER BY sale_date) * 100.0 / 10000,
2
) AS target_pct -- target: 10,000
FROM sales
ORDER BY sale_date;
6.2 Outlier Detection¶
-- Data beyond average Β± 2 standard deviations
WITH stats AS (
SELECT
salesperson,
amount,
AVG(amount) OVER (PARTITION BY salesperson) AS avg_amount,
STDDEV(amount) OVER (PARTITION BY salesperson) AS stddev_amount
FROM sales
)
SELECT *
FROM stats
WHERE amount > avg_amount + 2 * stddev_amount
OR amount < avg_amount - 2 * stddev_amount;
6.3 Consecutive Record Analysis¶
-- Calculate consecutive sales days
WITH daily_sales AS (
SELECT
salesperson,
sale_date,
sale_date - (ROW_NUMBER() OVER (
PARTITION BY salesperson
ORDER BY sale_date
))::int AS grp
FROM sales
)
SELECT
salesperson,
MIN(sale_date) AS streak_start,
MAX(sale_date) AS streak_end,
COUNT(*) AS streak_length
FROM daily_sales
GROUP BY salesperson, grp
ORDER BY salesperson, streak_start;
6.4 Row/Column Comparison Without Pivot¶
-- Current vs previous month vs same month last year
SELECT
salesperson,
DATE_TRUNC('month', sale_date) AS month,
SUM(amount) AS monthly_total,
LAG(SUM(amount)) OVER (
PARTITION BY salesperson
ORDER BY DATE_TRUNC('month', sale_date)
) AS prev_month,
LAG(SUM(amount), 12) OVER (
PARTITION BY salesperson
ORDER BY DATE_TRUNC('month', sale_date)
) AS same_month_last_year
FROM sales
GROUP BY salesperson, DATE_TRUNC('month', sale_date)
ORDER BY salesperson, month;
6.5 Sessionization¶
-- New session if gap exceeds 30 minutes
WITH events AS (
SELECT
user_id,
event_time,
LAG(event_time) OVER (
PARTITION BY user_id ORDER BY event_time
) AS prev_event_time
FROM user_events
),
session_flags AS (
SELECT
user_id,
event_time,
CASE
WHEN prev_event_time IS NULL THEN 1
WHEN event_time - prev_event_time > INTERVAL '30 minutes' THEN 1
ELSE 0
END AS is_new_session
FROM events
)
SELECT
user_id,
event_time,
SUM(is_new_session) OVER (
PARTITION BY user_id
ORDER BY event_time
) AS session_id
FROM session_flags;
6.6 Gap Filling¶
-- Generate date series and LEFT JOIN
WITH date_series AS (
SELECT generate_series(
'2024-01-01'::date,
'2024-01-31'::date,
'1 day'::interval
)::date AS date
),
daily_totals AS (
SELECT sale_date, SUM(amount) AS total
FROM sales
GROUP BY sale_date
)
SELECT
ds.date,
COALESCE(dt.total, 0) AS daily_total,
SUM(COALESCE(dt.total, 0)) OVER (ORDER BY ds.date) AS running_total
FROM date_series ds
LEFT JOIN daily_totals dt ON ds.date = dt.sale_date
ORDER BY ds.date;
6.7 Percentile Calculation¶
-- Calculate percentiles
SELECT
salesperson,
amount,
PERCENT_RANK() OVER (ORDER BY amount) AS pct_rank,
CUME_DIST() OVER (ORDER BY amount) AS cume_dist,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) OVER () AS median
FROM sales;
-- Median by group
SELECT DISTINCT
region,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount)
OVER (PARTITION BY region) AS median_by_region
FROM sales;
7. Practice Problems¶
Exercise 1: Sales Performance Analysis¶
Analyze each salesperson's sales amount and calculate: - Rank by salesperson - Percentage of total - Change from previous sale
-- Example answer
SELECT
salesperson,
sale_date,
amount,
RANK() OVER (ORDER BY amount DESC) AS overall_rank,
RANK() OVER (
PARTITION BY salesperson
ORDER BY amount DESC
) AS personal_rank,
ROUND(amount * 100.0 / SUM(amount) OVER (), 2) AS pct_of_total,
amount - LAG(amount) OVER (
PARTITION BY salesperson ORDER BY sale_date
) AS change_from_prev
FROM sales
ORDER BY salesperson, sale_date;
Exercise 2: Moving Sum¶
Calculate moving sum for the last 7 days.
-- Example answer
SELECT
sale_date,
amount,
SUM(amount) OVER (
ORDER BY sale_date
RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW
) AS rolling_7day_sum
FROM sales
ORDER BY sale_date;
Exercise 3: Find Target Achievement Date¶
Find the first date when cumulative sales reached 5000.
-- Example answer
SELECT sale_date, cumulative
FROM (
SELECT
sale_date,
SUM(amount) OVER (ORDER BY sale_date) AS cumulative,
LAG(SUM(amount) OVER (ORDER BY sale_date)) OVER (ORDER BY sale_date) AS prev_cumulative
FROM sales
) sub
WHERE cumulative >= 5000
AND (prev_cumulative IS NULL OR prev_cumulative < 5000)
LIMIT 1;