SQL Learning
SQL GROUP BY Clause
SQL Learning
SQL GROUP BY Clause
Learn the essential SQL clause for grouping rows with shared values and applying aggregate functions to create powerful data summaries.
Understanding the SQL GROUP BY Clause
The GROUP BY
clause is a fundamental part of SQL used to group rows that have the same values in one or more specified columns into a summary row. Think of it as creating categories or buckets based on shared characteristics.
GROUP BY
is almost always used in conjunction with aggregate functions like COUNT
, SUM
, AVG
, MIN
, or MAX
. These functions perform calculations on each group, allowing you to summarize large datasets into meaningful insights (e.g., total sales per region, average views per user, count of orders per customer).
Basic Syntax
SELECT column_to_group_by_1, [column_to_group_by_2, ...],
AGGREGATE_FUNCTION(column_to_aggregate)
FROM table_name
[WHERE condition] -- Filters rows BEFORE grouping
GROUP BY column_to_group_by_1, [column_to_group_by_2, ...]
[HAVING condition] -- Filters groups AFTER grouping
[ORDER BY column];
Key GROUP BY Concepts & Interview Questions
- The Golden Rule: Which columns from the
SELECT
list *must* appear in theGROUP BY
clause? (Answer: Any non-aggregated column). Why? - What is the processing order of
WHERE
,GROUP BY
, andHAVING
? - Explain the difference between filtering with
WHERE
and filtering withHAVING
in a query withGROUP BY
. - How does
GROUP BY
treat NULL values in the grouping column(s)? (Answer: Usually forms a separate group for NULLs). - What happens when you group by multiple columns? What defines a unique group?
GROUP BY Fundamentals
Simple Grouping and Aggregation
Group rows based on a single column and apply an aggregate function.
-- Count the number of videos created by each user
SELECT
user_id, -- Column we are grouping by
COUNT(*) AS number_of_videos -- Aggregate function applied to each group
FROM Videos
GROUP BY user_id -- Group rows with the same user_id together
ORDER BY number_of_videos DESC;
This query collapses all rows for each distinct user_id
into a single summary row, showing the user ID and the count of videos associated with that ID.
The "Golden Rule" of GROUP BY
Any column listed in the SELECT
clause that is not enclosed in an aggregate function (COUNT
, SUM
, AVG
, MIN
, MAX
, etc.) must be included in the GROUP BY
clause.
-- This query is INVALID and will cause an error
SELECT
user_id,
title, -- ERROR: 'title' is not aggregated and not in GROUP BY
COUNT(*) AS number_of_videos
FROM Videos
GROUP BY user_id;
The database doesn't know *which* title
to display for a user_id
group if that user has multiple videos. To fix this, you either need to remove title
from the SELECT
list or apply an aggregate function to it (e.g., MAX(title)
, though that might not be meaningful).
Including Descriptive Columns
To show descriptive information alongside aggregated values, join tables and include the necessary columns in both SELECT
and GROUP BY
.
-- Count videos per user, showing username
SELECT
u.user_id, -- Grouping column (often PK)
u.username, -- Non-aggregated column, must be in GROUP BY
COUNT(v.video_id) AS number_of_videos -- Aggregate function
FROM Users u
LEFT JOIN Videos v ON u.user_id = v.user_id
GROUP BY u.user_id, u.username -- Include all non-aggregated selected columns
ORDER BY number_of_videos DESC;
Because u.username
is selected and not aggregated, it must be listed in the GROUP BY
clause along with u.user_id
(even if user_id
functionally determines username
, standard SQL requires listing all non-aggregated selected columns).
Grouping by Expressions
You can group by the result of an expression or function applied to a column.
-- Count users by the year they joined
SELECT
STRFTIME('%Y', join_date) AS join_year, -- Extract year (syntax varies by DB)
COUNT(*) AS users_joined
FROM Users
WHERE join_date IS NOT NULL
GROUP BY join_year -- Group by the derived year value
ORDER BY join_year;
Here, we group users based on the year extracted from their join_date
.
Grouping by Multiple Columns
Grouping by more than one column creates groups based on the unique combination of values across all specified grouping columns.
Creating Finer-Grained Groups
Each unique combination of the grouping columns forms a distinct group.
-- Count interactions of each type ('like', 'comment') per video
SELECT
video_id, -- Grouping column 1
interaction_type, -- Grouping column 2
COUNT(*) AS count_per_type
FROM Interactions
GROUP BY video_id, interaction_type -- Group by the combination
ORDER BY video_id, interaction_type;
This query creates summary rows like (Video 1, 'like', 50), (Video 1, 'comment', 10), (Video 2, 'like', 120), etc., showing the breakdown of interaction types for each video.
Multi-Column Grouping with Joins
Combine joins with multi-column grouping for detailed categorical summaries.
-- Calculate average views per user per month of upload
SELECT
v.user_id,
u.username,
STRFTIME('%Y-%m', v.upload_date) AS upload_month, -- Grouping expression
COUNT(v.video_id) AS videos_in_month,
AVG(v.views) AS avg_views_in_month
FROM Videos v
JOIN Users u ON v.user_id = u.user_id
WHERE v.upload_date IS NOT NULL
GROUP BY v.user_id, u.username, upload_month -- Group by user AND month
ORDER BY v.user_id, upload_month;
This query groups by both the user and the month they uploaded videos, allowing analysis of user performance trends over time.
Filtering Groups with HAVING
The HAVING
clause is specifically designed to filter results *after* the GROUP BY
clause has been applied and aggregate functions have been calculated. It filters based on the aggregated values.
Filtering Based on Aggregate Results
-- Find users who have uploaded more than 3 videos
SELECT
user_id,
COUNT(*) AS number_of_videos
FROM Videos
GROUP BY user_id
HAVING COUNT(*) > 3 -- Filter based on the result of COUNT()
ORDER BY number_of_videos DESC;
HAVING
allows us to keep only those user groups where the video count (calculated by COUNT(*)
) is greater than 3.
Crucial Distinction: WHERE vs. HAVING
- WHERE Clause: Filters individual rows BEFORE they are grouped. Conditions in
WHERE
cannot refer to aggregate functions.
Example:WHERE country = 'USA'
(filters rows before grouping by user). - HAVING Clause: Filters entire groups AFTER grouping and aggregation occur. Conditions in
HAVING
typically involve aggregate functions.
Example:HAVING SUM(views) > 1000
(filters user groups based on their total views).
Think of the SQL query execution order (simplified): FROM ➡️ WHERE ➡️ GROUP BY ➡️ Aggregate Functions ➡️ HAVING ➡️ SELECT ➡️ ORDER BY.
Learn more about HAVINGCombining WHERE and HAVING
Use both clauses for multi-stage filtering.
-- Find users from 'USA' who have uploaded at least 2 videos in 2023
SELECT
v.user_id,
COUNT(*) AS count_2023_videos
FROM Videos v
JOIN Users u ON v.user_id = u.user_id
WHERE u.country = 'USA' -- Filter users BEFORE grouping
AND STRFTIME('%Y', v.upload_date) = '2023' -- Filter videos BEFORE grouping
GROUP BY v.user_id
HAVING COUNT(*) >= 2 -- Filter groups AFTER counting
ORDER BY count_2023_videos DESC;
This efficiently filters down the data before grouping and then applies a final filter on the aggregated counts.
Summary
The GROUP BY
clause is essential for data summarization and analysis in SQL:
- It groups rows with identical values in the specified grouping column(s).
- It's used with aggregate functions (
COUNT
,SUM
,AVG
,MIN
,MAX
) to perform calculations on each group. - Golden Rule: Any column in the
SELECT
list not used in an aggregate function must be included in theGROUP BY
clause. - You can group by multiple columns to create more specific subgroups based on unique combinations.
- You can group by expressions derived from columns (e.g., grouping by month or year).
- Use the
WHERE
clause to filter rows *before* grouping. - Use the
HAVING
clause to filter groups *after* aggregation, typically based on aggregate results.
Mastering GROUP BY
along with aggregate functions and the HAVING
clause is fundamental for anyone performing data analysis or reporting with SQL.