LogoInterview Master

SQL Learning

SQL GROUP BY Clause

Understanding the SQL GROUP BY Clause

The GROUP BY clause is a fundamental part of SQL used to group rows that have the same values in one or more specified columns into a summary row. Think of it as creating categories or buckets based on shared characteristics.

GROUP BY is almost always used in conjunction with aggregate functions like COUNT, SUM, AVG, MIN, or MAX. These functions perform calculations on each group, allowing you to summarize large datasets into meaningful insights (e.g., total sales per region, average views per user, count of orders per customer).

Basic Syntax

SELECT column_to_group_by_1, [column_to_group_by_2, ...],
       AGGREGATE_FUNCTION(column_to_aggregate)
FROM table_name
[WHERE condition] -- Filters rows BEFORE grouping
GROUP BY column_to_group_by_1, [column_to_group_by_2, ...]
[HAVING condition] -- Filters groups AFTER grouping
[ORDER BY column];

Key GROUP BY Concepts & Interview Questions

  • The Golden Rule: Which columns from the SELECT list *must* appear in the GROUP BY clause? (Answer: Any non-aggregated column). Why?
  • What is the processing order of WHERE, GROUP BY, and HAVING?
  • Explain the difference between filtering with WHERE and filtering with HAVING in a query with GROUP BY.
  • How does GROUP BY treat NULL values in the grouping column(s)? (Answer: Usually forms a separate group for NULLs).
  • What happens when you group by multiple columns? What defines a unique group?

GROUP BY Fundamentals

Simple Grouping and Aggregation

Group rows based on a single column and apply an aggregate function.

-- Count the number of videos created by each user
SELECT
  user_id, -- Column we are grouping by
  COUNT(*) AS number_of_videos -- Aggregate function applied to each group
FROM Videos
GROUP BY user_id -- Group rows with the same user_id together
ORDER BY number_of_videos DESC;

This query collapses all rows for each distinct user_id into a single summary row, showing the user ID and the count of videos associated with that ID.

The "Golden Rule" of GROUP BY

Any column listed in the SELECT clause that is not enclosed in an aggregate function (COUNT, SUM, AVG, MIN, MAX, etc.) must be included in the GROUP BY clause.

-- This query is INVALID and will cause an error
SELECT
  user_id,
  title, -- ERROR: 'title' is not aggregated and not in GROUP BY
  COUNT(*) AS number_of_videos
FROM Videos
GROUP BY user_id;

The database doesn't know *which* title to display for a user_id group if that user has multiple videos. To fix this, you either need to remove title from the SELECT list or apply an aggregate function to it (e.g., MAX(title), though that might not be meaningful).

Including Descriptive Columns

To show descriptive information alongside aggregated values, join tables and include the necessary columns in both SELECT and GROUP BY.

-- Count videos per user, showing username
SELECT
  u.user_id, -- Grouping column (often PK)
  u.username, -- Non-aggregated column, must be in GROUP BY
  COUNT(v.video_id) AS number_of_videos -- Aggregate function
FROM Users u
LEFT JOIN Videos v ON u.user_id = v.user_id
GROUP BY u.user_id, u.username -- Include all non-aggregated selected columns
ORDER BY number_of_videos DESC;

Because u.username is selected and not aggregated, it must be listed in the GROUP BY clause along with u.user_id (even if user_id functionally determines username, standard SQL requires listing all non-aggregated selected columns).

Grouping by Expressions

You can group by the result of an expression or function applied to a column.

-- Count users by the year they joined
SELECT
  STRFTIME('%Y', join_date) AS join_year, -- Extract year (syntax varies by DB)
  COUNT(*) AS users_joined
FROM Users
WHERE join_date IS NOT NULL
GROUP BY join_year -- Group by the derived year value
ORDER BY join_year;

Here, we group users based on the year extracted from their join_date.

Grouping by Multiple Columns

Grouping by more than one column creates groups based on the unique combination of values across all specified grouping columns.

Creating Finer-Grained Groups

Each unique combination of the grouping columns forms a distinct group.

-- Count interactions of each type ('like', 'comment') per video
SELECT
  video_id,         -- Grouping column 1
  interaction_type, -- Grouping column 2
  COUNT(*) AS count_per_type
FROM Interactions
GROUP BY video_id, interaction_type -- Group by the combination
ORDER BY video_id, interaction_type;

This query creates summary rows like (Video 1, 'like', 50), (Video 1, 'comment', 10), (Video 2, 'like', 120), etc., showing the breakdown of interaction types for each video.

Multi-Column Grouping with Joins

Combine joins with multi-column grouping for detailed categorical summaries.

-- Calculate average views per user per month of upload
SELECT
  v.user_id,
  u.username,
  STRFTIME('%Y-%m', v.upload_date) AS upload_month, -- Grouping expression
  COUNT(v.video_id) AS videos_in_month,
  AVG(v.views) AS avg_views_in_month
FROM Videos v
JOIN Users u ON v.user_id = u.user_id
WHERE v.upload_date IS NOT NULL
GROUP BY v.user_id, u.username, upload_month -- Group by user AND month
ORDER BY v.user_id, upload_month;

This query groups by both the user and the month they uploaded videos, allowing analysis of user performance trends over time.

Filtering Groups with HAVING

The HAVING clause is specifically designed to filter results *after* the GROUP BY clause has been applied and aggregate functions have been calculated. It filters based on the aggregated values.

Filtering Based on Aggregate Results

-- Find users who have uploaded more than 3 videos
SELECT
  user_id,
  COUNT(*) AS number_of_videos
FROM Videos
GROUP BY user_id
HAVING COUNT(*) > 3 -- Filter based on the result of COUNT()
ORDER BY number_of_videos DESC;

HAVING allows us to keep only those user groups where the video count (calculated by COUNT(*)) is greater than 3.

Crucial Distinction: WHERE vs. HAVING

  • WHERE Clause: Filters individual rows BEFORE they are grouped. Conditions in WHERE cannot refer to aggregate functions.
    Example: WHERE country = 'USA' (filters rows before grouping by user).
  • HAVING Clause: Filters entire groups AFTER grouping and aggregation occur. Conditions in HAVING typically involve aggregate functions.
    Example: HAVING SUM(views) > 1000 (filters user groups based on their total views).

Think of the SQL query execution order (simplified): FROM ➡️ WHERE ➡️ GROUP BY ➡️ Aggregate Functions ➡️ HAVING ➡️ SELECT ➡️ ORDER BY.

Learn more about HAVING

Combining WHERE and HAVING

Use both clauses for multi-stage filtering.

-- Find users from 'USA' who have uploaded at least 2 videos in 2023
SELECT
  v.user_id,
  COUNT(*) AS count_2023_videos
FROM Videos v
JOIN Users u ON v.user_id = u.user_id
WHERE u.country = 'USA' -- Filter users BEFORE grouping
  AND STRFTIME('%Y', v.upload_date) = '2023' -- Filter videos BEFORE grouping
GROUP BY v.user_id
HAVING COUNT(*) >= 2 -- Filter groups AFTER counting
ORDER BY count_2023_videos DESC;

This efficiently filters down the data before grouping and then applies a final filter on the aggregated counts.

Summary

The GROUP BY clause is essential for data summarization and analysis in SQL:

  • It groups rows with identical values in the specified grouping column(s).
  • It's used with aggregate functions (COUNT, SUM, AVG, MIN, MAX) to perform calculations on each group.
  • Golden Rule: Any column in the SELECT list not used in an aggregate function must be included in the GROUP BY clause.
  • You can group by multiple columns to create more specific subgroups based on unique combinations.
  • You can group by expressions derived from columns (e.g., grouping by month or year).
  • Use the WHERE clause to filter rows *before* grouping.
  • Use the HAVING clause to filter groups *after* aggregation, typically based on aggregate results.

Mastering GROUP BY along with aggregate functions and the HAVING clause is fundamental for anyone performing data analysis or reporting with SQL.

Loading SQL editor...