A table in a database has columns of information in it. Each column in a table represents an attribute that has values and sometimes the same value. The group by clause is used to identify rows that have the same value for the specified attribute (i.e. duplicate value) and return a single row of information instead of all the rows where the attribute has the same value.
The GROUP BY clause can be used on one or more columns.
The generic syntax of SQL GROUP by clause is as below.
SELECT expressions
FROM tables
[WHERE conditions]
GROUP BY column_name1, column_name2 ,…
[ORDER BY column_name1, column_name2 ,… ASC | DESC]
In this Syntax,
Note – The SQL GROUP BY clause must be specified after WHERE conditions if specified and must be preceded by the ORDER BY clause if one is used.
Let us see how it works starting from the simple (i.e. single column scenario) to the complex (i.e. multiple column scenario).
NOTE: Columns are also referred to as fields and the terms are used interchangeably.
We will use the below sample table for reference and example.
The below query will find the number of employees working from different locations.
SELECT
emp_location,
count(emp_id) 'no of employees'
FROM
employee
GROUP BY
emp_location;
Output
In the above example, we have applied a GROUP BY on the column emp_location and count aggregate function to find out the number of employees belonging to different locations.
Now let’s take a look at the usefulness of the GROUP BY on multiple columns.
Suppose we want to find the number of employees in each department in different locations. To achieve this we have to perform GROUP BY operation on both dept_code and emp_location columns.
SELECT
dept_code,
emp_location,
count(emp_id) 'no of employees'
FROM
employee
GROUP BY
dept_code,
emp_location
ORDER BY
emp_location,
dept_code;
Output
In real-world scenarios, most of the time group by are used along with aggregate functions to prepare the report.
For example, management wants to know the total number of sales that happened in a particular month for a specific product. In this case, count() aggregate function is along with the select list.
An aggregate function is used to perform calculations on a group and returns a unique value per group. There are different kinds of aggregate functions is available in SQL. For example, SUM() is used to get the sum of the specified column. Other commonly used aggregate functions are COUNT(), MIN() (minimum), MAX() (maximum), AVG() (average).
The GROUP BY clause in SQL groups the rows on the specified columns and an aggregate function returns the summary (count, sum, min, max, average, etc.) for each group.
For example, the following query returns the number of employees in each department.
SELECT
dept_code,
count(*)
FROM
employee
GROUP BY
dept_code;
If you are specifying more columns in the select expression but not specified in the aggregate function then those column names should be mentioned in the GROUP BY clause. Otherwise, the query will produce an error because there is no guarantee that those columns will return unique values to the group. For example, the below query will fail.
SELECT
emp_gender,
dept_code,
count(*)
FROM
employee
GROUP BY
dept_code;
Msg 8120, Level 16, State 1, Line 2 Column 'employee.emp_gender' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
A GROUPING SET is the extension of the GROUP BY clause in SQL. A GROUP BY clause allows us to group the resultset of a query on the basis of a column or a combination of columns. The GROUP is like a category and the resultset reflects the properties of the category. For example, we can create a group combining department and location in an employee table to find out what is the total number of employees working in a particular department in a location. For example, what is the total number of IT personnel working in Bangalore. We can go further and add employee type (say developers) to department and location to find out what is the number of developers working in the IT department in Bangalore location. But we cannot do both of these in a single query using GROUP BY. We will have to write 2 separate queries for the two separate combinations or groups (i.e. (department, location) and (department, location, employee type)). The GROUP BY clause does not allow us to specify more than one group in a query.
This is where GROUPING SET comes into the picture. It extends the capability of the GROUP BY clause by allowing us to specify multiple groups in a single query. Using GROUPING SET with a single query we can generate multiple groups and analyze and compare their properties. GROUPING SET basically means a set of GROUPS.
The basic syntax of GROUPING SET is given below.
SELECT
column1,
column2,
aggregate(column3)
FROM table
GROUP BY
GROUPING SETS (
(column1, column2),
(column1),
(column2),
()
);
In this syntax,
Instead of beginning with GROUPING SET we will begin with GROUP BY so that we can progressively extend the logic and get a proper and clear understanding of the significance and usage of the GROUPING SET clause. Let us imagine that we are running an electronics retail chain (primarily dealing in laptops and tablets) having outlets across the country. These outlets are supplied by warehouses in the major cities and state capitals. For the North, we have warehouses in Kolkata and Delhi. In the company database, we have a table called warehouse_stocks which stores the stock information in the two warehouses. The table is represented below. We will use this sample table as the reference in the following examples.
warehouse_city | product_category | product_brand | stock_quantity |
Delhi | Laptop | Toshiba | 2000 |
Delhi | Laptop | Dell | 7000 |
Delhi | Laptop | HP | 2500 |
Delhi | Tablet | Sony | 3500 |
Delhi | Tablet | Samsung | 4200 |
Kolkata | Laptop | HP | 3000 |
Kolkata | Laptop | Dell | 4000 |
Kolkata | Laptop | Toshiba | 1000 |
Kolkata | Tablet | Sony | 3000 |
Kolkata | Tablet | Samsung | 4000 |
Table: warehouse_stocks
Given the above table, we might want to analyze the stock situation of the different materials in the inventory in the warehouses. We can do so with the help of the following queries using the GROUP BY clause which creates and reports the stock figure for the different groups.
The below query creates a GROUP using the product_category column to determine the total number of laptops and tablets in the 2 warehouses. It uses the SUM aggregate function on the stock_quantity column for the purpose and the result is aliased as stock.
SELECT
product_category,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY product_category;
It produces the following desired output.
product_category | stock |
Laptop | 19500 |
Tablet | 14700 |
The below query creates a GROUP using the product_brand column to determine the total number of items available from each brand in the 2 warehouses. It uses the SUM aggregate function on the stock_quantity column for the purpose and the result is aliased as stock.
SELECT
product_brand,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY product_brand;
It produces the following desired output.
product_brand | stock |
Dell | 11000 |
HP | 5500 |
Samsung | 8200 |
Sony | 6500 |
Toshiba | 3000 |
The below query creates a GROUP using the warehouse_city and product_category columns and to determine the total number of laptops and items available in each warehouse. It uses the SUM aggregate function on the stock_quantity column for the purpose and the result is aliased as stock.
SELECT
warehouse_city,
product_category,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY warehouse_city, product_category
ORDER BY 1;
It will produce the following desired output.
warehouse_city | product_category | stock |
Delhi | Laptop | 11500 |
Delhi | Tablet | 7700 |
Kolkata | Laptop | 8000 |
Kolkata | Tablet | 7000 |
From the above, we see that we have to issue a fresh query every time to create a new group or category to analyze the data from a different point of view. One way to get around this is to combine all the query result sets using the UNION ALL operator to get a holistic view of the stock status. The following UNION ALL query does the same. Since the UNION ALL operator requires that the resultset of all the participating queries should have the same number of columns therefore we have added a dummy column NULL in the second and third queries to make the number of columns equal in all.
SELECT
warehouse_city,
product_category,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY warehouse_city, product_category
UNION ALL
SELECT NULL,
product_category,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY product_category
UNION ALL
SELECT
NULL,
product_brand,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY product_brand;
It will produce the following output which is an integrated resultset of all the above 3 queries.
warehouse_city | product_category | stock |
Delhi | Laptop | 11500 |
Kolkata | Laptop | 8000 |
Delhi | Tablet | 7700 |
Kolkata | Tablet | 7000 |
NULL | Laptop | 19500 |
NULL | Tablet | 14700 |
NULL | Dell | 11000 |
NULL | HP | 5500 |
NULL | Samsung | 8200 |
NULL | Sony | 6500 |
NULL | Toshiba | 3000 |
But the problem with this approach is that it is ad-hoc and cumbersome and more importantly inefficient. It puts pressure on the database server since the server has to run 3 separate queries and then combine the resultset of the first 2 queries and then combine that resultset with the third query. It requires multiple reads of the table and temporary storage and multiple IO’s. To overcome these shortfalls SQL server 2008 introduced the GROUPING SET feature which allows us to specify multiple GROUPS as a set in a single query. The following SQL query does the same by specifying all the GROUPS within the GROUPING SET clause so that the result consists of all the groups and their relevant details.
SELECT
warehouse_city,
product_category,
product_brand,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY
GROUPING SETS (
(warehouse_city, product_category),
(product_category),
(product_brand)
()
);
It will produce the same integrated resultset of all the above 3 queries or the UNION ALL query with much less hassle and more technically efficient. There is one additional row though i.e. row number 6 which is the output of the optional empty grouping set () and gives the sum of the entire stock i.e. 34200. Since it is not a group it just aggregates the total stock_quantity.
warehouse_city | product_category | product_brand | stock |
NULL | NULL | Dell | 11000 |
NULL | NULL | HP | 5500 |
NULL | NULL | Samsung | 8200 |
NULL | NULL | Sony | 6500 |
NULL | NULL | Toshiba | 3000 |
NULL | NULL | NULL | 34200 |
Delhi | Laptop | NULL | 11500 |
Kolkata | Laptop | NULL | 8000 |
NULL | Laptop | NULL | 19500 |
Delhi | Tablet | NULL | 7700 |
Kolkata | Tablet | NULL | 7000 |
NULL | Tablet | NULL | 14700 |
The ROLLUP operator is an extension of the GROUPING SET operator just like the GROUPING SET operator is an extension of the GROUP BY operator. The GROUP BY operator aggregates a single group. GROUPING SET aggregates multiple groups by allowing us to specify a set of groups in one query. ROLLUP also aggregates multiple groups with a single query but without the need to specify groups explicitly and additionally provides subtotals and grand total. It creates the groups automatically using the hierarchical relation between the specified correlated data columns.
The basic syntax of the ROLLUP operator is given below.
SELECT
column1, column2, aggregate_function (column 3)
FROM
table
GROUP BY ROLLUP (column1, column2);
In this syntax,
Let us try to understand when and how to use the ROLLUP operator in a SQL query. Suppose we are an electronics retail chain (primarily dealing in laptops and tablets) having outlets across the country. These outlets are supplied by warehouses in the major cities and state capitals. For the North, we have warehouses in Kolkata and Delhi. In the company database, we have a table called warehouse_stocks which stores the stock information in the two warehouses. The table is represented below. We will use this sample table for reference in our examples.
warehouse_city | product_category | product_brand | stock_quantity |
Delhi | Laptop | Toshiba | 2000 |
Delhi | Laptop | Dell | 7000 |
Delhi | Laptop | HP | 2500 |
Delhi | Tablet | Sony | 3500 |
Delhi | Tablet | Samsung | 4200 |
Kolkata | Laptop | HP | 3000 |
Kolkata | Laptop | Dell | 4000 |
Kolkata | Laptop | Toshiba | 1000 |
Kolkata | Tablet | Sony | 3000 |
Kolkata | Tablet | Samsung | 4000 |
Table: warehouse_stocks
Using the above table, we will analyze the inventory using GROUP BY, GROUPING SET, and ROLLUP. It will help us understand the differences between the different operators and the appropriate use case of the ROLLUP operator.
First, we will use the GROUP BY operator. The following query does the same on a single-column group (warehouse). The sum of stock quantity is aliased as stock in the query.
SELECT
warehouse_city,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY (warehouse_city);
The resultset of the above query is the following. It shows the total stock of both laptops and tablets in each warehouse.
warehouse_city | stock |
Delhi | 19200 |
Kolkata | 15000 |
Now let us execute the same above query using GROUPING SETS. The following query does the same.
SELECT
warehouse_city,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY
GROUPING SETS (warehouse_city);
It produces the same resultset as the previous query. that is because it is a one-column group (warehouse_city) and not really a set of groups.
warehouse_city | stock |
Delhi | 19200 |
Kolkata | 15000 |
Now we will apply ROLLUP to the query and see the difference. The following query uses ROLLUP on the same column (warehouse_city).
SELECT
warehouse_city,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY
ROLLUP (warehouse_city);
We can see that the output is different this time. It has 3 rows with an additional row showing the total combined stock from both warehouses. This is basically the output of the empty set () and the grand total of the group.
warehouse_city | stock |
Delhi | 19200 |
Kolkata | 15000 |
NULL | 34200 |
We will now add one more single-column group to the GROUPING SETS query. The following query uses GROUPING SETS to aggregate two single-column groups (warehouse_city) and (product_category). (Please note that we cannot use GROUP BY anymore as using GROUP BY we can aggregate only one group.)
SELECT
warehouse_city,
product_category,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY
GROUPING SETS ((warehouse_city), (product_category));
It will produce the following resultset. The first 2 columns are aggregates for product_category and the second 2 columns for warehouse_city. Columns that are not part of the group show NULL in their column values.
warehouse_city | product_category | stock |
NULL | Laptop | 19500 |
NULL | Tablet | 14700 |
Delhi | NULL | 19200 |
Kolkata | NULL | 15000 |
We will now subject the above query to ROLLUP. The following query does the same.
SELECT
warehouse_city,
product_category,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY
ROLLUP ((warehouse_city), (product_category));
It will produce the following resultset. We can see that it has 7 rows. This is because rollup creates a set of 3 groups out of the 2 columns specified separately. They are –
They are highlighted in green, yellow, and blue respectively in the table.
This is the way ROLLUP operates. It takes a number of columns as its argument and then constructs the groups hierarchically. We do not need to and cannot specify groups explicitly as we did in the GROUPING SETS query.
warehouse_city | product_category | stock |
Delhi | Laptop | 11500 |
Delhi | Tablet | 7700 |
Delhi | NULL | 19200 |
Kolkata | Laptop | 8000 |
Kolkata | Tablet | 7000 |
Kolkata | NULL | 15000 |
NULL | NULL | 34200 |
In this final example, we will add all the 3 columns to the ROLLUP query. The following query does the same.
SELECT warehouse_city, product_category, product_brand, SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY ROLLUP (warehouse_city, product_category, product_brand);
The resultset of the above query is below. It has 17 rows. This is because rollup has created a set of 4 groups out of the columns specified separately. They are –
They are highlighted in green, yellow, gray, and blue respectively in the table. It means that if we specify n columns in the query ROLLUP creates and reports the aggregates for n+1 sets of groups and it does so by removing one column at a time for each new group it aggregates as listed above.
So, we see that ROLLUP is an advanced option for automatic comprehensive hierarchical data analysis and reporting unlike GROUP BY and GROUPING SETS which are more for specific and custom data analysis on selected dimensions.
warehouse_city | product_category | product_brand | stock |
Delhi | Laptop | Dell | 7000 |
Delhi | Laptop | HP | 2500 |
Delhi | Laptop | Toshiba | 2000 |
Delhi | Laptop | NULL | 11500 |
Delhi | Tablet | Samsung | 4200 |
Delhi | Tablet | Sony | 3500 |
Delhi | Tablet | NULL | 7700 |
Delhi | NULL | NULL | 19200 |
Kolkata | Laptop | Dell | 4000 |
Kolkata | Laptop | HP | 3000 |
Kolkata | Laptop | Toshiba | 1000 |
Kolkata | Laptop | NULL | 8000 |
Kolkata | Tablet | Samsung | 4000 |
Kolkata | Tablet | Sony | 3000 |
Kolkata | Tablet | NULL | 7000 |
Kolkata | NULL | NULL | 15000 |
NULL | NULL | NULL | 34200 |
CUBE is an extension of GROUPING SETS. It is almost the same as ROLLUP with one subtle difference. The difference is that it does an exhaustive grouping of the specified columns using all possible permutations.
For e.g. if we specify 3 table columns or fields (X, Y, Z) in rollup it will group as below. The total number of groups created and reported will be n+1 where n is the number of columns specified in the ROLLUP clause.
(X, Y, Z), (X, Y), (X), ()
But CUBE will group as below. The total number of groups created and reported in the resultset will be 2n where n is the number of columns specified in the CUBE clause.
(X, Y, Z), (X, Z), (X, Y), (X), (Y), ()
So, we can see it produces a larger resultset containing more rows of information.
The basic syntax of the CUBE is given below.
SELECT
column1, column2, aggregate_function (column 3)
FROM
table
GROUP BY CUBE (column1, column2);
In this syntax,
Let us try to understand when and how to use the CUBE operator in a SQL query. Suppose we have an employee table containing information about employees in the company. The table is represented below. We will use this sample table for reference in our examples.
id | name | gender | salary | department |
1 | David Jackson | Male | 5000 | IT |
2 | Jim Jameson | Female | 6000 | HR |
3 | Kate Johnson | Female | 7500 | IT |
4 | Will Ray | Male | 6500 | Marketing |
5 | Shane Mathews | Female | 5500 | Finance |
6 | Shed Price | Male | 8000 | Marketing |
7 | Viktor Smith | Male | 7200 | HR |
8 | Vincent Smithson | Female | 6600 | IT |
9 | Janice Streep | Female | 5400 | Marketing |
10 | Laura Wells | Female | 6300 | Finance |
11 | Mac Bull | Male | 5700 | Marketing |
12 | Patrick Patterson | Male | 7000 | HR |
13 | Julie Orbison | Female | 7100 | IT |
14 | Elice Hemingway | Female | 6800 | Marketing |
15 | Wayne Johnson | Male | 5000 | Finance |
Table: employee
Let us start by doing a ROLLUP on department and gender for employee salary. The following query does the same. The sum of employee salaries is aliased as ‘employee cost’.
SELECT
department, gender, sum(salary) AS 'employee cost'
FROM employee
GROUP BY
ROLLUP (department, gender);
The resultset of the above query is the following. It has 13 rows of information or records. They include the output of the following set of groups – (department, gender) with 8 records, (department) with 4 records, and () 1 record which is the total cost of the company of all employees of all genders and all departments i.e. 2+1=3 as mentioned in the introduction above.
department | gender | employee cost |
Finance | Female | 11800 |
Finance | Male | 5000 |
Finance | NULL | 16800 |
HR | Female | 6000 |
HR | Male | 14200 |
HR | NULL | 20200 |
IT | Female | 21200 |
IT | Male | 5000 |
IT | NULL | 26200 |
Marketing | Female | 12200 |
Marketing | Male | 20200 |
Marketing | NULL | 32400 |
NULL | NULL | 95600 |
Now we will apply the CUBE operator on the same query and see the difference. The following query does the same.
SELECT
department, gender, sum(salary) AS 'employee cost'
FROM employee
GROUP BY
CUBE (department, gender);
The resultset of the above query is the following. It has 15 records i.e. 2 more than the output from ROLLUP. The additional records are highlighted in pink and it is the output of the (gender) group which was missing in the ROLLUP resultset i.e. the total cost to the company of the male employees and female employees separately. The total number of groups created and reported is 2n as mentioned above in the introduction i.e. (department, gender), (gender), (department), and ().
department | gender | employee cost |
Finance | Female | 11800 |
HR | Female | 6000 |
IT | Female | 21200 |
Marketing | Female | 12200 |
NULL | Female | 51200 |
Finance | Male | 5000 |
HR | Male | 14200 |
IT | Male | 5000 |
Marketing | Male | 20200 |
NULL | Male | 44400 |
NULL | NULL | 95600 |
Finance | NULL | 16800 |
HR | NULL | 20200 |
IT | NULL | 26200 |
Marketing | NULL | 32400 |
The output can be formatted with the help of coalesce to address gaps in values i.e. NULL values. The following query does the same making the resultset more readable and comprehensible.
SELECT
coalesce (department, 'All Department') AS department,
coalesce (gender, 'Both Genders') AS gender,
sum(salary) AS 'employee cost'
FROM employee
GROUP BY
CUBE (department, gender);
The formatted output is the following. The NULL values have been replaced with meaningful values.
department | gender | employee cost |
Finance | Female | 11800 |
HR | Female | 6000 |
IT | Female | 21200 |
Marketing | Female | 12200 |
All Department | Female | 51200 |
Finance | Male | 5000 |
HR | Male | 14200 |
IT | Male | 5000 |
Marketing | Male | 20200 |
All Department | Male | 44400 |
All Department | Both Genders | 95600 |
Finance | Both Genders | 16800 |
HR | Both Genders | 20200 |
IT | Both Genders | 26200 |
Marketing | Both Genders | 32400 |
ROLLUP (YEAR, MONTH, DAY)
With a ROLLUP
, it will have the following outputs:
YEAR, MONTH, DAY
YEAR, MONTH
YEAR
()
With CUBE
, it will have the following:
YEAR, MONTH, DAY
YEAR, MONTH
YEAR, DAY
YEAR
MONTH, DAY
MONTH
DAY
()
Reference: