Handling skewness in features by applying transformation in Python
2022-02-13
Different approaches for finding feature importance using Random Forests
2022-02-15
Show all

Understanding GROUP BY, GROUPING SET, ROLL UP, and CUBE in SQL

18 mins read

GROUP BY

A table in a database has columns of information in it. Each column in a table represents an attribute that has values and sometimes the same value. The group by clause is used to identify rows that have the same value for the specified attribute (i.e. duplicate value) and return a single row of information instead of all the rows where the attribute has the same value.
The GROUP BY clause can be used on one or more columns.

GROUP BY Syntax

The generic syntax of SQL GROUP by clause is as below.

SELECT expressions
FROM tables
[WHERE conditions]
GROUP BY column_name1, column_name2 ,…
[ORDER BY column_name1, column_name2 ,… ASC | DESC]

In this Syntax,

  • expressions – expressions defined here the column(s) or calculation you want to retrieve. If you want to retrieve all the columns simply use * in the place of expressions.
  • tables – one or more than one table from where you want to retrieve data.
  • WHERE conditions – Optional. This is used to specify some conditions while selecting data. In case you are not using WHERE clause all the rows available will be selected.
  • GROUP BY – GROUP BY along with the arguments defines the group by the columns that you specify in the GROUP BY clause.
  • ORDER BY – Optional. This argument is used to sort the resultset. If you want to sort on more than one column, you need to provide them in comma-separated.
  • ASC – Optional. ASC sorts the resultset in ascending order. This is the default behavior if no modifier is mentioned.
  • DESC – Optional. DESC sorts resultsets in descending order by expression.

Note – The SQL GROUP BY clause must be specified after WHERE conditions if specified and must be preceded by the ORDER BY clause if one is used.

GROUP BY Examples

Let us see how it works starting from the simple (i.e. single column scenario) to the complex (i.e. multiple column scenario).

NOTE: Columns are also referred to as fields and the terms are used interchangeably.

We will use the below sample table for reference and example.

table employees

1) GROUP BY Example – Grouping on a single column

The below query will find the number of employees working from different locations.

SELECT 
  emp_location, 
  count(emp_id) 'no of employees' 
FROM 
  employee 
GROUP BY 
  emp_location;

Output

sql server group by on single column

In the above example, we have applied a GROUP BY on the column emp_location and count aggregate function to find out the number of employees belonging to different locations.

2) GROUP BY Example – Grouping on multiple columns

Now let’s take a look at the usefulness of the GROUP BY on multiple columns.

Suppose we want to find the number of employees in each department in different locations. To achieve this we have to perform GROUP BY operation on both dept_code and emp_location columns.

SELECT 
  dept_code, 
  emp_location, 
  count(emp_id) 'no of employees' 
FROM 
  employee
GROUP BY 
  dept_code, 
  emp_location
ORDER BY 
  emp_location, 
  dept_code;

Output

sql server group by on multiple column

3) GROUP BY clause and aggregate functions

In real-world scenarios, most of the time group by are used along with aggregate functions to prepare the report.

For example, management wants to know the total number of sales that happened in a particular month for a specific product. In this case, count() aggregate function is along with the select list.

An aggregate function is used to perform calculations on a group and returns a unique value per group. There are different kinds of aggregate functions is available in SQL. For example, SUM() is used to get the sum of the specified column. Other commonly used aggregate functions are COUNT(), MIN() (minimum), MAX() (maximum), AVG() (average).

The GROUP BY clause in SQL groups the rows on the specified columns and an aggregate function returns the summary (count, sum, min, max, average, etc.) for each group.

For example, the following query returns the number of employees in each department.

SELECT 
  dept_code, 
  count(*)
FROM 
  employee
GROUP BY 
  dept_code;

If you are specifying more columns in the select expression but not specified in the aggregate function then those column names should be mentioned in the GROUP BY clause. Otherwise, the query will produce an error because there is no guarantee that those columns will return unique values to the group. For example, the below query will fail.

SELECT 
  emp_gender, 
  dept_code, 
  count(*)
FROM 
  employee
GROUP BY 
  dept_code;
Msg 8120, Level 16, State 1, Line 2
Column 'employee.emp_gender' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

GROUPING SET

A GROUPING SET is the extension of the GROUP BY clause in SQL. A GROUP BY clause allows us to group the resultset of a query on the basis of a column or a combination of columns. The GROUP is like a category and the resultset reflects the properties of the category. For example, we can create a group combining department and location in an employee table to find out what is the total number of employees working in a particular department in a location. For example, what is the total number of IT personnel working in Bangalore. We can go further and add employee type (say developers) to department and location to find out what is the number of developers working in the IT department in Bangalore location. But we cannot do both of these in a single query using GROUP BY. We will have to write 2 separate queries for the two separate combinations or groups (i.e. (department, location) and (department, location, employee type)). The GROUP BY clause does not allow us to specify more than one group in a query.

This is where GROUPING SET comes into the picture. It extends the capability of the GROUP BY clause by allowing us to specify multiple groups in a single query. Using GROUPING SET with a single query we can generate multiple groups and analyze and compare their properties. GROUPING SET basically means a set of GROUPS.

GROUPING SET Syntax

The basic syntax of GROUPING SET is given below.

SELECT
column1,
column2,
aggregate(column3)
FROM table
GROUP BY
GROUPING SETS (
(column1, column2),
(column1),
(column2),
()
);

In this syntax,

  • column –a column from the table in the SELECT list.
  • aggregate(column)– column on which aggregate function (i.e. SUM, COUNT, AVG, etc.) is used.  
  • GROUP BY – SQL keyword combination to specify a column or multiple columns as a single group.
  • GROUPING SET – SQL keyword combination followed by a set of groups.
  • () – optional. Specifies an empty grouping set. It aggregates on all columns individually.

SQL GROUPING SET with Examples

Instead of beginning with GROUPING SET we will begin with GROUP BY so that we can progressively extend the logic and get a proper and clear understanding of the significance and usage of the GROUPING SET clause. Let us imagine that we are running an electronics retail chain (primarily dealing in laptops and tablets) having outlets across the country. These outlets are supplied by warehouses in the major cities and state capitals. For the North, we have warehouses in Kolkata and Delhi. In the company database, we have a table called warehouse_stocks which stores the stock information in the two warehouses. The table is represented below. We will use this sample table as the reference in the following examples.

warehouse_cityproduct_categoryproduct_brandstock_quantity
DelhiLaptopToshiba2000
DelhiLaptopDell7000
DelhiLaptopHP2500
DelhiTabletSony3500
DelhiTabletSamsung4200
KolkataLaptopHP3000
KolkataLaptopDell4000
KolkataLaptopToshiba1000
KolkataTabletSony3000
KolkataTabletSamsung4000

Table: warehouse_stocks

Given the above table, we might want to analyze the stock situation of the different materials in the inventory in the warehouses. We can do so with the help of the following queries using the GROUP BY clause which creates and reports the stock figure for the different groups.

1)  Query to determine the total number of laptops and tablets from both warehouses

The below query creates a GROUP using the product_category column to determine the total number of laptops and tablets in the 2 warehouses. It uses the SUM aggregate function on the stock_quantity column for the purpose and the result is aliased as stock.

SELECT
product_category,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY product_category;

It produces the following desired output.

product_categorystock
Laptop19500
Tablet14700

2)  Query to determine the total stock of each brand from both warehouses

The below query creates a GROUP using the product_brand column to determine the total number of items available from each brand in the 2 warehouses. It uses the SUM aggregate function on the stock_quantity column for the purpose and the result is aliased as stock.

SELECT
product_brand,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY product_brand;

It produces the following desired output.

product_brandstock
Dell11000
HP5500
Samsung8200
Sony6500
Toshiba3000

3)  Query to determine the total number of laptops and tablets available in each warehouse

The below query creates a GROUP using the warehouse_city and product_category columns and to determine the total number of laptops and items available in each warehouse. It uses the SUM aggregate function on the stock_quantity column for the purpose and the result is aliased as stock.

SELECT
warehouse_city,
product_category,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY warehouse_city, product_category
ORDER BY 1;

It will produce the following desired output.

warehouse_cityproduct_categorystock
DelhiLaptop11500
DelhiTablet7700
KolkataLaptop8000
KolkataTablet7000

From the above, we see that we have to issue a fresh query every time to create a new group or category to analyze the data from a different point of view. One way to get around this is to combine all the query result sets using the UNION ALL operator to get a holistic view of the stock status. The following UNION ALL query does the same. Since the UNION ALL operator requires that the resultset of all the participating queries should have the same number of columns therefore we have added a dummy column NULL in the second and third queries to make the number of columns equal in all.

SELECT
warehouse_city,
product_category,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY warehouse_city, product_category
UNION ALL
SELECT NULL,
product_category,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY product_category
UNION ALL
SELECT
NULL,
product_brand,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY product_brand;

It will produce the following output which is an integrated resultset of all the above 3 queries.

warehouse_cityproduct_categorystock
DelhiLaptop11500
KolkataLaptop8000
DelhiTablet7700
KolkataTablet7000
NULLLaptop19500
NULLTablet14700
NULLDell11000
NULLHP5500
NULLSamsung8200
NULLSony6500
NULLToshiba3000

But the problem with this approach is that it is ad-hoc and cumbersome and more importantly inefficient. It puts pressure on the database server since the server has to run 3 separate queries and then combine the resultset of the first 2 queries and then combine that resultset with the third query. It requires multiple reads of the table and temporary storage and multiple IO’s. To overcome these shortfalls SQL server 2008 introduced the GROUPING SET feature which allows us to specify multiple GROUPS as a set in a single query. The following SQL query does the same by specifying all the GROUPS within the GROUPING SET clause so that the result consists of all the groups and their relevant details.

SELECT
warehouse_city,
product_category,
product_brand,
SUM(stock_quantity) stock
FROM warehouse_stocks
GROUP BY
GROUPING SETS (
(warehouse_city, product_category),
(product_category),
(product_brand)
()
);

It will produce the same integrated resultset of all the above 3 queries or the UNION ALL query with much less hassle and more technically efficient. There is one additional row though i.e. row number 6 which is the output of the optional empty grouping set () and gives the sum of the entire stock i.e. 34200. Since it is not a group it just aggregates the total stock_quantity.

warehouse_cityproduct_categoryproduct_brandstock
NULLNULLDell11000
NULLNULLHP5500
NULLNULLSamsung8200
NULLNULLSony6500
NULLNULLToshiba3000
NULLNULLNULL34200
DelhiLaptopNULL11500
KolkataLaptopNULL8000
NULLLaptopNULL19500
DelhiTabletNULL7700
KolkataTabletNULL7000
NULLTabletNULL14700

What is ROLLUP in SQL?

The ROLLUP operator is an extension of the GROUPING SET operator just like the GROUPING SET operator is an extension of the GROUP BY operator. The GROUP BY operator aggregates a single group. GROUPING SET aggregates multiple groups by allowing us to specify a set of groups in one query. ROLLUP also aggregates multiple groups with a single query but without the need to specify groups explicitly and additionally provides subtotals and grand total. It creates the groups automatically using the hierarchical relation between the specified correlated data columns.

SQL ROLLUP Syntax

The basic syntax of the ROLLUP operator is given below.

SELECT
column1, column2, aggregate_function (column 3)
FROM
table
GROUP BY ROLLUP (column1, column2);

In this syntax,

  • column – a column from the table in the SELECT list.
  • aggregate_function (column) – column on which aggregate function (i.e. SUM, COUNT, AVG etc.) is used.  
  • GROUP BY – SQL keyword combination to specify a column or multiple columns to create a group on which the to apply the aggregate.
  • ROLLUP – SQL keyword combination which creates multiple groups (i.e. grouping set) and applies the aggregate on them.

SQL ROLLUP with Examples

Let us try to understand when and how to use the ROLLUP operator in a SQL query. Suppose we are an electronics retail chain (primarily dealing in laptops and tablets) having outlets across the country. These outlets are supplied by warehouses in the major cities and state capitals. For the North, we have warehouses in Kolkata and Delhi. In the company database, we have a table called warehouse_stocks which stores the stock information in the two warehouses. The table is represented below. We will use this sample table for reference in our examples.

warehouse_cityproduct_categoryproduct_brandstock_quantity
DelhiLaptopToshiba2000
DelhiLaptopDell7000
DelhiLaptopHP2500
DelhiTabletSony3500
DelhiTabletSamsung4200
KolkataLaptopHP3000
KolkataLaptopDell4000
KolkataLaptopToshiba1000
KolkataTabletSony3000
KolkataTabletSamsung4000

Table: warehouse_stocks

Using the above table, we will analyze the inventory using GROUP BY, GROUPING SET, and ROLLUP. It will help us understand the differences between the different operators and the appropriate use case of the ROLLUP operator.

1) Single-column example

First, we will use the GROUP BY operator. The following query does the same on a single-column group (warehouse). The sum of stock quantity is aliased as stock in the query.

SELECT
warehouse_city,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY (warehouse_city);

The resultset of the above query is the following. It shows the total stock of both laptops and tablets in each warehouse.

warehouse_citystock
Delhi19200
Kolkata15000

Now let us execute the same above query using GROUPING SETS. The following query does the same.

SELECT
warehouse_city,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY
GROUPING SETS (warehouse_city);

It produces the same resultset as the previous query. that is because it is a one-column group (warehouse_city) and not really a set of groups.

warehouse_citystock
Delhi19200
Kolkata15000

Now we will apply ROLLUP to the query and see the difference. The following query uses ROLLUP on the same column (warehouse_city).

SELECT
warehouse_city,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY
ROLLUP (warehouse_city);

We can see that the output is different this time. It has 3 rows with an additional row showing the total combined stock from both warehouses. This is basically the output of the empty set () and the grand total of the group.

warehouse_citystock
Delhi19200
Kolkata15000
NULL34200

2) Double-column example

We will now add one more single-column group to the GROUPING SETS query. The following query uses GROUPING SETS to aggregate two single-column groups (warehouse_city) and (product_category). (Please note that we cannot use GROUP BY anymore as using GROUP BY we can aggregate only one group.)

SELECT
warehouse_city,
product_category,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY
GROUPING SETS ((warehouse_city), (product_category));

It will produce the following resultset. The first 2 columns are aggregates for product_category and the second 2 columns for warehouse_city. Columns that are not part of the group show NULL in their column values.

warehouse_cityproduct_categorystock
NULLLaptop19500
NULLTablet14700
DelhiNULL19200
KolkataNULL15000

We will now subject the above query to ROLLUP. The following query does the same.

SELECT
warehouse_city,
product_category,
SUM (stock_quantity) stock
FROM warehouse_stocks
GROUP BY
ROLLUP ((warehouse_city), (product_category));

It will produce the following resultset. We can see that it has 7 rows. This is because rollup creates a set of 3 groups out of the 2 columns specified separately. They are –

  1. (warehouse_city, product_category),
  2. (warehouse_city) and
  3. () (i.e. empty set which returns the grand total of all stock from both warehouses, 19200+15000=34200).

They are highlighted in green, yellow, and blue respectively in the table.

This is the way ROLLUP operates. It takes a number of columns as its argument and then constructs the groups hierarchically. We do not need to and cannot specify groups explicitly as we did in the GROUPING SETS query.

warehouse_cityproduct_categorystock
DelhiLaptop11500
DelhiTablet7700
DelhiNULL19200
KolkataLaptop8000
KolkataTablet7000
KolkataNULL15000
NULLNULL34200

3) Adding all the columns to ROLLUP

In this final example, we will add all the 3 columns to the ROLLUP query. The following query does the same.

SELECT warehouse_city, product_category, product_brand, SUM (stock_quantity) stock 
FROM warehouse_stocks 
GROUP BY ROLLUP (warehouse_city, product_category, product_brand);

The resultset of the above query is below. It has 17 rows. This is because rollup has created a set of 4 groups out of the columns specified separately. They are –

  1. (warehouse_city, product_category, product_brand) which has 10 rows,
  2. (warehouse_city, product_category) which has 4 rows,
  3. (warehouse_city) which has 2 rows and
  4. () (i.e. empty set which returns the grand total of all stock from both warehouses, 19200+15000=34200).

They are highlighted in green, yellow, gray, and blue respectively in the table. It means that if we specify n columns in the query ROLLUP creates and reports the aggregates for n+1 sets of groups and it does so by removing one column at a time for each new group it aggregates as listed above.

So, we see that ROLLUP is an advanced option for automatic comprehensive hierarchical data analysis and reporting unlike GROUP BY and GROUPING SETS which are more for specific and custom data analysis on selected dimensions.

warehouse_cityproduct_categoryproduct_brandstock
DelhiLaptopDell7000
DelhiLaptopHP2500
DelhiLaptopToshiba2000
DelhiLaptopNULL11500
DelhiTabletSamsung4200
DelhiTabletSony3500
DelhiTabletNULL7700
DelhiNULLNULL19200
KolkataLaptopDell4000
KolkataLaptopHP3000
KolkataLaptopToshiba1000
KolkataLaptopNULL8000
KolkataTabletSamsung4000
KolkataTabletSony3000
KolkataTabletNULL7000
KolkataNULLNULL15000
NULLNULLNULL34200

What is CUBE in SQL?

CUBE is an extension of GROUPING SETS. It is almost the same as ROLLUP with one subtle difference. The difference is that it does an exhaustive grouping of the specified columns using all possible permutations.

For e.g. if we specify 3 table columns or fields (X, Y, Z) in rollup it will group as below. The total number of groups created and reported will be n+1 where n is the number of columns specified in the ROLLUP clause.

(X, Y, Z), (X, Y), (X), ()

But CUBE will group as below. The total number of groups created and reported in the resultset will be 2n where n is the number of columns specified in the CUBE clause.

(X, Y, Z), (X, Z), (X, Y), (X), (Y), ()

So, we can see it produces a larger resultset containing more rows of information.

SQL CUBE Syntax

The basic syntax of the CUBE is given below.

SELECT
column1, column2, aggregate_function (column 3)
FROM
table
GROUP BY CUBE (column1, column2);

In this syntax,

  • column – a column from the table in the SELECT list.
  • aggregate_function (column) – column on which aggregate function (i.e. SUM, COUNT, AVG, etc.) is used.  
  • GROUP BY – SQL keyword combination to specify a column or multiple columns to create a group on which to apply the aggregate.
  • CUBE – SQL keyword combination which creates an exhaustive grouping of the specified columns and aggregates them.

SQL CUBE with Example

Let us try to understand when and how to use the CUBE operator in a SQL query. Suppose we have an employee table containing information about employees in the company. The table is represented below. We will use this sample table for reference in our examples.

idnamegendersalarydepartment
1David JacksonMale5000IT
2Jim JamesonFemale6000HR
3Kate JohnsonFemale7500IT
4Will RayMale6500Marketing
5Shane MathewsFemale5500Finance
6Shed PriceMale8000Marketing
7Viktor SmithMale7200HR
8Vincent SmithsonFemale6600IT
9Janice StreepFemale5400Marketing
10Laura WellsFemale6300Finance
11Mac BullMale5700Marketing
12Patrick PattersonMale7000HR
13Julie OrbisonFemale7100IT
14Elice HemingwayFemale6800Marketing
15Wayne JohnsonMale5000Finance

Table: employee

Let us start by doing a ROLLUP on department and gender for employee salary. The following query does the same. The sum of employee salaries is aliased as ‘employee cost’.

SELECT
department, gender, sum(salary) AS 'employee cost'
FROM employee
GROUP BY
ROLLUP (department, gender);

The resultset of the above query is the following. It has 13 rows of information or records. They include the output of the following set of groups – (department, gender) with 8 records, (department) with 4 records, and () 1 record which is the total cost of the company of all employees of all genders and all departments i.e. 2+1=3 as mentioned in the introduction above.

departmentgenderemployee cost
FinanceFemale11800
FinanceMale5000
FinanceNULL16800
HRFemale6000
HRMale14200
HRNULL20200
ITFemale21200
ITMale5000
ITNULL26200
MarketingFemale12200
MarketingMale20200
MarketingNULL32400
NULLNULL95600

Now we will apply the CUBE operator on the same query and see the difference. The following query does the same.

SELECT
department, gender, sum(salary) AS 'employee cost'
FROM employee
GROUP BY
CUBE (department, gender);

The resultset of the above query is the following. It has 15 records i.e. 2 more than the output from ROLLUP. The additional records are highlighted in pink and it is the output of the (gender) group which was missing in the ROLLUP resultset i.e. the total cost to the company of the male employees and female employees separately. The total number of groups created and reported is 2n as mentioned above in the introduction i.e. (department, gender), (gender), (department), and ().

departmentgenderemployee cost
FinanceFemale11800
HRFemale6000
ITFemale21200
MarketingFemale12200
NULLFemale51200
FinanceMale5000
HRMale14200
ITMale5000
MarketingMale20200
NULLMale44400
NULLNULL95600
FinanceNULL16800
HRNULL20200
ITNULL26200
MarketingNULL32400

The output can be formatted with the help of coalesce to address gaps in values i.e. NULL values. The following query does the same making the resultset more readable and comprehensible.

SELECT
coalesce (department, 'All Department') AS department,
coalesce (gender, 'Both Genders') AS gender,
sum(salary) AS 'employee cost'
FROM employee
GROUP BY
CUBE (department, gender);

The formatted output is the following. The NULL values have been replaced with meaningful values.

departmentgenderemployee cost
FinanceFemale11800
HRFemale6000
ITFemale21200
MarketingFemale12200
All DepartmentFemale51200
FinanceMale5000
HRMale14200
ITMale5000
MarketingMale20200
All DepartmentMale44400
All DepartmentBoth Genders95600
FinanceBoth Genders16800
HRBoth Genders20200
ITBoth Genders26200
MarketingBoth Genders32400

ROLLUP (YEAR, MONTH, DAY)

With a ROLLUP, it will have the following outputs:

YEAR, MONTH, DAY
YEAR, MONTH
YEAR
()

With CUBE, it will have the following:

YEAR, MONTH, DAY
YEAR, MONTH
YEAR, DAY
YEAR
MONTH, DAY
MONTH
DAY
()

Reference:

https://www.sqlservertutorial.org/

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.