Setup Apache Spark on a multi-node cluster
2022-07-14
Understanding the basics of audio data with Python code
2022-07-19
Show all

SQL Window Functions explained with example

32 mins read

All database users know about regular aggregate functions which operate on an entire table and are used with a GROUP BY clause. But very few people use Window functions in SQL. These operate on a set of rows and return a single aggregated value for each row. The main advantage of using Window functions over regular aggregate functions is that Window functions do not cause rows to become grouped into a single output row, the rows retain their separate identities and an aggregated value will be added to each row.

Let’s take a look at how Window functions work and then see a few examples of using it in practice to be sure that things are clear and also how the SQL and output compare to that for SUM() functions. As always be sure that you are fully backed up, especially if you are trying out new things with your database.

Introduction to SQL Window Functions

Window functions were first introduced to standard SQL in 2003. Per the PostgresSQL documentation:

“A window function performs a calculation across a set of table rows that are somehow related to the current row…Behind the scenes, the window function is able to access more than just the current row of the query result.”

Window functions are similar to the aggregation done in the GROUP BY clause. However, rows are not grouped into a single row, each row retains its separate identity. That is, a window function may return a single value for each row. Here’s a good visualization of what I mean by that.

Notice how the GROUP BY aggregation on the left-hand side of the picture groups the three rows into one single row. The window function on the right-hand side of the picture is able to output each row with an aggregation value. This may save you from having to do a join after the GROUP BY.

Example: GROUP BY versus Window Function

Here’s a quick example to give you a taste of what a window function does.

Let’s say we have some salary data and we want to find to create a column that gives us the average salary for each job title.

Example Salary Data
Group By versus Window Function

On the left is what a GROUP BY aggregation would return and on the right is what a window function would return. As you can see, the GROUP BY consolidates our data into just three rows. With a window function, we retain the original 11 rows and have a new column called AVG_SALARY. We could then choose to compare each individual’s salary to the average salary if desired.

Why use Window Functions?

One major advantage of window functions is that it allows you to work with both aggregate and non-aggregate values all at once because the rows are not collapsed together. Window functions are also simple to use and read. That is, they can reduce the complexity of your queries, which makes it easier to maintain down the road. In addition, they can help with performance issues. For example, you can use a window function instead of having to do a self-join or cross-join.

Important Note:

Before we start, it is important to note that in terms of the order of operations in SQL, window functions come in sixth on the list.

This is important because based on this logical order, window functions are allowed in SELECT and ORDER BY, but they are not allowed in FROMWHEREGROUP BY, or HAVING clauses.

Note: If you really need to have it inside a WHERE clause or GROUP BY clause, you may get around this limitation by using a subquery or a WITH query.

Window functions operate on a set of rows and return a single aggregated value for each row. The term Window describes the set of rows in the database on which the function will operate. We define the Window (set of rows on which functions operates) using an OVER() clause. We will discuss more about the OVER() clause.

The aggregate functions perform calculations across a set of rows and return a single output row. The following query uses the SUM() aggregate function to calculate the total salary of all employees in the company:

SELECT 
    SUM(salary) sum_salary
FROM
    employees;

Here is the output:

As shown clearly in the output, all rows from the  employees table are grouped into a single row. Similar to an aggregate function, a window function calculates on a set of rows. However, a window function does not cause rows to become grouped into a single output row.

The following query uses the SUM() as a window function. It returns the sum salary of all employees along with the salary of each individual employee:

SELECT
    first_name,
    last_name,
    salary,
    SUM(salary) OVER() sum_salary
FROM
    employees;

Here is the partial output:

sql window functions - SUM window function example

In this example, the OVER() clause signals that the SUM() function is used as a window function.

The following picture illustrates the main difference between aggregate functions and window functions:

sql window functions

SQL window function syntax

The syntax of the window functions is as follows:

window_function_name ( expression ) OVER (
    partition_clause
    order_clause
    frame_clause
)

window_function_name: The name of the supported window function such as ROW_NUMBER()RANK(), and SUM().

expression: The target expression or column on which the window function operates.

OVER clause: The OVER clause defines window partitions to form the groups of rows and specifies the orders of rows in a partition. The OVER clause consists of three clauses: partition, order, and frame clauses.

The partition clause divides the rows into partitions to which the window function applies. It has the following syntax:

PARTITION BY expr1, expr2, ...

If the PARTITION BY clause is not specified, then the whole result set is treated as a single partition.

The order clause specifies the orders of rows in a partition on which the window function operates:

ORDER BY 
    expression [ASC | DESC]  [NULL {FIRST| LAST}]
    ,...

A frame is the subset of the current partition. To define the frame, you use one of the following syntaxes:

{ RANGE | ROWS } frame_start
{ RANGE | ROWS } BETWEEN frame_start AND frame_end  

where frame_start is one of the following options:

N PRECEDING
UNBOUNDED PRECEDING
CURRENT ROW

and frame_end is one of the following options:

CURRENT ROW
UNBOUNDED FOLLOWING
N FOLLOWING

The following picture illustrates a frame and its options:

SQL window function frame
  • UNBOUNDED PRECEDING: the frame starts at the first row of the partition.
  • N PRECEDING: the frame starts at the Nth row before the current row.
  • CURRENT ROW: means the current row that is being evaluated.
  • UNBOUNDED FOLLOWING: the frame ends at the final row in the partition.
  • N FOLLOWING: the frame ends at the Nh row after the current row.

The ROWS or RANGE specifies the type of relationship between the current row and frame rows.

  •  ROWS: the offsets of the current row and frame rows are row numbers.
  •  RANGE: the offset of the current row and frame rows are row values.

SQL window function types

The window functions are divided into three types value window functions, aggregation window functions, and ranking window functions:

Value window functions:

FIRST_VALUE()
LAG()
LAST_VALUE()
LEAD()

Ranking window functions:

CUME_DIST()
DENSE_RANK()
NTILE()
PERCENT_RANK()
RANK()
ROW_NUMBER()

Aggregate window functions:

AVG()
COUNT()
MAX()
MIN()
SUM()

Types of Window functions

  • Aggregate Window Functions
    SUM(), MAX(), MIN(), AVG(). COUNT()
  • Ranking Window Functions
    RANK(), DENSE_RANK(), ROW_NUMBER(), NTILE()
  • Value Window Functions
    LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE()

Syntax

window_function ( [ ALL ] expression ) 
OVER ( [ PARTITION BY partition_list ] [ ORDER BY order_list] )

Arguments

window_function
Specify the name of the window function

ALL
ALL is an optional keyword. When you will include ALL it will count all values including duplicate ones. DISTINCT is not supported in window functions

expression
The target column or expression that the functions operate on. In other words, the name of the column for which we need an aggregated value. For example, a column containing the order amount so that we can see the total orders received.

OVER
Specifies the window clauses for aggregate functions.

PARTITION BY partition_list
Defines the window (set of rows on which window function operates) for window functions. We need to provide a field or list of fields for the partition after the PARTITION BY clause. Multiple fields need to be separated by a comma as usual. If PARTITION BY is not specified, the grouping will be done on the entire table and values will be aggregated accordingly.

ORDER BY order_list
Sorts the rows within each partition. If ORDER BY is not specified, ORDER BY uses the entire table.

Examples

Let’s create a table and insert dummy records to write further queries. Run the below code.

CREATE TABLE [dbo].[Orders]
(
	order_id INT,
	order_date DATE,
	customer_name VARCHAR(250),
	city VARCHAR(100),	
	order_amount MONEY
)
 
INSERT INTO [dbo].[Orders]
SELECT '1001','04/01/2017','David Smith','GuildFord',10000
UNION ALL	  
SELECT '1002','04/02/2017','David Jones','Arlington',20000
UNION ALL	  
SELECT '1003','04/03/2017','John Smith','Shalford',5000
UNION ALL	  
SELECT '1004','04/04/2017','Michael Smith','GuildFord',15000
UNION ALL	  
SELECT '1005','04/05/2017','David Williams','Shalford',7000
UNION ALL	  
SELECT '1006','04/06/2017','Paum Smith','GuildFord',25000
UNION ALL	 
SELECT '1007','04/10/2017','Andrew Smith','Arlington',15000
UNION ALL	  
SELECT '1008','04/11/2017','David Brown','Arlington',2000
UNION ALL	  
SELECT '1009','04/20/2017','Robert Smith','Shalford',1000
UNION ALL	  
SELECT '1010','04/25/2017','Peter Smith','GuildFord',500

Aggregate Window Functions

SUM()

We all know the SUM() aggregate function. It does the sum of a specified field for the specified group (like city, state, country, etc.) or for the entire table if the group is not specified. We will see what will be the output of the regular SUM() aggregate function and window SUM() aggregate function.

The following is an example of a regular SUM() aggregate function. It sums the order amount for each city. You can see from the result set that a regular aggregate function groups multiple rows into a single output row, which causes individual rows to lose their identity.

SELECT city, SUM(order_amount) total_order_amount
FROM [dbo].[Orders] GROUP BY city

This does not happen with window aggregate functions. Rows retain their identity and also show an aggregated value for each row. In the example below the query does the same thing, namely, it aggregates the data for each city and shows the sum of the total order amount for each of them. However, the query now inserts another column for the total order amount so that each row retains its identity. The column marked grand_total is the new column in the example below.

SELECT order_id, order_date, customer_name, city, order_amount
 ,SUM(order_amount) OVER(PARTITION BY city) as grand_total 
FROM [dbo].[Orders]

AVG()

AVG or Average works in exactly the same way with a Window function. The following query will give you the average order amount for each city and for each month (although for simplicity we’ve only used data in one month).

We specify more than one average by specifying multiple fields in the partition list. It is also worth noting that you can use expressions in the lists like MONTH(order_date) as shown in the query below. As ever you can make these expressions as complex as you want so long as the syntax is correct!

SELECT order_id, order_date, customer_name, city, order_amount
 ,AVG(order_amount) OVER(PARTITION BY city, MONTH(order_date)) as  average_order_amount 
FROM [dbo].[Orders]

From the above image, we can clearly see that on average we have received orders of 12,333 for Arlington city for April 2017.

Average Order Amount = Total Order Amount / Total Orders
                                       = (20,000 + 15,000 + 2,000) / 3
                                       = 12,333

You can also use the combination of SUM() & COUNT() function to calculate an average.

MIN()

The MIN() aggregate function will find the minimum value for a specified group or for the entire table if the group is not specified. For example, if we are looking for the smallest order (minimum order) for each city we would use the following query.

SELECT order_id, order_date, customer_name, city, order_amount
 ,MIN(order_amount) OVER(PARTITION BY city) as minimum_order_amount 
FROM [dbo].[Orders]

MAX()

Just as the MIN() function gives you the minimum value, the MAX() function will identify the largest value of a specified field for a specified group of rows or for the entire table if a group is not specified. Let’s find the biggest order (maximum order amount) for each city.

SELECT order_id, order_date, customer_name, city, order_amount
 ,MAX(order_amount) OVER(PARTITION BY city) as maximum_order_amount 
FROM [dbo].[Orders]

COUNT()

The COUNT() function will count the records/rows. Note that DISTINCT is not supported with the window COUNT() function whereas it is supported for the regular COUNT() function. DISTINCT helps you to find the distinct values of a specified field. For example, if we want to see how many customers have placed an order in April 2017, we cannot directly count all customers. It is possible that the same customer has placed multiple orders in the same month.

COUNT(customer_name) will give you an incorrect result as it will count duplicates. Whereas COUNT(DISTINCT customer_name) will give you the correct result as it counts each unique customer only once.

Valid for regular COUNT() function:

SELECT city,COUNT(DISTINCT customer_name) number_of_customers
FROM [dbo].[Orders] 
GROUP BY city

Invalid for window COUNT() function:

SELECT order_id, order_date, customer_name, city, order_amount
 ,COUNT(DISTINCT customer_name) OVER(PARTITION BY city) as number_of_customers
FROM [dbo].[Orders] 

The above query with the Window function will give you the below error.

Now, let’s find the total order received for each city using the window COUNT() function.

SELECT order_id, order_date, customer_name, city, order_amount
 ,COUNT(order_id) OVER(PARTITION BY city) as total_orders
FROM [dbo].[Orders]

Ranking Window Functions

Just as Window aggregate functions aggregate the value of a specified field, RANKING functions will rank the values of a specified field and categorize them according to their rank. The most common use of RANKING functions is to find the top (N) records based on a certain value. For example, Top 10 highest-paid employees, Top 10 ranked students, Top 50 largest orders, etc.

The following are supported RANKING functions: RANK(), DENSE_RANK(), ROW_NUMBER(), NTILE()

Let’s discuss them one by one.

RANK()

The RANK() function is used to give a unique rank to each record based on a specified value, for example, salary, order amount, etc. If two records have the same value then the RANK() function will assign the same rank to both records by skipping the next rank. This means – if there are two identical values at rank 2, it will assign the same rank 2 to both records and then skip rank 3 and assign rank 4 to the next record.

Let’s rank each order by its order amount.

SELECT order_id,order_date,customer_name,city, 
RANK() OVER(ORDER BY order_amount DESC) [Rank]
FROM [dbo].[Orders]

From the above image, you can see that the same rank (3) is assigned to two identical records (each having an order amount of 15,000) and it then skips the next rank (4) and assigns rank 5 to the next record.

DENSE_RANK()

The DENSE_RANK() function is identical to the RANK() function except that it does not skip any rank. This means that if two identical records are found then DENSE_RANK() will assign the same rank to both records but not skip and then skip the next rank.

Let’s see how this works in practice.

SELECT order_id, order_date, customer_name, city, order_amount,
DENSE_RANK() OVER(ORDER BY order_amount DESC) [Rank]
FROM [dbo].[Orders]

As you can clearly see above, the same rank is given to two identical records (each having the same order amount) and then the next rank number is given to the next record without skipping a rank value.

ROW_NUMBER()

The name is self-explanatory. These functions assign a unique row number to each record. The row number will be reset for each partition if PARTITION BY is specified. Let’s see how ROW_NUMBER() works without PARTITION BY and then with PARTITION BY.

ROW_ NUMBER() without PARTITION BY

SELECT order_id, order_date, customer_name, city, order_amount,
ROW_NUMBER() OVER(ORDER BY order_id) [row_number]
FROM [dbo].[Orders]

ROW_NUMBER() with PARTITION BY

SELECT order_id, order_date, customer_name, city, order_amount,
ROW_NUMBER() OVER(PARTITION BY city ORDER BY order_amount DESC) [row_number]
FROM [dbo].[Orders]

Note that we have done the partition of the city. This means that the row number is reset for each city and so restarts at 1 again. However, the order of the rows is determined by order amount so that for any given city the largest order amount will be the first row and so assigned row number 1.

NTILE()

NTILE() is a very helpful window function. It helps you to identify what percentile (or quartile, or any other subdivision) a given row falls into. This means that if you have 100 rows and you want to create 4 quartiles based on a specified value field you can do so easily and see how many rows fall into each quartile. Let’s see an example. In the query below, we have specified that we want to create four quartiles based on order amount. We then want to see how many orders fall into each quartile.

SELECT order_id, order_date, customer_name, city, order_amount,
NTILE(4) OVER(ORDER BY order_amount) [row_number]
FROM [dbo].[Orders]

NTILE creates tiles based on the following formula:

# of rows in each tile = # of rows in result set/# of tiles specified

Here is our example, we have a total of 10 rows and 4 tiles are specified in the query so the number of rows in each tile will be 2.5 (10/4). As the number of rows should be a whole number, not a decimal, the SQL engine will assign 3 rows for the first two groups and 2 rows for the remaining two groups.

Value Window Functions

Value window functions are used to find first, last, previous and next values. The functions that can be used are LAG(), LEAD(), FIRST_VALUE(), and LAST_VALUE().

LAG() and LEAD()

LEAD() and LAG() functions are very powerful but can be complex to explain. As this is an introductory article below we are looking at a very simple example to illustrate how to use them. The LAG function allows to access data from the previous row in the same result set without the use of any SQL joins. You can see in the below example, that using the LAG function we found the previous order date.

Script to find previous order date using LAG() function:

SELECT order_id, customer_name, city, order_amount, order_date,
 --in below line, 1 indicates check for previous row of the current row
 LAG(order_date,1) OVER(ORDER BY order_date) prev_order_date
FROM [dbo].[Orders]

The LEAD function allows to access data from the next row in the same result set without the use of any SQL joins. You can see in the below example, that using the LEAD function we found the next order date.

Script to find next order date using LEAD() function:

SELECT order_id,customer_name,city, order_amount,order_date,
 --in below line, 1 indicates check for next row of the current row
 LEAD(order_date,1) OVER(ORDER BY order_date) next_order_date
FROM [dbo].[Orders]

FIRST_VALUE() and LAST_VALUE()

These functions help you to identify the first and last record within a partition or entire table if PARTITION BY is not specified. Let’s find the first and last order of each city from our existing dataset. Note ORDER BY clause is mandatory for FIRST_VALUE() and LAST_VALUE() functions:

SELECT order_id,order_date,customer_name,city, order_amount,
FIRST_VALUE(order_date) OVER(PARTITION BY city ORDER BY city) first_order_date,
LAST_VALUE(order_date) OVER(PARTITION BY city ORDER BY city) last_order_date
FROM [dbo].[Orders]

From the above image, we can clearly see that the first order was received on 2017-04-02 and the last order was received on 2017-04-11 for Arlington city and it works the same for other cities.

Window Function Syntax

Here’s what the generic syntax looks like for a window function in the SELECT clause.

There are a lot of words here, so let’s look at some definitions:

  • window_function is the name of the window function we want to use; for example, sum, avg, or row_number (we’ll learn more about these later)
  • expression is the name of the column that we want the window function operated on. This may not be necessary depending on what window_function is used
  • OVER is just to signify that this is a window function
  • PARTITION BY divides the rows into partitions so we can specify which rows to use to compute the window function
  • partition_list is the name of the column(s) we want to partition by
  • ORDER BY is used so that we can order the rows within each partition. This is optional and does not have to be specified
  • order_list is the name of the column(s) we want to order by
  • ROWS can be used if we want to further limit the rows within our partition. This is optional and usually not used
  • frame_clause defines how much to offset from our current row

Don’t worry about memorizing the definitions and syntax or even fully understanding what it means exactly right now. Everything will make a lot more sense once you look at the examples in the article and get an intuitive understanding of how to go about writing a window function.

Quick Example

To help you get a better idea of how the syntax really works below is an example of what a window function would look like in practice.

This is the query that would have generated the output we saw earlier regarding salary by job title.

Here, AVG() is the name of the window function, SALARY is the expression and JOB_TITLE is our partition list. We did not use an ORDER BY as it is not needed and we do not want to use ROWS because we do not want to further limit our partition.

Again, no need for memorizing syntax for now. At this stage, the one concept I want you to understand is that the window function computes a value for each row in the “window” or “partition”. A window can be one of more rows and it is specified by the clause PARTITION BY. In our example, we partitioned by job title. As you can see in the snippet above, I’ve highlighted each job title in a different color. Each color represents a different “window” or a different “partition”. The window function computes one average salary value for each partition.

List of Window Functions

Now that you know the syntax, let’s take look at the different kinds of window functions that can be substituted in place of the red font below.

There are three main types of window functions available to use: aggregate, ranking, and value functions. In the image below, you can see some of the names of the functions that fall within each group.

Here’s a quick overview of what each type of window function is useful for.

Aggregate functions: We can use these functions to calculate various aggregations such as average, total # of rows, maximum or minimum values, or total sum within each window or partition.

Ranking functions: These functions are useful for ranking rows within their partition.

Value functions: These functions allow you to compare values from previous or following rows within the partition or the first or last value within the partition.

Window Function Examples

Now let’s start doing some fun exercises to help you really grasp how window functions work. We’ll go through various exercises on aggregate, ranking, and value functions.

Data

For the example problems below, I am using data from the Northwind database located on this website. See Northwind_small.sqlite. The Northwind sample database was provided with Microsoft Access as a tutorial schema for managing small business customers, orders, inventory, purchasing, suppliers, shipping, and employees. Northwind is an excellent tutorial schema for a small-business ERP, with customers, orders, inventory, purchasing, suppliers, shipping, employees, and single-entry accounting.

The full schema of the database is displayed on the website linked above. For the examples in this article, I will only be using the [Order] and [OrderDetail] tables.

Source: https://github.com/jpwhite3/northwind-SQLite3

Aggregate Functions

Exercise 1: Create a new column that calculates the average Unit Price for each CustomerId

From the list of aggregate window functions listed on the left, we can see that AVG() will be the window function that we want to use. Our expression will be the Unit Price column because we want to calculate the average of Unit Price. Next, we need to figure out how we want to partition. That is, how should the rows be grouped together in order to create our partitions? The exercise statement tells us to find the avg price for each CustomerId. That tells us that we want to group rows that have the same CustomerId, so that will be part of our partition list. For the purpose of this exercise, we have no use of an ORDER BY. Below is what our query would look like.

SELECT CustomerId, 
       UnitPrice, 
       AVG(UnitPrice) OVER (PARTITION BY CustomerId) AS “AvgUnitPrice”
FROM [Order] 
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId

As you can see in the image above, an average unit price is computed for each of our partitions of CustomerId.

Here are some exercises to try for yourself:

  1. Create a new column that calculates the max Discount for each CustomerId
  2. Create a new column that calculates min Unit Price for each ProductId
Exercise 2: Create a new column that calculates the average Unit Price for each group of CustomerId AND EmployeeId.

You can choose to partition by more than 1 column. Earlier, we calculated the average unit price for each CustomerId group. This time, we’ll add in EmployeeId.

SELECT CustomerId, 
       EmployeeId, 
       AVG(UnitPrice) OVER (PARTITION BY CustomerId, EmployeeId) AS “AvgUnitPrice”
FROM [Order] 
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId 

Notice how the partition changes from earlier. The calculations are computed for every unique group of CustomerId and EmployeeId as visually shown by the different colors in the table.

Ranking Functions

Exercise 3: Create a new column that ranks Unit Price in descending order for each CustomerId.

We can complete this exercise in three different ways. We’ll use the first three ranking functions on the list on the left: ROW_NUMBER, RANK, and DENSE_RANK. Each one has a slightly different way of ranking the data.

ROW_NUMBER

We can use this function to show the row number of a given row within its partition. Note that for the ranking functions, we do not have to specify an expression within the parentheses as we did previously for the aggregate functions.

In addition, since are doing a ranking, the order is important here. We have to make sure that the Unit Price column is ordered correctly so that the ranking is applied correctly. To do so, we can add ORDER BY UnitPrice DESCas part of the windows function, right after PARTITION BY.

SELECT CustomerId, 
       OrderDate, 
       UnitPrice, 
       ROW_NUMBER() OVER (PARTITION BY CustomerId ORDER BY UnitPrice DESC) AS “UnitRank”
FROM [Order] 
INNER JOIN OrderDetail 
ON [Order].Id = OrderDetail.OrderId

As you can see in the output above, our UnitPrice column is in descending order and the unit’s rank is shown for each customer id in the last column. There are 12 rows for customer ALFK, so the rank goes from 1 to 12. You may be wondering, what happens if I use the ORDER BY at the end of the SQL statement and not inside the windows function, will I get the same results?

Take a minute think about it and come back. Does it matter if I have the ORDER BY inside the windows function versus outside?

Let’s try it out! Remove the ORDER BY from the windows function and let’s add it to the end.

SELECT CustomerId, 
       OrderDate, 
       UnitPrice,    
       ROW_NUMBER() OVER (PARTITION BY CustomerId) AS “UnitRank”
FROM [Order]
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId
ORDER BY CustomerId, UnitPrice DESC 

It looks like we aren’t getting the same results as earlier. Unit price is ordered correctly in descending order, but the unit’s rank doesn’t look right. Why not? Remember back to the SQL order of operations. Window functions are processed sixth whereas the ORDER BY is processed tenth.

So the row numbers were created BEFORE the UnitPrice was ordered. That’s why we don’t get the same results! Makes sense.

RANK()

Now, let’s try RANK() in place of ROW_NUMBER().

SELECT CustomerId, 
       OrderDate, 
       UnitPrice, 
       RANK() OVER (PARTITION BY CustomerId ORDER BY UnitPrice DESC) AS “UnitRank”
FROM [Order] 
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId

What’s the difference now? With row number, there were no repeated numbers. But with RANK(), if you have multiple values with the exact same value, the rank function will give them the same rank. Notice in rows 2 and 3, the unit price is 45.60, so both rows are given the same rank of 2. Rows 7 and 8 also have the same unit price and are given the same rank of 7. Also, note that the ranking skips a number. For example, in row 3, the rank skips to 4 since there are two rows of rank 2. If there were three rows with rank 2, then it would skip to rank 5 and so on. Well, what if you don’t want it to skip numbers? Well, we can use DENSE_RANK() instead.

DENSE_RANK()

Again, replace our window function to be DENSE_RANK() and keep all else the same.

SELECT CustomerId, 
       OrderDate, 
       UnitPrice, 
       DENSE_RANK() OVER (PARTITION BY CustomerId ORDER BY UnitPrice DESC) AS “UnitRank”
FROM [Order] 
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId

It follows the same behavior as RANK() in that if the values are the same, the same rank will be given to those rows. See rows 2 and 3. Notice that in row 4, the rank does not skip a number now. It is ranked 3 instead of 4. Your homework now is to learn how PERCENT_RANK() and NTILE() work and to try those functions for yourself.

Value Functions

To me, value functions are probably the top reason why window functions are so amazing. These functions are great for extracting values from other rows that might be useful for a report.

We can use the LAG or LEAD functions to help us create a column that is able to pull values from other rows. LAG can return values from the previous rows whereas LEAD returns values from the following rows. Comparing previous or following rows can be super useful when working with time-series data and calculating differences across time.

Exercise 4: Create a new column that provides the previous order date’s Quantity for each ProductId.
SELECT ProductId, 
       OrderDate, 
       Quantity, 
       LAG(Quantity) OVER (PARTITION BY ProductId ORDER BY OrderDate) AS "LAG"
FROM [Order] 
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId 

We use LAG on the Quantity column to return the value from the previous row. Just like before, we need to make sure our data is sorted in order inside our windows function. We’ll sort by the OrderDate here.

As you can see in the image above, we get a column with the previous OrderDate’s Quantity. This is really useful because we could compare the current order date to the previous order date and calculate the differences across the two time periods. In the first row, there is no previous order date, so it is NaN or null.

Exercise 5: Create a new column that provides the following order date’s Quantity for each ProductId.

This is going to look pretty similar to our earlier example. However, this time, since we want the following row, we will use LEAD().

SELECT ProductId, 
       OrderDate, 
       Quantity, 
       LEAD(Quantity) OVER (PARTITION BY ProductId ORDER BY OrderDate) AS "LEAD"
FROM [Order] 
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId

As you can see, the new column LEAD contains the values from the next row down.

Exercise 6: Create a new column that provides the very first Quantity ever ordered for each ProductId.

To get the first quantity, we can use the FIRST_VALUE function, which will give us the first value within a partition.

SELECT ProductId, 
       OrderDate, 
       Quantity, 
       FIRST_VALUE(Quantity) OVER (PARTITION BY ProductId ORDER BY OrderDate) AS "FirstValue"
FROM [Order] 
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId

As you can see in the image, the first order for product ID 1 was on 8/20/2012 with a quantity of 45, so we get a value of 45 for all rows related to product 1. Here’s an exercise for you to try yourself.

  1. Create a new column that provides the second Quantity ordered for each ProductId. (HINT: use the NTH_VALUE function)

Using the frame_clause

Let’s take a quick break from the exercises to learn a new concept that hasn’t been discussed yet. You may remember from the definitions at the beginning that we can specify ROWS with a frame_clause to further limit our window size. I’m saving this towards the end here because people tend to get a bit confused about this. I’m going to quickly go over the syntax and how it’s used, then let’s look at an example to really understand what’s going on.

Here’s what the generic syntax looks like

ROWS BETWEEN <starting_row> AND <ending_row>

In the <starting_row> and <ending row>, we have the following options at our disposal:

  • UNBOUNDED PRECEDING — all rows before the current row in the partition, i.e. the first row of the partition
  • [some #] PRECEDING — # of rows before the current row
  • CURRENT ROW — the current row
  • [some #] FOLLOWING — # of rows after the current row
  • UNBOUNDED FOLLOWING — all rows after the current row in the partition, i.e. the last row of the partition

Here are some examples of how it could be written:

  • ROWS BETWEEN 3 PRECEDING AND CURRENT ROW — This means looking back at the previous 3 rows up to the current row.
  • ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING — this means looking from the first row of the partition to 1 row after the current row
  • ROWS BETWEEN 5 PRECEDING AND 1 PRECEDING — this means to look back at the previous 5 rows up to 1 row before the current row
  • ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING — this means looking from the first row of the partition to the last row of the partition

One worthy note is that anytime that you add an ORDER BY clause, SQL sets the default window as ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

Exercise 7: Calculate a cumulative moving average UnitPrice for each CustomerId.

In order to calculate a cumulative moving average, we will take advantage of the frame_clause.

SELECT CustomerId, 
       UnitPrice, 
       AVG(UnitPrice) OVER (PARTITION BY CustomerId 
       ORDER BY CustomerId 
       ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS “CumAvg”
FROM [Order]
INNER JOIN OrderDetail ON [Order].Id = OrderDetail.OrderId

In the output above, you can see that an average is recalculated at every row. In row 1, there is only 1 number, so the average is 45.60. In row 2, the CumAvg is the average of 45.60 and 18. In row 3, the CumAvg is the average of 45.60, 18, and 12. And so on…

Here are some exercises to try for yourself:

  1. Calculate the average Quantity of the previous 5 rows up to the previous 3 rows for each CustomerId.
  2. Calculated the minimum UnitPrice for the previous 2 rows up to the current row for each CustomerId.

Challenge Exercise: Create a new column that provides the last Quantity ordered for each ProductId. (Hint: use the LAST_VALUE() window function and think about the partitioning).

If you are looking for a cheat sheet, check out this link here.

References:

https://towardsdatascience.com/a-guide-to-advanced-sql-window-functions-f63f2642cbf9

https://www.sqlshack.com/use-window-functions-sql-server/
Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.