Time series analysis with Pandas: Power consumption case study
2021-06-26
Understanding Dates, Times, Periods, and Time Zones in Pandas
2021-06-27
Show all

Resampling time series in Pandas: resample and asfreq methods

23 mins read

This article is an introductory dive into the technical aspects of resampling methods in pandas.

1. Resampling 

Resampling is necessary when you’re given a data set recorded in some time interval and you want to change the time interval to something else. For example, you could aggregate monthly data into yearly data, or you could upsample hourly data into minute-by-minute data. Resampling time series generally refers to:

  • Enforcing frequency to data when you have data measured without any kind of frequency (e.g. data collected with different time delta between various measurements).
  • Enforcing different frequencies than the already present frequency of measured data.

We need methods that can help us enforce some kind of frequency to data so that it makes analysis easy. Python library Pandas is quite commonly used to hold time series data and it provides a list of tools to handle sampling of data. We’ll be exploring ways to resample time series data using pandas.

Resampling is generally performed in two ways:

  • Up Sampling: It happens when you convert time series from lower frequency to higher frequency like from month-based to day-based or hour-based to minute-based. When time series data is converted from lower frequency to higher frequency then the number of observations increases hence we need a method to fill the newly created frequency. We’ll explain below various methods available when going through examples.
  • Down Sampling: It happens when you convert time series from higher frequency to lower frequency like from week-based to month-based, hour-based to day-based, etc. When you convert time series from higher frequency to lower frequency then the number of samples will decrease and also it’ll result in loss of some values. We’ll explain it below when going through examples.

Let’s start by importing the required libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore")

%matplotlib inline

1.1 asfreq() 

The first method that we’ll like to introduce is asfreq() method for resampling. Pandas series, as well as dataframe objects, has this method available which we can call on them.

asfreq() method accepts important parameters like freqmethod, and fill_value.

  • freq parameter lets us specify a new frequency for the time series object.
  • method parameter provides a list of methods like ffillbfillbackfill and pad for filling in newly created indexes when we up-sampled time series data. Forward fill will fill newly created indexes with values in previous indexes whereas backward fill will fill the newly created indexes with values from the next index value. pad method will fill in with the same values for a particular time interval. The default value for method parameter is None and it puts NaNs in newly created indexes when upsampling.
  • fill_value lets us fill in NaNs with a value specified as this parameter. It does not fill existing NaNs in data but only NaNs which are generated by asfreq() when upsampling/downsampling data.

We’ll explore the usage of asfreq() below with a few examples.

rng = pd.date_range(start = "1-1-2020", periods=5, freq="H")
ts = pd.Series(data=range(5), index=rng)
ts

2020-01-01 00:00:00    0
2020-01-01 01:00:00    1
2020-01-01 02:00:00    2
2020-01-01 03:00:00    3
2020-01-01 04:00:00    4
Freq: H, dtype: int64

Below we are trying a few examples to demonstrate upsampling. We’ll explore various methods to fill in newly created indexes.

ts.asfreq(freq="30min")

2020-01-01 00:00:00    0.0
2020-01-01 00:30:00    NaN
2020-01-01 01:00:00    1.0
2020-01-01 01:30:00    NaN
2020-01-01 02:00:00    2.0
2020-01-01 02:30:00    NaN
2020-01-01 03:00:00    3.0
2020-01-01 03:30:00    NaN
2020-01-01 04:00:00    4.0
Freq: 30T, dtype: float64

We can notice from the above example that asfreq() method by default put NaN in all newly created indexes. We can either pass a value to be filled in into these newly created indexes by setting fill_value parameter or we can call any fill method as well. We’ll explain it below with a few examples.

ts.asfreq(freq="30min", fill_value=0.0)

2020-01-01 00:00:00    0.0
2020-01-01 00:30:00    0.0
2020-01-01 01:00:00    1.0
2020-01-01 01:30:00    0.0
2020-01-01 02:00:00    2.0
2020-01-01 02:30:00    0.0
2020-01-01 03:00:00    3.0
2020-01-01 03:30:00    0.0
2020-01-01 04:00:00    4.0
Freq: 30T, dtype: float64

We can see that the above example filled in all NaNs with 0.0.

ts.asfreq(freq="30min", method="ffill")

2020-01-01 00:00:00    0
2020-01-01 00:30:00    0
2020-01-01 01:00:00    1
2020-01-01 01:30:00    1
2020-01-01 02:00:00    2
2020-01-01 02:30:00    2
2020-01-01 03:00:00    3
2020-01-01 03:30:00    3
2020-01-01 04:00:00    4
Freq: 30T, dtype: int64

We can notice from the above examples that ffill method filled in a newly created index with the value of previous indexes.

ts.asfreq(freq="45min", method="ffill")

2020-01-01 00:00:00    0
2020-01-01 00:45:00    0
2020-01-01 01:30:00    1
2020-01-01 02:15:00    2
2020-01-01 03:00:00    3
2020-01-01 03:45:00    3
Freq: 45T, dtype: int64

ts.asfreq(freq="45min", method="bfill")

2020-01-01 00:00:00    0
2020-01-01 00:45:00    1
2020-01-01 01:30:00    2
2020-01-01 02:15:00    3
2020-01-01 03:00:00    3
2020-01-01 03:45:00    4
Freq: 45T, dtype: int64

ts.asfreq(freq="45min", method="pad")

2020-01-01 00:00:00    0
2020-01-01 00:45:00    0
2020-01-01 01:30:00    1
2020-01-01 02:15:00    2
2020-01-01 03:00:00    3
2020-01-01 03:45:00    3
Freq: 45T, dtype: int64

df = pd.DataFrame({"TimeSeries":ts})
df

TimeSeries
2020-01-01 00:00:000
2020-01-01 01:00:001
2020-01-01 02:00:002
2020-01-01 03:00:003
2020-01-01 04:00:004
df.asfreq(freq="45min")
TimeSeries
2020-01-01 00:00:000.0
2020-01-01 00:45:00NaN
2020-01-01 01:30:00NaN
2020-01-01 02:15:00NaN
2020-01-01 03:00:003.0
2020-01-01 03:45:00NaN
df.asfreq(freq="45min", fill_value=0.0)
TimeSeries
2020-01-01 00:00:000.0
2020-01-01 00:45:000.0
2020-01-01 01:30:000.0
2020-01-01 02:15:000.0
2020-01-01 03:00:003.0
2020-01-01 03:45:000.0
df.asfreq("30min", method="ffill")
TimeSeries
2020-01-01 00:00:000
2020-01-01 00:30:000
2020-01-01 01:00:001
2020-01-01 01:30:001
2020-01-01 02:00:002
2020-01-01 02:30:002
2020-01-01 03:00:003
2020-01-01 03:30:003
2020-01-01 04:00:004
df.asfreq("30min", method="bfill")
TimeSeries
2020-01-01 00:00:000
2020-01-01 00:30:001
2020-01-01 01:00:001
2020-01-01 01:30:002
2020-01-01 02:00:002
2020-01-01 02:30:003
2020-01-01 03:00:003
2020-01-01 03:30:004
2020-01-01 04:00:004
df.asfreq("30min", method="pad")
TimeSeries
2020-01-01 00:00:000
2020-01-01 00:30:000
2020-01-01 01:00:001
2020-01-01 01:30:001
2020-01-01 02:00:002
2020-01-01 02:30:002
2020-01-01 03:00:003
2020-01-01 03:30:003
2020-01-01 04:00:004

We’ll now explain a few examples of downsampling.

ts.asfreq(freq="1H30min")

2020-01-01 00:00:00    0.0
2020-01-01 01:30:00    NaN
2020-01-01 03:00:00    3.0
Freq: 90T, dtype: float64

ts.asfreq(freq="1H30min", fill_value=0.0)

2020-01-01 00:00:00    0.0
2020-01-01 01:30:00    0.0
2020-01-01 03:00:00    3.0
Freq: 90T, dtype: float64

ts.asfreq(freq="1H30min", method="ffill")

2020-01-01 00:00:00    0
2020-01-01 01:30:00    1
2020-01-01 03:00:00    3
Freq: 90T, dtype: int64

ts.asfreq(freq="1H30min", method="bfill")

2020-01-01 00:00:00    0
2020-01-01 01:30:00    2
2020-01-01 03:00:00    3
Freq: 90T, dtype: int64

ts.asfreq(freq="1H30min", method="pad")

2020-01-01 00:00:00    0
2020-01-01 01:30:00    1
2020-01-01 03:00:00    3
Freq: 90T, dtype: int64

df.asfreq(freq="1H30min")
TimeSeries
2020-01-01 00:00:000.0
2020-01-01 01:30:00NaN
2020-01-01 03:00:003.0
df.asfreq(freq="1H30min", fill_value=0.0)
TimeSeries
2020-01-01 00:00:000.0
2020-01-01 01:30:000.0
2020-01-01 03:00:003.0
df.asfreq(freq="1H30min", method="ffill")
TimeSeries
2020-01-01 00:00:000
2020-01-01 01:30:001
2020-01-01 03:00:003
df.asfreq(freq="1H30min", method="bfill")
TimeSeries
2020-01-01 00:00:000
2020-01-01 01:30:002
2020-01-01 03:00:003
df.asfreq(freq="1H30min", method="pad")
TimeSeries
2020-01-01 00:00:000
2020-01-01 01:30:001
2020-01-01 03:00:003

We can lose data sometimes when doing downsampling and the asfreq() method just uses a simple approach of downsampling. It provides only method bfill, ffill, and pad for filling in data when upsampling or downsampling. What if we need to apply some other function than these three functions. We need a more reliable approach to handle downsampling. Pandas provides another method called resample() which can help us with that.

1.2 resample() 

The syntax of resample is fairly straightforward:

<DataFrame or Series>.resample(arguments).<aggregate function>

I’ll dive into what the arguments are and how to use them, but first here’s a basic, out-of-the-box demonstration. You will need a datetime type index or column to do the following:

# Given a Series object called data with some number value per date
╔═══════════════════════╦══════╗
║         date          ║ val  ║
╠═══════════════════════╬══════╣
║ 2000-01-01 00:00:00   ║    0 ║
║ 2000-01-01 00:01:00   ║    2 ║
║ 2000-01-01 00:02:00   ║    4 ║
║ 2000-01-01 00:03:00   ║    6 ║
║ 2000-01-01 00:04:00   ║    8 ║
║ 2000-01-01 00:05:00   ║   10 ║
║ 2000-01-01 00:06:00   ║   12 ║
║ 2000-01-01 00:07:00   ║   14 ║
║ 2000-01-01 00:08:00   ║   16 ║
╚═══════════════════════╩══════╝
Freq: T, dtype: int64
# We can resample this to every other minute instead and aggregate by summing the intermediate rows:
data.resample('2min').sum()

╔═════════════════════╦══════╗
║        date         ║ val  ║
╠═════════════════════╬══════╣
║ 2000-01-01 00:00:00 ║    2 ║
║ 2000-01-01 00:02:00 ║   10 ║
║ 2000-01-01 00:04:00 ║   18 ║
║ 2000-01-01 00:06:00 ║   26 ║
║ 2000-01-01 00:08:00 ║   16 ║
╚═════════════════════╩══════╝
Freq: 2T, dtype: int64

Now that we have a basic understanding of what resampling is, let’s go into the code!

The Arguments

In the order of the source code :

pd.DataFrame.resample
(
rule,
how=None,
axis=0,
fill_method=None,
closed=None,
label=None,

convention=”start”,
kind=None,
loffset=None,

limit=None,
base=0,
on=None,

level=None
)

I’ve bolded the arguments that I will cover. The rest are either deprecated or used for period instead of datetime analysis, which I will not be going over in this article.

‘Rule’ Argument

string that contains rule aliases and/or numerics

This is the core of resampling. The string you input here determines by what interval the data will be resampled by, as denoted by the bold part in the following line:

data.resample('2min').sum()

As you can see, you can throw in floats or integers before the string to change the frequency. You can even throw multiple float/string pairs together for a very specific timeframe! For example:

'3min' or '3T' = 3 minutes
'SMS' = Two times a month
'1D3H.5min20S' = One Day, 3 hours, .5min(30sec) + 20sec

To save you the pain of trying to look up the resample strings, I’ve posted the table below:

From Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

Once you put in your rule, you need to decide how you will either reduce the old data points or fill in the new ones. This function goes right after the resample function call:

data.resample('2min').sum()

There are two kinds of resampling:

Downsampling — Resample to a wider time frame (from months to years). This is fairly straightforward in that it can use all the groupby aggregate functions including mean(), min(), max(), sum(), and so forth. In downsampling, your total number of rows goes down.

Upsampling — Resample to a shorter time frame (from hours to minutes). This will result in additional empty rows, so you have the following options to fill those with numeric values:

1. ffill() or pad()
2. bfill() or backfill()
  • ‘Forward filling’ or ‘padding’ — Use the last known value to fill the new one.
  • ‘Backfilling’ — Use the next known value to fill the new one.
  • You can also fill with NaNs using the asfreq() function with no arguments. This will result in new data points having NaNs in them, which you can use a fillna() function on later.

Here are some demonstrations of the forward and back fills:

Starting with months table:
╔════════════╦═════╗
║    date    ║ val ║
╠════════════╬═════╣
║ 2000-01-31 ║   0 ║
║ 2000-02-29 ║   2 ║
║ 2000-03-31 ║   4 ║
║ 2000-04-30 ║   6 ║
║ 2000-05-31 ║   8 ║
╚════════════╩═════╝
print('Forward Fill')
print(months.resample('SMS').ffill())
╔════════════╦═════╗
║    date    ║ val ║
╠════════════╬═════╣
║ 2000-01-15 ║ NaN ║
║ 2000-02-01 ║ 0.0 ║
║ 2000-02-15 ║ 0.0 ║
║ 2000-03-01 ║ 2.0 ║
║ 2000-03-15 ║ 2.0 ║
║ 2000-04-01 ║ 4.0 ║
║ 2000-04-15 ║ 4.0 ║
║ 2000-05-01 ║ 6.0 ║
║ 2000-05-15 ║ 6.0 ║
╚════════════╩═════╝
# Alternative to ffill is bfill (backward fill) that takes value of next existing months point
print('Backward Fill')
print(months.resample('SMS').bfill())
╔════════════╦═════╗
║    date    ║ val ║
╠════════════╬═════╣
║ 2000-01-15 ║   0 ║
║ 2000-02-01 ║   2 ║
║ 2000-02-15 ║   2 ║
║ 2000-03-01 ║   4 ║
║ 2000-03-15 ║   4 ║
║ 2000-04-01 ║   6 ║
║ 2000-04-15 ║   6 ║
║ 2000-05-01 ║   8 ║
║ 2000-05-15 ║   8 ║
╚════════════╩═════╝

‘Closed’ Argument

'left', 'right', or None

I’m going to include their documentation comment here since it describes the basics fairly succinctly:

Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

The closed argument tells which side is included, ‘closed’ being the included side (implying the other side is not included) in the calculation for each time interval. You can see how it behaves here:

# Original Table 'minutes'

╔═════════════════════╦═════╗
║        date         ║ val ║
╠═════════════════════╬═════╣
║ 2000-01-01 00:00:00 ║   0 ║
║ 2000-01-01 00:01:00 ║   2 ║
║ 2000-01-01 00:02:00 ║   4 ║
║ 2000-01-01 00:03:00 ║   6 ║
║ 2000-01-01 00:04:00 ║   8 ║
║ 2000-01-01 00:05:00 ║  10 ║
║ 2000-01-01 00:06:00 ║  12 ║
║ 2000-01-01 00:07:00 ║  14 ║
║ 2000-01-01 00:08:00 ║  16 ║
╚═════════════════════╩═════╝
# The default is closed='left'
df=pd.DataFrame()
df['left'] = minutes.resample('2min').sum()
df['right'] = minutes.resample('2min',closed='right').sum()
df

╔═════════════════════╦══════╦═══════╗
║ index               ║ left ║ right ║
╠═════════════════════╬══════╬═══════╣
║ 2000-01-01 00:00:00 ║ 2    ║ 6.0   ║
║ 2000-01-01 00:02:00 ║ 10   ║ 14.0  ║
║ 2000-01-01 00:04:00 ║ 18   ║ 22.0  ║
║ 2000-01-01 00:06:00 ║ 26   ║ 30.0  ║
║ 2000-01-01 00:08:00 ║ 16   ║ NaN   ║
╚═════════════════════╩══════╩═══════╝

‘Label’ Argument

'left', 'right', or None

Once again, the documentation is pretty useful.

Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

This argument does not change the underlying calculation, it just relabels the output based on the desired edge once the aggregation is performed.

df=pd.DataFrame()
# Label default is left
df['left'] = minutes.resample('2min').sum()
df['right'] = minutes.resample('2min',label='right').sum()
df
╔═════════════════════╦══════╦═══════╗
║                     ║ left ║ right ║
╠═════════════════════╬══════╬═══════╣
║ 2000-01-01 00:00:00 ║    2 ║ NaN   ║
║ 2000-01-01 00:02:00 ║   10 ║ 2.0   ║
║ 2000-01-01 00:04:00 ║   18 ║ 10.0  ║
║ 2000-01-01 00:06:00 ║   26 ║ 18.0  ║
║ 2000-01-01 00:08:00 ║   16 ║ 26.0  ║
╚═════════════════════╩══════╩═══════╝

‘Loffset’ Argument

stringthat matches rule notation.

This argument is also pretty self-explanatory. Instead of changing any of the calculations, it just bumps the labels over by the specified amount of time.

df=pd.DataFrame()
df['no_offset'] = minutes.resample('2min').sum()
df['2min_offset'] = minutes.resample('2min',loffset='2T').sum()
df['4min_offset'] = minutes.resample('2min',loffset='4T').sum()
df

╔═════════════════════╦═══════════╦═════════════╦═════════════╗
║ index               ║ no_offset ║ 2min_offset ║ 4min_offset ║
╠═════════════════════╬═══════════╬═════════════╬═════════════╣
║ 2000-01-01 00:00:00 ║         2 ║ NaN         ║ NaN         ║
║ 2000-01-01 00:02:00 ║        10 ║ 2.0         ║ NaN         ║
║ 2000-01-01 00:04:00 ║        18 ║ 10.0        ║ 2.0         ║
║ 2000-01-01 00:06:00 ║        26 ║ 18.0        ║ 10.0        ║
║ 2000-01-01 00:08:00 ║        16 ║ 26.0        ║ 18.0        ║
╚═════════════════════╩═══════════╩═════════════╩═════════════╝

‘Base’ Argument

numeric input that correlates with the unit used in the resampling rule. It shifts the base time to calculate by some time amount. As the documentation describes it, this function moves the ‘origin’.

minutes.head().resample('30S').sum()

╔═════════════════════╦═════╗
║        date         ║ val ║
╠═════════════════════╬═════╣
║ 2000-01-01 00:00:00 ║   0 ║
║ 2000-01-01 00:00:30 ║   0 ║
║ 2000-01-01 00:01:00 ║   2 ║
║ 2000-01-01 00:01:30 ║   0 ║
║ 2000-01-01 00:02:00 ║   4 ║
║ 2000-01-01 00:02:30 ║   0 ║
║ 2000-01-01 00:03:00 ║   6 ║
║ 2000-01-01 00:03:30 ║   0 ║
║ 2000-01-01 00:04:00 ║   8 ║
╚═════════════════════╩═════╝
minutes.head().resample('30S',base=15).sum()

╔═════════════════════╦═════╗
║        date         ║ val ║
╠═════════════════════╬═════╣
║ 1999-12-31 23:59:45 ║   0 ║
║ 2000-01-01 00:00:15 ║   0 ║
║ 2000-01-01 00:00:45 ║   2 ║
║ 2000-01-01 00:01:15 ║   0 ║
║ 2000-01-01 00:01:45 ║   4 ║
║ 2000-01-01 00:02:15 ║   0 ║
║ 2000-01-01 00:02:45 ║   6 ║
║ 2000-01-01 00:03:15 ║   0 ║
║ 2000-01-01 00:03:45 ║   8 ║
╚═════════════════════╩═════╝
The table shifted by 15 seconds.

‘Axis’, ‘On’, and ‘Level’

These arguments specify what column name or index to base your resampling on. If your data has the date along with the columns instead of down the rows, specify axis = 1. If your date column is not the index, specify that column name using:

on = 'date_column_name'

If you have a multi-level indexed dataframe, use level to specify what level the correct datetime index to resample is.

Other arguments not covered

The rest of the arguments are deprecated or redundant due to functionality being captured using other methods. For example, how and fill_method remove the need for the aggregate function after the resample call, but how is for downsampling and fill_method is for upsampling. You can read more about these arguments in the source documentation if you’re interested.

resample() method accepts new frequency to be applied to time series data and returns Resampler object. We can apply various methods other than bfillffill and pad for filling in data when doing upsampling/downsampling. The Resampler object supports a list of aggregation functions like mean, std, var, count, etc which will be applied to time-series data when doing upsampling or downsampling. We’ll explain the usage of resample() below with a few examples.

We are below trying various ways to downsample the data below.

ts.resample("1H30min").mean()

2020-01-01 00:00:00    0.5
2020-01-01 01:30:00    2.0
2020-01-01 03:00:00    3.5
Freq: 90T, dtype: float64

The above example is taking the mean of index values appearing in the 1-hour and 30-minute windows. Out time series is sampled at 1 hour so in 1 hour and 30 minutes window generally, 2 values will fall in. It’ll take the mean of that values when downsampling to the new index. We can call functions other than mean() like std()var()sum()count(),interpolate() etc.

ts.resample("1H15min").mean()

2020-01-01 00:00:00    0.5
2020-01-01 01:15:00    2.0
2020-01-01 02:30:00    3.0
2020-01-01 03:45:00    4.0
Freq: 75T, dtype: float64

ts.resample("1H15min").std()

2020-01-01 00:00:00    0.707107
2020-01-01 01:15:00         NaN
2020-01-01 02:30:00         NaN
2020-01-01 03:45:00         NaN
Freq: 75T, dtype: float64

ts.resample("1H15min").var()

2020-01-01 00:00:00    0.5
2020-01-01 01:15:00    NaN
2020-01-01 02:30:00    NaN
2020-01-01 03:45:00    NaN
Freq: 75T, dtype: float64

ts.resample("1H15min").sum()

2020-01-01 00:00:00    1
2020-01-01 01:15:00    2
2020-01-01 02:30:00    3
2020-01-01 03:45:00    4
Freq: 75T, dtype: int64

ts.resample("1H15min").count()

2020-01-01 00:00:00    2
2020-01-01 01:15:00    1
2020-01-01 02:30:00    1
2020-01-01 03:45:00    1
Freq: 75T, dtype: int64

ts.resample("1H15min").bfill()

2020-01-01 00:00:00    0
2020-01-01 01:15:00    2
2020-01-01 02:30:00    3
2020-01-01 03:45:00    4
Freq: 75T, dtype: int64

ts.resample("1H15min").ffill()

2020-01-01 00:00:00    0
2020-01-01 01:15:00    1
2020-01-01 02:30:00    2
2020-01-01 03:45:00    3
Freq: 75T, dtype: int64

We’ll now try below a few examples by upsampling time series.

Note: Please make a note that we can even apply our own defined function to Resampler object by passing it to apply() method on it.

ts.resample("45min").bfill()

2020-01-01 00:00:00    0
2020-01-01 00:45:00    1
2020-01-01 01:30:00    2
2020-01-01 02:15:00    3
2020-01-01 03:00:00    3
2020-01-01 03:45:00    4
Freq: 45T, dtype: int64

ts.resample("45min").apply(lambda x: x**2 if x.values.tolist() else np.nan)

2020-01-01 00:00:00     0.0
2020-01-01 00:45:00     1.0
2020-01-01 01:30:00     4.0
2020-01-01 02:15:00     NaN
2020-01-01 03:00:00     9.0
2020-01-01 03:45:00    16.0
Freq: 45T, dtype: float64

ts.resample("45min").interpolate()

2020-01-01 00:00:00    0.00
2020-01-01 00:45:00    0.75
2020-01-01 01:30:00    1.50
2020-01-01 02:15:00    2.25
2020-01-01 03:00:00    3.00
2020-01-01 03:45:00    3.00
Freq: 45T, dtype: float64

df.resample("45min").mean().fillna(0.0)
TimeSeries
2020-01-01 00:00:000.0
2020-01-01 00:45:001.0
2020-01-01 01:30:002.0
2020-01-01 02:15:000.0
2020-01-01 03:00:003.0
2020-01-01 03:45:004.0

The above examples clearly state that resample() is a very flexible function and lets us resample time series by applying a variety of functions.

Note: Please make a note that in order for asfreq() and resample() to work time series data should be sorted according to time else it won’t work. It’s also suggested to use resample() more frequently than asfreq() because of its flexibility.

1.3 Difference between asfreq() and resample() 

resample is more general than asfreq. For example, using resample I can pass an arbitrary function to perform binning over a Series or DataFrame object in bins of arbitrary size. asfreq is a concise way of changing the frequency of a DatetimeIndex object. It also provides padding functionality.

As the pandas documentation says, asfreq is a thin wrapper around a call to date_range + a call to reindex. See here for an example.

An example of resample that I use in my daily work is computing the number of spikes of a neuron in 1 second bins by resampling a large boolean array where True means “spike” and False means “no spike”. I can do that as easy as large_bool.resample('S', how='sum'). Kind of neat!

asfreq can be used when you want to change a DatetimeIndex to have a different frequency while retaining the same values at the current index.

Here’s an example where they are equivalent:

dr = date_range('1/1/2010', periods=3, freq=3 * datetools.bday)

raw = randn(3)

ts = Series(raw, index=dr)

ts
2010-01-01   -1.948
2010-01-06    0.112
2010-01-11   -0.117
Freq: 3B, dtype: float64

ts.asfreq(datetools.BDay())

2010-01-01   -1.948
2010-01-04      NaN
2010-01-05      NaN
2010-01-06    0.112
2010-01-07      NaN
2010-01-08      NaN
2010-01-11   -0.117
Freq: B, dtype: float64

ts.resample(datetools.BDay())

2010-01-01   -1.948
2010-01-04      NaN
2010-01-05      NaN
2010-01-06    0.112
2010-01-07      NaN
2010-01-08      NaN
2010-01-11   -0.117
Freq: B, dtype: float64

As far as when to use either: it depends on the problem you have in mind…care to share?

Let me use an example to illustrate:

# generate a series of 365 days
# index = 20190101, 20190102, ... 20191231
# values = [0,1,...364]
ts = pd.Series(range(365), index = pd.date_range(start='20190101', 
                                                end='20191231',
                                                freq = 'D'))
ts.head()

2019-01-01    0
2019-01-02    1
2019-01-03    2
2019-01-04    3
2019-01-05    4
Freq: D, dtype: int64

Now, resample the data by quarter:

ts.asfreq(freq='Q')

2019-03-31     89
2019-06-30    180
2019-09-30    272
2019-12-31    364
Freq: Q-DEC, dtype: int64

The asfreq() returns a Series object with the last day of each quarter in it.

ts.resample('Q')

DatetimeIndexResampler [freq=<QuarterEnd: startingMonth=12>, axis=0, closed=right, label=right, convention=start, base=0]

Resample returns a DatetimeIndexResampler and you cannot see what’s actually inside. Think of it as the groupby method. It creates a list of bins (groups):

bins = ts.resample('Q')
bin.groups

 {Timestamp('2019-03-31 00:00:00', freq='Q-DEC'): 90,
 Timestamp('2019-06-30 00:00:00', freq='Q-DEC'): 181,
 Timestamp('2019-09-30 00:00:00', freq='Q-DEC'): 273,
 Timestamp('2019-12-31 00:00:00', freq='Q-DEC'): 365}

Nothing seems different so far except for the return type. Let’s calculate the average for each quarter:

# (89+180+272+364)/4 = 226.25
ts.asfreq(freq='Q').mean()

226.25

When mean() is applied, it outputs the average of all the values. Note that this is not the average for each quarter, but the average of the last day of each quarter. To calculate the average of each quarter:

ts.resample('Q').mean()

2019-03-31     44.5
2019-06-30    135.0
2019-09-30    226.5
2019-12-31    318.5

You can perform more powerful operations with resample() than asfreq(). Think of resample as groupby + every method that you can call after groupby (e.g. mean, sum, apply, you name it) . Think of asfreq as a filter mechanism with limited fillna() capabilities (in fillna(), you can specify limit, but asfreq() does not support it).

2. Moving Window Functions 

Moving window functions refers to functions that can be applied to time-series data by moving a fixed/variable size window over total data and computing descriptive statistics over window data each time. Here window generally refers to a number of samples taken from total time series in order and represents a particular represents a period of time.

There are 2 kinds of window functions:

  • Rolling Window Functions: It performs aggregate operations on the window with the same amount of samples each time.
  • Expanding Window Functions: It performs aggregate operations on the window which expands with time.

Pandas provides a list of functions for performing window functions. We’ll start with rolling() function.

2.1 rolling() 

rolling() function lets us perform rolling window functions on time series data. rolling() function can be called on both series and dataframe in pandas. It accepts window size as a parameter to group values by that window size and returns Rolling objects which have grouped values according to window size. We can then apply various aggregate functions on this object as per our needs. We’ll create a simple dataframe of random data to explain this further.

df = pd.DataFrame(np.random.randn(100, 4),
                  index = pd.date_range('1/1/2020', periods = 100),
                  columns = ['A', 'B', 'C', 'D'])

df.head()
ABCD
2020-01-010.7927580.262306-1.033230-1.913741
2020-01-022.2790120.7040821.0218070.995765
2020-01-032.7158930.262504-0.156704-0.255339
2020-01-04-0.8585271.132931-0.1733790.052590
2020-01-05-0.6759831.2598560.581401-0.336817
df.plot(figsize=(8,4));
Time Series: Resampling & Moving Window Functions in Python using Pandas
r = df.rolling(3)
r
Rolling [window=3,center=False,axis=0]

Above, We have created a rolling object with a window size of 3. We can now apply various aggregate functions on this object to get a modified time series. We’ll start by applying a mean function to a rolling object and then visualize data of column B from the original dataframe and rolled output.

df["B"].plot(color="grey", figsize=(8,4));
r.mean()["B"].plot(color="red");
Time Series: Resampling & Moving Window Functions in Python using Pandas

There are many other descriptive statistics functions available that can be applied to the rolling objects like count()median()std()var()quantile()skew(), etc. We can try a few below for our learning purpose.

df["B"].plot(color="grey", figsize=(8,4));
r.quantile(0.25)["B"].plot(color="red");
Time Series: Resampling & Moving Window Functions in Python using Pandas
df["B"].plot(color="grey", figsize=(8,4));
r.skew()["B"].plot(color="red");
Time Series: Resampling & Moving Window Functions in Python using Pandas
df["B"].plot(color="grey", figsize=(8,4));
r.var()["B"].plot(color="red");
Time Series: Resampling & Moving Window Functions in Python using Pandas

We can even apply our own function by passing it to apply() function. We are explaining its usage below with an example.

Note: Please make a note that input to function passed to apply() will be a numpy array of samples same as the window size.

df["B"].plot(color="grey", figsize=(8,4));
r.apply(lambda x: x.sum())["B"].plot(color="red");
Time Series: Resampling & Moving Window Functions in Python using Pandas

We can apply more than one aggregate function by passing them to agg() function. We’ll explain it below with an example. We can apply aggregate functions to only one column as well as ignoring other columns.

r.agg(["mean", "std"]).head()
r["A"].agg(["mean", "std"]).head()
meanstd
2020-01-01NaNNaN
2020-01-02NaNNaN
2020-01-031.9292211.008155
2020-01-041.3787921.949850
2020-01-050.3937942.013066

We can perform a rolling window function on data samples at a different frequency than the original frequency as well. We’ll below load data as hourly and then apply the rolling window function by daily sampling that data.

df = pd.DataFrame(np.random.randn(100, 4),
                  index = pd.date_range('1/1/2020', freq="H", periods = 100),
                  columns = ['A', 'B', 'C', 'D'])
df.head()
ABCD
2020-01-01 00:00:00-0.661877-1.309971-0.222158-0.839181
2020-01-01 01:00:001.6704440.305705-0.4792181.202464
2020-01-01 02:00:000.0107801.395900-0.9979472.104720
2020-01-01 03:00:000.250527-0.556719-0.3094150.242392
2020-01-01 04:00:00-0.800937-0.915483-1.0907980.273126
df.resample("1D").mean().rolling(3).mean().head()
ABCD
2020-01-01NaNNaNNaNNaN
2020-01-02NaNNaNNaNNaN
2020-01-03-0.0897160.1425060.009205-0.024009
2020-01-04-0.1123660.0798270.033722-0.122137
2020-01-05-0.1932460.1202480.3466800.024083
df.resample("1D").mean().rolling(3).mean().plot();
Time Series: Resampling & Moving Window Functions in Python using Pandas

We can notice above that our output is with a daily frequency than the hourly frequency of original data.

2.2 expanding() 

Pandas provided a function named expanding() to perform expanding window functions on our time series data. expanding() function can be called on both series and dataframe in pandas. As we discussed above, expanding window functions are applied to total data and take into consideration all previous values, unlike the rolling window which takes fixed-size samples into consideration. We’ll explain its usage below with a few examples.

df.expanding(min_periods=1).mean().head()
ABCD
2020-01-01 00:00:00-0.661877-1.309971-0.222158-0.839181
2020-01-01 01:00:000.504284-0.502133-0.3506880.181642
2020-01-01 02:00:000.3397820.130545-0.5664410.822668
2020-01-01 03:00:000.317469-0.041271-0.5021840.677599
2020-01-01 04:00:000.093788-0.216114-0.6199070.596704
df.expanding(min_periods=1).mean().plot();
Time Series: Resampling & Moving Window Functions in Python using Pandas

We can notice from the above plot that the output of expanding the window is fluctuating at the beginning but then settling as more samples come into the computation. The output fluctuates a bit initially due to less number of samples taken into consideration initially. The number of samples increases as we move forward with computation and keeps on increasing till the whole time series has completed.

We can apply various aggregation functions to expanding windows like count(), median(), std(), var(), quantile(), skew(), etc. We’ll explain them below with a few examples.

df.expanding(min_periods=1).std().plot();
Time Series: Resampling & Moving Window Functions in Python using Pandas
df.expanding(min_periods=1).var().plot();
Time Series: Resampling & Moving Window Functions in Python using Pandas

We can apply more than one aggregation function by passing their names as a list to agg() function as well as we can apply our own function by passing it to apply() function. We have explained both usages below with examples.

df.expanding(min_periods=1).agg(["mean", "var"]).head()
df.expanding(min_periods=1).apply(lambda x: x.sum()).plot();
Time Series: Resampling & Moving Window Functions in Python using Pandas
df["A"].expanding(min_periods=1).apply(lambda x: x.sum()).plot();
Time Series: Resampling & Moving Window Functions in Python using Pandas

We’ll generally use expanding() windows function when we care about all past samples in time series data even though new samples are added to it. We’ll use it when we want to take all previous samples into consideration. We’ll use rolling() window functions when only the past few samples are important and all samples before it can be ignored.

2.3 ewm() 

An exponential weighted moving average is a weighted moving average of the last n samples from time-series data. ewm() function can be called on both series and dataframe in pandas. The exponential weighted moving average function assigns weights to each previous sample which decrease with each previous sample. We’ll explain its usage by comparing it with rolling() window function.

df["A"].ewm(span=10).mean().plot(color="tab:red");
df["A"].rolling(window=10).mean().plot(color="tab:green");
Time Series: Resampling & Moving Window Functions in Python using Pandas
df["A"].ewm(span=10, min_periods=5).mean().plot(color="tab:red");
df["A"].rolling(window=10).mean().plot(color="tab:green");
Time Series: Resampling & Moving Window Functions in Python using Pandas

We can apply different kinds of aggregation functions like we applied above with rolling() and expanding() functions. We’ll try below a few examples for explanation purposes.

df.ewm(span=10).std().plot();
Time Series: Resampling & Moving Window Functions in Python using Pandas
df.ewm(span=10).agg(["mean", "var"]).head()

This concludes our tutorial on resampling and moving window functions with time-series data using pandas.

Resources:

https://towardsdatascience.com/using-the-pandas-resample-function-a231144194c4

https://coderzcolumn.com/tutorials/data-science/time-series-resampling-and-moving-window-functions

https://newbedev.com/difference-between-asfreq-and-resample

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.