ARIMA and SARIMA for Real-World Time Series Forecasting in Python
2021-11-01
ARCH and GARCH models for Time Series Prediction in Python
2021-11-02
Show all

Finding and removing seasonality in Time-Series Data with Python

17 mins read

Seasonality in Time Series

Time series data may contain seasonal variation. Seasonal variation, or seasonality, are cycles that repeat regularly over time.

A repeating pattern within each year is known as seasonal variation, although the term is applied more generally to repeating patterns within any fixed period.

A cycle structure in a time series may or may not be seasonal. If it consistently repeats at the same frequency, it is seasonal, otherwise, it is not seasonal and is called a cycle.

Why Explore Seasonality?

Seasonality in time-series data refers to a pattern that occurs at a regular interval. This is different from regular cyclic trends, such as the rise and fall of stock prices, that re-occur regularly but don’t have a fixed period. There’s a lot of insight to be gained from understanding seasonality patterns in your data and you can even use it as a baseline to compare your time-series machine learning models.

Time series datasets can contain a seasonal component. This is a cycle that repeats over time, such as monthly or yearly. This repeating cycle may obscure the signal that we wish to model when forecasting, and in turn, may provide a strong signal to our predictive models. In this tutorial, you will discover how to identify and correct seasonality in time series data with Python.

Understanding the seasonal component in time series can improve the performance of modeling with machine learning.

This can happen in two main ways:

  • Clearer Signal: Identifying and removing the seasonal component from the time series can result in a clearer relationship between input and output variables.
  • More Information: Additional information about the seasonal component of the time series can provide new information to improve model performance.

Both approaches may be useful in a project. Modeling seasonality and removing it from the time series may occur during data cleaning and preparation.

Extracting seasonal information and providing it as input features, either directly or in summary form, may occur during feature extraction and feature engineering activities.

Types of Seasonality

There are many types of seasonality; for example:

  • Time of Day.
  • Daily.
  • Weekly.
  • Monthly.
  • Yearly.

As such, identifying whether there is a seasonality component in your time series problem is subjective. The simplest approach to determining if there is an aspect of seasonality is to plot and review your data, perhaps at different scales and with the addition of trend lines.

Getting Started

Quick note: For this article, we’ll be using data published by the Quebec Professional Association of Real Estate Brokers. The association publishes monthly real estate stats. For convenience, I’ve put the monthly median condo prices for the Province of Quebec and the Montreal Metropolitan Area into a CSV file, available here: https://drive.google.com/file/d/1SMrkZPAa0aAl-ZhnHLLFbmdgYmtXgpAb/view?usp=sharing

The quickest way to get an idea of whether or not your data has a seasonal trend is by plotting it. Let’s see what we get when we plot the median house price in Montreal by month.

import pandas as pd

data_orig = pd.read_csv('quebec_real_estate.csv')

data_orig['Date'] = pd.to_datetime(data_orig['Date']) # convert date column to DateTime
ax = data_orig.plot(x='Date', y='Montreal_Median_Price', figsize=(12,6))

A keen eye might already see from this plot that the prices seem to dip around the new year and peak a few months before, around late summer. Let’s dive a little further into this by plotting a vertical line for January of every year.

ax = data_orig.plot(x='Date', y='Montreal_Median_Price', figsize=(12,6))
xcoords = ['2015-01-01', '2016-01-01','2017-01-01', '2018-01-01', '2019-01-01', '2020-01-01',
          '2021-01-01']
for xc in xcoords:
    plt.axvline(x=xc, color='black', linestyle='--')
Image by author

It seems like there’s definitely a trend here. In this case, it appears the seasonality has a period of one year. Next, we’ll look into a tool we can use to further examine the seasonality and break down our time series into its trend, seasonal, and residual components. Before we can do that though, you’ll have to understand the difference between an additive and a multiplicative seasonality.

Additive vs Multiplicative Seasonality

There are two types of seasonality that you may come across when analyzing time-series data. To understand the difference between them let’s look at a standard time series with perfect seasonality, a cosine wave:

Sine Wave Plot

We can clearly see that the period of the wave is 20 and the amplitude (distance from the center line to the top of a crest or to the bottom of a trough) is 1 and remains constant.

Additive Seasonality

It’s pretty rare for actual time series to have constant crest and trough values and instead, we typically see some kind of general trends like an increase or a decrease over time. In our sales price plot, for example, the median price tends to go up over time.

If the amplitude of our seasonality tends to remain the same, then we have what’s called an additive seasonality. Below is an example of an additive seasonality.

Additive seasonality

A great way to think about it is by imagining we took our standard cosine wave and simply added a trend to it:

Image by author

We can even think of our basic cosine model from earlier as an additive model with a constant trend! We can model additive time series using the following simple equation:

Y[t] = T[t] + S[t] + e[t]

Y[t]: Our time-series function
T[t]: Trend (general tendency to move up or down)
S[t]: Seasonality (cyclic pattern occurring at regular intervals)
e[t]: Residual (random noise in the data that isn’t accounted for in the trend or seasonality

Multiplicative Seasonality

The other type of seasonality that you may encounter in your time-series data is multiplicative. In this type, the amplitude of our seasonality becomes larger or smaller based on the trend. An example of multiplicative seasonality is given below.

Multiplicative seasonality

We can apply a similar train of thought as we used with our additive model and imagine that we took our cosine wave but instead of adding the trend, we multiplied it (hence the name multiplicative seasonality):

Multiplicative seasonality

We can model this with a similar equation as our additive model by just swapping the additions for multiplications.

Y[t] = T[t] *S[t] *e[t]

Decomposing the dataset

Now that we have a clear picture of the different models, let’s look at how we can break down our real estate time series into its trend, seasonality, and residual components. We’ll be using the seasonal_decompose model from the statsmodels library.

The seasonal_decompose model requires you to select a model type for the seasonality (additive or multiplicative). We’ll select a multiplicative model since it would appear the amplitude of the cycles is increasing with time. This would make sense since a large factor for housing prices is lending rates which are done as a percentage of the price.

from statsmodels.tsa.seasonal import seasonal_decompose

data_orig.set_index('Date', inplace=True)

analysis = data_orig[['Montreal_Median_Price']].copy()
decompose_result_mult = seasonal_decompose(analysis, model="multiplicative")

trend = decompose_result_mult.trend
seasonal = decompose_result_mult.seasonal
residual = decompose_result_mult.resid

decompose_result_mult.plot();
Image by author

Ta-da! The trend, seasonal, and residual components are returned as Pandas series so you can plot them by calling their plot() methods or perform further analysis on them. One thing that may be useful in measuring their correlation to outside factors. For example, you could measure the correlation between the trend and mortgage rates or you could see if there’s a strong correlation between the residual and the number of new babies born in the city.

From our decomposition, we can see the model picked up on a 5% difference between the seasons. If you’re looking to sell your house, you should probably list it in mid to late spring instead of mid-winter if you want to get top dollar!

Removing Seasonality

Once seasonality is identified, it can be modeled. The model of seasonality can be removed from the time series. This process is called Seasonal Adjustment, or Deseasonalizing. A time series where the seasonal component has been removed is called seasonal stationery. A time series with a clear seasonal component is referred to as non-stationary.

There are sophisticated methods to study and extract seasonality from time series in the field of Time Series Analysis. As we are primarily interested in predictive modeling and time series forecasting, we are limited to methods that can be developed on historical data and available when making predictions on new data.

In this section, we will look at two methods for making seasonal adjustments on a classical meteorological-type problem of daily temperatures with a strong additive seasonal component. Next, let’s take a look at the dataset we will use in this tutorial.

Minimum Daily Temperatures Dataset

This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city of Melbourne, Australia.

The units are in degrees Celsius and there are 3,650 observations. The source of the data is credited as the Australian Bureau of Meteorology.

Below is a sample of the first 5 rows of data, including the header row.

"Date","Temperature"
"1981-01-01",20.7
"1981-01-02",17.9
"1981-01-03",18.8
"1981-01-04",14.6
"1981-01-05",15.8

Below is a plot of the entire dataset where you can download the dataset and learn more about it.

Minimum Daily Temperatures

Minimum Daily Temperatures

The dataset shows a strong seasonality component and has a nice, fine-grained detail to work with.

Load the Minimum Daily Temperatures Dataset

Download the Minimum Daily Temperatures dataset and place it in the current working directory with the filename “daily-minimum-temperatures.csv“.

The code below will load and plot the dataset.

from pandas import read_csv
from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0)
series.plot()
pyplot.show()

Running the example creates the following plot of the dataset.

Minimum Daily Temperature Dataset

Minimum Daily Temperature Dataset

Seasonal Adjustment with Differencing

A simple way to correct for a seasonal component is to use differencing.

If there is a seasonal component at the level of one week, then we can remove it on an observation today by subtracting the value from last week.

In the case of the Minimum Daily Temperatures dataset, it looks like we have a seasonal component each year showing a swing from summer to winter.

We can subtract the daily minimum temperature from the same day last year to correct for seasonality. This would require special handling of February 29th in leap years and would mean that the first year of data would not be available for modeling.

Below is an example of using the difference method on the daily data in Python.

from pandas import read_csv
from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0)
X = series.values
diff = list()
days_in_year = 365
for i in range(days_in_year, len(X)):
	value = X[i] - X[i - days_in_year]
	diff.append(value)
pyplot.plot(diff)
pyplot.show()

Running this example creates a new seasonally adjusted dataset and plots the result.

Differencing Sesaonal Adjusted Minimum Daily Temperature

Differencing Sesaonal Adjusted Minimum Daily Temperature

There are two leap years in our dataset (1984 and 1988). They are not explicitly handled; this means that observations in March 1984 onwards the offset are wrong by one day, and after March 1988, the offsets are wrong by two days. One option is to update the code example to be leap-day aware. Another option is to consider that the temperature within any given period of the year is probably stable. Perhaps over a few weeks. We can shortcut this idea and consider all temperatures within a calendar month to be stable. An improved model may be to subtract the average temperature from the same calendar month in the previous year, rather than the same day.

We can start off by resampling the dataset to a monthly average minimum temperature.

from pandas import read_csv
from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0)
resample = series.resample('M')
monthly_mean = resample.mean()
print(monthly_mean.head(13))
monthly_mean.plot()
pyplot.show()

Running this example prints the first 13 months of average monthly minimum temperatures.


Date
1981-01-31 17.712903
1981-02-28 17.678571
1981-03-31 13.500000
1981-04-30 12.356667
1981-05-31 9.490323
1981-06-30 7.306667
1981-07-31 7.577419
1981-08-31 7.238710
1981-09-30 10.143333
1981-10-31 10.087097
1981-11-30 11.890000
1981-12-31 13.680645
1982-01-31 16.567742

It also plots the monthly data, clearly showing the seasonality of the dataset.

Minimum Monthly Temperature Dataset

Minimum Monthly Temperature Dataset

We can test the same differencing method on the monthly data and confirm that the seasonally adjusted dataset does indeed remove the yearly cycles.

from pandas import read_csv
from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0)
resample = series.resample('M')
monthly_mean = resample.mean()
X = series.values
diff = list()
months_in_year = 12
for i in range(months_in_year, len(monthly_mean)):
	value = monthly_mean[i] - monthly_mean[i - months_in_year]
	diff.append(value)
pyplot.plot(diff)
pyplot.show()

Running the example creates a new seasonally adjusted monthly minimum temperature dataset, skipping the first year of data in order to create the adjustment. The adjusted dataset is then plotted.

Seasonal Adjusted Minimum Monthly Temperature Dataset

Seasonally Adjusted Minimum Monthly Temperature Dataset

Next, we can use the monthly average minimum temperatures from the same month in the previous year to adjust the daily minimum temperature dataset.

Again, we just skip the first year of data, but the correction using the monthly rather than the daily data may be a more stable approach.

from pandas import read_csv
from matplotlib import pyplot
series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0)
X = series.values
diff = list()
days_in_year = 365
for i in range(days_in_year, len(X)):
	month_str = str(series.index[i].year-1)+'-'+str(series.index[i].month)
	month_mean_last_year = series[month_str].mean()
	value = X[i] - month_mean_last_year
	diff.append(value)
pyplot.plot(diff)
pyplot.show()

Running the example again creates the seasonally adjusted dataset and plots the results. This example is robust to daily fluctuations in the previous year and to offset errors creeping in due to February 29 days in leap years.

More Stable Seasonal Adjusted Minimum Monthly Temperature Dataset With

More Stable Seasonally Adjusted Minimum Monthly Temperature Dataset

The edge of calendar months provides a hard boundary that may not make sense for temperature data. More flexible approaches that take the average from one week on either side of the same date in the previous year may again be a better approach. Additionally, there is likely to be seasonality in temperature data at multiple scales that may be corrected for directly or indirectly, such as:

  • Day level.
  • Multiple day level, such as a week or weeks.
  • Multiple week level, such as a month.
  • Multiple month level, such as a quarter or season.

Seasonal Adjustment with Modeling

We can model the seasonal component directly, then subtract it from the observations. The seasonal component in a given time series is likely a sine wave over a generally fixed period and amplitude. This can be approximated easily using a curve-fitting method.

A dataset can be constructed with the time index of the sine wave as an input, or x-axis, and the observation as the output, or y-axis.

For example:

Time Index, Observation
1, obs1
2, obs2
3, obs3
4, obs4
5, obs5

Once fit, the model can then be used to calculate a seasonal component for any time index. In the case of the temperature data, the time index would be the day of the year. We can then estimate the seasonal component for the day of the year for any historical observations or any new observations in the future. The curve can then be used as a new input for modeling with supervised learning algorithms or subtracted from observations to create a seasonally adjusted series. Let’s start off by fitting a curve to the Minimum Daily Temperatures dataset.

The NumPy library provides the polyfit() function that can be used to fit a polynomial of a chosen order to a dataset. First, we can create a dataset of time index (day in this case) for observation. We could take a single year of data or all the years. Ideally, we would try both and see which model resulted in a better fit. We could also smooth the observations using a moving average centered on each value. This too may result in a model with a better fit.

Once the dataset is prepared, we can create the fit by calling the polyfit() function passing the x-axis values (integer day of year), y-axis values (temperature observations), and the order of the polynomial. The order controls the number of terms, and in turn, the complexity of the curve used to fit the data.

Ideally, we want the simplest curve that describes the seasonality of the dataset. For consistent sine wave-like seasonality, a 4th-order or 5th-order polynomial will be sufficient.

In this case, I chose an order of 4 by trial and error. The resulting model takes the form:

y = x^4*b1 + x^3*b2 + x^2*b3 + x^1*b4 + b5

Where y is the fit value, x is the time index (day of the year), and b1 to b5 are the coefficients found by the curve-fitting optimization algorithm.

Once fit, we will have a set of coefficients that represent our model. We can then use this model to calculate the curve for one observation, one year of observations, or the entire dataset.

The complete example is listed below.

from pandas import read_csv
from matplotlib import pyplot
from numpy import polyfit
series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0)
# fit polynomial: x^2*b1 + x*b2 + ... + bn
X = [i%365 for i in range(0, len(series))]
y = series.values
degree = 4
coef = polyfit(X, y, degree)
print('Coefficients: %s' % coef)
# create curve
curve = list()
for i in range(len(X)):
	value = coef[-1]
	for d in range(degree):
		value += X[i]**(degree-d) * coef[d]
	curve.append(value)
# plot curve over original data
pyplot.plot(series.values)
pyplot.plot(curve, color='red', linewidth=3)
pyplot.show()

Running the example creates the dataset, fits the curve, predicts the value for each day in the dataset, and then plots the resulting seasonal model (red) over the top of the original dataset (blue). One limitation of this model is that it does not take into account leap days, adding small offset noise that could easily be corrected with an update to the approach.

For example, we could just remove the two February 29 observations from the dataset when creating the seasonal model.

Curve Fit Seasonal Model of Daily Minimum Temperature

Curve Fit Seasonal Model of Daily Minimum Temperature

The curve appears to be a good fit for the seasonal structure in the dataset. We can now use this model to create a seasonally adjusted version of the dataset. The complete example is listed below.

from pandas import read_csv
from matplotlib import pyplot
from numpy import polyfit
series = read_csv('daily-minimum-temperatures.csv', header=0, index_col=0)
# fit polynomial: x^2*b1 + x*b2 + ... + bn
X = [i%365 for i in range(0, len(series))]
y = series.values
degree = 4
coef = polyfit(X, y, degree)
print('Coefficients: %s' % coef)
# create curve
curve = list()
for i in range(len(X)):
	value = coef[-1]
	for d in range(degree):
		value += X[i]**(degree-d) * coef[d]
	curve.append(value)
# create seasonally adjusted
values = series.values
diff = list()
for i in range(len(values)):
	value = values[i] - curve[i]
	diff.append(value)
pyplot.plot(diff)
pyplot.show()

Running the example subtracts the values predicted by the seasonal model from the original observations. The

The seasonally adjusted dataset is then plotted.

Curve Fit Seasonal Adjusted Daily Minimum Temperature

Curve Fit Seasonally Adjusted Daily Minimum Temperature

Source:

https://towardsdatascience.com/finding-seasonal-trends-in-time-series-data-with-python-ce10c37aa861

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.