In statistics, correlation or dependence refers to any statistical association between two random variables or bivariate data, whether causal or not. Correlation refers to any statistical association in the broadest sense, but it actually relates to the degree to which two variables are linearly connected.
Correlations are helpful because they can reveal a predicted relationship that can be used in the real world. Based on the relationship between electricity demand and weather, an electrical company might produce less power on a mild day. Extreme weather causes individuals to consume more power for heating and cooling, therefore there is a causal relationship in this case.
To summarize the correlation between the variables, a statistical method known as Pearson’s correlation coefficient is frequently used to calculate the correlation. Pearson’s correlation coefficient has a value between -1 and 1 that indicates whether a relationship is negative or positive. There is no association if the value is zero. We can see how distinct correlated data looks in the image below.
The degree of resemblance between a certain time series and a lagged version of itself over subsequent time intervals is represented mathematically as autocorrelation. Autocorrelation is similar to the correlation between two different time series in theory, but it uses the same time series twice: once in its original form and again with one or more time periods added.
For example, If it is raining now, the autocorrelation implies that it will also rain tomorrow than if it is rainy today. When it comes to investment, a stock’s positive autocorrelation of returns may be strong, which implies that if it’s up today, it’s more likely to be up tomorrow.
A partial autocorrelation, on the other hand, is a description of the relationship between an observation in a time series and data from earlier time steps that do not take into account the correlations between the intervening observations. The correlation between observations at successive time steps is a linear function of the indirect correlations. These indirect connections are eliminated using the partial autocorrelation function.
Autocorrelation is the same, but with a twist — you’ll calculate a correlation between a sequence with itself lagged by some number of time units. Don’t worry if you don’t fully get it, as we’ll explore it next.
You’ll use the Airline passengers dataset throughout the article. Here’s how to import the libraries, load and plot the dataset:
import numpy as np import pandas as pd from statsmodels.tsa.stattools import acf, pacf from statsmodels.graphics.tsaplots import plot_acf, plot_pacf import matplotlib.pyplot as plt from matplotlib import rcParams from cycler import cycler rcParams['figure.figsize'] = 18, 5 rcParams['axes.spines.top'] = False rcParams['axes.spines.right'] = False rcParams['axes.prop_cycle'] = cycler(color=['#365977']) rcParams['lines.linewidth'] = 2.5 # Dataset df = pd.read_csv('data/airline-passengers.csv', index_col='Month', parse_dates=True) # Visualize plt.title('Airline Passengers dataset', size=20) plt.plot(df);
And here’s what the dataset looks like:
Now let’s discuss these briefly which are often referred to as ACF and PACF.
As said before, autocorrelation shows the correlation of a sequence with itself lagged by some number of time units. Once plotted, X-axis shows the lag number, and Y-axis shows the correlation of the sequence with a sequence at that lag. Y-axis ranges from -1 to 1.
Here’s an example.
The airline passenger dataset shows the number of passengers per month from 1949 to 1960. Autocorrelation answers the following question: “How correlated is the number of passengers this month with the number of passengers in the previous month?”. Here, the previous month indicates the lag value of 1.
You can rephrase to question and ask how correlated the number of passengers this month is to the number of passengers a year ago. Then, the lag value would be 12. And this is a great question, since yearly seasonality is visible from the chart.
One thing to remember — the more lags you use, the lower the correlation will be. More recent periods have more impact.
Before calculating autocorrelation, you should make the time series stationary. We haven’t covered the concept of stationarity yet, but we will in the following article. In a nutshell — the mean, variance, and covariance shouldn’t change over time.
The easiest way to make time series stationary is by calculating the first-order difference. It’s not a way to statistically prove stationarity, but don’t worry about it for now.
Here’s how to calculate the first-order difference:
# First-order difference df['Passengers_Diff'] = df['Passengers'].diff(periods=1) df = df.dropna() # Plot plt.title('Airline Passengers dataset with First-order difference', size=20) plt.plot(df['Passengers'], label='Passengers') plt.plot(df['Passengers_Diff'], label='First-order difference', color='orange') plt.legend();
Here’s what both series look like:
The differenced series doesn’t look completely stationary but will suit for now.
You can now use the
acf() function from
statsmodels to calculate autocorrelation:
# Calculate autocorrelation acf_values = acf(df['Passengers_Diff'])
Here’s how the values look, rounded up to two decimal points:
The first value is 1 because a correlation between two identical series was calculated. But take a look at as 12th period — the autocorrelation value is 0.83. This tells you a value 12 periods ago has a strong impact on the value today.
Further, you can use the
plot_acf() function to inspect the autocorrelation visually:
# Plot autocorrelation plot_acf(df['Passengers_Diff'], lags=30);
Here’s what it looks like:
The plot confirms our assumption about the correlation on lag 12. The same is visible at lag 24, but the correlation declines over time. Value 12 periods ago has more impact on the value today than value 24 periods ago does.
Another thing to note is the shaded area. Anything inside it isn’t statistically significant.
This one is a bit tougher to understand. It does the same as regular autocorrelation — shows the correlation of a sequence with itself lagged by some number of time units. But there’s a twist. Only the direct effect is shown, and all intermediary effects are removed.
For example, you want to know the direct relationship between the number of passengers today and 12 months ago. You don’t care about anything in between.
The number of passengers in 12 months affects the number of passengers 11 months ago — and the whole chain repeats until the most recent period. These indirect effects are neglected in partial autocorrelation calculations.
You should also make the time series stationary before calculations.
You can use the
pacf() function from
statsmodels for the calculation:
# Calculate partial autocorrelation pacf_values = pacf(df['Passengers_Diff'])
Here’s what the values look like:
The correlation value at lag 12 has dropped to 0.61, indicating the direct relationship is a bit weaker. Let’s take a look at the results graphically to confirm these are still significant:
# Plot partial autocorrelation plot_acf(df['Passengers_Diff'], lags=30);
Here’s what it looks like:
To conclude — the lag at 12 is still significant, but the lag at 24 isn’t. A couple of lags before 12 are negatively correlated to the original time series. Take some time to think about why.
There’s still one important question remaining — how do you interpret ACF and PACF plots for forecasting? Let’s answer that next.
Time series models you’ll soon learn about, such as Auto Regression (AR), Moving Averages (MA), or their combinations (ARMA), require you to specify one or more parameters. These can be obtained by looking at ACF and PACF plots.
In a nutshell:
Still, reading ACF and PACF plots is challenging, and you’re far better of using grid search to find optimal parameter values. An optimal parameter combination has the lowest error (such as MAPE) or the lowest general quality estimator (such as AIC). We’ll cover time series evaluation metrics soon, so stay tuned.
Plots of the autocorrelation function and the partial autocorrelation function for a time series tell a very different story.
We can use the intuition for ACF and PACF above to explore some thought experiments.
Consider a time series that was generated by an autoregression (AR) process with a lag of k. We know that the ACF describes the autocorrelation between an observation and another observation at a prior time step that includes direct and indirect dependence information. This means we would expect the ACF for the AR(k) time series to be strong to a lag of k and the inertia of that relationship would carry on to subsequent lag values, trailing off at some point as the effect was weakened. We know that the PACF only describes the direct relationship between an observation and its lag. This would suggest that there would be no correlation for lag values beyond k. This is exactly the expectation of the ACF and PACF plots for an AR(k) process.
Consider a time series that was generated by a moving average (MA) process with a lag of k. Remember that the moving average process is an autoregression model of the time series of residual errors from prior predictions. Another way to think about the moving average model is that it corrects future forecasts based on errors made in recent forecasts. We would expect the ACF for the MA(k) process to show a strong correlation with recent values up to the lag of k, then a sharp decline to low or no correlation. By definition, this is how the process was generated. For the PACF, we would expect the plot to show a strong relationship to the lag and a trailing off of correlation from the lag onwards. Again, this is exactly the expectation of the ACF and PACF plots for an MA(k) process.
And there you have it — autocorrelation and partial autocorrelation in a nutshell. Both functions and plots help analyze time series data, but we’ll mostly rely on brute-force parameter finding methods for forecasting. It’s much easier to do a grid search than to look at charts.
Both ACF and PACF require stationary time series. We’ve only covered stationarity briefly for now, but that will change in the following article. Stay tuned to learn everything about stationarity, stationarity tests, and testing automation.