Time series and forecasting have been some of the key problems in statistics and Data Science. Data becomes a time series when it’s sampled on a time-bound attribute like days, months, and years inherently giving it an implicit order. Forecasting is when we take that data and predict future values.
ARIMA and SARIMA are both algorithms for forecasting. ARIMA takes into account the past values (autoregressive, moving average) and predicts future values based on that. SARIMA similarly uses past values but also takes into account any seasonality patterns. Since SARIMA brings in seasonality as a parameter, it’s significantly more powerful than ARIMA in forecasting complex data spaces containing cycles.
Further in the blog, we’re going to explore:
Before we move on to the algorithms, there’s an important section about the data processing that you should be wary about before embarking on your forecasting journey.
Time series data is messy. Forecasting models from simple rolling averages to LSTMs requires data to be clean. So here are some techniques you could use before moving to forecasting.
Note: This data preprocessing step is general and intended to make readers emphasize it as real-world projects involve a lot of cleaning and preparation.
Now, let’s move on to the models.
ARIMA model is a class of linear models that utilizes historical values to forecast future values. ARIMA stands for Autoregressive Integrated Moving Average, each of which techniques contributes to the final forecast. Let’s understand it one by one.
In an autoregression model, we forecast the variable of interest using a linear combination of past values of that variable. The term autoregression indicates that it is a regression of the variable against itself. That is, we use lagged values of the target variable as our input variables to forecast values for the future. An autoregression model of order p will look like this:
In the above equation, the currently observed value of m is a linear function of its past p values. [B_0,B_p] are the regression coefficients that are determined after training. There are some standard methods to determine optimal values of p one of which is, analyzing Autocorrelation and Partial Autocorrelation function plots.
The autocorrelation function (ACF) is the correlation between the current and the past values of the same variable. It also considers the translative effect that values carry over with time apart from a direct effect. For example, prices of oil 2 days ago will affect prices 1 day ago and eventually, today. But the prices of oil 2 days ago might also have an effect on today which ACF measures.
Partial Autocorrelation (PACF) on the other hand measures only the direct correlation between past values and current values. For example, PACF will only measure the effect of prices of oil 2 days ago on today with no translative effect.
ACF and PACF plots help us determine past value dependency which in turn helps us deduce p in AR. Head over here to understand how to deduce values for p (AR), and q(MA) in depth.
Integrated represents any differencing that has to be applied in order to make the data stationary. A dickey-fuller test (code below) can be run on the data to check for stationarity and then experiment with different differencing factors. A differencing factor, d=1 means a lag of i.e.mt-mt-1. Let’s look at a plot of original vs differenced data.
The difference between them is evident. After differencing we could see that it’s significantly more stationary than the original and the mean and variance are approximately consistent over the years. We could use the code below to conduct a dickey-fuller test.
def check_stationarity(ts):
dftest = adfuller(ts)
adf = dftest[0]
pvalue = dftest[1]
critical_value = dftest[4]['5%']
if (pvalue < 0.05) and (adf < critical_value):
print('The series is stationary')
else:
print('The series is NOT stationary')
Moving average models uses past forecast errors rather than past values in a regression-like model to forecast future values. A moving average model can be denoted by the following equation:
This is referred as MA(q) model. In the above equation, e is called an error and it represents the random residual deviations between the model and the target variable. Since e can only be determined after fitting the model and since it’s a parameter too so in this case e is an unobservable parameter. Hence, to solve the MA equation, iterative techniques like Maximum Likelihood Estimation are used instead of OLS.
Since we’ve looked at how ARIMA works, let’s dive into an example and see how ARIMA is applied to time series data.
For the implementation, I’ve chosen catfish sales data from 1996 to 2008. We’re going to apply the techniques we learned above to this dataset and see them in action. Although the data doesn’t need a lot of cleaning and is in a read-to-be-analyzed state, you might have to apply cleaning techniques to your dataset.
Unfortunately, we cannot replicate each and every scenario as cleaning methods are highly subjective and depend on the team’s requirements too. But the techniques learned here can be directly applied to your dataset after cleaning.
Let’s start with importing essential modules.
from IPython.display import display
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 15)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import matplotlib.pyplot as plt
from datetime import datetime
from datetime import timedelta
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from time import time
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings('ignore')
RANDOM_SEED = np.random.seed(0)
These are pretty self-explanatory modules every Data Scientist will be familiar with. It’s always a good practice to set the RANDOM_SEED to make code reproducible with the same results.
Next, we’re going to import and plot the time series data
def parser(s):return datetime.strptime(s, '%Y-%m-%d')
#read data
catfish_sales = pd.read_csv('catfish.csv', parse_dates=[0], index_col=0, date_parser=parser)
#infer the frequency of the data
catfish_sales = catfish_sales.asfreq(pd.infer_freq(catfish_sales.index))
#transform
start_date = datetime(1996,1,1)
end_date = datetime(2008,1,1)
lim_catfish_sales = catfish_sales[start_date:end_date]
#plot
plt.figure(figsize=(14,4))
plt.plot(lim_catfish_sales)
plt.title('Catfish Sales in 1000s of Pounds', fontsize=20)
plt.ylabel('Sales', fontsize=16)
For the sake of simplicity, I’ve limited the data to only 1996-2008. The plot generated by above code looks like:
First impressions say there is a definite trend and seasonality present in the data. Let’s do an STL decomposition to get a better understanding.
plt.rc('figure',figsize=(14,8))
plt.rc('font',size=15)
result = seasonal_decompose(lim_catfish_sales,model='additive')
fig = result.plot()
Resulting plot look like this:
Points to ponder:
Let’s look at ACF and PACF plots to get an idea for p and q values
plot_acf(lim_catfish_sales['Total'], lags=48);
plot_pacf(lim_catfish_sales['Total'], lags=30);
The output of the above code plots ACF and PACF:
Points to ponder:
The differencing factor d should be kept at 1 since there’s a clear trend and non-stationary data. P can be tested with values 6 and 12.
We’re going to use statsmodels module to implement and use ARIMA. For this, we’ve imported the ARIMA class from the statsmodels. Now, let’s fit with the parameters we discussed in the previous section.
arima = ARIMA(lim_catfish_sales['Total'], order=(12,1,1))
predictions = arima.fit().predict()
As you notice above I started with (12,1,1) for (p,d,q) right from what we saw in the ACF and PACF plots.
Note: It is quite handy to use modules for algorithms (like scikit-learn) and you’ll be glad to know that statsmodels is one of the libraries that gets used a lot.
Let’s see how our predictions stack up with the original data.
plt.figure(figsize=(16,4))
plt.plot(lim_catfish_sales.diff(), label="Actual")
plt.plot(predictions, label="Predicted")
plt.title('Catfish Sales in 1000s of Pounds', fontsize=20)
plt.ylabel('Sales', fontsize=16)
plt.legend()
The output of the above code will give you a comparative plot of predictions and the actual data.
You can witness here that the model didn’t really catch up with some of the peaks but captured the essence of the data well. We can experiment with more p,d,q values to generalize the model better and make sure it doesn’t overfit.
Trial and optimization is one way but you can also use Auto-ARIMA. It essentially does the heavy lifting for you and tunes the hyperparameters for you. This blog is a good starting point for auto-ARIMA.
Keep in mind that the explainability of the parameters will be something that you have to deal with while working on Auto-ARIMA and make sure it doesn’t get converted into a BlackBox as forecasting models have to go for governance before deployment. So, it’s good practice to be able to explain the parameter values and their contribution.
SARIMA stands for Seasonal-ARIMA and it includes seasonality contribution to the forecast. The importance of seasonality is quite evident and ARIMA fails to encapsulate that information implicitly.
The Autoregressive (AR), Integrated (I), and Moving Average (MA) parts of the model remain as that of ARIMA. The addition of Seasonality adds robustness to the SARIMA model. It’s represented as:
where m is the number of observations per year. We use uppercase notation for the seasonal parts of the model, and lowercase notation for the non-seasonal parts of the model.
Similar to ARIMA, the P,D,Q values for seasonal parts of the model can be deduced from the ACF and PACF plots of the data. Let’s implement SARIMA for the same Catfish sales model.
The ETL and dependencies will remain the same as in ARIMA so we’ll jump straight to the modeling part.
sarima = SARIMAX(lim_catfish_sales['Total'],
order=(1,1,1),
seasonal_order=(1,1,0,12))
predictions = sarima.fit().predict()
I experimented with taking 1,1,1 for the non-seasonal parts and took 1,1,0,12 for seasonal ones as ACF showed a 6-month and 12-month lagged correlation. Let’s see how it turned out.
plt.figure(figsize=(16,4))
plt.plot(lim_catfish_sales, label="Actual")
plt.plot(predictions, label="Predicted")
plt.title('Catfish Sales in 1000s of Pounds', fontsize=20)
plt.ylabel('Sales', fontsize=16)
plt.legend()
As you can see, at the start model, struggled to fit probably because of off-course initialization but it quickly learned the right path. The fit is quite good as compared to the ARIMA one suggesting that SARIMA can learn seasonality better and if it’s present in the data then it’d make sense to try SARIMA out.
Owing to the linear nature of both algorithms, they are quite handy and used in the industry when it comes to experimentation and understanding the data, creating baseline forecasting scores. If tuned right with lagged values (p,d,q) they can perform significantly better. The simple and explainable nature of both algorithms makes them one of the top picks by analysts and Data Scientists. There are, however, some pros and cons when working with ARIMA and SARIMA at scale. Let’s discuss both of those:
ARIMA/SARIMA are among the most popular econometrics models used for forecasting stock prices, demand forecasting, and even the spread of infectious diseases. When the underlying mechanisms are not known or are too complicated, e.g., the stock market, or not fully known, e.g., retail sales, it is usually better to apply ARIMA or a similar statistical model than complex deep algorithms like RNNs.
However, there are cases where applying ARIMA can give you par results.
Here are some curated papers that use ARIMA/SARIMA:
When it comes to the industry, here’s a nice article about forecasting in Uber.
More often than not you’ll find ARIMA/SARIMA used when the problem statement is limited to the past values whether it’s predicting hospital beds, COVID cases, or forecasting demand. The shortcoming, however, arises when there are other factors to consider in forecasting like attributes that are static. Look out for the problem statement you’re working on, if these circumstances occur for you then try to use other methods like Theta, QRF (quantile regression forests), Prophet, and RNNs.
You’ve reached the end! In this blog, we discussed ARIMA and SARIMA at length pertaining to their utilization and importance for research in the industry. Their simplicity and robustness make them top contenders for modeling and forecasting. There are however some things to keep in mind while working on them in your real-world use case:
Source:
https://medium.com/swlh/seasonality-analysis-and-forecast-in-time-series-b8fbba820327