The Box-Jenkins methodology is a systematic approach to identifying, estimating, and evaluating models for time series data, particularly those represented by the AutoRegressive Integrated Moving Average (ARIMA) processes. The method consists of four main stages: model identification, parameter estimation, diagnostic checking, and forecasting. In this post, we’ll work through each stage of the Box-Jenkins process in RStudio, from model identification to forecasting.
Introduction to ARIMA Models
ARIMA models are designed to capture autocorrelation patterns in time series data. A general \( ARIMA(p,d,q) \) model is defined as:
\( \Phi_p(B)(1 – B)^d Y_t = \Theta_q(B) \varepsilon_t \), where:
- \(p\): order of the autoregressive (AR) term
- \(d\): degree of differencing to achieve stationarity
- \(q\): order of the moving average (MA) term
- \(B\): backshift operator (\(BY_t = Y_{t-1}\))
- \(\varepsilon_t\): white noise error term
To illustrate the Box-Jenkins methodology in R, we’ll be using the following simulated time series dataset consisting of 100 observations:
time <- 1:100
ts_data <- c(51.4, 51.7, 52.3, 51.1, 51.4, 52.7, 51.7, 50.3, 51.6, 52.7, 54.8, 55.0, 55.5, 54.3, 52.9,
53.5, 54.0, 51.9, 53.9, 54.5, 54.8, 55.1, 56.0, 56.5, 55.5, 56.9, 58.5, 59.8, 61.3, 62.1,
62.6, 64.8, 66.7, 67.8, 70.6, 71.1, 72.5, 73.0, 74.2, 75.9, 76.3, 77.3, 79.7, 82.0, 83.2,
83.5, 85.8, 86.8, 88.3, 90.6, 92.1, 93.2, 94.6, 94.3, 96.0, 95.6, 94.5, 94.3, 94.6, 94.9,
94.9, 96.3, 98.2, 98.4, 100.1, 99.9, 100.3, 100.8, 101.5, 100.8, 100.8, 100.9, 101.9,
104.1, 107.8, 110.0, 113.0, 115.3, 116.1, 118.2, 117.7, 118.5, 118.0, 118.2, 115.3,
113.3, 111.9, 111.1, 109.4, 110.6, 110.2, 111.6, 111.8, 110.1, 108.6, 107.9, 107.4,
106.9, 108.0, 108.8)
First, let’s create a time series plot to visualize the series and observe its general trend:
# Installing the ggplot2 package:
install.packages("ggplot2")
library(ggplot2)
#Visualizing the data using a line plot:
ts_plot = ggplot()+
geom_line(aes(x = time,y = ts_data))+
xlab("Time")+
ylab("Value")+
ggtitle("Simulated Time Series")+
scale_x_continuous(breaks = seq(0, 100, by=20)) +
scale_y_continuous(breaks = seq(50, 150, by=20))+
theme(
plot.title = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold")
)
ts_plot
Here’s the resulting graph:
The plot shows a significant upward trend, reaching a peak near the end of the series and then declining slightly. With this trend in mind, we will now follow the four stages of the Box-Jenkins methodology to formally identify and fit the most appropriate model.
Step 1: Model Identification
In the identification stage, we determine whether the time series is stationary and identify the potential AR and MA terms using autocorrelation plots. To formally test for stationarity, we use the the Augmented Dickey-Fuller (ADF) test. The null and alternative hypotheses of this test are:
\(H_o\): The series has a unit root (The series is non-stationary)
\(H_a\): The series does not have a unit root (The series is stationary)
Using the observed values, the ADF test calculates a test statistic and a p-value. If the test statistic is less than the critical value (or equivalently, if the p-value is less than the significance level), we reject the null hypothesis and conclude that the series is stationary. Otherwise, we fail to reject the null hypothesis; that is, there is not enough evidence to conclude that the series is stationary.
To conduct the ADF test in R, we can use the adf.test function from the tseries package:
# Installing the tseries package:
install.packages("tseries")
library(tseries)
# Testing for stationarity using the ADF test:
adf.test(ts_data)
The results of the ADF test are:
Augmented Dickey-Fuller Test data: ts_data Dickey-Fuller = -1.6681, Lag order = 4, p-value = 0.7141 alternative hypothesis: stationary
The ADF test returns a p-value of 0.7141, which is greater than the significance level of 0.05. Therefore, there is currently no evidence to conclude that the series is stationary. To proceed, we’ll need to difference the series in order to remove the trend and achieve stationarity. If the series remains non-stationary after the first differencing, additional differencing will be required.
Differencing the data and re-checking for stationarity:
# Differencing the data: diff_data <- diff(ts_data) # Repeating the ADF test: adf.test(diff_data)
The results of the ADF test are:
Augmented Dickey-Fuller Test data: diff_data Dickey-Fuller = -3.6488, Lag order = 4, p-value = 0.03248 alternative hypothesis: stationary
For the differenced series, the ADF test returns a p-value of 0.03248, which is now less than the significance level of 0.05. Therefore, we reject the null hypothesis of non-stationarity and conclude that the differenced series is stationary.
Let’s have a look at the differenced series:
differenced_plot = ggplot()+
geom_line(aes(x = time[1:99],y = diff_data))+
xlab("Time")+
ylab("Value")+
ggtitle("Differenced Time Series")+
theme(
plot.title = element_text(face = "bold"),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold")
)
differenced_plot
The plot of the differenced series is:
The plot confirms that applying first-order differencing (\(d=1\)) has successfully transformed the series into a stationary state by removing the underlying upward trend. This ensures that the statistical properties of the series are consistent over time. Next, we’ll examine the autocorrelation (ACF) and partial autocorrelation (PACF) plots of the differenced series. Generating the plots:
# Creating the ACF plot: acf(diff_data, main = "ACF of Differenced Series", lag.max = 12) # Creating the PACF plot: pacf(diff_data, main = "PACF of Differenced Series", lag.max = 12)
The plots are:
The ACF plot shows that the autocorrelations decay gradually toward zero, while the PACF plot shows that partial autocorrelations become insignificant after lag 2. This pattern suggests that an AR(2) and MA(0) components are appropriate for the differenced series. Since the series was differenced once to achieve stationarity, the resulting model is ARIMA(2,1,0).
Step 2: Model Estimation
To estimate this ARIMA(2,1,0) model, we can use the Arima function from the forecast package:
# Fitting the ARIMA(2,1,0) model: my_arima = Arima(ts_data, order=c(2,1,0), include.drift=TRUE) summary(my_arima)
Here’s the estimated model:
Series: ts_data
ARIMA(2,1,0) with drift
Coefficients:
ar1 ar2 drift
0.3116 0.2363 0.5818
s.e. 0.0967 0.0966 0.2388
sigma^2 = 1.231: log likelihood = -149.4
AIC=306.81 AICc=307.23 BIC=317.19
Training set error measures:
ME RMSE MAE MPE MAPE
Training set 0.002564409 1.087263 0.8566579 0.005516468 1.1269
MASE ACF1
Training set 0.7572244 -0.02175091
For the differenced series, the estimated ARIMA(2,1,0) model can be written as:
\(\Delta Y_t = 0.5818 + 0.3116\,\Delta Y_{t-1} + 0.2363\,\Delta Y_{t-2} + \varepsilon_t\)
In terms of the original (non-differenced) series \(Y_t\), the model can be written as:
\( Y_t \,-\, Y_{t-1} = 0.58 + 0.31\,(Y_{t-1} \,-\, Y_{t-2}) + 0.24\,(Y_{t-2} \,-\, Y_{t-3}) + \varepsilon_t \)
\( Y_t = Y_{t-1} + 0.58 + 0.31 Y_{t-1} \,-\, 0.31 Y_{t-2} + 0.24 Y_{t-2} \,-\, 0.24 Y_{t-3} + \varepsilon_t = \)
\( = 0.58 + 1.31 Y_{t-1} \,-\, 0.07 Y_{t-2} \,-\, 0.24 Y_{t-3} + \varepsilon_t \)
That is, the final model is:
\( Y_t = 0.58 + 1.31 Y_{t-1} \,-\, 0.07 Y_{t-2} \,-\, 0.24 Y_{t-3} + \varepsilon_t \)
Step 3: Model Diagnostics
To assess model adequacy, we’ll examine the residuals to verify that they resemble white noise. We’ll start by creating a time plot of residuals:
# Extracting the residuals:
residuals_arima = residuals(my_arima)
# Creating the residual plot:
ggplot(mapping=aes(x=time,y=residuals_arima))+
geom_point()+
xlab("Time")+
ylab("Residuals")+
ggtitle("Residuals vs. Time")
The plot of residuals against time is:
The residuals appear to fluctuate randomly around zero with no visible trend, seasonal, or cyclical patterns. The variability looks roughly constant across time, and there are no long runs of positive or negative residuals. This suggests that the ARIMA(2,1,0) model has successfully captured the main dynamics of the series and that the remaining errors behave approximately like white noise. To complement this visual assessment, we can also apply the Box–Ljung test which checks whether there are statistically significant autocorrelations in the residuals. The null and alternative hypotheses of the Box–Ljung test are:
\(H_0:\) All autocorrelations up to lag \(k\) are jointly zero.
\(H_a:\) At least one autocorrelation up to lag \(k\) is not zero.
If the p-value of the the Box–Ljung test is greater than the significance level (typically 0.05), we fail to reject the null hypothesis and conclude that there is no evidence of statistically significant residual autocorrelations. This result would confirm that the model is well-specified and suitable for generating forecasts.
Conducting the Box–Ljung test in R:
# Performing the Box–Ljung test on model residuals: Box.test(residuals_arima, lag = 20, type = "Ljung-Box")
The results of the test are:
Box-Ljung test data: residuals_arima X-squared = 15.923, df = 20, p-value = 0.7214
The test statistic is 15.923 with 20 degrees of freedom, and the p-value is 0.7214. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. Therefore, there is no evidence of statistically significant autocorrelations in the residuals.
Taken together with the earlier residual plot, these findings indicate that the ARIMA(2,1,0) model fits the data well and adequately captures the time dependence in the series.
Step 4: Forecasting
Once the ARIMA model has passed the diagnostic checks, we can use it to generate forecasts. In this step, we’ll produce a 40-step-ahead forecast from the fitted ARIMA(2,1,0) model and visualize the predicted values along with their associated confidence intervals.
To generate and plot the forecasts in R, we use the forecast function from the forecast package and the autoplot function from the ggplot2 package:
# Generating a 40-step-ahead forecast:
arima_forecast = forecast(my_arima, h = 40)
# Plotting the forecast:
forecast_plot = autoplot(arima_forecast) +
xlab("Time") +
ylab("Value") +
ggtitle("40-Step-Ahead Forecast from ARIMA(2,1,0)")
forecast_plot
The forecast plot is shown below:
The solid line represents the model’s predicted values for the next 40 time periods, while the shaded regions show the 80% and 95% confidence intervals. As expected for an ARIMA(2,1,0) model with a drift, the forecast continues the upward movement observed in the historical data. The widening confidence bands reflect increasing uncertainty as the forecast horizon extends further into the future.
In summary, the Box-Jenkins methodology provides a systematic approach for identifying the optimal combination of autoregressive, differencing, and moving average terms, enabling us to fit ARIMA models and generate multi-step forecasts.
