Box-Jenkins Methodology in R

The Box-Jenkins methodology is a systematic approach to identifying, estimating, and evaluating models for time series data, particularly those represented by the AutoRegressive Integrated Moving Average (ARIMA) processes. The method consists of four main stages: model identification, parameter estimation, diagnostic checking, and forecasting. In this post, we’ll work through each stage of the Box-Jenkins process in RStudio, from model identification to forecasting.

Introduction to ARIMA Models

ARIMA models are designed to capture autocorrelation patterns in time series data. A general \( ARIMA(p,d,q) \) model is defined as:

\( \Phi_p(B)(1 – B)^d Y_t = \Theta_q(B) \varepsilon_t \), where:

  • \(p\): order of the autoregressive (AR) term
  • \(d\): degree of differencing to achieve stationarity
  • \(q\): order of the moving average (MA) term
  • \(B\): backshift operator (\(BY_t = Y_{t-1}\))
  • \(\varepsilon_t\): white noise error term

To illustrate the Box-Jenkins methodology in R, we’ll be using the following simulated time series dataset consisting of 100 observations:

time <- 1:100
ts_data <- c(51.4, 51.7, 52.3, 51.1, 51.4, 52.7, 51.7, 50.3, 51.6, 52.7, 54.8, 55.0, 55.5, 54.3, 52.9, 
             53.5, 54.0, 51.9, 53.9, 54.5, 54.8, 55.1, 56.0, 56.5, 55.5, 56.9, 58.5, 59.8, 61.3, 62.1, 
             62.6, 64.8, 66.7, 67.8, 70.6, 71.1, 72.5, 73.0, 74.2, 75.9, 76.3, 77.3, 79.7, 82.0, 83.2, 
             83.5, 85.8, 86.8, 88.3, 90.6, 92.1, 93.2, 94.6, 94.3, 96.0, 95.6, 94.5, 94.3, 94.6, 94.9, 
             94.9, 96.3, 98.2, 98.4, 100.1, 99.9, 100.3, 100.8, 101.5, 100.8, 100.8, 100.9, 101.9, 
             104.1, 107.8, 110.0, 113.0, 115.3, 116.1, 118.2, 117.7, 118.5, 118.0, 118.2, 115.3, 
             113.3, 111.9, 111.1, 109.4, 110.6, 110.2, 111.6, 111.8, 110.1, 108.6, 107.9, 107.4, 
             106.9, 108.0, 108.8)

First, let’s create a time series plot to visualize the series and observe its general trend:

# Installing the ggplot2 package:
install.packages("ggplot2")
library(ggplot2)
#Visualizing the data using a line plot:
ts_plot = ggplot()+
  geom_line(aes(x = time,y = ts_data))+
  xlab("Time")+
  ylab("Value")+
  ggtitle("Simulated Time Series")+
  scale_x_continuous(breaks = seq(0, 100, by=20)) +     
  scale_y_continuous(breaks = seq(50, 150, by=20))+
  theme(
    plot.title = element_text(face = "bold"),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold")
  )
ts_plot

Here’s the resulting graph:

time series graph

The plot shows a significant upward trend, reaching a peak near the end of the series and then declining slightly. With this trend in mind, we will now follow the four stages of the Box-Jenkins methodology to formally identify and fit the most appropriate model.

Step 1: Model Identification

In the identification stage, we determine whether the time series is stationary and identify the potential AR and MA terms using autocorrelation plots. To formally test for stationarity, we use the the Augmented Dickey-Fuller (ADF) test. The null and alternative hypotheses of this test are:
\(H_o\): The series has a unit root (The series is non-stationary)
\(H_a\): The series does not have a unit root (The series is stationary)
Using the observed values, the ADF test calculates a test statistic and a p-value. If the test statistic is less than the critical value (or equivalently, if the p-value is less than the significance level), we reject the null hypothesis and conclude that the series is stationary. Otherwise, we fail to reject the null hypothesis; that is, there is not enough evidence to conclude that the series is stationary.

To conduct the ADF test in R, we can use the adf.test function from the tseries package:

# Installing the tseries package: 
install.packages("tseries") 
library(tseries) 
# Testing for stationarity using the ADF test: 
adf.test(ts_data) 

The results of the ADF test are:

               Augmented Dickey-Fuller Test 
data: ts_data 
Dickey-Fuller = -1.6681, Lag order = 4, p-value = 0.7141
alternative hypothesis: stationary 

The ADF test returns a p-value of 0.7141, which is greater than the significance level of 0.05. Therefore, there is currently no evidence to conclude that the series is stationary. To proceed, we’ll need to difference the series in order to remove the trend and achieve stationarity. If the series remains non-stationary after the first differencing, additional differencing will be required.

Differencing the data and re-checking for stationarity:

# Differencing the data:
diff_data <- diff(ts_data)
# Repeating the ADF test:
adf.test(diff_data)

The results of the ADF test are:

               Augmented Dickey-Fuller Test 
data: diff_data 
Dickey-Fuller = -3.6488, Lag order = 4, p-value = 0.03248
alternative hypothesis: stationary 

For the differenced series, the ADF test returns a p-value of 0.03248, which is now less than the significance level of 0.05. Therefore, we reject the null hypothesis of non-stationarity and conclude that the differenced series is stationary.

Let’s have a look at the differenced series:

differenced_plot = ggplot()+
  geom_line(aes(x = time[1:99],y = diff_data))+
  xlab("Time")+
  ylab("Value")+
  ggtitle("Differenced Time Series")+
  theme(
    plot.title = element_text(face = "bold"),
    axis.title.x = element_text(face = "bold"),
    axis.title.y = element_text(face = "bold")
  )
differenced_plot

The plot of the differenced series is:

differenced time series graph

The plot confirms that applying first-order differencing (\(d=1\)) has successfully transformed the series into a stationary state by removing the underlying upward trend. This ensures that the statistical properties of the series are consistent over time. Next, we’ll examine the autocorrelation (ACF) and partial autocorrelation (PACF) plots of the differenced series. Generating the plots:

# Creating the ACF plot:
acf(diff_data, main = "ACF of Differenced Series", lag.max = 12)
# Creating the PACF plot:
pacf(diff_data, main = "PACF of Differenced Series", lag.max = 12)

The plots are:

Autocorrelation plot (ACF) Partial Autocorrelation plot (PACF)

The ACF plot shows that the autocorrelations decay gradually toward zero, while the PACF plot shows that partial autocorrelations become insignificant after lag 2. This pattern suggests that an AR(2) and MA(0) components are appropriate for the differenced series. Since the series was differenced once to achieve stationarity, the resulting model is ARIMA(2,1,0).

Step 2: Model Estimation

To estimate this ARIMA(2,1,0) model, we can use the Arima function from the forecast package:

# Fitting the ARIMA(2,1,0) model:
my_arima = Arima(ts_data, order=c(2,1,0), include.drift=TRUE)
summary(my_arima)

Here’s the estimated model:

Series: ts_data 
ARIMA(2,1,0) with drift 

Coefficients:
         ar1     ar2   drift
      0.3116  0.2363  0.5818
s.e.  0.0967  0.0966  0.2388

sigma^2 = 1.231:  log likelihood = -149.4
AIC=306.81   AICc=307.23   BIC=317.19

Training set error measures:
                      ME     RMSE       MAE         MPE   MAPE
Training set 0.002564409 1.087263 0.8566579 0.005516468 1.1269
                  MASE        ACF1
Training set 0.7572244 -0.02175091

For the differenced series, the estimated ARIMA(2,1,0) model can be written as:
\(\Delta Y_t = 0.5818 + 0.3116\,\Delta Y_{t-1} + 0.2363\,\Delta Y_{t-2} + \varepsilon_t\)
In terms of the original (non-differenced) series \(Y_t\), the model can be written as:
\( Y_t \,-\, Y_{t-1} = 0.58 + 0.31\,(Y_{t-1} \,-\, Y_{t-2}) + 0.24\,(Y_{t-2} \,-\, Y_{t-3}) + \varepsilon_t \)
\( Y_t = Y_{t-1} + 0.58 + 0.31 Y_{t-1} \,-\, 0.31 Y_{t-2} + 0.24 Y_{t-2} \,-\, 0.24 Y_{t-3} + \varepsilon_t = \)
\( = 0.58 + 1.31 Y_{t-1} \,-\, 0.07 Y_{t-2} \,-\, 0.24 Y_{t-3} + \varepsilon_t \)
That is, the final model is:
\( Y_t = 0.58 + 1.31 Y_{t-1} \,-\, 0.07 Y_{t-2} \,-\, 0.24 Y_{t-3} + \varepsilon_t \)

Step 3: Model Diagnostics

To assess model adequacy, we’ll examine the residuals to verify that they resemble white noise. We’ll start by creating a time plot of residuals:

# Extracting the residuals:
residuals_arima = residuals(my_arima)
# Creating the residual plot:
ggplot(mapping=aes(x=time,y=residuals_arima))+
  geom_point()+
  xlab("Time")+
  ylab("Residuals")+
  ggtitle("Residuals vs. Time")

The plot of residuals against time is:

The residuals appear to fluctuate randomly around zero with no visible trend, seasonal, or cyclical patterns. The variability looks roughly constant across time, and there are no long runs of positive or negative residuals. This suggests that the ARIMA(2,1,0) model has successfully captured the main dynamics of the series and that the remaining errors behave approximately like white noise. To complement this visual assessment, we can also apply the Box–Ljung test which checks whether there are statistically significant autocorrelations in the residuals. The null and alternative hypotheses of the Box–Ljung test are:

\(H_0:\) All autocorrelations up to lag \(k\) are jointly zero.
\(H_a:\) At least one autocorrelation up to lag \(k\) is not zero.

If the p-value of the the Box–Ljung test is greater than the significance level (typically 0.05), we fail to reject the null hypothesis and conclude that there is no evidence of statistically significant residual autocorrelations. This result would confirm that the model is well-specified and suitable for generating forecasts.

Conducting the Box–Ljung test in R:

# Performing the Box–Ljung test on model residuals: 
Box.test(residuals_arima, lag = 20, type = "Ljung-Box") 

The results of the test are:

Box-Ljung test 
data:  residuals_arima
X-squared = 15.923, df = 20,
p-value = 0.7214 

The test statistic is 15.923 with 20 degrees of freedom, and the p-value is 0.7214. Since the p-value is greater than 0.05, we fail to reject the null hypothesis. Therefore, there is no evidence of statistically significant autocorrelations in the residuals.

Taken together with the earlier residual plot, these findings indicate that the ARIMA(2,1,0) model fits the data well and adequately captures the time dependence in the series.

Step 4: Forecasting

Once the ARIMA model has passed the diagnostic checks, we can use it to generate forecasts. In this step, we’ll produce a 40-step-ahead forecast from the fitted ARIMA(2,1,0) model and visualize the predicted values along with their associated confidence intervals.

To generate and plot the forecasts in R, we use the forecast function from the forecast package and the autoplot function from the ggplot2 package:

# Generating a 40-step-ahead forecast:
arima_forecast = forecast(my_arima, h = 40)

# Plotting the forecast:
forecast_plot = autoplot(arima_forecast) +
  xlab("Time") +
  ylab("Value") +
  ggtitle("40-Step-Ahead Forecast from ARIMA(2,1,0)")

forecast_plot

The forecast plot is shown below:

ARIMA forecast plot

The solid line represents the model’s predicted values for the next 40 time periods, while the shaded regions show the 80% and 95% confidence intervals. As expected for an ARIMA(2,1,0) model with a drift, the forecast continues the upward movement observed in the historical data. The widening confidence bands reflect increasing uncertainty as the forecast horizon extends further into the future.

In summary, the Box-Jenkins methodology provides a systematic approach for identifying the optimal combination of autoregressive, differencing, and moving average terms, enabling us to fit ARIMA models and generate multi-step forecasts.

About the author: Premier Statistics Tutoring is a statistics and programming tutoring service for university students, covering everything from introductory undergraduate coursework to advanced postgraduate research.