Multiple linear regression is a statistical model used to predict the response variable based on two or more explanatory variables. Unlike simple regression, which includes only one explanatory variable, multiple regression takes into account the effect of multiple variables simultaneously, resulting in more accurate predictions. The multiple regression model is typically written as follows: \( Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \varepsilon \), where:
- \( Y \) is the response (i.e., dependent) variable,
- \( X_1, X_2, \dots, X_k \) are the explanatory (i.e., independent) variables,
- \( \beta_0, \beta_1, \dots, \beta_k \) are the regression coefficients to be estimated, and
- \( \varepsilon \) is the random error.
When you estimate a linear regression model in R, the output will generally include:
- The estimated coefficients for each predictor, \( \beta_0, \beta_1, \dots, \beta_k \), along with their standard errors, t-values, and p-values;
- The residual standard error, \(s\), which measures the typical size of the prediction errors;
- The \( R^2 \) and adjusted \( R^2 \), which indicate how much of the variation in the dependent variable is explained by the explanatory variables;
- The F-statistic and its associated p-value for testing the overall significance of the model.
In this guide, we’ll explore each of these statistics using a simple example: a multiple regression model that predicts students’ final exam scores based on their study hours and class attendance rate. Let’s start by looking at the R output for this model:
Call:
lm(formula = final_score ~ hours_studied + attendance_rate, data = student_data)
Residuals:
Min 1Q Median 3Q Max
-9.7128 -2.0934 -0.1846 2.7811 8.3915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.528 4.132 7.39 1.2e-07 ***
hours_studied 3.782 0.479 7.90 4.1e-08 ***
attendance_rate 0.215 0.089 2.42 0.022 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.294 on 27 degrees of freedom
Multiple R-squared: 0.824, Adjusted R-squared: 0.811
F-statistic: 63.19 on 2 and 27 DF, p-value: 1.76e-10
Part 1, Model Setup
This part of the regression output displays the formula and the dataset used to fit the model:
lm(formula = final_score ~ hours_studied + attendance_rate, data = student_data)
In the expression above, we have:
- The term
lm()is R’s linear regression model function. - The formula
final_score ~ hours_studied + attendance_rateindicates that the variablefinal_scoreis the response variable and the variableshours_studiedandattendance_rateare the explanatory variables. - The expression
data = student_dataspecifies that all of the variables used in the model formula are taken from thestudent_datadataset.
This section describes the model setup, but does not yet provide any statistical conclusions.
Part 2, Residuals
The residuals represent the differences between the observed values and the values predicted by the model:
\( r_i= y_{observed} \ -\ y_{predicted} \)
Generally, smaller residuals indicate that the model’s predictions are close to the observed values, while larger residuals suggest greater discrepancies between the predicted and observed values. The residuals should also be roughly symmetric around zero, with no extreme outliers (large minimum or maximum values point to poor model fit for certain cases).
In our example, we have:
Residuals: Min 1Q Median 3Q Max -9.7128 -2.0934 -0.1846 2.7811 8.3915
In this case, the residuals range from approximately –9.71 to 8.39, with a median close to zero (–0.18). Assuming that the average final score is around 75, the largest residuals represent errors of roughly 11–13% relative to the mean. The first quartile is –2.09 and the third quartile is 2.78, indicating that the middle 50% of residuals differ from the observed values by only a few points.
Part 3, Coefficients Table
The coefficients table provides the estimated values for the model’s parameters, along with tests of their statistical significance:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.528 4.132 7.39 1.2e-07 ***
hours_studied 3.782 0.479 7.90 4.1e-08 ***
attendance_rate 0.215 0.089 2.42 0.022 *
Based on the table above, the estimated model equation can be written as:
\( Final\ Exam\ Score = 30.528 + 3.782 \times Hours\ Studied + 0.215 \times Attendance\ Rate \)Interpretation of Coefficients:
- The intercept of 30.528 indicates that the predicted final exam score of a student who did not attend any classes and studied for zero hours is 30.528
- The coefficient of 3.782 for the study hours indicates that for each additional hour studied, the expected final exam score is expected to increase by 3.782 points, holding attendance constant.
- The coefficient of 0.215 for the attendance rate indicates that for each additional class attended, the expected final exam score is expected to increase by 0.215 points, holding study hours constant.
The table also provides additional information about the model parameters:
- Std. Error: The standard error of each estimate, which measures how much the estimate would vary if the data collection or sampling process were repeated many times. Smaller standard errors indicate more precise estimates.
- T value: The test statistic used to assess whether each regression coefficient is significantly different from 0.
- Pr(>|t|): The p-value of each coefficient, which tells us the probability of observing a coefficient as extreme as the one in the model if the true coefficient were 0. In this case, all p-values are very small (less than 0.05), providing strong evidence that the intercept and the coefficients of both hours_studied and attendance_rate are significantly different from 0.
Part 4, Significance Codes
The significance codes offer a quick reference for interpreting p-values and the strength of statistical evidence:
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
These codes can be interpreted as follows:
***: \( p < 0.001 \) — The p-value is less than 0.001**: \( p < 0.01 \) — The p-value is less than 0.01*: \( p < 0.05 \) — The p-value is less than 0.05.: \( p < 0.1 \) — The p-value is less than 0.1
These visual indicators make it simpler to see which predictors are statistically significant without having to study the p-values too closely.
Part 5, Residual Standard Error
This value estimates the average distance between the model’s predicted values and the actual observed values. It’s calculated using the following formula: \( s = \sqrt{ \frac{1}{n – k – 1} \sum_{i=1}^{n} (y_i – \hat{y}_i)^2 } \) , where:
- \( y_i \) is the actual observed value for observation \( i \),
- \( \hat{y}_i \) is the predicted value for observation \( i \),
- \( n \) is the total number of observations, and
- \( k \) is the number of predictor variables in the model.
The term \( n – k – 1 \) represents the degrees of freedom, accounting for the intercept and the predictor coefficients.
In the example above, we have:
Residual standard error: 3.294 on 27 degrees of freedom
The residual standard error of 3.294 means that, on average, the predicted final exam scores deviate from the true scores by about ±3.294 points.
Part 6, R Squared and Adjusted R Squared
These two metrics measure the explanatory power of the model: \( R^2 \) measures the proportion of variation in the response variable explained by the predictors, while adjusted \( R^2 \) measures the same proportion, but adjusts for the number of predictors, providing a more reliable estimate when multiple predictors are included.
The formula for \( R^2 \) is:
\( R^2 = 1 – \frac{\sum_{i=1}^{n} (y_i – \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i – \bar{y})^2} \), where:
- \( y_i \) is the actual observed value,
- \( \hat{y}_i \) is the predicted value from the model,
- \( \bar{y} \) is the mean of the observed values, and
- \( n \) is the number of observations.
The adjusted \( R^2 \) is calculated as:
\( \bar{R}^2 = 1 – \left( \frac{(1 – R^2)(n – 1)}{n – k – 1} \right) \)
where \( k \) is the number of predictor variables and \( n \) is the number of observations. The denominator \( n – k – 1 \) accounts for the degrees of freedom.
In the example above, we have:
Multiple R-squared: 0.824 Adjusted R-squared: 0.811
The \( R^2 \) of 0.824 indicates that 82.4% of variation in the final exam score is explained by the hours studied and the attendance rate, while the adjusted \( R^2 \) of 0.811 indicates that 81.1% of variation in the final exam score is explained by the hours studied and the attendance rate, while adjusting for the number of predictors in the model.
Part 7, F-statistic and Overall Model Significance
The F-test evaluates the overall significance of the model by comparing it to a baseline model with no predictors. Specifically, it tests the following hypotheses:
- Null hypothesis: All regression coefficients are equal to zero, that is, the model is not significant overall.
\( H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0 \) - Alternative hypothesis: At least one of the coefficients is not equal to zero, that is, the model is significant overall.
\(\text{At least one } \beta_j \ne 0 \)
In the example above, we have:
F-statistic: 63.19 on 2 and 27 DF, p-value: 1.76e-10
Using the significance level of 0.05, we reject the null hypothesis and conclude that the model is significant overall.
In summary, the regression output in RStudio focuses on three main aspects: the coefficients that describe the effects of the predictors, the p-values that indicate whether these effects are statistically significant, and the model fit statistics that measure how well the model explains the data overall.
