4 Multiple Linear Regression

4.1 An Example: Wire Bond Strength Dataset

4.1.1 Loading Data and Visualization

Note: You must change the file paths in the read.csv() functions below to match the location of the files on your computer (for example C:\\Users\\<YourUsername>\\Documents on Windows).

Code

## Read data. Change the path as necessary.
## Example: bond.data <- read.csv("wire-bond.csv")
bond.data <- read.csv("wire-bond.csv")

## This will now be automatically rendered as a paged table
bond.data

2D Visualization

Code

par(mfrow = c(1, 3), mar = c(5, 4, 2, 1))

## 1) length vs strength
i1 <- which(!is.na(bond.data$length) & !is.na(bond.data$strength))
plot(bond.data$length[i1], bond.data$strength[i1],
     xlab = "Wire Length", ylab = "Pull strength", pch = 19)
text(bond.data$length[i1], bond.data$strength[i1],
     labels = i1, pos = 1, offset = 0.4, cex = 0.75)

## 2) height vs strength
i2 <- which(!is.na(bond.data$height) & !is.na(bond.data$strength))
plot(bond.data$height[i2], bond.data$strength[i2],
     xlab = "Die height", ylab = "Pull strength", pch = 19)
text(bond.data$height[i2], bond.data$strength[i2],
     labels = i2, pos = 1, offset = 0.4, cex = 0.75)

## 3) height vs length
i3 <- which(!is.na(bond.data$height) & !is.na(bond.data$length))
plot(bond.data$height[i3], bond.data$length[i3],
     xlab = "Die height", ylab = "Length", pch = 19)
text(bond.data$height[i3], bond.data$length[i3],
     labels = i3, pos = 1, offset = 0.4, cex = 0.75)

3D Visualize

Code

library(scatterplot3d)

par(mfrow = c(1,1))
s3d <- with(bond.data, scatterplot3d(
  x = length,
  y = height,
  z = strength,
  pch = 19,
  color = "steelblue",
  main = "3D Scatterplot: Strength vs. Length and Height",
  xlab = "Length",
  ylab = "Height",
  zlab = "Strength",
  angle = 60
))

fit <- lm(strength ~ length + height, data = bond.data)
s3d$plane3d(fit, lty.box = "solid")

4.1.2 Model Fitting and Summary

We fit a multiple linear regression model with strength as the response variable and length and height as predictors.

Code

fit <- lm(strength ~ length + height, data = bond.data)
summary(fit)


Call:
lm(formula = strength ~ length + height, data = bond.data)

Residuals:
   Min     1Q Median     3Q    Max 
-3.865 -1.542 -0.362  1.196  5.841 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.263791   1.060066   2.136 0.044099 *  
length      2.744270   0.093524  29.343  < 2e-16 ***
height      0.012528   0.002798   4.477 0.000188 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.288 on 22 degrees of freedom
Multiple R-squared:  0.9811,    Adjusted R-squared:  0.9794 
F-statistic: 572.2 on 2 and 22 DF,  p-value: < 2.2e-16

The summary provides the ANOVA F-test for overall significance, \(R^2\), adjusted \(R^2\), and t-tests for individual coefficients.

4.1.3 Confidence Intervals and Model Components

Code

## Confidence intervals
confint(fit)

                  2.5 %     97.5 %
(Intercept) 0.065348613 4.46223426
length      2.550313061 2.93822623
height      0.006724246 0.01833138

Code

## Fitted values and residuals
pred <- fitted.values(fit)
e <- resid(fit)
data.frame(y = bond.data$strength, y.hat = pred, e = e)

Code

## Covariance matrix and standard errors
cov.mat <- vcov(fit)
cov.mat

             (Intercept)        length        height
(Intercept)  1.123740429 -3.921612e-02 -1.781991e-03
length      -0.039216122  8.746709e-03 -9.903775e-05
height      -0.001781991 -9.903775e-05  7.831149e-06

Code

data.frame(std.error = sqrt(diag(cov.mat)))

4.2 RSS-based Inference: F-test, and adjusted \(R^2\)

The General Linear Model

The general linear model is:

\[y = X\beta + \epsilon\]

\(y\): \(n \times 1\) vector of responses
\(X\): \(n \times p\) design matrix (first column often ones)
\(\beta\): \(p \times 1\) parameter vector, where \(p=k+1\)
\(\epsilon\): \(n \times 1\) error vector

4.2.1 RSS-Based Quantities

4.2.1.1 RSS-Based Quantities

Source	Sum of Squares	\(R^2\)	df	Mean Squares	\(F\)	SS\(_\mathrm{adj}\)	\(\hat{\sigma}^2\)	\(R^2_{\mathrm{adj}}\)
\(x^\top\beta\)	\(\mathrm{SSR} = \displaystyle \sum_{i=1}^n (\hat y_i - \bar y)^2\)	\(\displaystyle \frac{\mathrm{SSR}}{\mathrm{SST}}\)	\(k\)	\(\displaystyle \mathrm{MSR} = \frac{\mathrm{SSR}}{k}\)	\(\displaystyle \frac{\mathrm{MSR}}{\mathrm{MSE}}\)	\(\mathrm{SSR}_{\mathrm{adj}}\)	\(\displaystyle \hat{\sigma}^2_{x^\top\beta} = \frac{\mathrm{SSR}_{\mathrm{adj}}}{n-1}\)	\(\displaystyle \frac{\mathrm{SSR}_{\mathrm{adj}}}{\mathrm{SST}} = 1 - \frac{\mathrm{MSE}}{\mathrm{MST}}\)
\(\epsilon\)	\(\mathrm{SSE} = \displaystyle \sum_{i=1}^n (y_i - \hat y_i)^2\)	—	\(n-p\)	\(\displaystyle \mathrm{MSE} = \frac{\mathrm{SSE}}{n-p}\)	—	\(\mathrm{SSE}\)	\(\displaystyle \hat{\sigma}^2_{\epsilon} = \mathrm{MSE}\)	—
\(y\)	\(\mathrm{SST} = \displaystyle \sum_{i=1}^n (y_i - \bar y)^2\)	—	\(n-1\)	\(\displaystyle \mathrm{MST} = \frac{\mathrm{SST}}{n-1}\)	—	\(\mathrm{SST}\)	\(\displaystyle \hat{\sigma}^2_{y} = \mathrm{MST}\)	—

Interpretation of the \(\hat{\sigma}^2\) Column

The \(\hat{\sigma}^2\) column highlights how each sum of squares corresponds to an estimated variance.
This view makes the adjusted coefficient of determination clear:

\[ R^2_{\mathrm{adj}} = 1 - \frac{\hat{\sigma}^2_\epsilon}{\hat{\sigma}^2_y} = \frac{\hat{\sigma}^2_{x^\top\beta}}{\hat{\sigma}^2_y}. \]

Hence, the adjusted \(R^2\) simply expresses the proportion of total estimated variance attributable to the fitted model \(X\beta\) rather than the residual noise \(\epsilon\).

4.2.2 Remarks

4.2.2.1 Fundamental Identities

\[ \begin{aligned} \mathrm{SST} &= \mathrm{SSR} + \mathrm{SSE}, \\ \mathrm{MST} &= \mathrm{MSE} + \frac{\mathrm{SSR}_{\mathrm{adj}}}{n-1}. \end{aligned} \]

where

\[ \mathrm{SSR}_{\mathrm{adj}} = (n-1)MST-(n-p+k)\mathrm{MSE} = \mathrm{SST}-\mathrm{SSE} - k\,\mathrm{MSE} = \mathrm{SSR} - k\,\mathrm{MSE}. \]

4.2.2.2 Difference of \(\hat{\sigma}^2\) and Mean Squares

The quantity \(\hat{\sigma}^2\) represents the estimated variance associated with each component of the model. MSE and MST are the estimated variances of the \(\epsilon\) and \(y\) itself. However, the MSR, although called Mean Square for Regression (MSR) is NOT an estimate of the variance or sample variance of \(x^\top \beta\). The name of “mean” here is used to indicate a different thing. Its name “Mean Square” reflects that it is also an estimate estimate of noise variance \(\sigma^2\) under \(H_0\!:\,\beta = 0\):

\[ E[\mathrm{MSR} \mid H_0] = \sigma^2, \qquad E[\mathrm{MSR} \mid H_1] > \sigma^2. \]

Hence the F-statistic

\[ F = \frac{\mathrm{MSR}}{\mathrm{MSE}} \] is approximately equal to 1 subject to the variability as characterized with F-distribution with degree freedoms of \(k\) and \(n-p\). This test is to test whether any regression coefficients are not equal to 0.

4.2.2.3 \(\hat \sigma^2_{x^\top\beta}=\frac{\mathrm{SSR}_{\text{adj}}}{n-1}\)

\(\hat \sigma^2_{x^\top\beta}\) is an unbiased estimator of the variance of linear signal when \(x\) is a regarded as a random variable. This can be seen from the following equations: \[ E[\mathrm{SSR}] = k\,\sigma^2 + \beta^\top X^\top (I - J/n)\,X\,\beta, \qquad E[\mathrm{MSE}] = \sigma^2. \] Hence, \[ \begin{aligned} E[\mathrm{SSR}_{\mathrm{adj}}] &= E[\mathrm{SSR}] - k\,E[\mathrm{MSE}] \\ &= \beta^\top X^\top (I - J/n)\,X\,\beta \\ &= \sum_{i=1}^n (\mu_i - \bar\mu)^2, \end{aligned} \]

where \[ \begin{aligned} \mu_i &= x_i^\top \beta \\ \bar\mu &= \tfrac{1}{n}\sum_{i=1}^n \mu_i \end{aligned} \] For fixed \(X\), \(\mathrm{SSR}_{\text{adj}}/(n-1)\) equals the sample variance of the true means \(\{\mu_i\}\) over the observed design points. If the rows of \(X\) are independently sampled with covariance matrix \(\Sigma_X\) (the random-\(X\) model), then

\[ \mathbb{E}_X\!\left[\frac{\mathrm{SSR}_{\text{adj}}}{n-1}\right] = \beta^\top \Sigma_X \beta = \mathrm{Var}(x^\top \beta), \]

4.2.2.4 Connection to Rao-Blackwell Formula

The decomposition of \(\hat{\sigma}^2\) is consistent with the Rao–Blackwell formula for total variance:

\[ \mathrm{Var}(y) = \mathrm{Var}\!\big(E[y \mid x]\big) + E\!\big(\mathrm{Var}[y \mid x]\big). \]

Here,

\(\mathrm{Var}\!\big(E[y \mid x]\big)\) corresponds to the explained variation due to the regression component \(x^\top\beta\), and
\(E\!\big(\mathrm{Var}[y \mid x]\big)\) corresponds to the residual variation due to \(\epsilon\).

4.2.3 An Animation for Illustrating \(R^2\) under \(H_0\) and \(H_1\)

We simulate a dataset with \(n=30\) observations and consider a sequence of nested models adding groups of predictors.

Predictor Groups:

Step 1 (\(k=1\)): Add \(x_1\). (Signal under \(H_1\)).
Step 2 (\(k=6\)): Add \(x_2, \dots, x_6\) (Noise).
Step 3 (\(k=11\)): Add \(x_7, \dots, x_{11}\) (Noise).
Step 4 (\(k=20\)): Add \(x_{12}, \dots, x_{20}\) (Noise).

Null Hypothesis (\(H_0\))
Alternative Hypothesis (\(H_1\))

Under \(H_0\), the true coefficient for \(x_1\) is \(\beta_1 = 0\). All predictors are noise.

Under \(H_1\), \(x_1\) is a true predictor (\(\beta_1 = 2\)). The subsequent groups (\(x_2 \dots x_{20}\)) remain noise.

4.2.4 Example: Modelling Children Weight with Height and Age

Code

## Data: Weight, height and age of children
wgt <- c(64, 71, 53, 67, 55, 58, 77, 57, 56, 51, 76, 68)
hgt <- c(57, 59, 49, 62, 51, 50, 55, 48, 42, 42, 61, 57)
age <- c(8, 10, 6, 11, 8, 7, 10, 9, 10, 6, 12, 9)
child.data <- data.frame(wgt, hgt, age)

4.2.4.1 Problem 1: Height then Age

Code

fit_hgt_age <- lm(wgt ~ hgt + age, data = child.data)
summary(fit_hgt_age)


Call:
lm(formula = wgt ~ hgt + age, data = child.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8708 -1.7004  0.3454  1.4642 10.2336 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   6.5530    10.9448   0.599   0.5641  
hgt           0.7220     0.2608   2.768   0.0218 *
age           2.0501     0.9372   2.187   0.0565 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.66 on 9 degrees of freedom
Multiple R-squared:   0.78, Adjusted R-squared:  0.7311 
F-statistic: 15.95 on 2 and 9 DF,  p-value: 0.001099

Code

fit_hgt <- lm(wgt ~ hgt, data = child.data)
summary(fit_hgt)


Call:
lm(formula = wgt ~ hgt, data = child.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.8736 -3.8973 -0.4402  2.2624 11.8375 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   6.1898    12.8487   0.482  0.64035   
hgt           1.0722     0.2417   4.436  0.00126 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.471 on 10 degrees of freedom
Multiple R-squared:  0.663, Adjusted R-squared:  0.6293 
F-statistic: 19.67 on 1 and 10 DF,  p-value: 0.001263

Code

anova(fit_hgt, fit_hgt_age)

Code

anova(fit_hgt_age)

4.2.4.2 Problem 2: Age then Height

Code

fit_age <- lm(wgt ~ age, data = child.data)
summary(fit_age)


Call:
lm(formula = wgt ~ age, data = child.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.000  -3.911   1.143   4.071  10.000 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  30.5714     8.6137   3.549  0.00528 **
age           3.6429     0.9551   3.814  0.00341 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.015 on 10 degrees of freedom
Multiple R-squared:  0.5926,    Adjusted R-squared:  0.5519 
F-statistic: 14.55 on 1 and 10 DF,  p-value: 0.003407

Code

fit_age_hgt <- lm(wgt ~ age + hgt, data = child.data)
summary(fit_age_hgt)


Call:
lm(formula = wgt ~ age + hgt, data = child.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8708 -1.7004  0.3454  1.4642 10.2336 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   6.5530    10.9448   0.599   0.5641  
age           2.0501     0.9372   2.187   0.0565 .
hgt           0.7220     0.2608   2.768   0.0218 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.66 on 9 degrees of freedom
Multiple R-squared:   0.78, Adjusted R-squared:  0.7311 
F-statistic: 15.95 on 2 and 9 DF,  p-value: 0.001099

Code

anova(fit_age, fit_age_hgt)

Code

anova(fit_age_hgt)

4.2.5 Example: Wire bond strength

Code

fit_len_hgt <-  lm(strength ~ length + height, data = bond.data)
fit_hgt_len <-  lm(strength ~ height+length, data = bond.data)
anova(fit_len_hgt)

Code

anova(fit_hgt_len)

Code

summary(fit_hgt_len)


Call:
lm(formula = strength ~ height + length, data = bond.data)

Residuals:
   Min     1Q Median     3Q    Max 
-3.865 -1.542 -0.362  1.196  5.841 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.263791   1.060066   2.136 0.044099 *  
height      0.012528   0.002798   4.477 0.000188 ***
length      2.744270   0.093524  29.343  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.288 on 22 degrees of freedom
Multiple R-squared:  0.9811,    Adjusted R-squared:  0.9794 
F-statistic: 572.2 on 2 and 22 DF,  p-value: < 2.2e-16

Code

summary(fit_len_hgt)


Call:
lm(formula = strength ~ length + height, data = bond.data)

Residuals:
   Min     1Q Median     3Q    Max 
-3.865 -1.542 -0.362  1.196  5.841 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.263791   1.060066   2.136 0.044099 *  
length      2.744270   0.093524  29.343  < 2e-16 ***
height      0.012528   0.002798   4.477 0.000188 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.288 on 22 degrees of freedom
Multiple R-squared:  0.9811,    Adjusted R-squared:  0.9794 
F-statistic: 572.2 on 2 and 22 DF,  p-value: < 2.2e-16

4.2.6 Relationship between t-test and partial F-test

A t-test for a single coefficient is a special case of the partial F-test; the relationship is \(F = t^2\) for 1 df in the numerator.
The p-value from t-test (output of summary(lm())) is the same as anova test for: \(H_0: \beta_j = 0\) vs \(H_1\): all covaraites have non-zero effects.

4.3 Predictions for Mean Response and a Future Observation

4.3.1 Confidence Interval for Mean Response

Code

predict(fit, newdata = data.frame(length = 8, height = 275),
        interval = "confidence", level = 0.95)

      fit      lwr      upr
1 27.6631 26.66324 28.66296

4.3.2 Prediction Interval for a New Observation

Code

predict(fit, newdata = data.frame(length = 8, height = 275),
        interval = "prediction", level = 0.95)

      fit      lwr      upr
1 27.6631 22.81378 32.51241

4.4 Model Diagnostics

4.4.1 Residual Calculations

Code

residuals_df <- data.frame(
  hat_values = hatvalues(fit),
  ordinary_resid = resid(fit),
  standardized_resid = resid(fit) / sigma(fit),
  studentized_internal = rstandard(fit),
  studentized_external = rstudent(fit)
)
residuals_df

4.4.2 Residual Plots

Code

n <- nrow(bond.data)
r <- rstudent(fit) 
y.hat <- fitted.values(fit)

par(mfrow = c(2, 3), mar = c(4, 4, 2, 1))
qqnorm(r, main = "Normal Q-Q Plot"); qqline(r)
plot(y.hat, r, xlab = "Fitted values", ylab = "Studentized Residuals"); abline(h = 0)
plot(1:n, r, xlab = "Observation Number", ylab = "Studentized Residuals"); abline(h = 0)
plot(bond.data$length, r, xlab = "Wire Length", ylab = "Studentized Residuals"); abline(h = 0)
plot(bond.data$height, r, xlab = "Die Height", ylab = "Studentized Residuals"); abline(h = 0)

4.5 Influential Observations

Code

influence_df <- data.frame(dffits = dffits(fit),
                           cook.D = cooks.distance(fit),
                           dfbetas(fit))
influence_df

4.5.1 Plotting with the `olsrr` Package

Code

## install.packages("olsrr") # Run once if needed
library(olsrr)

ols_plot_cooksd_chart(fit)

Code

ols_plot_dffits(fit)

Code

ols_plot_dfbetas(fit)

4.6 Polynomial Regression

Code

y <- c(1.81, 1.70, 1.65, 1.55, 1.48, 1.40, 1.30, 1.26, 1.24, 1.21, 1.20, 1.18)
x <- c(20, 25, 30, 35, 40, 50, 60, 65, 70, 75, 80, 90)
fit_poly <- lm(y ~ x + I(x^2))
summary(fit_poly)


Call:
lm(formula = y ~ x + I(x^2))

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0174763 -0.0065087  0.0001297  0.0071482  0.0151887 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.198e+00  2.255e-02   97.48 6.38e-15 ***
x           -2.252e-02  9.424e-04  -23.90 1.88e-09 ***
I(x^2)       1.251e-04  8.658e-06   14.45 1.56e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01219 on 9 degrees of freedom
Multiple R-squared:  0.9975,    Adjusted R-squared:  0.9969 
F-statistic:  1767 on 2 and 9 DF,  p-value: 2.096e-12

Code

plot(x, y, xlab = "Lot size, x", ylab = "Average cost per unit, y")
lines(x, predict(fit_poly, newdata = data.frame(x = x)), type = "l")

Code

fit1 <- lm(y ~ x)
anova(fit1, fit_poly)

4.7 Handling Categorical Variables with Dummy Variables

Investigate the common observation that males tend to have higher blood pressure than females of similar age.

Code

## Note: Update this path to your local file location
sbpdata <- read.csv("sbpdata.csv")
sbpdata

4.7.1 Four Models Involving “sex”

4.7.1.1 Coincidence Model (Age Only)

Code

## Ensure sex is a factor (labels will appear in the legend)
sbpdata$sex <- as.factor(sbpdata$sex)

## Fit (you already have this)
fit.age <- lm(sbp ~ age, data = sbpdata)

## Generate predictions over the observed age range
new_age <- seq(min(sbpdata$age, na.rm = TRUE),
               max(sbpdata$age, na.rm = TRUE),
               length.out = 200)
pred <- predict(fit.age, newdata = data.frame(age = new_age))

## Simple palette for the sex levels (works for 1–3 levels; expand if needed)
lev  <- levels(sbpdata$sex)
cols <- setNames(c("steelblue3", "tomato3", "darkorchid3")[seq_along(lev)], lev)

## Scatter plot with colored points by sex
plot(sbp ~ age, data = sbpdata,
     col = cols[sbpdata$sex], pch = 16,
     xlab = "Age", ylab = "Systolic BP")

## Add predicted line
lines(new_age, pred, lwd = 2)

## Legend
legend("topleft", legend = lev, col = cols[lev], pch = 16, bty = "n", title = "Sex")

Code

data.frame(model.matrix(fit.age))

Code

print(anova(fit.age))

Analysis of Variance Table

Response: sbp
          Df  Sum Sq Mean Sq F value    Pr(>F)    
age        1 14951.3 14951.3  121.27 < 2.2e-16 ***
Residuals 67  8260.5   123.3                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.7.1.2 Additive Effect Model (Age + Sex)

Code

## Parallelism: H0: beta3=0 (Sex has additive effect)
fit.agePLUSsex <- lm(sbp ~ age + sex, data = sbpdata)

## Ensure sex is a factor for labeling/colors
sbpdata$sex <- factor(sbpdata$sex)

## Fit (additive: parallelism)
fit.agePLUSsex <- lm(sbp ~ age + sex, data = sbpdata)

## X-range and palette
ages <- seq(min(sbpdata$age, na.rm = TRUE),
            max(sbpdata$age, na.rm = TRUE),
            length.out = 200)
lev  <- levels(sbpdata$sex)
cols <- setNames(c("steelblue3", "tomato3", "darkorchid3")[seq_along(lev)], lev)

## Scatter with colored points by sex
plot(sbp ~ age, data = sbpdata,
     col = cols[sbpdata$sex], pch = 16,
     xlab = "Age", ylab = "Systolic BP")

## Parallel fitted lines: one per sex (same slope, different intercepts)
for (sx in lev) {
  nd <- data.frame(age = ages, sex = factor(sx, levels = lev))
  yhat <- predict(fit.agePLUSsex, newdata = nd)
  lines(ages, yhat, col = cols[sx], lwd = 2)
}

## Legend
legend("topleft", legend = lev, col = cols[lev], pch = 16, lwd = 2, bty = "n", title = "Sex")

Code

data.frame(model.matrix(fit.agePLUSsex))

Code

print(anova(fit.age, fit.agePLUSsex))

Analysis of Variance Table

Model 1: sbp ~ age
Model 2: sbp ~ age + sex
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1     67 8260.5                                  
2     66 5202.0  1    3058.5 38.805 3.701e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.7.1.3 Varying Intercept and Varying Slope Model (Age + Sex + Age:Sex)

Code

## Make sure sex is a factor (for colors/legend)
sbpdata$sex <- factor(sbpdata$sex)

## Fit (interaction: different slopes by sex)
fit.age.TIMES.sex <- lm(sbp ~ age + sex + age:sex, data = sbpdata)

## Age grid and palette
ages <- seq(min(sbpdata$age, na.rm = TRUE),
            max(sbpdata$age, na.rm = TRUE),
            length.out = 200)
lev  <- levels(sbpdata$sex)
cols <- setNames(c("steelblue3", "tomato3", "darkorchid3")[seq_along(lev)], lev)

## Scatter: color points by sex
plot(sbp ~ age, data = sbpdata,
     col = cols[sbpdata$sex], pch = 16,
     xlab = "Age", ylab = "Systolic BP")

## Fitted lines: one per sex (different slopes allowed)
for (sx in lev) {
  nd <- data.frame(age = ages, sex = factor(sx, levels = lev))
  yhat <- predict(fit.age.TIMES.sex, newdata = nd)
  lines(ages, yhat, col = cols[sx], lwd = 2)
}

## Legend
legend("topleft", legend = lev, col = cols[lev], pch = 16, lwd = 2, bty = "n", title = "Sex")

Model Matrix and ANOVA

Code

data.frame(model.matrix(fit.age.TIMES.sex))

Code

summary(fit.age.TIMES.sex)


Call:
lm(formula = sbp ~ age + sex + age:sex, data = sbpdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.647  -3.410   1.254   4.314  21.153 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 97.07708    5.17046  18.775  < 2e-16 ***
age          0.94932    0.10864   8.738 1.43e-12 ***
sex1        12.96144    7.01172   1.849   0.0691 .  
age:sex1     0.01203    0.14519   0.083   0.9342    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.946 on 65 degrees of freedom
Multiple R-squared:  0.7759,    Adjusted R-squared:  0.7656 
F-statistic: 75.02 on 3 and 65 DF,  p-value: < 2.2e-16

Code

print(anova(fit.age,fit.agePLUSsex,fit.age.TIMES.sex))

Analysis of Variance Table

Model 1: sbp ~ age
Model 2: sbp ~ age + sex
Model 3: sbp ~ age + sex + age:sex
  Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
1     67 8260.5                                   
2     66 5202.0  1   3058.52 38.2210 4.692e-08 ***
3     65 5201.4  1      0.55  0.0069    0.9342    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.7.1.4 Varying Slope, Equal Intercept Model (Age + Age:Sex)

Code

## Make sure sex is a factor (for colors/legend)
sbpdata$sex <- factor(sbpdata$sex)

## Fit (interaction: different slopes by sex)
fit.equal.intercept <- lm(sbp ~ age + age:sex, data = sbpdata)


## Age grid and palette
ages <- seq(min(sbpdata$age, na.rm = TRUE),
            max(sbpdata$age, na.rm = TRUE),
            length.out = 200)
lev  <- levels(sbpdata$sex)
cols <- setNames(c("steelblue3", "tomato3", "darkorchid3")[seq_along(lev)], lev)

## Scatter: color points by sex
plot(sbp ~ age, data = sbpdata,
     col = cols[sbpdata$sex], pch = 16,
     xlab = "Age", ylab = "Systolic BP")

## Fitted lines: one per sex (different slopes allowed)
for (sx in lev) {
  nd <- data.frame(age = ages, sex = factor(sx, levels = lev))
  yhat <- predict(fit.equal.intercept, newdata = nd)
  lines(ages, yhat, col = cols[sx], lwd = 2)
}

## Legend
legend("topleft", legend = lev, col = cols[lev], pch = 16, lwd = 2, bty = "n", title = "Sex")

4.7.2 Orders of Terms Matters in ANOVA and Warnings in Interpreting t-test Tables

Code

fit.int <- lm(sbp ~ 1, data = sbpdata)
fit.sex <- lm(sbp ~ sex, data = sbpdata)

print(anova(fit.int,fit.age,fit.agePLUSsex, fit.age.TIMES.sex))

Analysis of Variance Table

Model 1: sbp ~ 1
Model 2: sbp ~ age
Model 3: sbp ~ age + sex
Model 4: sbp ~ age + sex + age:sex
  Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1     68 23211.8                                    
2     67  8260.5  1   14951.3 186.8390 < 2.2e-16 ***
3     66  5202.0  1    3058.5  38.2210 4.692e-08 ***
4     65  5201.4  1       0.5   0.0069    0.9342    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

print(anova(fit.int,fit.age,fit.equal.intercept, fit.age.TIMES.sex))

Analysis of Variance Table

Model 1: sbp ~ 1
Model 2: sbp ~ age
Model 3: sbp ~ age + age:sex
Model 4: sbp ~ age + sex + age:sex
  Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1     68 23211.8                                    
2     67  8260.5  1   14951.3 186.8390 < 2.2e-16 ***
3     66  5474.9  1    2785.6  34.8107 1.437e-07 ***
4     65  5201.4  1     273.4   3.4171   0.06907 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

print(anova(fit.int,fit.sex,fit.agePLUSsex, fit.age.TIMES.sex))

Analysis of Variance Table

Model 1: sbp ~ 1
Model 2: sbp ~ sex
Model 3: sbp ~ age + sex
Model 4: sbp ~ age + sex + age:sex
  Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1     68 23211.8                                    
2     67 19282.5  1    3929.2  49.1017 1.684e-09 ***
3     66  5202.0  1   14080.6 175.9583 < 2.2e-16 ***
4     65  5201.4  1       0.5   0.0069    0.9342    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

summary(fit.age)


Call:
lm(formula = sbp ~ age, data = sbpdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.782  -7.632   1.968   8.201  22.651 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 103.34905    4.33190   23.86   <2e-16 ***
age           0.98333    0.08929   11.01   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.1 on 67 degrees of freedom
Multiple R-squared:  0.6441,    Adjusted R-squared:  0.6388 
F-statistic: 121.3 on 1 and 67 DF,  p-value: < 2.2e-16

Code

summary(fit.equal.intercept)


Call:
lm(formula = sbp ~ age + age:sex, data = sbpdata)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.6338  -4.3067   0.9922   4.9819  20.2753 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 104.12501    3.55578  29.283  < 2e-16 ***
age           0.80908    0.07918  10.219 3.14e-15 ***
age:sex1      0.26705    0.04608   5.795 2.09e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.108 on 66 degrees of freedom
Multiple R-squared:  0.7641,    Adjusted R-squared:  0.757 
F-statistic: 106.9 on 2 and 66 DF,  p-value: < 2.2e-16

Code

summary(fit.agePLUSsex)


Call:
lm(formula = sbp ~ age + sex, data = sbpdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.705  -3.299   1.248   4.325  21.160 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 96.77353    3.62085  26.727  < 2e-16 ***
age          0.95606    0.07153  13.366  < 2e-16 ***
sex1        13.51345    2.16932   6.229  3.7e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.878 on 66 degrees of freedom
Multiple R-squared:  0.7759,    Adjusted R-squared:  0.7691 
F-statistic: 114.2 on 2 and 66 DF,  p-value: < 2.2e-16

Code

summary(fit.age.TIMES.sex)


Call:
lm(formula = sbp ~ age + sex + age:sex, data = sbpdata)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.647  -3.410   1.254   4.314  21.153 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 97.07708    5.17046  18.775  < 2e-16 ***
age          0.94932    0.10864   8.738 1.43e-12 ***
sex1        12.96144    7.01172   1.849   0.0691 .  
age:sex1     0.01203    0.14519   0.083   0.9342    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.946 on 65 degrees of freedom
Multiple R-squared:  0.7759,    Adjusted R-squared:  0.7656 
F-statistic: 75.02 on 3 and 65 DF,  p-value: < 2.2e-16

4.8 Model Building

Code

library(olsrr)
## Note: Update this path to your local file location
wine <- read.csv("wine.csv")

model.wine <- lm(quality ~ ., data = wine)

4.8.1 All Possible Regression

Code

ols_step_best_subset(model.wine)

             Best Subsets Regression             
-------------------------------------------------
Model Index    Predictors
-------------------------------------------------
     1         flavor                             
     2         flavor oakiness                    
     3         aroma flavor oakiness              
     4         clarity aroma flavor oakiness      
     5         clarity aroma body flavor oakiness 
-------------------------------------------------

                                                  Subsets Regression Summary                                                   
-------------------------------------------------------------------------------------------------------------------------------
                       Adj.        Pred                                                                                         
Model    R-Square    R-Square    R-Square     C(p)       AIC        SBIC        SBC        MSEP       FPE       HSP       APC  
-------------------------------------------------------------------------------------------------------------------------------
  1        0.6242      0.6137      0.5868    9.0436    130.0214    21.6859    134.9341    61.4102    1.7010    0.0462    0.4176 
  2        0.6611      0.6417      0.6058    6.8132    128.0901    20.1242    134.6404    57.0033    1.6171    0.0441    0.3970 
  3        0.7038      0.6776      0.6379    3.9278    124.9781    18.0702    133.1661    51.3383    1.4906    0.0409    0.3659 
  4        0.7147      0.6801      0.6102    4.6747    125.5480    19.2854    135.3736    50.9872    1.5143    0.0418    0.3717 
  5        0.7206      0.6769       0.587    6.0000    126.7552    21.0956    138.2183    51.5452    1.5649    0.0436    0.3842 
-------------------------------------------------------------------------------------------------------------------------------
AIC: Akaike Information Criteria 
 SBIC: Sawa's Bayesian Information Criteria 
 SBC: Schwarz Bayesian Criteria 
 MSEP: Estimated error of prediction, assuming multivariate normality 
 FPE: Final Prediction Error 
 HSP: Hocking's Sp 
 APC: Amemiya Prediction Criteria

4.8.2 Automated Stepwise Procedures

Code

## Backward Elimination (alpha_out = 0.1)
ols_step_backward_p(model.wine, p_val = 0.1)


                             Stepwise Summary                             
------------------------------------------------------------------------
Step    Variable        AIC        SBC       SBIC       R2       Adj. R2 
------------------------------------------------------------------------
 0      Full Model    126.755    138.218    21.096    0.72060    0.67694 
 1      body          125.548    135.374    19.285    0.71471    0.68013 
 2      clarity       124.978    133.166    18.070    0.70377    0.67763 
------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                          
---------------------------------------------------------------
R                       0.839       RMSE                 1.098 
R-Squared               0.704       MSE                  1.207 
Adj. R-Squared          0.678       Coef. Var            9.338 
Pred R-Squared          0.638       AIC                124.978 
MAE                     0.868       SBC                133.166 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                
-------------------------------------------------------------------
               Sum of                                              
              Squares        DF    Mean Square      F         Sig. 
-------------------------------------------------------------------
Regression    108.935         3         36.312    26.925    0.0000 
Residual       45.853        34          1.349                     
Total         154.788        37                                    
-------------------------------------------------------------------

                                  Parameter Estimates                                    
----------------------------------------------------------------------------------------
      model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
----------------------------------------------------------------------------------------
(Intercept)     6.467         1.333                  4.852    0.000     3.759     9.176 
      aroma     0.580         0.262        0.307     2.213    0.034     0.047     1.113 
     flavor     1.200         0.275        0.603     4.364    0.000     0.641     1.758 
   oakiness    -0.602         0.264       -0.217    -2.278    0.029    -1.140    -0.065 
----------------------------------------------------------------------------------------

Code

## Forward Selection (alpha_in = 0.1)
ols_step_forward_p(model.wine, p_val = 0.1)


                             Stepwise Summary                             
------------------------------------------------------------------------
Step    Variable        AIC        SBC       SBIC       R2       Adj. R2 
------------------------------------------------------------------------
 0      Base Model    165.209    168.484    55.141    0.00000    0.00000 
 1      flavor        130.021    134.934    21.686    0.62417    0.61373 
 2      oakiness      128.090    134.640    20.124    0.66111    0.64175 
 3      aroma         124.978    133.166    18.070    0.70377    0.67763 
------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                          
---------------------------------------------------------------
R                       0.839       RMSE                 1.098 
R-Squared               0.704       MSE                  1.207 
Adj. R-Squared          0.678       Coef. Var            9.338 
Pred R-Squared          0.638       AIC                124.978 
MAE                     0.868       SBC                133.166 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                
-------------------------------------------------------------------
               Sum of                                              
              Squares        DF    Mean Square      F         Sig. 
-------------------------------------------------------------------
Regression    108.935         3         36.312    26.925    0.0000 
Residual       45.853        34          1.349                     
Total         154.788        37                                    
-------------------------------------------------------------------

                                  Parameter Estimates                                    
----------------------------------------------------------------------------------------
      model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
----------------------------------------------------------------------------------------
(Intercept)     6.467         1.333                  4.852    0.000     3.759     9.176 
     flavor     1.200         0.275        0.603     4.364    0.000     0.641     1.758 
   oakiness    -0.602         0.264       -0.217    -2.278    0.029    -1.140    -0.065 
      aroma     0.580         0.262        0.307     2.213    0.034     0.047     1.113 
----------------------------------------------------------------------------------------

Code

## Stepwise Regression (alpha_in = 0.1, alpha_out = 0.1)
ols_step_both_p(model.wine, p_enter = 0.1, p_remove = 0.1)


                              Stepwise Summary                              
--------------------------------------------------------------------------
Step    Variable          AIC        SBC       SBIC       R2       Adj. R2 
--------------------------------------------------------------------------
 0      Base Model      165.209    168.484    55.141    0.00000    0.00000 
 1      flavor (+)      130.021    134.934    21.686    0.62417    0.61373 
 2      oakiness (+)    128.090    134.640    20.124    0.66111    0.64175 
 3      aroma (+)       124.978    133.166    18.070    0.70377    0.67763 
--------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                          
---------------------------------------------------------------
R                       0.839       RMSE                 1.098 
R-Squared               0.704       MSE                  1.207 
Adj. R-Squared          0.678       Coef. Var            9.338 
Pred R-Squared          0.638       AIC                124.978 
MAE                     0.868       SBC                133.166 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                               ANOVA                                
-------------------------------------------------------------------
               Sum of                                              
              Squares        DF    Mean Square      F         Sig. 
-------------------------------------------------------------------
Regression    108.935         3         36.312    26.925    0.0000 
Residual       45.853        34          1.349                     
Total         154.788        37                                    
-------------------------------------------------------------------

                                  Parameter Estimates                                    
----------------------------------------------------------------------------------------
      model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
----------------------------------------------------------------------------------------
(Intercept)     6.467         1.333                  4.852    0.000     3.759     9.176 
     flavor     1.200         0.275        0.603     4.364    0.000     0.641     1.758 
   oakiness    -0.602         0.264       -0.217    -2.278    0.029    -1.140    -0.065 
      aroma     0.580         0.262        0.307     2.213    0.034     0.047     1.113 
----------------------------------------------------------------------------------------

4.9 Multicollinearity

4.9.1 A Simple Example

Code

y <- c(19, 20, 37, 39, 36, 38)
x1 <- c(4, 4, 7, 7, 7.1, 7.1)
x2 <- c(16, 16, 49, 49, 50.4, 50.4)
cor(data.frame(x1, x2))

          x1        x2
x1 1.0000000 0.9999713
x2 0.9999713 1.0000000

Code

fit_multi <- lm(y ~ x1 + x2)
summary(fit_multi)


Call:
lm(formula = y ~ x1 + x2)

Residuals:
   1    2    3    4    5    6 
-0.5  0.5 -1.0  1.0 -1.0  1.0 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -156.056    117.158  -1.332    0.275
x1            65.444     45.890   1.426    0.249
x2            -5.389      4.152  -1.298    0.285

Residual standard error: 1.225 on 3 degrees of freedom
Multiple R-squared:  0.9897,    Adjusted R-squared:  0.9829 
F-statistic: 144.3 on 2 and 3 DF,  p-value: 0.001043

Code

fit1_multi <- lm(y ~ x1)
summary(fit1_multi)


Call:
lm(formula = y ~ x1)

Residuals:
      1       2       3       4       5       6 
-0.5260  0.4740 -0.1925  1.8075 -1.7814  0.2186 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -4.0293     2.3332  -1.727    0.159    
x1            5.8888     0.3762  15.654 9.73e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.325 on 4 degrees of freedom
Multiple R-squared:  0.9839,    Adjusted R-squared:  0.9799 
F-statistic: 245.1 on 1 and 4 DF,  p-value: 9.725e-05

Code

ols_vif_tol(fit_multi)

4.9.2 VIFs in the Wine Quality Data

Code

wine.x <- wine[, -ncol(wine)] # Assuming quality is the last column
cor(wine.x)

             clarity     aroma       body      flavor  oakiness
clarity   1.00000000 0.0619021 -0.3083783 -0.08515993 0.1832147
aroma     0.06190210 1.0000000  0.5489102  0.73656121 0.2016444
body     -0.30837826 0.5489102  1.0000000  0.64665917 0.1521059
flavor   -0.08515993 0.7365612  0.6466592  1.00000000 0.1797605
oakiness  0.18321471 0.2016444  0.1521059  0.17976051 1.0000000

Code

## VIF using olsrr (data frame output)
ols_vif_tol(model.wine)

4.9.3 VIFs in the Children Height Data

Code

## Data: Weight, height and age of children
wgt <- c(64, 71, 53, 67, 55, 58, 77, 57, 56, 51, 76, 68)
hgt <- c(57, 59, 49, 62, 51, 50, 55, 48, 42, 42, 61, 57)
age <- c(8, 10, 6, 11, 8, 7, 10, 9, 10, 6, 12, 9)

fit_age_hgt <- lm(wgt ~ hgt + age, data = child.data)
ols_vif_tol(fit_age_hgt)

4.1 An Example: Wire Bond Strength Dataset

4.1.1 Loading Data and Visualization

4.1.2 Model Fitting and Summary

4.1.3 Confidence Intervals and Model Components

4.2 RSS-based Inference: F-test, and adjusted \(R^2\)

4.2.1 RSS-Based Quantities

4.2.1.1 RSS-Based Quantities

4.2.2 Remarks

4.2.2.1 Fundamental Identities

4.2.2.2 Difference of \(\hat{\sigma}^2\) and Mean Squares

4.2.2.3 \(\hat \sigma^2_{x^\top\beta}=\frac{\mathrm{SSR}_{\text{adj}}}{n-1}\)

4.2.2.4 Connection to Rao-Blackwell Formula

4.2.3 An Animation for Illustrating \(R^2\) under \(H_0\) and \(H_1\)

4.2.4 Example: Modelling Children Weight with Height and Age

4.2.4.1 Problem 1: Height then Age

4.2.4.2 Problem 2: Age then Height

4.2.5 Example: Wire bond strength

4.2.6 Relationship between t-test and partial F-test

4.3 Predictions for Mean Response and a Future Observation

4.3.1 Confidence Interval for Mean Response

4.3.2 Prediction Interval for a New Observation

4.4 Model Diagnostics

4.4.1 Residual Calculations

4.4.2 Residual Plots

4.5 Influential Observations

4.5.1 Plotting with the olsrr Package

4.6 Polynomial Regression

4.7 Handling Categorical Variables with Dummy Variables

4.7.1 Four Models Involving “sex”

4.7.1.1 Coincidence Model (Age Only)

4.7.1.2 Additive Effect Model (Age + Sex)

4.7.1.3 Varying Intercept and Varying Slope Model (Age + Sex + Age:Sex)

4.7.1.4 Varying Slope, Equal Intercept Model (Age + Age:Sex)

4.7.2 Orders of Terms Matters in ANOVA and Warnings in Interpreting t-test Tables

4.8 Model Building

4.8.1 All Possible Regression

4.8.2 Automated Stepwise Procedures

4.9 Multicollinearity

4.9.1 A Simple Example

4.9.2 VIFs in the Wine Quality Data

4.9.3 VIFs in the Children Height Data

4.5.1 Plotting with the `olsrr` Package