1 Introduction

1.1 Multiple Linear Regression

Suppose we have observations on \(Y\) and \(X_j\). The data can be represented in matrix form.

\[ \underset{n \times 1}{y} = \underset{n \times p}{X} \beta + \underset{n \times 1}{\epsilon} \]

where the error terms are distributed as: \[ \epsilon \sim N_n(0, \sigma^2 I_n), \]

in which \(I_n\) is the identity matrix: \[ I_n = \begin{pmatrix} 1 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & 1 \end{pmatrix} \] The scalar equation for a single observation is: \[ Y_i = \beta_0 + \beta_1 X_{i1} + \dots + \beta_p X_{ip} + \epsilon_i \]

1.2 Examples

1.2.1 Polynomial Regression

Polynomial regression fits a curved line to the data points but remains linear in the parameters (\(\beta\)).

The model equation is: \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \dots + \beta_{p-1} x_i^{p-1} \]

1.2.2 Design Matrix Construction

The design matrix \(X\) is constructed by taking powers of the input variable.

\[ y = \begin{pmatrix} y_1 \\ \vdots \\ y_n \end{pmatrix} = \begin{pmatrix} 1 & x_1 & x_1^2 & \dots & x_1^{p-1} \\ 1 & x_2 & x_2^2 & \dots & x_2^{p-1} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_n & x_n^2 & \dots & x_n^{p-1} \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_{p-1} \end{pmatrix} + \begin{pmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{pmatrix} \]

1.2.3 One-Way ANOVA

ANOVA can be expressed as a linear model using categorical predictors (dummy variables).

Suppose we have 3 groups (\(G_1, G_2, G_3\)) with observations: \[ Y_{ij} = \mu_i + \epsilon_{ij}, \quad \epsilon_{ij} \sim N(0, \sigma^2) \]

\[ \overset{G_1}{ \boxed{ \begin{matrix} Y_{11} \\ Y_{12} \end{matrix} } } \quad \overset{G_2}{ \boxed{ \begin{matrix} Y_{21} \\ Y_{22} \end{matrix} } } \quad \overset{G_3}{ \boxed{ \begin{matrix} Y_{31} \\ Y_{32} \end{matrix} } } \]

We construct the matrix \(X\) to select the group mean (\(\mu\)) corresponding to the observation:

\[ \underset{6 \times 1}{y} = \underset{6 \times 3}{X} \begin{pmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \end{pmatrix} + \epsilon \]

\[ \begin{bmatrix} Y_{11} \\ Y_{12} \\ Y_{21} \\ Y_{22} \\ Y_{31} \\ Y_{32} \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \end{bmatrix} + \epsilon \]

1.2.4 Analysis of Covariance (ANCOVA)

ANCOVA combines continuous variables and categorical (dummy) variables in the same design matrix.

\[ \begin{bmatrix} Y_1 \\ \vdots \\ Y_n \end{bmatrix} = \begin{bmatrix} X_{1,\text{cont}} & 1 & 0 \\ X_{2,\text{cont}} & 1 & 0 \\ \vdots & 0 & 1 \\ X_{n,\text{cont}} & 0 & 1 \end{bmatrix} \beta + \epsilon \]

1.3 Least Squares Estimation

For the general linear model \(y = X\beta + \epsilon\), the Least Squares estimator is:

\[ \hat{\beta} = (X'X)^{-1}X'y \]

The predicted values (\(\hat{y}\)) are obtained via the Projection Matrix (Hat Matrix) \(P_X\):

\[ \hat{y} = X\hat{\beta} = X(X'X)^{-1}X'y = P_X y \]

The residuals and Sum of Squared Errors are:

\[ \hat{e} = y - \hat{y} \] \[ \text{SSE} = ||\hat{e}||^2 \]

The coefficient of determination is: \[ R^2 = \frac{\text{SST} - \text{SSE}}{\text{SST}} \] where \(\text{SST} = \sum (y_i - \bar{y})^2\).

1.4 Geometric Perspective of Least Square Estimation

We align the coordinate system to the models for clarity:

Reduced Model (\(M_0\)): Represented by the X-axis (labeled \(j_3\)).
- \(\hat{y}_0\) is the projection of \(y\) onto this axis.
Full Model (\(M_1\)): Represented by the XY-plane (the floor).
- \(\hat{y}_1\) is the projection of \(y\) onto this plane (\(z=0\)).
Observed Data (\(y\)): A point in 3D space.

The “improvement” due to adding predictors is the distance between \(\hat{y}_0\) and \(\hat{y}_1\).

Figure 1.1: Geometric Interpretation: Projection onto Axis (M0) vs Plane (M1)

The geometric perspective is not merely for intuition, but as the most robust framework for mastering linear models. This approach offers three distinct advantages:

Statistical Clarity: Geometry provides the most natural path to understanding the properties of estimators. By viewing least square estimation as an orthogonal projection, the decomposition of sums of squares into independent components becomes visually obvious, demystifying how degrees of freedom relate to subspace dimensions rather than abstract algebraic constants. The sampling distribution of the sum squares become straightforward.
Computational Stability: A geometric understanding is essential for implementing efficient and numerically stable algorithms. While the algebraic “Normal Equations” (\((X'X)^{-1}X'y\)) are theoretically valid, they are often computationally hazardous. The geometric approach leads directly to superior methods—such as QR and Singular Value Decompositions—that are the backbone of modern statistical software.
Generalizability: The principles of projection and orthogonality extend far beyond the Gaussian linear model. These geometric insights provide the foundational intuition needed for tackling non-Gaussian optimization problems, including Generalized Linear Models (GLMs) and convex optimization, where solutions can often be viewed as projections onto convex sets.