9  Estimation and Inference with Non-full-rank Models

9.1 Non-full-rank Models and Parameter Non-identifiability

9.1.1 One-way ANOVA Model

Consider the balanced one-way layout model for \(y_{ij}\) a response on the \(j^{th}\) unit in the \(i^{th}\) treatment group. Suppose that there are \(a\) treatments and \(n\) units in the \(i^{th}\) treatment group, with total sample size \(N=an\). Let \(x_i\) be an indicator vector for treatment \(i\), and \(j_N\) be a vector of ones.

1. The Cell-Means Model

The cell-means model represents the response strictly in terms of the mean of its respective treatment group.

Equation and Vector Form: \[ y_{ij} = \mu_i + e_{ij}, \quad i=1,\dots,a, \quad j=1,\dots,n \] where the \(e_{ij}\) are i.i.d. \(N(0,\sigma^2)\). In vector notation, this is: \[ y = \mu_1 x_1 + \mu_2 x_2 + \dots + \mu_a x_a + e, \quad e \sim N(0, \sigma^2 I) \]

Matrix Formulation (\(n=2\) observations per level): Let the parameter vector be \(\boldsymbol{\mu} = (\mu_1, \mu_2, \dots, \mu_a)^T\). Arranging the response and error vectors by treatment group, the model \(\mathbf{y} = X_1\boldsymbol{\mu} + \mathbf{e}\) is written as: \[ \begin{pmatrix} y_{11} \\ y_{12} \\ y_{21} \\ y_{22} \\ \vdots \\ y_{a1} \\ y_{a2} \end{pmatrix} = \begin{pmatrix} 1 & 0 & \dots & 0 \\ 1 & 0 & \dots & 0 \\ 0 & 1 & \dots & 0 \\ 0 & 1 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & 1 \\ 0 & 0 & \dots & 1 \end{pmatrix} \begin{pmatrix} \mu_1 \\ \mu_2 \\ \vdots \\ \mu_a \end{pmatrix} + \begin{pmatrix} e_{11} \\ e_{12} \\ e_{21} \\ e_{22} \\ \vdots \\ e_{a1} \\ e_{a2} \end{pmatrix} \]

Identifiability: The \(2a \times a\) model matrix \(X_1\) is of full column rank (rank \(a\)). Therefore, the parameters \(\mu_i\) are uniquely identifiable.


2. The Effects Model

An alternative, but equivalent, linear model is the effects model, which decomposes the cell mean (\(\mu_i\)) into a baseline value (\(\mu\)) and a treatment-specific deviation (\(\alpha_i\)). The explicit relationship between the parameters of the two models is: \[ \mu_i = \mu + \alpha_i \]

Equation and Vector Form: Substituting this relationship into the cell-means model yields: \[ y_{ij} = \mu + \alpha_i + e_{ij}, \quad i=1,\dots,a, \quad j=1,\dots,n \] with the same assumptions on the errors. In vector notation, this is: \[ y = \mu j_N + \alpha_1 x_1 + \alpha_2 x_2 + \dots + \alpha_a x_a + e, \quad e \sim N(0, \sigma^2 I) \]

Matrix Formulation (\(n=2\) observations per level): The parameter vector is expanded to include the baseline/grand mean \(\mu\), so \(\boldsymbol{\beta} = (\mu, \alpha_1, \alpha_2, \dots, \alpha_a)^T\). The model \(\mathbf{y} = X_2\boldsymbol{\beta} + \mathbf{e}\) is written as: \[ \begin{pmatrix} y_{11} \\ y_{12} \\ y_{21} \\ y_{22} \\ \vdots \\ y_{a1} \\ y_{a2} \end{pmatrix} = \begin{pmatrix} 1 & 1 & 0 & \dots & 0 \\ 1 & 1 & 0 & \dots & 0 \\ 1 & 0 & 1 & \dots & 0 \\ 1 & 0 & 1 & \dots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & 0 & 0 & \dots & 1 \\ 1 & 0 & 0 & \dots & 1 \end{pmatrix} \begin{pmatrix} \mu \\ \alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_a \end{pmatrix} + \begin{pmatrix} e_{11} \\ e_{12} \\ e_{21} \\ e_{22} \\ \vdots \\ e_{a1} \\ e_{a2} \end{pmatrix} \]

Non-identifiability and Constraints: The effects model has the same model matrix as the cell-means model, but with one extra column of ones in the first position (\(j_N\)). Notice that \(\sum_i x_i = j_N\) and \(\text{Col}(X_1) = \text{Col}(X_2)\). Therefore, the columns of the model matrix \(X_2\) are linearly dependent. In this \(2a \times (a+1)\) matrix, the first column is exactly equal to the sum of the remaining \(a\) columns.

Because \(X_2\) is not full rank (it has rank \(a\), but \(a+1\) columns), the parameters are not uniquely identified. \(\mu\) is in no sense the grand mean; it is just an arbitrary baseline value. However, subject to the constraint \(\sum_i \alpha_i = 0\), the parameters gain specific interpretations:

  1. Grand Mean (\(\mu\)): The grand mean response across all treatments.
  2. Treatment Effect (\(\alpha_i\)): The deviation from the grand mean.

Adding the constraint \(\sum_i \alpha_i = 0\) essentially reparameterizes the overparameterized (non-full rank) effects model to a just-parameterized (full rank) model equivalent to the cell means model.

Definition 9.1 (Equivalent Linear Models) In general, two linear models \(y = X_1 \beta_1 + e_1\) and \(y = X_2 \beta_2 + e_2\) with the same assumptions on \(e_1\) and \(e_2\) are equivalent linear models if \(\text{Col}(X_1) = \text{Col}(X_2)\).

Different Strategies for Non-Full Rank Models

When faced with a non-full rank model, we have three ways to proceed:

  1. Reparameterize to a full rank model (e.g., cell-means model).
  2. Add constraints (side-conditions) to the model parameters (e.g., \(\sum \alpha_i = 0\)).
  3. Analyze the model as a non-full rank model, limiting inference to estimable functions.

9.2 Least Square Estimation of \(\beta\) and \(\mu=X\beta\)

Even if \(X\) is not of full rank, the least-squares criterion is still reasonable and leads to the normal equation:

\[ X^T X \hat{\beta} = X^T y \]

Theorem 9.1 (Consistency of Normal Equations) For \(X\) an \(n \times p\) matrix of rank \(k < p \le n\), the normal equations form a consistent system of equations because \(\text{Col}(X^T X) = \text{Col}(X^T)\).

Since the system is consistent, it has a (non-unique) solution given by:

\[ \hat{\beta} = (X^T X)^- X^T y \]

where \((X^T X)^-\) is some (any) generalized inverse of \(X^T X\).

Theorem 9.2 (Uniqueness of Projection) \(\hat{\beta} = (X^T X)^- X^T y\) is a solution to \((X^T X)\beta = X^T y\). Then \(\hat{y} = X\hat{\beta} = X(X^T X)^- X^T y\) is the projection of \(y\) onto \(\text{Col}(X)\).

Since the projection onto a subspace is unique, \(\hat{y}\) is unique regardless of the choice of generalized inverse.

9.2.1 Distribution of \(\hat{\beta}\) and \(s^2\)

In the model \(y = X\beta + e\), where \(E(e)=0\), \(\text{Var}(e)=\sigma^2 I\), and \(X\) is \(n \times p\) of rank \(k \le p \le n\):

  1. Unbiasedness: \(E(s^2) = \sigma^2\).
  2. Uniqueness: \(s^2\) is invariant to the choice of \(\hat{\beta}\) (i.e., to the choice of \((X^T X)^-\)).

\[ s^2 = \frac{SSE}{n-k} = \frac{||y - X\hat{\beta}||^2}{n-k} \]

where \(k = \text{rank}(X)\).

Theorem 9.3 (Distributions in Non-Full Rank Models) In the normal-errors model \(y \sim N(X\beta, \sigma^2 I)\):

  1. For any given choice of \((X^T X)^-\):

    \[ \hat{\beta} \sim N_p [(X^T X)^- X^T X \beta, \sigma^2 (X^T X)^- X^T X \{(X^T X)^-\}^T] \]

  2. \((n-k)s^2 / \sigma^2 \sim \chi^2(n-k)\) and \(SSE/\sigma^2 \sim \chi^2(n-k)\).

  3. For any given choice of \((X^T X)^-\), \(\hat{\beta}\) and \(s^2\) are independent.

9.3 Sum Squares and F-test

Suppose \(y \sim N_n(X\beta, \sigma^2 I)\) where \(X\) is a matrix with rank \(k+1\). Let \(X = [X_1, X_2]\) where \(\text{rank}(X) = k+1\) and assume we are testing a reduced model involving \(X_1\).

Let:

  • \(\hat{y} = P_{\text{Col}(X)} y\) (Full Model Projection)
  • \(\hat{y}_0 = P_{\text{Col}(X_1)} y\) (Reduced Model Projection)

Theorem 9.4 (Partitioning of Sum of Squares)  

  1. Full Model Residual Sum of Squares:

    \[ \frac{1}{\sigma^2} ||y - \hat{y}||^2 = \frac{1}{\sigma^2} y^T (I - P_{\text{Col}(X)}) y \sim \chi^2(n - k - 1) \]

  2. Difference in Sum of Squares (Hypothesis):

    \[ \frac{1}{\sigma^2} ||\hat{y} - \hat{y}_0||^2 = \frac{1}{\sigma^2} y^T (P_{\text{Col}(X)} - P_{\text{Col}(X_1)}) y \sim \chi^2(h, \lambda_1) \]

    where \(h\) is the difference in rank (degrees of freedom). The non-centrality parameter is:

    \[ \lambda_1 = \frac{1}{2\sigma^2} ||(P_{\text{Col}(X)} - P_{\text{Col}(X_1)}) \mu||^2 = \frac{1}{2\sigma^2} ||\mu - \mu_0||^2 \]

  3. Independence: The two quadratic forms above are independent.

Theorem 9.5 (F-Statistic for Nested Models) Under the conditions of the previous theorem:

\[ F = \frac{||\hat{y} - \hat{y}_0||^2 / h}{s^2} = \frac{y^T (P_{\text{Col}(X)} - P_{\text{Col}(X_1)}) y / h}{y^T (I - P_{\text{Col}(X)}) y / (n - k - 1)} \]

\[ \sim \begin{cases} F(h, n - k - 1), & \text{under } H_0 \\ F(h, n - k - 1, \lambda_1), & \text{under } H_1 \end{cases} \]

9.3.1 Examples

9.3.1.1 Numerical Example: One-way ANOVA with Non-Full Rank Matrix

Example 9.1 We generate a balanced dataset with \(a=3\) groups and \(n=4\) observations per group. We then explicitly construct the non-full-rank design matrix \(X = [j_N, x_1, x_2, x_3]\) and compute projections. We add a column for \(\hat{y} - \hat{y}_0\) to explicitly show the vector whose squared length equals the Between Group Sum of Squares.

Code
library(MASS)  # For ginv()
library(knitr) # For kable()

# 1. Data Generation (3 Groups, 4 obs per group)
group_means_true <- c(10, 20, 15)
n_per_group <- 4
groups <- factor(rep(1:3, each = n_per_group))

# Adding some random noise
set.seed(123)
y <- c(rnorm(n_per_group, group_means_true[1], 2),
       rnorm(n_per_group, group_means_true[2], 2),
       rnorm(n_per_group, group_means_true[3], 2))
n <- length(y)

# 2. Construct Non-full-rank Design Matrix X
j_N <- rep(1, n)
X_ind <- model.matrix(~ groups - 1) 
colnames(X_ind) <- c("x_1", "x_2", "x_3")
X <- cbind(Intercept = j_N, X_ind)

# 3. Compute Projections
# A. Projection onto j_N (Grand Mean)
P0 <- j_N %*% t(j_N) / as.numeric(t(j_N) %*% j_N)
y_grand_mean <- P0 %*% y

# B. Projection onto [j_N, X] (Full Column Space)
XtX <- t(X) %*% X
X_inv_gen <- ginv(XtX)
P_X <- X %*% X_inv_gen %*% t(X)
y_proj_X <- P_X %*% y

# C. Difference Vector and Residuals
y_diff <- y_proj_X - y_grand_mean
e <- y - y_proj_X

# --- Compute Arithmetic Group Means for Verification ---
group_means_vec <- ave(y, groups)

# Combine into a data frame
results_df <- data.frame(
  y = y,
  j_N = j_N,
  x1 = X_ind[,1],
  x2 = X_ind[,2],
  x3 = X_ind[,3],
  Proj_Grand = as.vector(y_grand_mean),
  Proj_FullX = as.vector(y_proj_X),
  Group_Means = group_means_vec,
  Diff = as.vector(y_diff),          
  Residuals = as.vector(e)
)

# Calculate Sum of Squares for each column
SS_row <- colSums(results_df^2)
results_table <- rbind(results_df, SS_row)
rownames(results_table)[nrow(results_table)] <- "Sum of Squares"

# Display Detailed Data Table
kable(results_table, digits = 2, 
      col.names = c("$y$", "$j_N$", "$x_1$", "$x_2$", "$x_3$", 
                    "$\\hat{y}_0$", 
                    "$\\hat{y}$", 
                    "$\\bar{y}_{i\\cdot}$", # <--- Verification Header
                    "$\\hat{y} - \\hat{y}_0$", 
                    "$e$"),
      caption = "Data, Projections, and Verification of Group Means")
Data, Projections, and Verification of Group Means
\(y\) \(j_N\) \(x_1\) \(x_2\) \(x_3\) \(\hat{y}_0\) \(\hat{y}\) \(\bar{y}_{i\cdot}\) \(\hat{y} - \hat{y}_0\) \(e\)
1 8.88 1 1 0 0 15.39 10.42 10.42 -4.97 -1.54
2 9.54 1 1 0 0 15.39 10.42 10.42 -4.97 -0.88
3 13.12 1 1 0 0 15.39 10.42 10.42 -4.97 2.70
4 10.14 1 1 0 0 15.39 10.42 10.42 -4.97 -0.28
5 20.26 1 0 1 0 15.39 20.52 20.52 5.13 -0.26
6 23.43 1 0 1 0 15.39 20.52 20.52 5.13 2.91
7 20.92 1 0 1 0 15.39 20.52 20.52 5.13 0.40
8 17.47 1 0 1 0 15.39 20.52 20.52 5.13 -3.05
9 13.63 1 0 0 1 15.39 15.23 15.23 -0.16 -1.60
10 14.11 1 0 0 1 15.39 15.23 15.23 -0.16 -1.12
11 17.45 1 0 0 1 15.39 15.23 15.23 -0.16 2.22
12 15.72 1 0 0 1 15.39 15.23 15.23 -0.16 0.49
Sum of Squares 3083.33 12 4 4 4 2841.62 3045.83 3045.83 204.21 37.49
Code
# 4. Compute F and p-value
SSH <- sum(y_diff^2)
SSE <- sum(e^2)
SST <- SSH + SSE  # Total Sum of Squares

rank_X <- 3 
df_hyp <- rank_X - 1
df_err <- n - rank_X
df_tot <- n - 1

MSH <- SSH / df_hyp
MSE <- SSE / df_err
MST <- SST / df_tot # Mean Square Total

F_stat <- MSH / MSE
p_val <- pf(F_stat, df_hyp, df_err, lower.tail = FALSE)

# 5. Adjusted R-squared
R2_adj <- 1 - (MSE / MST)

# 6. Output Extended ANOVA Table
anova_tab <- data.frame(
  Source = c("Groups (Model)", "Residuals (Error)", "Total"),
  DF = c(df_hyp, df_err, df_tot),
  SS = c(SSH, SSE, SST),
  MS = c(MSH, MSE, MST),
  F = c(F_stat, NA, NA),
  P_value = c(p_val, NA, NA),
  
  # Custom Variance Estimation Column
  Sigma_Hat_Sq = c(MST - MSE,  # Group Row: MST - MSE
                   MSE,        # Error Row: MSE
                   MST),       # Total Row: MST
  
  R_sq_adj = c(R2_adj, NA, NA)
)

kable(anova_tab, digits = 4, 
      col.names = c("Source", "DF", "SS", "MS", "F", "P-value", 
                    "$\\hat{\\sigma}^2$", "$R_a^2$"),
      caption = "ANOVA Table with Variance Estimates and Adjusted R-squared",
      escape = FALSE)
ANOVA Table with Variance Estimates and Adjusted R-squared
Source DF SS MS F P-value \(\hat{\sigma}^2\) \(R_a^2\)
Groups (Model) 2 204.2120 102.1060 24.509 2e-04 17.8073 0.8104
Residuals (Error) 9 37.4945 4.1661 NA NA 4.1661 NA
Total 11 241.7065 21.9733 NA NA 21.9733 NA

Key Observation from the Table

  • Note that the projection onto the group means (\(\hat{y}_{groups}\)) and the projection onto the overparameterized space (\(\hat{y} = P_{[j_N, X]}y\)) are identical (Columns 7 and 8). This confirms numerically that adding the linearly dependent intercept column does not change the column space of the model; \(\text{Col}(X) = \text{Col}(\text{Indicators})\).

  • The Sum of Squares for the new column \(\hat{y} - \hat{y}_0\) (Column 8) is exactly equal to the Between Group Sum of Squares (SSH) in the ANOVA table. This numerically demonstrates that \(SSH = ||\hat{y} - \hat{y}_0||^2\).

9.3.1.2 Numerical Example: Regression with Linear Dependency

Example 9.2 We generate a dataset with \(n=12\) observations and 4 predictors where \(x_4 = 0.1x_1 + 0.5x_2 + x_3\). This makes the design matrix rank-deficient. We use the generalized inverse to compute projections onto the full model space and compare them with projections onto the subspace of just \(x_1, x_2, x_3\).

  1. Fitting Models and F-test

    Code
    library(MASS)  # For ginv()
    library(knitr) # For kable()
    library(gt)    # For HTML tables
    
    # 1. Data Generation
    set.seed(42)
    n <- 12
    x1 <- rnorm(n, mean=5, sd=2)
    x2 <- rnorm(n, mean=10, sd=2)
    x3 <- rnorm(n, mean=0, sd=1)
    x4 <- 0.1*x1 + 0.5*x2 + 0.1*x3 
    beta_true <- c(10, 1, -1, 1) 
    y <- beta_true[1] + beta_true[2]*x1 + beta_true[3]*x2 + beta_true[4]*x3 + rnorm(n, sd=3)
    
    # 2. Projections (Condensed for brevity, same as before)
    j_N <- rep(1, n)
    X_full <- cbind(Intercept = j_N, x1, x2, x3, x4)
    X_sub  <- cbind(Intercept = j_N, x1, x2, x3)
    P0 <- j_N %*% t(j_N) / as.numeric(t(j_N) %*% j_N)
    y_0 <- P0 %*% y
    
    # B. Projection onto Full Model Space (Rank Deficient)
    XtX_full <- t(X_full) %*% X_full
    X_inv_full <- ginv(XtX_full)
    beta_hat_full <- X_inv_full %*% t(X_full) %*% y
    y_hat_full <- X_full %*% beta_hat_full
    
    # C. Projection onto Subspace (Full Rank)
    XtX_sub <- t(X_sub) %*% X_sub
    X_inv_sub <- solve(XtX_sub) 
    beta_hat_sub <- X_inv_sub %*% t(X_sub) %*% y
    y_hat_sub <- X_sub %*% beta_hat_sub
    
    # D. Difference Vector and Residuals
    y_diff <- y_hat_full - y_0
    e <- y - y_hat_full
    
    # 4. Create Detailed Results Table
    results_df <- data.frame(
    y = y,
    x1 = x1,
    x2 = x2,
    x3 = x3,
    x4 = x4,
    y_hat_0 = as.vector(y_0),
    y_hat_sub = as.vector(y_hat_sub),   # New Column
    y_hat_full = as.vector(y_hat_full),
    y_diff = as.vector(y_diff),
    e = as.vector(e)
    )
    
    # Calculate Sum of Squares for each column
    SS_row <- colSums(results_df^2)
    results_table <- rbind(results_df, SS_row)
    rownames(results_table)[nrow(results_table)] <- "Sum of Squares"
    
    # Display Table using kable (for standard rendering)
    kable(results_table, digits = 2,
          col.names = c("$y$", "$x_1$", "$x_2$", "$x_3$", "$x_4$",
                      "$\\hat{y}_0$", 
                      "$\\hat{y}_{sub}$",      # Projection onto x1-x3
                      "$\\hat{y}_{full}$",     # Projection onto x1-x4
                      "$\\hat{y}_{full} - \\hat{y}_0$", 
                      "$e$"),
          caption = "Comparison of Projections: Full (Rank-Deficient) vs Subspace")
    Comparison of Projections: Full (Rank-Deficient) vs Subspace
    \(y\) \(x_1\) \(x_2\) \(x_3\) \(x_4\) \(\hat{y}_0\) \(\hat{y}_{sub}\) \(\hat{y}_{full}\) \(\hat{y}_{full} - \hat{y}_0\) \(e\)
    1 10.06 7.74 7.22 1.90 4.57 6.42 9.65 9.65 3.24 0.41
    2 1.44 3.87 9.44 -0.43 5.07 6.42 1.72 1.72 -4.70 -0.27
    3 -1.51 5.73 9.73 -0.26 5.41 6.42 4.26 4.26 -2.16 -5.76
    4 3.34 6.27 11.27 -1.76 6.09 6.42 3.92 3.92 -2.50 -0.58
    5 7.46 5.81 9.43 0.46 5.34 6.42 4.41 4.41 -2.01 3.05
    6 8.38 4.79 4.69 -0.64 2.76 6.42 9.46 9.46 3.04 -1.08
    7 15.63 8.02 5.12 0.46 3.41 6.42 13.57 13.57 7.16 2.06
    8 0.70 4.81 12.64 0.70 6.87 6.42 -1.48 -1.48 -7.90 2.18
    9 6.58 9.04 9.39 1.04 5.70 6.42 9.40 9.40 2.98 -2.82
    10 9.13 4.87 6.44 -0.61 3.65 6.42 7.32 7.32 0.90 1.81
    11 6.02 7.61 9.66 0.50 5.64 6.42 7.01 7.01 0.60 -0.99
    12 9.76 9.57 12.43 -1.72 7.00 6.42 7.76 7.76 1.34 2.00
    Sum of Squares 745.54 546.12 1037.30 12.92 334.51 493.97 676.12 676.12 182.15 69.42
    Code
    # 5. ANOVA Statistics
    SSH <- sum(y_diff^2)
    SSE <- sum(e^2)
    rank_X <- qr(X_full)$rank 
    df_hyp <- rank_X - 1 
    df_err <- n - rank_X
    df_tol <- df_hyp + df_err
    MSH <- SSH / df_hyp
    MSE <- SSE / df_err
    F_stat <- MSH / MSE
    p_val <- pf(F_stat, df_hyp, df_err, lower.tail = FALSE)
    SST <- SSH + SSE 
    MST <- SST/df_tol
    R2_adj <- 1-MSE/MST
    
    # 4. Create ANOVA Data Frame
    anova_df <- data.frame(
    Source = c("Regression (Model)", "Residuals (Error)", "Total"),
    DF = c(df_hyp, df_err, df_tot),
    SS = c(SSH, SSE, SST),
    MS = c(MSH, MSE, MST),
    F = c(F_stat, NA, NA),
    P_value = c(p_val, NA, NA),
    Sigma_Hat_Sq = c(MST - MSE, MSE, MST),
    R_sq_adj = c(R2_adj, NA, NA)
    )
    
    # 5. Output with LaTeX Headers in gt
    if (knitr::is_html_output()) {
    anova_df |>
       gt() |>
       # Use md() to enable LaTeX parsing in headers
       cols_label(
          Source = "Source",
          DF = "DF",
          SS = "SS",
          MS = "MS",
          F = "F",
          P_value = "P-value",
          Sigma_Hat_Sq = md("$\\hat{\\sigma}^2$"),
          R_sq_adj = md("$R_a^2$")
       ) |>
       fmt_number(
          columns = c(SS, MS, F, P_value, Sigma_Hat_Sq, R_sq_adj), 
          decimals = 4
       ) |>
       sub_missing(columns = everything(), missing_text = "") |>
       tab_style(
          style = cell_text(color = "red", weight = "bold"),
          locations = cells_body(columns = DF, rows = Source == "Regression (Model)")
       ) |>
       tab_header(title = "ANOVA Table")
    } else {
    # Standard LaTeX Output for PDF
    anova_df$DF <- as.character(anova_df$DF)
    anova_df$DF[1] <- paste0("\\textcolor{red}{", anova_df$DF[1], "}")
    
    kable(anova_df, digits = 4, escape = FALSE, 
          col.names = c("Source", "DF", "SS", "MS", "F", "P-value", 
                         "$\\hat{\\sigma}^2$", "$R_a^2$"),
          caption = "ANOVA Table")
    }
    ANOVA Table
    Source DF SS MS F P-value \(\hat{\sigma}^2\) \(R_a^2\)
    Regression (Model) 3 182.1536 60.7179 6.9973 0.0126 14.1929 0.6206
    Residuals (Error) 8 69.4190 8.6774

    8.6774
    Total 11 251.5726 22.8702

    22.8702

    Key Observation

    Notice that the columns \(\hat{y}_{sub}\) (projection onto \(x_1, x_2, x_3\)) and \(\hat{y}_{full}\) (projection onto \(x_1, \dots, x_4\)) are identical. This confirms that adding the linearly dependent predictor \(x_4\) does not change the column space of the design matrix or the fitted values, even though it makes individual coefficients non-unique.

  2. Model Performance and Inference for \(\rho^2\)

    In a rank-deficient model where \(r = \text{rank}(\mathbf{X})\), we define the following metrics based on the projection onto the model space \(\mathcal{C}(\mathbf{X})\).

    Mathematical Formulas

    The Coefficient of Determination (\(R^2\)) and its adjusted version (\(R^2_a\)) are defined as:

    \[R^2 = \frac{SS_{Reg}}{SS_{Total}} = \frac{\| \hat{\mathbf{y}}_{full} - \bar{\mathbf{y}} \|^2}{\| \mathbf{y} - \bar{\mathbf{y}} \|^2}\]

    \[R^2_a = 1 - (1 - R^2) \frac{n - 1}{n - r}\]

    For the Confidence Interval of \(\rho^2\), we utilize the non-central \(F\)-distribution. The non-centrality parameter \(\lambda\) is related to the population squared multiple correlation \(\rho^2\) by:

    \[\lambda = n \frac{\rho^2}{1 - \rho^2} \iff \rho^2 = \frac{\lambda}{\lambda + n}\]

    We find the confidence limits \([\lambda_{L}, \lambda_{U}]\) by inverting the non-central \(F\) cumulative distribution function such that: \[P(F \le F_{obs} \mid \lambda_{U}) = \alpha/2 \quad \text{and} \quad P(F \le F_{obs} \mid \lambda_{L}) = 1 - \alpha/2\]

    R Implementation

    Code
    # 7. R-Squared and Adjusted R-Squared
    SS_Total <- sum((y - mean(y))^2)
    r_sq <- SSH / SS_Total
    
    # Using the rank computed earlier (rank_X = 4)
    adj_r_sq <- 1 - ((1 - r_sq) * (n - 1) / (n - rank_X))
    
    # 8. Confidence Interval for rho^2 (95%)
    ci_rho2_F <- function(F_stat, df1, df2, n, conf_level = 0.95) {
    
       # Helper function to find the non-centrality parameter (lambda)
       get_ncp <- function(f_val, df1, df2, prob) {
          # If observed F is less than the critical value, the lower bound is 0
          if (f_val <= qf(prob, df1, df2)) return(0)
    
          # Objective function: find lambda such that P(F < f_val | lambda) = prob
          obj <- function(lambda) pf(f_val, df1, df2, ncp = lambda) - prob
    
          # Attempt to solve using uniroot
          result <- try(uniroot(obj, interval = c(0, 10000))$root, silent = TRUE)
    
          if (inherits(result, "try-error")) return(NA) else return(result)
       }
    
       # Calculate alpha tail probabilities
       alpha <- 1 - conf_level
       prob_lower <- 1 - (alpha / 2) # Upper quantile for lower limit of lambda
       prob_upper <- alpha / 2       # Lower quantile for upper limit of lambda
    
       # Calculate Non-centrality Parameters (Lambdas)
       # Note: The "Upper" probability yields the "Lower" lambda bound and vice versa
       lambda_low  <- get_ncp(F_stat, df1, df2, prob_lower)
       lambda_high <- get_ncp(F_stat, df1, df2, prob_upper)
    
       # Convert Lambda to Rho-squared
       # Formula: rho^2 = lambda / (lambda + n)
       rho2_low  <- lambda_low / (lambda_low + n)
       rho2_high <- lambda_high / (lambda_high + n)
    
       return(c(lower = rho2_low, upper = rho2_high))
    }
    
    ci_rho2 <- ci_rho2_F(F_stat, df_hyp, df_err, n)
    rho2_low  <- ci_rho2[1]
    rho2_high <- ci_rho2[2]
    
    # 9. Summary Table
    metrics_df <- data.frame(
    Metric = c("R-Squared", "Adjusted R-Squared", "95% CI for $\\rho^2$"),
    Value = c(
       round(r_sq, 4), 
       round(adj_r_sq, 4), 
       paste0("[", round(rho2_low, 4), ", ", round(rho2_high, 4), "]")
    )
    )
    
    kable(metrics_df, 
          caption = "Model Fit Metrics and Population Correlation Interval")
    Model Fit Metrics and Population Correlation Interval
    Metric Value
    R-Squared 0.9809
    Adjusted R-Squared 0.9738
    95% CI for \(\rho^2\) [0.8979, 0.987]

Example 9.3 (Rank-Deficient Model Fitting and ANOVA Analysis) This example demonstrates the end-to-end process of fitting a model with linear dependencies using R’s lm() function and generating a formatted ANOVA table. We explicitly handle the potential for fitting errors and highlight the reduced degrees of freedom resulting from the singularity.

  1. Data Generation and Model Fitting

    We first generate a dataset where \(x_4\) is a perfect linear combination of \(x_1, x_2\), and \(x_3\). We then attempt to fit the full model using lm() within a tryCatch block to ensure robustness.

    Code
    library(MASS)  # For ginv()
    library(knitr) # For kable()
    library(gt)    # For HTML tables
    
    # 1. Generate Data with linear dependency: x4 = 0.1*x1 + 0.5*x2 + x3
    set.seed(42)
    n <- 12
    x1 <- rnorm(n, mean=5, sd=2)
    x2 <- rnorm(n, mean=10, sd=2)
    x3 <- rnorm(n, mean=0, sd=1)
    x4 <- 0.1*x1 + 0.5*x2 + x3 
    y <- 10 + 2*x1 - 1*x2 + 3*x3 + rnorm(n, sd=1)
    df <- data.frame(y=y, x1=x1, x2=x2, x3=x3, x4=x4)
    
    # 2. Fit model using lm() with tryCatch
    lm_fit <- tryCatch({
      model <- lm(y ~ x1 + x2 + x3 + x4, data = df)
      model
    }, error = function(e) {
      message("Error in fitting: ", e$message)
      return(NULL)
    })
    
    # 3. Analyze Coefficients and ANOVA
    # Note: lm() sets the coefficient of x4 to NA due to the singularity
    print("Estimated Coefficients:")
    [1] "Estimated Coefficients:"
    Code
    print(coef(lm_fit))
    (Intercept)          x1          x2          x3          x4 
       9.143931    2.206439   -1.097512    2.494560          NA 
    Code
    anova_results <- anova(lm_fit)
    anova_df <- as.data.frame(anova_results)
    anova_df$Term <- rownames(anova_results)
    anova_df <- anova_df[, c("Term", "Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")]
    
    # 4. Combined Formatting and Output
    if (knitr::is_html_output()) {
      # HTML Output with gt highlighting
      anova_df |>
        gt() |>
        fmt_number(columns = c("Sum Sq", "Mean Sq", "F value", "Pr(>F)"), decimals = 4) |>
        sub_missing(columns = everything(), missing_text = "") |>
        tab_style(
          style = cell_text(color = "red", weight = "bold"),
          locations = cells_body(columns = Df, rows = Term %in% c("x1", "x2", "x3")) 
        ) |>
        tab_header(title = "ANOVA Table")
    
    } else {
      # LaTeX Output with color highlighting
      anova_df$Df <- as.character(anova_df$Df)
      # Identify rows for model components and color them
      model_rows <- which(anova_df$Term %in% c("x1", "x2", "x3"))
      anova_df$Df[model_rows] <- paste0("\\textcolor{red}{", anova_df$Df[model_rows], "}")
    
      kable(anova_df, digits = 4, escape = FALSE, 
            caption = "Sequential ANOVA Table")
    }
    ANOVA Table
    Term Df Sum Sq Mean Sq F value Pr(>F)
    x1 1 184.6483 184.6483 191.5136 0.0000
    x2 1 139.5381 139.5381 144.7261 0.0000
    x3 1 72.8565 72.8565 75.5654 0.0000
    Residuals 8 7.7132 0.9642

  2. Interpretation of Results

    The lm() function automatically identifies that \(x_4\) is aliased with the other predictors and assigns it NA. Consequently, in the ANOVA table, the variation is partitioned among the first three predictors, and \(x_4\) contributes no additional degrees of freedom or sum of squares. This demonstrates that for non-full rank models, inference is limited to the estimable part of the parameter space.

9.4 Inference for Estimable Parameters

9.4.1 Estimability

Definition 9.2 (Estimability) Let \(\lambda = (\lambda_1, ..., \lambda_p)^T\) be a vector of constants. The parameter \(\lambda^T \beta = \sum_j \lambda_j \beta_j\) is said to be estimable if there exists a vector \(a\) in \(R^n\) such that:

\[ E(a^T y) = \lambda^T \beta \]

for all \(\beta \in \mathcal{R}^p\).

This condition is equivalent to \(\lambda^T \beta\) being estimable if and only if there exists \(a\) such that \(X^T a = \lambda\) (i.e., iff \(\lambda\) lies in the row space of \(X\)).

Theorem 9.6 (Equivalent Conditions for Estimability) In the model \(y = X\beta + e\), the linear function \(\lambda^T \beta\) is estimable if and only if any one of the following equivalent conditions holds:

  1. \(\lambda^T\) is a linear combination of the rows of \(X\); that is, there exists a vector \(a\) such that \(a^T X = \lambda^T\). Equivalently, this means \(\lambda \in \text{Col}(X^T) = \text{Row}(X)\).
  2. \(\lambda\) is a linear combination of the columns of \(X^T X\). Because \(\text{Col}(X^T) = \text{Col}(X^T X)\), there exists a vector \(b\) such that \(X^T X b = \lambda\).
  3. \(\lambda\) satisfies the consistency condition using a generalized inverse of \(X^T X\), such that \(X^T X (X^T X)^- \lambda = \lambda\).

Proof. We prove the equivalence of the conditions for estimability by following this logical progression:

  1. Definition of Estimability \(\iff\) Condition (1) By definition, \(\lambda^T \beta\) is estimable if there exists a vector \(a\) such that \(E(a^T y) = \lambda^T \beta\) for all \(\beta\). Since \(E(a^T y) = a^T E(y) = a^T X \beta\), the requirement \(a^T X \beta = \lambda^T \beta\) must hold for all \(\beta \in \mathbb{R}^p\). This is true if and only if \(a^T X = \lambda^T\). Taking the transpose, \(X^T a = \lambda\), which explicitly means \(\lambda \in \text{Col}(X^T) = \text{Row}(X)\).

  2. Condition (1) \(\iff\) Condition (2) From Condition 1, we established that \(\lambda \in \text{Col}(X^T)\). A fundamental result in matrix algebra states that the column space of \(X^T\) is identical to the column space of \(X^T X\), meaning \(\text{Col}(X^T) = \text{Col}(X^T X)\). Therefore, \(\lambda \in \text{Col}(X^T X)\). By the definition of a column space, this implies there must exist a vector \(b\) such that \(X^T X b = \lambda\).

  3. Condition (2) \(\iff\) Condition (3) The existence of a solution \(b\) to the linear system \(X^T X b = \lambda\) means the system is consistent. A standard property of generalized inverses is that a linear system \(Ax = y\) is consistent if and only if \(A A^- y = y\). Applying this to our system, it is consistent if and only if \(X^T X (X^T X)^- \lambda = \lambda\). We can verify this directly; if such a \(b\) exists, then:

    \[ X^T X (X^T X)^- (X^T X b) = X^T X b = \lambda \]

    using the defining property of generalized inverses, \(A A^- A = A\).

9.4.2 Example: Checking Estimability in One-Way ANOVA

Example 9.4 (Checking Estimability in One-Way ANOVA) We use the balanced one-way ANOVA effects model matrix \(X\) (\(a=3\) groups, \(n=4\) observations per group) to verify different linear functions \(\lambda^T \beta\). In this model, the parameter vector is \(\beta = (\mu, \alpha_1, \alpha_2, \alpha_3)^T\).

  1. Defining the Vectors and Explicit Matrix Form

    First, we observe the explicit form of the \(X^T X\) matrix. For a balanced one-way layout with \(n=4\) per group, the matrix is as follows :

    \[ X^T X = \begin{pmatrix} 12 & 4 & 4 & 4 \\ 4 & 4 & 0 & 0 \\ 4 & 0 & 4 & 0 \\ 4 & 0 & 0 & 4 \end{pmatrix} \]

    We examine the following functions of the parameters:

    • \(\lambda_1 = (1, 1, 0, 0)^T\): The cell mean for group 1 (\(\mu + \alpha_1\)).
    • \(\lambda_2 = (0, 1, 0, 0)^T\): The individual treatment effect \(\alpha_1\).
    • \(\lambda_3 = (0, 1, -1, 0)^T\): The contrast between group 1 and group 2 (\(\alpha_1 - \alpha_2\)).
    • \(\lambda_4 = (1, 0, 0, 1)^T\): The cell mean for group 3 (\(\mu + \alpha_3\)).
    • \(\lambda_5 = (12, 4, 4, 4)^T\): This corresponds to \(12\mu + 4\sum \alpha_i\), which is the first column of \(X^TX\).
  2. Numerical Consistency Check (Condition 3)

    A linear function \(\lambda^T\beta\) is estimable if and only if \(X^T X (X^T X)^- \lambda = \lambda\). We perform this check for all vectors using the Moore-Penrose generalized inverse.

    Code
    library(MASS)
    library(knitr)
    
    # Setup X'X for balanced one-way layout (n=4 per group) 
    XtX <- matrix(c(
      12,  4,  4,  4,
       4,  4,  0,  0,
       4,  0,  4,  0,
       4,  0,  0,  4
    ), nrow = 4, byrow = TRUE)
    
    XtX_inv <- ginv(XtX)
    
    # Define lambda vectors
    L1 <- c(1, 1, 0, 0)      # mu + alpha1 
    L2 <- c(0, 1, 0, 0)      # alpha1      
    L3 <- c(0, 1, -1, 0)     # alpha1-alpha2 
    L4 <- c(1, 0, 0, 1)      # mu + alpha3 
    L5 <- c(12, 4, 4, 4)     # 12mu + 4(alpha1 + alpha2 + alpha3)
    
    # Compute (X'X)(X'X)^- * lambda for each
    check_L1 <- XtX %*% XtX_inv %*% L1
    check_L2 <- XtX %*% XtX_inv %*% L2
    check_L3 <- XtX %*% XtX_inv %*% L3
    check_L4 <- XtX %*% XtX_inv %*% L4
    check_L5 <- XtX %*% XtX_inv %*% L5
    
    # Combine results for display
    final_comp <- data.frame(
      Param = c("mu", "alpha1", "alpha2", "alpha3"),
      L1_Target = L1, L1_Res = as.vector(check_L1),
      L2_Target = L2, L2_Res = as.vector(check_L2),
      L3_Target = L3, L3_Res = as.vector(check_L3),
      L4_Target = L4, L4_Res = as.vector(check_L4),
      L5_Target = L5, L5_Res = as.vector(check_L5)
    )
    
    kable(final_comp, digits = 3,
          col.names = c("Param", "$\\lambda_1$", "$L_1$", 
                        "$\\lambda_2$", "$L_2$", 
                        "$\\lambda_3$", "$L_3$", 
                        "$\\lambda_4$", "$L_4$", 
                        "$\\lambda_5$", "$L_5$"),
          caption = "Estimability Check for Multiple Parameter Functions")
    Estimability Check for Multiple Parameter Functions
    Param \(\lambda_1\) \(L_1\) \(\lambda_2\) \(L_2\) \(\lambda_3\) \(L_3\) \(\lambda_4\) \(L_4\) \(\lambda_5\) \(L_5\)
    mu 1 1 0 0.25 0 0 1 1 12 12
    alpha1 1 1 1 0.75 1 1 0 0 4 4
    alpha2 0 0 0 -0.25 -1 -1 0 0 4 4
    alpha3 0 0 0 -0.25 0 0 1 1 4 4
  3. Summary of Results

    • Estimable Functions: \(\lambda_1, \lambda_3, \lambda_4\), and \(\lambda_5\) all satisfy the condition because their result columns (\(L_i\)) match their target columns (\(\lambda_i\)). These represent cell means, contrasts, and linear combinations of the rows of \(X^TX\).
    • Non-Estimable Function: \(\lambda_2\) (individual \(\alpha_1\)) fails the condition as the target and result do not match.
    • The First Column (\(\lambda_5\)): This vector is estimable because it is explicitly the first row/column of \(X^TX\). Since any linear combination of rows of \(X^TX\) is estimable, \(\lambda_5^T \beta = 12\mu + 4\alpha_1 + 4\alpha_2 + 4\alpha_3\) is estimable. Note that this is equal to \(\sum_{i,j} E(y_{ij})\), the sum of all expected cell means.
  4. Theoretical Conclusion

    In non-full-rank models, we cannot estimate individual parameters that are not uniquely identified by the data. However, linear functions that lie in the row space of \(X\) (or the column space of \(X^TX\))—such as cell means or treatment contrasts—are invariant to the choice of solution \(\hat{\beta}\) and remain uniquely estimable.

Theorem 9.7 (Number of Estimable Functions) In the non-full-rank model \(y = X\beta + e\), the number of linearly independent estimable functions of \(\beta\) is the rank of \(X\).

9.4.3 Properties of Estimators for Estimabile Parameters

Let \(\lambda^T \beta\) be an estimable function. Let \(\hat{\beta}\) be any solution to the normal equations. Then the estimator \(\lambda^T \hat{\beta}\) has the following properties:

  1. Unbiasedness: \(E(\lambda^T \hat{\beta}) = \lambda^T \beta\).
  2. Uniqueness: \(\lambda^T \hat{\beta}\) is invariant to the choice of \(\hat{\beta}\) (to the choice of generalized inverse).

Theorem 9.8 (Variance and Covariance)  

  1. Variance: The variance of \(\lambda^T \hat{\beta}\) is unique and is given by:

    \[ \text{Var}(\lambda^T \hat{\beta}) = \sigma^2 \lambda^T (X^T X)^- \lambda \]

  2. Covariance: For two estimable functions \(\lambda_1^T \beta\) and \(\lambda_2^T \beta\):

    \[ \text{Cov}(\lambda_1^T \hat{\beta}, \lambda_2^T \hat{\beta}) = \sigma^2 \lambda_1^T (X^T X)^- \lambda_2 \]

9.4.4 Properties of Estimators for Estimabile Parameters

Let \(\lambda^T \beta\) be an estimable function. Let \(\hat{\beta}\) be any solution to the normal equations. Then the estimator \(\lambda^T \hat{\beta}\) has the following properties:

  1. Unbiasedness: \(E(\lambda^T \hat{\beta}) = \lambda^T \beta\).
  2. Uniqueness: \(\lambda^T \hat{\beta}\) is invariant to the choice of \(\hat{\beta}\) (to the choice of generalized inverse).

Theorem 9.9 (Variance and Covariance)  

  1. Variance: The variance of \(\lambda^T \hat{\beta}\) is unique and is given by:

    \[ \text{Var}(\lambda^T \hat{\beta}) = \sigma^2 \lambda^T (X^T X)^- \lambda \]

  2. Covariance: For two estimable functions \(\lambda_1^T \beta\) and \(\lambda_2^T \beta\):

    \[ \text{Cov}(\lambda_1^T \hat{\beta}, \lambda_2^T \hat{\beta}) = \sigma^2 \lambda_1^T (X^T X)^- \lambda_2 \]

Theorem 9.10 (Gauss-Markov in Non-Full Rank Case) If \(\lambda^T \beta\) is estimable in the spherical errors non-full rank linear model, then \(\lambda^T \hat{\beta}\) is its Best Linear Unbiased Estimator (BLUE).

9.4.5 Testable Hypotheses in Non-Full Rank Models

In non-full rank models, we cannot test hypotheses about individual parameters that are not uniquely identified by the data. Instead, we must restrict our inference to testable hypotheses, which are formulated using estimable functions.

Definition 9.3 (Formal Description of a Testable Hypothesis) A linear hypothesis \(H_0: C\beta = 0\) is said to be testable if and only if every row of the \(m \times p\) matrix \(C\) is an estimable function of \(\beta\).

This implies that there exists a matrix \(A\) such that \(C = AX\). Physically, this means \(H_0\) can be expressed entirely in terms of the expected values of the observations.

9.4.6 The General Linear Hypothesis F-Test

If \(y \sim N_n(X\beta, \sigma^2 I)\), where \(X\) is \(n \times p\) with \(\text{rank}(X) = k < p\), and \(C\beta\) is a set of \(m\) linearly independent estimable functions (where \(m \le k\)), the following results hold:

  1. Distribution of the Estimator

    The least-squares estimator \(C\hat{\beta}\) follows a multivariate normal distribution:

    \[C\hat{\beta} \sim N_m\left[C\beta, \sigma^2 C(X^T X)^- C^T\right]\]

  2. The F-Statistic

    The test statistic for \(H_0: C\beta = 0\) is defined as:

    \[F = \frac{SSH / m}{SSE / (n - k)}\]

    where:

    • \(SSH = (C\hat{\beta})^T [C(X^T X)^- C^T]^{-1} C\hat{\beta}\) is the Hypothesis Sum of Squares.
    • \(SSE = y^T [I - X(X^T X)^- X^T] y\) is the Error Sum of Squares.
    • \(k\) is the rank of the matrix \(X\), not the number of columns \(p\).
  3. Sampling Distribution

    The statistic \(F\) follows an F-distribution:

    \[F \sim F(m, n - k)\]

    Under the alternative hypothesis \(H_1: C\beta eq 0\), the statistic follows a non-central F-distribution \(F(m, n - k, \lambda)\), with non-centrality parameter:

    \[\lambda = \frac{(C\beta)^T [C(X^T X)^- C^T]^{-1} C\beta}{2\sigma^2}\]

Summary of Degrees of Freedom

Component Degrees of Freedom
Numerator (Model/Hypothesis) \(m\) (Number of linearly independent estimable constraints)
Denominator (Error) \(n - k\) (Total observations minus rank of \(X\))