4 Multivariate Normal Distribution

4.1 Motivation

Consider the linear model: \[ y = X\beta + \epsilon, \quad \epsilon_i \sim N(0, \sigma^2) \]

We are often interested in the distributional properties of the response vector \(y\) and the residuals. Specifically, if \(y = (y_1, \dots, y_n)'\), we need to understand its multivariate distribution. \[ \hat{y} = Py, \quad e = y - \hat{y} = (I_n - P)y \]

4.2 Random Vectors and Matrices

Definition 4.1 (Random Vector and Matrix) A Random Vector is a vector whose elements are random variables. E.g., \[ x_{k \times 1} = (x_1, x_2, \dots, x_k)^T \] where \(x_1, \dots, x_k\) are each random variables.

A Random Matrix is a matrix whose elements are random variables. E.g., \(X_{n \times k} = (x_{ij})\), where \(x_{11}, \dots, x_{nk}\) are each random variables.

Definition 4.2 (Expected Value) The expected value (population mean) of a random matrix (or vector) is the matrix (or vector) of expected values of its elements.

For \(X_{n \times k}\): \[ E(X) = \begin{pmatrix} E(x_{11}) & \dots & E(x_{1k}) \\ \vdots & \ddots & \vdots \\ E(x_{n1}) & \dots & E(x_{nk}) \end{pmatrix} \]

\[ E\left(\begin{pmatrix} x_1 \\ \vdots \\ x_k \end{pmatrix}\right) = \begin{pmatrix} E(x_1) \\ \vdots \\ E(x_k) \end{pmatrix} \]

Definition 4.3 (Variance-Covariance Matrix) For a random vector \(x_{k \times 1} = (x_1, \dots, x_k)^T\), the matrix is:

\[ \text{Var}(x) = \Sigma_x = \begin{pmatrix} \sigma_{11} & \sigma_{12} & \dots & \sigma_{1k} \\ \sigma_{21} & \sigma_{22} & \dots & \sigma_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{k1} & \sigma_{k2} & \dots & \sigma_{kk} \end{pmatrix} \]

Where:

\(\sigma_{ij} = \text{Cov}(x_i, x_j) = E[(x_i - \mu_i)(x_j - \mu_j)]\)
\(\sigma_{ii} = \text{Var}(x_i) = E[(x_i - \mu_i)^2]\)

In matrix notation: \[ \text{Var}(x) = E[(x - \mu_x)(x - \mu_x)^T] \] Note: \(\text{Var}(x)\) is symmetric.

4.2.1 Derivation of Covariance Matrix Structure

Expanding the vector multiplication for variance: \[ (x - \mu_x)(x - \mu_x)' \quad \text{where } \mu_x = (\mu_1, \dots, \mu_n)' \] \[ = \begin{pmatrix} x_1 - \mu_1 \\ \vdots \\ x_n - \mu_n \end{pmatrix} (x_1 - \mu_1, \dots, x_n - \mu_n) \] This results in the matrix \(A = (a_{ij})\) where \(a_{ij} = (x_i - \mu_i)(x_j - \mu_j)\). Taking expectations yields the covariance matrix elements \(\sigma_{ij}\).

Definition 4.4 (Covariance Matrix (Two Vectors)) For random vectors \(x_{k \times 1}\) and \(y_{n \times 1}\), the covariance matrix is: \[ \text{Cov}(x, y) = E[(x - \mu_x)(y - \mu_y)^T] = \begin{pmatrix} \text{Cov}(x_1, y_1) & \dots & \text{Cov}(x_1, y_n) \\ \vdots & \ddots & \vdots \\ \text{Cov}(x_k, y_1) & \dots & \text{Cov}(x_k, y_n) \end{pmatrix} \] Note that \(\text{Cov}(x, x) = \text{Var}(x)\).

Definition 4.5 (Correlation Matrix) The correlation matrix of a random vector \(x\) is: \[ \text{corr}(x) = \begin{pmatrix} 1 & \rho_{12} & \dots & \rho_{1k} \\ \vdots & \ddots & \vdots \\ \rho_{k1} & \rho_{k2} & \dots & 1 \end{pmatrix} \] where \(\rho_{ij} = \text{corr}(x_i, x_j)\).

Relationships: Let \(V_x = \text{diag}(\text{Var}(x_1), \dots, \text{Var}(x_k))\). \[ \Sigma_x = V_x^{1/2} \rho_x V_x^{1/2} \quad \text{and} \quad \rho_x = (V_x^{1/2})^{-1} \Sigma_x (V_x^{1/2})^{-1} \] Similarly for two vectors: \[ \Sigma_{xy} = V_x^{1/2} \rho_{xy} V_y^{1/2} \]

4.3 Properties of Mean and Variance

We can derive several key algebraic properties for operations on random vectors.

\(E(X + Y) = E(X) + E(Y)\)
\(E(AXB) = A E(X) B\) (In particular, \(E(AX) = A\mu_x\))
\(\text{Cov}(x, y) = \text{Cov}(y, x)^T\)
\(\text{Cov}(x + c, y + d) = \text{Cov}(x, y)\)
\(\text{Cov}(Ax, By) = A \text{Cov}(x, y) B^T\)
- Special case for scalars: \(\text{Cov}(ax, by) = ab \cdot \text{Cov}(x, y)\)
\(\text{Cov}(x_1 + x_2, y_1) = \text{Cov}(x_1, y_1) + \text{Cov}(x_2, y_1)\)
\(\text{Var}(x + c) = \text{Var}(x)\)
\(\text{Var}(Ax) = A \text{Var}(x) A^T\)
\(\text{Var}(x_1 + x_2) = \text{Var}(x_1) + \text{Cov}(x_1, x_2) + \text{Cov}(x_2, x_1) + \text{Var}(x_2)\)
\(\text{Var}(\sum x_i) = \sum \text{Var}(x_i)\) if independent.

Proof. Property 5 (Covariance of Linear Transformation): \[ \begin{aligned} \text{Cov}(Ax, By) &= E[(Ax - A\mu_x)(By - B\mu_y)^T] \\ &= A E[(x - \mu_x)(y - \mu_y)^T] B^T \\ &= A \text{Cov}(x, y) B^T \end{aligned} \] Property 2 (Expectation of Linear Transformation):

To prove \(E(AXB) = A E(X) B\): First consider \(E(Ax_j)\) where \(x_j\) is a column of \(X\). \[ E(Ax_j) = E\begin{pmatrix} a_1' x_j \\ \vdots \\ a_n' x_j \end{pmatrix} = \begin{pmatrix} E(a_1' x_j) \\ \vdots \\ E(a_n' x_j) \end{pmatrix} \] Since \(a_i\) are constants: \[ E(a_i' x_j) = E\left(\sum_{k=1}^p a_{ik} x_{kj}\right) = \sum_{k=1}^p a_{ik} E(x_{kj}) = a_i' E(x_j) \] Thus \(E(Ax_j) = A E(x_j)\). Applying this to all columns of \(X\): \[ E(AX) = [E(Ax_1), \dots, E(Ax_m)] = [AE(x_1), \dots, AE(x_m)] = A E(X) \] Similarly, \(E(XB) = E(X)B\).

Proof of Property 9 (Variance of Sum):

\[ \text{Var}(x_1 + x_2) = E[(x_1 + x_2 - \mu_1 - \mu_2)(x_1 + x_2 - \mu_1 - \mu_2)^T] \] Let centered variables be denoted by differences. \[ = E[((x_1 - \mu_1) + (x_2 - \mu_2))((x_1 - \mu_1) + (x_2 - \mu_2))^T] \] Expanding terms: \[ = E[(x_1 - \mu_1)(x_1 - \mu_1)^T + (x_1 - \mu_1)(x_2 - \mu_2)^T + (x_2 - \mu_2)(x_1 - \mu_1)^T + (x_2 - \mu_2)(x_2 - \mu_2)^T] \] \[ = \text{Var}(x_1) + \text{Cov}(x_1, x_2) + \text{Cov}(x_2, x_1) + \text{Var}(x_2) \]

4.4 The Multivariate Normal Distribution

4.4.1 Definition and Density

Definition 4.6 (Independent Standard Normal) Let \(z = (z_1, \dots, z_n)'\) where \(z_i \sim N(0, 1)\) are independent. We say \(z \sim N_n(0, I_n)\). The joint PDF is the product of marginals: \[ f(z) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi}} e^{-\frac{z_i^2}{2}} = \frac{1}{(2\pi)^{n/2}} e^{-\frac{1}{2} z^T z} \] Properties: \(E(z) = 0\) and \(\text{Var}(z) = I_n\) (Covariance is 0 for \(i \ne j\), Variance is 1).

Definition 4.7 (Multivariate Normal Distribution) A random vector \(x\) (\(n \times 1\)) has a multivariate normal distribution if it has the same distribution as: \[ x = A_{n \times p} z_{p \times 1} + \mu_{n \times 1} \] where \(z \sim N_p(0, I_p)\), \(A\) is a matrix of constants, and \(\mu\) is a vector of constants. The moments are:

\(E(x) = \mu\)
\(\text{Var}(x) = AA^T = \Sigma\)

4.4.2 Geometric Interpretation

Using Spectral Decomposition, \(\Sigma = Q \Lambda Q'\). We can view the transformation \(x = Az + \mu\) as:

Scaling by eigenvalues (\(\Lambda^{1/2}\)).
Rotation by eigenvectors (\(Q\)).
Shift by mean (\(\mu\)).

An Shinely App for Visualizing Bivariate Normal

Use the controls to construct the covariance matrix \(\boldsymbol{\Sigma}\) geometrically.

We define the transformation matrix \(\mathbf{A} = \mathbf{Q}\mathbf{\Lambda}^{1/2}\), where \(\mathbf{Q}\) is a rotation matrix and \(\mathbf{\Lambda}^{1/2}\) is a diagonal scaling matrix. The resulting covariance is \(\boldsymbol{\Sigma} = \mathbf{A}\mathbf{A}'\).

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| viewerHeight: 700
#| echo: false


library(shiny)
library(bslib)
library(shinyWidgets)
library(munsell) 
library(scales)
library(tibble)
library(rlang)
library(ggplot2)
library(mvtnorm)

# --- 1. PRE-GENERATE FIXED Z POINTS ---
set.seed(123)
z_fixed <- matrix(rnorm(50 * 2), ncol = 2)

ui <- page_fillable(
  theme = bs_theme(version = 5),
  withMathJax(), 
  
  # --- ROW 1: CONTROLS (Compact Strip) ---
  card(
    class = "p-2", 
    layout_columns(
      col_widths = c(3, 2, 2, 2, 2),
      
      div(class = "text-center", tags$label(HTML("$$\\theta$$")), 
          noUiSliderInput("theta", label = NULL, min = 0, max = 360, value = 0, step = 5, 
                          orientation = "horizontal", width = "100%", height = "10px", color = "#0d6efd")),
      
      div(class = "text-center", tags$label(HTML("$$\\sqrt{\\lambda_1}$$")), 
          noUiSliderInput("L1", label = NULL, min = 0.5, max = 3, value = 2, step = 0.1, 
                          orientation = "horizontal", width = "100%", height = "10px", color = "#ffc107")),
      
      div(class = "text-center", tags$label(HTML("$$\\sqrt{\\lambda_2}$$")), 
          noUiSliderInput("L2", label = NULL, min = 0.5, max = 3, value = 1, step = 0.1, 
                          orientation = "horizontal", width = "100%", height = "10px", color = "#adb5bd")),
      
      div(class = "text-center", tags$label(HTML("$$\\mu_1$$")), 
          noUiSliderInput("mu1", label = NULL, min = -3, max = 3, value = 0, step = 0.5, 
                          orientation = "horizontal", width = "100%", height = "10px", color = "#6c757d")),
      
      div(class = "text-center", tags$label(HTML("$$\\mu_2$$")), 
          noUiSliderInput("mu2", label = NULL, min = -3, max = 3, value = 0, step = 0.5, 
                          orientation = "horizontal", width = "100%", height = "10px", color = "#6c757d"))
    )
  ),

  # --- ROW 2: SIDE-BY-SIDE (Plot & Math) ---
  layout_columns(
    col_widths = c(8, 4), # 2/3 for Plot, 1/3 for Matrix
    
    # Left: Visualization
    card(
      full_screen = TRUE,
      plotOutput("contourPlot", height = "500px")
    ),
    
    # Right: The Math (Larger Font)
    card(
      class = "p-3 d-flex justify-content-center", # Center content vertically
      h5("Algebraic Representation", class = "mb-3 text-center"),
      
      # Use CSS to make the font larger and monospaced
      div(
        style = "font-family: 'Courier New', monospace; font-size: 1.1rem; line-height: 1.4;",
        verbatimTextOutput("matrixSide", placeholder = TRUE)
      )
    )
  )
)

server <- function(input, output) {

  data <- reactive({
    theta_rad <- input$theta * pi / 180
    Q <- matrix(c(cos(theta_rad), sin(theta_rad), -sin(theta_rad), cos(theta_rad)), 2, 2)
    Lam_sqrt <- diag(c(input$L1, input$L2))
    
    A <- Q %*% Lam_sqrt
    Sigma <- A %*% t(A)
    mu_vec <- c(input$mu1, input$mu2)
    
    x_points <- z_fixed %*% t(A)
    x_points[,1] <- x_points[,1] + mu_vec[1]
    x_points[,2] <- x_points[,2] + mu_vec[2]
    
    list(Q=Q, L=c(input$L1, input$L2), mu=mu_vec, Sigma=Sigma, A=A, points=as.data.frame(x_points))
  })

  output$matrixSide <- renderText({
    M <- data()
    A <- round(M$A, 2)
    S <- round(M$Sigma, 2)
    rho <- cov2cor(M$Sigma)[1,2]
    
    # Formatted to fill vertical space comfortably
    paste0(
      "Linear Transform:\n",
      "x = A z + μ\n\n",
      
      "Matrix A:\n",
      sprintf("[%4.1f   %4.1f]\n", A[1,1], A[1,2]),
      sprintf("[%4.1f   %4.1f]\n", A[2,1], A[2,2]),
      "\n",
      
      "Covariance Σ:\n",
      "(Σ = AA')\n",
      sprintf("[%4.1f   %4.1f]\n", S[1,1], S[1,2]),
      sprintf("[%4.1f   %4.1f]\n", S[2,1], S[2,2]),
      "\n",
      
      "Correlation:\n",
      sprintf("ρ = %.3f", rho)
    )
  })

  output$contourPlot <- renderPlot({
    req(data())
    M <- data()
    
    grid_r <- seq(-6, 6, length.out = 60)
    df_grid <- expand.grid(x = grid_r, y = grid_r)
    df_grid$z <- dmvnorm(as.matrix(df_grid), mean = M$mu, sigma = M$Sigma)
    
    v1 <- M$Q[,1] * M$L[1]; v2 <- M$Q[,2] * M$L[2]
    axes <- tibble(x = M$mu[1], y = M$mu[2],
                   xend1 = M$mu[1] + v1[1], yend1 = M$mu[2] + v1[2],
                   xend2 = M$mu[1] + v2[1], yend2 = M$mu[2] + v2[2])
    
    ggplot() +
      geom_contour_filled(data = df_grid, aes(x, y, z = z), bins = 9, show.legend = FALSE) +
      geom_point(data = M$points, aes(V1, V2), color = "black", size = 2, alpha = 0.7) +
      geom_segment(data = axes, aes(x=x, y=y, xend=xend1, yend=yend1), 
                   color = "#ffc107", linewidth = 1.5, arrow = arrow(length = unit(0.3,"cm"))) +
      geom_segment(data = axes, aes(x=x, y=y, xend=xend2, yend=yend2), 
                   color = "white", linewidth = 1.5, arrow = arrow(length = unit(0.3,"cm"))) +
      coord_fixed(xlim = c(-6, 6), ylim = c(-6, 6)) +
      theme_minimal() +
      labs(x = "X", y = "Y")
  })
}

shinyApp(ui, server)

4.4.3 Probability Density Function

If \(\Sigma\) is positive definite, the PDF exists. We use the change of variable formula for \(x = Az + \mu\): \[ f_x(x) = f_z(g^{-1}(x)) \cdot |J| \] where \(z = A^{-1}(x - \mu)\) and \(J = \det(A^{-1}) = |A|^{-1}\).

\[ f_x(x) = (2\pi)^{-p/2} |A|^{-1} \exp \left\{ -\frac{1}{2} (A^{-1}(x-\mu))^T (A^{-1}(x-\mu)) \right\} \]

Using \(|\Sigma| = |AA^T| = |A|^2\) and \(\Sigma^{-1} = (AA^T)^{-1}\), we get: \[ f_x(x) = (2\pi)^{-p/2} |\Sigma|^{-1/2} \exp \left\{ -\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \right\} \]

4.4.4 Moment Generating Function

Definition 4.8 (Moment Generating Function (MGF)) The MGF of a random vector \(x\) is \(M_x(t) = E(e^{t^T x})\). For \(x = Az + \mu\): \[ M_x(t) = E[e^{t^T(Az + \mu)}] = e^{t^T\mu} E[e^{(A^T t)^T z}] = e^{t^T\mu} M_z(A^T t) \] Since \(M_z(u) = e^{u^T u / 2}\): \[ M_x(t) = e^{t^T\mu} \exp\left( \frac{1}{2} t^T (AA^T) t \right) = \exp \left( t^T\mu + \frac{1}{2} t^T \Sigma t \right) \]

Key Properties:

Uniqueness: Two random vectors with the same MGF have the same distribution.
Independence: \(y_1\) and \(y_2\) are independent iff \(M_y(t) = M_{y_1}(t_1) M_{y_2}(t_2)\).

4.5 Construction and Linear Transformations

Theorem 4.1 (Constructing MVN Random Vector) Let \(\mu \in \mathbb{R}^n\) and \(\Sigma\) be an \(n \times n\) symmetric non-negative definitive (n.n.d) matrix. Then there exists a multivariate normal distribution with mean \(\mu\) and covariance \(\Sigma\).

Proof. Since \(\Sigma\) is n.n.d., there exists \(B\) such that \(\Sigma = BB^T\) (e.g., via Cholesky or Spetral Decomposition). Let \(z \sim N_n(0, I)\) and define \(x = Bz + \mu\).

Theorem 4.2 (Linear Transformation Theorem) Let \(x \sim N_n(\mu, \Sigma)\). Let \(y = Cx + d\) where \(C\) is \(r \times n\) and \(d\) is \(r \times 1\). Then: \[ y \sim N_r(C\mu + d, C \Sigma C^T) \]

Proof. \(x = Az + \mu\) where \(AA^T = \Sigma\). \[ y = C(Az + \mu) + d = (CA)z + (C\mu + d) \] This fits the definition of MVN with mean \(C\mu + d\) and variance \(C \Sigma C^T\).

4.5.1 Important Corollaries of Theorem 4.2

Corollary 4.1 (Marginals) Any subvector of a multivariate normal vector is also multivariate normal.

Proof. If we partition \(x = (x_1', x_2')'\), we can use \(C = (I_r, 0)\) to show \(x_1 \sim N(\mu_1, \Sigma_{11})\).

Corollary 4.2 (Univariate Combinations) Any linear combination \(a^T x\) is univariate normal: \[ a^T x \sim N(a^T \mu, a^T \Sigma a) \]

Corollary 4.3 (Orthogonal Transformations) If \(x \sim N(0, I_n)\) and \(Q\) is orthogonal (\(Q'Q = I\)), then \(y = Q'x \sim N(0, I_n)\).

Corollary 4.4 (Standardization) If \(y \sim N_n(\mu, \Sigma)\) and \(\Sigma\) is positive definite: \[ \Sigma^{-1/2}(y - \mu) \sim N_n(0, I_n) \]

Proof. Let \(z = \Sigma^{-1/2}(y - \mu)\). Then \(\text{Var}(z) = \Sigma^{-1/2} \Sigma \Sigma^{-1/2} = I_n\).

4.6 Independence

Theorem 4.3 (Independence in MVN) Let \(y \sim N(\mu, \Sigma)\) be partitioned into \(y_1\) and \(y_2\). \[ \Sigma = \begin{pmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22} \end{pmatrix} \] Then \(y_1\) and \(y_2\) are independent if and only if \(\Sigma_{12} = 0\) (zero covariance).

Proof.

Independence \(\implies\) Covariance is 0: This holds generally for any distribution. \[ \text{Cov}(y_1, y_2) = E[(y_1 - \mu_1)(y_2 - \mu_2)'] = 0 \]
Covariance is 0 \(\implies\) Independence: This is specific to MVN. We use MGFs. If \(\Sigma_{12} = 0\), the quadratic form in the MGF splits: \[ t^T \Sigma t = t_1^T \Sigma_{11} t_1 + t_2^T \Sigma_{22} t_2 \] The MGF becomes: \[ M_y(t) = \exp(t_1^T \mu_1 + \frac{1}{2} t_1^T \Sigma_{11} t_1) \times \exp(t_2^T \mu_2 + \frac{1}{2} t_2^T \Sigma_{22} t_2) \] \[ M_y(t) = M_{y_1}(t_1) M_{y_2}(t_2) \] Thus, they are independent.

4.7 Signal-Noise Decomposition for Multivariate Normal Distribution

We can formalize the relationship between two random vectors \(y\) and \(x\) through a decomposition theorem that separates the systematic signal from the stochastic noise.

Theorem 4.4 (Regression Decomposition Theorem) Let the random vector \(V\) of dimension \(p \times 1\) be partitioned into two subvectors \(y\) (\(p_1 \times 1\)) and \(x\) (\(p_2 \times 1\)). Assume \(V\) follows a multivariate normal distribution:

\[ \begin{pmatrix} y \\ x \end{pmatrix} \sim N_p\left( \begin{pmatrix} \mu_y \\ \mu_x \end{pmatrix}, \begin{pmatrix} \Sigma_{yy} & \Sigma_{yx} \\ \Sigma_{xy} & \Sigma_{xx} \end{pmatrix} \right) \]

The response vector \(y\) can be uniquely decomposed into a systematic component and a stochastic error: \[ y = m(x) + e \] where we define the Regression Coefficient Matrix \(B\) and the components as:

\[ B = \Sigma_{yx}\Sigma_{xx}^{-1} \]

\[ m(x) = \mu_y + B(x - \mu_x) \]

\[ e = y - m(x) \]

Properties:

Independence: The noise vector \(e\) is statistically independent of the predictor \(x\) (and consequently independent of \(m(x)\)).
Marginal Distributions:
- \(m(x) \sim N_{p_1}(\mu_y, \; B \Sigma_{xx} B^T)\)
- \(e \sim N_{p_1}(0, \; \Sigma_{yy} - B \Sigma_{xx} B^T)\)
Conditional Distribution: Since \(y = m(x) + e\), and \(e\) is independent of \(x\), the conditional distribution is: \[ y | x \sim N_{p_1}(m(x), \Sigma_{y|x}) \] where: \[ m(x) = \mu_y + B(x - \mu_x) = \mu_y + \Sigma_{yx}\Sigma_{xx}^{-1}(x - \mu_x) \] \[ \Sigma_{y|x} = \Sigma_{yy} - B \Sigma_{xx} B^T = \Sigma_{yy} - \Sigma_{yx}\Sigma_{xx}^{-1}\Sigma_{xy} \]

Proof. We define a transformation from the input vector \(V = \begin{pmatrix} y \\ x \end{pmatrix}\) to the target vector \(W = \begin{pmatrix} m(x) \\ e \end{pmatrix}\).

Using the linear transformation \(W = CV + d\):

\[ \underbrace{\begin{pmatrix} m(x) \\ e \end{pmatrix}}_{W} = \underbrace{\begin{pmatrix} 0 & B \\ I & -B \end{pmatrix}}_{C} \underbrace{\begin{pmatrix} y \\ x \end{pmatrix}}_{V} + \underbrace{\begin{pmatrix} \mu_y - B \mu_x \\ -(\mu_y - B \mu_x) \end{pmatrix}}_{d} \]

Mean Vector

\[ E[W] = C E[V] + d = \begin{pmatrix} 0 & B \\ I & -B \end{pmatrix} \begin{pmatrix} \mu_y \\ \mu_x \end{pmatrix} + \begin{pmatrix} \mu_y - B \mu_x \\ -\mu_y + B \mu_x \end{pmatrix} = \begin{pmatrix} B \mu_x \\ \mu_y - B \mu_x \end{pmatrix} + \begin{pmatrix} \mu_y - B \mu_x \\ -\mu_y + B \mu_x \end{pmatrix} = \begin{pmatrix} \mu_y \\ 0 \end{pmatrix} \]

Covariance Matrix

We compute \(\text{Var}(W) = C \Sigma C^T\) directly:

\[ \begin{aligned} C \Sigma C^T &= \begin{pmatrix} 0 & B \\ I & -B \end{pmatrix} \begin{pmatrix} \Sigma_{yy} & \Sigma_{yx} \\ \Sigma_{xy} & \Sigma_{xx} \end{pmatrix} \begin{pmatrix} 0 & I \\ B^T & -B^T \end{pmatrix} \\ &= \begin{pmatrix} B \Sigma_{xy} & B \Sigma_{xx} \\ \Sigma_{yy} - B \Sigma_{xy} & \Sigma_{yx} - B \Sigma_{xx} \end{pmatrix} \begin{pmatrix} 0 & I \\ B^T & -B^T \end{pmatrix} \\ &= \begin{pmatrix} B \Sigma_{xx} B^T & B \Sigma_{xy} - B \Sigma_{xx} B^T \\ \Sigma_{yx}B^T - B \Sigma_{xx} B^T & (\Sigma_{yy} - B \Sigma_{xy}) - (\Sigma_{yx} - B \Sigma_{xx})B^T \end{pmatrix} \\ &= \begin{pmatrix} B \Sigma_{xx} B^T & 0 \\ 0 & \Sigma_{yy} - B \Sigma_{xx} B^T \end{pmatrix} \end{aligned} \]

Conditional Distribution

We have established that \(y = m(x) + e\) where \(e\) is independent of \(x\). To find the distribution of \(y\) conditional on \(x\), we observe that \(m(x)\) becomes a constant vector when \(x\) is fixed, and the randomness comes solely from \(e\):

\[ E[y|x] = m(x) + E[e|x] = m(x) + 0 = m(x) \] \[ \text{Var}(y|x) = \text{Var}(m(x)|x) + \text{Var}(e|x) = 0 + \text{Var}(e) = \Sigma_{y|x} \]

Thus, \(y | x \sim N(m(x), \Sigma_{y|x})\).

4.7.1 Connections with Other Formulas

4.7.1.1 Rao-Blackwell Decomposition of Variance

The Law of Total Variance (Rao-Blackwell theorem) allows us to decompose the total variance of \(y\) into two orthogonal components based on the predictor \(x\):

\[ \text{Var}(y) = \underbrace{E[\text{Var}(y | x)]}_{\text{Unexplained (Noise)}} + \underbrace{\text{Var}[E(y | x)]}_{\text{Explained (Signal)}} \]

In the Multivariate Normal case, this decomposition perfectly aligns with our regression model \(y = m(x) + e\).

Variance of Noise

This term represents the average variance remaining in \(y\) after accounting for \(x\). It corresponds to the variance of the error term \(e\):

\[ E[\text{Var}(y | x)] = \text{Var}(e) = \Sigma_{yy} - B \Sigma_{xx} B^T \]

Variance of Signal

This term represents the variability of the conditional mean \(m(x)\) itself. Using the matrix \(B\), this takes the quadratic form:

\[ \text{Var}[E(y | x)] = \text{Var}[m(x)] = B \Sigma_{xx} B^T \]

Total Variance

Summing the Signal and Noise components recovers the total marginal variance of \(y\):

\[ \Sigma_{yy} = \underbrace{\Sigma_{yy} - B \Sigma_{xx} B^T}_{\text{Unexplained (Noise)}} + \underbrace{B \Sigma_{xx} B^T}_{\text{Explained (Signal)}} \]

4.7.1.2 Connection to OLS Regression Estimators

In OLS regression, centering the data allows us to separate the intercept from the slopes. Let \(\mathbf{y}_c\) and \(\mathbf{X}_c\) be the centered response and design matrices (where \(\mathbf{X}_c\) excludes the column of 1s). Using this centered form, the total sum of squares decomposes exactly like the population variance:

\[ \text{SST} = \text{SSR} + \text{SSE} \]

Comparing the sample quantities to their population counterparts:

Regression Coefficients: \[ \hat{\beta}^T = (\mathbf{X}_c^T \mathbf{X}_c)^{-1} \mathbf{X}_c^T \mathbf{y}_c \approx B \] Note: \(\hat{\beta}\) here represents only the slope coefficients, matching the dimensions of the covariance matrix \(\Sigma_{xx}\).
Explained Variation (Signal): \[ \text{SSR} = \hat{\beta}^T (\mathbf{X}_c^T \mathbf{X}_c) \hat{\beta} \quad \approx \quad (n-1) B \Sigma_{xx} B^T \]
Unexplained Variation (Noise): \[ \text{SSE} = \mathbf{y}_c^T \mathbf{y}_c - \hat{\beta}^T (\mathbf{X}_c^T \mathbf{X}_c) \hat{\beta} \quad \approx \quad (n-1)(\Sigma_{yy} - B \Sigma_{xx} B^T) \]

4.8 Partial and Multiple Correlation

Definition 4.9 (Partial Correlation) The partial correlation between elements \(y_i\) and \(y_j\) given a set of variables \(x\) is derived from the conditional covariance matrix \(\Sigma_{y|x}\): \[ \rho_{ij|x} = \frac{\sigma_{ij|x}}{\sqrt{\sigma_{ii|x} \sigma_{jj|x}}} \] where \(\sigma_{ij|x}\) are elements of \(\Sigma_{y|x} = \Sigma_{yy} - \Sigma_{yx}\Sigma_{xx}^{-1}\Sigma_{xy}\).

Definition 4.10 (Multiple Correlation (\(R^2\))) For a scalar \(y\) and vector \(x\), the squared multiple correlation is the proportion of variance of \(y\) explained by the conditional mean: \[ R^2_{y|x} = \frac{\text{Var}(E(y|x))}{\text{Var}(y)} = \frac{\Sigma_{yx} \Sigma_{xx}^{-1} \Sigma_{xy}}{\sigma^2_{y}} \]

Note: this definition is the population or theretical \(R^2\), which is estimated by adjusted \(R^2\) using sample in linear regression.

4.9 Examples

Example 4.1 (Bivariate Normal) Let the random vector \(\begin{pmatrix} y \\ x \end{pmatrix}\) follow a bivariate normal distribution: \[ \begin{pmatrix} y \\ x \end{pmatrix} \sim N \left( \begin{pmatrix} 1 \\ 2 \end{pmatrix}, \begin{pmatrix} 2 & 2 \\ 2 & 4 \end{pmatrix} \right) \] Here, \(\mu_y = 1, \mu_x = 2, \Sigma_{yy} = 2, \Sigma_{xx} = 4\), and \(\Sigma_{yx} = 2\).

Finding the Regression Coefficient Matrix \(B\) Using the population formula: \[ B = \Sigma_{yx}\Sigma_{xx}^{-1} = 2(4)^{-1} = 0.5 \]
Finding the Conditional Mean \(m(x)\) (The Signal) The systematic component represents the projection of \(y\) onto \(x\): \[ \begin{aligned} m(x) &= \mu_y + B(x - \mu_x) \\ &= 1 + 0.5(x - 2) = 0.5x \end{aligned} \]
Variance of the Signal \(\text{Var}(m(x))\) Using the quadratic form established in the theorem: \[ \text{Var}(m(x)) = B \Sigma_{xx} B^T = 0.5(4)(0.5) = 1 \]
Variance of the Noise \(\text{Var}(y|x)\) (The Residual) By the Signal-Noise Decomposition: \[ \begin{aligned} \text{Var}(y|x) &= \Sigma_{yy} - \text{Var}(m(x)) \\ &= 2 - 1 = 1 \end{aligned} \] Thus, \(y | x \sim N(m(x), 1)\). The total variance (2) is split equally between signal (1) and noise (1).
Multiple Correlation Coefficient (\(R^2\)) \[ R^2 = \frac{\text{Var}(m(x))}{\Sigma_{yy}} = \frac{1}{2} = 0.5 \]

Figure 4.1: Illustration of Rao-Blackwell Variance Decomposition in Bivariate Normal

Example 4.2 (Trivariate Normal with 2 Predictors) Let \(V = (y, x_1, x_2)' \sim N_3(\mu, \Sigma)\) with: \[ \mu = \begin{pmatrix} 1 \\ 2 \\ 3 \end{pmatrix}, \quad \Sigma = \begin{pmatrix} 10 & 3 & 4 \\ 3 & 2 & 1 \\ 4 & 1 & 4 \end{pmatrix} \] We partition these into \(\Sigma_{yy} = 10\), \(\Sigma_{yx} = \begin{pmatrix} 3 & 4 \end{pmatrix}\), and \(\Sigma_{xx} = \begin{pmatrix} 2 & 1 \\ 1 & 4 \end{pmatrix}\).

Finding the Regression Coefficient Matrix \(B\) \[ \Sigma_{xx}^{-1} = \frac{1}{7} \begin{pmatrix} 4 & -1 \\ -1 & 2 \end{pmatrix} \implies B = \Sigma_{yx} \Sigma_{xx}^{-1} = \begin{pmatrix} \frac{8}{7} & \frac{5}{7} \end{pmatrix} \]
Finding the Conditional Mean \(m(x)\) (The Signal) \[ m(x) = 1 + \frac{8}{7}(x_1 - 2) + \frac{5}{7}(x_2 - 3) \]
Variance of the Signal \(\text{Var}(m(x))\) \[ \text{Var}(m(x)) = B \Sigma_{xx} B^T = \begin{pmatrix} \frac{8}{7} & \frac{5}{7} \end{pmatrix} \begin{pmatrix} 3 \\ 4 \end{pmatrix} = \frac{44}{7} \approx 6.29 \]
Variance of the Noise \(\text{Var}(y|x)\) (The Residual) Using the Signal-Noise Decomposition: \[ \Sigma_{y|x} = \Sigma_{yy} - \text{Var}(m(x)) = 10 - 6.29 = 3.71 \]
Multiple Correlation Coefficient (\(R^2\)) \[ R^2 = \frac{6.29}{10} = 0.629 \]

Code

library(ggplot2)
library(mvtnorm)

mu <- c(1, 2, 3)
sigma <- matrix(c(10, 3, 4, 3, 2, 1, 4, 1, 4), nrow=3, byrow=TRUE)

var_total <- sigma[1,1]
S_yx <- matrix(sigma[1, 2:3], nrow=1)
S_xx <- sigma[2:3, 2:3]
B_mat <- S_yx %*% solve(S_xx)
var_signal <- as.numeric(B_mat %*% S_xx %*% t(B_mat))
var_noise <- var_total - var_signal

set.seed(2024)
dat <- rmvnorm(1000, mean=mu, sigma=sigma)
df <- data.frame(y=dat[,1], x1=dat[,2], x2=dat[,3])
df$m_x <- 1 + (8/7)*(df$x1 - 2) + (5/7)*(df$x2 - 3)

limit_min <- -12
limit_max <- 12
seq_vals <- seq(limit_min, limit_max, length.out=300)
scale_factor <- 20 

df_total <- data.frame(y = seq_vals, x = 9 + dnorm(seq_vals, 1, sqrt(var_total)) * scale_factor)
df_signal <- data.frame(x = seq_vals, y = -8 - dnorm(seq_vals, 1, sqrt(var_signal)) * scale_factor)
df_noise <- data.frame(y = seq_vals, x = 5 + dnorm(seq_vals, 5, sqrt(var_noise)) * scale_factor)

ggplot(df, aes(x=m_x, y=y)) +
  geom_abline(intercept=0, slope=1, color="red", linewidth=0.5, alpha=0.3) +
  geom_point(alpha=0.15, size=1.5, color="black") +
  geom_polygon(data=df_signal, aes(x=x, y=y), fill="red", alpha=0.3) +
  geom_path(data=df_signal, aes(x=x, y=y), color="red", linewidth=1) +
  annotate("text", x=1, y=-11, label="Signal Var\n(m(x))", color="red", size=3, fontface="bold") +
  geom_polygon(data=df_total, aes(x=x, y=y), fill="gray40", alpha=0.3) +
  geom_path(data=df_total, aes(x=x, y=y), color="gray40", linewidth=1) +
  annotate("text", x=11, y=6, label="Total Var\n(y)", color="gray40", size=3, fontface="bold") +
  geom_polygon(data=df_noise, aes(x=x, y=y), fill="blue", alpha=0.3) +
  geom_path(data=df_noise, aes(x=x, y=y), color="blue", linewidth=1) +
  annotate("text", x=6, y=9, label="Noise Var\n(y|x)", color="blue", size=3, fontface="bold") +
  scale_x_continuous(limits=c(-12, 14)) + scale_y_continuous(limits=c(-14, 12)) +
  coord_fixed(ratio=1) + labs(x = "Signal m(x)", y = "Observed y") + theme_minimal()

Figure 4.2: Signal-Noise Variance Decomposition in Multivariate Normal

Code

library(plotly)

x1_seq <- seq(min(df$x1), max(df$x1), length.out=20)
x2_seq <- seq(min(df$x2), max(df$x2), length.out=20)
grid <- expand.grid(x1=x1_seq, x2=x2_seq)
grid$y_pred <- 1 + (8/7)*(grid$x1 - 2) + (5/7)*(grid$x2 - 3)
z_matrix <- matrix(grid$y_pred, nrow=20, ncol=20)

plot_ly() %>%
  add_markers(data = df, x = ~x1, y = ~x2, z = ~y,
              marker = list(size = 3, color = '#444', opacity = 0.5),
              name = "Observed (Total)") %>%
  add_surface(x = x1_seq, y = x2_seq, z = z_matrix,
              opacity = 0.6, colorscale = list(c(0, 1), c("red", "red")),
              showscale = FALSE, name = "Signal (m(x))") %>%
  layout(scene = list(xaxis = list(title = "x1"), yaxis = list(title = "x2"), zaxis = list(title = "y")))

Figure 4.3: Interactive 3D Plot: Signal Plane vs Noise