Theory of Statistical Inference

Author
Affiliation

University of Saskatchewan

Published

April 29, 2026

Preface

This is a concise course about statistical inference, which was developed for the course STAT 442/851 at University of Saskatchewan.

Key Features

  • Traditional Mathematical Rigor: Emphasis is placed on the rigorous mathematical derivation and proof of the fundamental theorems underpinning statistical inference. For example, students will engage deeply with the analytic proofs of the Neyman-Pearson Lemma, the asymptotic normality of Maximum Likelihood Estimators (MLE), and the formulation of the Cramér-Rao Lower Bound.
  • Integration of Computational Tools: Utilization of computational simulations and graphical representations to elucidate and validate complex statistical concepts. For instance, the course employs simulation studies to empirically demonstrate the advantages of shrinkage estimators and to visualize the asymptotic distributions governed by Maximum Likelihood Estimation (MLE) theory and Wilks’ Theorem. Furthermore, discussions will highlight the operational relevance of classical mathematical theorems within modern computational frameworks, such as the implications of Bartlett Identities in deep learning algorithms.
  • Priority of Topics in Light of Modern Practice in Statistics and Machine Learning: The curriculum is strategically curated to emphasize methodologies that offer broad generalizability and utility in contemporary applications. Prominence is given to versatile frameworks such as Bayesian inference, regularization techniques, Maximum Likelihood Estimation (MLE), likelihood-based hypothesis testing, information criteria, and rigorous model assessment. Concurrently, essential classical foundations, including the Neyman-Pearson Lemma and the theory of Uniformly Minimum Variance Unbiased Estimators (UMVUE), are deliberately retained to ensure a comprehensive theoretical grounding.

Audience

This course requires a strong command of multivariate calculus, alongside a rigorous foundation in intermediate probability theory including asymptotic theorey for probability. Students should also possess prior exposure to applied statistical methods and familiar with basic statistical concepts such as p-value and confidence internal.

About the author

Longhai Li is a professor at the University of Saskatchewan in Canada. He received his Ph.D. degree in statistics from the University of Toronto. His research activities focus on developing and applying statistical machine-learning methods for bioinformatics and epidemiology applications, with particular interests on statistical learning, cross-validation, hierarchical modelling, survival modelling, model checking, residual diagnostics, model comparison, zero-inflated models, high-throughput data, microbiome data. His research has been funded by NSERC, CANSSI, CFI, CFREF, and MITACS. His research papers have appeared in highly reputed journals, such as Journal of American Statistical Association, Bayesian Analysis, Statistics in Medicine, Statistics and Computing, American Statistician, Journal of Applied Statistics, Scientific Reports, and BMC Bioinformatics.

Notation and Symbols

Throughout this document, we use boldface lowercase letters (e.g., \(\mathbf{x}\)) to denote vectors and boldface uppercase letters (e.g., \(\mathbf{X}\), \(\mathbf{I}\)) to denote matrices. Scalar variables and parameters are written in standard italics (e.g., \(x\), \(\theta\)).

Symbol Description
Variables & Parameters
\(x\), \(X\) Scalar random variable or observation
\(\mathbf{x}\), \(\mathbf{X}\) Vector random variable or observation column vector
\(\theta\) Scalar unknown parameter
\(\boldsymbol{\theta}\) Vector of unknown parameters, \(\boldsymbol{\theta} \in \mathbb{R}^p\)
\(\Theta\) Parameter space
\(\mathbf{0}\) Zero vector or zero matrix (dimension implied by context)
\(\mathbf{I}_p\) Identity matrix of size \(p \times p\)
Functions & Operators
\(f(x|\theta)\) Probability density function (pdf) or probability mass function (pmf)
\(\ell(\boldsymbol{\theta})\) Log-likelihood function, \(\ell(\boldsymbol{\theta}) = \log f(\mathbf{x}|\boldsymbol{\theta})\)
\(E^{\mathbf{X}|\boldsymbol{\theta}}[\cdot]\) or \(E_{\boldsymbol{\theta}}[\cdot]\) Expectation w.r.t. \(\mathbf{X}\) conditional on \(\boldsymbol{\theta}\)
\(\text{Var}^{\mathbf{X}|\boldsymbol{\theta}}(\cdot)\) or \(\text{Var}_{\boldsymbol{\theta}}(\cdot)\) Variance (or Covariance Matrix) w.r.t. \(\mathbf{X}\) conditional on \(\boldsymbol{\theta}\)
\(\text{Cov}_{\boldsymbol{\theta}}(\cdot, \cdot)\) Covariance between two random variables or vectors
\(\mathbf{A}^\top\) or \(\mathbf{A}^T\) Transpose of vector or matrix \(\mathbf{A}\)
\(\text{tr}(\mathbf{A})\) Trace of matrix \(\mathbf{A}\) (sum of diagonal elements)
\(\mathbf{A} \succeq \mathbf{B}\) Matrix inequality; \(\mathbf{A} - \mathbf{B}\) is positive semi-definite
Calculus & Gradients
\(\nabla_{\boldsymbol{\theta}} f\) Gradient vector with respect to \(\boldsymbol{\theta}\)
\(\nabla^2_{\boldsymbol{\theta}} f\) Hessian matrix (second derivatives) with respect to \(\boldsymbol{\theta}\)
\(\frac{\partial}{\partial \boldsymbol{\theta}}\) Partial derivative operator
\(\mathbf{J}_f(\mathbf{x})\) Jacobian matrix of a vector-valued function \(f\)
\(\nabla \cdot \mathbf{g}\) Divergence of a vector field \(\mathbf{g}\), \(\sum \frac{\partial g_i}{\partial x_i}\)
Statistical Quantities
\(\mathbf{U}(\boldsymbol{\theta})\) Score vector, \(\nabla_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta})\)
\(\mathbf{J}(\boldsymbol{\theta})\) Observed Information matrix, \(-\nabla^2_{\boldsymbol{\theta}} \ell(\boldsymbol{\theta})\)
\(\mathcal{I}(\boldsymbol{\theta})\) Fisher Information matrix, \(E[\mathbf{U}\mathbf{U}^\top] = E[\mathbf{J}]\)
\(T(\mathbf{X})\) Estimator or statistic
\(m(\boldsymbol{\theta})\) Expectation of an estimator, \(E[T(\mathbf{X})]\)
\(\mathbf{D}(\boldsymbol{\theta})\) Jacobian of the expectation vector \(m(\boldsymbol{\theta})\)