My note on Ordinary Least Squares
In statistics, ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. OLS chooses the parameters of a linear function of a set of explanatory variables by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being predicted) in the given dataset and those predicted by the linear function.
The True Model
Suppose the data consists of N observations $\{x_i, y_i\}_{i=1}^{N}$ . Each observation i includes a scalar response $y_i$ and a column vector $x_i$ of values of K predictors (regressors) $x_{ij}$ for j = 1, ..., K. In a linear regression model, the response variable, $y_i$ is a linear function of the regressors:
$$ y_i = \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_P x_{iK} + \epsilon_i, $$or in vector form,
$$ y_i = x_i^T \beta + \epsilon_i, $$where $\beta$ is a K×1 vector of unknown parameters; the $\epsilon_i$'s are unobserved scalar random variables (errors) which account for influences upon the responses $y_i$ from sources other than the explanators $x_i$; and $x_{i}$ is a column vector of the ith observations of all the explanatory variables. This model can also be written in matrix notation as
$$ y = X \beta + \epsilon \qquad (1)$$where y and $\epsilon$ are N×1 vectors of the values of the response variable and the errors for the various observations, and X is an N×K matrix of regressors, also sometimes called the design matrix, whose row i is $x_i^T$ and contains the ith observations on all the explanatory variables.
This is assumed to be an accurate reflection of the real world. The model has a systematic component $(X \beta)$ and a stochastic component $(\epsilon)$. Our goal is to obtain estimates of the population parameters in the $\beta$ vector.
Criteria for Estimates
Our estimates of the population parameters are referred to as $\hat{\beta}$. Recall that the criteria we use for obtaining our estimates is to find the estimator $\hat{\beta}$ that minimizes the sum of squared residuals ($\sum e_i^2$ in scalar notation). To know where does this criteria come from, please kindly check my other blog post
The residual vector e is given by
$$ e = y - X \hat{\beta} \qquad (2)$$The residual sum square error (RSS) is then $e^Te$
$$ [e_1 \ e_2 \ ... \ e_N]_{1×N} \begin{bmatrix} e_1\\ e_2\\ .\\ .\\ e_N\\ \end{bmatrix}_{N×1} = [e_1^2 + e_2^2 + \ ... \ + e_N^2]_{1×1} \qquad (3)$$It's clearly that we can re-write the RSS as
$$ RSS = (y - X \hat{\beta})^T(y - X \hat{\beta}) = y^Ty - y^T X \hat{\beta} - (X \hat{\beta})^T y + (X \hat{\beta})^T X \hat{\beta} $$$$ = y^Ty - y^T X \hat{\beta} - \hat{\beta}^T X^T y + \hat{\beta}^T X^T X \hat{\beta} $$Note that $y^T X \hat{\beta}$ is an 1×1 matrix which means it's a scalar, so it's transpose is also a scalar and equal to itself
$$ y^T X \hat{\beta} = (y^T X \hat{\beta})^T = \hat{\beta}^T X^T y \qquad (4)$$Then
$$ RSS = y^Ty - 2 \hat{\beta}^T X^T y + \hat{\beta}^T X^T X \hat{\beta} \qquad (5)$$To find $\hat{\beta}$ that minimize RSS, we need to take the derivative of Eq. 5 with respect to $\hat{\beta}$ and set it to zero
$$ \frac{\partial RSS}{\partial \hat{\beta}} = - 2 X^T y + 2 X^T X \hat{\beta} = 0 \qquad (6)$$$$ \Rightarrow 2 X^T X \hat{\beta} = 2 X^T y $$$$ \Rightarrow \hat{\beta} = (X^T X)^{-1} X^T y $$How do we know this is the minimum? We can check the second derivative of RSS with respect to $\hat{\beta}$
$$ \frac{\partial^2 RSS}{\partial \hat{\beta}^2} = 2 X^T X$$If X has full rank, this is a positive definite matrix (analogous to a positive real number). So Eq. 6 gives us the minimum.
A brief overview of matrix differentiation.
$$ \frac{\partial a^T b}{\partial b} = \frac{\partial b^T a}{\partial b} = a \qquad (7) $$when a and b are K×1 vector
$$ \frac{\partial b^T A b}{\partial b} = 2Ab = 2b^TA \qquad (8) $$when A is any symetric matrix. Do note that we can write it both ways, either $2Ab$ or $2b^TA$.
Thus,
$$ \frac{\partial \hat{\beta}^T X^T y}{\partial \hat{\beta}} = X^T y $$and
$$ \frac{\partial \hat{\beta}^T X^T X \hat{\beta}}{\partial \hat{\beta}} = 2 X^T X \hat{\beta} $$when $X^T X$ is a K×K matrix.
Properties of the OLS Estimators
Since the OLS estimators in the $\hat{\beta}$ vector are a linear combination of existing random variables (X and y), they themselves are random variables with certain straightforward properties.
Recall
$$ X^TX \hat{\beta} = X^Ty \qquad (9) $$from Eq. 2 we have $y = X \hat{\beta} + e $, applying this to Eq. 9
$$ X^TX \hat{\beta} = X^T(X \hat{\beta} + e) $$$$ X^TX \hat{\beta} = X^TX \hat{\beta} + X^T e$$$$ 0 = X^T e \qquad (10)$$Eq. 10 looks like this
$$ \begin{bmatrix} X_{11} & X_{21} & ... & X_{1N}\\ X_{12} & X_{22} & ... & X_{2N}\\ .\\ .\\ X_{K2} & X_{22} & ... & X_{KN}\\ \end{bmatrix} \begin{bmatrix} e_{1}\\ e_{2}\\ .\\ .\\ e_{N}\\ \end{bmatrix} = \begin{bmatrix} X_{11} e_1 + X_{21} e_2 + ... + X_{1N} e_N\\ X_{12} e_1 + X_{22} e_2 + ... + X_{2N} e_N\\ .\\ .\\ X_{K2} e_1 + X_{22} e_2 + ... + X_{KN} e_N\\ \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ .\\ .\\ 0\\ \end{bmatrix} \qquad (11)$$From Eq. 11 we can derive some nice properties of OLS
1. The observed values X are uncorrelated with residuals
This means each regressor has zero sample correlation with the residuals. But this doesn't mean X has zero correlation with $\epsilon$, it's a assumption we make
The following properties hold true only when our regression includes a constant
2. Sum of the residuals is zero
If there is a constant, then a column in X (let's say X1) is a column of ones (or any constant c). This means the fist element of $X^Te$ is $e_1 + e_2 + .. + e_N = 0$
3. The sample mean of the residuals $\bar{e}$ is zero
4. The regression hyperplane passes through the means of the observed values ($\bar{X}$ and $\bar{y}$)
Recall $y = X \hat{\beta} + e$. Dividing by the number of observations, we get $\bar{y} = \bar{x} \hat{\beta} + \bar{e} = \bar{x} \hat{\beta} $. This show that the regression hyperplane goes through the point of means of the data.
5. The predicted value of y are uncorrelated with the residual
$$ \hat{y}^Te = (X \hat{\beta})^Te = \hat{\beta}^T X^T e = \hat{\beta}^T 0 = 0 \qquad (12) $$6. The mean of the predicted Y’s for the sample will equal the mean of the observed Y’s i.e $\bar{\hat{y}}=\bar{y}$
Recall $y = X \hat{\beta} + e = \hat{y} + e$. Dividing by the number of sample we get $\bar{y} = \bar{\hat{y}} + \bar{e} = \bar{\hat{y}}$
The Gauss-Markov Assumptions
Assumption 1: $ y = X \beta + \epsilon $. This means there is a linear relationship between y and X
Assumption 2 (Identification Condition): X is an N×K matrix of full rank.
This means there is no perfect multicollinearity which means the columns of X are linearly independent.
Assumption 3 (Zero mean conditional assumption): $E[\epsilon | X] = 0$ or
$$ E \begin{bmatrix} \epsilon_1 | X \\ \epsilon_2 | X \\ .\\ .\\ \epsilon_N | X \\ \end{bmatrix} = \begin{bmatrix} E[\epsilon_1] \\ E[\epsilon_2] \\ .\\ .\\ E[\epsilon_N] \\ \end{bmatrix} = [0] $$The disturbances average out to zero for any value of X which means no observations of the independent variables convey any information about the expected value of the disturbance. This imply that $E[y] = X \beta$
Assumption 4 $E[\epsilon \epsilon^T|X] = \sigma^2 I$
$$ E [\epsilon \epsilon^T|X] = E \left [ \begin{bmatrix} \epsilon_1 | X \\ \epsilon_2 | X \\ .\\ .\\ \epsilon_N | X \\ \end{bmatrix} [\epsilon_1 \ \epsilon_2 \ ... \ \epsilon_N ] \right ] \qquad (13) $$$$ = E \begin{bmatrix} \epsilon_1 \epsilon_1 | X & \epsilon_1 \epsilon_2 |X & ... & \epsilon_1 \epsilon_N | X\\ \epsilon_2 \epsilon_1 | X & \epsilon_2 \epsilon_2 | X& ... & \epsilon_2 \epsilon_N | X\\ . & . & ... & .\\ . & . & ... & .\\ \epsilon_N \epsilon_1 | X & \epsilon_N \epsilon_2 | X & ... & \epsilon_N \epsilon_N | X\\ \end{bmatrix} $$$$ = \begin{bmatrix} E[\epsilon_1^2 | X] & E[\epsilon_1 \epsilon_2 | X] & ... & E[\epsilon_1 \epsilon_N | X]\\ E[\epsilon_2 \epsilon_1 | X] & E[\epsilon_2^2 | X] & ... & E[\epsilon_2 \epsilon_N | X]\\ . & . & ... & .\\ . & . & ... & .\\ E[\epsilon_N \epsilon_1] | X & E[\epsilon_N \epsilon_2 |X] & ... & E[\epsilon_N^2 | X]\\ \end{bmatrix} \qquad (14)$$$$ = \begin{bmatrix} \sigma^2 & 0 & ... & 0\\ 0 & \sigma^2 & ... & 0\\ . & . & ... & .\\ . & . & ... & .\\ 0 & 0 & ... & \sigma^2\\ \end{bmatrix} \qquad (15) = \sigma^2 I$$This assumption means:
- The variance of $\epsilon_i$ is the same ($\sigma^2$) for all i (the assumption of homoskedasticity)
- The assumption of no autocorrelation: $Cov(\epsilon_i, \epsilon_j) = 0 \ \forall i \neq j$
Disturbances that meet these two assumptions are referred to as spherical disturbances.We can re-write compactly the Gauss-Markov assumptions about the disturbances as:
$$ \Omega = \sigma^2 I \qquad (15)$$where $\Omega$ is the variance-covariance matrix of the disturbances i.e $\Omega = E[\epsilon \epsilon^T]$
Assumption 5 X may be random or fixed, but must not related to $\epsilon$
Assumption 6 $\epsilon | X \sim N(0, \sigma^2I)$
This assumption is not required for the Gauss-Markov Theorem. We make this assumption to simplify the hypothesis testing.
The Gauss-Markov Theorem
There is be no other linear and unbiased estimator of the $\beta$ coefficients that has a smaller sampling variance. In other words, the OLS estimator is the Best Linear, Unbiased and Efficient estimator (BLUE).
How do we know this?
Proof that $\hat{\beta}$ is an unbiased estimator of $\beta$.
Recall that $\hat{\beta} = (X^TX)^{-1}X^Ty$ and $y = X \beta + \epsilon$. This means
$$ \hat{\beta} = (X^TX)^{-1}X^T(X \beta + \epsilon) $$$$ \hat{\beta} = \beta + (X^TX)^{-1} X^T\epsilon \qquad (16)$$This means that $\hat{\beta}$ is unbiased estimate of $\beta$ so long as $E[(X^TX)^{-1} X^T\epsilon] = 0$. There are 2 cases:
- X is fixed (non-stochastic; The same every time we draw N samples). $E[(X^TX)^{-1} X^T\epsilon] = (X^TX)^{-1} X^T E[\epsilon] = 0$ where $E[\epsilon] = 0$ using the Gauss-Markov assumption.
- X is stochastic (each time we draw N sample, we get another X) but independent of $\epsilon$. $$ E[(X^TX)^{-1} X^T \epsilon] = E[(X^TX)^{-1}]E[ X^T \epsilon] = 0$$ where $E[ X^T \epsilon] = 0$
Proof that $\hat{\beta}$ is a linear estimator of $\beta$.
We can re-write Eq. 16 as $\hat{\beta} = \beta + A\epsilon$ where $A = (X^TX)^{-1}X^T$. which shows that $\hat{\beta}$ is a linear function of the disturbances. This makes it a linear estimator.
Proof that $\hat{\beta}$ has minimal variance among all linear and unbiased estimators.
$$ Var(\hat{\beta}) = E[(\hat{\beta} - \beta)(\hat{\beta} - \beta)^T] $$ $$ Var(\hat{\beta}) = E[((X^TX)^{-1} X^T \epsilon)((X^TX)^{-1} X^T \epsilon)^T] $$where $\hat{\beta} = \beta + (X^TX)^{-1} X^T\epsilon $ from Eq. 16
$$ Var(\hat{\beta}) = E[(X^TX)^{-1} X^T \epsilon \epsilon^T X (X^TX)^{-1}] \qquad (17)$$Since X and $\epsilon$ are independent
$$ Var(\hat{\beta}) = E[(X^TX)^{-1} X^T] E[ \epsilon \epsilon^T] E[ X (X^TX)^{-1}] \qquad (18)$$If X is non stochastic
$$ Var(\hat{\beta}|X) = (X^TX)^{-1} X^T E[ \epsilon \epsilon^T | X] X (X^TX)^{-1} $$ $$ Var(\hat{\beta}|X) = (X^TX)^{-1} X^T \sigma^2I X (X^TX)^{-1} $$ $$ Var(\hat{\beta}|X) = \sigma^2 (X^TX)^{-1} X^TX (X^TX)^{-1} $$ $$ Var(\hat{\beta}|X) = \sigma^2 (X^TX)^{-1} \qquad (19)$$Recall that $\hat{\beta} = (X^TX)^{-1} X^Ty$ is an unbiased, linear estimator of $\beta$. Let $\hat{\beta}_0 = Cy$ is another linear unbiased estimator of $\beta$ where C is a K×N matrix. $\hat{\beta}_0$ being unbiased means
$$ E[\hat{\beta}_0 | X] = E[Cy | X] = E[C(X\beta + \epsilon) | X] = E[CX\beta + CX \epsilon | X] = \beta $$$$ E[CX\beta | X + E[CX \epsilon | X] = \beta$$$$\beta E[CX | X] = \beta$$$$\beta CX = \beta$$$$CX = I$$Let $D = C - (X^TX)^{-1} X^T$ so that $Dy = \hat{\beta}_0 - \hat{\beta}$
$$ Var(\hat{\beta}_0 |X) = \sigma^2 [(D + (X^TX)^{-1} X^T)(D + (X^TX)^{-1} X^T)^T] $$Since $CX = I = DX + (X^TX)^{-1} X^TX$ which means $ DX = 0$. Therefore
$$ Var(\hat{\beta}_0 |X) = \sigma^2 D D^T + \sigma^2 (X^TX)^{-1} $$$$ Var(\hat{\beta}_0 |X) = \sigma^2 D D^T + Var(\hat{\beta} | X) $$Since a quadratic forms in $D D^T$ is $z^Tz \geq 0$, every quadratic form in $ Var(\hat{\beta}_0 |X)$ is learger than or equal to the corresponding quadratic form in $ Var(\hat{\beta} |X)$
The Variance-Covariance Matrix of the OLS Estimates
Recall Eq. 19, if X is non-stochastic
$$ Var(\hat{\beta}|X) = \sigma^2 (X^TX)^{-1}$$To estimate $\sigma^2$ recall that $$ y = X\hat{\beta} + e , \hat{\beta} = (X^T X)^{-1}X^Ty $$ $$ e = y - X\hat{\beta} = y - X(X^T X)^{-1}X^Ty = (I - X(X^T X)^{-1}X^T) y = My$$
$$ e = M(X \beta + \epsilon) = MX \beta + M \epsilon = M \epsilon \qquad (20)$$where $ M = I - X(X^T X)^{-1}X^T$ is and N×N symetric and indempotent matrix (when multiplied by itself, yields itself) which are easy to prove. Note that $MX = 0$
$$ e^Te = (M \epsilon)^T M \epsilon = \epsilon^T M^T M \epsilon = \epsilon^T MM \epsilon = \epsilon^T M \epsilon \qquad (21)$$$$ E[e^Te | X] = E[ \epsilon^T M \epsilon | X ] \qquad (22)$$Since $\epsilon^T M \epsilon$ is 1×1 matrix so $\epsilon^T M \epsilon = tr(\epsilon^T M \epsilon)$ (trace function). Also note that $tr(AB) = tr(BA)$ (cyclic permutation). Eq. 22 can be re-written as
$$ E[e^Te | X] = E[tr(\epsilon^T M \epsilon) | X ] = E[tr(M \epsilon \epsilon^T ) | X ] \qquad (23)$$Because M is a function of X (and X) only and X is non-stochastic in this case
$$ E[e^Te | X] = tr(E[M \epsilon \epsilon^T | X ]) = tr(M E[\epsilon \epsilon^T | X] \qquad (24)$$Appling assumption 4 that $E[\epsilon \epsilon^T |X] = \sigma^2 I$ we have
$$ E[e^Te | X] = tr(M E[\epsilon \epsilon^T | X] = tr(M) \sigma^2 I = (N - K) \sigma^2 I \qquad (25)$$where
$$ tr(M) = tr(I_N) - tr(X(X^TX)^{-1}X^T) = tr(I_N) - tr((X^TX)^{-1}X^TX) = tr(I_N) - tr(I_K) = N - K \qquad (26)$$So an unbiased estimate of $\sigma^2$ would be $$ \hat{\sigma}^2 = \frac{e^Te}{N-K} \qquad (27) $$
So, with non-stochastic X (conditioned on X) $\hat{\beta} = \beta + (X^TX)^{-1}X^T \epsilon$ is a linear combination of the disturbances. With the assumption that $\epsilon \sim N[0, \sigma^2 I]$, se can say the OLS estimator has gaussian distribution
$$ \hat{\beta} \sim N[\beta, \sigma^2(X^TX)^{-1}] \qquad (28) $$References
Prof. Michael J. Rosenfeld - OLS in Matrix Form - lecture note