Maximum likelihood estimators and least squares

Why do we choose to minimize Mean Square Error (least square) when we do linear regression? Is it because it is smooth and easy to solve its direvative? Is it because that's out intuition about how to fit a curve to a set of points? As it turn out, there is a mathematical reason behind this and It has something to do with Maximum Likelihood

Maximum likelihood estimators (MLE)

A maximum likelihood estimate for some hidden parameter $\lambda$ (or parameters, plural) of some probability distribution is a number $\hat{\lambda}$ computed from an independent identical distribution (i.i.d.) sample $X_{1} , ..., X_{n}$ from the given distribution that maximizes something called the “likelihood function”. Suppose that the distribution in question is governed by a pdf $f(x; \lambda_{1}, ..., \lambda_{k})$, where the $\lambda_{i}$’s are all hidden parameters. The likelihood function associated to the sample is just

$$ L(X_{1}, X_{2}, ..., X_{n}) = \prod_{1}^{n}f(X_{i}; \lambda_{1}, ..., \lambda_{k}) $$

For example, if the distribution is $N(\mu , \sigma^{2})$, then

$$ L(X_{1}, X_{2}, ..., X_{n}; \hat{\mu}, \hat{\sigma}^{2}) = \frac{1}{(2\pi)^{\frac{n}{2}} \hat{\sigma}^{n}} e^{-\frac{1}{2\hat{\sigma}^{2}}((X_{1} - \hat{\mu})^{2} + (X_{2} - \hat{\mu})^{2} + ... + (X_{n} - \hat{\mu})^{2})} \qquad (1)$$

The ˆ in $\hat{\mu}$ and $\hat{\sigma}$ indicate that these are variable, not to be confused with $\mu$ and $\sigma$ which are unknown parameters that we want to estimate of the real distribution.

Why should one expect a maximum likelihood esimate for some parameter to be a “good estimate”? Well, what the likelihood function is measuring is how likely $(X_{1} , ..., X_{n})$ is to have come from the distribution assuming particular values for the hidden parameters; the more likely this is, the closer one would think that those particular choices for hidden parameters are to the true values. Let’s see an examples:

Example. Suppose that $X_{1} , ..., X_{n}$ are generated from a normal distribution having hidden mean $\mu$ and variance $\sigma^{2}$. Compute a MLE for $\mu$ from the sample.

Solution. As we said above, the likelihood function in this case is given by (1). It is obvious that to maximize L as a function of $\hat{\mu}$ and $\hat{\mu})^{2}$ we must minimize

$$ \sum_{1}^{n} (X_{i} - \hat{\mu})^{2} $$

as a function of $\hat{\mu}$. Upon taking a derivative with respect to $\hat{\mu}$ and setting it to 0, we find that

$$ \hat{\mu} = \frac{X_{1} + X_{2} + ... + X_{n}}{n} = \bar{X},$$

the sample mean. So, the sample mean is the MLE for $\mu$ in this case.

Least squares

Suppose that you are presented with a sequence of data points $(X_{1},Y_{1}), ..., (X_{n}, Y_{n})$, and you are asked to find the “best fit” line passing through those points. Well, of course, in order to answer this you need to know precisely how to tell whether one line is “fitter” than another. A common measure of fitness is the square-error, given as following: Suppose $y = \lambda_{1}x + \lambda_{2}$ is your candidate line. Then, the error associated with this line is

$$ E :=\sum_{1}^{n}(Y_{i} - \lambda_{1}X_{i} - \lambda_{2})^{2} $$

In other words, it is the sum of the square distance between the $y$-value at the data points $X = X_{1}, X_{2}, ..., X_{n}$ and the $y$-value for the line at those data points.

Why use the sum of square errors? Well, first of all, the fact that we compute squares means that all the terms in the sum are non-negative and error at a given point $X = X_{i}$ is the same if the point $(X_{i},Y_{i})$ is $t$ units above the line $y = \lambda_{1}x + \lambda_{2}$ as it is if it is $t$ units below the line. Secondly, squaring is a “smooth operation”; and so, we can easily compute derivatives of $E$ – in other words, using sum of square errors allows us to use calculus. And finally, at the end of this blog we will relate the sum of square error to MLE’s.

Minimizing E over all choices for $(\lambda_{2},\lambda_{2})$ results in what is called the “least squares approximation”. Let us see how to compute it: Well, basically we take a partial of E with respect to $\lambda_{1}$ and $\lambda_{2}$ and then set those equal to 0

MLE’s again, and least squares

In this section we consider a different sort of problem related to “best fit lines”. Suppose that we know a priori that the data points $(X_{i},Y_{i})$ fit a straight line, except that there is a little error involved. That is to say, suppose that $X_{1}, ..., X_{n}$ are fixed and that we think of $Y_{1}, ..., Y_{n}$ as being random variables satisfying $$ Y_{i} = \lambda_{1}X_{i} + \lambda_{2} + \epsilon_{i}, \ where \ \epsilon_{i} \sim N(\mu, \sigma^{2}),$$

where all $\epsilon_{i}$'s are assumed to be independent. $\epsilon$ is the measurement error.

Now we find a MLE for $\lambda_{1}, \lambda_{2}$. The likelihooh of $ Y_{i} = \lambda_{1}X_{i} + \lambda_{2} + \epsilon_{i} $ is the likelihood of $ \epsilon_{i} = Y_{i} - \lambda_{1}X_{i} - \lambda_{2} $. Since $\epsilon_{i} \sim N(\mu, \sigma^{2})$, Our likelihood function is given by (we assume $X_{1}, ..., X_{n}$ are fixed)

$$ L(Y_{1}, Y_{2}, ..., Y_{n}; \hat{\lambda}_{1}, \hat{\lambda}_{2}, \hat{\sigma}^{2}) = \frac{1}{(2\pi)^{\frac{n}{2}} \hat{\sigma}^{n}} e^{-\frac{1}{2\hat{\sigma}^{2}} \sum_{i=1}^{n}(Y_{i} - \hat{\lambda}_{1} X_{1} - \hat{\lambda}_{2})^{2}} $$

Clearly, for any fixed $\sigma \neq 0$, maximizing L is equivalent to minimizing

$$ E := \sum_{i=1}^{n}(Y_{i} - \hat{\lambda}_{1} X_{1} - \hat{\lambda}_{2})^{2} $$

So we see that the least squares estimate we saw before is really equivalent to producing a maximum likelihood estimate for $\lambda_{1}$ and $\lambda_{2}$ for variables $X$ and $Y$ that are linearly related up to some Gaussian noise $N(\mu, \sigma^{2})$. The significance of this is that it makes the least-squares method of linear curve fitting a little more natural – it’s not as artificial as it might have seemed at first: What made it seem artificial, at first, was the fact that there are many, many different error functions that we could have written down that measure how well the line $y = \lambda_{1}x + \lambda_{2}$ fits the given data. And what we have shown is that the “sum of square errors” error function happens to have a privileged position among them.

In [ ]: