Regularization as robust regression

Ridge regression

In this post I want to write about the connection of Ridge regression and robust regression. Ridge regression (also know as Thikonov regularization) is a form of regularization or shrinkage, where the parameters of linear regression are shrunk towards 0.

There are several reason why one might want to use methods like this. A very simple motivation is the case of multicollinearity. If regression covariates suffer from multicollinearity, the moment matrix X^TX is (close to) singular and computing the least squares solution \beta_{\mathrm{ols}} = (X^TX)^{-1}X^Ty becomes difficult or impossible. An easy solution to make X^TX invertible is to add a (\lambda-scaled) identity matrix and use the estimator \beta_{\mathrm{ridge}} = (X^TX+\lambda I)^{-1}X^Ty. It turns out this is the solution to the optimization problem

\min_\beta \|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2,

which corresponds to the usual least squares minimization objective plus a penalty term for the size of the parameters. Another interpretation can be found by considering a Bayesian linear model where the parameters \beta are endowed with a Gaussian prior distribution. These models are well known. However, a less well known fact is the connection to robust regression.

Robust regression

Now let us consider the case where the observations are random. Let us denote the random observation as X and we now write \bar{X} for the mean and Z for a centered random matrix such that X = \bar{X} + Z and E[Z] = 0. One could still perform ordinary least squares, i.e. minimizing \|y - X\beta\|_2^2, however a more robust choice would be to take the randomness into account and minimize the expected squared norm

\min_\beta E\|y - X\beta\|_2^2.

For the Euclidean norm as above the expectation can be rewritten as \begin{array}{ll}    E\|y - X\beta\|_2^2 &= E\|y - \bar{X}\beta - Z\beta \|_2^2 \\ &= E\left[\left(y - \bar{X}\beta - Z\beta \right)^T\left(y - \bar{X}\beta - Z\beta \right)\right] \\ &= E\left[\left(y - \bar{X}\beta\right)^T\left(y - \bar{X}\beta \right)\right] -2 E\left[ (y - \bar{X}\beta)^T Z\beta\right] + E\left[\beta^TZ^TZ\beta \right] \\ &= \left(y - \bar{X}\beta\right)^T\left(y - \bar{X}\beta \right) -2 (y - \bar{X}\beta)^T E\left[Z\beta\right] + \beta^TE\left[Z^TZ \right]\beta \\ &= \left(y - \bar{X}\beta\right)^T\left(y - \bar{X}\beta \right) + \beta^T\Sigma \beta \\ &= \|y - \bar{X}\beta\|_2^2 + \|\Sigma^{1/2} \beta\|_2^2. \end{array}

This is the same optimization objective as Ridge regression! For example, if the variance of the error matrix is isotropic, i.e. E\left[Z^TZ\right] = \sigma^2 I for some value \sigma^2 we get \|y - \bar{X}\beta\|_2^2 + \sigma^2\|\beta\|_2^2. As before the solution is \beta_{\mathrm{ridge}} = (\bar{X}^T\bar{X} + \Sigma)^{-1}\bar{X}^Ty.

Hence, Ridge regression has another interesting interpretation as a robust optimization problem which is why I think that this fact is useful in understanding why Ridge regression approaches can be advantageous especially when the number of observed data are low. There is another interesting detail. It is a well know fact that Ridge regression corresponds to a Bayesian model where the parameters are random and assumed to have an a priori Gaussian distribution whereas the matrix of covariates X is assumed fixed. Here, the parameters are fixed and the covariates are random.

References

Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization, Section 6.4.1. Cambridge university press.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: