In this post I want to write about the connection of Ridge regression and robust regression. Ridge regression (also know as Thikonov regularization) is a form of regularization or shrinkage, where the parameters of linear regression are shrunk towards 0.
There are several reason why one might want to use methods like this. A very simple motivation is the case of multicollinearity. If regression covariates suffer from multicollinearity, the moment matrix is (close to) singular and computing the least squares solution becomes difficult or impossible. An easy solution to make invertible is to add a (-scaled) identity matrix and use the estimator . It turns out this is the solution to the optimization problem
which corresponds to the usual least squares minimization objective plus a penalty term for the size of the parameters. Another interpretation can be found by considering a Bayesian linear model where the parameters are endowed with a Gaussian prior distribution. These models are well known. However, a less well known fact is the connection to robust regression.
Now let us consider the case where the observations are random. Let us denote the random observation as and we now write for the mean and for a centered random matrix such that and . One could still perform ordinary least squares, i.e. minimizing , however a more robust choice would be to take the randomness into account and minimize the expected squared norm
For the Euclidean norm as above the expectation can be rewritten as
This is the same optimization objective as Ridge regression! For example, if the variance of the error matrix is isotropic, i.e. for some value we get . As before the solution is
Hence, Ridge regression has another interesting interpretation as a robust optimization problem which is why I think that this fact is useful in understanding why Ridge regression approaches can be advantageous especially when the number of observed data are low. There is another interesting detail. It is a well know fact that Ridge regression corresponds to a Bayesian model where the parameters are random and assumed to have an a priori Gaussian distribution whereas the matrix of covariates is assumed fixed. Here, the parameters are fixed and the covariates are random.
Boyd, S., Boyd, S. P., & Vandenberghe, L. (2004). Convex optimization, Section 6.4.1. Cambridge university press.