Stepsize for gradient descent in linear regression

Consider the setting of least squares regression and suppose we are given a data matrix X \in \mathbb{R}^{n\times p} together with a vector of labels y \in \mathbb{R}^n. The solution to the problem \begin{array}{l} \min_\beta \ell(\beta) = \min_\beta \|y - X\beta\|_2^2 \\ \end{array} is given by \beta^* that solves the equation

X^TX\beta ^*= X^Ty.

Using matrix inversion the solution can be easily calculated as the well known least squares estimator \beta^* = (X^TX)^{-1}X^Ty. However, for large datasets matrix inversion can become very costly. In order to avoid this inversion a common alternative is to use gradient descent to find \beta^* meaning we iteratively take small steps in the direction that results in the greatest decrease of the loss. For the least squares problem the recursion can be readily computed as \beta_{t+1} = \beta_t - \eta \nabla_\beta \ell(\beta_t) = \beta_{t} - \eta X^T(X\beta_t - y), where \eta is a tuning parameter. As this is the only free parameter, this leaves us with the following question: How can we choose the learning rate \eta in an optimal way?

What does it mean to have an optimal learning rate \eta in gradient descent? Surely it must have to do with the speed of convergence of the iterates \beta_t to the solution \beta^*. Some short calculations yield

\displaystyle \begin{array}{ll} \beta_{t+1} & = \beta_{t} - \eta X^T(X\beta_t - y) \\   & = \beta_{t} - \eta X^TX\beta_t - \eta X^Ty\\   & = \beta_{t} - \eta X^TX\beta_t - \eta X^TX\beta^*. \end{array}

Subtracting \beta^* from both sides gives \beta_{t+1} - \beta^* = (I - \eta X^T X) (\beta_t - \beta^*). Thus, by measuring convergence in the sense of L^2 norms we obtain

\displaystyle \|\beta_{t+1} - \beta^*\|_2 \leq \|I - \eta X^T X\|_2 \|\beta_t - \beta^*\|_2,

where \|I - \eta X^T X\|_2 is the induced matrix norm (the spectral norm) of I - \eta X^T X. Hence, all that is left is finding the value \eta that minimizes this norm.

We can minimize \|I - \eta X^T X\|_2 = \max\{|1 - \eta \lambda_{\max}(X^TX)|, |1 - \eta \lambda_{\min}(X^TX)|\} by taking

\displaystyle - 1 + \eta \lambda_{\max}(X^TX) = 1 - \eta \lambda_{\min}(X^TX) \Leftrightarrow \eta = \frac{2}{\lambda_{\max}(X^TX) + \lambda_{\min}(X^TX)}.

Thus, ensuring fastest convergence can be achieved by choosing \displaystyle \eta = \frac{2}{\lambda_{\max}(X^TX) + \lambda_{\min}(X^TX)}

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

<span>%d</span> bloggers like this: