# Stepsize for gradient descent in linear regression

Consider the setting of least squares regression and suppose we are given a data matrix $X \in \mathbb{R}^{n\times p}$ together with a vector of labels $y \in \mathbb{R}^n$. The solution to the problem $\begin{array}{l} \min_\beta \ell(\beta) = \min_\beta \|y - X\beta\|_2^2 \\ \end{array}$ is given by $\beta^*$ that solves the equation

$X^TX\beta ^*= X^Ty$.

Using matrix inversion the solution can be easily calculated as the well known least squares estimator $\beta^* = (X^TX)^{-1}X^Ty$. However, for large datasets matrix inversion can become very costly. In order to avoid this inversion a common alternative is to use gradient descent to find $\beta^*$ meaning we iteratively take small steps in the direction that results in the greatest decrease of the loss. For the least squares problem the recursion can be readily computed as $\beta_{t+1} = \beta_t - \eta \nabla_\beta \ell(\beta_t) = \beta_{t} - \eta X^T(X\beta_t - y)$, where $\eta$ is a tuning parameter. As this is the only free parameter, this leaves us with the following question: How can we choose the learning rate $\eta$ in an optimal way?

What does it mean to have an optimal learning rate $\eta$ in gradient descent? Surely it must have to do with the speed of convergence of the iterates $\beta_t$ to the solution $\beta^*$. Some short calculations yield

$\displaystyle \begin{array}{ll} \beta_{t+1} & = \beta_{t} - \eta X^T(X\beta_t - y) \\ & = \beta_{t} - \eta X^TX\beta_t - \eta X^Ty\\ & = \beta_{t} - \eta X^TX\beta_t - \eta X^TX\beta^*. \end{array}$

Subtracting $\beta^*$ from both sides gives $\beta_{t+1} - \beta^* = (I - \eta X^T X) (\beta_t - \beta^*).$ Thus, by measuring convergence in the sense of $L^2$ norms we obtain

$\displaystyle \|\beta_{t+1} - \beta^*\|_2 \leq \|I - \eta X^T X\|_2 \|\beta_t - \beta^*\|_2,$

where $\|I - \eta X^T X\|_2$ is the induced matrix norm (the spectral norm) of $I - \eta X^T X$. Hence, all that is left is finding the value $\eta$ that minimizes this norm.

We can minimize $\|I - \eta X^T X\|_2 = \max\{|1 - \eta \lambda_{\max}(X^TX)|, |1 - \eta \lambda_{\min}(X^TX)|\}$ by taking

$\displaystyle - 1 + \eta \lambda_{\max}(X^TX) = 1 - \eta \lambda_{\min}(X^TX) \Leftrightarrow \eta = \frac{2}{\lambda_{\max}(X^TX) + \lambda_{\min}(X^TX)}.$

Thus, ensuring fastest convergence can be achieved by choosing $\displaystyle \eta = \frac{2}{\lambda_{\max}(X^TX) + \lambda_{\min}(X^TX)}$