Consider the setting of least squares regression and suppose we are given a data matrix together with a vector of labels . The solution to the problem is given by that solves the equation
Using matrix inversion the solution can be easily calculated as the well known least squares estimator . However, for large datasets matrix inversion can become very costly. In order to avoid this inversion a common alternative is to use gradient descent to find meaning we iteratively take small steps in the direction that results in the greatest decrease of the loss. For the least squares problem the recursion can be readily computed as , where is a tuning parameter. As this is the only free parameter, this leaves us with the following question: How can we choose the learning rate in an optimal way?
What does it mean to have an optimal learning rate in gradient descent? Surely it must have to do with the speed of convergence of the iterates to the solution . Some short calculations yield
Subtracting from both sides gives Thus, by measuring convergence in the sense of norms we obtain
where is the induced matrix norm (the spectral norm) of . Hence, all that is left is finding the value that minimizes this norm.
We can minimize by taking
Thus, ensuring fastest convergence can be achieved by choosing