Consider the setting of least squares regression and suppose we are given a data matrix together with a vector of labels
. The solution to the problem
is given by
that solves the equation
.
Using matrix inversion the solution can be easily calculated as the well known least squares estimator . However, for large datasets matrix inversion can become very costly. In order to avoid this inversion a common alternative is to use gradient descent to find
meaning we iteratively take small steps in the direction that results in the greatest decrease of the loss. For the least squares problem the recursion can be readily computed as
, where
is a tuning parameter. As this is the only free parameter, this leaves us with the following question: How can we choose the learning rate
in an optimal way?
What does it mean to have an optimal learning rate in gradient descent? Surely it must have to do with the speed of convergence of the iterates
to the solution
. Some short calculations yield
Subtracting from both sides gives
Thus, by measuring convergence in the sense of
norms we obtain
where is the induced matrix norm (the spectral norm) of
. Hence, all that is left is finding the value
that minimizes this norm.
We can minimize by taking
Thus, ensuring fastest convergence can be achieved by choosing