A way to express Nesterov Accelerated Gradient in terms of a regular momentum update was noted by Sutskever and co-workers, and perhaps more importantly, when it came to training neural networks, it seemed to work better than classical momentum schemes. This was further confirmed by Bengio and co-workers, who provided an alternative formulation that might be easier to integrate into existing software.
I implemented NAG as part of mize, but found it was a bit more difficult than I’d anticipated to test that I’d got the correct results.
It seems I am not alone in being confused by the exact details of implementing NAG as a momentum scheme (e.g. comments and issues in the deep learning projects keras and mocha.jl, which I found from random googling), so mainly for the benefit of future me, I am going to derive the Sutskever and Bengio formulations from NAG in tedious detail. I am also going to derive an alternative expression that I ended up using in mize. Finally, I will write some R code to demonstrate their equivalence. This will turn out to be trickier than I anticipated.
Wherever possible I will try and stick with the notation used by Sutskever. Apart from the Sutskever and Bengio papers linked to above, it’s worth checking out the appendix to the Sutskever paper (PDF) or the relevant part (chapter 7) of Sutskever’s thesis. For an introduction to NAG itself, try the first part of this paper by O’Donoghue and Candès (the rest of it is also good but not germane to Nesterov momentum expressions).
The goal is to optimize some parameters, \(\theta\). This is going to involve gradient descent, so we will be evaluating the gradient of an objective function of those parameters, \(\nabla f\left(\theta\right)\), and moving a certain distance in the direction of the negative of the gradient, the distance being related to the learning rate, \(\varepsilon\). There will also be a momentum term involved, with momentum coefficient \(\mu\). The value of the parameters, learning rate and momentum at iteration \(t\) will be indicated by a subscript, e.g. \(\theta_t\).
A definition which holds across all the methods discussed here is that the parameters at iteration \(t+1\) are related to the parameters at iteration \(t\) by an update that involves the addition of a velocity vector, \(v\):
\[\theta_{t+1} = \theta_t + v_{t+1} \implies v_{t+1} = \theta_{t+1} - \theta_t\]
This is pretty obvious, but different methods have different velocity defintions, and it’s easy to get confused.
Finally, I’ll try to be strict about names. When I’m referring to the original Nesterov Accelerated Gradient, I’ll call that NAG. When I’m referring to the versions that recast it as a momentum scheme, I’ll call that Nesterov momentum, and refer to either the Sutskever or Bengio forumulation where necessary. The traditional momentum scheme is referred to as “standard” or “regular” momentum by Bengio, and “classical” by Sutskever. I’ll refer to it as “classical” momentum.
The classical momentum velocity vector is defined by Bengio as:
\[v_t = \mu_{t-1}v_{t-1} - \varepsilon_{t-1}\nabla f\left(\theta_{t-1}\right)\]
Sutskever gives the same definition but without any \(t\) subscript on the velocity vector or the momentum coefficient.
We can write out the full update for classical momentum as:
\[\theta_{t+1} = \theta_t + \mu_t v_t - \varepsilon_t \nabla f\left(\theta_t\right)\] with velocity:
\[v_{t+1} = \mu_t v_t - \varepsilon_t \nabla f\left(\theta_t \right)\]
The Nesterov Accelerated Gradient method consists of a gradient descent step, followed by something that looks a lot like a momentum term, but isn’t exactly the same as that found in classical momentum. I’ll call it a “momentum stage” here. It’s important to note that the parameters being minimized by NAG are given the symbol \(y\) by Sutskever, not \(\theta\). You’ll see that \(\theta\) is the symbol for the parameters after they’ve been updated by the gradient descent stage, but before the momentum stage.
Here’s the gradient descent stage: \[\theta_t = y_t - \varepsilon_t \nabla f\left(y_t\right)\] And here’s the momentum stage: \[y_{t+1} = \theta_t + \mu_t\left(\theta_t - \theta_{t-1} \right)\] That concludes one iteration of NAG. The hard part is actually finding the correct learning rate and momentum value in order to get the convergence guarantees that make the method attractive, but that needn’t concern us. Sutskever in his thesis suggests manual tuning to get an optimal result (at least for deep learning applications) so we are in heuristic territory here anyway.
The key idea behind the Sutskever momentum derivation is to shift the perspective about which of the parameters we want as the result of the iteration, from \(y\) to \(\theta\). Rather than having the optimization iterations proceed as “gradient descent, momentum (end of iteration 1), gradient descent, momentum (end of iteration 2), gradient descent etc.” move the boundary of where the iterations end by a half-iteration to get “momentum, gradient descent (end of iteration 1), momentum, gradient descent (end of iteration 2) etc.”. This leaves a phantom gradient descent step that used to be first stage of the first iteration, now floating in the nether regions of iteration zero, but you can just pretend that the starting position is the result of gradient descent from some other arbitrary starting position.
The Sutskever derivation then proceeds as follows. First, rewrite the momentum stage in terms of \(y_t\):
\[y_t = \theta_{t-1} + \mu_{t-1}\left(\theta_{t-1} - \theta_{t-2} \right)\]
Then note that the term in parentheses is just the definition of the velocity vector, so we can write:
\[y_t = \theta_{t-1} + \mu_{t-1}v_{t-1}\]
We now substitute this expression for \(y_t\) into the gradient descent stage:
\[\theta_t = \theta_{t-1} + \mu_{t-1}v_{t-1} - \varepsilon_t \nabla f\left(y_t\right)\] and replace \(y_t\) with an expression in terms of \(\theta_t\):
\[\theta_t = \theta_{t-1} + \mu_{t-1}v_{t-1} - \varepsilon_t \nabla f\left(\theta_{t-1} + \mu_{t-1}v_{t-1}\right)\]
At this point, the expression differs from that given by Sutskever in one detail: the learning rate associated with the gradient descent is currently written as \(\varepsilon_{t}\). But remember that we have now moved the boundaries of the iteration: the gradient descent that would have been the first stage of iteration \(t+1\) is now the second stage of iteration \(t\). So the above expression is more correctly written as:
\[\theta_t = \theta_{t-1} + \mu_{t-1}v_{t-1} - \varepsilon_{t-1} \nabla f\left(\theta_{t-1} + \mu_{t-1}v_{t-1}\right)\]
just as given by Sutskever. And therefore the expression for the parameter update is trivially:
\[\theta_{t+1} = \theta_t + \mu_t v_t - \varepsilon_t \nabla f\left(\theta_t + \mu_t v_t\right)\]
with the velocity vector defined as:
\[v_{t+1} = \mu_t v_t - \varepsilon_t \nabla f\left(\theta_t + \mu_t v_t\right)\]
This looks just like the classical momentum update, except that the gradient is calculated after the momentum update. Hence, one can do NAG by simply reversing the order in which the update is usually carried out: do the momentum stage first, update the parameters, and then do the gradient descent part.
The Bengio formulation of Nesterov momentum starts from the Sutskever definition and then defines a new variable, \(\Theta\), which represents \(\theta\) after the momentum update:
\[\Theta_{t-1} \equiv \theta_{t-1} + \mu_{t-1} v_{t-1}\]
A re-arrangement that will come in handy is: \[\theta_{t-1} = \Theta_{t-1} - \mu_{t-1} v_{t-1}\]
Also, we are going to do something a bit tricky, and that is to keep on using the velocity vector definition from the Sutskever formulation, but with \(\Theta\) substituted in:
\[v_t = \mu_{t-1} v_{t-1} - \varepsilon_{t-1} \nabla f\left(\Theta_{t-1}\right)\] The reason I call this tricky is that the velocity vector still refers to the \(\theta\) update, but we are going to be writing the update in terms of \(\Theta\). You’ll see what I mean.
Anyway, let’s take the Sutskever Nesterov momentum update and substitute in the expression above for \(\theta\) in terms of \(\Theta\):
\[\Theta_{t+1} - \mu_{t+1} v_{t+1} = \Theta_t - \mu_t v_t + \mu_t v_t - \varepsilon_t \nabla f\left(\Theta_t\right)\]
Noting that those two \(\mu_t v_t\) terms cancel out and then rearranging to leave \(\Theta_{t+1}\) on the LHS, we get:
\[\Theta_{t+1} = \Theta_t + \mu_{t+1} v_{t+1} - \varepsilon_t \nabla f\left(\Theta_t\right)\]
We can now substitute in the Sutskever velocity expression, \(v_{t+1}\):
\[ \Theta_{t+1} = \Theta_t + \mu_{t+1}\left[\mu_t v_t - \varepsilon_t \nabla f\left(\Theta_t\right)\right] - \varepsilon_t \nabla f\left(\Theta_t\right) \]
Expanding out the parentheses gives:
\[ \Theta_{t+1} = \Theta_t + \mu_{t+1} \mu_t v_t - \mu_{t+1} \varepsilon_t \nabla f\left(\Theta_t\right) - \varepsilon_t \nabla f\left(\Theta_t\right) \]
and, after grouping the gradient descent parts, finally gives:
\[ \Theta_{t+1} = \Theta_t + \mu_{t+1} \mu_t v_t - \left(1 + \mu_{t+1} \right) \varepsilon_t \nabla f\left(\Theta_t\right) \]
At this point in the other derivations, I isolated the bits on the RHS that wasn’t the old parameter and defined them as the velocity vector. Well, we can’t do that here, as we’re already using the Sutskever definition of the velocity vector which is \(\theta_t - \theta_{t-1}\) and not \(\Theta_t - \Theta_{t-1}\). I said it was a bit tricky.
The advantage of this expression for the Nesterov momentum is that it doesn’t require calculating a gradient at a non-standard position, and only requires a modification to the coefficients used to calculate the velocity, which is probably an easier change to make to an existing codebase which already uses classical momentum.
Before going any further, it’s worth going back to the original expression for the gradient descent part of NAG that Sutskever gives in his thesis (and appendix to the paper):
\[ y_{t+1} = \theta_t + \mu_t\left(\theta_t - \theta_{t-1} \right) = \theta_t + \mu_t v_t \] Let’s just compare that to the Bengio definition of \(\Theta\):
\[ \Theta_{t} \equiv \theta_{t} + \mu_{t} v_{t} \] Yep, that means:
\[ \Theta_{t} \equiv y_{t+1} \]
So the Bengio formulation is really just another way to write the standard NAG update.
Let’s go back to the original formulation of NAG and now instead of making \(\theta\) the variables after gradient descent, we’ll put them where \(y\) was originally used. The variables after gradient descent I’ll refer to as \(\phi\).
\[\phi_t = \theta_t - \varepsilon_t \nabla f\left(\theta_t \right)\] \[\theta_{t+1} = \phi_t + \mu_t\left(\phi_t - \phi_{t-1} \right)\]
Now, let’s just write out the momentum stage in terms of \(\theta\), substituting \(\phi\) wherever we find it:
\[ \theta_{t+1} = \theta_t - \varepsilon_t \nabla f\left(\theta_t \right) + \mu_t \left[\theta_t - \varepsilon_t \nabla f\left(\theta_t \right) - \theta_{t-1} + \varepsilon_{t-1} \nabla f\left(\theta_{t-1} \right) \right] \]
Rearranging:
\[ \theta_{t+1} = \theta_t + \mu_t \left[\theta_t - \theta_{t-1} + \varepsilon_{t-1} \nabla f\left(\theta_{t-1} \right) - \varepsilon_t \nabla f\left(\theta_t \right) \right] - \varepsilon_t \nabla f\left(\theta_t \right) \]
Finally, we can subtitute in \(v_t\) for the first two terms in the square brackets, to give:
\[ \theta_{t+1} = \theta_t + \mu_t \left[v_t + \varepsilon_{t-1} \nabla f\left(\theta_{t-1} \right) - \varepsilon_t \nabla f\left(\theta_t \right) \right] - \varepsilon_t \nabla f\left(\theta_t \right) \]
with velocity:
\[ v_{t+1} = \mu_t \left[v_t + \varepsilon_{t-1} \nabla f\left(\theta_{t-1} \right) - \varepsilon_t \nabla f\left(\theta_t \right) \right] - \varepsilon_t \nabla f\left(\theta_t \right) \]
This looks a lot like the classical momentum expression, but with the velocity vector modified to first remove the contribution of the gradient descent from the previous iteration, and replace it with the gradient descent contribution from the current iteration. Gives an interesting insight into the idea of the Nesterov momentum using a form of “lookahead” with the gradient descent.
You could also choose to expand the velocity expression to make it look a bit like the Bengio formulation:
\[ v_{t+1} = \mu_t \left[v_t + \varepsilon_{t-1} \nabla f\left(\theta_{t-1} \right) \right] - \left(1 + \mu_t\right) \varepsilon_t \nabla f\left(\theta_t \right) \]
but as this version can’t be expressed as the classical momentum form with different coefficients, the way the Bengio formulation can be, it probably doesn’t gain you anything in terms of implementation, except you can expand it and rearrange it further to give:
\[ v_{t+1} = \mu_t v_t - \varepsilon_{t} \nabla f\left(\theta_{t} \right) + \mu_t \left[ \varepsilon_{t-1} \nabla f \left(\theta_{t-1}\right) - \varepsilon_{t} \nabla f \left(\theta_{t}\right) \right] \] which now resembles the classical momentum expression with an extra momentum term. I haven’t found a definitive reference for this expression, or what, if any, extra insight it provides, but user ‘denis’ uses this expression in an answer on the Cross Validated Stack Exchange, and refers to the third form as “gradient momentum”. Another way to look at this is to say that the third term reduces the contribution of the classical momentum component of the update and increases the contribution of the gradient descent, with the degree of weighting controlled by \(\mu_{t}\).
Quasi-hyperbolic momentum uses a similar formulation, but introduces an extra parameter that allows QHM to generalize over classical momentum and NAG, subject to the restriction of a fixed learning rate. The Stack Exchange answer given above also suggests that the “gradient momentum” term can be set independently of \(\mu\) and Denis Yarats is one of the authors of the QHM paper, so perhaps we see the origins of QHM in this answer.
The setup and presentation of the update equations for QHM in that paper make it a bit hard to see the connection between it and the form of NAG given here. The supplemental material in Understanding the Role of Momentum in Stochastic Gradient Methods might be of interest, or some extra notes on QHM I wrote for my own benefit.
Similarly to QHM, we can write a related “generalized” momentum update by introducing a parameter \(\beta\) to give:
\[ v_{t+1} = \mu_t v_t - \varepsilon_{t} \nabla f\left(\theta_{t} \right) + \beta_t \mu_t \left[ \varepsilon_{t-1} \nabla f \left(\theta_{t-1}\right) - \varepsilon_{t} \nabla f \left(\theta_{t}\right) \right] \] where setting \(\beta_t = 0\) will give classical momentum and \(\beta_t = 1\) gives Nesterov.
From the look of the expression, it seems that a major downside of this expression is that calculating the parameters for iteration \(t+1\) requires storage of information from not only iteration \(t\) but from iteration \(t-1\) too. But you don’t necessarily have to do any extra storage with an implementation that used this version of NAG. At the end of an iteration, when saving the velocity vector for the next iteration, you change:
\[v_{t-1} \leftarrow v_t\] to: \[v_{t-1} \leftarrow v_t + \varepsilon_{t} \nabla f\left(\theta_t \right)\]
and then when calculating the momentum term, change:
\[\mu_t v_t\] to: \[\mu_t \left[v_t - \varepsilon_{t} \nabla f\left(\theta_t \right)\right]\]
As a way to understand the difference between classical momentum and NAG, let’s write out the first few steps of the optimization, substituting in the recursive definitions with the previous step, and see what emerges.
Because I have finally got tired of writing out \(\varepsilon_t \nabla f\left(\theta_t \right)\) all over the place (and it’s visually distracting), let’s define:
\[ -\varepsilon_t \nabla f\left(\theta_t \right) \equiv s_t \]
to represent the steepest descent direction. Note that this includes the negative sign. Also, I am going to assume that the momentum is constant, to save on some more notational clutter.
Classical momentum is therefore represented as:
\[ \theta_{t+1} = \theta_t + s_t + \mu v_t \]
And we’ll use this straight-forward expression for NAG:
\[ \theta_{t+1} = \theta_t + s_t + \mu \left(\theta_t + s_t - \theta_{t-1} - s_{t-1} \right) \]
with just this tiny bit of re-arranging so it most closely resembles the classical momentum expression:
\[ \theta_{t+1} = \theta_t + s_t + \mu v_t + \mu \left(s_t - s_{t-1} \right) \]
A consequence of my nomenclature choice is that the initial coordinates are at \(\theta_1\), not \(\theta_0\). Sorry if you find that annoying. In both cases, the first step is always just gradient descent:
\[ \theta_2 = \theta_1 + s_1 \]
The next three steps then proceed as follows:
\[ \theta_3 = \theta_2 + s_2 + \mu v_2 \\ \theta_3 = \theta_2 + s_2 + \mu s_1 \]
\[ \theta_4 = \theta_3 + s_3 + \mu v_3 \\ \theta_4 = \theta_3 + s_3 + \mu \left(s_2 + \mu s_1 \right) \\ \theta_4 = \theta_3 + s_3 + \mu s_2 + \mu^2 s_1 \]
\[ \theta_5 = \theta_4 + s_4 + \mu v_4 \\ \theta_5 = \theta_4 + s_4 + \mu \left( s_3 + \mu s_2 + \mu^2 s_1 \right)\\ \theta_5 = \theta_4 + s_4 + \mu s_3 + \mu^2 s_2 + \mu^3 s_1 \]
\[ \theta_3 = \theta_2 + s_2 + \mu v_2 + \mu \left(s_2 - s_1 \right) \\ \theta_3 = \theta_2 + s_2 + \mu s_1 + \mu s_2 - \mu s_1 \\ \theta_3 = \theta_2 + \left(1 + \mu \right) s_2 \]
\[ \theta_4 = \theta_3 + s_3 + \mu v_3 + \mu \left(s_3 - s_2 \right) \\ \theta_4 = \theta_3 + s_3 + \mu \left[ \left(1 + \mu \right) s_2 \right] + \mu s_3 - \mu s_2 \\ \theta_4 = \theta_3 + \left(1 + \mu \right) s_3 + \mu^2 s_2 \] \[ \theta_5 = \theta_4 + s_4 + \mu v_4 + \mu \left(s_4 - s_3 \right) \\ \theta_5 = \theta_4 + s_4 + \mu \left[ \left(1 + \mu \right) s_3 + \mu^2 s_2 \right] + \mu s_4 - \mu s_3 \\ \theta_5 = \theta_4 + \left(1 + \mu \right) s_4 + \mu^2 s_3 + \mu^3 s_2 \]
You can see a pattern emerging here at this point, I hope. These are quite similar forms, the difference being in how the momentum affects the weighting coefficients.
Here’s a table showing how on the 4th step, the two methods weight the current and previous gradients in deciding on the next direction:
Momentum Type | \(s_4\) | \(s_3\) | \(s_2\) | \(s_1\) |
---|---|---|---|---|
Classical | 1 | \(\mu\) | \(\mu^2\) | \(\mu^3\) |
NAG | 1 + \(\mu\) | \(\mu^2\) | \(\mu^3\) | 0 |
From this it’s clear that the difference between NAG and classical momentum is that NAG puts more weight on recent gradients: in fact, it gives zero weight to the first gradient descent direction after the first iteration, so the second step also be a pure gradient descent step, but with an extra big step size of \(\left(1 + \mu \right) \varepsilon\). Another way of looking at it is to say that NAG forgets old gradients more quickly.
Let’s write some R code to generate a table to compare how the relative weights turn out between CM and NAG at different values of \(\mu\). We’ll normalize the contributions so they sum to 1.
cm_weights <- function(mu) { c(1, mu, mu * mu, mu * mu * mu) }
nag_weights <- function(mu) { c(1 + mu, mu * mu, mu * mu * mu, 0) }
mumat <- function(wfun, mus = c(0.1, 0.25, 0.5, 0.75, 0.9, 0.99)) {
rel_weights <- matrix(nrow = length(mus), ncol = 5)
for (i in 1:length(mus)) {
mu <- mus[i]
weights <- wfun(mu)
rel_weights[i, ] <- c(mu, weights / sum(weights))
}
colnames(rel_weights) <- c("mu", "s4", "s3", "s2", "s1")
rel_weights
}
knitr::kable(mumat(cm_weights), digits = 4)
mu | s4 | s3 | s2 | s1 |
---|---|---|---|---|
0.10 | 0.9001 | 0.0900 | 0.0090 | 0.0009 |
0.25 | 0.7529 | 0.1882 | 0.0471 | 0.0118 |
0.50 | 0.5333 | 0.2667 | 0.1333 | 0.0667 |
0.75 | 0.3657 | 0.2743 | 0.2057 | 0.1543 |
0.90 | 0.2908 | 0.2617 | 0.2355 | 0.2120 |
0.99 | 0.2538 | 0.2512 | 0.2487 | 0.2462 |
knitr::kable(mumat(nag_weights), digits = 4)
mu | s4 | s3 | s2 | s1 |
---|---|---|---|---|
0.10 | 0.9901 | 0.0090 | 0.0009 | 0 |
0.25 | 0.9412 | 0.0471 | 0.0118 | 0 |
0.50 | 0.8000 | 0.1333 | 0.0667 | 0 |
0.75 | 0.6400 | 0.2057 | 0.1543 | 0 |
0.90 | 0.5525 | 0.2355 | 0.2120 | 0 |
0.99 | 0.5050 | 0.2487 | 0.2462 | 0 |
At high momentum, classical momentum is only placing around 25-30% of the weight on the current gradient, whereas for NAG, it’s still at 50-55%.
So that’s how it’s all supposed to work in principle. Here I’ll demonstrate the equivalence of these methods, by implementing them all in simple R code.
The biggest simplification I’ll make is that I’ll assume a constant learning rate and a constant momentum coefficient.
#' Optimization by Classical Momentum
#'
#' @param par Starting point of vector of parameters to optimize.
#' @param fn Objective function to optimize. Takes vector with length of
#' \code{par} and returns a scalar.
#' @param gr Gradient of the objective function \code{fn}. Takes vector with
#' length of \code{par} and returns a vector of the same length.
#' @param lr Learning rate.
#' @param mu Momentum coefficient.
#' @param max_iter Maximum number of iterations to optimize for. First iteration
#' is always steepest descent.
#' @return list with components: \code{par} final set of parameters; \code{f}
#' value of \code{fn} evaluated at the returned set of parameters; \code{fs}
#' vector of function evaluated after each iteration.
cm <- function(par, fn, gr, lr, mu, max_iter = 10) {
fs <- rep(0, max_iter)
v <- rep(0, length(par))
for (i in 1:max_iter) {
g <- gr(par)
v <- mu * v - lr * g
par <- par + v
# store results
f <- fn(par)
fs[i] <- f
}
list(par = par, f = f, fs = fs)
}
This is a reference implementation of classical momentum. Nearly all the parameters and the return values are the same for the other functions (except where noted), so they’re documented here once.
Onto the various NAG implementations. They all have an extra bit of code in them to ensure that on the first iteration only the gradient descent stage is carried out. This is needed to keep the outputs consistent between the different implementations. The cm
routine doesn’t require any extra checks for steepest descent on the first iteration, because initializing the velocity vector to zero results in steepest descent anyway even if the momentum coefficient is non-zero. As we’ll see below, some of the other implementations either don’t explicitly use a velocity vector or need the momentum coefficient to be set to zero on the first iteration to get the same result.
# Optimization by Nesterov Accelerated Gradient
#
# Return list also contains gd_fs: function evaluated after gradient descent
# stage of each iteration; all: function evaluated after gradient descent
# stage and momentum stage, in order.
nag <- function(par, fn, gr, lr, mu, max_iter = 10) {
fs <- rep(0, max_iter)
gd_fs <- rep(0, max_iter)
all <- rep(0, max_iter * 2)
x_old <- rep(0, length(par))
for (i in 1:max_iter) {
# gradient descent stage
g <- gr(par)
x <- par - (lr * g)
# store gradient descent values
f <- fn(x)
gd_fs[i] <- f
all[i * 2 - 1] <- f
# momentum stage and update
par <- x + ifelse(i == 1, 0, mu) * (x - x_old)
x_old <- x
# store momentum values
f <- fn(par)
fs[i] <- f
all[i * 2] <- f
}
list(par = par, f = f, fs = fs, gd_fs = gd_fs, all = all)
}
In this routine, rather than store the velocity vector, we store the previous gradient descent result, x_old
. In order to ensure the first iteration is gradient descent only, we also need to manually set the momentum coefficient mu
to zero on the first iteration, which is what that ifelse
expression does.
Also, there’s some extra code to calculate and store the function values after the gradient descent stage. These aren’t needed for the optimization to work, I just want to keep track of the values to demonstrate the different methods are in fact equivalent.
# Optimization by Sutskever Nesterov Momentum
#
# Return list also contains mu_fs: function evaluated after momentum
# stage of each iteration; all: function evaluated after gradient descent
# stage and momentum stage, in order.
snag <- function(par, fn, gr, lr, mu, max_iter = 10) {
v <- rep(0, length(par))
fs <- rep(0, max_iter)
mu_fs <- rep(0, max_iter)
all <- rep(0, max_iter * 2)
for (i in 1:max_iter) {
# momentum stage and update parameters
mu_step <- mu * v
par <- par + mu_step
# store momentum results
f <- fn(par)
mu_fs[i] <- f
all[i * 2 - 1] <- f
# gradient descent stage
g <- gr(par)
gd_step <- -lr * g
# update and store velocity for next step
par <- par + gd_step
v <- mu_step + gd_step
# store gradient descent results
f <- fn(par)
fs[i] <- f
all[i * 2] <- f
}
list(par = par, f = f, fs = fs, mu_fs = mu_fs, all = all)
}
# Optimization by Bengio Nesterov Momentum
bnag <- function(par, fn, gr, lr, mu, max_iter = 10) {
fs <- rep(0, max_iter)
v <- rep(0, length(par))
for (i in 1:max_iter) {
g <- gr(par)
par <- par + mu * mu * v - (1 + mu) * lr * g
v <- mu * v - lr * g
# store results
f <- fn(par)
fs[i] <- f
}
list(par = par, f = f, fs = fs)
}
Because this implementation only uses a constant momentum coefficient, we replace the \(\mu_{t+1}\mu_{t}\) term with \(\mu_{t}^2\). You would probably have to using a pretty wacky non-constant momentum schedule where this approximation had any major effect anyway… except when the momentum is changing from zero to non-zero, that is, in which case that’s the difference between some momentum and no momentum. We shall return to this point later.
# Optimization by Nesterov Accelerated Gradient, using a momentum-style
# update expression.
mnag <- function(par, fn, gr, lr, mu, max_iter = 10) {
fs <- rep(0, max_iter)
v <- rep(0, length(par))
for (i in 1:max_iter) {
g <- gr(par)
v <- mu * (v - lr * g) - lr * g
par <- par + v
# setup v for the next iteration by removing the old gradient contribution
v <- v + lr * g
# store results
f <- fn(par)
fs[i] <- f
}
list(par = par, f = f, fs = fs)
}
Finally, here is your humble author’s expression for NAG, written as a momentum-style update. To differentiate from the Sutskever and Bengio versions of NAG, I’ll refer to it as momentum-NAG or mNAG.
Tradition dictates that I must demonstrate the use of these optimizers using the 2D Rosenbrock function, with a specific starting point:
par <- c(-1.2, 1)
fn <- function(x) {
x1 <- x[1]
x2 <- x[2]
100 * (x2 - x1 * x1) ^ 2 + (1 - x1) ^ 2
}
gr <- function(x) {
x1 <- x[1]
x2 <- x[2]
c(
-400 * x1 * (x2 - x1 * x1) - 2 * (1 - x1),
200 * (x2 - x1 * x1))
}
Let’s run the optimizers for 100 iterations. The point here is not whether we get really amazing optimization (we don’t), but whether the outputs of the Sutskever and Bengio algorithms are equivalent. It would be a bonus if my mNAG result also worked. For comparison we’ll throw in the classical momentum and as a sanity check, the vanilla NAG routine.
lr <- 0.001
mu <- 0.95
max_iter <- 100
snag_opt <- snag(par, fn, gr, lr, mu, max_iter)
bnag_opt <- bnag(par, fn, gr, lr, mu, max_iter)
nag_opt <- nag(par, fn, gr, lr, mu, max_iter)
mnag_opt <- mnag(par, fn, gr, lr, mu, max_iter)
cm_opt <- cm(par, fn, gr, lr, mu, max_iter)
sbnag_df <- data.frame(Bengio = bnag_opt$fs,
Sutskever = snag_opt$fs,
"Suts Mom" = snag_opt$mu_fs,
"NAG" = nag_opt$fs,
"mNAG" = mnag_opt$fs,
"CM" = cm_opt$fs)
Let’s have a look at the first few iteration results:
knitr::kable(head(sbnag_df), caption = paste(
"First few evaluations of NAG implementations, with lr = ",
formatC(lr), "mu = ", formatC(mu),
collapse = " "))
Bengio | Sutskever | Suts.Mom | NAG | mNAG | CM |
---|---|---|---|---|---|
34.960154 | 5.352912 | 24.200000 | 5.352912 | 34.960154 | 5.352912 |
7.039334 | 6.144886 | 34.960154 | 5.257042 | 7.039334 | 25.541437 |
5.020041 | 4.002116 | 7.039334 | 4.133512 | 5.020041 | 22.488091 |
3.845406 | 3.895340 | 5.020041 | 4.096729 | 3.845406 | 4.321887 |
3.746363 | 3.818670 | 3.845406 | 4.069433 | 3.746363 | 18.826923 |
3.668701 | 3.741945 | 3.746363 | 4.050225 | 3.668701 | 17.960659 |
The first two columns pit the Sutskever vs Bengio formulations directly. And, as we would expect, they’re not the same: the Sutskever iteration result is from the gradient descent stage, and the Bengio result comes from the momentum stage. But if we put the momentum results from the Sutskever formulation up, we can see that they are the same as the Bengio result, but behind by one iteration, i.e. the Bengio result at iteration \(t\) matches the Sutskever momentum stage result at \(t+1\). The final column, indicates the results using the alternative derivation I provided. It matches the Bengio results. Hurrah.
However, none of the results match the vanilla NAG implementation. A clue to what’s going on is from looking at the classical momentum result in the “CM” column. The first iteration of classical momentum is always plain gradient descent, and that’s also how NAG works by construction. We can see that both NAG and CM have the same result on the first iteration, which suggests they’re working correctly. The Sutskever version is also doing gradient descent on its first step. But the other implementations clearly aren’t doing gradient descent on their first iteration, and on subsequent iterations even the Sutskever version diverges from the NAG result.
We’ll get back to this, but let’s just make sure these observations holds up at the end of the table too.
knitr::kable(tail(sbnag_df), caption = paste(
"Last few evaluations of NAG implementations, with lr = ",
formatC(lr), "mu = ", formatC(mu),
collapse = " "))
Bengio | Sutskever | Suts.Mom | NAG | mNAG | CM | |
---|---|---|---|---|---|---|
95 | 0.0142850 | 0.0151685 | 0.0151835 | 0.0590959 | 0.0142850 | 0.4448498 |
96 | 0.0134463 | 0.0142710 | 0.0142850 | 0.0552234 | 0.0134463 | 0.5019262 |
97 | 0.0126630 | 0.0134332 | 0.0134463 | 0.0516306 | 0.0126630 | 0.4306990 |
98 | 0.0119314 | 0.0126508 | 0.0126630 | 0.0482955 | 0.0119314 | 0.3112414 |
99 | 0.0112476 | 0.0119199 | 0.0119314 | 0.0451983 | 0.0112476 | 0.2869773 |
100 | 0.0106082 | 0.0112368 | 0.0112476 | 0.0423208 | 0.0106082 | 0.3721692 |
The Sutskever, Bengio and momentum NAG implementations all still match up in the same way they did. CM and NAG are off doing their own thing. I suppose it’s at least reassuring that for this arbitrarily chosen set of learning rate and momentum coefficient values, the NAG results all do better than classical momentum. And in fact it seems like not doing plain gradient descent on the first iteration gives better results, at least for this set of parameters.
The reason for the differing behaviors of the NAG implementations is down to two things:
For the first point, let’s look at the Bengio formulation again:
\[\Theta_{t+1} = \Theta_t + \mu_{t+1} \mu_t v_t - \left(1 + \mu_{t+1} \right) \varepsilon_t \nabla f\left(\Theta_t\right) \] Let’s see the result for the first iteration with our current strategy of setting the initial velocity vector to zero:
\[\Theta_1 = \Theta_0 - \left(1 + \mu_1 \right) \varepsilon_t \nabla f\left(\Theta_0\right) \] Without also specifying that the momentum coefficient should be zero on the first iteration, we get an extra long gradient descent step. The same applies to the momentum NAG expression.
That explains the Bengio formulation’s difference from the vanilla NAG result. What about the Sutskever formulation? Let’s think about the chain of parameter updates that actually take place over the first few iterations:
For standard NAG, the chain is: gradient descent stage, momentum stage, gradient descent stage, momentum stage. Except the momentum stage results in zero change on iteration one because of the zero velocity vector, so what actually happens is: gradient descent stage, gradient descent stage, momentum stage. Let’s call that g|gm
for short with the bar indicating where the end of an iteration occurs. To make the pattern more obvious, the first three iterations look like g|gm|gm
. This is the same pattern as classical momentum, althought the m
stages are obviously different in content.
What about the Sutskever formulation? Its pattern for the first two iterations is: momentum stage, gradient descent stage, momentum stage, gradient descent stage. Except, once again, the momentum stage doesn’t happen on the first iteration, so you actually get: gradient descent stage, momentum stage, gradient descent stage, and so on. So for the Sutskever forumulation, the first three iterations look like g|mg|mg
.
Therefore, as you can see, the initial set of gradient and momentum stages is different for the Sutskever formulation, which is able to strictly interleave gradient descent and momentum stages, whereas NAG begins, like classical momentum, with two gradient descent stages. This difference shouldn’t be too surprising, because as we saw when deriving the Sutskever formulation, we conveniently “forget” about an initial gradient descent stage compared to NAG.
Anyway, all this means that the parameters end up in different locations after the first iteration.
It’s not clear to me that there is any particular theoretical or practical benefit to trying to bend these different formulations to give either gradient descent on the first iteration or to having them give the same pattern as vanilla NAG. But given the effort that’s been made to show that NAG is like classical momentum in some respects, we may as well see what it would take.
# Sutskever Nesterov Momentum consistent with NAG
#
# Extra parameter wait: wait this number of extra iterations before applying
# momentum. Needed only to sync up with other implementations of nesterov
# momentum: set wait to 1 to make mu_f at iter i match the output of the
# other implementations at iter i-1.
snagc <- function(par, fn, gr, lr, mu, max_iter = 10, wait = 0) {
v <- rep(0, length(par))
fs <- rep(0, max_iter)
mu_fs <- rep(0, max_iter)
all <- rep(0, max_iter * 2)
for (i in 1:max_iter) {
# momentum stage and update parameters
mu_step <- ifelse(i > wait + 1, mu, 0) * v
par <- par + mu_step
# store momentum results
f <- fn(par)
mu_fs[i] <- f
all[i * 2 - 1] <- f
# gradient descent stage
g <- gr(par)
gd_step <- -lr * g
# update and store velocity for next step
par <- par + gd_step
v <- mu_step + gd_step
# store gradient descent results
f <- fn(par)
fs[i] <- f
all[i * 2] <- f
}
list(par = par, f = f, fs = fs, mu_fs = mu_fs, all = all)
}
The Sutskever formulation already does gradient descent on its first iteration, so there’s not a huge change required for the algorithm. Instead, I’ve introduced a new parameter wait
, that indicates the number of iterations to wait before applying momentum. If this is set to 0
, then you get the current behavior. Set it to 1
and the momentum step will be zero on the second iteration, too. That gives us the two gradient descent stages in a row that should sync us up with NAG.
# Bengio Nesterov Momentum consistent with NAG
bnagc <- function(par, fn, gr, lr, mu, max_iter = 10) {
fs <- rep(0, max_iter)
v <- rep(0, length(par))
for (i in 1:max_iter) {
g <- gr(par)
if (i == 1) {
par <- par - lr * g
}
else {
par <- par + (mu * mu * v) - ((1 + mu) * lr * g)
v <- (mu * v) - (lr * g)
}
# store gradient descent results
f <- fn(par)
fs[i] <- f
}
list(par = par, f = f, fs = fs)
}
“Fixing” the Bengio formulation turns out to be a bit more work than you might think. As we noticed earlier, to get gradient decent on the first iteration, \(\mu_1\), needs to be set to zero. However, as a result this means the assumption we made that \(\mu_{t}\mu_{t-1} = \mu_{t-1}^2\) isn’t correct for iteration 2: \(\mu_2\mu_1 = 0 \neq \mu_2^2\). However, we can’t just set mu
to 0 for the second iteration, because the non-zero \(\mu_2\) is used on its own as part of the gradient descent expression.
There are two ways to deal with this: start storing the value of mu
from the previous iteration, even though I purposely chose a constant value of mu
to keep things simple, or just force \(v_1\) to be zero by not updating it on the first iteration. I chose the latter.
# momentum NAG consistent with NAG
mnagc <- function(par, fn, gr, lr, mu, max_iter = 10) {
fs <- rep(0, max_iter)
v <- rep(0, length(par))
for (i in 1:max_iter) {
g <- gr(par)
v <- (ifelse(i == 1, 0, mu) * (v - lr * g)) - lr * g
par <- par + v
# setup v for the next iteration by removing the old gradient contribution
v <- v + lr * g
# store results
f <- fn(par)
fs[i] <- f
}
list(par = par, f = f, fs = fs)
}
And last, the rewritten momentum NAG update. This required the least modification from the original routine: simply set the momentum coefficient to zero on the first iteration.
Time to look at some numbers again. We’ll compare the new versions of the Sutskever, Bengio and Momentum formulations of NAG with vanilla NAG. We’ll also once again pull out the momentum stage results for Sutskever so we can compare directly to the Bengio result.
mnagc_opt <- mnagc(par, fn, gr, lr, mu, max_iter)
snagc_opt <- snagc(par, fn, gr, lr, mu, max_iter, wait = 1)
bnagc_opt <- bnagc(par, fn, gr, lr, mu, max_iter)
ncdf <- data.frame(cBengio = bnagc_opt$fs,
cSutskever = snagc_opt$fs,
"Suts Mom" = snagc_opt$mu_fs,
NAG = nag_opt$fs,
cmNAG = mnagc_opt$fs,
"NAG gd" = nag_opt$gd_fs)
Let’s have a look at the first few iteration results:
knitr::kable(head(ncdf), caption = paste(
"First few evaluations of NAG consistent implementations, with lr = ",
formatC(lr), "mu = ", formatC(mu),
collapse = " "))
cBengio | cSutskever | Suts.Mom | NAG | cmNAG | NAG.gd |
---|---|---|---|---|---|
5.352912 | 5.352912 | 24.200000 | 5.352912 | 5.352912 | 5.352912 |
5.257042 | 4.117790 | 5.352912 | 5.257042 | 5.257042 | 4.117790 |
4.133512 | 4.118466 | 5.257042 | 4.133512 | 4.133512 | 4.118466 |
4.096729 | 4.097143 | 4.133512 | 4.096729 | 4.096729 | 4.097143 |
4.069433 | 4.082870 | 4.096729 | 4.069433 | 4.069433 | 4.082870 |
4.050225 | 4.066122 | 4.069433 | 4.050225 | 4.050225 | 4.066122 |
These are in the same order as the previous table. The Bengio and Sutskever formulations don’t match, as expected. But the Sutskever momentum result in the third column does match the Bengio results from the previous iteration. So relative performance between Sutskever and Bengio has been maintained. But in addition, the NAG result in the fourth column matches the Bengio and Sutskever momentum results. The momentum NAG result has also been correctly updated. Finally, as an extra check, the NAG gradient desent result is shown in the final column, which also matches up with the Sutskever results.
And let’s take a look at the final few iterations, just to make sure everything still holds up:
knitr::kable(tail(ncdf), caption = paste(
"Last few evaluations of NAG consistent implementations, with lr = ",
formatC(lr), "mu = ", formatC(mu),
collapse = " "))
cBengio | cSutskever | Suts.Mom | NAG | cmNAG | NAG.gd | |
---|---|---|---|---|---|---|
95 | 0.0590959 | 0.0631888 | 0.0632715 | 0.0590959 | 0.0590959 | 0.0631888 |
96 | 0.0552234 | 0.0590202 | 0.0590959 | 0.0552234 | 0.0552234 | 0.0590202 |
97 | 0.0516306 | 0.0551542 | 0.0552234 | 0.0516306 | 0.0516306 | 0.0551542 |
98 | 0.0482955 | 0.0515670 | 0.0516306 | 0.0482955 | 0.0482955 | 0.0515670 |
99 | 0.0451983 | 0.0482371 | 0.0482955 | 0.0451983 | 0.0451983 | 0.0482371 |
100 | 0.0423208 | 0.0451446 | 0.0451983 | 0.0423208 | 0.0423208 | 0.0451446 |
That’s a relief.
Yes, NAG can be thought of as being like classical momentum but where you do the momentum step first and then the gradient step. But there’s more to it than that.
Both classical momentum and NAG use a weighted combination of the current and past gradients to determine the next direction of descent. The difference is in the distribution of weights: NAG places more emphasis on recent gradients. So it might succeed better than classical momentum under conditions where the old gradients quickly become uniformative (regions of high curvature?). Personally, I find this view of NAG through the lens of a weighted history of gradients more appealing than the Sutskever formulation because it leads to interesting ideas like Quasi-Hyperbolic Momentum.
In terms of implementation, I pity anyone tasked with implementing Nesterov momentum and demonstrating that they actually got it right. Imagine naively implementing vanilla NAG, the Sutskever formulation and the Bengio formulation. Using identical test functions, starting points and parameters, their output are all different from each other but they’re all correct!