A Glance at Optimization algorithms for Deep Learning

Published in

Becoming Human: Artificial Intelligence Magazine

11 min readJul 11, 2021

I am glad you made it here. And since you are reading this, I expect that you are quite well familiar with the terms like ‘Neural Networks’, ‘Backpropagation’, ‘Overfitting’, ‘Underfitting’, ‘Learning’, ‘Gradient Checking’, and ‘Loss functions’ etc.

If you didn’t, no worries, I will save them for next stories. But here, we can start with Optimization.

Optimization is a process of finding quality parameters for our neural network “Remember by parameters we mean the weights and biases”

The goal of optimization is to find W(weights vector) that minimizes the loss function. We will now motivate and slowly develop an approach to optimizing the loss function.

An obsolete strategy says to try random weights and try minimizing the loss. Phew! that is ridiculous and time consuming.

Initial stages of such a boring topic starts with what is called the “gradient descent”.

Gradient Descent

We can compute the best direction along which we should change our weight vector that is mathematically guaranteed to be the direction of the steepest descent (step by step).

This direction will be related to the gradient of the loss function.

In this way, we follow the gradient to reach minima and simultaneously update the parameters all way down.

Generally, loss function is an n-dimensional curve but we can visualize the gradient optimization if we consider the loss function as a plot in 1-D space of weights.

Mathematically, we can state it as

Assuming a vector of parameters x and the gradient dx, the simplest update has the form

# Vanilla update
x += - learning_rate * dx

where learning_rate is a hyperparameter - a fixed constant. When evaluated on the full dataset, and when the learning rate is low enough, this is guaranteed to make non-negative progress on the loss function(L).

We also have some variants of GD. viz, SGD and minibatch SGD.

Batch Gradient Descent, Mini-batch Gradient Descent and Stochastic Gradient Descent are techniques used for gradient optimization differ in the batch size they use for computing gradients in each iteration.

Gradient Descent uses all the data to compute gradients and update weights in each iteration.

# Vanilla Gradient Descent

while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad # perform parameter update

However this reaches global minima in lesser iterations, it is much slower process than its counterparts and this is due to batch_size=entire_dataset…

It is also difficult for GD to come out of saddle points. It could remain struck there.

Minibatch Gradient Descent takes a subset of dataset to update its weights in each iteration. It however takes more iterations to converge to minima, but it is faster as compared to Gradient Descent due to lesser size of batch data used.

# Vanilla Minibatch Gradient Descent

while True:
  data_batch = sample_training_data(data, 256) # sample 256 examples
  weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
  weights += - step_size * weights_grad # perform parameter update

Stochastic Gradient Descent (SGD) (or also sometimes on-line gradient descent) is the extreme case of this. It is a setting where the mini-batch contains only a single example(batch_size=1).

SGD with Momentum

Momentum update is a concept of denoising gradients of SGD to enjoy faster convergence.
Algorithm used: “Exponential weighting” (motivated from a physical perspective of the optimization problem)
Intuitively, the loss(error function)can be interpreted as the height of a hilly terrain (and therefore also to the potential energy since U=mgh and therefore U∝h)
We know force on the particle is related to the gradient of potential energy (F=−∇U) i.e., the force felt by the particle is precisely the (negative) gradient of the loss function.
And F=ma and a=Integration(velocity), the physics view suggests an update in which the gradient only directly influences the velocity, that changes the position in turn.
Notice this is different from the SGD update shown above, where the gradient directly integrates the position.

# Momentum update
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position

Also, initializing the parameters with random numbers is equivalent to setting a particle with zero initial velocity at some location.

Mathematically,

Here we see an introduction of a v variable that is initialized at zero, and an additional hyperparameter (mu).
This variable is in optimization referred to as momentum (its typical value is about 0.9), but its physical meaning is more consistent with the “coefficient of friction”.
Effectively, this variable damps the velocity and reduces the kinetic energy of the system, or otherwise the particle would never come to a stop at the bottom of a hill.

Nesterov Accelerated Gradients (NAG)

This is a slightly different version of the momentum update(Momentum + SGD). It has stronger theoretical converge guarantees for convex functions and in practice it also consistently works slightly better than standard momentum.

Core idea

We ignore the second term with the gradient(i.e.,learning_rate*dx term).
We compute the gradient and treat the future approximate position x + mu * v as a “lookahead”. It is the tip of green arrow above.
This is a point in the vicinity of where we are soon going to end up. Hence it makes more sense to compute derivative here instead of old position x(the big red dot).

x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x += v

You can see more of it here. Beware! its research work.

Annealing the learning rate

Why we need this?

In the methods we saw till now, learning rate was kept constant which actually is not a good practice. In training deep networks, it is usually helpful to anneal the learning rate over time.

A good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically.
And hence it is unable to settle down into deeper, but narrower parts of the loss function.

For instance, consider the loss function as y=x².

Obvious remedy for oscillations is to decay r(learning rate) with iterations. But knowing when to decay the learning rate can be tricky:

Decay it slowly and you’ll be wasting computation bouncing around chaotically with little improvement for a long time.
Decay it too aggressively and the system will cool too quickly, unable to reach the best position it can.

Stating three common types of implementation of learning rate decay:

Step decay: Reduce the learning rate by some factor every few epochs. One heuristic you may see in practice is to watch the validation error while training with a learning rate decreasing by a factor and tune accordingly.
Exponential decay. has the mathematical form α=α0exp(-kt) , where α0, k are hyperparameters and t is the iteration number.
1/t decay has the mathematical form α=α0/(1+kt) where a0, k are hyperparameters and t is the iteration number.

In practice, we find that the step decay is slightly preferable because the hyperparameters it involves are more interpretable than the hyperparameter ‘k’.

But not everyone can afford the computational budget and err on the side of slower decay and train for a longer time. So we also devised a second popular group of method based on Newton’s Method.

Aren’t they too many problems? But hey, hold on! this is the reason why this is still a topic of research.

Second Order Methods

In this method,

We compute the Hessian matrix, which is a square matrix of second-order partial derivatives of the function.
The term ∇f(x) is gradient vector and Hf(x) is the 2-D Hessian Matrix.
Intuitively, multiplying by the inverse Hessian leads the optimization to take more aggressive steps in directions of shallow curvature and shorter steps in directions of steep curvature.
Also notice, the absence of learning rate parameter. Which cites its large advantage over first order methods.

But we always have the cons..

this update is impractical for most deep learning applications :(
Computing (and inverting) the Hessian in its explicit form is a very costly process in both space and time.
A Neural Network with one million parameters would have a Hessian matrix of size [1,000,000 x 1,000,000], occupying approximately 3725 gigabytes of RAM.

However it also has a better variant, L-BFGS you would like to read about.

Why do we need to go Adaptive?

As a matter of fact, learning rate is not always inversely proportional to time(as we saw in Newton’s method), and also they are not ‘global’.

Fact: Each weight/parameter has a different learning rate. Why?

Reason being, in a dataset, some features are sparse while some are dense, so keeping same learning rate for all would be improper.

Much work has gone into devising methods that can adaptively tune the learning rates, and even do so per parameter. And many of them still require a hyperparameter. But let’s go easy now. Focusing on the applied methods…

Adagrad (Adaptive Gradients)

This can be mathematically stated as

And in code..

# Assume the gradient dx and parameter vector x
cache += dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

Variable cache has size equal to the size of the gradient, and keeps track of per-parameter sum of squared gradients.
It is then hence to normalize the parameter update step, element-wise.
Notice that the weights that receive high gradients(high gradients…more αt…less eta’) will have their effective learning rate reduced, while weights that receive small or infrequent updates will have their effective learning rate increased.

Pros: No need of manual tuning the hyperparameter(learning rate). Adagrad automatically takes care of learning rate when we have sparse and dense features.
Cons: cache variable can go very large with iterations correspondingly decreasing effective learning rate and freezing any movement further. (happens in case of monotonic learning rate. Being too aggressive it stops learning quite early)

Adadelta and RMSProp

Both are almost identical.
This update adjusts the Adagrad method in a very simple way. It is an attempt to reduce its aggressive, monotonically decreasing learning rate.
In particular, it uses a moving average of squared gradients.

Do you know? RMSProp is still an unpublished adaptive learning rate method. Amusingly, everyone who uses this method in their work currently cites slide 29 of Lecture 6 of Geoff Hinton’s Coursera class.

Mathematically,

In code,

cache = decay_rate * cache + (1 - decay_rate) * dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)

decay_rate is a hyperparameter and typical values are [0.9, 0.99, 0.999]
Notice that the x+= update is identical to Adagrad, but the cache variable is a leaky (1- decay_rate) only for new updates.

Adam (Adaptive Moments)

It is fastest of algorithms known..

Why ‘Moment’? In statistics, mean is considered to be the 1st order moment while variance is the second order moment.

Although there are also some variants of Adam available now(since 2015).

It has both mean and variance as exponential decaying averages to speed up the learning process as you may see..

beta1 = 0.9, beta2 = 0.999 are the recommended values in Adam paper
Adam is currently recommended as the default algorithm to use.
However, it is often also worth trying SGD+Nesterov Momentum(NAG) as an alternative. (of course if you could adjust the learning rate)

# t is your iteration counter going from 1 to infinity
m = beta1*m + (1-beta1)*dx
mt = m / (1-beta1**t)
v = beta2*v + (1-beta2)*(dx**2)
vt = v / (1-beta2**t)
x += - learning_rate * mt / (np.sqrt(vt) + eps)

Why Bias Correction? It compensates for the fact that in the first few time steps the vectors m,v are both initialized(at m0 and v0) and therefore are biased. They hence need to be unbiased of m0, v0, before they fully “warm up”.

Summary

I saved the visualization for this part..

Check! if gifs below do not tempt you to scroll above, you are surely a genius…let’s see

Notice the “overshooting” behavior of momentum-based methods, which make the optimization look like a ball rolling down the hill.

Who can escape saddle points?

Notice that SGD has a very hard time breaking symmetry and gets stuck on the top. Conversely, algorithms such as RMSprop will see very low gradients in the saddle direction. Due to the denominator term in the RMSprop update, this will increase the effective learning rate along this direction, helping RMSProp proceed.

Images credit: Alec Radford.

will discuss again!