Mixed precision training using fastai

Published in

Towards Data Science

4 min readApr 13, 2019

Introduction

Image classification is the Hello World of deep learning. For me, that project was Pneumonia Detection using Chest X-rays. Since this was a relatively small dataset, I could train my model in about 50 minutes. But what if I told you, that with just one additional line of code

we can reduce the training time (by 50% in theory) without any significant decrease in the accuracy. But first…

Why is this important?

The dataset I worked with, involved around 4,500 images. And the only reason it took 50 minutes was because the images were high definition.

However, if we scale the same project to a real world application, there’s probably going to be a lot more images. Take a look at Stanford’s Chexpert dataset.

Also, I trained a model to predict just one disease, but in a reality, we will have to predict a lot more than just two classes. In such scenarios, a reduce in the training time would really be a plus. It would reduce the cost of resources and help us experiment faster. So how do we do it?

We become less precise.

It turns out that sometimes, making things less precise in deep learning causes it to generalize a bit better.
- Jeremy Howard

Precision in neural networks

In neural nets, all the floats i.e. our inputs, weights and activations are stored using 32 bits. Using 32 bits gives us a high amount of precision. But higher precision also means more computation time and more memory required to store these variables. What if we used only 16 bits instead?

Half precision

One way to reduce memory usage is to perform all operations in half precisions (16 bits).

By definition, this would take half the space in RAM, and in theory could allow you to double your batch size. The increased batch size would mean more operations performed in parallel thus reducing the training time. However there are some problems associated with this.

Problems with half precision

Imprecise weight updates:

We update the weights of our model as follows:

w = w - learning_rate * w.gradient

The problem with performing this operation in half precision is that w.grad is usually really small and so is our learning_rate which can make the 2nd term of our equation so small that no updates happen at all.

2. Gradient underflow:

Similar to imprecise weight updates, if our gradients are too small (below the values that can be represented using 16 bits) they will get converted to 0.

3. Activation exploding:

A series of matrix multiplications (forward pass) can easily cause activations (outputs) of the neural network to grow so large that they reach NaN (or infinity)

The way to solve these problems is to use mixed precision training.

source: https://arxiv.org/pdf/1710.03740.pdf

Mixed Precision Training

Full jupyter notebook.

As the name suggests, we don’t do everything in half precision. We perform some operations in FP16 while the others in FP32. More specifically, we do our weight updates in 32 bit precision. This takes care of Problem #1.

To overcome gradient underflow, we use something called gradient scaling. We multiply our loss function by a scaling factor. We do so to avoid the gradient from falling below the range that can be represented by FP16 and hence avoid it from getting replaced by 0. We also make sure that the scaling factor is not so large that it causes our activations to overflow.

Using these ideas we avoid the problems of half precision and train our network efficiently.

Final thoughts

The idea of using mixed precision training has only been around for a couple of years, and not all GPUs support it. But its an idea worth knowing, and would be used a lot more in the future.

Results show that it reduces training time without any effect on accuracy.