Regularization in neural networks
I want to start this article with this funny little analogy. It compares training a model to buying pants. We can either buy a small one, get it just right, or end up overfitting. What it also tells us that we’d rather overfit than underfit. Because in that case, we can use a belt or something and still wear the pant. And that’s what this article is about.
Regularization
Regularization refers to training our model well so that it can generalize over data it hasn’t seen before. We’ve already seen how to regularize our models using data augmentation and weight decay. This time we will learn about another regularization method known as dropout.
Trending AI Articles:
1. Deep Learning Book Notes, Chapter 1
2. Deep Learning Book Notes, Chapter 2
Dropout
A fully connected neural network is nothing but a bunch of parameters (or weights) and activations. Our inputs go through this network and come out as predictions. We then use those predictions to back propagate the error to different layers of our network and update our weights accordingly.
Now what if the weights in some part of our network are so big, that they have a greater impact on our predictions than other parts of our network. This can lead to memorizing our inputs and hence, overfitting. And we absolutely want to avoid that.
Hence what we do is, we randomly turn off certain parts of our network while training. We decide that during this epoch or this mini-batch, a bunch of neurons will be turned off. This method is known as dropout.
But how do we decide which neurons to throw away? We do so by assigning each neuron with a probability ps
. Then based on this probability, the neuron will either compute its output or not.
In fastai learners, the probability is passed layer wise so that every layer can have a different probability.
learn = tabular_learner(data, layers=[1000,500],
ps=[0.001,0.01],emb_drop=0.04,
y_range=y_range, metrics=exp_rmspe)
Note: Some nodes may be turned off more than the other ones but since we are doing it over and over again, on average, every node will get the same treatment. Also note that dropout is only used during training time. During test time, all the nodes of our network are always present.
Values of ps
In general what should the values of ps
be? If we have too much dropout, that will reduce the capacity of our model and it’s going to under fit. So just like weight decay or learning rate, we really need to play around with this thing until we get a sense of what works best.
Final notes
We’ve now seen three methods for regularizing our deep learning models. There’s no absolute best practice for which ones to use and which ones to avoid. We generally want a bit of both, weight decay and dropout. Data augmentation is also pretty useful since it comes without any costs. It’s basically free data. How we use it really depends on our applications.
Let’s take brightness for example. For most applications, we transform our image only so much as long as it is clear what the picture is of. We don’t want to go too dark or too bright. We also take a look at our validation set into consideration to get a sense of what kind of transforms we might like.
One of the big opportunities for research is to figure out ways to do data augmentation in other domains. So how can you do data augmentation with text data, or genomic data, or histopathology data, or whatever. Almost nobody’s looking at that, and to me, it’s one of the biggest opportunities that could let you decrease data requirements by like five to ten X.
-Jeremy Howard
That will be it for this article. If you liked this article give it at least 50 claps :p
If you want to learn more about deep learning check out my series of articles on the same:
~happy learning.