How to Overfit Your Model

Lots of articles out there are talking about how not to overfit. Now let’s discuss about it the other way around.

Published in

Becoming Human: Artificial Intelligence Magazine

5 min readJan 8, 2021

How overfitted model looks like. Source: https://id.wikipedia.org/wiki/Overfitting

According to Andrew Ng’s Machine Learning Yearning book, there are several things that might affect the variance (a.k.a. overfitting rate, or the difference between training and testing accuracy) of a model. Two of which are listed below.

Number of training data
Number of features

Today, in this article I would like to make a proof of these theories.

Let’s now talk about the first thing first — the number of training data. Theoretically, more training samples is able to address overfitting problem pretty significantly. According to my experience of creating a CNN model for pneumonia detection (here’s the dataset), implementing ImageDataGenerator() object from Keras as an approach to augment image data did help me to reduce overall model variance. The process of data augmentation itself is done by creating random rotation, shift, zoom, etc. based on the existing train images. Essentially, the CNN model didn’t even know that those pictures are actually taken from the exact same distribution. This kind of technique is able to make as if the model being trained on large number of data. Therefore (back to the main topic), if you want to make your model to be overfitting, just use small amount of training data and never use data augmentation technique.

Time for the proof. So, the neural network architecture I displayed below is the one that I used to classify whether a person is healthy, suffers bacterial pneumonia, or suffers viral pneumonia based on their chest x-ray images.

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 200, 200, 1)]     0         
_________________________________________________________________
conv2d (Conv2D)              (None, 200, 200, 16)      160       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 200, 200, 32)      4640      
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 100, 100, 32)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 100, 100, 16)      2064      
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 100, 100, 32)      2080      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 50, 50, 32)        0         
_________________________________________________________________
flatten (Flatten)            (None, 80000)             0         
_________________________________________________________________
dense (Dense)                (None, 100)               8000100   
_________________________________________________________________
dense_1 (Dense)              (None, 50)                5050      
_________________________________________________________________
dense_2 (Dense)              (None, 3)                 153       
=================================================================
Total params: 8,014,247
Trainable params: 8,014,247
Non-trainable params: 0
_________________________________________________________________

The idea behind this experiment is that, I trained the exact same model first by using image data augmentation method, and second without data augmentation. Below is how the training process looks like after 45 epochs.

Epoch 43/45
163/163 [==============================] - 17s 104ms/step - loss: 0.4973 - acc: 0.7910 - val_loss: 0.7446 - val_acc: 0.7869
Epoch 44/45
163/163 [==============================] - 17s 104ms/step - loss: 0.5017 - acc: 0.7899 - val_loss: 0.8871 - val_acc: 0.7340
Epoch 45/45
163/163 [==============================] - 17s 102ms/step - loss: 0.5007 - acc: 0.7841 - val_loss: 0.6466 - val_acc: 0.8141

As you can see the progress bar above, it’s clearly seen that the accuracy towards train and validation data seems pretty close to each other. Therefore we can just conclude that this model does not suffer overfitting.

But now let’s do the second one. I do not use data augmentation technique this time around, and below is the last 3 training epochs.

Epoch 43/45
163/163 [==============================] - 4s 26ms/step - loss: 0.0053 - acc: 0.9977 - val_loss: 5.4777 - val_acc: 0.6362
Epoch 44/45
163/163 [==============================] - 4s 26ms/step - loss: 0.0015 - acc: 0.9996 - val_loss: 5.2033 - val_acc: 0.6442
Epoch 45/45
163/163 [==============================] - 4s 27ms/step - loss: 2.3021e-04 - acc: 1.0000 - val_loss: 5.8940 - val_acc: 0.6410

Alright, so the result above shows that the model is extremely overfitting that the training accuracy touches exactly 100% while at the same time the validation accuracy does not even reach 65%. So ya, back to the topic again. IF YOU WANNA MAKE YOUR MODEL OVERFIT THEN JUST USE SMALL AMOUNT OF DATA. Keep that in mind.

Next up, let’s talk about how the number of features affect classification performance. Different to the previous one, here I am going to use potato leaf disease dataset (it’s here). What basically I wanna perform here is to train two CNN models (again), where the first one is using max-pooling while the next one doesn’t. So below is how the first model looks like.

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_10 (Conv2D)           (None, 254, 254, 16)      448       
_________________________________________________________________
conv2d_11 (Conv2D)           (None, 252, 252, 32)      4640      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 63, 63, 32)        0         
_________________________________________________________________
flatten_6 (Flatten)          (None, 127008)            0         
_________________________________________________________________
dense_18 (Dense)             (None, 100)               12700900  
_________________________________________________________________
dense_19 (Dense)             (None, 50)                5050      
_________________________________________________________________
dense_20 (Dense)             (None, 3)                 153       
=================================================================
Total params: 12,711,191
Trainable params: 12,711,191
Non-trainable params: 0
_________________________________________________________________

As you can see the model summary above, the max-pooling layer is placed right after the last convolution layer. — Well, the size of this pooling layer is 4x4, but it’s just not mentioned in this summary. Also, note that the number of neurons in the flatten layer is only 127,008.

The output showing the training progress bar looks like below. Here you can see that the model is performing excellent since the accuracy towards validation data is reaching up to 93%.

Epoch 18/20
23/23 [==============================] - 1s 31ms/step - loss: 1.0573e-05 - acc: 1.0000 - val_loss: 0.3191 - val_acc: 0.9389
Epoch 19/20
23/23 [==============================] - 1s 31ms/step - loss: 9.6892e-06 - acc: 1.0000 - val_loss: 0.3215 - val_acc: 0.9389
Epoch 20/20
23/23 [==============================] - 1s 31ms/step - loss: 8.9155e-06 - acc: 1.0000 - val_loss: 0.3233 - val_acc: 0.9389

But remember that we want the model to be AS OVERFIT AS POSSIBLE. So my approach here is to modify the CNN such that it no longer contains max-pooling layer. Below is the detail of the modified CNN model.

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_6 (Conv2D)            (None, 254, 254, 16)      448       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 252, 252, 32)      4640      
_________________________________________________________________
flatten_4 (Flatten)          (None, 2032128)           0         
_________________________________________________________________
dense_12 (Dense)             (None, 100)               203212900 
_________________________________________________________________
dense_13 (Dense)             (None, 50)                5050      
_________________________________________________________________
dense_14 (Dense)             (None, 3)                 153       
=================================================================
Total params: 203,223,191
Trainable params: 203,223,191
Non-trainable params: 0
_________________________________________________________________

Now, you can see the summary above that the number of neurons in the flatten layer is 2,032,128. This number is a lot more compared to the one that we see in the CNN which contains pooling layer in it. Remember, in CNN the actual classification task is done by the dense layer, which basically means that the number of neurons in this flatten layer acts kinda like the number of the input features.

Theoretically speaking, the absence of the pooling layer will cause the model to get more overfit due to the fact that the number of features is a lot higher compared to the previous CNN model. In order to prove, let’s just fit the model and see the result below.

Epoch 15/20
23/23 [==============================] - 2s 65ms/step - loss: 0.0016 - acc: 1.0000 - val_loss: 2.4898 - val_acc: 0.6667
Epoch 16/20
23/23 [==============================] - 2s 68ms/step - loss: 0.0014 - acc: 1.0000 - val_loss: 2.5055 - val_acc: 0.6667
Epoch 17/20
23/23 [==============================] - 2s 66ms/step - loss: 0.0013 - acc: 1.0000 - val_loss: 2.5193 - val_acc: 0.6667
Epoch 18/20
23/23 [==============================] - 1s 65ms/step - loss: 0.0011 - acc: 1.0000 - val_loss: 2.5339 - val_acc: 0.6611
Epoch 19/20
23/23 [==============================] - 2s 65ms/step - loss: 0.0010 - acc: 1.0000 - val_loss: 2.5476 - val_acc: 0.6611
Epoch 20/20
23/23 [==============================] - 2s 65ms/step - loss: 9.2123e-04 - acc: 1.0000 - val_loss: 2.5602 - val_acc: 0.6611

And yes, we’ve got to our goal. We can clearly see here that the validation accuracy is only 66% while at the same time the training accuracy is exactly 100%.

Long story short here’s two things that you can do if you wanna overfit your model:

Use less training data
Use more features

I hope by knowing how to make a model being overfit, now you know how not to do so. Thanks for reading :)

How to Overfit Your Model

Lots of articles out there are talking about how not to overfit. Now let’s discuss about it the other way around.

Written by Muhammad Ardi