Understanding Few-shot learning through an experiment

Published in

Becoming Human: Artificial Intelligence Magazine

6 min readDec 8, 2021

The field of machine learning and deep learning has always been data-hungry, i.e., the more data you provide to the neural networks, the better it generalizes on the test set of the dataset. But this field has been growing so rapidly that the datasets are scaling exponentially and the training requirements for these huge datasets are only filled by tech monopolies. But, the advantage that every other scholar in this field gets from these models (trained on huge datasets) is to fine-tune their datasets on such models and use them as per their requirements without worrying about the problem of collecting such huge datasets.

Recently, I came across the notion of Few-shot learning in which we can fine-tune the pre-trained models with very limited sources of data. Few-shot learning refers to the fact that we are training the model with only a few examples per class, and based on these few training examples we would like to test our model for the dataset. For example, 8-shot learning would mean only providing 8 samples per class and training the network. I was anxious to test this out on my personal 4GB GPU laptop and thus carried out a small experiment to test out few-shot learning.

Experiment

The goal of the experiment would be to test how an ImageNet pre-trained model performs on a Flowers classification problem when provided with a limited number of data points, N where N can take the value 128, 64, 32, 16, 8, 4, 2, and 1. The whole experiment is done using Pytorch. Let’s first have a look at the dataset used.

The Flowers Photos Dataset

This dataset contains photos belonging to 5 classes i.e., tulips, roses, sunflower, daisy, and dandelion. The test set was created by sampling 20% images from each class and thus the test size was 731 (constant). The train set was simply number_of_shots x 5 (Max = 640, Min = 5).

Here the objective is to demonstrate few-shot learning and thus if the dataset looks simple to any reader then it’s just for demonstration purposes and not actually a research problem dataset.

Models

The selection of models for this experiment was mainly based on choosing a small and efficient model. Thus, the experiment is performed with 4 different models:

ResNet-18 — (11,179,077 Parameters)
ResNet-34 — (21,287,237 Parameters)
MobileNetV2 — (3,507,437 Parameters)
MobileNetV3 (small) — (2,545,421 Parameters)
MobileNetV3 (large) — (5,485,897 Parameters)

These are pre-trained torchvision models on ImageNet. ResNet models are bigger models when compared with MobileNet models and thus this selection could offer some interesting observations.

Trending AI Articles:

1. Why Corporate AI projects fail?
2. How AI Will Power the Next Wave of Healthcare Innovation?
3. Machine Learning by Using Regression Model
4. Top Data Science Platforms in 2021 Other than Kaggle

Procedure

Create a dataset: At this step, the task was given the complete dataset to create a test and training set. The test set was constant for all the shots. The training set was created from the original dataset (-test set) by randomly sampling N (number of shots) images per class. Once created the training set was fixed for multiple runs (reason for multiple runs explained later).
Initialize the model: Initialize the selected model by keeping the pretrained = true. Then change the classifier/fully connected layer to a linear layer mapping to the number of classes (5 in this case).

#Select the model
model = torchvision.models.mobilenet_v3_small(pretrained=True)#Change the classifier for the model to map to our dataset classes
model.classifier = nn.Linear(576, len(FewShotsDataset.classes))
model = model.to(device)

3. Train the model: Now, the model is trained for 32 epochs with a learning rate of 0.0005 with an exponential learning rate scheduler (gamma = 0.95). These parameters are kept constant throughout the models and runs to ensure similar testing conditions across the models.

Note: I briefly experimented with changing learning rates for different models but the performance remained approximately similar. Thus, kept it this way.

4. Test the model: Final phase involved testing the model (the model with the best training accuracy was tested) on the complete test set and recording the accuracies.

Observations

The table below summarizes the experiment performed for a different number of training examples.

One interesting point to note is that the smallest model (parameter-wise) MobileNetV3(small) performed the worst but the second smallest model, MobileNetV2 performed the best. This would mean that not all necessarily all compressed models will work well for few-shot learning but definitely around some threshold the model could also give the best performance.
Every other row in the table reduces the training samples by half, and thus when we reduce the training samples from 128 to 64, there is no significant performance drop. But significant performance drop could be observed when we reduce the training samples to an absolute minimum value like 1, 2, or 4.
This can also be observed from the following graph(The x-axis of the graph is on a logarithmic scale)
One more thing that was observed was a fluctuation of test accuracies in the lower shots. For example, the 1 shot training of ResNet-18 over 10 runs is observed as shown:

Another thing that was observed was a fluctuation of test accuracies in the lower shots. For example, the 1 shot training of ResNet-18 over 10 runs is observed as shown:

Therefore, to address this issue for all shots an average of over 10 runs is performed for all the models, and test accuracies are recorded and noted.

Conclusion

The objective of this experiment is to test if we have some relatively easy task like flowers classification then do we really need a large dataset and a heavy model (like ResNets), and the results indicate the trend that lighter models and even with minimal data points could give satisfactory performance.

The experiment also demonstrated Few-shot learning and the effect of reducing the shots on different models. This also makes us realize that we can pause and take a moment to think how much data we actually need to train the model so that it can generalize before loading the entire dataset.

Don’t forget to give us your 👏 !