CIFAR-10 Image Classification

How to teach machine differentiating images using CNN.

Published in

Becoming Human: Artificial Intelligence Magazine

13 min readAug 21, 2020

CIFAR-10 dataset. Source: https://www.cs.toronto.edu/~kriz/cifar.html

In this story I wanna show you another project that I just done: classifying images from CIFAR-10 dataset using CNN. Such classification problem is obviously a subset of computer vision task. So, for those who are interested to this field probably this article might help you to start with.

CIFAR-10 is an image dataset which can be downloaded from here. It contains 60000 tiny color images with the size of 32 by 32 pixels. The dataset consists of 10 different classes (i.e. airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck), in which each of those classes consists of 6000 images. On the other hand, CNN is used in this project due to its robustness when it comes to image classification task. That’s for the intro, now let’s get our hands dirty with the code!

Note: I put the full code at the very end of this article.

Loading modules and dataset

The very first thing to do when we are about to write a code is importing all required modules. We will discuss each of these imported modules as we go.

import cv2
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from keras.datasets import cifar10
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import confusion_matrix
from keras.layers import Conv2D, MaxPool2D, Flatten, Dense, Dropout
from keras.models import Sequential, load_model
from keras.callbacks import EarlyStopping

The CIFAR-10 dataset itself can either be downloaded manually from this link or directly through the code (using API). To make things simpler, I decided to take it using Keras API. Here is how to do it:

(X_train, y_train), (X_test, y_test) = cifar10.load_data()

If this is your first time using Keras to download the dataset, then the code above may take a while to run. FYI, the dataset size itself is around 160 MB. After the code finishes running, the dataset is going to be stored automatically to X_train, y_train, X_test and y_test variables, where the training and testing data itself consist of 50000 and 10000 samples respectively.

Now if we try to print out the shape of training data (X_train.shape), we will get the following output.

(50000, 32, 32, 3)

Here is how to read the shape: (number of samples, height, width, color channels). Keep in mind that in this case we got 3 color channels which represents RGB values. If you have ever worked with MNIST handwritten digit dataset, you will see that it only has single color channel since all images in the dataset are shown in grayscale.

On the other hand, if we try to print out the value of y_train, it will output labels which are all already encoded into numbers:

array([[6],
       [9],
       [9],
       ...,
       [9],
       [1],
       [1]], dtype=uint8)

Since it’s kinda difficult to interpret those encoded labels, so I would like to create a list of actual label names. This list sequence is based on the CIFAR-10 dataset webpage.

labels = [‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’]

Image preprocessing

Before doing anything with the images stored in both X variables, I wanna show you several images in the dataset along with its labels. Here’s how I did it:

fig, axes = plt.subplots(ncols=7, nrows=3, figsize=(17, 8))index = 0
for i in range(3):
    for j in range(7):
        axes[i,j].set_title(labels[y_train[index][0]])
        axes[i,j].imshow(X_train[index])
        axes[i,j].get_xaxis().set_visible(False)
        axes[i,j].get_yaxis().set_visible(False)
        index += 1
plt.show()

The code above tells the computer that we are about to display the first 21 images in the dataset which are divided into 7 columns and 3 rows. The figsize argument is used just to define the size of our figure. We can see here that I am going to set the title using set_title() and display the images using imshow(). Below is how the output of the code above looks like.

The first 21 images in CIFAR-10 dataset.

It’s good to know that higher array dimension in training data may require more time to train the model. So as an approach to reduce the dimensionality of the data I would like to convert all those images (both train and test data) into grayscale. Luckily it can simply be achieved using cv2 module.

X_train = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_train])X_test = np.array([cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) for image in X_test])

Now we can display the pictures again just to check whether we already converted it correctly. Notice that the code below is almost exactly the same as the previous one. Here I only add ‘gray’ as the cmap (colormap) argument to make those images look better.

fig, axes = plt.subplots(ncols=7, nrows=3, figsize=(17, 8))index = 0
for i in range(3):
    for j in range(7):
        axes[i,j].set_title(labels[y_train[index][0]])
        axes[i,j].imshow(X_train[index], cmap='gray')
        axes[i,j].get_xaxis().set_visible(False)
        axes[i,j].get_yaxis().set_visible(False)
        index += 1
plt.show()

The output should be looking like this:

The first 21 images in CIFAR-10 dataset converted to grayscale.

Afterwards, we also need to normalize array values. We know that by default the brightness of each pixel in any image are represented using a value which ranges between 0 and 255. In order for neural network to work best, we need to convert this value such that it’s going to be in the range between 0 and 1. And it’s actually pretty simple to do so:

X_train  = X_train/255
X_test  = X_test/255

And well, that’s all what we need to do to preprocess the images.

Label preprocessing

Remember our labels y_train and y_test? Those are still in form of a single number ranging from 0 to 9 stored in array. In fact, such labels are not the one that a neural network expect. Instead, all those labels should be in form of one-hot representation. To do that, we can simply use OneHotEncoder object coming from Sklearn module, which I store in one_hot_encoder variable.

one_hot_encoder = OneHotEncoder(sparse=False)

Now we will use this one_hot_encoder to generate one-hot label representation based on data in y_train.

one_hot_encoder.fit(y_train)

The code above hasn’t actually transformed y_train into one-hot. It just uses y_train as the transformation basis — well, I hope my explanation is understandable. Therefore we still need to actually convert both y_train and y_test. Here is how to do it:

y_train = one_hot_encoder.transform(y_train)
y_test = one_hot_encoder.transform(y_test)

Now if we did it correctly, the output of printing y_train or y_test will look something like this, where label 0 is denoted as [1, 0, 0, 0, …], label 1 as [0, 1, 0, 0, …], label 2 as [0, 0, 1, 0, …] and so on.

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 1.],
       ...,
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]])

Constructing CNN

Before going any further, lemme review our 4 important variables first: those are X_train, X_test, y_train and y_test. Up to this step, our X data holds all grayscaled images, while y data holds the ground truth (a.k.a labels) in which it’s already converted into one-hot representation.

Notice here that if we check the shape of X_train and X_test, the size will be (50000, 32, 32) and (10000, 32, 32) respectively. Well, actually this shape is not acceptable by Conv2D layer that we are going to implement. So, we need to reshape those two arrays using the following code:

X_train = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], 1)X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2], 1)

Now our X_train and X_test shapes are going to be (50000, 32, 32, 1) and (10000, 32, 32, 1), where the number 1 in the last position indicates that we are now using only 1 color channel (gray). Next, we are going to use this shape as our neural net’s input shape. To make it looks straightforward, I store this to input_shape variable.

input_shape = (X_train.shape[1], X_train.shape[2], 1)

Trending AI Articles:

1. Machine Learning Concepts Every Data Scientist Should Know
2. AI for CFD: byteLAKE’s approach (part3)
3. AI Fail: To Popularize and Scale Chatbots, We Need Better Data
4. Top 5 Jupyter Widgets to boost your productivity!

Subsequently, we can now construct the CNN architecture. In this project I decided to be using Sequential() model. Below is how I create the neural network.

model = Sequential()model.add(Conv2D(16, (3, 3), activation='relu', strides=(1, 1), 
    padding='same', input_shape=input_shape))
model.add(Conv2D(32, (3, 3), activation='relu', strides=(1, 1), 
    padding='same'))
model.add(Conv2D(64, (3, 3), activation='relu', strides=(1, 1), 
    padding='same'))
model.add(MaxPool2D((2, 2)))model.add(Conv2D(16, (3, 3), activation='relu', strides=(1, 1), 
    padding='same'))
model.add(Conv2D(32, (3, 3), activation='relu', strides=(1, 1), 
    padding='same'))
model.add(Conv2D(64, (3, 3), activation='relu', strides=(1, 1), 
    padding='same'))
model.add(MaxPool2D((2, 2)))model.add(Flatten())model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(128, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))

There are several things I wanna highlight in the code above. First, filters used in all convolution layers are having the size of 3 by 3 and stride 1, where the number filters are increasing twice as many as its previous convolution layer before eventually reaches max-pooling layer. This convolution-pooling layer pair is repeated twice as an approach to extract more features in image data. Secondly, all layers in the neural network above (except the very last one) are using ReLU activation function because it allows the model to gain more accuracy faster than sigmoid activation function. Next, the dropout layer with 0.5 rate is also used to prevent the model from overfitting too fast. Lastly, notice that the output layer of this network consists of 10 neurons with softmax activation function. The reason is because in this classification task we got 10 different classes in which each of those is represented by each neuron in that layer. The use of softmax activation function itself is to obtain probability score of each predicted class.

I am not quite sure though whether my explanation about CNN is understandable, thus I suggest you to read this article if you want to learn more about the neural net architecture.

Now if we run model.summary(), we will have an output which looks something like this. It’s also important to know that None values in output shape column indicates that we are able to feed the neural network with any number of samples.

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 32, 32, 16)        160       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 32, 32, 32)        4640      
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 32, 32, 64)        18496     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 16, 16, 64)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 16, 16, 16)        9232      
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 16, 16, 32)        4640      
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 16, 16, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 8, 8, 64)          0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 4096)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 256)               1048832   
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_3 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_4 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_5 (Dense)              (None, 10)                650       
=================================================================
Total params: 1,150,458
Trainable params: 1,150,458
Non-trainable params: 0
_________________________________________________________________

The next step we do is compiling the model. In this case we are going to use categorical cross entropy loss function because we are dealing with multiclass classification. By the way if we perform binary classification task such as cat-dog detection, we should use binary cross entropy loss function instead. To the optimizer, I decided to use Adam as it usually performs better than any other optimizer. Lastly, I use acc (accuracy) to keep track of my model performance as the training process goes.

model.compile(loss='categorical_crossentropy', 
     optimizer='adam',
     metrics=['acc'])

Before actually training the model, I wanna declare an early stopping object. This is going to be useful to prevent our model from overfitting. What’s actually said by the code below is that I wanna stop the training process once the loss value approximately reaches at its minimum point. This is defined by monitor and mode argument respectively.

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=3)

That’s all of the preparation, now we can start to train the model. I keep the training progress in history variable which I will use it later. Notice that our previous EarlyStopping() object is put in the callbacks argument of fit() function.

history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), callbacks=[es])

Here’s how the training process goes. I delete some of the epochs to make things look simpler in this page.

Train on 50000 samples, validate on 10000 samples
Epoch 1/20
50000/50000 [==============================] - 65s 1ms/step - loss: 1.7465 - acc: 0.3483 - val_loss: 1.3424 - val_acc: 0.5145
.
.
.
Epoch 4/20
50000/50000 [==============================] - 64s 1ms/step - loss: 0.9497 - acc: 0.6707 - val_loss: 0.9306 - val_acc: 0.6778
.
.
.
Epoch 8/20
50000/50000 [==============================] - 64s 1ms/step - loss: 0.6935 - acc: 0.7625 - val_loss: 0.8046 - val_acc: 0.7252
.
.
.
Epoch 11/20
50000/50000 [==============================] - 62s 1ms/step - loss: 0.5842 - acc: 0.7976 - val_loss: 0.8116 - val_acc: 0.7294
Epoch 00011: early stopping

Notice the training process above. We see there that it stops at epoch 11, even though I define 20 epochs to run in the first place. This is what’s actually done by our early stopping object. If we pay more attention to the last epoch, indeed the gap between train and test accuracy has been pretty high (79% vs 72%), thus training with more than 11 epochs will just make the model becomes more overfit towards train data.

By the way if we wanna save this model for future use, we can just run the following code:

model.save('CNN_CIFAR.h5')

Next time we want to use the model, we can simply use load_model() function coming from Keras module like this:

model = load_model('CNN_CIFAR.h5')

Model evaluation

After the training completes we can display our training progress more clearly using Matplotlib module. Let’s show the accuracy first:

plt.plot(history.history[‘acc’])
plt.plot(history.history[‘val_acc’])
plt.show()

Accuracy improvement (orange represents accuracy towards test data).

And below is the loss value decrease:

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.show()

Loss value decrease (orange represents loss towards test data).

According to the two figures above, we can conclude that our model is slightly overfitting due to the fact that our loss value towards test data did not get any lower than 0.8 after 11 epochs while the loss towards train data keeps decreasing. At the same moment, we can also see the final accuracy towards test data remains at around 72% even though its accuracy on train data almost reaches 80%. In fact, the accuracy of perfect model should be having high accuracy score on both train and test data. Hence, there’s still a room for improvement.

Now we are going to display a confusion matrix in order to find out the misclassification distribution of our test data. To do so, we need to perform prediction to the X_test like this:

predictions = model.predict(X_test)

Remember that these predictions are still in form of probability distribution of each class, hence we need to transform the values to its predicted label in form of a single number encoding instead. This can be achieved using np.argmax() function or directly using inverse_transform method. For this case, I prefer to use the second one:

predictions = one_hot_encoder.inverse_transform(predictions)

Now if I try to print out the value of predictions, the output will look something like the following. Keep in mind that those numbers represent predicted labels for each sample.

array([[3.],
       [8.],
       [8.],
       ...,
       [5.],
       [1.],
       [7.]])

Also, remember that our y_test variable already encoded to one-hot representation at the earlier part of this project. So, we need to inverse-transform its value as well to make it comparable with the predicted data.

y_test = one_hot_encoder.inverse_transform(y_test)

Now, up to this stage, our predictions and y_test are already in the exact same form. Thus, we can start to create its confusion matrix using confusion_matrix() function from Sklearn module. We will store the result in cm variable.

cm = confusion_matrix(y_test, predictions)

Now to make things look clearer, we will plot the confusion matrix using heatmap() function.

plt.figure(figsize=(9,9))
sns.heatmap(cm, cbar=False, xticklabels=labels, yticklabels=labels, fmt=’d’, annot=True, cmap=plt.cm.Blues)
plt.xlabel(‘Predicted’)
plt.ylabel(‘Actual’)
plt.show()

And here is how the confusion matrix generated towards test data looks like. We can see here that even though our overall model accuracy score is not very high (about 72%), but it seems like most of our test samples are predicted correctly. Here’s how to read the numbers below in case you still got no idea: 155 bird image samples are predicted as deer, 101 airplane images are predicted as ship, and so on.

Lastly, I also wanna show several first images in our X_test. To do that, we need to reshape the image from (10000, 32, 32, 1) to (10000, 32, 32) like this:

X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2])

Well, the code above is done just to make Matplotlib imshow() function to work properly to display the image data. Since we will also display both actual and predicted label, it’s necessary to convert the values of y_test and predictions to integer (previously inverse_transform() method returns float).

y_test = y_test.astype(int)
predictions = predictions.astype(int)

Finally we can display what we want. Notice that in the figure below most of the predictions are correct. Only some of those are classified incorrectly.

fig, axes = plt.subplots(ncols=7, nrows=3, sharex=False,
    sharey=True, figsize=(17, 8))index = 0
for i in range(3):
    for j in range(7):
        axes[i,j].set_title('actual:' + labels[y_test[index][0]] + '\n' 
                            + 'predicted:' + labels[predictions[index][0]])
        axes[i,j].imshow(X_test[index], cmap='gray')
        axes[i,j].get_xaxis().set_visible(False)
        axes[i,j].get_yaxis().set_visible(False)
        index += 1
plt.show()

By the way, I found a page on the internet which shows CIFAR-10 image classification researches along with its accuracy ranks. The papers are available in this page, and luckily those are free to download. Just click on that link if you’re curious how researchers of those papers obtain their model accuracy.

Researches attempts to perform image classification on CIFAR-10 dataset. Source: https://paperswithcode.com/sota/image-classification-on-cifar-10

That’s all of this image classification project. Please lemme know if you can obtain higher accuracy on test data! See you in the next article :)

Note: here’s the code for this project. If you find that the accuracy score remains at 10% after several epochs, try to re run the code. It’s probably because the initial random weights are just not good.

Don’t forget to give us your 👏 !