Deep Autoencoder in Action: Reconstructing Handwritten Digit

Published in

Becoming Human: Artificial Intelligence Magazine

11 min readJul 25, 2020

Hello world, welcome back to my page! Here I wanna show you another project that I just done, A Deep Autoencoder. So autoencoder is essentially just a kind of neural network architecture, yet this one is more special thanks to its ability to generate new data based on given sample represented in lower dimension. Here I am going to be using MNIST Handwritten Digit dataset in which each of its image samples has the size of 28 by 28 pixels. This size is then going to be flattened, hence we will have 784 values to represent each of those images.

As usual, I also include all code required for this project in the end of this article.

Before we jump into the code, let me explain first about the structure of a deep autoencoder. Look at the figure below.

Structure of the deep autoencoder used in this project.

What you are seeing in the picture above is a structure of the deep autoencoder that we are going to construct in this project. An autoencoder has two main parts, namely encoder and decoder. The encoder part, which covers the first half of the entire network, has a purpose to map a sample into its lower dimensional representation. In this case, the encoder consists of an input layer which takes 784 features. Next, it is connected to a hidden layer of 32 neurons and then followed by 2-neurons layer. The encoder part ends at this 2-neurons layer, which is usually called as latent space. Since this latent space has exactly two dimensions, then we are able to represent all the data in a simple cartesian coordinate system in order to find out where the location of those digit numbers are encoded.

The next half of an autoencoder is called decoder. The architecture of a decoder is nearly the same as the encoder part. However, instead of lowering the dimensionality of data, it maps back a value in latent space to the original image shape. In this project, the decoder takes two input values, in which it should be two coordinate numbers that represent a location in a latent space. Then it is attached to a hidden layer and output (original shape) layer of size 32 and 784 respectively.

I think that’s all of the explanation about autoencoder, so now let’s start to implement this!

As usual, the first thing to do is to import all required modules, namely NumPy, Matplotlib, and Keras. The MNIST Handwritten Digit dataset that we will use is available from Keras datasets, so we can load it directly through the code.

import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from keras.models import Model
from keras.layers import Dense, Input# Loading MNIST Digit dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

Now that we already have 4 variables, where X_train and y_train consist of 60000 data-label pairs while the test variables consist of 10000 pairs. Spoiler alert: we do not use use both y_train and y_test for training. I will explain the reason later.

Preprocessing

After loading the dataset, the next thing to do is to preprocess those data. Fortunately, the preprocessing steps is very simple for this case because the shape of the images are already uniform (28 by 28). So now, what we need to do is to flatten out both X_train and X_test, then keep those flattened array to X_train_flat and X_test_flat.

# Convert 2D arrays into 1D (flattening)
X_train_flat = 
X_train.reshape(60000, X_train.shape[1]*X_train.shape[2])X_test_flat = 
X_test.reshape(10000, X_test.shape[1]*X_test.shape[2])

If we check the shape of both flattened variables you will get (60000, 784) and (10000, 784) for train and test data respectively.

The next preprocessing step is array values normalization. We know that pixel brightness in images are represented with values ranging between 0 and 255. In order for neural network to work best, we need those numbers to lie between 0 and 1, Even though in some other cases this step might not affect much. The normalization process can be done like this:

# Normalize values
X_train_flat = X_train_flat/255
X_test_flat = X_test_flat/255

In addition, we do not convert both labels (y_train and y_test) into one-hot encoding representation because, as I said earlier, those data are literally not used to train the neural network model.

Constructing the autoencoder

After all preprocessing steps done, now we are able to construct the autoencoder. The structure of this deep autoencoder is already shown in the figure that I put in the early part of this writing. Below is the code implementation of the architecture.

input_1 = Input(shape=(X_train_flat.shape[1],))
hidden_1 = Dense(32, activation='relu')(input_1)
latent_space = Dense(2, activation='relu')(hidden_1)
hidden_2 = Dense(32, activation='relu')(latent_space)
output_1 = Dense(X_train_flat.shape[1], activation='sigmoid')(hidden_2)

Technically speaking, this deep autoencoder takes an array of size 784 as the input value (the flattened image array). Next, those values are delivered to the next layers, namely hidden_1, latent_space and hidden_2 respectively before eventually reach the last layer called output_1. Note that we call this a deep autoencoder due to the existence of hidden_1 and hidden_2 layer. If those two layers do not exist, we can simply call it as an autoencoder.

Next, we need to define the segments of the network (encoder, decoder, and the entire model). Here I will use a variable called autoencoder to store the entire neural network model and use encoder variable to store the first half of the network.

autoencoder = Model(inputs=input_1, outputs=output_1)
encoder = Model(inputs=input_1, outputs=latent_space)

Notice the way I define the model variables. The autoencoder takes the very first layer (input_1) as the input and the very last layer (output_1) as the output in order to take all the 5 layers of the network. The encoder part, however, stops at the latent_space because we want to take the value from this layer to get the lower dimension representation of an image data.

The decoder part is kinda tricky though. Below is the code creating the decoder part:

decoder_input = Input(shape=(2,))
decoder_layer_1 = autoencoder.layers[-2](decoder_input)
decoder_output = autoencoder.layers[-1](decoder_layer_1)decoder = Model(inputs=decoder_input, outputs=decoder_output)

First we need to create a placeholder called decoder_input. This is done because essentially we want to give a particular value as the input of the latent_space, while the latent_space itself is actually not an input layer. So we can say that decoder_input and latent_space are actually representing the same layer, but decoder_input takes a value from user while the latent_space takes a value from the previous layer of the network.

Next, I define more decoder layers which are also basically taken from the layers of the model stored in autoencoder variable. decoder_layer_1 is exactly the same as the second last layer of the entire network, while decoder_output is the same as the output of autoencoder. Lastly, we need to define the decoder variable itself which is the second half of the entire network.

We may check the entire structure of this deep autoencoder using autoencoder.summary() just to check whether we have already constructed the model exactly like what is displayed in the picture I shown earlier. Below is the model summary.

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 784)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                25120     
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 66        
_________________________________________________________________
dense_3 (Dense)              (None, 32)                96        
_________________________________________________________________
dense_4 (Dense)              (None, 784)               25872     
=================================================================
Total params: 51,154
Trainable params: 51,154
Non-trainable params: 0
_________________________________________________________________

You may also run encoder.summary() or decoder.summary() if you want.

Compiling and fitting the model

A neural network can not be trained before we define the loss function and the optimizer. In this case, I decided to go with binary cross entropy loss function and adam optimizer. You may change this loss function to something like mse (Mean Squared Error), while other optimizers like adagrad or adadelta are also applicable. Below is how I compile the model:

autoencoder.compile(loss='binary_crossentropy', optimizer='adam')

Now, our deep autoencoder is ready to train. The training of such generative model is quite different to model for performing classification.

autoencoder.fit(X_train_flat, X_train_flat, epochs=10, validation_data=(X_test_flat, X_test_flat))

Notice that when fitting (a.k.a. training) the neural network model, the first and second argument are the same variable (both are X_train_flat). If you are familiar with classification task, usually we set the first argument as the sample (X) while the second one is used to pass the ground truth (y). The reason why in autoencoder we pass both X variables is because we want the output of the model to be as similar as possible with the input data. Therefore, as I have mentioned in the earlier part of this writing, actually loading y_train and y_test is not necessary for the training process.

Anyway, below is the output of the model fitting after 10 epochs. We can see here that the loss value decreases as the epoch goes. Theoretically, this loss value can still go lower as we increase the number of epochs. Note that I removed the result of epoch 2 to 9 for simplicity.

Train on 60000 samples, validate on 10000 samples
Epoch 1/10
60000/60000 [==============================] - 11s 186us/step - loss: 0.2133 - val_loss: 0.2098
.
.
.
.
.
Epoch 10/10
60000/60000 [==============================] - 10s 171us/step - loss: 0.1983 - val_loss: 0.1985

Up to this point, our deep autoencoder has just been trained well. Now that we are able to find out the lower dimension representation of all images and draw the distribution in a simple scatter plot using encoder. Then we can also use the decoder to perform digit image reconstruction.

What’s done by encoder?

After training the entire deep autoencoder model, we can perform mapping from 784-dimension flattened image to 2-dimension latent space. Now we are going to try to map all those training data into latent space using only the encoder part of the model, which can be achieved using the following code:

encoded_values = encoder.predict(X_train_flat)

Here the shape of encoded_values variable is (60000, 2), where it represents the number of samples and its dimension respectively. We can think of this 2-dimensional shape as a value in x-y coordinate system for each sample. Hence, we are able to put all those data into a scatter plot. Notice that the y_train is used to color-code the samples. Below is the code to do so:

plt.figure(figsize=(13,10))
plt.scatter(encoded_values[:,0], encoded_values[:,1], s=4, c=y_train, cmap='hsv')
plt.colorbar()
plt.show()

Which displays the following image:

Digit samples represented in two-dimensional latent space.

And there it is! The figure above shows the handwritten digit distribution in two-dimensional latent space. Previously, each of the images in the dataset are represented in 784 dimensions, which is absolutely impossible to visualize the label distribution. However, now it is a lot easier to see the distribution of all images because we have encode those high data dimension to only 2 dimensions.

Furthermore, the scatter plot above tells some interesting facts. First, we can see that the picture with label 1 (orange) looks very far from most of the other images. Here we can say that images with label 1 are very different with most other numbers. Next, when we pay more attention to data points with label 4 and 9 (those in between pink and purple), we can say that these two handwritten digits are quite similar to each other due to the fact that these data points are literally spread in a same cluster.

That’s all of the encoder, now let’s jump into decoder.

What’s done by decoder?

Now, what if we are given a pair of x-y coordinate value representing a point in a latent space? Can we reconstruct the handwriting image from that point? Yes we can! It can simply be achieved by performing prediction using the decoder model that we defined earlier.

decoded_values = decoder.predict(encoded_values)

Remember, the shape of encoded_values variable is (60000, 2), meaning that it contains 60000 data points in our latent space in which each of those samples are represented using two values. Now we use this variable as the argument of predict() method on the decoder, in which its return value is a 60000 flattened images where each of those images are having 784 values representing the brightness of each pixel. Since the MNIST image should have the size of 28 by 28, then we still need to reshape this output value. Below is the code for it.

decoded_values = decoded_values.reshape(60000, 28, 28)

Up to this point, we already got the reconstructed images stored in decoded_values variable. Now we can compare each of the sample value stored in the variable with the actual handwritten digit image stored in X_train variable. Here I decided to print out 10 images of index 110 to 119 (out of 60000).

Below is the code to display the actual images taken from X_train and its output along with the labels:

# Display some images
fig, axes = plt.subplots(ncols=10, sharex=False,
    sharey=True, figsize=(20, 7))
counter = 0
for i in range(110, 120):
    axes[counter].set_title(y_train[i])
    axes[counter].imshow(X_train[i], cmap='gray')
    axes[counter].get_xaxis().set_visible(False)
    axes[counter].get_yaxis().set_visible(False)
    counter += 1
plt.show()

And the code below is used to display the reconstructed images, also with its ground truth.

# Display some images
fig, axes = plt.subplots(ncols=10, sharex=False,
    sharey=True, figsize=(20, 7))
counter = 0
for i in range(110, 120):
    axes[counter].set_title(y_train[i])
    axes[counter].imshow(decoded_values[i], cmap='gray')
    axes[counter].get_xaxis().set_visible(False)
    axes[counter].get_yaxis().set_visible(False)
    counter += 1
plt.show()

Reconstructed images of index 110 to 119.

Now we can compare some of the actual and reconstructed images pretty clearly. In fact, these reconstructed images are exactly like what I expected. Remember the latent space I displayed earlier. It shows us that data points with label 1 is clearly separated from nearly all other samples. This makes reconstructing the handwritten digit of 1 is pretty easy that it looks like there is no much noise generated in its reconstructed image. Next, for the case of number 4 and 9, as I explained earlier as well, it seems those two numbers are quite similar to each other due to the fact that in the latent space they are spread in the same cluster. We can also see the reconstructed images above that the number of 4 and 9 are kinda indistinguishable.

Don’t worry if you get different latent space image when trying to run the code by yourself since it also produces different results in my computer when I try to run it multiple times.

Thank you very much for reading! Hope you learn something new from this post!

Don’t forget to give us your 👏 !