Using Variational Autoencoder (VAE) to Generate New Images

An advancement of traditional autoencoder.

--

Photo by James Harrison on Unsplash

Hey there! It’s been pretty long since my last post. In this article I wanna share another project that I just done. Well, this one is — once again — related to computer vision field. However though, this is going to be different. Instead of doing classification, what I wanna do here is to generate new images using VAE (Variational Autoencoder). Actually I already created an article related to traditional deep autoencoder. Here’s the link if you wanna read that one.

Wait! Before getting into the code, you can treat me a coffee by clicking this link to help me staying up at night. So that I can write more posts like this. Thanks in advance!

VAE neural net architecture

The two algorithms (VAE and AE) are essentially taken from the same idea: mapping original image to latent space (done by encoder) and reconstructing back values in latent space into its original dimension (done by decoder). However, there is a little difference in the two architectures. I display them in the figures below.

Traditional autoencoder.
Variational Autoencoder (VAE).

I bet it doesn’t even take you a second to spot the difference! Lemme explain a bit. So, the encoder and decoder half of traditional autoencoder simply looks symmetrical. On the other hand, we see the encoder part of VAE is slightly longer than its decoder thanks to the presence of mu and sigma layers, where those represent mean and standard deviation vectors respectively. I do recommend you to read this article if you wanna know in details why is it necessary to employ the two layers — here I will be more focusing on the code implementation!

Note: I’ll share the entire code used in this project at the end of this article.

Let’s get into the code

Now, let’s get our hands dirty with some code. We’ll start with some imports. Notice that it’s important to run the code I write in bold, especially for those who use Tensorflow 2.0+ version. What it essentially does is disabling eager mode since this neural network for some reasons just can not be trained when it’s still turned on. Additionally, here I run tf.executing_eagerly() at the end to check whether eager mode has been completely disabled, and hence returning False.

import os
import cv2
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf; tf.compat.v1.disable_eager_execution()
from keras import backend as K
from keras.layers import Input, Dense, Conv2D, Conv2DTranspose, Flatten, Lambda, Reshape
from keras.models import Model
from keras.losses import binary_crossentropy
from keras.datasets import mnist
np.random.seed(25)
tf.executing_eagerly()

Image dataset preprocessing

Before we go any further, I want you to know that this project is going to be done on MNIST Handwritten Digit Dataset. I love to use such simple dataset when I am working with relatively complicated model. — Well, I don’t wanna complicate things, lol. — By the way, here’s several images in the dataset along with its labels.

Several images loaded from MNIST dataset (index 120 to 129 of X_train).

The digit images itself can be downloaded through Keras API — you might have noticed this when we imported the libraries. After running the code below all images are going to be stored in X_train and X_test, while the ground truths (or labels) are stored in both y arrays.

(X_train, y_train), (X_test, y_test) = mnist.load_data()

Now if you wanna see how the images look like, we can just run the following code. Here I decided to show images at index 120 to 129 taken from X_train array. The output should look something like the image that I showed earlier.

fig, axes = plt.subplots(ncols=10, sharex=False,
sharey=True, figsize=(20, 7))
counter = 0
for i in range(120, 130):
axes[counter].set_title(y_train[i])
axes[counter].imshow(X_train[i], cmap='gray')
axes[counter].get_xaxis().set_visible(False)
axes[counter].get_yaxis().set_visible(False)
counter += 1
plt.show()

Well, so far we haven’t actually doing any kind of preprocessing stuff. The first thing to do now is to normalize the values which represent the brightness of each pixels, such that those numbers are going to lie within the range of 0 to 1 instead of 0 to 255. It can simply be achieved by dividing all elements in the array by 255 like this:

X_train = X_train/255
X_test = X_test/255

Next, what we need to do now is to reshape both X_train and X_test. Initially, if we check the shape of the two arrays, it’s going to be (60000, 28, 28) and (10000, 28, 28) respectively.

The shape represents (no_of_samples, height, width).

In fact, we need to reshape them all such that there will be a new axis which represents a single color channel as we are going to employ convolution layer (Conv2D layer we’ll get into it later) in our VAE network. Therefore, we need to apply reshape() method to do so. Notice that I add number 1 (written in bold) at the end of each line.

# Convert from (no_of_data, 28, 28) to (no_of_data, 28, 28, 1)X_train_new = X_train.reshape(X_train.shape[0], X_train.shape[1], X_train.shape[2], 1)X_test_new = X_test.reshape(X_test.shape[0], X_test.shape[1], X_test.shape[2], 1)

After running the code above, the shape of our X data should now look like this:

The new shape of our X data (no_of_samples, height, width, no_of_color_channels).

That’s pretty much all of the preprocessing stage. In the next step we are going to construct the VAE architecture.

Constructing encoder

Before constructing the encoder part of VAE, I wanna define some variables first so that we can reuse this architecture for other tasks without needing to change many things in the neural net. Here we are also going to define the input shape for the first layer where the values are taken directly from our image data shape. Furthermore, our latent space is going to have 2 dimensions such that we are able to display the digit image distribution in a standard scatter plot — we’ll also see this plot later.

img_height   = X_train_new.shape[1]    # 28
img_width = X_train_new.shape[2] # 28
num_channels = X_train_new.shape[3] # 1
input_shape = (img_height, img_width, num_channels) # (28,28,1)latent_dim = 2 # Dimension of the latent space

Now let’s actually build the network. Here, instead of using Sequential() model, we are going to use Functional style. If you are still not familiar with it, then I suggest you to read this article.

encoder_input = Input(shape=input_shape)

encoder_conv = Conv2D(filters=8, kernel_size=3, strides=2,
padding='same', activation='relu')(encoder_input)
encoder_conv = Conv2D(filters=16, kernel_size=3, strides=2,
padding='same', activation='relu')(encoder_input)
encoder = Flatten()(encoder_conv)

mu = Dense(latent_dim)(encoder)
sigma = Dense(latent_dim)(encoder)

We can see here that the input layer is followed by 2 convolution layers. These 2-stack of Conv2Ds are expected to be able to extract more features in image data. Next, the convolution layers are connected to flatten layer in order to reshape all data into a single one-dimensional array. These flatten layer is then connected to mu and sigma layers where each of those are having 2 neurons. Up to this point, the neural net architecture is still pretty easy. But the next one is going to be kinda more tricky — get ready for that :)

Now what we need to do is to define a function called compute_latent() which is going to be used to determine the values in the latent space layer.

def compute_latent(x):
mu, sigma = x
batch = K.shape(mu)[0]
dim = K.int_shape(mu)[1]
eps = K.random_normal(shape=(batch,dim))
return mu + K.exp(sigma/2)*eps

This function is then applied to Lambda layer which basically works by computing values passed from mu and sigma with such special operation. In fact, this step is called as reparameterization trick. Check this page out if you wanna learn more about this process.

latent_space = Lambda(compute_latent, output_shape=(latent_dim,))([mu, sigma])

That’s essentially all about the encoder. Additionally, here I will also keep the shape of our convolution layer in conv_shape. This is process is done since we will need this exact same shape to be applied at the Conv2D layer in decoder.

conv_shape = K.int_shape(encoder_conv)
The output shape of Conv2D layer in encoder.

Constructing decoder

The decoder part is kinda like the inverse of encoder. Instead of being started with (28, 28 , 1) input shape and outputs a value with the shape of (2,), we are gonna use (2,) as the input shape which will output an image of shape (28, 28, 1). Here’s how I construct the entire decoder:

decoder_input = Input(shape=(latent_dim,))decoder = Dense(conv_shape[1]*conv_shape[2]*conv_shape[3], activation='relu')(decoder_input)
decoder = Reshape((conv_shape[1], conv_shape[2], conv_shape[3]))(decoder)
decoder_conv = Conv2DTranspose(filters=16, kernel_size=3, strides=2,
padding='same', activation='relu')(decoder)
decoder_conv = Conv2DTranspose(filters=8, kernel_size=3, strides=2,
padding='same', activation='relu')(decoder)
decoder_conv = Conv2DTranspose(filters=num_channels, kernel_size=3,
padding='same', activation='sigmoid')(decoder_conv)

It’s important to notice that the convolution layer used in decoder is Conv2DTranspose which works by doing inverse transformation of the standard Conv2D layer. Here’s an article if you wanna know how the transpose layer works. Also, notice that the last Conv2DTranspose layer acts as the output of the decoder where the filters argument has to be 1 (already stored in num_channels variable) as we want to reshape back to the original image dimension (28, 28, 1).

Connecting the encoder and decoder

Remember that so far we haven’t actually build both encoder and decoder. To do so, we can just simply pass the input and output layer to Model().

encoder = Model(encoder_input, latent_space)
decoder = Model(decoder_input, decoder_conv)

But notice that, still, the encoder and decoder part are not connected just yet. So we need to link the two in order to construct the entire VAE. The code written in bold below is kinda tricky though. Here’s how to read it: “The output of vae model is the output of decoder in which its input is taken from the output of encoder.”

vae = Model(encoder_input, decoder(encoder(encoder_input)))

As we reach this step, we now already got 3 models which can be trained simply just by applying fit() method to the connecting model (vae). The details of each models can be seen by applying summary() method.

Summary of encoder.
Summary of decoder.
Summary of the entire model (vae).

Now let’s pay attention to the last vae summary figure. Here we need to ensure that the shape of input and output layer has to be exactly the same since in the case of autoencoders the target or ground truth is taken from the original image itself.

Defining loss function and compiling model

Another tricky part! variational autoencoders do not use standard loss function like categorical cross entropy, RMSE (Root Mean Square Error) or others. Instead, it uses the combination between binary cross entropy loss and Kullback-Leibler divergence loss (KL loss). Since there is no such function in Keras library, then we need to define it manually. The math is kinda complicated though, but long story short the overall loss is obtained from the mean of the two error values.

def kl_reconstruction_loss(true, pred):    # Reconstruction loss
reconstruction_loss = binary_crossentropy(K.flatten(true), K.flatten(pred)) * img_width * img_height
# KL divergence loss
kl_loss = 1 + sigma - K.square(mu) - K.exp(sigma)
kl_loss = K.sum(kl_loss, axis=-1)
kl_loss *= -0.5
# Total loss = 50% rec + 50% KL divergence loss
return K.mean(reconstruction_loss + kl_loss)

As the loss function has been defined, we can now compile the vae model with that error function. Here I decided to use Adam optimizer since it usually performs better than any others.

vae.compile(optimizer='adam', loss=kl_reconstruction_loss)

Training the model

Remember the tf.compat.v1.disable_eager_execution() that I ran at the very beginning of the code? If you are using Tensorflow 2.0+ then the training can not be executed without the command for some reasons.

Now it’s time to train! Here I set the X samples as both x and y arguments. We do this because we want our reconstructed images to be as similar as the original ones. In other words, the loss value is going to decrease as the produced images get more identical to the actual value.

history = vae.fit(x=X_train_new, y=X_train_new, epochs=20, batch_size=32, validation_data=(X_test_new,X_test_new))

Here is how the training process goes. I decided to erase several epochs since displaying the entire process is just a waste of space.

Train on 60000 samples, validate on 10000 samples
Epoch 1/20
60000/60000 [==============================] - 24s 397us/sample - loss: 188.4155 - val_loss: 173.5554

Epoch 5/20
60000/60000 [==============================] - 24s 406us/sample - loss: 165.6812 - val_loss: 165.1816
Epoch 10/20
60000/60000 [==============================] - 24s 396us/sample - loss: 163.1520 - val_loss: 163.0548
Epoch 15/20
60000/60000 [==============================] - 24s 407us/sample - loss: 162.1884 - val_loss: 162.5337
Epoch 20/20
60000/60000 [==============================] - 24s 404us/sample - loss: 161.5313 - val_loss: 161.7869

To display the training progress become even simpler, I will show them using plt.plot() function taken from Matplotlib module. We can see here that the loss value of both train and test data are getting smaller until it stops at the value of around 161. This error might still be even lower if we increase the number of epochs, but here I decided not to continue the training process since I think it’s been pretty good.

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
Loss values decrease during training process.

Displaying latent space

We have trained the model well and found that the loss value is already small enough (as it starts to decrease slowly after several epochs). Now what? The answer is: we can now encode images into latent space and show the distribution using simple scatter plot — as I’ve promised earlier. To do so, we need to use our encoder model to find out the location of each sample in latent space by applying predict() method, just like when we are about to predict the class of a sample in classification problem.

encoded = encoder.predict(X_train_new)

Now the encoded variable should be containing an array which holds the data points in latent space. If we think of these values as a bunch data points in a cartesian coordinate system, then we can say that the values in the first column represents x-axis while the second one represents y-axis.

How the encoded array looks like. It stores all data points in latent space.

As we already got these values, now we can show them using scatter plot like this:

plt.figure(figsize=(14,12))
plt.scatter(encoded[:,0], encoded[:,1], s=2, c=y_train, cmap='hsv')
plt.colorbar()
plt.grid()
plt.show()
Digit images shown in latent space.

So the figure above essentially shows that the digit 1 is distributed at the upper side of the graph (orange), the cluster of digit 7 (purple) is exactly located next to the number 1, number 6 is distributed at the right side of the figure (dark blue) and so on. These dots in the latent space are distributed according to their similarity. That’s essentially the reason why the same digit tends to be automatically clustered by this VAE. Another example, we can say that number 4 and 5 are somehow becoming extremely similar to each other since the clusters of the two digits are indistinguishable in the latent space. On the other hand, the distribution of number 0 and 1 (red at the bottom and orange) are separated pretty far since our VAE thinks that these two digits look very different.

Another thing I wanna discuss about this latent space is that the distribution is centered at (0,0). In fact, this is completely different to the one we commonly obtain in traditional autoencoder. You can see my article about it here and scroll to the latent space figure to see how it differs from the one obtained using VAE. The reason why the encoded samples in VAE are distributed in such way is because of the presence of Kullback-Leibler divergence loss function (or simply KL loss) which essentially works by giving higher error value for data points which lie far from the latent space origin.

Decoding data points in latent space

Now it’s time for the decoder to show off his ability. I’ll start with a function to create an image sequence based on sequential points taken from latent space. What we need to pass in order to run the function below is just the starting point, end point and number of images to decode. The core of decoding process itself is done in the line that I write in bold.

def display_image_sequence(x_start,y_start,x_end,y_end,no_of_imgs):
x_axis = np.linspace(x_start,x_end,no_of_imgs)
y_axis = np.linspace(y_start,y_end,no_of_imgs)

x_axis = x_axis[:, np.newaxis]
y_axis = y_axis[:, np.newaxis]

new_points = np.hstack((x_axis, y_axis))
new_images = decoder.predict(new_points)
new_images = new_images.reshape(new_images.shape[0], new_images.shape[1], new_images.shape[2])

# Display some images
fig, axes = plt.subplots(ncols=no_of_imgs, sharex=False,
sharey=True, figsize=(20, 7))
counter = 0
for i in range(no_of_imgs):
axes[counter].imshow(new_images[i], cmap='gray')
axes[counter].get_xaxis().set_visible(False)
axes[counter].get_yaxis().set_visible(False)
counter += 1
plt.show()
Digit images shown in latent space.

After defining the function above, now we can try to use it to display some image sequence. Here I display the same sample distribution in latent space in case you wanna see how the display_image_sequence() function works so that you don’t have to scroll back to the one displayed earlier.

Anyway, now I wanna display an image sequence where it’s started from cluster of digit 1 and ended at cluster of digit 6. The idea is to take some points between the two clusters to see the gradual changes between them. The initial point I wanna take is (0, 2) while the terminal point is going to be (2, 0), and there will be 7 other points between them (so we got 9 images in total).

# Starting point=(0,2), end point=(2,0)
display_image_sequence(0,2,2,0,9)

After running the code, we should get the following output:

Generated digit images between (0, 2) and (2, 0), inclusive.

The figure above shows that the leftmost image is essentially having the value of (0, 2) in latent space while the rightmost image is generated from a point in coordinate (2, 0). All other images in the middle are reconstructed based on values between our starting and end point. In fact, such gradual change can not be generated using traditional autoencoder since it produces neither continuous nor complete latent space. Here’s a good article that explains the two properties in depth.

I wanna display another image sequence. This time I wanna see how the images between cluster of digit 7 (purple) and digit 1 (orange) looks like.

# Starting point=(-2,1), end point=(0,2)
display_image_sequence(-2,1,0,2,9)
Generated digit images between (-2, 1) and (0, 2), inclusive.

Just for fun, here’s another one with longer sequence. This time I wanna take points from (0, -2) up until (0, 2). We can see here that the sequence gradually changes from digit zero, three(?), eight(?) and finally ends at one — the reconstructed images between initial and terminal points sometimes hard to read!

# Starting point=(0,-2), end point=(0,2)
display_image_sequence(0,-2,0,2,19)
Generated digit images between (0, -2) and (0, 2), inclusive.

Where to next?

By this far we have been able to create a new image simply by picking up a new point in a latent space and employ decoder to do the reconstruction. Well, actually before deciding to use MNIST dataset, I wanted to do this project using CelebA dataset, which you can download it from here. The dataset contains around 200000 faces along with its attributes like pale_skin, oval_face, similing, etc. What’s interesting is that we can pick a point in its latent space such that the reconstructed image is going to be — for example: smiling and angry at the same time. However though, the dataset itself is pretty huge which might take a lot longer duration to train, so then I decided to go with MNIST instead.

In addition to our work: it’s important to keep in mind that the point distribution in latent space that you produce might be different to the one that I obtained.

That’s all of the project! I put the full code down here. See ya in the next one!

References

“Reparameterization” trick in Variational Autoencoders by Sayak Paul. https://towardsdatascience.com/reparameterization-trick-126062cfd3c3

How to create a variational autoencoder with Keras? by Chris. https://www.machinecurve.com/index.php/2019/12/30/how-to-create-a-variational-autoencoder-with-keras/#comment-8504

Intuitively Understanding Variational Autoencoders by Irhum Shafkat. https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

What is a Variational Autoencoder (VAE)? by Chris. https://www.machinecurve.com/index.php/2019/12/24/what-is-a-variational-autoencoder-vae/#continuity-and-completeness

--

--

A machine learning, deep learning, computer vision, and NLP enthusiast. Doctoral student of Computer Science, Universitas Gadjah Mada, Indonesia.