Understanding CapsNet (Part 1)

--

Convolutional Neural Network (CNN or ConvNet) is a deep neural network which has been very effective in classifying images. Geoffrey E Hinton, Nicholas Frosst, and Sara Sabour, from Google Brain team, provided approaches to improve image classification, object detection, and object segmentation, by introducing CapsNet (Paper: ‘Dynamic Routing Between Capsules’, submitted on 25th Oct 2017). This article aims to explain CapsNet, by describing the limitations of CNN.

Summary

If the CNN is trained with data sets of images having orientation similar to Image_TrainingDataSetType to identify whether it contains a panda and if it is not trained with images having orientation similar to Image_RotatedPanda, then for Image_RotatedPanda and Image_Deformed, CNN classifier does not produce correct classification; CapsNet produces correct classification.

Image_TrainingDataSetType

Actual Result: Panda; CNN Result: Panda; CapsNet Result:Panda

Image_RotatedPanda

Actual Result: Panda; CNN Result: Not Panda; CapsNet Result: Panda

Image_Deformed

Actual Result: Not Panda; CNN Result: Panda; CapsNet Result:Not Panda

-

“No Animals Were Harmed in the Making of This Image” :)

How does CNN classifier work?

Upon training CNN with the data set of images having orientation similar to Image_TrainingDataSetType, it learns the features like ‘left eye’, ‘right eye’, ‘nose’, and so on.

When Image_TrainingDataSetType is fed to this CNN for classification, it detects the learned features in the given input. Hence, it will correctly classify it as ‘Panda’.

When Image_Deformed is fed to this CNN for classification, it detects the learned features in the given input. Hence, it will incorrectly classify it as ‘Panda’.

When Image_RotatedPanda is fed to this CNN for classification, it will fail to detect the learned features in the given input. Hence, it will incorrectly classify it as ‘Not Panda’.

Workaround to fix this CNN classifier to classify Image_RotatedPanda correctly is to add similar images (images of similar orientation, size, ..) in training data set and labeling them as ‘panda’. This will result in CNN learning features of more orientations of nose and eyes. This requires more data in various poses.

Root Cause of limitations in CNN

The limitation of CNN is that its neurons are activated based on the chances of detecting specific feature. Neurons do not consider the properties of a feature, such as orientation, size, velocity, color and so on. Hence, it was not trained on relationships between features. Had it been trained on spatial relationships between features, CNN would have correctly classified Image_Deformed as ‘Not Panda’. Had it been trained, considering the orientation of the features, CNN would have correctly classified Image_RotatedPanda as ‘Panda’, without the need for bulk datasets for different orientations. Determining the special relationship between nose and eyes in CNN requires precise location of those features in the input image. The features (nose, left eye, right eye) location information is lost at (Max)Pooling layer of CNN. MaxPooling is performed to achieve translation invariance. Translation Invariance means that CNN will classify the input image in the same way regardless of how the information within the image is shifted. For example, the below three images will be classified by CNN as ‘panda’ even if the CNN is not trained with images having panda at exactly same pixel positions as it is in below images.

Translation Invariance

How Pooling loses location information?

Other than convolutional layers, CNN often uses pooling layers to reduce the size of representation, to speed up computation, and to improve feature detection.

CNN

Since the final label needs to be viewpoint-invariant, CNN makes the neural activities invariant to minor changes in viewpoint by combining the activities within a pool. MaxPooling is a type of pooling, which is very popular in neural networks. MaxPooling layer will look at the activity level of neurons in a layer and will report the activity level of the most active one. Let us go through an example to understand MaxPooling. Suppose we have 6*6 input, where each cell represents the activity level of a neuron. When MaxPooling is applied with parameter values, filter=3, and stride=3, on to this input, MaxPooling creates regions of size denoted by the filter (3*3), where the first region starts at the corner, and the second region starts after 3 cells (based on stride value). Each region is represented in a different color in below image. Padding is usually 0. If it is set to any value, then the input matrix will be padded with 0s around the matrix accordingly to improve the contribution of corner cells. If padding is not done, for some combinations of filter and stride which produces overlapping regions, the inner cells in the matrix will be part of more regions than the corner cells. ‘Singe Depth Slice’ indicates that the input images are in grayscale (6*6*1). Similar colored input images will have dimensions 6*6*3 (3 for R, G & B). MaxPooling will identify the maximum value in each region and report in the corresponding output cell (colored as its input region). Each region represents some set of features (activations in some layer of the neural network), and so, the largest number indicates that a particular feature has been detected (may be a vertical edge). This will result in a small amount of translational invariance. This will give less active neurons. But, loses the precise location and pose of the object.

Max Pooling (2*2 filter; Stride=3; Padding=0)

CapsNet

CapsNet (Capsule Network) is a neural network that performs inverse graphics. CapsNet is composed of capsules. A capsule comprises of a group of neurons in a layer which performs internal computations to predict the presence and the instantiation parameters (values for feature properties such as orientation, size, velocity, color..) of a particular feature at a given location.

The implementation of CapsNet involves,

  • Sending training data (panda images) to a couple of Convolution layers which outputs feature maps (let us say, an array of size 15).
  • Reshaping feature maps into 3 vectors of dimension 5 for each location, where vector1 might represent feature ‘nose’, vector2 might represent feature ‘left eye’ and vector3 might represent feature ‘right eye’.
  • Squashing so that vector’s length is between 0 and 1 as it is meant to represent probability. Squashing is performed without affecting other parameters like orientation, size, and so on. It is not just in squashing, the information about feature’s location and pose is preserved throughout CapsNet. If the image is transformed in any way, the activation vectors also change accordingly (Equivariance). The activation vectors in this layers are the Primary Capsules.
  • All capsules in the first layer predict the output of capsules in next layer. Once the capsules in primary layer figure out the capsules in the second layer to which it belongs, then those capsules in the primary layer should be routed only to the corresponding capsule in the second layer. This is routing by agreement. Paths of activations represent the hierarchy of parts. Routing by agreement also handles crowded scenes.
  • When Image_Deformed is fed to this CapsNet for classification, the primary capsules will detect the learned features (left eye, right eye, nose) in the given input. When each primary capsule applies the transformed location from its feature in the given image on the second layer capsules, the resultant 3 transformed pandas will not be same. This is because these features (parts) are not positioned properly in the original image for it to be qualified as a Panda. These learned features will not agree strongly that they are part of Panda. Hence, it will correctly classify it as ‘Not Panda’.
  • When Image_RotatedPanda is fed to this CapsNet for classification, it will detect the learned features and its orientation in the given input. When each primary capsule applies the orientation (rotated by 270 degrees) from its feature in the given image on the second layer capsules, the resultant 3 transformed pandas (rotated by 270 degrees) will be same. This is because these features (parts) are positioned properly in the original image. These learned features will agree strongly that they are part of Panda. Hence, it will correctly classify it as ‘Panda’.
Capsule Network

References

--

--