CNN Simplified for Beginners Part -I

--

Image processing is one of the predominant needs in Deep Learning and metaphorically we can term CNN (Convolutional Neural Networks) as the king of this segment. With a plethora of documents and videos on this topic, through this blog series, we will approach CNN from the standpoint of a beginner.

A thorough understanding of the nuts and bolts of this algorithm certainly will bolster your confidence but pragmatically it’s not fully required, to give a simple metaphor its equivalent of learning to drive a car. An in-depth understanding of the working of the combustion engine doesn’t hurt but it’s not essential to drive or even to get a license !!. In the Deep learning world, a data scientist is expected to know a) When to use an algorithm. (b) How to use it

Carrying on our Car driving metaphor, a driver needs to know on available controls such as brake, accelerator, etc, and its application, such as brake will slow down /stop a vehicle whereas accelerator will do the reverse (a misunderstanding will be fatal !!) In deep learning such controls are technically coined as “Hyperparameters” and its imperative for us to know what controls are available and its application (an incorrect or a hazy understanding will lead to inaccurate results along with significant cash/effort burn)

With that context being set, let's get on to CNN. Identifying animals in below picture wouldn't be a tough ask for any of us

Within a fraction of a second, we will easily identify Zebra, Giraffe, and elephant (left to right). How about the below pictures?

We still can identify it as elephant, irrespective of angle(side or front pose), distortion, or how much the image is zoomed in/out we don't really experience a difficulty in recognizing. Let's try to comprehend the logic behind this simple task, the human brain doesn't store a few fixed images of an object and try to do a pixel-to-pixel comparison between the stored image and image captured by the eyes, rather it tries to classify an object based on its features.

For example, in the case of an elephant, we understand it has wider ears, long trunk, columnar legs, etc. So whenever we see a picture containing the image of an elephant, our brain recognizes the features and classifies it as an elephant. Since all this computation happens so fast, we don't get to recognize the complexity behind the whole process. Just imagine a picture of an object with wider ears, columnar legs, but with black and white stripes, our subconscious recognition will error out promoting us to take a closer look, that is because black and white stripes are not part of the features what we have recorded for an elephant.

CNN is an earnest attempt to replicate the human way of recognizing an object, it doesn't do a pixel-to-pixel mapping rather tries to learn from the features. Covering CNN in one single blog will be an invitation for a deep slumber so to keep it engaging will break it into 3 parts

Part 1 and 2 will deal with the logic behind CNN — The bare thread algorithm final part will cover the implementation of it

Let's get back to our familiar example of identifying an elephant, I had converted the image into monochrome so that we can keep the calculation as simple as possible (in the real world it can be a colored high definition image)

Very first thing CNN does in image classification is “Feature mapping”. In this example, we can take any number of features such as Wider ear, Trunk, legs, etc

In our example, we will consider the wider ear of the elephant as a sample feature. To do a Feature mapping lets translate the sample image and sample feature into a grid as shown below

Image fitted into a 8*8 grid

The monochrome image of the elephant is now fitted into an 8 * 8 grid. Below figure captures the Feature(wider ear ) into a 3 * 3 Grid

Feature — Ear (3*3 Grid)

Next step is to digitize the image, so whichever cell has the image of an elephant will be assigned a +1 and other cells as -1, the resultant matrix is given below

Digitizing our 8*8 Grid
Feature mapped as +/- 1 grid

With image translated into numbers, feature mapping becomes easier, we traverse each and every cell of the base image with 3*3 grid to identify similarities. Consider below, In figure -1, its a perfect fit so we get to assign 1 to the cell, in figure -2 its only a partial fit as we are mapping ear to the head of an elephant.

Figure — 1

Partial fit

Figure-2: Partial Fit

We need to repeat the above exercise for each cell of the grid and for the number of features that we had selected for this exercise. Technically these 2 elements (Number of features, the Grid size of each feature) are the Hyperparametrs for “Feature mapping”, so when we get into the implementation mode, you will not actually code logic behind feature mapping rather you will specify the hyperparameter (remember the car metaphor !!)

AI Jobs

So at end of Feature mapping, we will be getting a layer of grids filtered by the features, this layer is our first level of Convolution, refer a sample grid in Figure-3( By increasing number of features and grid size you can have a better layer for each iteration)

Figure -3

The second phase of CNN is called Pooling, this is done to shrink the resultant grid/parameters so that we have a manageable chunk for processing without compromising on information loss. Few popular pooling algorithms as Max pooling, min pooling, average pooling, etc. In the below figure, we see an example of max pooling

Max Pooling

Applying max pooling to our grid we get a compressed matrix as shown below

Applying Pooling

Hyperparameters for pooling is the size of the pool (in our case its 2*2) and strides it takes (in our example its 2*2)

The last phase of this blog is “Normalization” this is similar to what we would have experienced in a conventional work setup wherein appraisal rating will be normalized at a portfolio/org level to remove the bias.

If you are wondering how this applies to our example, consider a model which was trained to classify elephants, if the model was built fully on Asian elephants and then we apply it for an African elephant, it will try to relearn as there are minor differences between both this category. Technically this differences doesn't really matter as we are still solving the problem of “Classifying elephants” so to avoid this minor biases impacting our model we apply normalization

Trending AI Articles:

1. MS or Startup Job — Which way to go to build a career in Deep Learning?

2. TOP 100 medium articles related with Artificial Intelligence

3. Neural networks for algorithmic trading. Multimodal and multitask deep learning

4. Back-Propagation is very simple. Who made it Complicated ?

In our example, we will apply ReLU(Rectified Linear Unit), for beginners lets keep it simple, ReLU retains all positive values and any negative value is converted as zero (more on ReLU in a different blog)

Applying, ReLU to our example grid

Are we done? very close… we can iteratively apply concepts we learned, Filtering,(F) Normalization(N), and Pooling(P) until we get a really compressed version, metamorphically this will be like a grand sandwich with multiple layers

This concludes Part-I of this blog, in Part-2 we will get into more interesting part in using this grid to determine the object and how neurons fire to accomplish our objective, for ease of following you can also refer to my video on the same topic

Don’t forget to give us your 👏 !

--

--