Basics of Image Recognition: A beginner’s approach

Published in

Becoming Human: Artificial Intelligence Magazine

6 min readJan 26, 2022

“Just as electricity transformed almost everything 100 years ago, today I even have a tough time thinking of an industry that i do not think AI will transform within the next several years.” ~said by Andrew Ng

What is Image Recognition ?

A digital image is a representation of visual data in a grid-like fashion. It consists of a series of pixel values that denote how bright and what colour each pixel should be. Image recognition is the process of taking image as an input, which is then passed through neural network, finally giving the class label as the output. This class label given by neural network will be a part of a set of pre-defined classes.

How is it done ?

A Convolutional Neural Network, also known as CNN, is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image. The second we perceive an image; our brain analyses a massive amount of data. Each neuron has its own receptive field and is coupled to other neurons in such a way that the full visual field is covered. Similarly, a CNN has a number of layers, designed so that simpler patterns (lines, curves, etc.) are detected first, followed by more complicated patterns (faces, objects, etc.).

***Representation of image as a grid of pixels***

The input given to a CNN is an image. If the image has 32 width and 32 height encompassing three R, G, B channels, then it will hold the raw pixel([32x32x3]) values of an image.

The image is the matrix form of values (as represented in Figure 1), is given as an input to CNN and has to pass through multiple layers.

Now let us learn about the different layers in a CNN.

Different Layers of CNN

Convolution Layer

This is the core building block of our neural network. Dot multiplication is done between 2 matrices- kernel which is a matrix having the set of learnable parameters or weights and the image matrix given as input. The kernel is dimensionally smaller than an image. The kernel slides across the image matrix and dot multiplication are done to get a value as elements for the resultant output matrix (as shown in the figure).

Advantages of Convolution layer –

Sparse interaction: This is achieved by making kernel smaller than the input e.g., an image can have millions or thousands of pixels, but while processing it using kernel, we can detect meaningful information that is of tens or hundreds of pixels. This means fewer resultant parameters, thus less memory and more efficiency.
Parameter sharing: The kernel for all set of elements is the same, so these shared parameters guarantee equal importance to all elements.
Equivariant representation: Since the kernel slides over input in a fixed manner, if we changed the input in a way, the output will also get changed in the same way.

Pooling Layer

This helps in reducing the spatial size of the representation by deriving a summary statistic of the nearby outputs. This decreases the computation. There are several pooling functions such as the average of the rectangular frame, L2 norm of the rectangular frame, and a weighted average based on the distance from the central pixel. The max popular one is max pooling which takes the maximum value of the elements in frame.

Fully Connected Layer

The Fully Connected (FC) layer consists of neurons, weights and biases. Fully connected here means that each neuron of a FC layer is connected to some neuron of the next layer. The FC layers are generally just before the output and in the later stages of CNN architecture. In this, the input image from the previous layers is flattened and fed to the FC layer. The flattened vector goes through a series of some more layers, where it undergoes mathematical operations. This is the start of classification.

**Image recognition and Classification**

Dropout

Usually, when all the features are connected to the FC layer, it can lead to overfitting in the training dataset. Overfitting occurs when a particular model performs too well on the training data, which negatively impacts the performance of the model when used on new data. To overcome this problem, a suppression layer is used where some neurons are removed from the neural network during training, which reduces the size of the model. By passing the 0.35, 35% of the neurons are randomly dropped from the neural network.

Activation Function

Finally, one among the foremost important parameters of the CNN model is that the activation function. They are used to learn and approximate any quite continuous and sophisticated relationship between variables of the network. In simple words, it decides which information of the model should fire within the forward direction and which of them shouldn’t at the top of the network.

It adds non-linearity to the network. There are several commonly used activation functions like the ReLU, Softmax, tanH and therefore the Sigmoid functions. Each of those functions have a selected usage. For a binary classification CNN model, sigmoid and softmax functions are preferred and for a multi-class classification, generally softmax us used.

Some functions are explained below.

ReLU — Rectified Linear Unit. ReLU is an element-wise operation, which means that it is applied per pixel, and it replaces all negative values in the feature map with zero. Essentially, all-black values (i.e. negative values) are negated.
Sigmoid — The input to the function is transformed into a value between 0.0 and 1.0. Inputs that are much larger than 1.0 are transformed to the value 1.0, similarly, values much smaller than 0.0 are snapped to 0.0.
Hyperbolic Tan (tanh) — The hyperbolic tangent function, or tanh for short, is a similar shaped nonlinear activation function that outputs values between -1.0 and 1.0.
Softmax — The Softmax activation function calculates the relative probabilities. It can be seen as a modification to sigmoid function.

**Different layers together predict class**

What is the difference between a neural network and a CNN?

A simple neural network converts the original image into a list, which is accepted as input. The knowledge between neighboring pixels might not be retained. In contrast, CNNs construct the convolution layer that retains the knowledge between neighboring pixels.

Conclusion

Image recognition using deep learning has a wide variety of applications, like improving augmented reality gaming and applications, assisting in the educational system, optimizing medical imagery, predicting consumerism behavior and giving machines a vision. This was just about the basics, but one can dive deeper into the world of using images as a mode of improving technology and lifestyle.