Explain Object Detection to me like I’m five

Object Detection is one of the fundamental problems in computer vision. It refers to the task of find bounding boxes for objects in an image. An example of the type of boxes we want to draw is shown above. But how do we actually achieve this? In this post, I try to explain how researchers approach this problem in an intuitive way.

Take for example this picture from a popular game ‘Where’s Waldo’. If I asked to find where Waldo is, how would you go about doing that? You would look around in the image and try to find someone who is wearing a red-white striped t-shirt and blue pants.

Image from

How did we know to look for someone wearing a red-white striped t-shirt and blue pants? Because we’ve seen Waldo before and we know what he looks like. We know what “features” Waldo has. How did we know where to look for Waldo? Well, we didn’t. We fixed our gaze on a small portion of the image and moved our gaze around until we found him. Researchers call this technique the sliding window approach.

This is essentially what researchers used to use to solve object detection. They extract features from an image and then use the sliding window approach to try to match the features they know the object has with the features in that window. The features we extract could be anything varying from edges, Histograms of Oriented Gradients, or deep learning activations. We could just use whatever features with the sliding window approach to find the objects.

Now consider this image. We could still use the sliding window approach, but it seems like we would waste a lot of time that way. We already know what Waldo is not in the yellow regions. So we don’t even have to look there. Plus when we look in a window, we assume a certain size for that window. If Waldo doesn’t fit in that window, then we maybe would only see his pant or only his shirt so we can’t be sure whether Waldo is there or not. These are the drawbacks of the sliding window approach.

What we could do is only look at portions of the image where we think there is a high chance of Waldo being present. This is an important trick that helped make object detection faster and more accurate. We get generate proposals of regions where we think the objects might be and only compare the features in those proposed regions. This technique (region proposal + feature extraction) is basically what the state-of-the-art in object detection does.