RetinaFace-Face Detection Model

Published in

Becoming Human: Artificial Intelligence Magazine

5 min readNov 2, 2022

RetinaFace is the state-of-the-art model for facial detection developed as a part of the InsightFace Project. Author Jiankang Deng et al. published a paper in 2019 titled “RetinaFace: Single-stage Dense Face Localisation in the Wild”. The paper boasts an outstanding model performance of 91.4% average precision(AP) in the WIDER FACE dataset. This article provides a brief explanation of the paper.

Introduction

Retinaface is a single shot framework that implements three sub tasks to perform pixel-wise face localisation. The tasks are Face Detection, 3D Face Reconstruction with a mesh decoder and 2D Face Alignment. Cording to the paper, the key contributions have been

Developing a robust architecture
Providing an annotated version of Wider face dataset with five facial features (2 eyes, nose, 2 corners of the mouth)
Outperforming Sate-Of-The-Art in the Wider Face and IJB-C dataset
Real time performance using CPU with an efficient CNN backbone

Previous methods such as MTCNN offers novel concept of cascaded structure for defining facial landmarks and Mask Rcnn made use of 3D mask Reconstruction as a tool for better face localisation. Retinaface incorporated these ideas to produce a robust model that beats the SOTA.

Dataset

The paper primarily worked with WIDER FACE dataset that consists of 32,203 images of faces that includes various scale, pose, expression, occlusion and illumination to make the data corpus diverse. Next facial landmarks were annotated on this dataset which had been categorized into 5 levels of image quality depending on difficulty of landmark annotations.

Key concepts

The method implements 3 key topics : Feature pyramid, Single stage method and Context modelling. Before we dive into the model architecture its crucial to understand these concepts and their significance that makes this model a success.

Feature pyramid: Its a feature extractor that takes a single image as input and outputs feature maps at multiple scales. It has been a crucial tool for object detection tasks in the last few years. It was proposed in the paper titled ‘Feature Pyramid Networks for Object Detection’.

Single stage: Single stage methods, unlike two stage methods (eg: Faster R-CNN), only require one single iteration over the full network to generate the bounding boxes of the object to be detected. This makes it much more efficient and is more widely used in the recent papers.

Context modelling: The idea is to learn contextual information from the images to aid the localisation task using DCN. DCN or deformable convolutional networks works much like convolutional blocks in CNN except that it doesn't have a strict kernel grid. Rather the grid points can be adjusted by a parameter that allows it to be more adaptive to the multiple feature scales of the object.

Methodology

Multi-task Loss: The loss that they are trying to minimize works on 4 different levels and the final equation is given below:

i. Face classification loss (Lcls): Its a SoftMax loss for binary classes
(face/not face) where pi is the predicted probability of anchor i being a face and p∗ i is 1 for the positive anchor and 0 for the negative anchor.

ii. Face box regression loss(Lbox): Its the regression loss for the bounding box where t and t* are coordinates of predicted box and ground-truth box respectively. All values are standardized.

iii. Facial landmark regression loss(Lpts): Its the regression loss of 5 facial landmarks (2 eyes, nose and 2 corners of the mouth) where ℓ and ℓ* are the predicted and ground-truth landmarks respectively. All values are standardized.

iv. Dense regression loss(Lpixel): After constructing the 2D mesh Face(explained below), the pixel-wise difference of the rendered and the original 2D face is compared and the Dense regression loss is minimized. This regression loss ensures that better localisation is made using the bounding box and the landmarks. It provides extra supervision signals.

3D mesh decoder: The paper utilizes methods from Generating 3D faces using Convolutional Mesh Autoencoders and Joint Texture & Shape Convolutional Mesh Decoders to build the 3D face from an image. The idea is to use N vertices and a predefined generic face template to construct a 3D face where each pixel can be called or indexed by the coordinates of the 3D map.

Architecture

Lets look at the figure below that shows the full pipeline of Retinaface. First we have the feature pyramid network(FPN) followed by the cascade module and then the multitask loss.

The FPN extracts features on five different levels from the 2D image. First four feature map is calculated using pretrained Restnet model. The smallest feature map on top was extracted by the convolution of 3x3 with stride 2. The context module head with 5 different filters is attached to it to find further contextual information from these features. Finally the resultant feature map is passed to the multi level Loss function.

Evaluation

The graph plots below demonstrates the performance of RetinaFace in comparison to 24 other face detection algorithms in terms of AP (Average Precision). RetinaFace produces the best AP in all subsets of both validation and test sets, i.e., 96:9% (Easy), 96:1% (Medium) and 91:8% (Hard) for validation set, and 96:3% (Easy), 95:6% (Medium) and 91:4% (Hard) for test set.

Most interesting performance assesment can be noticed in the selfie below where the model was able to detect 900 faces and identify facial landmarks out of 1151 dense faces!

Conclusion

The article condenses the key concepts of RetinaFace and explains its significance. This is the current State-Of-The-Art and holds promising potential in the field of biometrics, security and entertainment. The code for inference with a pretrained version can be found in this repo.

Hope this article was helpful. Happy Coding!