Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (2024)

Computer Vision and Object Detection

Demystifying Object Detection

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (1)

Published in

Towards Data Science

·

16 min read

·

Nov 2, 2022

--

I was first introduced to object detection through the Tensorflow Object Detection API. It was simple to use. I passed in an image of a beach and in return, the API painted boxes over the objects it recognized. It seemed magical. I became curious and wanted to dissect the API to understand how it really works behind the hood. It was hard, and I failed. The Tensorflow Object Detection API supports state-of-the-art models which are a result of decades of research. They’re intricately woven into code like how a watchmaker puts together tiny little gears which move coherently.

However, most of the current state-of-the-art models are built on top of the groundwork laid by the Faster-RCNN model, which remains one of the most cited papers in computer vision even today. Hence it’s crucial to understand it.

In this article, we’ll break down the Faster-RCNN paper, understand its working, and build it part by part in PyTorch to understand the nuances.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (4)

For object detection we need to build a model and teach it to learn to both recognize and localize objects in the image. The Faster R-CNN model takes the following approach: The Image first passes through the backbone network to get an output feature map, and the ground truth bounding boxes of the image get projected onto the feature map. The backbone network is usually a dense convolutional network like ResNet or VGG16. The output feature map is a spatially dense Tensor that represents the learned features of the image. Next, we treat each point on this feature map as an anchor. For each anchor, we generate multiple boxes of different sizes and shapes. The purpose of these anchor boxes is to capture objects in the image.

We use a 1x1 convolutional network to predict the category and the offsets of all the anchor boxes. During training, we sample the anchor boxes that overlap the most with the projected ground truth boxes. These are called positive or activated anchor boxes. We also sample negative anchor boxes which have little to no overlap with the ground truth boxes. The positive anchor boxes are assigned the category object, while the negative boxes are assigned background. The network learns to classify anchor boxes using binary cross-entropy loss. Now, the positive anchor boxes may not exactly align with the projected ground truth boxes. So we train a similar 1x1 convolutional network to learn to predict offsets from ground truth boxes. These offsets when applied to the anchor boxes bring them closer to the ground truth boxes. We use L2 regression loss to learn the offsets. The anchor boxes are transformed using the predicted offsets and are called region proposals, and the network described above is called the region proposal network. This is stage 1 of the detector. Faster-RCNN is a two-stage detector. There is another stage.

The input to stage 2 is the region proposals generated from stage 1. In stage 2, we learn to predict the category of the object in the region proposal using a simple convolutional network. Now, the raw region proposals are of different sizes, so we use a technique called ROI pooling to resize them before passing through the network. This network learns to predict multiple categories using cross-entropy loss. We use another network to predict offsets of region proposals from ground truth boxes. This network further tries to align region proposals with ground truth boxes. This uses L2 regression loss. Finally we take a weighed combination of both the losses to compute the final loss. In stage 2, we learn to predict both category and offsets. This is called multi-task learning.

All of this happens during training. During inference, we pass the image through the backbone network and generate anchor boxes — same as before. However, this time we select only the top ~300 boxes that get a high classification score in stage 1 and qualify them for stage 2. In stage 2, we predict the final categories and offsets. In addition, we perform an extra post-processing step to remove duplicate bounding boxes using a technique called non-max suppression. If everything functions as expected, the detector recognizes and paints boxes over objects in the image as shown below:

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (5)

This is a brief overview of the two-stage Faster-RCNN network. In the following sections we’ll deep dive into each of the parts.

All the code used can be found in this GitHub repository. We don’t need many dependencies as we will be building from scratch. Just the PyTorch library installed in the standard anaconda environment would be enough.

Here’s the main notebook we’ll be working with. Just glance it through. We’ll cover it step-by-step in the following sections.

First, we’d need some sample images to work with. Here I’ve downloaded two high resolution images from here.

Next we need to label these images. CVAT is one of the popular open-source labelling tools out there. You can download it for free from here.

You can simply load the images into the tool, draw boxes around the relevant objects, and mark their category as shown below:

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (6)

Once done, you can export the annotations into a preferable format. Here I’ve exported them in CVAT for images 1.1 xml format.

The annotation files contain all the information about the image, the labelled classes, and the bounding box coordinates.

PyTorch Dataset and DataLoader

In PyTorch, it’s considered a best practice to create a class that inherits from PyTorch’s Dataset class to load the data. This will give us more control over the data and helps keep the code modular. Moreover, We can create a PyTorch DataLoader from the dataset instance that can automatically take care of batching, shuffling, and sampling the data.

In the above class, we’ve defined a function called get_data that loads the annotation file and parses it to extract image paths, labelled classes, and bounding box coordinates which are then converted to PyTorch’s Tensor object. The images are reshaped to a fixed size.

Notice we’re padding the bounding boxes. This, combined with resize, allows us to batch the images together.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (7)

If there are more than two images in a batch with variable number of objects in each of them, we consider the maximum number of objects in any image and pad the rest with -1 to match the maximum length as shown in the above figure. We pad the bounding box coordinates as well as the categories.

We can grab some images from the DataLoader and visualize them as shown below:

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (8)

Here we’ll use ResNet 50 as the backbone network. Remember, a single block in ResNet 50 is composed of stacks of bottleneck layers. The Image gets reduced in half after each block along the spatial dimension while the number of channels get doubled. A bottleneck layer is composed of three convolutional layers along with a skip connection as shown below:

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (9)

We’ll use first four blocks of ResNet 50 as the backbone network.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (10)

Once an image passes through the backbone network, it gets downsampled along the spatial dimension. The output is a feature-rich representation of the image.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (11)

If we pass an Image of size (640, 480) through the backbone network, we get an output feature map of size (15, 20). So the image has been down-scaled by (32, 32).

We consider each point in the feature map as an anchor point. So anchor points would just be an array representing coordinates along the width and height dimensions.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (12)

To visualize these anchor points, we can simply project them onto the image space by multiplying with the width and height scale factors.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (14)

For each anchor point, we generate nine bounding boxes of different shapes and sizes. We choose the size and shape of these boxes such that they enclose all the objects in the image. The selection of anchor boxes usually depends on the dataset.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (15)

Another advantage of resizing the images is that the anchor boxes can be duplicated across all the images.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (16)

Again, to visualize the anchor boxes we project them onto the image space by multiplying with the width and height scale factors.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (17)

Here’s what it looks like if we visualize all anchor boxes for all anchor points:

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (18)

In this section we’ll discuss data preparation for training.

Positive and Negative Anchor Boxes

We only need to sample a few anchor boxes for training. We sample both positive and negative anchor boxes. Positive anchor boxes contain an object and negative anchor boxes do not. To sample positive anchor boxes, we select the anchor boxes that have an IoU more than 0.7 with any of the ground truth boxes or those that have the highest IoU for every ground truth box. When anchor boxes are poorly generated, condition 1 fails, so condition 2 comes to the rescue as it selects one positive box for every ground truth box. To sample negative anchor boxes, we select the anchor boxes that have an IoU less than 0.3 with any of the ground truth boxes. Usually, the number of negative samples will be far higher than the positive samples. So we randomly sample a few to match the number of positive samples. IoU is metric that measures the overlap between two bounding boxes.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (19)

The function above computes the IoU matrix which contains IoU of every anchor box with all the ground truth boxes in the image. It takes anchor boxes of shape (B, w_amap, h_amap, n_anc_boxes, 4) and ground truth boxes of shape (B, max_objects, 4) as the input and returns a matrix of shape (B, anc_boxes_tot, max_objects) where the notations are as follows:

B - Batch Size
w_amap - width of the output activation map
h_wmap - height of the output activation map
n_anc_boxes - number of anchor boxes per an anchor point
max_objects - max number of objects in a batch of images
anc_boxes_tot - total number of anchor boxes in the image i.e, w_amap * h_amap * n_anc_boxes

The function essentially flattens all the anchor boxes and computes IoU with every ground truth box as illustrated below:

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (20)

Projecting Ground Truth Boxes

It’s important to remember that the IoU is computed in the feature space between the generated anchor boxes and the projected ground truth boxes. To project a ground truth box onto the feature space, we simply divide its coordinates by the scale factor as shown in the below function:

Now, when we divide the coordinates by the scale factor, we round off the value to the nearest integer. This essentially means we’re “snapping” the ground truth box to the nearest grid in the feature space. So if the difference in scale of image space and feature space is high, we would not get accurate projections. Hence it’s important to work with high resolution images in object detection.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (21)

Computing Offsets

The postive anchor boxes do not exactly align with the ground truth boxes. So we compute offsets between the positive anchor boxes and the ground truth boxes and train a neural network to learn these offsets. The offsets can be computed as follows:

tx_ = (gt_cx - anc_cx) / anc_w
ty_ = (gt_cy - anc_cy) / anc_h
tw_ = log(gt_w / anc_w)
th_ = log(gt_h / anc_h)
Where:gt_cx, gt_cy - centers of ground truth boxes
anc_cx, anc_cy - centers of anchor boxes
gt_w, gt_h - width and height of ground truth boxes
anc_w, anc_h - width and height of anchor boxes

The following function can be used to compute the same:

If you notice, we’re teaching the network to learn how much the anchor box is off from the ground truth box. We’re not forcing it to predict the exact location and scale of the anchor box. Hence the offsets and transformations learnt by the network are location and scale invariant.

Code Walkthrough

Let’s walk through the data preparation code. This is probably the most important function in the whole repository.

The main inputs to this function are the generated anchor boxes and the projected ground truth boxes for a batch of images.

First we compute the IoU matrix using the function described above. Then from this matrix we take the IoU of the most overlapped anchor box for every ground truth box. This is condition 1 to sample positive anchor boxes. We also apply condition 2 and select anchor boxes which have IoU greater than the threshold with any ground truth box in the image. We combine condition 1 & 2 and sample positive anchor boxes for all the images.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (22)

Each image will have a different number of positive samples. To avoid this disparity during training, we flatten the batch and combine the positive samples from all the images. Moreover, we can keep track of where each positive sample comes from using torch.where.

Next we need to compute offsets of positive samples from ground truth. To do so, we need to map every positive sample to its corresponding ground truth box. It’s important to note that a positive anchor box maps to only one ground truth box, while multiple positive anchor boxes can map to the same ground truth box.

To do this mapping, we first expand the ground truth boxes to match the total anchor boxes using Tensor.expand. Then for each anchor box we select the ground truth box it overlaps with the most. To do this, we take the max IoU indices for all the anchor boxes from the IoU matrix and then “gather” at these indices using torch.gather. Finally we flatten the batch and filter the postive samples. The process is illustrated below:

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (23)

We perform the same process with categories to assign one to each positive sample.

Now that we have a ground truth box mapped for every positive sample, we can compute the offsets using the function described above.

Finally we select negative samples by sampling the anchor boxes which have an IoU less than the given threshold with all of the ground truth boxes. Since negative samples far outweigh positive samples in number, we randomly select a few of them to match the count.

Here’s how the positive and negative anchor boxes look like:

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (24)

We can now use the sampled positive and negative anchor boxes for training.

Proposal Module

Let’s start with the proposal module first. As we discussed, every point in the feature map is considered an anchor, and every anchor generates boxes of different sizes and shapes. We want to classify each of these boxes as object or background. Moreover, we want to predict their offsets from the corresponding ground truth boxes. How can we do that? The solution is to use 1x1 convolution layers. Now, a 1x1 convolutional layer does not increase the receptive field. Their function is not to learn image-level features. They are rather used to change the number of filters, or to serve as a regression or classification head.

So we take two 1x1 convolutional layers and use one of them to classify each anchor box as object or background. Let’s call this the confidence head. So, given a feature map of size (B, C, w_amap, h_amap), we convolve a kernel of size 1x1 to get an output of size (B, n_anc_boxes, w_amap, h_amap). Essentially, each output filter represents the classification score of an anchor box.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (25)

In a similar way, the other 1x1 convolutional layer takes the feature map and produces an output of size (B, n_anc_boxes * 4, w_amap, h_amap) where the output filters represent the predicted offsets of the anchor boxes. This is called the regression head.

During training, we select the positive anchor boxes and apply predicted offsets to generate region proposals. The region proposals can be computed as follows:

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (26)

where the superscript p denotes region proposal, the superscript a denotes anchor boxes, and t denotes the predicted offsets.

The following function implements the above transformations and generates region proposals:

Region Proposal Network

The region proposal network is the stage 1 of the detector which takes the feature map and produces region proposals. Here we combine the backbone network, the sampling module, and the proposal module into the region proposal network.

During both training and inference, the RPN produces scores and offsets for all the anchor boxes. However, during training we select only the positive and negative anchor boxes to compute classification loss. To compute L2 regression loss, we only consider the offsets of positive samples. The final loss is a weighed combination of both these losses.

During inference, we select the anchor boxes with scores above a given threshold and generate proposals using the predicted offsets. We use a sigmoid function to convert the raw model logits to probability scores.

The proposals generated in both the cases are passed to the second stage of the detector.

The Classification Module

In second stage we receive region proposals and predict the category of the object in the proposals. This can be done by a simple convolutional network, but there’s a catch: all proposals do not have the same size. Now, you may think of resizing the proposals before feeding into the model like how we usually resize images in image classification tasks, but the problem is resizing is not a differentiable operation, and so backpropagation cannot happen through this operation.

Here’s a smarter way to resize: we divide the proposals into roughly equal subregions and apply a max pooling operation on each of them to produce outputs of same size. This is called ROI pooling and is illustrated below:

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (27)

Max pooling is a differentiable operation, we use them in convolutional neural networks all the time.

We don’t need to implement ROI pooling from scratch, the torchvision.ops library provides it for us.

Once the proposals have been resized using ROI pooling, we pass them through a convolutional neural network consisting of a convolutional layer followed by an average pooling layer followed by a linear layer that produces category scores.

During inference, we predict the object category by applying softmax function over the raw model logits and selecting the category with highest probability score. During training, we compute the classification loss using cross-entropy.

In a full-scale implementation, we also include the background category in the second stage, but let’s leave it in this tutorial.

In the second stage, we also add a regression network that further produces offsets for the region proposals. However, since this requires additional bookkeeping, I’ve not included it in this tutorial.

Non-maximum suppression

In the final step of inference, we remove duplicate bounding boxes using a technique called non-max suppression. In this technique, we first consider the bounding box with highest classification score. Then we compute IoU of all the other boxes with this box and remove the ones which have a high IoU score. These are the duplicate bounding boxes which overlap with the “original” ones. We repeat this process for remaining boxes as well until all the duplicates are removed.

Again, we don’t have to implement it from scratch. The torchvision.ops library provides it to us. The NMS processing step is implemented in the stage 1 regression network described above.

We put together the region proposal network and the classification module to build the final end-to-end Faster-RCNN model.

First let’s overfit the network on a small sample of data to ensure everything is working as expected. We’re using a standard training loop with Adam optimizer and a learning rate of 1e-3.

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (28)

Here are the results:

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (29)

Since we’ve trained on a small subset of data, the model hasn’t learned the image-level features, so the results are not accurate. This can be improved by training on a large dataset.

In a full-scale implementation, we train the network on a standard dataset like MS-COCO or PASCAL VOC and evaluate the results using metrics like mean average precision or area under the ROC curve. However, the aim of this tutorial is to understand the Faster-RCNN model, so we’ll leave the evaluation part.

Over the years significants advancements have been made in the field and many new networks have been developed. Examples include YOLO, EfficientDet, DETR, and Mask-RCNN. However, most of them are built on the groundwork laid by the Faster-RCNN model which we’ve discussed in this tutorial.

I hope you’ve enjoyed the article. The code is available in GitHub. Let’s connect. You can also reach out to me on LinkedIn or Twitter.

Dataset Acknowledgement

The two images used in this article are from the DIV2K dataset. The dataset is licensed under CC0: Public Domain.

@InProceedings{Agustsson_2017_CVPR_Workshops,
author = {Agustsson, Eirikur and Timofte, Radu},
title = {NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {July},
year = {2017}
}

Image Credits

Unless the source has been cited explicitly in the caption, all the images in this tutorial are by the author.

References

Understanding and Implementing Faster R-CNN: A Step-By-Step Guide (2024)
Top Articles
Latest Posts
Article information

Author: Terence Hammes MD

Last Updated:

Views: 6407

Rating: 4.9 / 5 (69 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Terence Hammes MD

Birthday: 1992-04-11

Address: Suite 408 9446 Mercy Mews, West Roxie, CT 04904

Phone: +50312511349175

Job: Product Consulting Liaison

Hobby: Jogging, Motor sports, Nordic skating, Jigsaw puzzles, Bird watching, Nordic skating, Sculpting

Introduction: My name is Terence Hammes MD, I am a inexpensive, energetic, jolly, faithful, cheerful, proud, rich person who loves writing and wants to share my knowledge and understanding with you.