14.8. Region-based CNNs (R-CNNs) — Dive into Deep Learning 1.0.3 documentation (2024)

Besides single shot multibox detection described in Section 14.7,region-based CNNs or regions with CNN features (R-CNNs) are also amongmany pioneering approaches of applying deep learning to object detection(Girshick et al., 2014). In this section, we willintroduce the R-CNN and its series of improvements: the fast R-CNN(Girshick, 2015), the faster R-CNN(Ren et al., 2015), and the mask R-CNN(He et al., 2017). Due to limited space, we will onlyfocus on the design of these models.

14.8.1. R-CNNs¶

The R-CNN first extracts many (e.g., 2000) region proposals from theinput image (e.g., anchor boxes can also be considered as regionproposals), labeling their classes and bounding boxes (e.g., offsets).

(Girshick et al., 2014)

Then a CNN is used to perform forward propagation on each regionproposal to extract its features. Next, features of each region proposalare used for predicting the class and bounding box of this regionproposal.

Fig. 14.8.1 The R-CNN model.¶

Fig. 14.8.1 shows the R-CNN model. More concretely, the R-CNNconsists of the following four steps:

Perform selective search to extract multiple high-quality regionproposals on the input image(Uijlings et al., 2013). These proposedregions are usually selected at multiple scales with different shapesand sizes. Each region proposal will be labeled with a class and aground-truth bounding box.
Choose a pretrained CNN and truncate it before the output layer.Resize each region proposal to the input size required by thenetwork, and output the extracted features for the region proposalthrough forward propagation.
Take the extracted features and labeled class of each region proposalas an example. Train multiple support vector machines to classifyobjects, where each support vector machine individually determineswhether the example contains a specific class.
Take the extracted features and labeled bounding box of each regionproposal as an example. Train a linear regression model to predictthe ground-truth bounding box.

Although the R-CNN model uses pretrained CNNs to effectively extractimage features, it is slow. Imagine that we select thousands of regionproposals from a single input image: this requires thousands of CNNforward propagations to perform object detection. This massive computingload makes it infeasible to widely use R-CNNs in real-worldapplications.

14.8.2. Fast R-CNN¶

The main performance bottleneck of an R-CNN lies in the independent CNNforward propagation for each region proposal, without sharingcomputation. Since these regions usually have overlaps, independentfeature extractions lead to much repeated computation. One of the majorimprovements of the fast R-CNN from the R-CNN is that the CNN forwardpropagation is only performed on the entire image(Girshick, 2015).

Fig. 14.8.2 The fast R-CNN model.¶

Fig. 14.8.2 describes the fast R-CNN model. Its majorcomputations are as follows:

Compared with the R-CNN, in the fast R-CNN the input of the CNN forfeature extraction is the entire image, rather than individual regionproposals. Moreover, this CNN is trainable. Given an input image, letthe shape of the CNN output be\(1 \times c \times h_1 \times w_1\).
Suppose that selective search generates \(n\) region proposals.These region proposals (of different shapes) mark regions of interest(of different shapes) on the CNN output. Then these regions ofinterest further extract features of the same shape (say height\(h_2\) and width \(w_2\) are specified) in order to beeasily concatenated. To achieve this, the fast R-CNN introduces theregion of interest (RoI) pooling layer: the CNN output and regionproposals are input into this layer, outputting concatenated featuresof shape \(n \times c \times h_2 \times w_2\) that are furtherextracted for all the region proposals.
Using a fully connected layer, transform the concatenated featuresinto an output of shape \(n \times d\), where \(d\) dependson the model design.
Predict the class and bounding box for each of the \(n\) regionproposals. More concretely, in class and bounding box prediction,transform the fully connected layer output into an output of shape\(n \times q\) (\(q\) is the number of classes) and an outputof shape \(n \times 4\), respectively. The class prediction usessoftmax regression.

The region of interest pooling layer proposed in the fast R-CNN isdifferent from the pooling layer introduced in Section 7.5.In the pooling layer, we indirectly control the output shape byspecifying sizes of the pooling window, padding, and stride. Incontrast, we can directly specify the output shape in the region ofinterest pooling layer.

14.8.3. Faster R-CNN¶

To be more accurate in object detection, the fast R-CNN model usuallyhas to generate a lot of region proposals in selective search. To reduceregion proposals without loss of accuracy, the faster R-CNN proposesto replace selective search with a region proposal network(Ren et al., 2015).

Fig. 14.8.4 The faster R-CNN model.¶

Fig. 14.8.4 shows the faster R-CNN model. Compared withthe fast R-CNN, the faster R-CNN only changes the region proposal methodfrom selective search to a region proposal network. The rest of themodel remain unchanged. The region proposal network works in thefollowing steps:

Use a \(3\times 3\) convolutional layer with padding of 1 totransform the CNN output to a new output with \(c\) channels. Inthis way, each unit along the spatial dimensions of the CNN-extractedfeature maps gets a new feature vector of length \(c\).
Centered on each pixel of the feature maps, generate multiple anchorboxes of different scales and aspect ratios and label them.
Using the length-\(c\) feature vector at the center of eachanchor box, predict the binary class (background or objects) andbounding box for this anchor box.
Consider those predicted bounding boxes whose predicted classes areobjects. Remove overlapped results using non-maximum suppression. Theremaining predicted bounding boxes for objects are the regionproposals required by the region of interest pooling layer.

It is worth noting that, as part of the faster R-CNN model, the regionproposal network is jointly trained with the rest of the model. In otherwords, the objective function of the faster R-CNN includes not only theclass and bounding box prediction in object detection, but also thebinary class and bounding box prediction of anchor boxes in the regionproposal network. As a result of the end-to-end training, the regionproposal network learns how to generate high-quality region proposals,so as to stay accurate in object detection with a reduced number ofregion proposals that are learned from data.

14.8.4. Mask R-CNN¶

In the training dataset, if pixel-level positions of object are alsolabeled on images, the mask R-CNN can effectively leverage suchdetailed labels to further improve the accuracy of object detection(He et al., 2017).

Fig. 14.8.5 The mask R-CNN model.¶

As shown in Fig. 14.8.5, the mask R-CNN is modified basedon the faster R-CNN. Specifically, the mask R-CNN replaces the region ofinterest pooling layer with the region of interest (RoI) alignmentlayer. This region of interest alignment layer uses bilinearinterpolation to preserve the spatial information on the feature maps,which is more suitable for pixel-level prediction. The output of thislayer contains feature maps of the same shape for all the regions ofinterest. They are used to predict not only the class and bounding boxfor each region of interest, but also the pixel-level position of theobject through an additional fully convolutional network. More detailson using a fully convolutional network to predict pixel-level semanticsof an image will be provided in subsequent sections of this chapter.

14.8.5. Summary¶

The R-CNN extracts many region proposals from the input image, uses aCNN to perform forward propagation on each region proposal to extractit* features, then uses these features to predict the class andbounding box of this region proposal.
One of the major improvements of the fast R-CNN from the R-CNN isthat the CNN forward propagation is only performed on the entireimage. It also introduces the region of interest pooling layer, sothat features of the same shape can be further extracted for regionsof interest that have different shapes.
The faster R-CNN replaces the selective search used in the fast R-CNNwith a jointly trained region proposal network, so that the formercan stay accurate in object detection with a reduced number of regionproposals.
Based on the faster R-CNN, the mask R-CNN additionally introduces afully convolutional network, so as to leverage pixel-level labels tofurther improve the accuracy of object detection.

14.8.6. Exercises¶

Can we frame object detection as a single regression problem, such aspredicting bounding boxes and class probabilities? You may refer tothe design of the YOLO model(Redmon et al., 2016).
Compare single shot multibox detection with the methods introduced inthis section. What are their major differences? You may refer toFigure 2 of Zhao et al. (2019).

pytorchmxnet

Discussions