CornerNet: Detecting Objects as Paired Keypoints Hello, everyone. My name is Hei Law and I am going to present our work, CornerNet: Detecting Objects as Paired Keypoints. Hei Law1,2 Jia Deng1,2 Princeton University1, *University of Michigan2 *Work done at the University of Michigan
Object Detection Person Person Object detection is a fundamental task in computer vision. The task is to localize each object as a bounding box with a category label.
Two-Stage Detector Region of Interest Person 2nd Network 1st Network [Girshick et al. CVPR’14] [He et al. ECCV’14] [He et al. ICCV’17] [Cai & Vasconcelos, CVPR’18] [Singh & Davis, CVPR’18] Region of Interest [Ren et al. NIPS’15] Many current object detectors share a two-stage design that consists of two convolutional networks. The first network generates a set of region proposals. The second network classifies each region proposal and outputs a final bounding box. 2nd Network Person 1st Network Person Region Pooling [Girshick, ICCV’15]
One-stage Detector ConvNet [Redmon & Farhadi, CVPR’17] [Shen et al. ICCV’17] [Liu et al. ECCV’16] [Fu et al. arXiv’17] [Lin et al. ICCV’17] [Zhang et al. CVPR’18] Anchors Class Person Background An alternative to the two-stage design is to use a single convolutional network that directly outputs a set of bounding boxes. Specifically, at each pixel location, the network chooses from a set of anchor boxes of various sizes and aspect ratios, in addition to assigning a category label. Because only a single network is evaluated, one-stage detectors can be made much faster than two-stage detectors. ConvNet
Drawbacks of Anchor Boxes Need a large number of anchors A tiny fraction of anchors are positive examples Slow down training [Lin et al. ICCV’17] Extra hyperparameters – sizes and aspect ratios At least one anchor sufficiently overlaps with ground-truth However, the use of anchor boxes has a couple drawbacks. First, we typically need a large number of anchor boxes to ensure that at least one of them overlaps sufficiently with the ground truth. This means that during training, only a tiny fraction of anchor boxes are positive examples. As shown by Lin et al., this imbalance can significantly slow down training. Second, the use of anchor boxes introduces many extra hyperparameters, including what sizes and aspect ratios to include.
CornerNet: Detecting Objects as Paired Keypoints In this work, we propose CornetNet, a new one-stage detector that does away with anchor boxes. We reformulate object detection as detecting and grouping keypoints. In particular, we detect the top-left corners and bottom-right corners of bounding boxes, and pair them to form individual object instances.
CornerNet: Detecting Objects as Paired Keypoints Top-Left Corner? Whose Top-Left? Bottom-Right Corner? Whose Bottom-Right? Class Class Yes Person No CornerNet consists of a single convolutional network. At each pixel, we predict if there exists a top-left corner, and if so, its category label. In addition, we predict an embedding vector, which is just a bunch of real numbers that serve to identify the object the corner belongs to. Similarly, at each pixel, we predict if there exists a bottom-right corner, together with its category label and an embedding vector. ConvNet Yes Person No No Yes Person No Yes Person
CornerNet: Detecting Objects as Paired Keypoints Top-Left Corner? Whose Top-Left? Bottom-Right Corner? Whose Bottom-Right? Class Class Yes Person No To form bounding boxes, we need to pair each detected top-left corner with a bottom-right corner. To do that we compare the embedding vectors. The network has been trained to predict similar embeddings for the pair of corners that belong to the same object, and different embeddings otherwise. For example, this pair of corners have similar embeddings, so they are grouped to form a bounding box for the person on the right. Similarly, this pair of corners are grouped to form a bounding box for the person on the left. Yes Person No No Yes Person No Yes Person
CornerNet: Detecting Objects as Paired Keypoints Top-Left Corner? Whose Top-Left? Bottom-Right Corner? Whose Bottom-Right? Class Class Yes Person No Loss: distance In pairing the corners, What matters is only the similarities between embeddings, not the absolute values of each embedding. To train the network to predict such embeddings, we apply a loss to minimize the distances between embeddings from the same object and apply another loss to minimize the similarities between the embeddings from different objects. Yes Person No Loss: similarity No Yes Person No Yes Person
Associative Embedding [Newell et al. NIPS’17] Our approach draws inspiration from work by Newell et al. that introduced the idea of associative embedding in the context of multi-person pose estimation. In their approach, they detect the keypoints of all persons and predict an embedding vector for each detected keypoint. The keypoints are then grouped to form individual persons based on the similarities between embeddings.
Advantages of Detecting Corners Corner: 2 sides We hypothesize that detecting corners has some unique advantages over alternatives. First, detecting corners is likely easier than detecting bounding box centers, because a center depends on all 4 sides of an object, whereas a corner depends on only 2 sides. Center: 4 sides Detecting corner is easier than detecting center
Advantages of Detecting Corner w 𝑂( 𝑤 2 ℎ 2 ) possible proposals Another advantage is that corners provides a compact output space to represent all candidate boxes. Given an image with width w and height h, we can represent O(w2h2) possible candidate boxes, using only O(wh) corners. 𝑂(𝑤ℎ) corners h Represent 𝑂( 𝑤 2 ℎ 2 ) possible proposals using only 𝑂 𝑤ℎ corners
Supervising Corner Detection Boxes sufficiently overlap with ground-truths It is worth mentioning an important detail on training the network to detect corners. Although there is only a single ground truth location of a corner, false detections within a radius are penalized less because they can still generate bounding boxes that overlap sufficiently with the ground truth box. Reducing penalties
Corner Pooling To further improve corner detection, we also introduce a new type of pooling layer called corner pooling. This is based on the observation that we OFTEN cannot detect a corner based on local visual evidence. For example, to detect a top-left corner, we need to look to the right to see if there is a topmost boundary of the object and look down to see if there is a leftmost boundary of the object.
Top-Left Corner Pooling max This observation motivates our corner pooling design. It takes two feature maps as input. It max pools all feature vectors to the right in the first feature map, and max pools all feature vectors below in the second feature map. It then adds the two pooled results together. max feature maps
CornerNet Top-Left Corner Top-Left Corner Pooling Hourglass Network Class Embedding Top-Left Corner Pooling Yes Person Yes Person We are now ready to describe the complete CornerNet. It consists of a stacked hourglass network, followed by two prediction branches, one for the top-left corner and the other for the bottom-right corner. In each branch, we apply corner pooling to the features before predicting the corner heatmaps and embedding vectors. Hourglass Network [Newell et al. ECCV’16] Yes Person Bottom-Right Corner Pooling Yes Person Bottom-Right Corner
One of the most challenging object detection benchmarks We evaluate on MS COCO, one of the most challenging object detection benchmarks. [Lin et al. ECCV’14] One of the most challenging object detection benchmarks
Experiment: CornerNet versus Others On MSCOCO, CornerNet achieves a 42.1% AP, outperforming all existing one-stage detectors. It also achieves results highly competitive to the two-stage detectors.
Experiment: Corner Pooling We also evaluate the effectiveness of corner pooling through ablation. We see that adding corner pooling significantly improves the object detection performance. It especially benefits medium and large objects, which is expected because the corners can be further away.
This is an input image with its two corner heatmaps, one for the top-left corner, and the other for the bottom-right corner. We can see that CornerNet is able to accurately localize the corners even the boundaries of the objects are far away from the corner locations.
Here are some additional examples Here are some additional examples. We see that CornerNet is able to successfully pair corners in the presence of multiple instances of the same object category.
Related Work Main difference: how grouping is done Before concluding my talk, I’d like to mention DeNet and Point Linking Network, two related works that also perform corner-based object detection. The main difference between CornerNet and them is how the grouping is done. DeNet is a two-stage detector which does the grouping in stage 2 by a classifier, while CornerNetis a one-stage detector and does the grouping by associative embedding. Point Linking Network performs grouping by predicting pixel locations, while CornerNet uses associative embedding. DeNet: [Tychsen-Smith & Petersson, CVPR’17] Two-stage; grouping in stage 2 by classifier Ours: One-stage; grouping by associative embedding Point Linking Network: [Wang et al. arXiv’17] Grouping by predicting pixel locations Ours: Grouping by associative embedding
Conclusions CornerNet: Detecting objects as pairs of top-left and bottom- right corners Corner pooling to help better localize corners State-of-the-art performance among single-stage detectors To conclude, we have introduced a new approach that detects objects as pairs of corners, with corner pooling as one key component. Our approach has achieved state of the art performance among single-stage detectors.
CornerNet: Detecting Objects as Paired Keypoints Poster P-4B-03 on the 2nd floor of the Philharmonie foyer Code Available At: https://github.com/umich-vl/CornerNet For more details, please visit our poster or check out our code on Github. Thank you.