Equalization Loss for Long-Tailed Object Recognition

Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, Junjie Yan

Introduction

Recently, the computer vision community has witnessed the great success of object recognition because of the emerge of deep learning and convolutional neural networks (CNNs). Object recognition, which is a fundamental task in computer vision, plays a central role in many related tasks, such as re-identification, human pose estimation and object tracking.

Today, most datasets for general object recognition, e.g. Pascal VOC and COCO , mainly collect frequently seen categories, with a large number of annotations for each class. However, when it comes to more practical scenarios, a large vocabulary dataset with a long-tailed distribution of category frequency (e.g. LVIS ) is inevitable. The problem of the long-tailed distribution of the categories is a great challenge to the learning of object detection models, especially for the rare categories (categories with very few samples). Note that for one category, all the samples of other categories including the background are regarded as negative samples. So the rare categories can be easily overwhelmed by the majority categories (categories with a large number of samples) during training and are inclined to be predicted as negatives. Thus the conventional object detectors trained on such an extremely unbalanced dataset suffer a great decline.

Most of the previous works consider the influence of the long-tailed category distribution problem as an imbalance of batch sampling during training, and they handle the problem mainly by designing specialized sampling strategies . Other works introduce specialized loss formulations to cope with the problem of positive-negative sample imbalance . But they focus on the imbalance between foreground and background samples so that the severe imbalance among different foreground categories remains a challenging problem.

In this work, we focus on the problem of extremely imbalanced frequencies among different foreground categories and propose a novel perspective to analyze the effect of it. As illustrated in Figure 1, the green and orange curves represent the average norms of gradients contributed by positive and negative samples respectively. We can see that for the frequent categories, the positive gradient has a larger impact than the negative gradient on average, but for the rare categories, the status is just the opposite. To put it further, the commonly used loss functions in classification tasks, e.g. softmax cross-entropy and sigmoid cross-entropy, have a suppression effect on the classes that are not the ground-truth one. When a sample of a certain class is utilized for training, the parameters of the prediction of the other classes will receive discouraging gradients which lead them to predict low probabilities. Since the objects of the rare categories hardly occur, the predictors for these classes are overwhelmed by the discouraging gradients during network parameters updating.

To address this problem, we propose a novel loss function, equalization loss (EQL). In general, we introduce a weight term for each class of each sample, which mainly reduces the influence of negative samples for the rare categories. The complete formulation of equalization loss is presented in Section 3. With the equalization loss, the average gradient norm of negative samples decrease as shown in Figure 1 (the blue curve). And a simple visualization of the effect of EQL is shown in Figure 2, which illustrates the average predicted probabilities for the positive proposals of each category with (the red curve) and without (the blue curve) equalization loss. It can be seen that EQL significantly improves the performance on rare categories without harming the accuracy of frequent categories. With the proposed EQL, categories of different frequencies are brought to a more equal status during network parameter updating, and the trained model is able to distinguish objects of the rare categories more accurately.

Extensive experiments on several unbalanced datasets, e.g. Open Images and LVIS , demonstrate the effectiveness of our method. We also verify our method on other tasks, like image classification.

Our key contributions can be summarized as follows: (1) We propose a novel perspective to analyze the long-tailed problem: the suppression on rare categories during learning caused by the inter-class competition, which explains the poor performance of rare categories on long-tailed datasets. Based on this perspective, a novel loss function, equalization loss is proposed, which alleviates the effect of the overwhelmed discouraging gradients during learning by introducing an ignoring strategy. (2) We present extensive experiments over different datasets and tasks, like object detection, instance segmentation and image classification. All experiments demonstrate the strength of our method, which brings a large performance boosting over common classification loss functions. Equipped with our equalization loss, we achieved the 1st place in the LVIS Challenge 2019.

Related Works

We first revisit common objection detection and instance segmentation. Then we introduce re-sampling, cost-sensitive re-weighting, and feature manipulation methods that are widely used to alleviate the class-unbalanced problem in long-tailed datasets.

Object Detection and Instance Segmentation. There are two mainstream frameworks for objection detection: single-stage detector and two-stage detector . While single-stage detectors achieve higher speed, most of state-of-the-art detectors follow the two-stage regime for better performance. The popular Mask R-CNN , which extends a mask head in the typical two-stage detector, provided promising results on many instance segmentation benchmarks. Mask Scoring R-CNN introduced an extra mask score head to align the mask’s score and quality. And Cascade Mask R-CNN and HTC further improved the performance by predicting the mask in a cascade manner.

Re-sampling Methods. One of the commonly used methods in re-sampling is oversampling , which randomly samples more training data from the minority classes, to tackle the unbalanced class distribution. Class-aware sampling , also called class-balanced sampling, is a typical technique of oversampling, which first samples a category and then an image uniformly that contains the sampled category. While oversampling methods achieve significant improvement for under-represented classes, they come with a high potential risk of overfitting. On the opposite of oversampling, the main idea of under-sampling is to remove some available data from frequent classes to make the data distribution more balanced. However, the under-sampling is infeasible in extreme long-tailed datasets, since the imbalance ratio between the head class and tail class are extremely large. Recently, proposed a decoupling training schema, which first learns the representations and classifier jointly, then obtains a balanced classifier by re-training the classifier with class-balanced sampling. Our method helps the model learn better representations for tail classes, so it could be complementary to the decoupling training schema.

Re-weighting Methods. The basic idea of re-weighting methods is to assign weights for different training samples. In an unbalanced dataset, an intuitive strategy is to weight samples based on the inverse of class frequency or use a smoothed version, inverse square root of class frequency . Besides methods mentioned above which adjust the weight on class level, there are other studies focus on re-weighting on sample level. make the neural network to be cost-sensitive by increasing the weight for hard samples and decreasing the weight for easy samples, which can be seen as online versions of hard example mining technique . Recently, Meta-Weight-Net learns an explicit mapping for sample re-weighting. Different from the works above, we focus on the imbalance problem among different foreground categories. We propose a new perspective that the large number of negative gradients from frequent categories severely suppress the learning of rare categories during training. And we propose a new loss function to tackle this problem, which is applied to the sample level and class level simultaneously.

Feature Manipulation. There are also some works operating on the feature representations directly. Range Loss enlarges inter-classes distance and reduces intra-classes variations simultaneously. augments the feature space of tail classes by transferring the feature variance of regular classes that have sufficient training samples. transfers the semantic feature representation from head to tail categories by adopting a memory module. However, designing those modules or methods is not a trivial task and makes the model harder to train. In contrast, our method is simpler and does not access the representation directly.

Equalization Loss

The central goal of our equalization loss is to alleviate the category quantity distribution imbalance problem for each category in a long-tailed class distribution. We start by revisiting conventional loss functions for classification, namely softmax cross-entropy and sigmoid cross-entropy.

Softmax Cross-Entropy derives a multinomial distribution $\bm{p}$ over each category from the network outputs $\bm{z}$ , and then computes the cross-entropy between the estimated distribution $\bm{p}$ and ground-truth distribution $\bm{y}$ . The softmax cross-entropy loss $L_{SCE}$ can be formulated as:

and $C$ is the number of categories. Here, $\bm{p}$ is calculated by $Softmax(\bm{z})$ . Note that the $C$ categories include an extra class for background. In practice, $\bm{y}$ uses one-hot representation, and we have $\sum_{j=1}^{C}y_{j}=1$ . Formally, for the ground truth category $c$ of a sample,

Sigmoid Cross-Entropy estimates the probability of each category independently using $C$ sigmoid loss functions. The ground truth label $y_{j}$ only represents a binary distribution for category $j$ . Usually, an extra category for background is not included. Instead, $y_{j}=0$ will be set for all the categories when a proposal belongs to the background. So the sigmoid cross-entropy loss can be formulated as:

Where $p_{j}$ is calculated by $\sigma(z_{j})$ . The derivative of the $L_{BCE}$ and $L_{SCE}$ with respect to network’s output $\bm{z}$ in sigmoid cross entropy shares the same formulation:

In softmax cross-entropy and sigmoid cross-entropy, we notice that for a foreground sample of category $c$ , it can be regarded as a negative sample for any other category $j$ . So the category $j$ will receive a discouraging gradient $p_{j}$ for model updating, which will lead the network to predict low probability for category $j$ . If $j$ is a rare category, the discouraging gradients will occur much more frequently than encouraging gradients during the iterations of optimization. The accumulated gradients will have a non-negligible impact on that category. Finally, even positive samples for category $j$ might get a relatively low probability from the network.

2 Equalization Loss Formulation

When the quantity distribution of categories is fairly imbalanced, e.g. in a long-tailed dataset, the discouraging gradients from frequent categories have a remarkable impact on categories with scarce annotations. With commonly used cross-entropy losses, the learning of rare categories are easily suppressed. To solve this problem, we propose the equalization loss, which ignores the gradient from samples of frequent categories for the rare categories. This loss function aims to make the network training more fair for each class, and we refer it as equalization loss.

Formally, we introduce a weight term $w$ to the original sigmoid cross-entropy loss function, and the equalization loss can be formulated as:

For a region proposal $r$ , we set $w$ with the following regulations:

In this equation, $E(r)$ outputs 1 when $r$ is a foreground region proposal and 0 when it belongs to background. And $f_{j}$ is the frequency of category $j$ in the dataset, which is computed by the image number of the class $j$ over the image number of the entire dataset. And $T_{\lambda}(x)$ is a threshold function which outputs 1 when $x<\lambda$ and 0 otherwise. $\lambda$ is utilized to distinguish tail categories from all other categories and Tail Ratio ( $TR$ ) is used as the criterion to set the value of it. Formally, we define $TR$ by the following formula:

where $N_{j}$ is the image number of category $j$ . The settings of hyper-parameters of each part in Equation 7 are studied in Section 4.4.

In summary, there are two particular designs in equalization loss function: 1) We ignore the discouraging gradients of negative samples for rare categories whose quantity frequency is under a threshold. 2) We do not ignore the gradients of background samples. If all the negative samples for the rare categories are ignored, there will be no negative samples for them during training, and the learned model will predict a large number of false positives.

3 Extend to Image Classification

Since softmax loss function is widely adopted in image classification, we also design a form of Softmax Equalization Loss following our main idea. Softmax equalization loss (SEQL) can be formulated as:

and the weight term $w_{k}$ is computed by:

where $\beta$ is a random variable with a probability of $\gamma$ to be 1 and $1-\gamma$ to be 0.

Experiments on LVIS

We conduct extensive experiments for equalization loss. In this section, we first present the implementation details and the main results on the LVIS dataset in Section 4.2 and Section 4.3. Then we perform ablation studies to analyze different components of equalization loss in Section 4.4. In Section 4.5, we compare equalization loss with other methods. Details of LVIS Challenge 2019 will be introduced in Section 4.6.

LVIS is a large vocabulary dataset for instance segmentation, which contains 1230 categories in current version v0.5. In LVIS, categories are divided into three groups according to the number of images that contains those categories: rare (1-10 images), common (11-100), and frequent ( $>$ 100). We train our model on 57k train images and evaluate it on 5k val set. We also report our results on 20k test images. The evaluation metric is AP across IoU threshold from 0.5 to 0.95 over all categories. Different from COCO evaluation process, since LVIS is a sparse annotated dataset, detection results of categories that are not listed in the image level labels will not be evaluated.

2 Implementation Details

We implement standard Mask R-CNN equipped with FPN as our baseline model. Training images are resized such that its shorter edge is 800 pixels while the longer edge is no more than 1333. No other augmentation is used except horizontal flipping. In the first stage, RPN samples 256 anchors with a 1:1 ratio between the foreground and background, and then 512 proposals are sampled per image with 1:3 foreground-background ratio for the second stage. We use 16 GPUs with a total batch size 32 for training. Our model is optimized by stochastic gradient descent (SGD) with momentum 0.9 and weight decay 0.0001 for 25 epochs, with an initial learning rate 0.04, which is decayed to 0.004 and 0.0004 at 16 epoch and 22 epoch respectively. Though class-specific mask prediction achieves better performance, we adopt a class-agnostic regime in our method due to the huge memory and computation cost for the large scale categories. Following , the threshold of prediction score is reduced from 0.05 to 0.0, and we keep the top 300 bounding boxes as prediction results. We make a small modification when EQL is applied on LVIS. Since for each image LVIS provide additional image-level annotations of which categories are in that image (positive category set) and which categories are not in it (negative category set), categories in EQL will not be ignored if they are in the positive category set or negative category set of that image, i.e. the weight term of Equation 7 will be 1 for those categories, even if they are rare ones.

3 Effectiveness of Equalization Loss

Table 1 demonstrates the effectiveness of equalization loss function over different backbones and frameworks. Besides Mask R-CNN, we also apply equalization loss on Cascade Mask R-CNN . Our method achieves consistent improvement on all those models. As we can see from the table, the improvement mainly comes from the rare and common categories, indicating the effectiveness of our method on categories of the long-tailed distribution.

4 Ablation Studies

To better analyze equalization loss, we conduct several ablation studies. For all experiments we use ResNet-50 Mask R-CNN.

Frequency Threshold $\lambda$ : The influence of different $\lambda$ is shown in Table 2. We perform experiments of changing $\lambda$ from $1.76\times 10^{-4}$ , which exactly split rare categories from all categories, to a broad range. We empirically find the proper $\lambda$ locating in the space when $TR(\lambda)$ ranges from 2% to 10%. Results in Table 2 shows that significant improvement of overall AP as $\lambda$ increases to include more tail categories. Meanwhile, the performance tends to degenerate when $\lambda$ increases to include frequent categories. One advantage of equalization loss is that it has negligible effect on categories whose frequency is larger than a given $\lambda$ . When $\lambda=\lambda_{r}$ , $AP_{r}$ improves significantly with marginal influence to $AP_{c}$ and $AP_{f}$ . And when $\lambda=\lambda_{c}$ , $AP_{r}$ and $AP_{c}$ improve a lot while $AP_{f}$ only degenerates slightly. We set $\lambda$ to $\lambda_{c}$ in all our experiments.

Threshold Function $T_{\lambda}(f)$ : In Equation 7, we use $T_{\lambda}(f_{j})$ to compute the weight of category $j$ for a given proposal. Except for the proposed threshold function, $T_{\lambda}(f)$ can have other forms to calculate the weight for the categories with frequency under the threshold. As illustrated in Figure 3, we present and compare with another two designs: (1) Exponential decay function $y=1-(af)^{n}$ , which computes the weight according to the power of category frequency. (2) Gompertz decay function $y=1-ae^{-be^{-cf}}$ , which decays smoothly at the beginning and then decreases more steeply. We run multiple experiments for Exponential decay function and Gompertz decay function with different hyper-parameters and report the best results. The best hyper-parameter settings for Exponential decay function is $a=400$ and $n=2$ and for Gompertz decay function $a=1,b=80,c=3000$ . Table 3 shows that all of the three designs achieve fairly similar results, while both exponential decay and Gompertz decay function introduce more hyper-parameters to fit the design. Therefore, we use the threshold function in our method for its simpler format with less hyper-parameters and better performance.

Excluding Function $E(r)$ : Table 4 shows the experiment results for EQL with or without $E(r)$ . EQL without $E(r)$ means removing $E(r)$ from Equation 7, which will treat the foreground and background the same way. EQL with $E(r)$ means equalization loss only affects foreground proposals, as defined in Equation 7. Experiment results demonstrate the importance of $E(r)$ . As we can see from the table, with $E(r)$ , EQL achieves 0.6 points AP gain compared with EQL without $E(r)$ . If $E(r)$ is discarded, although $AP_{r}$ has an increase, $AP_{f}$ drops dramatically, which causes the overall AP decline.

It is worth to notice that if we don’t use $E(r)$ , a large number of background proposals will be also ignored for rare and common categories, and the insufficient supervision from background proposals will cause extensive false positives. We visualize the detection results of an example image, which is shown in Figure 4. Without $E(r)$ , more false positives are introduced, which are shown in red color. Both analysis and illustration above indicate that $AP_{r}$ should decrease without $E(r)$ , which is contradictory with the experiment results in Table 4. The reason is that according to LVIS evaluation protocol, if it is not sure whether category $j$ is in or not in image $I$ , all the false positives of category $j$ will be ignored in image $I$ . If category $j$ is rare, the increased false positives are mostly ignored, which alleviates their influence. But the simultaneously increased true positives bring a direct increase in $AP_{r}$ .

5 Comparison with Other Methods

Table 5 presents the comparison with other methods that are widely adopted to tackle the class imbalance problem. According to the table, re-sampling methods improve $AP_{r}$ and $AP_{c}$ at the sacrifice of $AP_{f}$ , while re-weighting methods bring consistent gains on all categories but the overall improvement is trivial. The equalization loss improves $AP_{r}$ and $AP_{c}$ significantly with slight effect on $AP_{f}$ , surpassing all other approaches.

6 LVIS Challenge 2019

With the help of the equalization loss, we finally won the 1st place on LVIS challenge held on COCO and Mapillary Joint Recognition Challenge 2019. Combined with other enhancements, like larger backbone , deformable convolution , synchronized batch normalization , and extra data, our method achieves a 28.9 mask AP on LVIS v0.5 test set, outperforming the ResNeXt-101 Mask R-CNN baseline (20.1%) by 8.4%. More details about our solution of challenge are described in Appendix A.

Experiments on Open Images Detection

Open Image dataset v5 is a large dataset of 9M images annotated with image-level labels and bounding boxes. In our experiments, we use the split of data and the subset of the categories of the competition 2019 for object detection track (OID19). The train set of OID19 contains 12.2M bounding boxes over 500 categories on 1.7M images, and the val contains about 10k images.

According to Table 6, our method achieves a great improvement compared with standard sigmoid cross-entropy, outperforming class-aware sampling method by a significant margin. To better understand the improvement of our methods, we group all the categories by their image number and report the performance of each group. We can see that our method has larger improvements on categories with fewer samples. Significant AP gains on the group of fewest 100 categories are achieved compared with sigmoid cross-entropy and class-aware sampling (2.6 and 10.88 points respectively).

Experiments on Image Classification

To demonstrate the generalization ability of the equalization loss when transferring to other tasks. We also evaluate our method on two long-tailed image classification datasets, CIFAR-100-LT and ImageNet-LT.

Datasets. We follow exactly the same setting with to generate the CIFAR-100-LT with imbalance factor of 200 https://github.com/richardaecn/class-balanced-loss. CIFAR-100-LT contains 9502 images in train set, with 500 images for the most frequent category and 2 images for the rarest category. CIFAR-100-LT shares the same test set of 10k images with original CIFAR-100. We report the top1 and top5 accuray. ImageNet-LT is generated from ImageNet-2012 , which contains 1000 categories with images number ranging from 1280 to 5 images for each category https://github.com/zhmiao/OpenLongTailRecognition-OLTR. There are 116k images for training and 50k images for testing. Different from CIFAR-100-LT, we additionally present accuracies of many shot, medium shot and few shot to measure the improvement on tail classes.

Implementation Details. For CIFAR-100-LT, we use Nesterov SGD with momentum 0.9 and weight decay 0.0001 for training. We use a total mini-batch size of 256 with 128 images per GPU. The model ResNet-32 is trained for 12.8K iterations with learning rate 0.2, which is then decayed by a factor of 0.1 at 6.4K and 9.6K iteration. Learning rate is increased gradually from 0.1 to 0.2 during the first 400 iterations. For data augmentation, we first follow the same setting as , then use autoAugment and Cutout . In testing, we simply use the origin $32\times 32$ images. For ImageNet-LT, we use a total mini-batch size of 1024 with 16 GPUs. We use ResNet-10 as our backbone like .The model is trained for 12K iterations with learning rate 0.4, which is divided by 10 at 3.4K, 6.8K, 10.2K iterations. A gradually warmup strategy is also adopted to increase the learning rate from 0.1 to 0.4 during the first 500 iterations. We use random-resize-crop, color jitter and horizontal flipping as data augmentation. Training input size is $224\times 224$ . In testing, we resized the images to $256\times 256$ then cropped a single view of $224\times 224$ at the center.

Results on CIFAR-100-LT and ImageNet-LT. We build a much stronger baseline on CIFAR-100-LT due to those augmentation techniques. As shown in Table 7, our EQL still improves the strong baseline by a large margin of 2%. And those improvements are come from classes with fewer training samples. As for ImageNet-LT, we also present ablation studies in Table 9. A wide range of values of $\gamma$ give consistent improvements over the softmanx loss baseline. As shown in Table 8 and Table 10, our equalization loss surpasses prior state-of-the-art approaches significantly, which demonstrates that our method can be generalized to different tasks and datasets effectively.

Conclusion

In this work, we analyze the severe inter-class competition problem in long-tailed datasets. We propose a novel equalization loss function to alleviate the effect of the overwhelmed discouraging gradients on tail categories. Our method is simple but effective, bringing a significant improvement over different frameworks and network architectures on challenging long-tailed object detection and image classification datasets.

Acknowledgment Wanli Ouyang is supported by the Australian Research Council Grant DP200103223.

Appendix A Details of LVIS Challenge 2019

With equalization loss, we ranked 1st entry on LVIS Challenge 2019. In this section, we will introduce details of the solution we used in the challenge.

External Data Exploiting Since LVIS is not exhaustively annotated with all categories and the annotations for long-tailed categories are quite scarce, we utilize additional public datasets to enrich our training set. First, we train a Mask R-CNN on COCO train2017 with 115k images and then fine-tune our model with equalization loss on LVIS. During fine-tuning, we leverage COCO annotations of bounding boxes as ignored regions to exclude background proposals during sampling. Moreover, we borrow $\sim$ 20k images from Open Images V5 which contains shared 110 categories with LVIS and use the bounding boxes annotations to train the model.

Model Enhancements We achieve our challenge baseline by training ResNeXt-101-64x4d enhanced by deformable convolution and synchronized batch normalization , along with equalization loss, repeat factor sampling , multi-scale training and COCO exploiting, which lead to 30.1% AP on LVIS v0.5 val set. We apply multi-scale testing on both bounding box and segmentation results and the testing scale ranges from 600 to 1400 with step size of 200. We train two expert models on train set of COCO 2017 and Open Images V5 respectively and then evaluate them on LVIS val set to collect the detection results of shared categories. Though our method improves the performance of long-tailed categories a lot, the prediction scores for these categories tend to be smaller than frequent ones due to the lack of positive training samples, which leads to degeneration of $AP_{r}$ in ensemble. To keep more results for rare and common categories, we employ a re-score ensemble approach via improving the scores of these categories.

Our road map is shown in 11. With those enhancements, we achieve 36.4 and 28.9 Mask AP on val and test set respectively which is demonstrated in Table 12.