1st Place Solution of LVIS Challenge 2020: A Good Box is not a Guarantee of a Good Mask

Jingru Tan, Gang Zhang, Hanming Deng, Changbao Wang, Lewei Lu, Quanquan Li, Jifeng Dai

Introduction

LVIS is a new dataset for large vocabulary instance segmentation. Firstly, given modern object detectors perform poorly in few samples regime, it provides new research opportunities for long-tailed object detection. Secondly, unlike COCO dataset , it provides over 2 million high quality mask annotations, making it possible to train and evaluate against high quality ground truth.

Our solutions focus on those two aspects: (1) handling the extremely inter-class imbalance caused by long-tail distribution, (2) predicting higher quality instance mask. Overall, we adopt a two-stage training strategy consisting of the representation learning stage and the fine-tuning stage. At the representation learning stage, we use some techniques like EQL , repeat factor re-sampling , data augmentation, self-training to learn generalized representation. At the fine-tuning stage, we first freeze the backbone, and follow the balanced group softmax to balance the classifier for solving the inter-class imbalance problem. We also put more emphasis on the mask head at this stage. We found that a well-aligned bounding box does not guarantee a precise mask. For example, instances of some categories usually have large bounding boxes but with thin masks. i.e, the area ratio of the mask and bounding box is small. However, given a proposal, the traditional strategy is to extract features at a specific feature map according to scale of the bounding box, as a consequence, the required detailed information for predicting thin mask may be discarded at the coarse feature map if a proposal has large bounding box. To alleviate this problem, we assign the mask proposals considering both scale of the bounding box and area ratio of the mask and bounding box. Another issue caused by the extremely small area ratio is the imbalance problem of foreground and background pixels when training the mask head. So we propose a novel balanced mask loss, which combines dice loss with weighted binary cross-entropy loss. Specifically, the new mask loss will dynamically adjust the weight for foreground pixels according to the area ratio.

Our Approach

EQL. We apply Equalization Loss to alleviate the suppression to rare and common categories.

Data Augmentation. Mosaic , rotate, scale jitter is used unless otherwise stated.

Self-training. We do inference on LVIS v1.0 training data, and collect pseudo labels of bounding boxes that do not have overlap with ground truth. We consider those pseudo labels as missing annotations (caused by sparse annotation of a federated dataset). Then we ignore proposals if those proposals have a large IOU overlap with these pseudo labels. We also do inference on Open Image data, and use the pseudo labels to jointly train with standard LVIS v1.0 training data. We only sub-sample 10k images from those pseudo labels each training epoch and use a loss weight $\lambda$ to control its effect.

2 Fine-tuning Stage

Classifier Balance. After the representation learning stage, we freeze the backbone, neck, and RPN. Balanced GroupSoftmax is used for balancing classifier.

Mask Proposal Assignment. Unlike COCO, which includes 80 well-defined categories, the LVIS dataset has 1203 categories found by data-driven object discovery, and instances of some categories may have irregular shapes. As a result, some new challenges arise. We find instances of some categories have large scale bounding boxes but with thin masks, in other words, the area ratio of mask and bounding box is small. But in the proposal assignment stage, we usually assign proposals to specific feature map (e.g. P2, P3, P4, P5 when using FPN ) to extract features according to the scale of the bounding box. As a result, some proposals with large bounding boxes and thin masks will be assigned to the coarse feature map in which the required detailed information needed for predicting thin masks may be discarded. To alleviate this problem, we propose a new proposal assignment strategy for mask proposals, which considers both scale of the bounding box and area ratio of the mask and bounding box. Specifically, we assign proposals according to Eq.1, where $S_{bbox}$ and $S_{mask}$ represent area of the bounding box and mask respectively, and 3 is the level index of the feature map with the coarsest resolution.

Balanced Mask Loss. As mentioned above, instances of some categories have large bounding boxes but with thin masks, which also results in the imbalance between the foreground and background pixels during training. So we propose a new balanced mask loss as Eq.2 to handle this problem. It consists of dice loss and weighted binary cross-entropy loss.

where $p_{m}\in R^{H\times W}$ denotes the predicted mask for a particular category, $y_{m}\in R^{H\times W}$ denotes the corresponding mask ground truth, H and W are height and width of the predicted mask map respectively. $\lambda$ is a hyper-parameter to adjust the weight of weighted binary cross-entropy loss. We set $\lambda$ as 1 in all experiments.

where $i$ denotes the $i$ -th pixel and $\epsilon$ is a smooth term to avoid zero division. We set $\epsilon$ as 1 in all experiments. By the way, using the dice loss as mask supervision alone is worse than the standard binary cross-entropy loss.

Weighted binary cross-entropy loss is given as follows.

Weight for each pixel is given as follows.

As Eq.5 shows, when pixel $i$ is a foreground pixel, we adjust its weight according to area ratio of the mask and bounding box of the proposal which the pixel $i$ belongs to.

Boundary Supervision. In addition, we also add intermediate boundary supervision to improve mask localization accuracy following .

More Computation on Head. We also add three more convolutions for the mask head, and use the deformable RoI pooling to extract features for proposals in the second stage.

Experiments

LVIS. We perform experiments on LVIS v1.0 dataset , which contains 1203 categories. It consists of 100k training images and 19.8k validation images. Note the LVIS is the only source of training data with annotations. We also reported our results on 19.8k test-dev set.

Open Image We only use images without annotations of Open Images to generate pseudo labels.

2 Implementation Details

We re-implement the Mask-RCNN and HTC following the origin paper. All the hyper-parameters are kept unchanged except we set weight decay to 0.00005 instead of 0.0001 for large models. The Learning rate is set to 0.02, batch size is 16 (one image per GPU). For HTC model, we do NOT include the semantic segmentation branch because coco stuff annotation is not permitted. We train model with small backbone, e.g. ResNet-50 with 24 epoch, with learning rate divided by 10 at the 16th and 22th epoch. For large model, we train with 15 epoch, with learning rate divided by 10 at the 11th and 14th epoch. All models are initialized with ImageNet pre-trained model.

3 Ablation Studies

We choose R50-FPN-MaskRCNN as our baseline model and mask head is class-specific, scale-jitter is adopted. Some useful enhancement techniques are shown in Table 1. With those methods, we improve the AP from 19.2 to 33.2. Based on this strong baseline, we then apply other methods, results are shown in Table 2.

The results of fine-tuning stage are present at Table 2. First, fine-tune with Balanced GroupSoftmax improves the AP from 33.2 to 34.7, with a 3.0 gap between bounding boxes and masks. With our proposed high quality mask method, we shrink gap to 2.3 and further improve the AP to 36.1.

We apply multi-scale testing as Test Time Augmentation. We make several modifications on standard multi-scale testing. (1) We limit the valid bounding boxes range for each resolution. i.e. we only accept small bounding boxes on high-resolution images or large bounding boxes on small resolution images. (2) We slightly increase the score of rare categories when merge detected boxes from multiple scales. (3) We use standard NMS with a threshold 0.7 followed by Soft NMS . (4) We extract mask predictions from different resolution images according to scale of the bounding box and area ratio of the mask and bounding box. With these changes, we achieve the single model result of 41.5 AP on LVIS v1.0 val set.

Final Results

We submit our results on test-dev to the LVIS v1.0 evaluation server. The results are shown at Table 3.