ACFNet: Attentional Class Feature Network for Semantic Segmentation

Fan Zhang, Yanqin Chen, Zhihang Li, Zhibin Hong, Jingtuo Liu, Feifei Ma, Junyu Han, Errui Ding

Introduction

Semantic segmentation, which aims to assign per-pixel class label for a given image, is one of the fundamental tasks in computer vision. It has been widely used in various challenging fields like autonomous driving, scene understanding, human parsing, etc. Recent state-of-the-art semantic segmentation approaches are typically based on convolutional neural networks (CNNs), especially the Fully Convolution Network (FCN) frameworks .

One of the most effective approaches to improve the performance is exploiting richer context . For example, Chen et al. proposed the atrous spatial pyramid pooling (ASPP) to aggregate spatial regularly sampled pixels at different dilated rates around a pixel as its context. In PSPNet , the pyramid pooling module divides the feature map into multiple regions with different sizes. The pooled representation of each region is then considered as the context within the same region. Moreover, the global average pooling (GAP) is also widely used to obtain a global context . Generally, these kinds of methods focus on exploiting different spatial strategies to capture richer contextual information. They do not distinguish pixels from different classes explicitly when calculating the context. Surrounding activated objects from different categories contribute the same to the context no matter what category the pixel comes from, which might be confusing for the pixel to determine which category it belongs to.

Different from the methods above, we argue that exploiting the class-level context, an ignored factor before, is also critical for semantic segmentation task. So in this work, we propose a new approach to exploit contextual information from a categorical perspective. We first present a so-called class center which describes the overall representation of each category in an image. Specifically, the class center of one class is the aggregation of all features of pixels belonging to this class. A comparison between class center and traditional context modules like ASPP and pyramid pooling module (PPM) is shown in Figure 1. ASPP and PPM try to exploit context by employing spatial strategies while the class center focuses on capturing the context from a categorical perspective which uses all pixels of the same category to calculate a class-level representation.

However, it is impractical to get the groundtruth label while testing. Hence, we propose a simple yet effective coarse-to-fine segmentation framework to approximate the class center. The class center for each class can be calculated by the coarse segmentation result and the high-level feature map of the backbone.

Moreover, inspired by the successful applications of attention mechanism in computer vision tasks, e.g. , we put forward that different pixels need to adaptively pick up to class centers of different categories. For example, if there is no class of ‘road’ in an image, then pixels in this image do not need to focus on feature of ‘road’. Or if a pixel oscillates between class ‘person’ and class ‘rider’, it should pay more attention to how ‘person’ and ‘rider’ behave in the whole image rather than other categories. Therefore, an attentional class feature (ACF) module is proposed to use the attention mechanism to make pixels selectively be aware of different class centers of the whole scene. Different from previous works which design an independent module to learn the attention map, we directly use the coarse segmentation result as our attention map.

The overall structure of our proposed coarse-to-fine segmentation network, named Attentional Class Feature Network, is shown in Figure 2. More specifically, our proposed network consists of two parts. The first part is a complete semantic segmentation network, called base network, which generates coarse segmentation results and it can be any state-of-the-art semantic segmentation networks. The second part is our ACF module. The ACF module first uses the coarse segmentation result and the feature map in base network to calculate the class center for each category. After that, the attentional class feature is computed by coarse segmentation result and class center. Finally, the attentional class feature and the original feature in base network are fused to generate the final segmentation.

We evaluate our Attentional Class Feature Network (ACFNet) on the popular scene parsing dataset Cityscapes and it achieves new state-of-the-art performance of 81.85% mean IoU with only fine-annotated data for training.

Our contributions can be summarized as follows:

We first present the concept of class center, which represents the class-level context, to help pixels be aware of the performance of different categories in the whole scene.

The Attentional Class Feature (ACF) module is proposed to make different pixels adaptively focus on different class centers.

We propose a coarse-to-fine segmentation structure, named Attentional Class Feature Network (ACFNet), to exploit class-level context to improve the semantic segmentation.

ACFNet achieves new state-of-the-art performance of the mean IoU of 81.85% on the popular benchmark Cityscapes dataset with only fine-annotated data for training.

Related Work

Semantic Segmentation. Benefiting from the advances of deep neural networks , semantic segmentation has achieved great success. The FCN first replaces the fully connected layer in traditional classification network by convolutional layer to get a segmentation result. SegNet, RefineNet , Deeplabv3+ and UNet adopt encoder-decoder structure to carefully recover the reduced spatial information through step-by-step upsample operation. Conditional random field (CRF) , Markov random field (MRF) and Recurrent Neural Networks (RNNs) are also widely used to exploit the long-range dependencies. Dilated convolution is used to maintain a large enough receptive field while increasing the feature resolution. In our work, we also use the same dilated strategy as in to preserve the resolution.

Context. Context plays a critical role in various vision tasks including semantic segmentation. There are bunches of works focusing on how to exploit more discriminative context to help the segmentation. Works like use global average pooling (GAP) to exploit the image level context. The atrous spatial pyramid pooling (ASPP) is proposed to capture the nearby context based on different dilated rate. In PSPNet , the average pooling is employed over four different pyramid scales and pixels in one sub-region are treated as the context of pixels within the same sub-region. Some other works focus on how to fuse different context information more selectively. In contrast to conventional context described above, in this paper, we harvest the contextual information from a categorical perspective.

More recently, a few works have also investigated the influence of the class-specific context. In EncNet , the channel-wise class-level features are enhanced or weakened according to the whole scene. Different from EncNet, we mainly focus on selectively utilizing the class-specific context from the pixel-level in our work.

Attention. Attention is widely used in various fields including natural language processing and computer vision. Vaswani et al. proposed the transformer using self-attention for machine translation. Hu et al. proposed object relation module to extend a learnable NMS operation. The non-local module is proposed by Wang et al. to calculate the spatial-temporal dependencies. OCNet and DANet use self-attention mechanism to explore the context. PSANet also uses an attention map to aggregate long-range contextual information. Our work is inspired by the attention mechanism and we apply it to the calculation of attentional class feature. Instead of designing an independent module to learn the attention map as in previous works, we simply use the coarse segmentation result as the attention map.

Coarse-to-fine Methods. There are a lot of successful applications of using coarse-to-fine approaches, such as face detection , shape detection , face alignment and optical flow . Some existing segmentation networks also adopt coarse-to-fine strategy. Islam et al. combined high resolution features and coarse segmentation result of low resolution features to get a finer segmentation result. In , rough locations of pancreas are obtained in the coarse stage and the fine stage is in charge of smoothing segmentation. In our work, we propose a coarse-to-fine structure and focus on improving the final result through feature-level aggregation.

Methodology

In this section, we first introduce our proposed attentional class feature (ACF) module and elaborate how ACF module captures and adaptively combines the class centers. Then we introduce a coarse-to-fine segmentation structure which consists of our ACF module, named Attentional Class Feature Network (ACFNet).

The overall structure of ACF module is shown in Figure.2 (d). It consists of of two blocks, Class Center Block (CCB) and Class Attention Block(CAB) which are used to calculate class center and attentional class feature respectively. The ACF module is based on a coarse-to-fine segmentation structure. The input of the ACF module is the coarse segmentation result and the feature map in base network and the output is the attentional class feature.

where $y_{j}$ is the label of pixel $j$ and $\mathds{1}[y_{j}=i]$ is the binary indicator that denotes whether the corresponding pixel comes from the $i$ -th class.

Since the groundtruth label is not available during the test phase, we use the coarse segmentation result to evaluate how likely a pixel belongs to a specific class. For a certain class $A$ , pixels with higher probability to $A$ in coarse segmentation usually belong to $A$ , and these pixels should contribute more when computing the class center of $A$ . In this way, we can approximate a robust class center.

The benefits of class center are two-fold. Firstly, it allows the pixels to understand the overall presentation of each class from a global view. Since the class center is the combination of all pixels in an image, this gives a strong supervision information while training and can help the model learn more discriminative features for each class. Moreover, the class center can also help to check for the consistency between one pixel and each class center in the image to improve the performance. Therefore, the distribution of each class can be further refined. It is known that a model always learns the distribution of each category across the entire dataset, thus for a specific image, the distribution of a particular category often occupies a small portion of the distribution of that category over the entire dataset. So the class center of this portion is more representative and helpful for the pixel classification in this image. By introducing the class center, the model can correct many cases which are wrongly classified before. An example is shown in Figure 4, when only the feature of pixel $p$ is used, the model mislabels it to class $B$ . But the misclassification can be further fixed by considering the class centers at the same time.

1.2 Attentional Class Feature

Inspired by the attention mechanism, we present the attentional class feature. Different pixels need to selectively attend to different classes. For a pixel $p$ , we use the coarse segmentation result as its attention map to calculate its attentional class feature. The reason why we use the coarse segmentation result is straightforward. If the coarse segmentation mislabels a pixel to a wrong class, it needs to pay more attention to that wrong class to check for the feature consistency. Or if some classes do not even exist in the image, the pixel does not need to know about these classes. As in Figure 4, the pixel $p$ only needs to be aware of the class centers of $A$ and $B$ rather than other class centers.

After the attentional class feature is calculated, we apply a $1\times 1$ conv to refine the calculated feature.

2 Attentional Class Feature Network

Based on Attentional Class Feature (ACF) module, we propose the Attentional Class Feature Network for semantic segmentation as illustrated in Figure 2. ACFNet consists of two separate parts, base network and ACF module. The base network is a complete segmentation network. In our experiments, we use the ResNet and ResNet with atrous spatial pyramid pooling (ASPP) as our base networks respectively to verify the effectiveness of our ACF module. The ACF module leverages the segmentation result and feature map in base network to calculate the attentional class feature. Finally, we concatenate the attentional class feature and the feature map in base network together and refine it through a $1\times 1$ conv to get the final segmentation result.

Loss Function. For explicit feature refinement, we use the auxiliary supervision to improve the performance and make the network easier to optimize following PSPNet . The class-balanced cross entropy loss is employed for auxiliary supervision, coarse segmentation and fine segmentation. Finally, we use three parameters $\lambda_{a}$ , $\lambda_{c}$ and $\lambda_{f}$ to balance the auxiliary loss $l_{a}$ , the coarse segmentation loss $l_{c}$ and the fine segmentation loss $l_{f}$ as shown in Equation. 4 .

Experiments

To evaluate the proposed module, we conduct several experiments on the Cityscapes dataset. The Cityscapes dataset is collected for urban scene understanding, which contains 19 classes for scene parsing or semantic segmentation evaluation. It has 5,000 high resolution ( $2048\times 1024$ ) images, of which 2,975 images for training, 500 images for validation and 1,525 for testing. In our experiments, we use the mean of class-wise Intersection over Union (mIoU) as the evaluation metric.

We use two base networks to verify the effectiveness and generality of ACF module. One is ResNet-101 which is our baseline network and the other one is ResNet-101 with ASPP. The experiments on the latter network show that our module can also significantly improve the performance when combined with other state-of-the-art modules.

Baseline Network. As for baseline network, we use the ResNet-101 pre-trained on ImageNet . Following PSPNet , the classification layer and last pooling layer are removed and the dilation rate of the convolution layers within the last two blocks are set to 2 and 4 respectively. The output stride of the network is set to 8.

Baseline Network with ASPP. It is known that the atrous spatial pyramid pooling (ASPP) has achieved great success in segmentation tasks. To verify the generalization ability of the ACF module, we also conduct several experiments based on the ResNet-101 (baseline network) followed by ASPP module. The ASPP consists of four parallel parts: a $1\times 1$ convolution branch and three $3\times 3$ convolution branches with dilation rate being 12, 24 and 36 respectively. In our re-implementation of ASPP module, we follow the original paper but change the output channel from 256 to 512 in all of four branches.

Attentional Class Feature Module. To reduce the computation and the memory usage, we first reduce the channel of input feature of ACF module to 512. The channel number of final output of the ACF module is also set to 512.

2 Implementation Details

For training, we use the stochastic gradient descent (SGD) optimizer with the initial learning rate 0.01, weight decay 0.0005 and momentum 0.9 for Cityscapes dataset. Following the previous works , we also employ the ‘poly’ learning rate policy, where the learning rate of current iteration is multiplied by the factor $(1-\frac{iter}{max\_iter})^{0.9}$ . The loss weights $\lambda_{a}$ , $\lambda_{c}$ and $\lambda_{f}$ in Equation. 4 are set to 0.4, 0.6 and 0.7 respectively. All experiments are trained on 4 $\times$ Nvidia P40 GPUs for 40k iterations with batch size 8.

All BatchNorm layers in our network are replaced by InPlaceABN-Sync . To avoid overfitting, we also employ the common data augmentation strategies, including random horizontal flipping, random scaling in the range of [0.5, 2.0] and random cropping of $769\times 769$ image patches following .

3 Ablation Study

In this subsection, we conduct a series of experiments based on the baseline network to reveal the effect of each component in our proposed module.

We first use the atrous ResNet-101 as the baseline network and the final results are obtained by directly upsampling the output. For starters, we evaluate the performance of the baseline network, as shown in Table 1. It should be noted that all our experiments use the auxiliary supervision.

Ablation for Attentional Class Feature. We further evaluate the role of attentional class feature. Essentially, the calculation process described in Equation.3 is the weighted summation of class centers in which the weight is coarse segmentation probabilities of each pixel. So we call this approach of calculating the attentional class feature as ACF(sum). Besides ACF(sum), we also try another way, named ACF(concat), to leverage the coarse segmentation probabilities and class centers to get another type of attentional class feature. For a given pixel $j$ , ACF(concat) can be formulated as follows,

3.2 Feature Similarity

Improvement Compared with Baseline. In order to better understand how ACF module improves the final result, we visualize the cosine similarity map between a given pixel and other pixels in the feature map. As shown in Figure 5, we select two pixels from ‘terrain’ and ‘car’ respectively. The feature similarity maps of the baseline and ACFNet are shown in column (c) and (d) separately. For ACFNet, we use the feature map before fine segmentation to calculate the feature similarity. After adding the class-level context, ACFNet learns a more discriminative feature for each class. The intra-class features are more consistent and the inter-class features are more distinguishable.

Improvement Compared with Coarse Segmentation. As discussed in section 4, the class-level context may also help a pixel check for the consistency with each class in the image and further refine the segmentation result. To verify this idea, we also visualize the feature similarity of the feature maps before coarse segmentation and fine segmentation given a specific pixel. As shown in Figure 6, the area which shows the improvement is marked by yellow square in both (e) coarse segmentation and (f) fine segmentation. From (b) and (e), we can see that the model does not learn a good enough distribution of class ‘building’ and thus mislabels a lot of pixels. Features of those mislabeled pixels are inconsistent with those correctly labeled pixels. But after adding the attentional class feature for those pixels, the refined feature shows the consistency between mislabeled pixels and correctly labeled pixels. Thus, the final result has a significant improvement.

3.3 Result Visualization

We provide the qualitative comparisons between ACFNet and baseline network in Figure 7. We use the yellow square to mark those challenging regions. The baseline easily mislabels such areas, but ACFNet is able to correct them. For example, the baseline model can not classify ‘truck’ or ‘car’ correctly in the first example and mislabels the ‘building’ and ‘wall’ in the fifth example. After adding the ACF module, such areas are greatly corrected.

4 Experiments on Baseline Network with ASPP

To verify the generality of ACF module, we also combine it with ResNet-101 and ASPP. We first conduct the baseline (ResNet-101 with ASPP) experiment and the result is shown in Table 2. Our re-implemented version of ASPP achieves similar performance compared with the original paper (78.42% vs. 77.82%).

Performance with ACF Module. We append the ACF module to the end of ASPP and the experiment result is shown in Table 2. After adding the ACF module, the performance is improved by 1.7% (78.42% to 80.08%), which verifies that our ACF module can work together with other state-of-the-art modules to further boost the performance.

Moreover, we apply the online bootstrapping and multi-scale (MS), left-right flipping (Flip) to improve the performance based on the ResNet-101+ ASPP + ACF. The results on Cityscapes val are shown in Table 2.

Online Bootstrapping: Following the previous works , we adopt the online bootstrapping for hard training pixels. The hard training pixels are those whose probabilities on the correct classes are less than a certain threshold $\theta$ . When training with online bootstrapping, we keep at least $K$ pixels within each batch. In our experiments, we set $\theta$ to 0.7 and $K$ to 100,000. With online bootstrapping, the performance on Cityscapes val set can be improved by 0.91%.

MS/Flip: As many of previous works , we also adopt the left-right flipping and multi-scale $[0.75,1.0,1.25,1.5,1.75,2.0]$ strategies while testing. From Table 2, we can see that MS/Flip improves the performance by 1.38% on val set.

5 Comparing with the State-of-the-Art

We further compare ACFNet with the existing methods on the Cityscapes test set by submitting our result to the official evaluation server. Specifically, we train the ResNet-101 with ASPP and ACF with online bootstrapping strategy and use the multi-scale & flipping strategies while testing. The results and comparison are illustrated in Table 3. ACFNet, which uses only train-fine data, outperforms previous work PSANet for about 2.2% and even better than most methods that also employ the validation set for training. While using both train-fine and val-fine data for training, ACFNet outperforms the previous methods for a large margin and achieves new state-of-the-art of 81.85% mIoU.

Conclusion

In this paper, we propose the concept of class center to represent the class-level context to improve the segmentation performance. We further propose a coarse-to-fine segmentation structure based on our attentional class feature module, called ACFNet, to calculate and selectively combine the class-level context according to the feature of each pixel. The ablation studies and visualization of intermediate results show the effectiveness of class-level context. ACFNet achieves new state-of-the-art on Cityscapes dataset with mIoU of 81.85%.

Acknowledgment

Feifei Ma is supported by the Youth Innovation Promotion Association, Chinese Academy of Sciences. Besides, our special thanks go to Yuchen Sun, Xueyu Song, Ru Zhang, Yuhui Yuan and the anonymous reviewers for the discussion and their helpful advice.