CCNet: Criss-Cross Attention for Semantic Segmentation

Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, Thomas S. Huang

Introduction

Semantic segmentation, which is a fundamental problem in the computer vision community, aims at assigning semantic class labels to each pixel in a given image. It has been extensively and actively studied in many recent works and is also critical for various significant applications such as autonomous driving , augmented reality ,image editing , civil engineering , remote sensing imagery and agricultural pattern analysis . Specifically, current state-of-the-art semantic segmentation approaches based on the fully convolutional network (FCN) have made remarkable progress. However, due to the fixed geometric structures, the conventional FCN is inherently limited to local receptive fields that only provide short-range contextual information. The limitation of insufficient contextual information imposes a great adverse effect on its segmentation accuracy.

To make up for the above deficiency of FCN, some works have been proposed to introduce useful contextual information to benefit the semantic segmentation task. Specifically, Chen et al. proposed atrous spatial pyramid pooling module with multi-scale dilation convolutions for contextual information aggregation. Zhao et al. further introduced PSPNet with pyramid pooling module to capture contextual information. However, the dilated convolution based methods collect information from a few surrounding pixels and cannot generate dense contextual information actually. Meanwhile, the pooling based methods aggregate contextual information in a non-adaptive manner and the homogeneous context extraction procedure is adopted by all image pixels, which does not satisfy the requirement that different pixels need different contextual dependencies.

To incorporate dense and pixel-wise contextual information, some fully-connected graph neural network (GNN) methods were proposed to augments traditional convolutional features with an estimated full-image context representation. PSANet learns to aggregate contextual information for each position via a predicted attention map. Non-local Networks utilizes a self-attention mechanism , which enables a single feature from any position to perceive features of all the other positions, thus harvesting full-image contextual information, see Fig. 1 (a). These non-local operations could be viewed as a densely-connected GNN module based on attention mechanism . This feature augmentation method allows a flexible way to represent non-local relations between features and has led to significant improvements in several vision recognition tasks. However, these GNN-based non-local neural networks need to generate huge attention maps to measure the relationships for each pixel-pair, leading to a very high complexity of $\mathcal{O}(N^{2})$ for both time and space, where $N$ is the number of input features. Since the dense prediction tasks, such as semantic segmentation, inherently require high resolution feature maps, the non-local based methods will often with high computation complexity and occupy a huge number of GPU memory. Thus, is there an alternative solution to achieve such a target in a more efficient way?

To address the above mentioned issue, our motivation is to replace the common single densely-connected graph with several consecutive sparsely-connected graphs, which usually require much lower computational resources. Without loss of generality, we use two consecutive criss-cross attention modules, in which each one only has sparse connections (about $\sqrt{N}$ ) for each position in the feature map. For each pixel/position, the criss-cross attention module aggregates contextual information in its horizontal and vertical directions. By serially stacking two criss-cross attention modules, each position can collect contextual information from all pixels in the given image. The above decomposition strategy will greatly reduce the complexities of both time and space from $\mathcal{O}(N^{2})$ to $\mathcal{O}(N\sqrt{N})$ .

We compare the differences between the non-local module and our criss-cross attention module in Fig. 1. Concretely, both non-local module and criss-cross attention module feed the input feature map to generate an attention map for each position and transform the input feature map into an adapted feature map. Then, a weighted sum is adopted to collecting contextual information from other positions in the adapted feature map based on the attention maps. Different from the dense connections adopted by the non-local module, each position (e.g., blue) in the feature map is sparsely connected with other ones which are in the same row and the same column in our criss-cross attention module, leading to the predicted attention map only has about $2\sqrt{N}$ weights rather than $N$ in non-local module.

To achieve the goal of capturing the full-image dependencies, we innovatively and simply take a recurrent operation for the criss-cross attention module. In particular, the local features are firstly passed through one criss-cross attention module to collect the contextual information in horizontal and vertical directions. Then, by feeding the feature map produced by the first criss-cross attention module into the second one, the additional contextual information obtained from the criss-cross path finally enables the full-image dependencies for all positions. As demonstrated in Fig. 1 (b), each position (e.g.red) in the second feature map can collect information from all others to augment the position-wise representations. We share parameters of the criss-cross modules to keep our model slim. Since the input and output are both convolutional feature maps, criss-cross attention module can be easily plugged into any fully convolutional neural network, named as CCNet, for learning full-image contextual information in an end-to-end manner. Thanks to the good usability of criss-cross attention module, CCNet is straight forward to extend to 3D networks for capturing long-range temporal context information.

In addition, to drive the proposed recurrent criss-cross attention method to learn more discriminative features, we introduce a category consistent loss to augment CCNet. Particularly, the category consistent loss enforces the network to map each pixel in the image to an n-dimensional vector in the feature space, such that feature vectors of pixels that belong to the same category lie close together while feature vectors of pixels that belong to different categories lie far apart.

We have carried out extensive experiments on multiple large-scale datasets. Our proposed CCNet achieves top performance on four most competitive semantic segmentation datasets, i.e., Cityscapes , ADE20K , LIP and CamVid . In addition, the proposed criss-cross attention even improves the state-of-the-art instance segmentation method, i.e., Mask R-CNN with ResNet-101 . These results well demonstrate that our criss-cross attention module is generally beneficial to the dense prediction tasks. In summary, our main contributions are three-fold:

We propose a novel criss-cross attention module in this work, which can be leveraged to capture contextual information from full-image dependencies in a more efficient and effective way.

We propose category consistent loss which can enforce criss-cross attention module to produce more discriminative features.

We propose CCNet by taking advantages of recurrent criss-cross attention module, achieving leading performance on segmentation-based benchmarks, including Cityscapes, ADE20K, LIP, CamVid and COCO.

Compare with our original conference version , the following improvements are conducted: 1) We further enhance the segmentation ability of CCNet by augmenting a simple yet effective category consistent loss; 2) we propose a more generic CCNet by extending the criss-cross attention module from 2D to 3D; 3) we include more extensive experiments on the LIP, CamVid and COCO datasets to verify the effectiveness and generalization ability of our CCNet.

The rest of this paper is organized as follows. We first review related work in Section 2 and describe the architecture of our network in Section 3. In Section 4, ablation studies are given and experimental results are analyzed. Section 5 presents our conclusion and future work.

Related work

The last years have seen a renewal of interest on semantic segmentation. FCN is the first approach to adopt fully convolutional network for semantic segmentation. Later, FCN-based methods have made remarkable progress in image semantic segmentation. Chen et al. and Yu et al. removed the last two downsample layers to obtain dense prediction and utilized dilated convolutions to enlarge the receptive field. Unet , DeepLabv3+ , MSCI , SPGNet , RefineNet and DFN adopted encoder-decoder structures that fuse the information in low-level and high-level layers to make dense predictions. The scale-adaptive convolutions (SAC) and deformable convolutional networks (DCN) methods improved the standard convolutional operator to handle the deformation and various scales of objects. CRF-RNN and DPN used Graph model, i.e., CRF, MRF, for semantic segmentation. AAF used adversarial learning to capture and match the semantic relations between neighboring pixels in the label space. BiSeNet was designed for real-time semantic segmentation. DenseDecoder built feature-level long-range skip connections on cascaded architecture. VideoGCRF used a densely-connected spatio-temporal graph for video semantic segmentation. RTA proposed the region-based temporal aggregation for leveraging the temporal information in videos. In addition, some works focus on human parsing task. JPPNet embed pose estimation into human parsing task. CE2P proposed a simple yet effective framework for computing context embedding while preserving edges. SANet used parallel branches with scale attention to handle large scale variance in human parsing. Semantic segmentation is also actively studied in the context of domain adaptation and dstillation and weakly supervised setting , etc.

2 Contextual information aggregation

It is a common practice to aggregate contextual information to augment the feature representation in semantic segmentation networks. Deeplabv2 proposed atrous spatial pyramid pooling (ASPP) to use different dilation convolutions to capture contextual information. DenseASPP brought dense connections into ASPP to generate features with various scale. DPC utilized architecture search techniques to build multi-scale architectures for semantic segmentation. Chen et al. made use of several attention masks to fuse feature maps or prediction maps from different branches. PSPNet utilized pyramid spatial pooling to aggregate contextual information. Recently, Zhao et al. proposed the point-wise spatial attention network which uses predicted attention map to guide contextual information collection. Auto-Deeplab utilized neural architecture search to search an effective context modeling. He et al. proposed an adaptive pyramid context module for semantic segmentation. Liu et al. utilized recurrent neural networks (RNNs) to capture long-range dependencies.

There are some works use graph models to model the contextual information. Conditional random field (CRF) , Markov random field (MRF) were also utilized to capture long-range dependencies for semantic segmentation. Vaswani et al. applied a self-attention model on machine translation. Wang et al. proposed the non-local module to generate the huge attention map by calculating the correlation matrix between each spatial point on the feature maps, then the attention map guided dense contextual information aggregation. OCNet and DANet utilized Non-local module to harvest the contextual information. PSA learned an attention map to aggregate contextual information for each individual point adaptively and specifically. Chen et al. proposed graph-based global reasoning networks which implements relation reasoning via graph convolution on a small graph.

CCNet vs. Non-Local vs. GCN. Here, we specifically discuss the differences among GCN , Non-local Network and CCNet. In term of contextual information aggregation, only the center point can perceive the contextual information from all pixels by the global convolution filters in GCN . In contrast, Non-local Network and CCNet guarantee that a pixel at any position perceives contextual information from all pixels. Though GCN alternatively decomposes the square-shape convolutional operation to horizontal and vertical linear convolutional operations which is related to CCNet, CCNet takes the criss-cross way to harvest contextual information which is more effective than the horizontal-vertical separate way. Moreover, CCNet is proposed to mimic Non-local Network for obtaining dense contextual information through a more effective and efficient recurrent criss-cross attention module, in which dissimilar features get low attention weights and features with high attention weights are similar ones. GCN is a conventional convolution neural network, while CCNet is a graph neural network in which each pixel in the convolutional feature map is considered as a node and the relation/context among nodes can be utilized to generate better node features.

3 Graph neural networks

Our work is related to deep graph neural network (GNN). Prior to graph neural networks, graphical models, such as the conditional random field (CRF) , markov random field (MRF) , were widely used to model the long-range dependencies for image understanding. GNNs were early studied in . Inspired by the success of CNNs, a large number of methods adapt graph structure into CNNs. These methods could be divided into two main steams, the spectral-based approaches and the spatial-based approaches . The proposed CCNet belongs to the latter.

Approach

In this section, we give the details of the proposed Criss-Cross Network (CCNet) for semantic segmentation. We first present a general framework of our CCNet. Then, the 2D criss-cross attention module which captures contextual information in horizontal and vertical directions will be introduced. To capture the dense and global contextual information, we propose to adopt a recurrent operation for the criss-cross attention module. To further improve RCCA, we introduce a discriminative loss function to drive RCCA to learn category consistent features. Finally we propose the 3D criss-cross attention module for leveraging temporal and spatial contextual information simultaneously.

The network architecture is given in Fig. 2. An input image is passed through a deep convolutional neural network (DCNN), which is designed in a fully convolutional fashion , to produce feature map $\mathbf{X}$ with the spatial size of $H\times W$ . In order to retain more details and efficiently produce dense feature maps, we remove the last two down-sampling operations and employ dilation convolutions in the subsequent convolutional layers, leading to enlarging the width/height of the output feature map $\mathbf{X}$ to 1/8 of the input image.

Given $\mathbf{X}$ , we first apply a convolutional layer to obtain the feature map $\mathbf{H}$ of dimension reduction. Then, $\mathbf{H}$ is fed into the criss-cross attention module to generate a new feature map $\mathbf{H^{\prime}}$ which aggregate contextual information together for each pixel in its criss-cross path. The feature map $\mathbf{H^{\prime}}$ only contains the contextual information in horizontal and vertical directions which are not powerful enough for accurate semantic segmentation. To obtain richer and denser context information, we feed the feature map $\mathbf{H^{\prime}}$ into the criss-cross attention module again and output the feature map $\mathbf{H^{\prime\prime}}$ . Thus, each position in $\mathbf{H^{\prime\prime}}$ actually gathers the information from all pixels. Two criss-cross attention modules before and after share the same parameters to avoid adding too many extra parameters. We name this recurrent structure as recurrent criss-cross attention (RCCA) module.

Then, we concatenate the dense contextual feature $\mathbf{H^{\prime\prime}}$ with the local representation feature $\mathbf{X}$ . It is followed by one or several convolutional layers with batch normalization and activation for feature fusion. Finally, the fused features are fed into the segmentation layer to predict the final segmentation result.

2 Criss-Cross Attention

3 Recurrent Criss-Cross Attention (RCCA)

Despite the criss-cross attention module can capture contextual information in horizontal and vertical directions, the connections between one pixel and its around ones that are not in the criss-cross path are still absent. To tackle this problem, we innovatively and simply introduce a RCCA operation based on the criss-cross attention. The RCCA module can be unrolled into $R$ loops. In the first loop, the criss-cross attention takes the feature map $\mathbf{H}$ extracted from a CNN model as the input and output the feature map $\mathbf{H^{\prime}}$ , where $\mathbf{H}$ and $\mathbf{H^{\prime}}$ are with the same shape. In the second loop, the criss-cross attention takes the feature map $\mathbf{H^{\prime}}$ as the input and output the feature map $\mathbf{H^{\prime\prime}}$ . As shown in Fig. 2, the RCCA module is equipped with two loops ( $R=2$ ) which is able to harvest full-image contextual information from all pixels to generate new features with dense and rich contextual information.

We denote $\mathbf{A}$ and $\mathbf{A^{\prime}}$ as the attention maps in loop 1 and loop 2, respectively. Since we are interested only in contextual information spreads in spatial dimension rather than in channel dimension, the convolutional layer with $1\times 1$ filters can be view as the identical connection. In the case of $R=2$ , the connections between any two spatial positions in the feature map built up by the RCCA module can be clearly and quantitatively described by introducing function $f$ defined as follows.

With the help of function $f$ , we can easily describe the information propagation between any position $\mathbf{u}$ in $\mathbf{H}^{\prime\prime}$ and any position $\boldsymbol{\theta}$ in $\mathbf{H}$ . It is obvious that information could flow from $\boldsymbol{\theta}$ to $\mathbf{u}$ when $\boldsymbol{\theta}$ is in the criss-cross path of $\mathbf{u}$ .

Then, we focus on another situation in which $\boldsymbol{\theta}(\theta_{x},\theta_{y})$ is NOT in the criss-cross path of $\mathbf{u}(u_{x},u_{y})$ . To make it easier to understand, we visualize the information propagation in Fig. 4. The position $(\theta_{x},\theta_{y})$ , which is blue, firstly passes the information into the $(u_{x},\theta_{y})$ and $(\theta_{x},u_{y})$ (light green) in the loop 1. The propagation could be quantified by function $f$ . It should be noted that these two points $(u_{x},\theta_{y})$ and $(\theta_{x},u_{y})$ are in the criss-cross path of $\mathbf{u}(u_{x},u_{y})$ . Then, the positions $(u_{x},\theta_{y})$ and $(\theta_{x},u_{y})$ pass the information into the $(u_{x},u_{y})$ (dark green) in the loop 2. Thus, the information in $\boldsymbol{\theta}(\theta_{x},\theta_{y})$ could eventually flow into $\mathbf{u}(u_{x},u_{y})$ even if $\boldsymbol{\theta}(\theta_{x},\theta_{y})$ is NOT in the criss-cross path of $\mathbf{u}(u_{x},u_{y})$ .

In general, our RCCA module makes up for the deficiency of criss-cross attention that cannot obtain the dense contextual information from all pixels. Compared with criss-cross attention, the RCCA module ( $R=2$ ) does not bring extra parameters and can achieve better performance with the cost of a minor computation increment.

4 Learning Category Consistent Features

Motivated by , we first adapt a discriminative loss for semantic segmentation rather than instance segmentation, then replace the first term with more robust one: instead of using quadratic function as the distance function to penalize mismatch all along, we design a piece-wise distance function to make the optimization more robust.

Let $C$ be the set of classes that are present in the mini-batch images. $N_{c}$ is the number of valid elements belonging to category $c\in C$ . $h_{i}\in\textbf{H}$ is the feature vector at spatial position $i$ . $\mu_{c}$ is the mean feature of category $c\in C$ (the cluster center). $\varphi$ is a piece-wise distance function. $\delta_{v}$ and $\delta_{d}$ are respectively the margins. In particular, Eq. 6 is a piece-wise distance function and the function $\varphi_{var}$ will be zero, quadratic, and linear function when the distance from the center $\mu_{c}$ is within $d_{v}$ , in range of $(\delta_{v},\delta_{d}]$ , and exceeds $\delta_{d}$ , respectively.

5 3D Criss-Cross Attention

Experiments

To evaluate the effectiveness of the CCNet, we carry out comprehensive experiments on the Cityscapes dataset , the ADE20K dataset , the COCO dataset , the LIP dataset and the CamVid dataset . Experimental results demonstrate that CCNet achieves state-of-the-art performance on Cityscapes, ADE20K and LIP. Meanwhile, CCNet can bring constant performance gain on COCO for instance segmentation. In the following subsections, we first introduce the datasets and implementation details, then we perform a series of ablation experiments on Cityscapes dataset. Finally, we report our results on ADE20K, LIP, COCO and CamVid datasets.

We adopt Mean IoU (mIOU, mean of class-wise intersection over union) for Cityscapes, ADE20K, LIP and CamVid and the standard COCO metrics Average Precision (AP) for COCO.

Cityscapes is tasked for urban segmentation. Only the 5,000 finely annotated images are used in our experiments and are divided into 2,975/500/1,525 images for training, validation, and testing, respectively.

ADE20K is a recent scene parsing benchmark containing dense labels of 150 stuff/object categories. The dataset includes 20k/2k/3k images for training, validation and testing, respectively.

LIP is a large-scale single human parsing dataset. There are 50,462 images with fine-grained annotations at pixel-level with 19 semantic human part labels and one background label. Those images are further divided into 30k/10k/10k for training, validation and testing, respectively.

COCO is a very challenging dataset for instance segmentation that contains 115k images over 80 categories for training, 5k images for validation and 20k images for testing.

CamVid is one of the datasets focusing on semantic segmentation for autonomous driving scenarios. It is composed of 701 densely annotated images with size $720\times 960$ from five video sequences.

2 Implementation Details

Network Structure For semantic segmentation, we choose the ImageNet pre-trained ResNet-101 as our backbone network, remove its last two down-sampling operations, and employ dilated convolutions in the subsequent convolutional layers following the previous work , resulting in the output stride as 8. For human parsing, we choose CE2P as our baseline and replace the Context Embedding module with RCCA. For instance segmentation, we choose Mask-RCNN as our baseline. For video semantic segmentation, we also choose Cityscapes pre-trained ResNet-101 as our backbone network with 3D RCCA.

Training settings SGD with mini-batch is used for training. For semantic segmentation, the initial learning rate is 1e-2 for Cityscapes and ADE20K. Following the prior works , we employ a poly learning rate policy where the initial learning rate is multiplied by $1-(\frac{iter}{max\_iter})^{power}$ with $power$ = 0.9. We use the momentum of 0.9 and a weight decay of 0.0001. For Cityscapes, the training images are augmented by randomly scaling (from 0.75 to 2.0), then randomly cropping out high-resolution patches ( $769\times 769$ ) from the resulting images. Since the images from ADE20K are with various sizes, we adopt an augmentation strategy of resizing the short side of input image to a length randomly chosen from the set {300, 375, 450, 525, 600}. For human parsing, the model are trained and tested with the input size of $473\times 473$ . For instance segmentation, we take the same training settings as that of Mask-RCNN . For video semantic segmentation, we sample 5 temporally ordered frames from a training video as training data and the input size is $504\times 504$ .

3 Experiments on Cityscapes

Results of other state-of-the-art semantic segmentation solutions on Cityscapes are summarized in Tab. I. For val set, we provide these results for reference and emphasize that these results should not be simply compared with our method, since these methods are trained on different (even larger) training sets or different basic network. Among these approaches, Deeplabv3 adopts multi-scale testing strategy. Deeplabv3+ and DPC both use a more stronger backbone (i.e., Xception-65 & 71 vs. ResNet-101). In addition, DPC makes use of additional dataset, i.e., COCO, for pre-training beyond the training set of Cityscapes. The results show that the proposed CCNet with single-scale testing still achieve comparable performance without bells and whistles.

Additionally, we also train the best learned CCNet with ResNet-101 as the backbone using both training and validation sets and make the evaluation on the test set by submitting our test results to the official evaluation server. Most of methods adopt the same backbone as ours and the others utilize stronger backbones. From Tab. I, it can be observed that our CCNet substantially outperforms all the previous state-of-the-arts on test set. Among the approaches, PSANet is the most related to our method which generates sub attention map for each pixel. One of the differences is that the sub attention map has $2\times H\times W$ weights in PSANet and $H+W-1$ weights in CCNet. Even with lower computation cost and memory usage, our method still achieves better performance.

3.2 Ablation studies

To verify the rationality of the CCNet, we conduct extensive ablation experiments on the validation set of Cityscapes with different settings for CCNet.

The effect of the RCCA module Tab. II shows the performance on the Cityscapes validation set by adopting different number of loop in RCCA. All experiments are conducted using ResNet-101 as the backbone. Besides, the input size of training images is $769\times 769$ and the size of the input feature map H of RCCA is $97\times 97$ . Our baseline network is the ResNet-based FCN with dilated convolutional module incorporated at stage 4 and 5, i.e., dilation rates are set to 2 and 4 for these two stages respectively. The increment of FLOPs and memory usage are estimated when $R=1,2,3$ , respectively.

We observe that adding a criss-cross attention module into the baseline, donated as $R=1$ , improves the performance by 2.9%, which can effectively demonstrates the significance of criss-cross attention. Furthermore, increasing the number of loops from 1 to 2 can further improve the performance by 1.8%, demonstrating the effectiveness of dense contextual information. Finally, increasing loops from 2 to 3 slightly improves the performance by 0.4%. Meanwhile, with the increasing the number of loops, the FLOPs and usage of GPU memory keep increasing. These results prove that the proposed criss-cross attention can significantly improve the performance by capturing contextual information in horizontal and vertical direction. In addition, the proposed RCCA is effective in capturing the dense and global contextual information, which can finally benefit the performance of semantic segmentation. To balance the performance and resource usage, we choose $R=2$ as default settings in all the following experiments.

To further validate the effectiveness of the criss-cross module, we provide the qualitative comparisons in Fig. 6. We leverage the white circles to indicate those challenging regions that are easily to be misclassified. It can be seen that these challenging regions are progressively corrected with the increasing the number of loops, which can well prove the effectiveness of dense contextual information aggregation for semantic segmentation.

The effect of the category consistent loss Tab. IV also shows the performance on the Cityscapes validation set by adopting the proposed category consistent loss. The category consistent loss is donated as “CCL” in the table. As we can see, adopting the category consistent loss could stably bring 0̃.7% mIoU gain with both Resnet-101 and Resnet-50, which prove the effectiveness of the proposed category consistent loss for semantic segmentation. To prove that the proposed piece-wise function is more robust than the original one, we conduct 10 times of the training processes using ResNet-50 for each kind of loss function. The training is deemed to fail when the loss value is NaN, thus we can calculate the success rate (number of successful training / total number of training). The experimental results in Table III demonstrate that using the piece-wise function has higher training success rate than using the original one. Besides, using the piece-wise function could achieve slightly better performance than a single quadratic function. Because we relax the punishment in the Eq. 6 to reduce the numerical values and gradients especially when the distance from the center exceeds $\delta_{d}$ . This relaxation makes the optimization much more stable.

Comparison of other context aggregation approaches We compare the performance of several different context aggregation approaches on the Cityscapes validation set with ResNet-50 and ResNet-101 as backbone networks.

Specifically, the baselines of context aggregation mainly include: 1) Peng et al. utilized global convolution filters for contextual information aggregation, donated as “+GCN”. 2) Zhao et al. proposed Pyramid pooling which is the simple and effective way to capture global contextual information, donated as “+PSP”; 3) Chen et al. used different dilation convolutions to harvest pixel-wise contextual information at the different range, donated as “+ASPP”; 4) Wang et al. introduced non-local network for context aggregation, donated as “+NL”.

In Tab. IV, both “+NL” and “+RCCA” achieve better performance compared with the other context aggregation approaches, which demonstrates the importance of capturing full-image contextual information. More interestingly, our method achieves better performance than “+NL”. This reason may be attributed to the sequentially recurrent operation of criss-cross attention. Concretely, “+NL” generates an attention map directly from the feature which has limit receptive field and short-range dependencies. In contrast, our “+RCCA” takes two steps to form dense contextual information, leading to that the latter step can learn a better attention map benefiting from the feature map produced by the first step in which some long-range dependencies has already been embedded.

To prove the effectiveness of attention with criss-cross shape, we compare criss-cross shape with other shapes in Tab. IV. “+HV” means stacking horizontal attention and vertical attention. “+HV&VH” means summing up features of two parallel branches, i.e. “HV” and “VH”.

We further explore the amount of computation and memory footprint of RCCA. As shown in Tab. V, compared with “+NL” method, the proposed “+RCCA” requires $11\times$ less GPU memory usage and significantly reduces FLOPs by about 85% of non-local block in computing full-image dependencies, which shows that CCNet is an efficient way to capture full-image contextual information in the least amount of computation and memory footprint. To further prove the effectiveness of the recurrent operation, we also run non-local module in the recurrent way, donated as “+NL(R=2)”. As we can seen, the recurrent operation can bring more than 1 point gain. Because the recurrent operation leads to that the latter step can learn a better attention map benefiting from the feature map produced by the first step in which some long-range dependencies has already been embedded. However, compared with “+RCCA”, “+NL(R=2)” needs huge GPU memory usage, which limits the use of self-attention.

Visualization of Attention Map To get a deeper understanding of our RCCA, we visualize the learned attention masks as shown in Fig. 7. For each input image, we select one point (cross in green) and show its corresponding attention maps when $R=1$ and $R=2$ in columns 2 and 3, respectively. It can be observed that only contextual information from the criss-cross path of the target point is captured when $R=1$ . By adopting one more criss-cross module, i.e., $R=2$ , RCCA can finally aggregate denser and richer contextual information compared with that of $R=1$ . Besides, we observe that the attention module could capture semantic similarity and full-image dependencies.

4 Experiments on ADE20K

In this subsection, we conduct experiments on the AED20K dataset, which is a very challenging scene parsing dataset. As shown in Tab. VI, CCNet with CCL achieves the state-of-the-art performance of 45.76%, outperforms the previous state-of-the-art methods by more than 1.1% and also outperforms the conference version CCNet by 0.5%. Some successful segmentation results are given in Fig 8. Among the approaches, most of methods adopt the ResNet-101 as backbone and RefineNet adopts a more powerful network, i.e., ResNet-152, as the backbone. EncNet achieves previous best performance among the methods and utilizes global pooling with image-level supervision to collect image-level context information. In contrast, our CCNet adopts an alternative way to integrate contextual information by capture full-image dependencies and achieve better performance.

5 Experiments on LIP

In this subsection, we conduct experiments on the LIP dataset, which is a very challenging human parsing dataset. The framework of CE2P is utilized, with ImageNet pre-trained ResNet-101 as bockbone and using RCCA (R=2) rather than PSP as context embedding module. The category consistent loss is used to boost the performance. The hyper-parameter setting strictly follows that in the CE2P . Among the approaches, Deeplab (VGG-16) , Attention and SAN adopt the VGG-16 as backbone and Deeplab (ResNet-101) , JPPNet , CE2P and CCNet adopt ResNet-101 as the backbone. As shown in Tab. VII, CCNet achieves the state-of-the-art performance of 55.47%, outperforms the previous state-of-the-art methods by more than 2.3%. This significant improvement demonstrates the effectiveness of proposed method on human parsing task. Fig. 9 shows some visualized segmentation results. The top two rows show some successful segmentation results It shows our method can produce accurate segmentation even for complicated poses. The third row shows a failure segmentation result where the “skirt” is misclassified as “pants”. But it’s difficult to recognize even for humans.

6 Experiments on COCO

To further demonstrate the generality of CCNet, we conduct the instance segmentation task on COCO using the competitive Mask R-CNN model as the baseline. Following , we modify the Mask R-CNN backbone by adding the RCCA module right before the last convolutional residual block of res4. We evaluate a standard baseline of ResNet-50/101. All models are fine-tuned from ImageNet pre-training. We use the official implementationhttps://github.com/facebookresearch/maskrcnn-benchmark with end-to-end joint training whose performance is almost the same as the baseline reported in . For fair comparison, we do not use the category consistent loss in our method. We report the results in terms of box AP and mask AP in Tab. VIII on COCO. The results demonstrate that our method substantially outperforms the baseline in all metrics. Some segmentation results for comparing baseline with “+RCCA” are given in Fig 10. Meanwhile, the network with “+RCCA” also achieves the better performance than the network with one non-local block “+NL”.

7 Experiments on CamVid

To further demonstrate the effectiveness of 3D-RCCA, we carry out the experiments on CamVid , which is one of the first datasets focusing on video semantic segmentation for driving scenarios. We follow the standard protocol proposed in to split the dataset into 367 training, 101 validation and 233 test images. For fair comparison, we only report single-scale evaluation scores. As can be seen in Tab. IX, we achieve an mIoU of 79.1%, outperforming all other methods by a large margin.

To demonstrate the effectiveness of our proposed techniques, we perform training under the same settings with the different length of input frames. We apply the CNNs on each frame for extracting features and then concatenate and reshape them to satisfy the required shape of 3D Criss-Coss Attention module. We use the $R=3$ for collecting dense spatial and temporal contextual information. Here, to make a training sample, we try two kinds of length ( $T$ ) of input frames. For $T=1$ , we randomly sample 1 frame from a training video, donated as “CCNet3D ( $T=1$ )”. For $T=5$ , we sample 5 temporally ordered frames from a training video, donated as “CCNet3D ( $T=5$ )”. As can be seen in Tab. IX, “CCNet3D ( $T=5$ )” outperforms “CCNet3D ( $T=1$ )” by 1.2%.

Conclusion and future work

In this paper, we have presented a Criss-Cross Network (CCNet) for deep learning based dense prediction tasks, which adaptively captures contextual information on the criss-cross path. To obtain dense contextual information, we introduce RCCA which aggregates contextual information from all pixels. The experiments demonstrate that RCCA captures full-image contextual information in less computation cost and less memory cost. Besides, to learn discriminative features, we introduce the category consistent loss. Our CCNet achieves outstanding performance consistently on several semantic segmentation datasets, i.e., Cityscapes, ADE20K, LIP, CamVid and instance segmentation dataset, i.e., COCO. The source codes of CCNet are released to facilitate related research and applications.

Acknowledgements

This work was in part supported by NSFC (No. 61733007 and No. 61876212), ARC DECRA DE190101315, ARC DP200100938, HUST-Horizon Computer Vision Research Center, and IBM-ILLINOIS Center for Cognitive Computing Systems Research (C3SR) - a research collaboration as part of the IBM AI Horizons Network.