OCNet: Object Context Network for Scene Parsing

Yuhui Yuan, Lang Huang, Jianyuan Guo, Chao Zhang, Xilin Chen, Jingdong Wang

Introduction

Semantic segmentation is a fundamental topic in computer vision and is critical for various scene understanding problems. It is typically formulated as a task of predicting the category of each pixel, i.e., the category of the object that the pixel belongs to. We are mainly interested in improving the pixel classification accuracy through explicitly identifying the object region that the pixel belongs to.

Extensive efforts based on deep convolutional neural networks have been made to address the semantic segmentation since the pioneering approach of the fully convolutional network (FCN) (Long et al., 2015). The original FCN approach suffers from two main drawbacks including the reduced feature resolution that loses the detailed spatial information and the small effective receptive field that fails to capture long-range dependencies. There exist two main paths to tackle the above drawbacks: (i) raising the resolution of feature maps for improving the spatial precision or maintaining a high-resolution response map through all stages, e.g., through dilated convolutions (Chen et al., 2018; Yu and Koltun, 2016), decoder network (Badrinarayanan et al., 2017; Ronneberger et al., 2015) or high-resolution networks (Sun et al., 2019a, b). (ii) exploiting the global context to capture long-range dependencies, e.g., ParseNet (Liu et al., 2015), DeepLabv3 (Chen et al., 2018), and PSPNet (Zhao et al., 2017). In this work, we focus on the second path and propose a more efficient context scheme. We define the context of a pixel as a set of selected pixels and its context representation as an aggregation of all selected pixels’ representations if not specified.

Most previous representative studies mainly exploit the multi-scale context formed from spatially nearby or sampled pixels. For instance, the pyramid pooling module (PPM) in PSPNet (Zhao et al., 2017) divides all pixels into multiple regions and selects all pixels lying in the same region with a pixel as its context. The atrous spatial pyramid pooling module (ASPP) in DeepLabv3 (Chen et al., 2017) selects the surrounding pixels of a pixel with different dilation rates as its context. Therefore, the selected pixels of both PPM context and ASPP context tend to be the mixture of object pixels, relevant background pixels and irrelevant background pixels. Motivated by the fact that category of each pixel is essentially the category of the object that it belongs to, we should enhance the object pixels that constitute the object.

To explicitly emphasize the contribution of the object pixels, we present an object context that aims at only gathering the pixels that belong to the same category as a given pixel as its context. Compared to the conventional multi-scale context schemes, our object context pays more attention to the necessary object information. Although estimating the accurate object context is not an easy task, we empirically find that a coarse estimation of the object context already outperforms both PPM and ASPP schemes on various benchmarks.

For a given pixel, we can use a binary vector to record pixels that belong to the same category as it with $1$ and otherwise. Thus, a binary relation matrix of $N\times N$ can be used to record the pair-wise relations between any two of $N$ pixels. Since computing the binary relation matrix is intractable, we use a dense relation matrix to serve as a surrogate of it, in which each relation value is computed based on the high-level features’ inner-product similarities. Therefore, the relation value of the semantically similar pixels tend to be larger. In our implementation, we use the conventional self-attention scheme (Vaswani et al., 2017) to predict the dense relation matrix, which requires $\mathcal{O}(N^{2})$ computation complexity. To address the efficiency problem, we propose a new interlaced sparse self-attention scheme that significantly improves the efficiency while maintaining the performance via two sparse relation matrices to approximate the dense relation matrix. To illustrate that our approach is capable of enhancing the object pixels, we show some examples of the predicted dense relation matrices in Fig. 1, where the relation values on the object pixels are larger than the relation values on the background pixels.

We further illustrate two extensions that capture richer context information: (i) pyramid object context, which estimates the object context within each sub-region generated by the spatial pyramid partitions following the PPM (Zhao et al., 2017). (ii) atrous spatial pyramid object context, which combines ASPP (Chen et al., 2017) with the object context. We summarize our main contributions as following:

We present a new object context scheme that explicitly enhances the object information.

We propose to instantiate the object context scheme with an efficient interlaced sparse self-attention that significantly decreases the complexity compared to the conventional self-attention scheme.

We construct the OCNet based on three kinds of object context modules and achieve competitive performance on five challenging semantic segmentation benchmarks including Cityscapes, ADE20K, LIP, PASCAL-Context and COCO-Stuff.

Related Work

Earlier studies based on conventional FCN (Long et al., 2015) apply the consecutive convolution striding and pooling operations to extract low-resolution feature map with high-level semantic information. For example, the output feature map size of ResNet- $101$ is $\frac{1}{32}$ of the input image, and such significant loss of the spatial information is one of the main challenges towards accurate semantic segmentation. To generate high-resolution feature map without much loss of the semantic information, many efforts (Ronneberger et al., 2015; Badrinarayanan et al., 2017; Yu and Koltun, 2016; Chen et al., 2017; Sun et al., 2019a) have proposed various efficient mechanisms. In this paper, we adopt the dilated convolution (Yu and Koltun, 2016; Chen et al., 2017) on ResNet- $101$ to increase the output stride from $32$ to $8$ by following the same settings of PSPNet (Zhao et al., 2017). Besides, we also conduct experiments based on the recent HRNet (Sun et al., 2019a) with output stride $4$ . We empirically verify that our approach is more efficient than the conventional multi-scale context mechanism, PPM and ASPP, with high resolution output feature map. More detailed comparisons are summarized in Table 4.

Context.

Context plays an important role in various computer vision tasks and it is of various forms such as global scene context, geometric context, relative location, 3D layout and so on. Context has been investigated for both object detection (Divvala et al., 2009) and part detection (Gonzalez-Garcia et al., 2018).

The importance of context for semantic segmentation is also verified in the recent works (Liu et al., 2015; Zhao et al., 2017; Chen et al., 2017; Shetty et al., 2019). It is common to define the context as a set of pixels in the literature of semantic segmentation. Especially, we can divide most of the existing context mechanisms into two kinds: (i) nearby spatial context: ParseNet (Liu et al., 2015) treats all pixels over the whole image as the context, and PSPNet (Zhao et al., 2017) performs pyramid pooling over sub-regions of four pyramid scales and all pixels within the same sub-region are treated as the context for the pixels belonging to the sub-region. (ii) sampled spatial context: DeepLabv3 (Chen et al., 2017) applies multiple atrous convolutions with different atrous rates to capture spatial pyramid context information and regards these spatially regularly sampled pixels as the context.

These two kinds of context are defined over regular rectangle regions and might carry pixels belonging to the background categories. Different from them, our object context is defined as the set of pixels belonging to the same object category, emphasizing the object pixels that are essential for labeling the pixel. There also exist some con-current efforts (Fu et al., 2019a; Zhang et al., 2019b; Huang et al., 2019; Li et al., 2019; Zhao et al., 2018) that exploit the semantic relations between pixels to construct the context, and our approach is different from most of them as we propose a simple yet effective interlaced sparse mechanism to model the relational context with smaller computation cost.

Attention.

Self-attention (Vaswani et al., 2017) and non-local neural network (Wang et al., 2018b) have achieved great success on various tasks with its efficiency on modeling long-range contextual information. The self-attention scheme (Vaswani et al., 2017) calculates the context at one position as a aggregation of all positions in a sentence (at the encoder stage). Wang et al. further proposed the non-local neural network (Wang et al., 2018b) for vision tasks such as video classification, object detection and instance segmentation based on self-attention scheme.

Our implementation is inspired by the self-attention scheme. We first apply the self-attention scheme to predict the dense relation matrix and verify its capability to approximate the object context, and there also exist some concurrent studies (Fu et al., 2019a; Zhang et al., 2019b) that apply the self-attention scheme for semantic segmentation. Some recent efforts (Huang et al., 2019; Yue et al., 2018; Zhu et al., 2019) propose different mechanisms to decrease the computation complexity and memory consumption of self-attention scheme. For example, CGNL (Yue et al., 2018) (Compact Generalized Non-local) applies the Taylor series of the RBF kernel function to approximate the pair-wise similarities, RCCA (Huang et al., 2019) (Recurrent Criss-Cross Attention) applies two consecutive criss-cross attention to approximate the original self-attention scheme.

Our interlaced sparse self-attention scheme is different from both CGNL and RCCA through factorizing the dense relation matrix to two sparse relation matrices, and we find that similar mechanisms have been applied in the previous studies on network architecture design including ShuffleNet (Ma et al., 2018) and Interleaved Group Convolution (Xie et al., 2018; Zhang et al., 2017b). The concurrent sparse transformer (Child et al., 2019) also apply the similar mechanism on one dimensional text/audio related tasks that require sequential masked inputs.

Approach

We introduce our approach with four subsections. First, we introduce the general mathematical formulation of the context representation and the definition of object context (Sec. 3.1). Second, we instantiate the object context with the conventional self-attention (SA) (Vaswani et al., 2017) and our interlaced sparse self-attention (ISA) (Sec. 3.2). Third, we present the pyramid extensions of object context (Sec. 3.3). Last, we illustrate the overall pipeline and the implementation details of OCNet (Sec. 3.4).

We define the general mathematical formulation of the context representation as: {ceqn}

We use $\mathbf{X}$ and $\mathbf{Z}$ to represent the input representation and the context representation respectively. $\mathbf{x}_{j}$ is the $j$ -th element of $\mathbf{X}$ and $\mathbf{z}_{i}$ is the $i$ -th element of $\mathbf{Z}$ . $\delta(\cdot)$ and $\rho(\cdot)$ are two different transform functions. $\mathcal{I}=\{1,\cdots,N\}$ represents a set of $N$ pixels. We use $\mathcal{I}_{i}$ to represent a subset of $\mathcal{I}$ , in other words, $\mathcal{I}_{i}$ is the set of context pixels for pixel $i$ . We show how $\mathcal{I}_{i}$ selects pixels in the following discussions. Intuitively, the above formula of context representation is to describe a pixel with the weighted average representations of a set of relevant pixels.

The above mathematical formulations are based on the one-dimensional case for convenience and can be generalized to the higher dimensional cases easily. One of the main differences between the existing representative context methods is the formulation of the context $\mathcal{I}_{i}$ .

Multi-scale context.

Zhao et al. (2017) proposes the pyramid pooling (PPM) context that constructs the $\mathcal{I}_{i}$ as a set of spatially-close pixels around pixel $i$ within the regular regions of different scales. Chen et al. (2017) introduces the atrous spatial pyramid pooling (ASPP) context that estimates $\mathcal{I}_{i}$ as a set of sparsely sampled pixels with different dilation rates around pixel $i$ .

We take the representative multi-scale context scheme ASPP as an example to illustrate the formulation of $\mathcal{I}_{i}$ : {ceqn}

where $r\in\{12,24,36\}$ is the dilation rate, $|i-j|$ represents the spatial distances between pixel $i$ and $j$ , and $\mathcal{I}_{i}$ is defined over one-dimensional input. Therefore, the ASPP context of pixel $i$ is a set of sampled pixels that have the predefined spatial distances with $i$ . Besides, we illustrate the formulation of $\mathcal{I}_{i}$ based on PPM scheme in Appendix A.

Both kinds of context tend to be a mixture of object pixels and background pixels. Therefore, they have not explicitly enhanced the contribution from the object pixels. Motivated by the fact that the category of each pixel is essentially inherited from the category of the object that it lies in, we propose a new scheme named object context to explicitly enhance the information of object pixels.

Object context.

We define the object context for pixel $i$ as: {ceqn}

where $l_{i}$ and $l_{j}$ are the label of pixel $i$ and $j$ respectively. We can see the object context for pixel $i$ is essentially a set of pixels that belong to the same object category as $i$ .

We can represent the pairwise relations between any two of $N$ pixels (encoded in the ground-truth object context) with a binary relation matrix of $N\times N$ , where the $i$ -th row records all pixels belonging to the same category with pixel $i$ with $1$ and otherwise. Especially, the binary relation matrix only encodes partial information of the ground-truth object context, i.e., the (same category) label co-occurring relations. In other words, all classes could be permuted and the binary relation matrix would be unchanged.

Considering it is intractable to estimate the binary relation matrix, we propose to use a dense relation matrix to serve as a surrogate of the binary relation matrix. We expect that the relation values between the pixels belonging to the same object category are larger than the ones belonging to different categories, thus, the contributions of the object pixels are enhanced.

In the following discussions, we first illustrate the formulation of the dense relation scheme that directly estimates the dense relation matrix $\mathbf{W}$ of size $N\times N$ . Second, to improve efficiency, we propose a sparse relation scheme that factorizes the dense relation matrix as the combination of two sparse relation matrices including $\mathbf{W}^{l}$ and $\mathbf{W}^{g}$ , where both sparse relation matrices are of size $N\times N$ . More details are illustrated as follows.

Dense relation.

The dense relation scheme estimates the relations between each pixel $i$ and all pixels in $\mathcal{I}$ . We illustrate the context representation based on dense relation: {ceqn}

where $w_{ij}$ is the relation value between pixel $i$ and $j$ , i.e., the element of $\mathbf{W}$ at coordinates ( $i$ , $j$ ). As we need to estimate the relations between $i$ and all pixels in $\mathcal{I}$ directly, the computational complexity of estimating $\mathbf{W}$ is quadratic to the input size: $\mathcal{O}(N^{2})$ .

Sparse relation.

The sparse relation scheme only estimates the relations between the pixel $i$ and two subsets of selected pixels following the “interlacing (a.k.a. interleaving) method” (Greenspun, 1999; Roelofs and Koman, 1999). We illustrate the context representation based on sparse relation: {ceqn}

where $\mathcal{I}_{i}^{g}$ / $\mathcal{I}_{i}^{l}$ is a subset of pixels with the same remainder / quotient as the pixel $i$ when divided by $P$ respectively. Both $i$ and $j$ in the above illustrations represent the spatial positions of pixel $i$ and $j$ in the one-dimensional case. $P$ represents the group number in the global relation stage (Sec. 3.2) and it determines the selection of context pixels. The main advantage of the sparse relation scheme lies at we only need to estimate the relation values between pixel $i$ and $\mathcal{I}_{i}^{g}\cup\mathcal{I}_{i}^{l}$ instead of $\mathcal{I}$ , thus, saves a lot of computation cost.

where we use $\mathbf{P}$ to represent a permutation matrix of size $N\times N$ that ensures the pixel orderings of the two sparse relation matrices are matched and $\mathbf{P}^{\top}$ is the transpose of $\mathbf{P}$ . We illustrate the definition of each value $p_{i,j}$ in $\mathbf{P}$ in Appendix B and why the sparse relation scheme is more efficient than the dense relation scheme in Appendix C.

2 Instantiations

We explain the specific instantiations of dense relation and sparse relation based on the self-attention and the interlaced sparse self-attention respectively.

The implementation of dense relation scheme based on self-attention is illustrated as following, {ceqn}

Interlaced sparse self-attention.

The implementation of the sparse relation scheme, i.e., the interlaced sparse self-attention, first divides all pixels into multiple subgroups and then applies the self-attention on each subgroup to compute the sparse relation matrices, i.e., $\mathbf{W}^{g}$ and $\mathbf{W}^{l}$ , and the context representations.

We illustrate the overall pipeline of the interlaced sparse self-attention scheme with a two dimensional example in Fig. 2, where we estimate $\mathbf{W}^{g}$ with the global relation module and $\mathbf{W}^{l}$ with the local relation module. With the combination of these two sparse relation matrices, we can approximate the dense relations between any two of all pixels, which is explained with an example in Appendix D.

Global relation. We divide all positions into multiple groups with each group consists of a subset of sampled positions according to the definition of $\mathcal{I}^{g}$ . Considering that the pixels within each group are sampled based on the remainder divided by the number of groups $P$ and they are distributed across the global image range, thus, we call it global relation.

We first permute the input feature map $\mathbf{X}$ : {ceqn}

Second, we divide $\mathbf{X}^{g}$ into ${P}$ groups with each group containing ${Q}$ neighboring positions ( $N={P}\times{Q}$ ): {ceqn}

where only the relation values in the diagonal blocks are non-zero. Therefore, we only need to estimate the relation values between pixel pairs belonging to the same group and ignore the relations between pixel pairs from different groups.

We concatenate all $\mathbf{Z}^{g}_{p}$ from different groups and get the output representation $\mathbf{Z}^{g}=[{\mathbf{Z}^{g}_{1}}^{\top},{\mathbf{Z}^{g}_{2}}^{\top},\cdots,{\mathbf{Z}^{g}_{P}}^{\top}]^{\top}$ after the global relation stage: {ceqn}

Local relation. In the local relation stage, we divide the positions into multiple groups according to the definition of $\mathcal{I}^{l}$ , where each group of pixels are sampled based on the quotient and they are distributed within the local neighboring range, thus, we call it local relation.

We apply another permutation on the output feature map from the global relation module following: {ceqn}

Then, we divide $\mathbf{X}^{l}$ into ${Q}$ groups with each group containing ${P}$ neighboring positions: {ceqn}

where the above relation matrix $\mathbf{W}^{l}$ based on the local relation is also very sparse and most of the relation values are zero.

We concatenate all $\mathbf{Z}^{l}_{q}$ from different groups and get output representation $\mathbf{Z}^{l}=[{\mathbf{Z}^{l}_{1}}^{\top},{\mathbf{Z}^{l}_{2}}^{\top},\cdots,{\mathbf{Z}^{l}_{Q}}^{\top}]^{\top}$ after the local relation stage: {ceqn}

where $\mathbf{Z}^{l}$ is also the final output representation of interlaced sparse self-attention scheme.

Complexity. Given an input feature map of size $H\times W\times C$ , we analyze the computation/memory cost of both the self-attention mechanism and our interlaced sparse self-attention scheme as follows.

The computation complexity of self-attention mechanism is $\mathcal{O}(HWC^{2}+(HW)^{2}{C})$ , and the complexity of our interlaced sparse self-attention mechanism is $\mathcal{O}(HWC^{2}+(HW)^{2}{C}(\frac{1}{{P}_{h}{P}_{w}}+\frac{1}{{Q}_{h}{Q}_{w}}))$ , where we divide the height dimension into ${P}_{h}$ groups and the width dimension to ${P}_{w}$ groups in the global relation stage and ${Q}_{h}$ and ${Q}_{w}$ groups during the local relation stage. We have $H={P}_{h}{Q}_{h}$ , $W={P}_{w}{Q}_{w}$ . The complexity of our approach can be reduced to $\mathcal{O}(HWC^{2}+(HW)^{\frac{3}{2}}{C})$ when ${P}_{h}{P}_{w}=\sqrt{HW}$ . Detailed formulations and proof of the computation complexity are provided in Appendix E. We compare the theoretical GFLOPs of the interlaced sparse self-attention scheme and the conventional self-attention scheme in Fig. 3, where we can see that our interlaced sparse self-attention is much more efficient than the conventional self-attention when processing inputs of higher resolution. We further report the actual GPU memory cost (measured by MB), computation cost (measured by GFLOPs), and inference time (measured by ms) of both mechanisms in Fig. 4 to illustrate the advantage of our method.

We present the PyTorch code of the proposed interlaced sparse self-attention in Algorithm 1. We explain the rough correspondence between Fig. 2 and Algorithm 1. For example, in global relation stage, the combination of $\operatorname{reshape}$ in line-9 and $\operatorname{permute}$ in line-10 of Algorithm 1 corresponds to the $\operatorname{Permute}$ in Fig. 2, and the $\operatorname{reshape}$ in line-11 of Algorithm 1 corresponds to the $\operatorname{Divide}$ in Fig. 2. Our implementation is optimized for the efficiency as we are applying the interlacing operations on the tensors of high dimension. Especially, the permute function (in both line-10 and line-16 of Algorithm 1) does not correspond to the Permute (for one-dimensional situation) illustrated in Fig 2.

Object context pooling. We use self-attention or interlaced sparse self-attention to implement the object context pooling module (OCP). The object context pooling estimates the context representation of each pixel $i$ by aggregating the representations of the selected subset of pixels based on the estimated dense relation matrix or the two sparse relation matrices. We first apply the object context pooling module based on either self-attention scheme or the proposed interlaced sparse self-attention scheme to compute the context representation, and then we concatenate the input representation with the context representation as the output representation, resulting in a baseline method named as Base-OC, and we illustrate the details in Fig. 5 (b).

3 Pyramid Extensions

To handle objects of multiple scales There exist two kinds of multiple scales problem: (i) objects of different categories have multiple scales given their distances to the camera are the same, e.g., the “car” is larger than the “person”. (ii) the objects of the same category have multiple scales given their distances to the camera are different, e.g., the closer “person” is larger than the distance “person”. , we further combine our approach with the conventional multi-scale context schemes including PPM and ASPP.

Combination with PPM. Inspired by the previous pyramid pooling module (Zhao et al., 2017), we divide the input image into regions of four pyramid scales: $1\times 1$ region, $2\times 2$ regions, $3\times 3$ regions and $6\times 6$ regions, and we update the feature maps for each scale by feeding the feature map of each region into the object context pooling module respectively, then we combine the four updated feature maps together. Finally, we concatenate the multiple pyramid object context representations with the input feature map. We call the resulting method as Pyramid-OC. More details are illustrated in Fig. 5 (c).

Combination with ASPP. The conventional atrous spatial pyramid pooling (Chen et al., 2017) consists of $5$ branches including: an image-level pooling branch, a $1\times 1$ convolution branch and three $3\times 3$ dilated convolution branches with dilation rates being $12$ , $24$ and $36$ , respectively. We replace the image-level pooling branch with the object context pooling to exploit the relation-based object context information, resulting in a method which we name as ASP-OC. More details are illustrated in Fig. 5 (d).

4 Network Architecture

We illustrate the overall pipeline of our OCNet in Fig. 5 (a). More details are illustrated as follows.

Backbone. We use the ResNet- $101$ (He et al., 2016) or HRNetV2-W $48$ (Sun et al., 2019b) pretrained over the ImageNet dataset as the backbone. For the ResNet- $101$ , we make some modifications by following PSPNet (Zhao et al., 2017): replace the convolutions within the last two blocks by dilated convolutions with dilation rates being $2$ and $4$ , respectively, so that the output stride becomes $8$ . For the HRNetV2-W $48$ , we directly apply our approach on the final concatenated feature map with output stride $4$ .

Base-OC. Before feeding the feature map into the OCP, we apply a dimension reduction module (a $3\times 3$ convolution) to reduce the channels of the feature maps output from the backbone to $512$ for both ResNet- $101$ and HRNetV2-W $48$ . Then we feed the updated feature map into the OCP and concatenate the output feature map of the OCP with the input feature map to the OCP. We further perform a $1\times 1$ convolution to decrease the channels of the concatenated feature map from $1024$ to $512$ , which is not included in Fig. 5 (b).

Pyramid-OC. We first apply a $3\times 3$ convolution to reduce the channels to $512$ in advance, then we feed the dimension reduced feature map to the Pyramid-OC and perform four different pyramid partitions ( $1\times 1$ region, $2\times 2$ regions, $3\times 3$ regions, and $6\times 6$ regions) on the input feature map, and we concatenate the four different output object context feature maps output by the four parallel OCPs. Each one of the four object context feature maps has $512$ channels. We apply a $1\times 1$ convolution to increase the channel of the input feature map from $512$ to $2048$ and concatenate it with all four object context feature maps. Lastly, we use a $1\times 1$ convolution on the concatenated feature map with $4096$ channels and produce the final feature map with $512$ channels , which is not included in Fig. 5 (c).

ASP-OC. We only perform the dimension reduction within the object context pooling branch, where we use a $3\times 3$ convolution to reduce the channel to $256$ . The output feature map from object context pooling module has $256$ channels. For the other four branches, we exactly follow the original ASPP module and apply a $1\times 1$ convolution within the second above branch and $3\times 3$ dilated convolution with different dilation rates ( $12$ , $24$ , $36$ ) in the three remaining parallel branches. We set the output channel as $256$ in all these four branches following the original settings (Chen et al., 2017). Lastly, we concatenate these five parallel output feature maps and use a $1\times 1$ convolution to decrease the channel of the concatenated feature map from $1280$ to $256$ , which is not included in Fig. 5 (d).

Discussion. The concept of object context is also discussed in our another work: object contextual representations (OCR) (Yuan et al., 2019). The main difference is that this work is focused on efficiently modeling the dense relations between pixel and pixel while OCR mainly exploits the coarse segmentation maps to construct a set of object region representations and models the dense relations between pixel and object regions.

Experimental Results

We evaluate our approach on five challenging semantic segmentation benchmarks. First, we study various components within our approach and compare our approach to some closely related mechanisms (Sec 4.3). Second, we compare our approach to the recent state-of-the-art methods to verify that we achieve competitive performance (Sec 4.4). Last, we apply our approach on the conventional Mask-RCNN to verify that our method generalizes well (Sec 4.5). Besides, we also illustrate the quantitative improvements along the boundary (Table 6) and the qualitative improvements on various benchmarks (Fig. 9) based on our approach.

Cityscapeshttps://www.cityscapes-dataset.com/. Cityscapes (Cordts et al., 2016) contains $5,000$ finely annotated images with $19$ semantic classes. The images are in $2048\times 1024$ resolution and captured from 50 different cities. The training, validation, and test sets consist of $2,975$ , $500$ , $1,525$ images, respectively.

ADE20Khttps://groups.csail.mit.edu/vision/datasets/ADE20K/. ADE20K (Zhou et al., 2017) is very challenging and it contains $22$ K densely annotated images with $150$ fine-grained semantic concepts. The training and validation sets consist of $20$ K, $2$ K images, respectively.

LIPhttp://sysu-hcp.net/lip/. LIP (Gong et al., 2017) is a large-scale dataset that focuses on semantic understanding of human bodies. It contains $50$ K images with $19$ semantic human part labels and $1$ background label for human parsing. The training, validation, and test sets consist of $30$ K, $10$ K, $10$ K images, respectively.

PASCAL-Contexthttps://cs.stanford.edu/~roozbeh/pascal-context/. PASCAL-Context (Mottaghi et al., 2014) is a challenging scene parsing dataset that contains $59$ semantic classes and $1$ background class. The training set and test set consist of $4,998$ and $5,105$ images, respectively.

COCO-Stuffhttps://github.com/nightrome/cocostuff. COCO-Stuff (Caesar et al., 2018) is a challenging scene parsing dataset that contains $171$ semantic classes. The training set and test set consist of $9$ K and $1$ K images, respectively.

2 Implementation Details

Training setting. We initialize the parameters within the object context pooling module and the classification head randomly. We perform the polynomial learning rate policy with factor $(1-(\frac{iter}{iter_{max}})^{0.9})$ . We set the weight on the final loss as $1$ and the weight on the auxiliary loss as $0.4$ following PSPNet (Zhao et al., 2017). The auxiliary loss is applied on the representation output from stage- $3$ of ResNet-101 or the final representation of HRNetV2-W48. We all use the InPlace-ABNsync (Rota Bulò et al., 2018) to synchronize the mean and standard-deviation of batch normalization across multiple GPUs. For the data augmentation, we perform random flipping horizontally, random scaling in the range of $[0.5,2]$ and random brightness jittering within the range of $$. More details are illustrated as following.

For the experiments on Cityscapes: we set the initial learning rate as $0.01$ , weight decay as $0.0005$ , crop size as $512\times 1024$ and batch size as $8$ . For the experiments evaluated on val/test, we set training iterations as $60$ K/ $100$ K on train/train+val respectively.

For the experiments on ADE20K: we set the initial learning rate as $0.02$ , weight decay as $0.0001$ , crop size as $520\times 520$ , batch size as $16$ and training iterations as $150$ K if not specified.

For the experiments on LIP: we set the initial learning rate as $0.007$ , weight decay as $0.0005$ , crop size as $473\times 473$ , batch size as $32$ and training iterations as $100$ K if not specified.

For the experiments on PASCAL-Context: we set the initial learning rate as $0.001$ , weight decay as $0.0001$ , crop size as $520\times 520$ , batch size as $16$ and training iterations as $30$ K if not specified.

For the experiments on COCO-Stuff: we set the initial learning rate as $0.001$ , weight decay as $0.0001$ , crop size as $520\times 520$ , batch size as $16$ and training iterations as $60$ K if not specified.

3 Ablation Study

We choose the dilated ResNet- $101$ as our backbone to conduct all ablation experiments, and we also use ResNet- $101$ alternatively for convenience. We choose the OCNet with Base-OC (ISA) as our default setting if not specified.

Group numbers. In order to study the influence of group numbers within the Base-OC (ISA) scheme, we train the Base-OC (ISA) method by varying ${P}_{h}$ and ${P}_{w}$ , where we can determine the value of ${Q}_{h}$ and ${Q}_{w}$ according to $H={P}_{h}\times{Q}_{h}$ and $W={P}_{w}\times{Q}_{w}$ . In Table 1, we illustrate the results on Cityscapes val. We can see that our approach with different group numbers consistently improves over the baseline and we get the best result with ${P}_{h}={P}_{w}=8$ , therefore, we set ${P}_{h}={P}_{w}=8$ in all experiments by default setting if not specified.

Global+Local vs. Local+Global. We study the influence of the order of global relation and local relation within Base-OC (ISA) module. We report the results in the $6^{\text{th}}$ row and $10^{\text{th}}$ row of Table 1. We can see that both mechanisms improve over the baseline by a large margin. Applying the global relation first seems to be favorable. We apply the global relation first unless otherwise specified for all our experiments. Besides, we also compare the results with only sparse global attention or only local relation. We report the results (measured by mIoU): only global relation: $78.9\%$ and only local relation: $77.2\%$ , which verifies that the global relation is more important and the local relation is complementary with the global relation.

Comparison to multi-scale context. We compare the proposed relational context scheme to two conventional multi-scale context schemes including: PPM (Zhao et al., 2017) and ASPP (Chen et al., 2017).

We conduct the comparison experiments under the same training/testing settings, e.g., the same training iterations and batch size. We report the related results in Table 2. Our reproduced PPM outperforms the original reported performance (Ours: $78.5\%$ vs. Paper (Zhao et al., 2017): $77.6\%$ ). Our approach consistently outperforms both PPM and ASPP on the evaluated benchmarks including Cityscapes and ADE $20$ K. For example, Base-OC (ISA) outperforms the PPM by $0.99\%$ / $0.61\%$ on Cityscapes/ADE $20$ K, respectively measured by mIoU. Compared to the ASPP, our approach is more efficient according to the complexity comparison reported in Table 4.

Besides, we also compare the performance based on the conventional self-attention (SA) and the proposed interlaced sparse self-attention (ISA) in Table 2, and we can see that our ISA achieves comparable performance while being much more efficient.

Comparison to RCCA/CGNL/Efficient-Attention. We compare our approach with several existing mechanisms that focus on addressing the efficiency problem of self-attention/non-local, such as SA- $2\times$ (Wang et al., 2018b), RCCA (Huang et al., 2019), CGNL (Yue et al., 2018) and Efficient Attention (Shen et al., 2018). For SA- $2\times$ , we directly down-sample the feature map for $2\times$ before computing the dense relation matrix. We evaluate all these mechanisms on the Cityscapes val and report the results in Table 3. We can see that our approach consistently outperforms all these three mechanisms, which verifies that our approach is more reliable. We all report the average performance for fairness considering the mIoU variance of RCCA is large.

Complexity. We compare the complexity of our approach with PPM (Zhao et al., 2017), DANet (Fu et al., 2019a), RCCA (Huang et al., 2019), CGNL (Yue et al., 2018) and Efficient Attention (Shen et al., 2018) in this section. We report the GPU memory, GFLOPs and inference time when processing input feature map of size $2048\times 128\times 128$ under the same setting in Table 4. We can see that our approach based on ISA is much more efficient than most of the other approaches except the Efficient Attention scheme. For example, the proposed Base-OC (ISA) is nearly $3\times$ faster and saves more than $88\%$ GPU memory when compared with the DANet. Besides, our approach also requires less GPU memory/inference time than both RCCA and CGNL.

Pyramid extensions. We study the performance with the two pyramid extensions including the Pyramid-OC and ASP-OC. We choose the dilated ResNet- $101$ as our baseline and summarize all related results in Table 5. We can find that the ASP-OC consistently improves the performance for both the SA scheme and ISA scheme while the Pyramid-OC slightly degrades the performance compared to the Base-OC mechanism. Accordingly, we only report the performance with Base-OC (ISA) module and ASP-OC (ISA) module for the following experiments if not specified.

4 Comparison to State-of-the-arts

We choose the object context pooling module based on ISA, e.g., Base-OC (ISA) and ASP-OC (ISA), by default. We evaluate the performance of OCNet (w/ Base-OC) and OCNet (w/ ASP-OC) on $5$ benchmarks and illustrate the related results as follows.

Cityscapes. We report the comparison to the state-of-the-art methods on Cityscapes test in Table 7, and we apply OHEM, the multi-scale testing and flip testing following the previous work. We apply our approach on both ResNet- $101$ and HRNetV2-W $48$ , and our approach achieves competitive performance with both backbones. For example, we improve the performance of HRNetV2-W $48$ from $81.6\%$ to $82.5\%$ with the ASP-OC module, which also outperforms the recent ACNet (Fu et al., 2019b).

ADE20K. In Table 8, we compare our approach to the state-of-the-arts on the ADE20K val. Our approach also achieves competitive performance, e.g., OCNet (w/ ASP-OC) based on ResNet- $101$ and HRNetV2-W $48$ achieve $45.40\%$ and $45.50\%$ respectively, both are slightly worse than the recent ACNet (Fu et al., 2019b) that exploits rich global context and local context.

PASCAL-Context. As illustrated in Table 9, we compare our approach with the previous state-of-the-arts on the PASCAL-Context test. We can find that our approach significantly improves the performance of HRNet (Sun et al., 2019a) and achieves $56.2\%$ with OCNet (w/ Base-OC), which also outperforms most of the other previous approaches.

LIP. We compare our approach to the previous state-of-the-arts on LIP val and illustrate the results in Table 10. The OCNet (w/ ASP-OC) based on HRNetV2- $48$ achieves competitive performance $56.35\%$ , which is slightly worse than the recent CNIF (Wang et al., 2019).

COCO-Stuff. From Table 11, we can see that our method also achieves competitive performance $40.0\%$ on COCO-Stuff test, which is comparable with the very recent state-of-the-art method.

Visualization. We visualize some examples of the global relation, local relation and dense relation predicted with our approach, e.g., OCNet based on dilated ResNet- $101$ + Base-OC (ISA), on different benchmarks in Fig. 6 and Fig. 7.

For all examples, we down-sample the the ground-truth label map to match the size of the dense relation map, which is $\frac{1}{8}$ of the input size. We choose the same group numbers ${P}_{h}={P}_{w}=8$ for all datasets, thus, the global relation matrix and the local relation matrix are of various shapes. For example, the dense relation matrix / global relation matrix / local relation matrix is of size $256\times 128$ / $32\times 16$ / $8\times 8$ respectively for Cityscapes images. We generate the dense relation matrix by multiplying the local relation with the global relation, and we can see that the estimated dense relation matrix puts its most relation weights on the pixels belonging to the same category as the chosen pixel, which well approximates the ground-truth object context.

We compare the segmentation maps predicted with our approach and the baseline (dilated ResNet- $101$ ) to illustrate the qualitative improvements, and we visualize the results in Fig. 9. We can find that our method produces better segmentation maps compared with the baseline. We mark all of the improved regions with white dashed boxes.

Boundary Analysis. We report the boundary improvements within $3$ , $5$ , $9$ and $12$ pixels width based on our approach on Cityscapes val in Table 6, and we can find that our approach significantly improves the boundary quality for several object categories including wall, truck, bus, train and so on.

5 Application to Mask-RCNN

Dataset. We use COCO (Lin et al., 2014) dataset to evaluate our approach. The dataset is one of the most challenging datasets for object detection and instance segmentation, which contains $140$ K images annotated with object bounding boxes and masks of 80 categories. We follow the COCO2017 split as in (He et al., 2017), where the training, validation and test sets contains $115$ K, $5$ K, $20$ K images, respectively. We report the standard COCO metrics including Average Precision (AP), AP50 and AP75 for both bounding boxes and masks.

Training settings. We use Mask-RCNN (He et al., 2017) as baseline to conduct our experiments. Similar to (Wang et al., 2018b), we insert $1$ non-local block or object context pooling module based on interlaced sparse self-attention before the last block of res- $4$ stage of the ResNet- $50$ FPN (Lin et al., 2017b) backbone. All models are initialized with ImageNet pretrained weights and built upon open source toolbox (Massa and Girshick, 2018). We train the models using SGD with batch size of $16$ and weight decay of $0.0001$ . We conduct experiments using training schedules including “ $1\times$ schedule” and “ $2\times$ schedule” (Massa and Girshick, 2018). The $1\times$ schedule starts at a learning rate of $0.02$ and is decreased by a factor of $10$ after $60$ K and $80$ K iterations and finally terminates at $90$ K iterations. We train for $180$ K iterations for $2\times$ schedule and decreases the learning rate proportionally. The other training and inference strategies keep the same with the default settings in the (Massa and Girshick, 2018).

Results. We report the results on COCO dataset in Table 12. We can see that adding one non-local block (Wang et al., 2018b) or interlaced sparse self-attention module consistently improves the Mask-RCNN baseline by $\sim 1$ % on all metrics involving both object detection and instance segmentation. Similar gains are observed for both $1\times$ schedule and $2\times$ schedule. For example, our approach improves the box AP/mask AP of Mask-RCNN from $38.7$ / $34.9$ to $39.7$ / $35.7$ with $2\times$ schedule. Especially, the performance of our approach is comparable with the non-local block on all metrics while decreasing the computation complexity significantly.

Last, we visualize the object detection and instance segmentation results of our approach and the Mask-RCNN on the validation set of COCO in Fig. 8. We can find that our approach improves the Mask-RCNN consistently on all the examples. For example, the Mask-RCNN fails to detect multiple cars in the last example while our approach achieves better detection performance.

Conclusion

In this paper, we present the object context that is capable of enhancing the object information via exploiting the semantic relations between pixels. Our object context is more in line with the definition of the semantic segmentation that defines the category of each pixel as the category of the object that it belongs to. We propose two different kinds of implementations including: (i) dense relation based on the conventional self-attention scheme and (ii) sparse relation based on the proposed interlaced sparse self-attention scheme. We demonstrate that the effectiveness of our method on five challenging semantic segmentation benchmarks, e.g., Cityscapes, ADE20K, LIP, PASCAL-Context and COCO-Stuff. We also extend our approach on Mask-RCNN to verify the advantage and we believe our approach might benefit various vision tasks through replacing the original self-attention or non-local scheme with our interlaced sparse self-attention mechanism.

Future work

Although our object context scheme achieves competitive results on various benchmarks, there still exist many other important paths to construct richer context information. We illustrate three potential candidates:

use the co-occurring relations between different object categories to refine the coarse segmentation map, e.g., the “rider” tends to co-occur with the “bicycle”, thus, we can refine the “rider” pixels being misclassified as “person”.

use the shape structure information to regularize the segmentation, e.g., the shape of “bus” tends to be quadrilateral, pentagon, or hexagon under various views, thus, we might use a set of prior shape masks to refine the predictions like the recent ShapeMask (Kuo et al., 2019).

the spatial location relation information, e.g., the “keyboard” is typically lying under the “monitor”, thus, we can use the prior knowledge on the spatial relations between different objects to refine the predictions. Besides, there also exist some efforts (Krishna et al., 2017) that focused on predicting the “relationship” between different objects from the input image directly.

References

Appendix

We give a further explanation on why we believe OC is superior to the previous two representation methods including PPM (Zhao et al., 2017) and ASPP (Chen et al., 2017) as following:

In theory, enhancing the object information in the context can decrease the variance of the context information, in other words, the context of PPM and ASPP suffers from larger variance than the OC context. Because the OC context only contains the variance of the {object information} while the context of PPM/ASPP further contains the variance of {object information, useful background information, irrelevant background information }. The recent study (Hoyer et al., 2019; Shetty et al., 2019) has verified that the overuse of the noisy context information based on PPM suffers from poor generalization ability. For example, the “cow” pixels might be mis-classified as “horse” pixels when the “cow” appears on the road. We directly use the Fig. 10 from Hoyer et al. (2019) to support our point. In summary, explicitly enhancing the object information might decrease the variance of the context information, thus, increases the generalization ability of model.

In experiments, according to the results in the Table 2 (in the paper), we have verified that the OCNet outperforms both PSPNet and DeepLabv3 under the fair comparison settings.

B. Formulation of PPM context.

We illustrate the definition of the context $\mathcal{I}_{i}$ based on PPM (Zhao et al., 2017) scheme: {ceqn}

where $k\in\{2,3,6\}$ represents different pyramid region partitions. Such context is a aggregation of the pixels the the same quotient.

C. Formulation of Permutation Matrix.

We illustrate the definition of each value $p_{i,j}$ in the permutation matrix $\mathbf{P}$ : {ceqn}

where, according to $\mathbf{W}=\mathbf{W}^{l}\mathbf{P}^{\top}\mathbf{W}^{g}\mathbf{P}$ , we permute the $i$ -th column of $\mathbf{W}^{g}$ / $\mathbf{W}^{l}$ to the $j$ -th column if $p_{i,j}=1$ / $p_{j,i}=1$ when multiplying permutation matrix $\mathbf{P}$ / $\mathbf{P}^{\top}$ on the right side of $\mathbf{W}^{g}$ / $\mathbf{W}^{l}$ respectively.

D. Why the sparse relation is more efficient?

For the convenience of analysis, we rewrite the mathematical formulation of computing the context representations (w/o considering the transform functions $\delta(\cdot)$ and $\rho(\cdot)$ ) based on dense relation scheme and sparse relation scheme as following. The formulation of dense relation scheme is $\mathbf{Z}=\mathbf{W}\mathbf{X}$ and the formulation of sparse relation scheme is $\mathbf{Z}=(\mathbf{W}^{l}\mathbf{P}^{\top}\mathbf{W}^{g}\mathbf{P})\mathbf{X}$ . We can see that the formulation of the sparse relation scheme still requires $\mathcal{O}(N^{2})$ GPU memory to store the reconstructed dense relation matrix $\mathbf{W}^{l}\mathbf{P}^{\top}\mathbf{W}^{g}\mathbf{P}$ . To avoid such expensive GPU memory consumption, we rewrite the formulation of the sparse relation scheme as $\mathbf{Z}=\mathbf{W}^{l}(\mathbf{P}^{\top}(\mathbf{W}^{g}(\mathbf{P}\mathbf{X})))$ according to the associative laws. Because both $\mathbf{W}^{l}$ and $\mathbf{W}^{g}$ are sparse block matrices and each block is independent from the other blocks, we compute the multiple block matrices concurrently via transforming these block matrices to align on the batch dimension. Besides, we also implement the permutation matrix via the combination of permute and reshape operation provided in PyTorch. More details are illustrated in the discussion in Sec. 3.2 and Algorithm 1

E. Intuitive example of the sparse relation scheme.

We use an one-dimensional example in Fig. 11 to explain why the combination of two sparse relation matrices are capable to approximate the dense relation matrix. In other words, both dense relation and sparse relation ensure that each output position is connected with all input positions. Specifically, in Fig. 11 (b), the output position $\rm{A}_{1}$ has direct relations with $\{\rm{A}_{2},\rm{A}_{3},\rm{B}_{1}\}$ and indirect relations with $\{\rm{B}_{2},\rm{B}_{3}\}$ via $\rm{B}_{1}$ .

F. Complexity Analysis.

We illustrate the proof of the complexity of interlaced sparse self-attention scheme:

Similarly, we can get the complexity of the lobal relation stage in ISA: {ceqn}

In summary, we can compute the final complexity of ISA via adding $T({\rm{ISA/global}})$ and $T({\rm{ISA/local}})$ , {ceqn}

where we can achieve the minimized computation complexity of $\mathcal{O}(HWC^{2}+(HW)^{\frac{3}{2}}{C})$ when ${P_{h}P_{w}}={Q_{h}Q_{w}}$ is satisfied (according to arithmetic mean $\geq$ geometric mean).

G. Illustrating the Permutation Scheme of ISA.

To help the readers to understand how we select and permute the indices within Interlaced Sparse Self-Attention, we use an example in Fig. 12 to explain the details.

H. More details of Pyramid-OC.

We explain the details of Pyramid-OC as following: Given an input feature map $\mathbf{X}$ of shape ${H\times W\times C}$ , we first divide it into $k\times k$ groups ( $k\in\{1,2,3,6\}$ ) following the pyramid partitions of PPM (Zhao et al., 2017): {ceqn}

We compute four different context feature maps $\{\mathbf{Z}^{1},\mathbf{Z}^{2},\mathbf{Z}^{3},\mathbf{Z}^{6}\}$ based on four different pyramid partitions. Last, we concatenate these context feature maps: {ceqn}

I. Checkered artefact with ISA.

We can observe checkered artefact in the Fig. 6 and Fig. 7, which is caused by our implementation on the visualization of the global relation, local relation and dense relation. We illustrate the related pseudo-code in Algorithm 2. Specifically speaking, for a selected pixel, we multiply a set of global relation matrices (associates with the pixels that belong to the same group as the selected pixel in the local relation stage) with the its local relation matrix. Therefore, the shape of the checkerboard is exactly the same as the shape of the global relation map. For example, we zoom in the dense relation and global relation of some examples in the Fig. 13.