Expectation-Maximization Attention Networks for Semantic Segmentation

Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, Hong Liu

Introduction

Semantic segmentation is a fundamental and challenging problem of computer vision, whose goal is to assign a semantic category to each pixel of the image. It is critical for various tasks such as autonomous driving, image editing and robot sensing. In order to accomplish the semantic segmentation task effectively, we need to distinguish some confusing categories and take the appearance of different objects into account. For example, ‘grass’ and ‘ground’ have similar color in some cases and ‘person’ may have various scales, figures and clothes in different locations of the image. Meanwhile, the label space of the output is quite compact and the amount of the categories for a specific dataset is limited. Therefore, this task can be treated as projecting data points in a high-dimensional noisy space into a compact sub-space. The essence lies in de-noising these variation and capturing the most important semantic concepts.

Recently, many state-of-the-art methods based on fully convolutional networks (FCNs) have been proposed to address the above issues. Due to the fixed geometric structures, they are inherently limited by local receptive fields and short-range contextual information. To capture long-range dependencies, several works employ the multi-scale context fusion , such as astrous convolution , spatial pyramid , large kernel convolution and so on. Moreover, to keep more detailed information, the encoder-decoder structures are proposed to fuse mid-level and high-level semantic features. To aggregate information from all spatial locations, attention mechanism is used, which enables the feature of a single pixel to fuse information from all other positions. However, the original attention-based methods need to generate a large attention map, which has high computation complexity and occupies a huge number of GPU memory. The bottleneck lies in that both the generation of attention map and its usage are computed w.r.t all positions.

Towards the above issues, in this paper, we rethink the attention mechanism from the view of expectation-maximization (EM) algorithm and propose a novel attention-based method, namely Expectation-Maximization Attention (EMA). Instead of treating all pixels themselves as the reconstruction bases , we use the EM algorithm to find a more compact basis set, which can largely reduce the computational complexity. In detail, we regard the bases for construction as the parameters to learn in the EM algorithm and attention maps as latent variables. In this setting, the EM algorithm aims to find a maximum likelihood estimate of parameters (bases). Given the current parameters, the expectation (E) step works as estimating the expectation of attention map and maximization (M) step functions as updating the parameters (bases) by maximizing the complete data likelihood. The E step and the M step execute alternately. After convergence, the output can be computed as the weighted sum of bases, where the weights are the normalized final attention maps. The pipeline of EMA is shown in Fig. 1.

We further embed the proposed EMA method into a module for neural network, which is named EMA Unit. EMA Unit can be simply implemented by common operators. It is also light-weighted and can be easily embedded into existing neural networks. Moreover, to make full use of its capacity, we also propose two more methods to stabilize the training process of EMA Unit. We also evaluate its performance on three challenging datasets.

The main contributions of this paper are listed as follows:

We reformulate the self-attention mechanism into an expectation-maximization iteration manner, which can learn a more compact basis set and largely reduce the computational complexity. To the best of our knowledge, this paper is the first to introduce EM iterations into attention mechanism.

We build the proposed expectation-maximization attention as a light-weighted module for neural network and set up specific manners for bases’ maintenance and normalization.

Extensive experiments on three challenging semantic segmentation datasets, including PASCAL VOC, PASCAL Context and COCO Stuff, demonstrate the superiority of our approach over other state-of-the-art methods.

Related Works

Semantic segmentation. Fully convolutional network (FCN) based methods have made great progress in image semantic segmentation by leveraging the powerful convolutional features of classification networks pre-trained on large-scale data . Several model variants are proposed to enhance the multi-scale contextual aggregation. For example, DeeplabV2 makes use of the astrous spatial pyramid pooling (ASPP) to embed contextual information, which consists of parallel dilated convolutions with different dilated rates. DeeplabV3 extends ASPP with image-level feature to further capture global contexts. Meanwhile, PSPNet proposes a pyramid pooling module to collect contextual information of different scales. GCN adopts decoupling of large kernel convolution to gain a large receptive field for the feature map and capture long-range information.

For the other type of variants, they mainly focus on predicting more detailed output. These methods are based on U-Net , which combines the advantages of high-level features with mid-level features. RefineNet makes use of the Laplacian image pyramid to explicitly capture the information available along the down-sampling process and output predictions from coarse to fine. DeeplabV3+ adds a decoder upon DeeplabV3 to refine the segmentation results especially along object boundaries. Exfuse proposes a new framework to bridge the gap between low-level and high-level features and thus improves the segmentation quality.

Attention model. Attention is widely used for various tasks such as machine translation, visual question answering and video classification. The self-attention methods calculate the context coding at one position by a weighted summation of embeddings at all positions in sentences. Non-local first adopts self-attention mechanism as a module for computer vision tasks, such as video classification, object detection and instance segmentation. PSANet learns to aggregate contextual information for each position via a predicted attention map. $A^{2}$ Net proposes the double attention block to distribute and gather informative global features from the entire spatio-temporal space of the images. DANet applies both spatial and channel attention to gather information around the feature maps, which costs even more computation and memory than the Non-local method.

Our approach is motivated by the success of attention in the above works. We rethink the attention mechanism from the view of the EM algorithm and compute the attention map in an iterative manner as the EM algorithm.

Preliminaries

Before introducing our proposed method, we first review three highly correlated methods, that is the EM algorithm, the Gaussian mixture model and the Non-local module.

The expectation-maximization (EM) algorithm aims to find the maximum likelihood solution for latent variables models. Denote $\mathbf{X}=\left\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\right\}$ as the data set which consists of $N$ observed samples and each data point $\mathbf{x}_{i}$ has its corresponding latent variable $\mathbf{z}_{i}$ . We call $\left\{\mathbf{X},\mathbf{Z}\right\}$ the complete data and its likelihood function takes the form $\ln p\left(\mathbf{X},\mathbf{Z}|\bm{\theta}\right)$ , where $\bm{\theta}$ is the set of all parameters of the model. In practice, the only knowledge of latent variables in $\mathbf{Z}$ is given by the posterior distribution $p\left(\mathbf{Z}|\mathbf{X},\bm{\theta}\right)$ . The EM algorithm is designed to maximize the likelihood $\ln p\left(\mathbf{X},\mathbf{Z}|\bm{\theta}\right)$ by two steps, i.e., the E step and the M step.

The EM algorithm executes the E step and the M step alternately until the convergence criterion is satisfied.

2 Gaussian Mixture Model

Gaussian mixture model (GMM) is a special case of the EM algorithm. It takes the distribution of data $\mathbf{x}_{n}$ as a linear superposition of Gaussians:

where the mean $\bm{\mu}_{k}$ and the covariance $\mathbf{\Sigma}_{k}$ are parameters for the $k$ -th Gaussian basis. Here we leave out the prior $\pi_{k}$ . The likelihood of the complete data is formulated as:

where $\sum_{k}z_{nk}=1$ . $z_{nk}$ can be viewed as the responsibility that the $k$ -th basis takes for the observation $\mathbf{x}_{n}$ . For GMM, in the E step, the expected value of $z_{nk}$ is given by:

In the M step, the parameters are re-estimated as follows:

In real applications, we can simply replace $\mathbf{\Sigma}_{k}$ as the identity matrix $\mathbf{I}$ and leave out the $\mathbf{\Sigma}_{k}$ in the above equations.

3 Non-local

The Non-local module functions the same as the self-attention mechanism. It can be formulated as:

where $f\left(\cdot,\cdot\right)$ represents a general kernel function, $\mathcal{C}\left(\mathbf{x}\right)$ is a normalization factor and $\mathbf{x}_{i}$ denotes the feature vector for the location $i$ . As this module is applied upon the feature map of convolutional neural networks (CNN).

Considering that $\mathcal{N}\left(\mathbf{x}_{n}|\bm{\mu}_{k},\mathbf{\Sigma}_{k}\right)$ in Eq. (5) is a specific kernel function between $\mathbf{x}_{n}$ and $\bm{\mu}_{k}$ , Eq. (8) is just a specific design of Eq. (9). Then, from the viewpoint of GMM, the Non-local module is just a re-estimation of $\mathbf{X}$ , without E steps and M steps. Specifically, $\bm{\mu}$ is just selected as the $\mathbf{X}$ in Non-local.

In GMM, the number of Gaussian bases is selected manually and usually satisfies $K\ll N$ . But in the Non-local module, the bases are selected as the data themselves, so it has $K=N$ . There are two obvious disadvantages of the Non-local module. First, the data are lying in a low dimensional manifold, so the bases are over-complete. Second, the computation overhead is heavy and the memory cost is also large.

Expectation-Maximization Attention

In view of the high computational complexity of the attention mechanism and limitations of the Non-local module, we first propose the expectation-maximization attention (EMA) method, which is an augmented version of self-attention. Unlike the Non-local module that selects all data points as bases, we use the EM iterations to find a compact basis set.

where $\mathcal{K}$ represents the general kernel function. And now, Eq. (5) can be reformulated into a more general form:

where $\lambda$ is a hyper-parameter to control the distribution of $\mathbf{Z}$ .

2 Likelihood Maximization

3 Data Re-estimation

EMA Unit

In order to better incorporate the proposed EMA with deep neural networks, we further propose the Expectation-maximization Attention Unit (EMAU) and apply it to semantic segmentation task. In this section, we will describe EMAU in detail. We first introduce the overall structure of EMAU and then discuss bases’ maintenance and normalization mechanisms.

2 Bases Maintenance

Another issue for the EM algorithm is the initialization of the bases. The EM algorithm is guaranteed to converge, because the likelihood of complete data is limited, and at each iteration both E and M steps lift its current lower bound. However, converging to global maximum is not guaranteed. Thus, the initial values of bases before iterations are of great importance.

We only describe how EMA is used to process one image above. However, for a computer vision task, there are thousand of images in a dataset. As each image $\mathbf{X}$ has different pixel feature distributions from others, it is not suitable to use the $\bm{\mu}$ computed upon an image to reconstruct feature maps of other images. So we run EMA on each image.

For the first mini-batch, we initialize $\bm{\mu}^{(0)}$ using Kaiming’s initialization , where we treat matrix multiplication as a $1\times 1$ convolution. For the following mini-batches, one simple choice is to update $\bm{\mu}^{(0)}$ using standard back propagation. However, as iterations of $A_{E}$ and $A_{M}$ can be unrolled as a recurrent neural network (RNN), the gradients propagating though them will encounter the vanishing or explosion problem. Therefore, the updating of $\bm{\mu}^{(0)}$ is unstable, and the training procedure of EMA Unit may collapse.

In this paper, we use the moving averaging to update $\bm{\mu}^{(0)}$ in the training process. After iterating over an image, the generated $\bm{\mu}^{(T)}$ can be regarded as a biased update of $\bm{\mu}^{(0)}$ , where the bias comes from the image sampling process. To make it less biased, we first average $\bm{\mu}^{(T)}$ over a mini-batch and get the $\bm{\bar{\mu}}^{(T)}$ . Then we update $\bm{\mu}^{(0)}$ as:

where $\alpha\in\left[0,1\right]$ is the momentum. For inference, the $\bm{\mu}^{(0)}$ keeps fixed. This moving averaging mechanism is also adopted in batch normalization (BN) .

3 Bases Normalization

To this end, we need to apply normalization upon $\bm{\mu}^{(t)}$ . At the first glance, BN or layer normalization (LN) sound to be good choices. However, these aforementioned normalization methods will change the direction of each basis $\bm{\mu}_{k}^{(t)}$ , which changes their properties and semantic meanings. To keep the direction of each basis untouched, we choose Euclidean normalization (L2Norm), which divides each $\bm{\mu}_{k}^{(t)}$ by its length. By applying it, $\bm{\mu}^{(t)}$ then lies in a $K$ -dimensional united hyper-sphere, and sequence of $\left\{\bm{\mu}_{k}^{(0)},\bm{\mu}_{k}^{(1)},\cdots,\bm{\mu}_{k}^{(T)}\right\}$ forms a trajectory on it.

4 Compare with the Double Attention Block

$A^{2}$ Net proposes the double attention block ( $A^{2}$ block), in which the output $\mathbf{Y}$ is computed as:

Experiments

To evaluate the proposed EMAU, we conduct extensive experiments on the PASCAL VOC dataset , the PASCAL Context dataset , and the COCO Stuff dataset . In this section, we first introduce implementation details. Then we perform ablation study to verify the superiority of proposed method on the PASCAL VOC dataset. Finally, we report our results on the PASCAL Context dataset and the COCO Stuff dataset.

The output stride of the backbone is set to $16$ for training on PASCAL VOC and PASCAL Context, and $8$ for training on COCO Stuff and evaluating on all datasets. To speed up the training procedure, we carry out all ablation studies on ResNet-50 , with batch size $12$ . For all models to be compared with state-of-the-art, we train them on ResNet-101, with batch size $16$ . We train 30K iterations on PASCAL VOC and COCO Stuff, and 15K on PASCAL Context. We use a $3\times 3$ convolution to reduce the channel number from $2,048$ to $512$ , and then stack EMAU upon it. We call the whole network as EMANet. We set the basis number $K=64$ , $\lambda=1$ and the number of iterations $T=3$ for training as default.

2 Results on the PASCAL VOC Dataset

We then compare the performances with no normalization, LN and L2Norm as described above. From the right part of Fig. 3, it is clear to see that LN is even worse than no normalization. Since it can partially relieve the gradient chores of RNN-like structure. The performance of LN and no normalization has little correlation with the number of iteration $T$ . By contrast, L2Norm’s performance increases as the iterations become larger and it outperforms LN and no normalization when $T\geq 3$ .

2.2 Ablation Study for Iteration Number

2.3 Comparisons with State-of-the-arts

We first thoroughly compare EMANet with three baselines, namely DeeplabV3, DeeplabV3+ and PSANet on the validation set. We report mIoU, FLOPs, memory cost and parameter numbers in Tab. 1. We can see that EMANet outperforms these three baselines by a large margin. Moreover, EMANet is much lighter in computation and memory.

We further compare our method with existing methods on the PASCAL VOC test set. Following previous methods , we train EMANet successively over COCO, the VOC trainaug and the VOC trainval set. We set the base learning rate as 0.009, 0.001 and 0.0001, respectively. We train 150K iterations on COCO, and 30K for the last two rounds. When inferring over the test set, we make use of multi-scale testing and left-right flipping.

As shown in Tab. 2, our EMANet sets the new record on PASCAL VOC, and improves DeeplabV3 with the same backbone by 2.0% in mIoU. Our EMANet achieves the best performance among networks with backbone ResNet-101, and outperforms the previous best one by 0.9%, which is significant due to the fact that this benchmark is very competitive. Moreover, it achieves the performance that is comparable with methods based on some larger backbones.

3 Results on the PASCAL Context Dataset

To verify the generalization of our proposed EMANet, we conduct experiments on the PASCAL Context dataset. Quantitative results of PASCAL Context are shown in Tab. 3. To the best of our knowledge, EMANet based on ResNet-101 achieves the highest performance on the PASCAL Context dataset. Even pretrained on additional data (COCO Stuff), SGR+ is still inferior to EMANet.

4 Results on the COCO Stuff Dataset

To further evaluate the effectiveness of our method, we also carry out experiments on the COCO Stuff dataset. Comparisons with previous state-of-the-art methods are shown in Tab. 4. Remarkably, EMANet achieves 39.9% in mIoU and outperforms previous methods by a large margin.

5 Visualization of Bases Responsibilities

Conclusion

In this paper, we propose a new type of attention mechanism, namely the expectation-maximization attention (EMA), which computes a more compact basis set by iteratively executing as the EM algorithm. The reconstructed output of EMA is low-rank and robust to the variance of input. We well formulate the proposed method as a light-weighted module that can be easily inserted to existing CNNs with little overhead. Extensive experiments on a number of benchmark datasets demonstrate the effectiveness and efficiency of the proposed EMAU.

Acknowledgment

Zhouchen Lin is supported by National Basic Research Program of China (973 Program) (Grant no. 2015CB352502), National Natural Science Foundation (NSF) of China (Grant nos. 61625301 and 61731018), Qualcomm and Microsoft Research Asia. Hong Liu is supported by National Natural Science Foundation of China (Grant nos. U1613209 and 61673030) and funds from Shenzhen Key Laboratory for Intelligent Multimedia and Virtual Reality (ZDSYS201703031405467).