Semi-convolutional Operators for Instance Segmentation

David Novotny, Samuel Albanie, Diane Larlus, Andrea Vedaldi

Introduction

State-of-the-art methods for detecting objects in images, such as R-CNN , YOLO , and SSD , can be seen as variants of the same paradigm: a certain number of candidate image regions are proposed, either dynamically or from a fixed pool, and then a convolutional neural network (CNN) is used to decide which of these regions tightly enclose an instance of the object of interest. An important advantage of this strategy, which we call propose & verify (P&V), is that it works particularly well with standard CNNs. However, P&V also has several significant shortcomings, starting from the fact that rectangular proposals can only approximate the actual shape of objects; segmenting objects, in particular, requires a two-step approach where, as in Mask R-CNN , one first detects object instances using simple shapes such as rectangles, and only then refines the detections to pixel-accurate segmentations.

An alternative to P&V that can overcome such limitations is to label directly individual pixels with an identifier of the corresponding object occurrence. This approach, which we call instance coloring (IC), can efficiently represent any number of objects of arbitrary shape by predicting a single label map. Thus IC is in principle much more efficient than P&V. Another appeal of IC is that it can be formulated as an image-to-image regression problem, similar to other image understanding tasks such as denoising, depth and normal estimation, and semantic segmentation. Thus this strategy may allow to more easily build unified architectures such as that can solve instance segmentations together with other problems.

Despite the theoretical benefits of IC, however, P&V methods currently dominate in terms of overall accuracy. The goal of this paper is to explore some of the reasons for this gap and to suggest workarounds. Part of the problem may be in the nature of the dense labels. The most obvious way of coloring objects is to number them and “paint” them with their corresponding number. However, the latter is a global operation as it requires to be aware of all the objects in the image. CNNs, which are local and translation invariant, may therefore be ill-suited for direct enumeration. Several authors have thus explored alternative coloring schemes more suitable for convolutional networks. A popular approach is to assign an arbitrary color (often in the guise of a real vector) to each object occurrence, with the only requirement that different colors should be used for different objects . The resulting color affinities can then be used to easily enumerate object a posteriori via a non-convolutional algorithm.

In this paper, we argue that even the latter technique is insufficient to make IC amenable to computation by CNNs. The reason is that, since CNNs are translation invariant, they must still assign the same color to identical copies of an object, making replicas indistinguishable by convolutional coloring. This argument, which is developed rigorously in sec. 3.6, holds in the limit since in practice the receptive field size of most CNNs is nearly as large as the whole image; however, it suggests that the convolutional structure of the network is at least an unnatural fit for IC.

In order to overcome this issue, we suggest that an architecture used for IC should not be translation invariant; while this may appear to be a significant departure from convolutional networks, we also show that a small modification of standard CNNs can overcome the problem. We do so by defining semi-convolutional operators which mix information extracted from a convolutional network with information about the global location of a pixel (sec. 3.1 and 1). We train the latter (sec. 3.2) so that the response of the operator is similar for all pixels that belong to the same object instance, making this embedding naturally suited for IC. We show that, if the mixing function is additive, then the resulting operator bears some resemblance to Hough voting and related detection approaches. After extending the embedding to incorporate standard convolutional responses that capture appearance cues (sec. 3.3), we use it to induce pixel affinities and show how the latter can be interpreted as a steered version of a bilateral kernel (sec. 3.4). Finally, we show how such affinities can also be integrated in methods such as Mask RCNN (sec. 3.5).

We assess our method with several experiments. We start by investigating the limit properties of our approach on simple synthetic data. Then, we show that our semi-convolutional feature extractor can be successfully combined with state-of-the-art approaches to tackle parsing of biological images containing overlapping and articulated organisms (sec. 4.2). Finally, we apply the latter to a standard instance segmentation benchmark PASCAL VOC (sec. 4.3). We show in all such cases that the use of semi-convolutional features can improve the performance of state-of-the-art instance segmentation methods such as Mask RCNN.

Related work

The past years have seen large improvements in object detection, thanks to powerful baselines such as Faster-RCNN , SSD or other similar approaches , all from the propose & verify strategy.

Following the success of object detection and semantic segmentation, the challenging task of instance-level segmentation has received increasing attention. Several very different families of approaches have been proposed.

Proposal-based instance segmentation. While earlier methods relied on bottom-up segmentations , the vast majority of recent instance-level approaches combine segment proposals together with powerful object classifiers. In general, they implement a multi-stage pipeline that first generates region proposals or class agnostic boxes, and then classifies them . For instance DeepMask and follow-up approaches learn to propose segment candidates that are then classified. The MNC approach , based on Faster-RCNN , repeats this process twice while does it multiple times. extends to model the shape of objects. The fully convolutional instance segmentation method of also combines segmentation proposal and object detection using a position sensitive score map.

Some methods start with semantic segmentation first, and then cut the regions obtained for each category into multiple instances , possibly involving higher-order CRFs .

Among the most successful methods to date, Mask-RCNN extends Faster R-CNN with a small fully convolutional network branch producing segmentation masks for each region of interest predicted by the detection branch. Despite its outstanding results, Mask-RCNN does not come without shortcomings: it relies on a small and predefined set of region proposals and non-maximum suppression, making it less robust to strong occlusions, crowded scenes, or objects with fundamentally non-rectangular shapes (see detailed discussion in sec. 3.6).

Instance-sensitive embeddings. Some works have explored the use of pixel-level embeddings in the context of clustering tasks, employing them as a soft, differentiable proxy for cluster assignments . This is reminiscent of unsupervised image segmentation approaches . It has been used for body joints , semantic segmentation and optical flow , and, more relevant to our work, to instance segmentation .

The goal of this type of approaches is to bring points that belong to the same instance close to each other in an embedding space, so that the decision for two pixels to belong to the same instance can be directly measured by a simple distance function. Such an embedding requires a high degree of invariance to the interior appearance of objects.

Among the most recent methods, combines the embedding with a greedy mechanism to select seed pixels, that are used as starting points to construct instance segments. connects embeddings, low rank matrices and densely connected random fields. embeds the pixels and then groups them into instances with a variant of mean-shift that is implemented as a recurrent neural network. All these approaches are based on convolutions, that are local and translation invariant by construction, and consequently are inherently ill-suited to distinguish several identical instances of the same object (see more details about the convolutional coloring dilemma in sec. 3.6). A recent work employs position sensitive convolutional embeddings that regress the location of the centroid of each pixel’s instance. We mainly differ by allowing embeddings to regress an unconstrained representative point of each instance.

Among other approaches using a clustering component, leverages a coverage loss and make use of depth information. In particular, trains a network to predict each pixel direction towards its instance center along with monocular depth and semantic labeling. Then template matching and proposal fusion techniques are applied.

Other instance segmentation approaches. Several methods move away from box proposals and use Faster-RCNN to produce “centerness” scores on each pixel instead. They directly predict the mask of each object in a second stage. An issue with such approaches is that objects do not necessarily fit in the receptive fields.

Recurrent approaches sequentially generate a list of individual segments. For instance, uses an LSTM for detection with a permutation invariant loss while uses an LSTM to produce binary segmentation masks for each instance. extends by refining segmentations in each window using a box network. These approaches are slow and do not scale to large and crowded images.

Some approaches use watershed algorithms. predicts pixel-level energy values and then partition the image with a watershed algorithm. combines a watershed algorithm with an instance aware boundary map. Such methods create disconnected regions, especially in the presence of occlusion.

Method

In this paper, we are interested in methods that reduce instance segmentation to a pixel-labeling problem. Namely, we seek to learn a function $\Phi:\mathcal{X}\rightarrow\mathcal{L}^{\Omega}$ that associates to each pixel $u$ a certain label $\Phi_{u}(\mathbf{x})\in\mathcal{L}$ so that, as a whole, labels encode the segmentation $\mathcal{S}_{\mathbf{x}}$ . Intuitively, this can be done by painting different regions with different “colors” (aka pixel labels) making objects easy to recover in post-processing. We call this process instance coloring (IC).

If this is the case, clustering colors trivially reconstructs the regions.

Unfortunately, it is difficult for a convolutional operator $\Phi$ to satisfy constraint (1) or analogous ones. While this is demonstrated formally in sec. 3.6, for now an intuition suffices: if the image contains replicas of the same object, then a convolutional network, which is translation invariant, must assign the same color to each copy.

If convolutional operators are inappropriate, then, we must abandon them in favor of non-convolutional ones. While this sounds complex, we suggest that very simple modifications of convolutional operators, which we call semi-convolutional, may suffice. In particular, if $\Phi_{u}(\mathbf{x})$ is the output of a convolutional operator at pixel $u$ , then we can construct a non-convolutional response by mixing it with information about the pixel location. Mathematically, we can define a semi-convolutional operator as:

where $f:\mathcal{L}\times\Omega\rightarrow\mathcal{L}^{\prime}$ is a suitable mixing function. As our main example of such an operator, we consider a particularly simple type of mixing function, namely addition. With it, eq. 2 specializes to:

While this choice is restrictive, it has the benefit of having a very simple interpretation. Suppose in fact that the resulting embedding can perfectly separate instances, in the sense that $\Psi_{u}(\mathbf{x})=\Psi_{v}(\mathbf{x})\Leftrightarrow\exists k:(u,v)\in S_{k}$ . Then for all the pixels of the region $S_{k}$ we can write in particular:

This establishes, a clear link between voting-based methods for object detection and coloring methods for instance segmentation. At the same time, there are significant differences. First, the goal here is to group pixels, not to reconstruct the parameters of an object instance (such as its centroid and scale). Eq. 3 may have this interpretation, but the more general version eq. 2 does not. Second, in methods such as Hough or ISM the centroid is defined a-priori as the actual center of the object; here the centroid $c_{k}$ has no explicit meaning, but is automatically inferred as a useful but arbitrary reference point. Third, in traditional voting schemes voting integrates local information extracted from individual patches; here the receptive field size of $\Phi_{u}(\mathbf{x})$ may be enough to comprise the whole object, or more. The goal of eq. 2 and 3 is not to pool local information, but to solve a representational issue.

2 Learning additive semi-convolutional features

Learning the semi-convolutional features of eq. 2 can be formulated in many different ways. Here we adopt a simple direct formulation inspired by and build a loss by considering, for each image $\mathbf{x}$ and instance $S\in\mathcal{S}$ in its segmentation, the distance between the embedding of each pixel $u\in S$ and the segment-wise mean of these embeddings:

Note that while this quantity resembles the variance of the embedding values for each segment, it is not as the distance is not squared; this was found to be more robust.

Note also that this loss is simpler than the margin condition (1) and than the losses proposed in , which resemble (1) more closely. In particular, this loss only includes an “attractive” force which encourages embeddings for each segment to be all equal to a certain mean value, but does not explicitly encourage different segments to be assigned different embedding values. While this can be done too, empirically we found that minimizing eq. 5 is sufficient to learn good additive semi-convolutional embeddings.

3 Coloring instances using individuals’ traits

In practice, very rarely an image contains exact replicas of a certain object. Instead, it is more typical for different occurrences to have some distinctive individual traits. For example, different people are generally dressed in different ways, including wearing different colors. In instance segmentation, one can use such cues to tell right away an instance from another. Furthermore, these cues can be extracted by conventional convolutional operators.

In this manner, the last $d-2$ dimensions of the embedding work as conventional convolutional features and can extract instance-specific traits normally.

4 Steered bilateral kernels

The pixel embedding vectors $\Psi_{u}(\mathbf{x})$ must ultimately be decoded as a set of image regions. Again, there are several possible strategies, starting from simple $K$ -means clustering, that can be used to do so. In this section, we consider transforming embeddings in an affinity matrix between two pixels, as the latter can be used in numerous algorithms.

In order to define the affinity between pixels $u,v\in\Omega$ , consider first the Gaussian kernel

The bilateral kernel is very popular in many applications, including image filtering and mean shift clustering. The idea of the bilateral kernel is to consider pixels to be similar if they are close in both space and appearance. Here we have shown that kernel (8) and hence kernel (7) can be interpreted as a generalization of this kernel where spatial locations are steered (distorted) by the network to move pixels that belong to the same underlying object instance closer together.

In a practical implementation of these kernels, vectors should be rescaled before being compared, for example in order to balance spatial and appearance components. In our case, since embeddings are trained end-to-end, the network can learn to perform this balancing automatically, but for the fact that (4) implicitly defines the scaling of the spatial component of the kernel. Hence, we modify eq. 7 in two ways: by introducing a learnable scalar parameter $\sigma$ and by considering a Laplacian rather than a Gaussian kernel:

This kernel is more robust to outliers (as it uses the Euclidean distance rather than its square) and is still positive definite . In the next section we show an example of how this kernel can be used to perform instance coloring.

5 Semi-convolutional Mask-RCNN

The semi-convolutional framework we proposed in sec. 3.1 is very generic and can be combined with many existing approaches. Here, we describe how it can be combined with the Mask-RCNN (MRCNN) framework , the current state-of-the-art in instance segmentation.

Extending MRCNN. Our approach is based on two intuitions: first, some points are easier to be recognized as foreground than others, and, second, once one such seed point has been determined, its affinity with other pixels can be used to cut out the foreground region.

In practice, we first identify a seed pixel $u_{s}$ in each region $R$ using the MRCNN foreground confidence score map $\mathbf{s}=[s(u_{1}),\dots,s(u_{|R|})]$ . We select the most confident seed point as $u_{s}=\operatorname{argmax}_{1\leq i\leq|R|}s(u_{i})$ , evaluate the steered bilateral kernel $K_{\sigma}(u_{s},u)$ after extracting the embeddings $\Psi_{u_{s}}$ for the seed and $\Psi_{u_{i}}$ of each pixel $u_{i}$ in the region, and then defining updated scores $\hat{s}(u_{i})$ as $\hat{s}(u_{i})=s(u_{i})+\log{K_{\sigma}(u_{s},u_{i})}$ . The combination of the scores and the kernel is performed in the log-space due to improved numerical stability. The final per-pixel foreground probabilities are obtained as in with $\text{sigmoid}(\hat{s}(u_{i}))$ .

The entire architecture —the region selection mechanism, the foreground prediction, and the pixel-level embedding —is trained end-to-end. For differentiability, this requires the following modifications: we replace the maximum operator with a soft maximum over the scores $\mathbf{p}_{s}=\text{softmax}(\mathbf{s})$ and we obtain the seed embedding $\Psi_{u_{s}}$ as the expectation over the embeddings $\Psi_{u}$ under the probability density $\mathbf{p}_{s}$ . The network optimizer minimizes, together with the MRCNN losses, the image-level embedding loss $\mathcal{L}(\Psi|\mathbf{x},\mathcal{S})$ and further attaches a secondary binary cross entropy loss that, similar to the MRCNN mask predictor, minimizes binary cross entropy between the kernel output $K_{\sigma}(u_{s},u_{i})$ and the ground truth instance masks.

The predictors of our semi-convolutional features $\Psi_{u}$ were implemented as an output of a shallow subnetwork, shared between all the FPN layers. This subnet consists of a 256-channel 1 $\times$ 1 convolutional filter followed by ReLU and a final 3 $\times$ 3 convolutional filter producing $D=8$ dimensional embedding $\Psi_{u}$ . Due to an excessive sensitivity of the RPN component to perturbations of the underlying FPN representation, we downscale the gradients that are generated by the shallow subnetwork and received by the shared FPN tensors by a factor of 10.

6 The convolutional coloring dilemma

In this section, we prove some properties of convolutional operators in relation to solving instance segmentation problems. In order to do this, we need to start by formalizing the problem.

There are two families of algorithms that can be used to segment signals in this manner, discussed next.

Propose & verify. The first family of algorithms submits all possible regions $S_{r}\subset\Omega$ , indexed for convenience by a variable $r$ , to a labeling function $\Phi_{r}(\mathbf{x})\in\{0,1\}$ that verifies which ones belong to the segmentation $\mathcal{S}_{\mathbf{x}}$ (i.e. $\Phi_{r}(\mathbf{x})=1\Leftrightarrow S_{r}\in\mathcal{S}_{\mathbf{x}}$ ). Since in practice it is not possible to test all possible subsets of $\Omega$ , such an algorithm must focus on a smaller set of proposal regions. A typical choice is to consider all translated squares (or rectangles) $S_{u}=[-H,H]^{m}+u$ . Since the index variable $u\in\Omega$ is now a translation, the operator $\Phi_{u}(\mathbf{x})$ has the form discussed above, although it is not necessarily local or translation invariant.

Instance coloring. The second family of approaches directly colors (labels) pixels with the index of the corresponding region, i.e. $\Phi_{u}(\mathbf{x})=k\Leftrightarrow u\in S_{k}.$ Differently from P&V, this can efficiently represent arbitrary shapes. However, the map $\Phi$ needs implicitly to decide which number to assign to each region, which is a global operation. Several authors have sought to make this more amenable to convolutional networks. A popular approach is to color pixels arbitrarily (for example using vector embeddings) so that similar colors are assigned to pixels in the same region and different colors are used between regions, as already detailed in eq. 1.

Convolutional coloring dilemma. Here we show that, even with the variants discussed above, IC cannot be approached with convolutional operators even for cases where these would work with P&V.

On the other hand, this problem can be solved by P&V using the proposal set $\{+u,u\in\Omega\}$ and the local and translation invariant verification function $\Phi_{u}(\mathbf{x})=[x_{u}=1]$ , which detects the center of each region.

The latter is an extreme example of a convolutional coloring dilemma: namely, a local and translation invariant operator will naturally assign the same color to identical copies of an object even if when they are distinct occurrences (c.f. interesting concurrent work that explores related convolutional dilemmas ).

Solving the dilemma. Solving the coloring dilemma can be achieved by using operators that are not translation invariant. In the counterexample above, this can be done by using the semi-convolutional function $\Phi_{u}(x)=u+(1-x_{u})\dot{x}_{u}$ . It is easy to show that $\Phi_{u}(x)=2k$ colors each pixel $u\in S_{k}=+2k$ with twice the index of the corresponding region by moving each point $u$ to the center of the closest region. This works because such displacements can be computed by looking only locally, based on the shape of the signal.

Experiments

We first conduct experiments on synthetic data in order to clearly demonstrate inherent limitations of convolutional operators for the task of instance segmentation. In the ensuing parts we demonstrate benefits of the semi-convolutional operators on a challenging scenario with a high number of overlapping articulated instances and finally we compare to the competition on a standard instance segmentation benchmark.

In sec. 3.6 and 3.1 we suggested that convolution operators are unsuitable for instance segmentation via coloring, but that semi-convolutional ones can do. These experiments illustrate this point by learning a deep neural network to segment a synthetic image $x_{S}$ where object instances correspond to identical dots arranged in a regular grid (fig. 3 (a)).

We use a network consisting of a pretrained ResNet50 model truncated after the Res2c layer, followed by a set of 1 $\times$ 1 filters that, for each pixel $u$ , produce 8-dimensional pixel embeddings $\Phi_{u}(x_{S})$ or $\Psi_{u}(x_{S})$ . We optimize the network by minimizing the loss from eq. 5 with stochastic gradient descent. Then, the embeddings corresponding to the foreground regions are extracted and clustered with the $k$ -means algorithm into $K$ clusters, where $K$ is the true number of dots present in the synthetic image.

Fig. 3 visualizes the results. Clustering the features consisting of the position invariant convolutional embedding $\Phi_{u}(x_{S})$ results in nearly random clusters (fig. 3 (c)). On the contrary, the semi-convolutional embedding $\Psi_{u}(x_{S})=\Phi_{u}(x_{S})+u$ allows to separate the different instances almost perfectly when compared to the ground truth segmentation masks (fig. 3 (d)).

2 Parsing biological images

The second set of experiments considers the parsing of biological images. Organisms to be segmented present non-rigid pose variations, and frequently form clusters of overlapping instances, making the parsing of such images challenging. Yet, this scenario is of crucial importance for many biological studies.

Dataset and evaluation. We evaluate our approach on the C. Elegans dataset (illustrated fig. 4), a subset of the Broad Biomedical Benchmark collection . The dataset consists of $100$ bright-field microscopy images. Following standard practice , we operate on the binary segmentation of the microscopy images. However, since there is no publicly defined evaluation protocol for this dataset, a fair numerical comparison with previously published experiments is infeasible. We therefore compare our method against a very strong baseline (MRCNN) and adopt the methodology introduced by in which the dataset is divided into $50$ training and $50$ test images. We evaluate the segmentation using average precision (AP) computed using the standard COCO evaluation criteria . We compare our method against the MRCNN FPN-101 model from which attains results on par with state of the art on the challenging COCO instance segmentation task.

Results. The results are given in table 1. We observe that the semi-convolutional embedding $\Psi_{u}$ brings improvements in all considered instance segmentation metrics. The improvement is more significant at higher IoU thresholds which underlines the importance of utilizing position sensitive embedding in order to precisely delineate an instance within an MRCNN crop.

3 Instance segmentation

The final experiment compares our method to competition on the instance segmentation task on a standard large scale dataset, PASCAL VOC 2012 .

As in the previous section, we base our method on the MRCNN FPN-101 model. Because we observed that the RPN component is extremely sensitive to changes in the base architecture, we employed a multistage training strategy. First, MRCNN FPN-101 model is trained until convergence and then our embeddings are attached and fine-tuned with the rest of the network . We follow and learn using 24 SGD epochs, lowering the initial learning rate of 0.0025 tenfold after the first 12 epochs. Following other approaches, we train on the training set of VOC 2012 and test on the validation set.

Results. The results are given in table 2. Our method attains state of the art on PASCAL VOC 2012 which validates our approach. We further compare in detail against MRCNN in table 3 using the standard COCO instance segmentation metrics from . Our method outperforms MRCNN on the considered metrics, confirming the contribution of the proposed semi-convolutional embedding.

Conclusions

In this paper, we have considered dense pixel embeddings for the task of instance-level segmentation. Departing from standard approaches that rely on translation invariant convolutional neural networks, we have proposed semi-convolutional operators which can be easily obtained with simple modifications of the convolutional ones. On top of their theoretical advantages, we have shown empirically that they are much more suited to distinguish several identical instances of the same object, and are complementary to the standard Mask-RCNN approach.

Acknowledgments. We gratefully acknowledge the support of Naver, EPSRC AIMS CDT, AWS ML Research Award, and ERC 677195-IDIU.