Learning deep structured active contours end-to-end

Diego Marcos, Devis Tuia, Benjamin Kellenberger, Lisa Zhang, Min Bai, Renjie Liao, Raquel Urtasun

Introduction

Accurate footprints of individual buildings are of paramount importance for a wide range of applications, such as census studies , disaster response after earthquakes and developmental assistances like malaria control . Automating large-scale building footprint segmentation has thus been an active research field, and the emergence of high-capacity models like fully convolutional networks (FCNs) , together with vast training data , has led to promising improvements in this field.

Most studies address semantic segmentation of buildings, which consists of inferring a class label (e.g. “building”) densely for each pixel over the overhead image of interest . While this approach may provide global statistics such as building area coverage estimation, it comes short at yielding estimations at the instance level. In computer vision, this problem is known as instance segmentation, where models provide a segmentation mask on a per-object instance basis. Solving this task is far more challenging than semantic segmentation, since the model has to understand whether any two building pixels belong to the same building or not. Precise delineation of object borders, with sharp corners and straight walls in the case of buildings, is a task that CNNs generally perform poorly at : as a result, building segmentations from CNNs commonly have a high detection rate, but fail in terms of spatial coverage and geometric correctness.

Active Contour Models (ACM ), also called snakes, may be considered to address this issue. ACMs augment bottom-up boundary detectors with high-level geometric constraints and priors. They work by constraining the possible outputs to a family of curves (e.g. closed polygons with a fixed number of vertices), and optimizing them by means of energy minimization based on both the image features and a set of shape priors such as boundary continuity and smoothness. Additional terms have been proposed, among which the balloon term is of particular interest: it mimics the inflation of a balloon by continuously pushing the snakes’ vertices outwards, thus preventing it to collapse to a single point. By expressing object detection as a polygon fitting problem with prior knowledge, ACMs have the potential of approaching object edges precisely and without the need for additional post-processing. However, the original formulation lacked flexibility, since it relied on low-level image features and a global parameterization of priors, when a more useful approach would be to penalize strongly the curvature in the regions of the boundary known to be straight or smooth and reduce the penalization in the regions that are more likely to form a corner. Moreover, the balloon term has so far only been included as a post-energy global minimization force and does not take part in the energy minimization defining the snake.

In this paper, we propose to combine the expressiveness of deep CNNs with the versatility of ACMs in a unified framework, which we term Deep Structured Active Contours (DSAC). In essence, we employ a CNN to learn the energy function that would allow an ACM to generate polygons close to a set of ground truth instances. To do so, DSAC leverages the original ACM formulation by learning high-level features and prior parameterizations, including the balloon term, in one model and on a local basis, i.e. penalizing each term differently at each image location. We cast the optimization of the ACM as a structured prediction problem and find optimal features and parameters using a Structured Support Vector Machine (SSVM ) loss. As a consequence, DSAC is trainable end-to-end and able to learn and adapt to a particular family of object instances. We test DSAC in three building instance segmentation datasets, where it outperforms state-of-the-art models.

This work’s contributions are as follows:

We formulate the learning of the energy function of an ACM as a structured prediction problem;

We include the balloon term of the ACM into the energy formulation;

We propose an end-to-end framework to learn the guiding features and local priors with a CNN.

Related work

Most current automated approaches make use of 3D information extracted from ground or aerial LIDAR , or employ humans in the loop . The use of a polygonal shape prior has been shown to substantially improve the results of systems based on color imagery and low level features. Recent efforts employ deep CNNs for semantic segmentation and allowed a great leap towards full automation of building segmentation . Works considering building instance segmentation are scarcer and the task has been recently defined as far-from-being solved , despite the interest shown by the participation to numerous contests aiming at automatic vectorization of building footprints from overhead imagery: SpaceNethttps://wwwtc.wpengine.com/spacenet, DSTLhttps://www.kaggle.com/c/dstl-satellite-imagery-feature-detection or OpenAI Challengehttps://werobotics.org/blog/2018/01/10/open-ai-challenge/. Our proposed DSAC aims at making high-level geometric information available to CNN based methods as a step towards bridging this gap.

Since instance segmentation combines object detection and dense segmentation, many proposed pipelines attempt at fusing both tasks in either separate or end-to-end trainable models. For example, employ a multi-task CNN to detect candidate objects and infer segmentation masks and class labels per detection. train a CNN on pairs of locations and predicts the likelihood for the pair to belong to the same object. apply an attention-based RNN sequentially on deep image features to trace object instances in propagation order. refine an existing semantic segmentation map by predicting a distance transform to the nearest boundary. High level relationships are accounted for in by means of an instance MRF applied to the CNN’s output.

All these methods employ pixel-wise CNNs and are thus not apt to integrating output shape priors directly, as polygonal output models would be. Only a few works deal with CNNs that explicitly produce a polygonal output. In , a recursive neural network is used to generate a segmentation polygon node by node, while in a CNN predicts the direction of the nearest object boundary for each node in a polygon and uses it as a data term in an ACM. However, the first model is tailored towards a different problem (interactive segmentation and correction) and does not allow the inclusion of strong priors, and the second decouples the CNN training from ACM inference, thus lacking the end-to-end training capabilities of the proposed DSAC.

The first ACMs were introduced by Kass et al. in 1988 under the name of snakes . Variants of this original try to overcome some of its limitations, such as the need for precise initializations, or the dependence on user interaction. In the authors propose to use two coupled snakes that better capture the information in the image. The above mentioned balloon force was introduced by .

Although some modifications have been proposed to improve the data term of the original paper, they rely on simple assumptions about the appearance of the objects and on global parameters for weighting the different terms in the energy function. The proposed DSAC leverages the original formulation by including local prior information, i.e. values weighting the snakes’ energy function terms on a per-pixel basis, and learns them using a CNN. Although this work focuses on curvature priors useful for segmenting objects of polygonal shape, other priors can be enforced with ACMs, such as convexity for biomedical imaging .

Structured prediction allows to model dependencies between multiple output variables and hence offers an elegant way to incorporate prior rule sets on output configurations. End-to-end trainable structured models exceed traditional two-step solutions by enriching the learning signal with relations at the output level. Although these models have been applied to a variety of problems , we are not aware of any work dealing with instance level segmentation.

We use a structured loss as a learning signal to a CNN such that it learns to coordinate the different ACM energy terms, which are heavily interdependent.

Method

We present the details of a modified ACM inference algorithm with image-dependent and local penalization terms as well as the structured loss that is used to train a CNN to generate these penalization maps. A diagram of the proposed method is shown in Fig. 2. The proposed training algorithm proceeds as exposed in Algorithm 1.

Note that i) DSAC does not depend on any particular ACM inference algorithm, and ii) the chosen ACM algorithm does not need to be differentiable.

Due to their local nature, $D$ , $\beta$ and $\kappa$ are $U\times V$ maps in our experiments while $\alpha$ is treated as a single scalar.

This term identifies areas of the image where the nodes of the polygon should lie. In the literature, $D\big{(}\mathbf{x}\big{)}$ is usually some predefined function on the image, typically related to the image gradients. $D(\mathbf{x})$ should learn to provide relatively low values along the boundary of the object of interest and high values elsewhere. During ACM inference, the direction of steepest descent $-\nabla D(\mathbf{x})=-\big{[}\frac{\partial D(\mathbf{x})}{\partial u},\frac{\partial D(\mathbf{x})}{\partial v}]$ is used as the data force term, moving the contour towards regions where $D$ is low.

1.2 Internal terms

In the literature, the values of $\alpha$ and $\beta$ are generally a single scalar, meaning that the penalization has the same strength in all parts of the object. This leads to a trade-off between over-smoothing corner regions and under-smoothing others. We avoid this trade-off by assigning different $\beta$ penalizations to each pixel, depending on which part of the object lies underneath.

The internal energy $E_{int}=\alpha\big{(}\mathbf{x},(\mathbf{y}_{s})\big{)}|\mathbf{y}^{\prime}|^{2}+\beta\big{(}\mathbf{x},(\mathbf{y}_{s})\big{)}|\mathbf{y}^{\prime\prime}|^{2}$ penalizes the length (membrane term) and curvature (thin plate term) of the polygon. In order to obtain the direction of steepest descent, we can express the internal energy as a function of finite differences:

and compute the derivative of $E_{int}$ w.r.t. the coordinates of node $s$ , $\mathbf{y}_{s}$ , expressed as a sum of scalar products:

The Jacobian matrix (in this case with two column vectors) can then be expressed as a matrix multiplication:

where $A(\alpha)$ is a tri-diagonal matrix and $B(\beta)$ is a penta-diagonal matrix.

1.3 Balloon term

The original balloon term consists of adding an outwards force of constant magnitude in the normal direction of each node, thus inflating the contour. As with the $\beta$ term, we propose to increase its flexibility by allowing it to take a different value at each image location.

In , the balloon term is only considered as a force added after the direction of steepest descent for the other energy terms has been computed. In DSAC, the SSVM formulation requires to express it in the form an energy.

The normal direction to the contour at $\mathbf{y}_{s}$ follows the vector:

This can be rewritten such that the whole set of $L$ normal vectors is expressed as:

where $C$ is a tri-diagonal matrix with in the main diagonal, $1$ in the upper diagonal and $-1$ in the lower diagonal.

Integrating this expression with respect to $\mathbf{u}$ and $\mathbf{v}$ , we obtain the scalar $E_{b}$ , corresponding to the polygon’s area (by the shoelace formula to compute the area of a polygon):

After this modification we need to recompute the force form of this term by finding the $L\times 2$ Jacobian matrix $[\frac{\partial E_{k}}{\partial u_{s}},\frac{\partial E_{k}}{\partial v_{s}}],$ $s\in[1,L]$ .

This corresponds to how a perturbation in $u_{s}$ and $v_{s}$ would affect $E_{k}$ . Since the perturbations are considered to be very small, we assume that the distribution of the $\kappa(u,v)$ values along the segments $[\mathbf{y}_{s},\mathbf{y}_{s+1}]$ and $[\mathbf{y}_{s-1},\mathbf{y}_{s}]$ will be identical to the one in $[\mathbf{y}_{s}+\Delta\mathbf{y},\mathbf{y}_{s+1}]$ and $[\mathbf{y}_{s-1},\mathbf{y}_{s}+\Delta\mathbf{y}]$ , respectively. As shown in Fig. 3, this boils down to summing a series of trapezoid areas, forming the two depicted triangles, each one weighted by its assigned $\kappa$ value.

In Fig. 3a, both triangles have bases of length $\Delta u_{s}$ and heights $v_{s-1}-v_{s}$ and $v_{s+1}-v_{s}$ , while in Fig. 3b the bases are $\Delta v_{s}$ and the heights $u_{s-1}-u_{s}$ and $u_{s+1}-u_{s}$ .

To obtain the $\kappa$ weighted areas in Fig. 3a, we compute:

and therefore the force term we need for inference is:

The same for Fig. 3b can be obtained by swapping $u$ and $v$ .

These derivatives point in the normal direction when the values of $\kappa$ are equal in all locations.

2 Active contour inference and implementation

When solving the active contour inference, Eq. (1), the four energy terms can be split into external terms $E_{ext}$ : the data ( $D$ ) and balloon energies ( $E_{k}$ ); and internal terms $E_{int}$ : the energies penalizing length ( $\alpha$ ) and curvature ( $\beta$ ). Since $E_{int}$ depends only on the contour $\mathbf{y}$ , we can find an update rule that minimizes it on the new time time step:

If we solve this expression for $\mathbf{y}^{t+1}$ , we obtain:

With $I$ being the identity matrix. An efficient implementation of the ACM inference is critical for the usability of the method, since thousands of iterations are typically required by CNNs to be trained, and the ACM inference has to be performed at each iteration. We have implemented the described locally penalized ACM using a Tensorflow graph. The typical inference time is under 50 ms on a single CPU for the settings used in this paper.

3 Structured SVM loss

Since no ground truth is available for the penalization terms, we frame the problem as structured prediction, in which loss augmented inference is used to generate negative examples to complement the positive examples of the ground truth polygons. The weights of the energy terms can then be modified such that the energy corresponding to the ground truth is lowered, while the one of the loss augmented results, which are presumed to be wrong, is increased.

Given a collection of ground truth pairs $(\mathbf{y^{i}},\mathbf{x^{i}})\in\mathcal{Y}\times\mathcal{X},i=1\dots N$ , and a task loss function $\Delta(\mathbf{y},\mathbf{\hat{y}})$ , we would like to find the CNN parameters $\omega$ such that, by optimizing Eq. (1) and thus obtaining the inference result:

one could expect a small $\Delta(\mathbf{y}^{i},\mathbf{\hat{y}}^{i})$ . The problem becomes:

Since $\mathcal{L}(\mathcal{Y},\mathcal{X},\omega)$ is convex but not differentiable, we compute the subgradient, which requires to find the most penalized constraint with the current $\omega$ :

This means to first run the ACM using the current $\omega$ and an extra term corresponding to the loss $\Delta(\mathbf{y},\mathbf{y}^{i})$ . Once we obtain $\hat{\mathbf{y}}^{i}$ , we can then compute the subgradient as:

We compute the subgradients of the loss with respect to each of the four outputs as

In the above equations, $[\cdot]$ represents the Iverson bracket. Finally, we can get $\frac{\partial\mathcal{L}(\mathcal{Y},\mathcal{X},\omega)}{\partial\omega}$ using the chain rule and modifying each CNN parameter $\omega$ applying:

which will simultaneously decrease $E(\mathbf{y}^{i},\mathbf{x}^{i};\omega)$ and increase $E(\hat{\mathbf{y}}^{i},\mathbf{x}^{i};\omega)$ , thus making a better solution more likely when performing inference anew.

The task loss $\Delta(\mathbf{y},\mathbf{y}^{i})$ defines the actual objective we want to solve with the SSVM loss. Since it’s the most common metric in instance segmentation, we employ the Intersection-over-Union (IoU) between the prediction $\mathbf{y}$ and the ground truth $\mathbf{y}^{i}$ . Note that optimizing for IoU can be split into maximizing the intersection while minimizing the union. During training, this allows us to simply add a negative value during training to the $\kappa$ map at the locations within the ground truth and a positive outside to obtain a loss-augmented inference (see Fig. 4).

Experiments

We test the proposed DSAC method for building footprint extraction from overhead images. We consider two settings: manual initialization, where the user provides a single click near the center of the building and automatic initialization, where an instance segmentation algorithm is used to generate the initial polygons. The first setting is tested in two datasets, Vaihingen and Bing Huts, while the second is tested in the TorontoCity dataset . The three datasets are detailed in the respective sections.

To learn the ACM energy terms, we use a CNN architecture similar to the Hypercolumn model in . The input consists of a patch cropped around each initialization polygon and resized an image of fixed size for each dataset. The first layer consists of $7\times 7$ convolutions, the second of $5\times 5$ and all subsequent layers are of size $3\times 3$ . All the convolutional layers are followed by ReLu, batch normalization and $2\times 2$ max-pooling. The number of filters is increased with the depth: $32$ , $64$ , $128$ , $128$ , $256$ and $256$ for the six blocks. The output tensors of all the layers are then upsampled to the output size and concatenated. After this, a two-layer MLP with 256 and 64 hidden units is used to predict the four output maps: $D(\mathbf{x})$ , $\alpha(\mathbf{x})$ , $\beta(\mathbf{x})$ and $\kappa(\mathbf{x})$ . We use this architecture for all datasets, with the exception of the Bing huts dataset, for which we skip the last two convolutional layers. In all cases, we use the Adam optimizer with a learning rate of $10^{-4}$ . We augment the data with random rotations. The number of ACM iterations is set to 50 in all the experiments, and the number of nodes is set to $L=60$ in Vaihingen and TorontoCity and $L=20$ in Bing huts.

2 Manual initialization

In this setting, the detection step is done manually by visual inspection. The only input required from the user is a single click to indicate the approximate center of the building. Two datasets are considered:

The dataset consists of 168 buildings extracted from the training set of the ISPRS “2D semantic labeling contest”http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html. The images have three bands, corresponding to near infrared, red and green wavelengths, and a resolution of 9 cm. We used $100$ buildings to train the models and the remaining $68$ as a test set.

The dataset consists of $605$ individual huts visible on Bing maps aerial imagery at a resolution of $30$ cm, over a rural area in Tanzania. See Fig. 5 for an overview of the study area and Fig. 7 for a full resolution subset. The ground truth building footprints have been obtained from OpenStreetMaphttp://www.openstreetmap.org. A total of $335$ images of size $80\times 80$ pixels are used to train the models and the remaining $270$ to test. The lower spatial resolution, low contrast between the buildings and the surrounding soil, as well as the high level of label noise make Bing huts a very challenging dataset.

We compare DSAC against a baseline where we train a CNN with the same architecture used by DSAC, but with a 3-class cross entropy loss with classes: building, building boundary, background. The boundary class is added to help the model focus on learning the shapes of the buildings. In this case, the click from the user is used to select the nearest connected region that has been labeled as building and treat it as the instance prediction.

3 Automatic initialization

Although the manual initialization only requires a single click from the user, it can still be a tedious task for large scale datasets. Existing instance segmentation algorithms, such as the recently proposed Deep Watershed Transform (DWT) , can be used instead to initialize the active contours. These methods have a good recall, but tend to undersegment the objects and to lose detail near to the boundaries. To compensate for this effect, the authors of apply a morphology-based post-processing step. We test the possibility of initializing the ACM within DSAC with the results obtained by on the TorontoCity building instance segmentation dataset , with around $28000$ instances for training and $12000$ for testing. The ACM contours are initialized with the output of the Deep Watershed Transform (DWT) , the current state-of-the-art in terms of IoU. Two initialization polygon types are considered: the raw DWT output and the post-processed versions used in . We also consider a third variant, where the raw DWT is used at train time and the post-processed one for inference at test time: this variant is based on the intuition that making the problem harder at train time, in addition to using the loss augmentation, helps learning a better energy function.

Results and discussion

Table 1 reports the average Intersection over Union (IoU) for the two datasets. Since the ground truth shift noise in the Bing huts dataset makes the IoU assessment untrustworthy, the root mean square error (RMSE in $m^{2}$ ) committed when estimating the area of the building footprints is also reported. DSAC significantly improves the baseline in terms of IoU for both datasets. This ablation study confirms the need to allow $\kappa$ and $\beta$ to vary locally (as opposed to having a single value for the whole image), while $\alpha$ can be treated as a single value without loss of performance. It also highlights the importance of the balloon term for the convergence of the contour.

Examples of segmentation results for the Vaihingen dataset (Fig. 7, top row) show that the learned priors do indeed promote smooth, straight edges while often allowing for sharp corners. By looking at the predicted energy terms in Fig. 6 we observe that the model focuses on the corners by producing very low $D$ values close to them, while predicting high $\kappa$ inside the building next to the corners and a sharp drop to on the outside. Moreover, the smoothness term $\beta$ is close to at the corners and high along the edges.

In the Bing huts dataset results (Fig. 7, bottom row), the biggest jump in performance can be seen in the area estimation metric. DSAC still tends to oversmooth the shapes, probably since it is unable to learn the location of corners due to the ground truth shift noise inherent to OpenStreetMap data, but manages to converge to polygons of the correct size, most probably because it learns to balance the balloon ( $\kappa$ , promoting large areas) and the membrane ( $\alpha$ , promoting short contours) terms.

Table 2 reports the results obtained on the TorontoCity dataset using two metrics: the IoU-based weighted coverage (“WeighCov”) and the shape similarity PolySim . Besides DWT, we also compare DSAC against the results of building footprint segmentation with FCN and ResNet, as reported in . We observe an improvement with respect to DWT of both metrics. DSAC obtains the best weighted coverage scores irrespectively of the initialization strategy. Interestingly, the best results are obtained by the hybrid initialization using raw DWT at training time and post-processed DWT polygons at test time. This suggests that our intuition about making the model work harder at train time is correct and seems to complement the use of a task loss in the SSVM loss. Finally, segmentation examples are shown in the last row of Fig. 7: DSAC (in yellow) consistently returns a more desirable segmentation with respect to DWT (in blue), closer to the ground truth polygon (in green). Although we can still see oversmoothing in our results, note how an important amount of shift noise is also present in some instances, making the DSAC result more plausible than the ground truth in a few cases (red arrows).

Conclusion

We have shown the potential of embedding high-level geometric processes into a deep learning framework for the segmentation of object instances with strong shape priors, such as buildings in overhead images. The proposed Deep Structured Active Contours (DSAC) uses a CNN to predict the energy function parameters for an Active Contour Model (ACM) such as to make its output close to a ground truth set of polygonal footprints. The model is trained end-to-end by bringing the ACM inference into the CNN training schedule and using the ACM’s output and the ground truth polygon to assess a structured loss that can be used to update the CNN’s parameters using back-propagation. DSAC opens up the possibility of using a large collection of energy terms encoding for different priors, since an adequate balance between them is learned automatically. The main limitation of our model is that the initialization is assumed to be given by some external method and is therefore not included in the learning process.

Results in three different datasets, which include a $10\%$ relative improvement over the state-of-the-art on the TorontoCity dataset, show that combining the bottom-up feature extraction capabilities of CNNs with the high-level constraints provided by ACMs is a promising path for instance segmentation when strong geometric priors exist.