DeepCut: Object Segmentation from Bounding Box Annotations using Convolutional Neural Networks

Martin Rajchl, Matthew C. H. Lee, Ozan Oktay, Konstantinos Kamnitsas, Jonathan Passerat-Palmbach, Wenjia Bai, Mellisa Damodaram, Mary A. Rutherford, Joseph V. Hajnal, Bernhard Kainz, Daniel Rueckert

cs.CV

I Introduction

Many modern medical image analysis methods that are based on machine learning rely on large amounts of annotations to properly cover the variability in the data (e.g. due to pose, presence of a pathology, etc). However, the effort for a single rater to annotate a large training set is often not feasible. To address this problem, recent studies employ forms of weak annotations (e.g. image-level tags, bounding boxes or scribbles) to reduce the annotation effort and aim to obtain comparably accurate results as to under full supervision (i.e. using pixelwise annotations) .

User-provided bounding boxes are a simple and popular form of annotation and have been extensively used in the field of computer vision to initialise object segmentation methods . Bounding boxes have advantages over other forms of annotations (e.g. scribbles or brush strokes ), as they allow to spatially constrain the problem (i.e. ideally, the object is unique to the bounding box region and fully contained in it). In a practical sense, bounding boxes can be defined via two corner coordinates, allowing fast placement (approximately 15 times faster than pixelwise segmentations ) and lightweight storage of the information. Considering the required interaction effort and the amount of provided information, these properties qualify bounding boxes as preferred weak annotation for image analysis methods.

In segmentation studies , bounding boxes are employed as both initialisation and spatial constraints for the segmentation problem. The above approaches model the image appearance (i.e., colours or greyscale intensity) and impose smoothness constraints upon the segmentation results for each image. However, given an image database and corresponding annotations, we can assume that objects share common shape and appearance information, which can be learned (i.e., instead of direct image-by-image object segmentation, a common model can be learned for the all images in the database). This is particularly interesting for segmentation problems on medical images, where typically an entire cohort is to be analysed for a specific organ or region, exhibiting large class similarity in terms of shape and appearance.

In this paper, we propose combining a neural network model with an iterative graphical optimisation approach to recover pixelwise object segmentations from an image database with corresponding bounding box annotations. The idea builds on top of the popular GrabCut method, where an intensity appearance model is iteratively fitted to a region and subsequently regularised to obtain a segmentation. Similarly to this, the proposed DeepCut method iteratively updates the training targets (i.e. the class associated with a voxel location, described by an image patch) learned by a convolutional neural network (CNN) model and employs a fully connected conditional random field (CRF) to regularise the segmentation. The approach is formulated in a generic form and thus can be readily applied to any object or image modality. We briefly review recent advancements in the following section to put this approach into the context of the current state-of-the-art and highlight our contributions.

Graphical energy minimisation techniques are popular methods for regularised image segmentation due to inherent optimality guarantees and computational efficiency . They have been extensively used in the optimisation of interactive and fully automated segmentation methods .

An iterative optimisation of such energy functionals allows to address more complex problems, such as the pixelwise segmentation from bounding box annotations. The popular GrabCut method iteratively updates parameters of a Gaussian mixture model (GMM) and optimises the resulting energy with a graph cut. Lempitsky et al. extended this method by adding a topological prior to prevent excessive shrinking of the region and Cheng et al. improved upon its performance by employing a fast fully connected Conditional-Random Field (CRF) . Similar approaches include the time-implicit propagation of levelsets via continuous max-flow , iterative graph cuts and the use of the expectation-maximisation (EM) algorithm .

Similarly to the above mentioned segmentation approaches, learning-based methods have been recently investigated to exploit the advantages of weak annotations, primarily to reduce the effort of establishing training data. In contrast to learning under full supervision (i.e. using pixelwise annotations), weakly supervised methods aim to learn from image-level tags, partial labels, bounding boxes, etc. and infer pixelwise segmentations.

Recently, several multiple instance learning (MIL) techniques were investigated, particularly when images could potentially contain multiple objects. Cinbis et al. proposed a multi-fold MIL method to obtain segmentations from image level tags. Vezhnevets and Buhmann addressed the problem with a Texton Forest framework, extending it to MIL. With the re-emerging of convolutional neural networks , MIL methods have been proposed to exploit such methods . However MIL-based methods, even when including additional modules under weak supervision , have not been able to achieve comparable accuracy to fully supervised methods . However, latest developments using CNN learning with weakly supervised data have shown remarkable improvements in accuracy. Schlegl et al. parse clinical reports to associate findings and their locations with optical coherence tomography images and further obtain segmentations of the reported pathology. Dai et al. iterative between updating region proposals in bounding boxes and model training. Papandreou et al. formulate an Expectation-Maximization (EM) algorithm to iteratively update the training targets. Both of the latter methods were able to achieve comparable accuracy to those under full supervision.

I-B Contributions

In this paper, we build upon the ideas of GrabCut , a well-known object segmentation method employed on single images. We extend the basic idea with recent advances in CNN modelling and propose DeepCut, a method to recover semantic segmentations given a database of images with corresponding bounding boxes. For this purpose, we formulate an iterative energy minimisation problem defined over a densely connected conditional random field (CRF) and use it to update the parameters of a CNN model. We compare the proposed method against a fully supervised ( $\text{CNN}_{\text{FS}}$ ) and a naïve approach to weakly supervised segmentation ( $\text{CNN}_{\text{na{\"{i}}ve}}$ ), to obtain upper and lower accuracy bounds for a given segmentation problem. Further, we examine the effect of region initialisation on the proposed method by providing a pre-segmentation within the bounding box, ( $\text{DC}_{\text{PS}}$ )). Finally, we compare all methods in their segmentation accuracy using a highly heterogeneous and challenging dataset of fetal magnetic resonance images (MRI) and further evaluate the performance to GrabCut , as an external method to this framework.

II Methods

Let us consider segmentation problems using energy functionals over graphs as described in . We seek a labelling $f$ for each pixel $i$ , minimising

where $\psi_{u}(f_{i})$ serves as unary data consistency term, measuring the fit of the label $f$ at each pixel $i$ , given the data. Additionally, the pairwise regularisation term $\psi_{p}(f_{i},f_{j})$ penalises label differences for any two pixel locations $i$ and $j$ . Typically, pairwise regularisation terms have the form of

and enforce contrast-sensitive smoothness penalties between the intensity vectors $I_{i}$ and $I_{j}$ . We can minimise the energy in Eq. (1) using a densely-connected CRF , where we replace the pairwise term with

consisting of two penalty terms modelling appearance (4a) and smoothness (4b) between the locations $p_{i}$ and $p_{j}$ :

The relative contributions of the two penalties is weighted by the regularisation parameters $\omega_{1}$ and $\omega_{2}$ and the degrees of spatial proximity and similarity are controlled by $\theta_{\alpha}$ and $\theta_{\beta}$ , respectively .

The unary potential is computed independently for each pixel $i$ by a data model with the parameters $\Theta$ that produces a distribution $y_{i}$ over the label assignment given an input image or patch $x$ and is defined as the negative log-likelihood of this probability:

In contrast to , where the unary term is computed from a GMM of the observed colour or intensity vector, we employ a CNN with the parameters $\Theta$ . We describe the network architecture in Section II-B in detail.

The proposed DeepCut method can be seen as an iterative energy minimisation method similar to GrabCut . There are two key stages to both algorithms, model estimation and label update. GrabCut uses a GMM to parametrise the colour distributions of the foreground and background. In the model estimation stage the parameters $\Theta$ for the GMM are computed based on the current label assignment $f$ for each pixel $i$ . At the label update stage the pixels are relabelled based on the new model. DeepCut replaces the GMM with a Neural Network model and the graph cut solver from with on a densely-connected graph. In contrast to , and rather than recomputing our model, we make use of transfer learning and reinitialise the CNN with the parameters of the last iteration.

This two-step iterative method is similar to an EM algorithm, consisting of an E-step (label update) and M-step (model update). Papandreou et al. describe such a method to iteratively update $f$ , however only employ regularisation as in Eq. (1) as a post-processing step during testing.

II-B Convolutional Neural Network Model

The CNN is a hierarchically structured feed-forward neural network comprising of least one convolutional layer . Additionally, max-pooling layers downsample the input information to learn object representations at different scales. Downstream of several convolutional and max-pool layers is typically a layer of densely-connected neurons, reducing the output to a desired number of classes . Our CNN is a typical feed-forward neural network model which consist of convolutional layers (feature extraction), max-pooling layers (shift/scale invariance) and dense layers (classification stage). The network architecture used in this study is depicted in Fig. 1.

Given a database of size $N$ containing images $I=\{I_{1},...,I_{N}\}$ and corresponding bounding boxes $B=\{B_{1},...,B_{N}\}$ , we attempt to learn pixelwise segmentations of the objects depicted in $I$ and constrained to $B$ . For this purpose, we employ a CNN with parameters $\Theta$ to classify image patches centred around a voxel location $i$ into foreground and background.

We describe each voxel location $i\in I_{n}$ as a 3D patch of size $p_{x}\times p_{y}\times p_{z}$ , centred around $X$ . Each patch $X$ is associated with an integer value $Y=\{0,1\}$ , representing background and foreground class of the centre voxel at $i$ , respectively. The patch $X$ serves as input to the network, which aims to predict $Y$ . Because of anticipated motion artefacts between fetal MR slices, we decided to emphasise the in-plane context of our training patches, using a patch size of 33x33x3, rather than cubic patch dimensions.

Network Configuration

For the purpose of this study, we designed a simple CNN inspired by the well-known LeNet architecture . While the proposed approach can employ other networks, we chose this configuration, as it is easily understandable and simple to reproduce. The CNN consists of two sets of convolutional (conv) and subsequent max-pooling (pool) layers. Since a convolution reduces the layer output by half the kernel size, we pad its output with zeros to preserve the size of the input tensor. The two serial conv/pool layers are attached to a layer with densely-connected (dense) neurons and an output layer with neurons associated with foreground and background. For regularisation purposes, we add dropout to both inputs of the dense and output layer, randomly sparsifying the signal and reducing the potential of over-fitting . Figure 1 depicts the specific configuration, which has been fixed for all experiments in this study.

Training & Optimisation

In each epoch, we extract $K=10^{5}$ patches $X_{k}$ from the training database, which are equally distributed between the classes $Y_{k}$ . All CNN weights are initialised by sampling from a Gaussian distribution. We employ an adaptive gradient descent optimisation (ADAGRAD) of mini-batches using a constant learning rate $\eta=0.015$ for a fixed number of epochs. This has the advantage of adapting the learning rate locally for each feature, thus exhibiting more robust behaviour and faster convergence. The loss function is defined as the categorical cross-entropy between the true and coding distributions, respectively.

Data Augmentation

The training set undergoes data augmentation for better generalisation of the learned features and to prevent over-fitting to the training data. For this purpose, we add a Gaussian-distributed intensity offset to each patch with the standard deviation $\sigma$ and randomly flip the patch across the spatial dimensions to increase the variation in the training data.

II-C Naïve Learning Approach

If we assume that the patches describing the object constrained to the bounding box $B$ are unique, we can attempt to classify patches $X_{i}\in B$ into object and background, by using patches sampled from the bounding boxes as foreground targets. To obtain background targets, we can extend $B$ to a halo region $H$ , solely containing background voxels. A naïve approach to segmentation would be to assume that all $X_{i}\in B$ belong to the object and all $X_{i}\in H$ to the background. However, since the region $B$ contains false positive locations (i.e., the object does not fully extend to the entire bounding box region, but is merely a subset of it), we will introduce errors into the model, which will impact the accuracy of the final segmentation. To ensure a fair comparison of the segmentation results, post-processing includes regularisation with a densely-connected CRF . This naïve approach ( $\text{CNN}_{\text{na{\"{i}}ve}}$ ) is depicted in Figure 2.

II-D DeepCut

In order to develop a better approach compared to Section II-C, we start training the CNN model $\Theta$ with patches $X$ sampled from $B$ and $H$ for foreground and background, respectively. In contrast to the naïve approach, we interrupt the training of $\Theta$ after a fixed number of epochs and update the classes $Y$ for all voxels in $B=R_{FG}\cup R_{BG}$ , via inference and subsequent CRF regularisation, according to Section II-A. We continue training with the updated targets and reinitialise the CNN with $\Theta$ .

II-E Region Initialisation

While methods such as DeepCut, GrabCut or others rely on (approximately) globally optimal solvers (i.e., and , respectively), the iterative nature of the algorithm will be limited to finding local optima. Thus the resulting segmentation is dependent on the initial regions $R_{FG}$ and $R_{BG}$ . Papandreou et al. propose performing a pre-segmentation within $B$ to initialise $R_{FG}$ and $R_{BG}$ closer to the object and observed large improvements in accuracy of the final segmentation. Similar initialisation can be done with the proposed DeepCut method and will be examined in Section III. In our experiments, we distinguish the variants of DeepCut initialised with bounding boxes and pre-segmentations with $\text{DC}_{\text{BB}}$ and $\text{DC}_{\text{PS}}$ , respectively.

III Experiments

For all experiments, we used the database in , consisting of MR images of 55 fetal subjects. The images were acquired on 1.5T MR scanner using a T2-weighted ssFSE sequence (scanning parameters: TR 1000ms, TE 98ms, 4mm slice thickness, 0.4mm slice gap). Most of the images contain motion artefacts that are typical for such acquisitions. The imaged population consists of 30 healthy subjects and 25 subjects with intrauterine growth restriction (IUGR) and their gestational age ranged from 20 to 30 weeks. For all images, the brain and the lung regions have been manually segmented by an expert rater. We want to emphasise that the brain segmentations are in fact not tissue segmentations, but whole brains similar to the data used in .

III-B Preprocessing & Generation of Bounding Boxes

All images underwent bias field correction and normalisation of the image intensities to zero mean and unit standard deviation computed from the bounding box. Bounding boxes $B$ were generated from manual segmentations by computing the maximum extent of the segmentation and enlarging it by 5 voxels for each slice. The background halo regions $H$ were created by extending $B$ by 20 voxels.

III-C Comparative Methods

To compare the performance of the proposed DeepCut approach, we fix the CNN architecture (see Section II-B), preprocessing and CRF parameters for all learning-based methods. We can consider the $\text{CNN}_{\text{na{\"{i}}ve}}$ (c.f. Section II-C) a lower bound in terms of accuracy performance. Alternatively, we train the CNN directly under full supervision (i.e. from pixelwise segmentations, $\text{CNN}_{\text{FS}}$ ) and predict into $B$ , which can be considered an upper accuracy bound given model complexity and data. We assess both the performance of the proposed DeepCut initialised by bounding boxes ( $\text{DC}_{\text{BB}}$ ) or via a GrabCut pre-segmentation ( $\text{DC}_{\text{PS}}$ ) as suggested in Section II-E. Lastly, we state results from the GrabCut (GC) method for a performance comparison external to the proposed framework.

III-D Experimental Setup, Evaluation & Parameter Selection

We performed 5-fold cross-validation of randomly selected healthy and IGUR subjects and fixed the training and testing databases for all compared methods. The resulting segmentations are evaluated in their overlap with expert manual segmentations using the Dice Similarity Coefficient, measuring the mean overlap between two regions $A$ and $B$ :

Three datasets were left out of the evaluation experiments and used to tune the GrabCut MRF regularisation weight and the CRF regularisation parameters in (4b) and (4a) via random permutations of the parameter combinations using manual segmentations.

III-E Implementation Details & Hardware

We implemented the CNN as shown in Fig. 1 with Lasagnehttp://lasagne.readthedocs.org/ and Theanohttp://deeplearning.net/software/theano/ . All experiments were run on Ubuntu 14.04 machines with 256 GB memory and a single Tesla K80 (NVIDIA Corp., Santa Clara, CA) with 12GB of memory. We used the CRF implementation in to solve Eq. (1), according to .

IV Results

Example segmentation results of all compared methods and initialisations can be found in Figures 3 and 4. We observe comparable agreement of the proposed $\text{DC}_{\text{PS}}$ and a fully supervised $\text{CNN}_{\text{FS}}$ for the brain region. Generally, an increase of segmentation accuracy can be seen in brain and lungs with increasing sophistication of the learning-based methods.

Comparison of methods directly learning from bounding boxes (i.e. $\text{CNN}_{\text{na{\"{i}}ve}}$ and $\text{DC}_{\text{BB}}$ , demonstrate that the iterative target update of the proposed DeepCut method results in large improvements in accuracy for both the brain and the lungs (see Fig. 3 and 4), Tab. II and III). Numerically, we obtain an increase of 12.6% and 8.9% in terms of average DSC for the brain and lungs, respectively.

IV-B Initialisation with Pre-segmentations

Further, when a pre-segmentation instead of bounding boxes is used to initialise the DeepCut method, the mean DSC is improved by another 3.7% for the brain and 4.9% for the lungs. This can be seen in the example segmentation in Fig. 3 and 4, where the $\text{DC}_{\text{PS}}$ segmentation for both organs is visually closer to those of the fully supervised $\text{CNN}_{\text{FS}}$ .

IV-C Comparison with GrabCut

While the comparative GrabCut method performs well for the brain (DSC $80.7\pm 4.9\%$ ), we observe less robust behaviour in the lungs (DSC $58.6\pm 19.0\%$ ). GrabCut outperforms the $\text{CNN}_{\text{na{\"{i}}ve}}$ for brain segmentation, however the presence of large outliers results in a lower mean accuracy in the lung regions. Several segmentations present with DSC $<20\%$ , indicating that the GrabCut was not able to detect an object in some cases or the segmenting false positive voxels in others.

IV-D DeepCut versus Fully Supervised Learning

We halted training and evaluated intermediate accuracy results of the proposed DeepCut variants over iterations. For both $\text{DC}_{\text{BB}}$ and $\text{DC}_{\text{PS}}$ , accuracy increases after each iteration (see Fig. 6) and approaches the upper bound of fully supervised training ( $\text{CNN}_{\text{FS}}$ ). Most importantly, both proposed DeepCut methods present with higher average accuracy than the naïve approach, however the accuracy remains lower than $\text{CNN}_{\text{FS}}$ . For reference, in Fig. 6 the mean (black) and standard deviation (gray) of $\text{CNN}_{\text{na{\"{i}}ve}}$ and $\text{CNN}_{\text{FS}}$ are also shown.

V Discussion

The proposed DeepCut allows for obtaining pixelwise segmentations from an image database with bounding box annotations. Its general formulation allows for readily porting it to other segmentation problems and its use of CNN models avoids feature engineering as required by other learning-based methods.

We deliberately chose a database exhibiting large variation in the imaged anatomy (e.g. the arbitrary position of the fetal body in the uterus, the extended gestational age range of 20-38 weeks or the presence of growth restriction (IGUR)) to test if a simple network configuration suffices for object segmentation problems constrained to bounding boxes. By restricting learning background patches from the halo $H$ , we avoid learning features for the entire image domain, allowing for faster training.

V-B Comparison with Related Studies

Comparing fully supervised learning ( $\text{CNN}_{\text{FS}}$ ) qualitatively, we obtain an increased mean accuracy (94.1% DSC) over Keraudren et al. (93.0% DSC) and large improvements over Taleb et al. (84.2% DSC) for fetal brain segmentations. However, these methods are highly problem-specific solutions, which are applied to the entire image domain, making comparisons difficult. The only conclusion drawn from this comparison is roughly what accuracy range can be expected for automated fetal brain segmentation methods. In this sense, the generally applicable $\text{DC}_{\text{PS}}$ method yields similar accuracy (90.3% DSC) by employing weak annotations, potentially placed 15x faster than pixel-wise annotations .

V-C Differences in Brain and Lung Segmentation Performance

For all internally compared methods, we observe a higher accuracy for brain segmentation results than for those of the lungs. There are several contributing factors to these differences, affecting all compared methods similarly. The regular shape of the brain can be better approximated with a bounding box than the lungs, which is underlined by the higher mean overlap of BB (refer to Tab. II and III, respectively). This introduces a lower amount of false positive initial targets, facilitating training the CNN. Secondly, the contrast of the background is higher in the brain, as it is often surrounded by hyper-intense amniotic fluid or hypo-intense muscular tissue. We can observe this in the tuned CRF parameters $\theta_{\beta}$ , penalising intensity differences (see Tab. I), to automatically tune to a lower value than the brain. Additionally, this can experimentally be observed with the GrabCut (GC) method, which heavily relies on intensity differences between the object and the adjacent background, performing worse than $\text{CNN}_{\text{na{\"{i}}ve}}$ (c.f., Fig. 5 (a) and (b), and Tab. II and III).

V-D Effect of Initialisation on DeepCut performance

As shown in Fig. 6, we observe an increase in segmentation accuracy, when initialising the DeepCut with a pre-segmentation, rather than a bounding box. Papandreou et al. reported a similar increase in accuracy when initialising their EM-based algorithm. Although in both approaches, the accuracy steadily increases with the number of epochs, the methods converge to different optima. This is due to the locally optimal nature of this iterative method and other iterative optimisation schemes, such as levelsets or iterated graphical methods even when employing an (approximately) globally optimal solver such as . Potential improvements might include to update the targets with a higher frequency than in this study (c.f., Tab I, $N_{Epochs}$ per DeepCut iteration) or entertaining the notion, that an optimal set of CRF regularisation parameters exists at each iteration. However, tuning for the latter might be computationally expensive and thus of little practical value. Recent advances of expressing the employed CRF as a recurrent neural network , might be a solution for back-to-back training of the $\theta$ parameters involved in (4a) and (4b) at each DeepCut iteration.

V-E Internal Comparative Experiments

For both lungs and brain segmentation, we report a large improvement in accuracy with DeepCut variants over a naïve learning approach (c.f., Fig. 5 and Tab. II and III). As in Section II-E, we suggest to initialise DeepCut with a pre-segmentation, reducing the amount of false positive targets for the initial training. A closer initialisation leads to a performance improvement, even if the pre-segmentation is not accurate (c.f., Fig. 5 (b), where there is a remarkable improvement from $\text{DC}_{\text{BB}}$ to $\text{DC}_{\text{PS}}$ ). The generally low standard deviations of all learning-based methods underline the robust performance compared to image segmentation methods, such as GrabCut. It can be explained, that learning a model of the object from a collection of images is favourable to fitting a model (e.g. a GMM) to a single image, as done in many object segmentation methods. If desired, the model can be adjusted in depth to cover a wider variation of appearance and scales, as those employed in . However, when the objects exhibit a large class similarity in terms of shape and appearance, simple CNN architectures could suffice for most medical image segmentation problems.

V-F Conclusions

We proposed DeepCut, a new method to obtain pixelwise segmentations, given a database of bounding box annotations and studied variants employing an iterative dense CRF formulation and convolutional neural network models. DeepCut is able to segment both the fetal brain and lungs from an image database of large variation in the anatomy and is readily applicable to similar problems on medical images. The proposed method performs well in terms of accuracy compared to a model trained under full supervision and simultaneously greatly reduces the annotation effort required for analysis.

Acknowledgements

We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40 GPU used for this research. This research was also supported by the National Institute for Health Research (NIHR) Biomedical Research Centre based at Guy’s and St Thomas’ NHS Foundation Trust and King’s College London. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health. Furthermore, this work was supported by Wellcome Trust and EPSRC IEH award for the iFIND project and the Developing Human Connectome Project, which is funded through a Synergy Grant by the European Research Council (ERC) under the European Union’s Seventh Framework Programme (FP/2007-2013) / ERC Grant Agreement number 319456.