Object Skeleton Extraction in Natural Images by Fusing Scale-associated Deep Side Outputs

Wei Shen, Kai Zhao, Yuan Jiang, Yan Wang, Zhijiang Zhang, Xiang Bai

Introduction

In this paper, we investigate an interesting and nontrivial problem in computer vision, object skeleton extraction from natural images (Fig. 1). Here, the concept of “object” means a standalone thing with a well-defined boundary and center , such as an animal, a human, and a plane, as opposed to amorphous background stuff, such as sky, grass, and mountain. Skeleton, also called symmetry axis, is a useful structure-based object descriptor. Extracting object skeletons directly from natural images is of broad interests to many real applications including object recognition/detection , text recognition , road detection and blood vessel detection .

Skeleton extraction from pre-segmented images used to be a hot topic, which has been well studied and successfully applied to shape-based object matching and recognition . However, such methods have severe limitations when being applied to natural images, as segmentation from natural images is still an unsolved problem.

Skeleton extraction from natural images is a much more challenging problem. The main difficulties stem from three aspects: (1) Complexity of natural scenes: Natural scenes can be very cluttered. Amorphous background elements, such as fences, bricks and even the shadows of objects, exhibit somewhat self-symmetry, and thus are prone to cause distractions. (2) Diversity of objects: Objects in natural images may exhibit entirely different colors, textures, shapes and sizes. (3) Specificity of skeletons: local skeleton segments have a variety of patterns, such as straight lines, T-junctions and Y-junctions. In addition, a local skeleton segment naturally associates with a certain scale, determined by the thickness of its corresponding object part. However, it is unknown in natural images. We term this problem as unknown-scale problem in skeleton extraction.

A number of works have been proposed to study this problem in the past decade. Broadly speaking, they can be categorized into two groups: (1) Traditional image processing methods , which compute skeletons from a gradient intensity map according to some geometric constraints between edges and skeletons. Due to the lack of object prior, these methods can not handle the images with complex scenes; (2) Recent learning based methods , which learn a per-pixel classification or segment-linking model based on elaborately hand-designed features computed at multi-scales for skeleton extraction. Limited by the ability of traditional learning models and hand-designed features, these methods fail to extract the skeletons of objects with complex structures and cluttered interior textures. In addition, such per-pixel/segment models are usually quite time consuming for prediction. Consequently, there still remains obvious gap between these skeleton extraction methods and human perception, in both performance and speed. Skeleton extraction has its unique aspect by looking into both local and global image context, which requires much more powerful models in both multi-scale feature learning and classifier learning, since the visual complexity increases exponentially with the size of the context field.

To tackle the obstacles mentioned above, we develop a holistically-nested network with multiple scale-associated side outputs for skeleton extraction. The holistically-nested network is a deep fully convolutional network (FCN) , which enables holistic image training and prediction for per-pixel tasks. Here, we connect a scale-associated side output to each convolutional layer in the holistically-nested network to address the unknown-scale problem in skeleton extraction.

Referring to Fig. 2, imagine that we are using multiple filters with different sizes (such as the convolutional kernels in convolutional networks) to detect a skeleton pixel with a certain scale; then only the filters with the sizes larger than the scale will have responses on it, and others will not. Note that the sequential convolutional layers in a holistically-nested network can be treated as the filters with increasing sizes (the receptive field sizes on the original image of each convolutional layer are increasing from shallow to deep). So each convolutional layer is only able to capture the features of the skeleton pixels with scales less than its receptive field size. The sequential increasing receptive field sizes provide a principle to quantize the skeleton scale space. With these observations, we propose to impose supervision to each side output, optimizing it towards a scale-associated groundtruth skeleton map. More specifically, each skeleton pixel in it is labeled by a quantized scale value and only the skeleton pixels whose scales are smaller than the receptive filed size of the side output are reserved. Thus, each side output is associated with some certain scales and able to give a certain number of scale-specific skeleton score maps (the score map for one specified quantized scale value) when predicting.

The final predicted skeleton map can be obtained by fusing these scale-associated side outputs. A straightforward fusion method is to average them. However, a skeleton pixel with larger scale probably has a stronger response on a deeper side output, and a weaker response on a shallower side output; a skeleton pixel with smaller scale may have strong responses on both of the two side outputs. By considering this phenomenon, for each quantized scale value, we propose to use a scale-specific weight layer to fuse the corresponding scale-specific skeleton score map provided by each side output.

In summary, the core contribution of this paper is the proposal of the scale-associated side output layer, which enables both target learning and fusion in a scale-associated way. Therefore, our holistically-nested network is able to localize skeleton pixels with multiple scales.

To the best of our knowledge, there are only two datasets related to our task. One is the SYMMAX300 dataset , which is converted from the well-known Berkeley Segmentation Benchmark (BSDS300) . However, this dataset is used for local reflection symmetry detection. Local reflection symmetry is a kind of low-level feature of image, regardless of the concept of “object”. Some samples in this dataset are shown in Fig. 3(a). Note that, a large number of symmetries occur in non-object parts. Generally, object skeleton is a subset of local reflection symmetry. The other one is the WH-SYMMAX dataset , which is converted from the Weizmann Horse dataset . This dataset is suitable to verify object skeleton extraction methods; however, as shown in Fig. 3(b) the limitation is that only one object category, the horse, is contained in it. To evaluate skeleton extraction methods, we construct a new dataset, named SK506http://wei-shen.weebly.com/uploads/2/3/8/2/23825939/sk506.zip. There are 506 natural images in this dataset, which are selected from the recent published MS COCO dataset . The objects in these 506 images belong to a variety of categories, including humans, animals, such as birds, dogs and giraffes, and artificialities, such as planes and hydrants. We apply a skeletonization method to the provided human-annotated foreground segmentation maps of the selected images to generate the groundtruth skeleton maps. Some samples of the SK506 dataset are shown in Fig. 3(c). We evaluate several skeleton extraction methods as well as symmetry detection methods on both SK506 and WH-SYMMAX. The experimental results demonstrate that the proposed method significantly outperforms others.

Related Works

Object skeleton extraction has been paid much attention in previous decades. However, most works in the early stage only focus on skeleton extraction from pre-segmented images. As these works have a strict assumption that object silhouettes are required to be available, they cannot be applied in our task.

Some pioneers try to extract skeletons from the gradient intensity maps computed on natural images. The gradient intensity map is generally obtained by applying directional derivative operators to a gray-scale image smoothed by a Gaussian kernel. For instance, in , the author provides an automatic mechanism to determine the best size of the Gaussian kernel used for gradient computation, and he propose to detect skeletons as the pixels for which the gradient intensity assumes a local maximum (minimum) in the direction of the main principal curvature. Jang and Hong extract the skeleton from the pseudo-distance map which is obtained by iteratively minimizing an object function defined on the gradient intensity map. Yu and Bajaj propose to trace the ridges of the skeleton intensity map calculated from the diffused vector field of the gradient intensity map, which can remove the undesirable biased skeletons. Due to the lack of object prior, these methods are only able to handle the images with simple scenes.

Recent learning based skeleton extraction methods are more suitable to deal with the scene complexity problem in natural images. One type of them formulates skeleton extraction to be a per-pixel classification problem. Tsogkas and Kokkinos compute the hand-designed features of multi-scale and multi-orientation at each pixel, and employ the multiple instance learning framework to determine whether it is symmetryAlthough symmetry detection is not the same problem as skeleton extraction, we also compare the methods for it with ours, as skeleton can be considered a subset of symmetry. or not. Shen et al. then improve their method by training MIL models on automatically learned scale- and orientation-related subspaces. Sironi et al. transform the per-pixel classification problem to a regression one to achieve accurate skeleton localization, which learns the distance to the closest skeleton segment in scale-space. Alternatively, another type of learning based methods aim to learn the similarity between local skeleton segments (represented by superpixel or spine model ), and link them by hierarchical clustering , dynamic programming or particle filter . Due to the limited power of the hand-designed features and traditional learning models, these methods are intractable to detect the skeleton pixels with large scales, as much more context information is needed to be handled.

Our method is inspired by , which develops a holistically-nested network for edge detection (HED). Edge detection does not face the unknown-scale problem. Using a local filter to detect an edge pixel, no matter what the size of the filter is, will have responses, either stronger or weaker. So summing up the multi-scale detection responses, which is adopted in the fusion layer in HED, is able to improve the performance of edge detection , while bringing noises across the scales for skeleton extraction. There are two main differences between HED and our method. 1. We supervise the side outputs of the network with different scale-associated groundtruths, while the groundtruths in HED are the same. 2. We use different scale-specific weight layers to fuse the corresponding scale-specific skeleton score maps provided by the side outputs, while the side outputs are fused by a single weight layer in HED. Such two changes utilize multi stages in a network to explicitly detect the unknown scale, which HED is unable to handle with. With the extra supervision added to each layer, our method is able to provide a more informative result, i.e., the predicted scale for each skeleton pixel, which is useful for other potential applications, such as part segmentation and object proposal detection (we will show this in Sec. 4.2.6 and Sec. 4.2.7). While the result of HED cannot be applied to such applications.

Methodology

In this section, we describe our methods for object skeleton extraction. First, we introduce the architecture of our holistically-nested network. Then, we discuss how to optimize and fuse the multiple scale-associated side outputs in the network for skeleton extraction.

The recent work has demonstrated that fine-tuning well pre-trained deep neural networks is an efficient way to obtain a good performance on a new task. Therefore, we basically adopt the network architecture used in , which is converted from VGG 16-layer net by adding additional side output layers and replacing fully-connected layers by fully-convolutional layers with $1\times 1$ kernel size. Each fully-convolutional layer is then connected to an upsampling layer to ensure that the outputs of all the stages are with the same size. Here, we make several modifications for our task skeleton extraction: (a) we connect the proposed scale-associated side output layer to the last convolutional layer in each stage except for the first one, respectively conv2_2, conv3_3, conv4_3, conv5_3. The receptive field sizes of the scale-associated side output layers are 14, 40, 92, 196, respectively. The reason why we omit the first stage is that the receptive field size of the last convolutional layer in it is too small (only 5) to capture any skeleton features. There are few skeleton pixels with scales less than such a small receptive field size. (b) Each scale-associated side output layer provides a certain number of scale-specific skeleton score maps. Each scale-associated side output layer is connected to a slice layer to obtain the skeleton score map for each scale. Then from all the scale-associated side output layers, we use a scale-specific weight layer to fuse the skeleton score maps for this scale. Such a scale-specific weight layer can be achieved by a fully-convolutional layer with $1\times 1$ kernel size. In this way, the skeleton score maps for different scales are fused by different weight layers. The fused skeleton score maps for each scale are concatenated together to form the final predicted skeleton map. To sum up, our holistically-nested network architecture has 4 stages with additional scale-associated side output layers, with strides 2, 4, 8 and 16, respectively, and with different receptive field sizes; it also has 5 additional weight layers to fuse the side outputs. An illustration for the network architecture is shown in Fig. 4.

2 Skeleton Extraction by Fusing Scale-associated Side Outputs

Skeleton extraction can be formulated as a per-pixel classification problem. Given a raw input image $X=\{x_{j},j=1,\ldots,|X|\}$ , our goal is to predict its skeleton map $\hat{Y}=\{\hat{y}_{j},j=1,\ldots,|X|\}$ , where $\hat{y}_{j}\in\{0,1\}$ denotes the predicted label for each pixel $x_{j}$ , i.e., if $x_{j}$ is predicted as a skeleton pixel, $\hat{y}_{j}=1$ ; otherwise, $\hat{y}_{j}=0$ . Next, we describe how to learn and fuse the scale-associated side outputs in the training phase as well as how to utilize the learned network in the testing phase, respectively.

We are given a training dataset denoted by $\mathcal{S}=\{(X^{(n)},Y^{(n)}),n=1,\ldots,N\}$ , where $X^{(n)}=\{x^{(n)}_{j},j=1,\ldots,|X^{(n)}|\}$ is a raw input image and $Y^{(n)}=\{y^{(n)}_{j},j=1,\ldots,|X^{(n)}|\}$ ( $y^{(n)}_{j}\in\{0,1\}$ ) is its corresponding groundtruth skeleton map. First, we describe how to compute a quantized skeleton scale map for each training image, which will be used for guiding the network training.

where $\lambda>1$ is a factor to ensure that the receptive field sizes are sufficiently large for feature computation. (We set $\lambda=1.2$ in our experiments.) Now, for an image $X$ , we can build a quantized scale value map $Z=\{z_{j},j=1,\ldots,|X|\}\}$ ( $z_{j}\in\{0,1,\ldots,M\}$ ).

where $\beta_{k}^{(i)}$ is the loss weight for the $k$ -th class and $\text{Pr}(z^{(i)}_{j}=k|X;\mathbf{W},\mathbf{\Phi}^{(i)})\in$ is the predicted score given by the classifier for how likely the quantized scale of $x_{j}$ is $k$ . $\mathcal{N}(\cdot)$ denotes the number of non-zero elements in a set, then $\beta_{k}$ can be computed by

Let $a^{(i)}_{jk}$ be the activation of the $i$ -th side output associated with the quantized scale $k$ for the input $x_{j}$ , then we use the softmax function $\sigma(\cdot)$ to compute

$\mathbf{\Phi}=(\mathbf{\Phi}^{(i)};i=1,\ldots,M)$ denotes the parameters of the classifiers in all the stages, then the loss function for all the side outputs is simply obtained by

For an input pixel $x_{j}$ , each scale-associated side output provides a predicted score $\text{Pr}(z^{(i)}_{j}=k|X;\mathbf{W},\mathbf{\Phi}^{(i)})$ (if $k{\leq}K^{(i)}$ ) for representing how likely its quantized scale is $k$ . We can obtain a fused score $f_{jk}$ by simply summing them with weights $\mathbf{a}_{k}=({a}^{(i)}_{k};i=\max(k,1),\ldots,M)$ :

We can understand the above fusion by this way: each scale-associated side output provides a certain number of scale-specific predicted skeleton score maps, and we utilize $M+1$ scale-specific weight layers: $\mathbf{A}=(\mathbf{a}_{k};k=0,\ldots,M)$ to fuse them. Similarly, we can define a fusion loss function by

where $\beta_{k}$ is defined by the same way in Eqn. 3 and $\text{Pr}(z_{j}=k|X;\mathbf{W},\mathbf{\Phi},\mathbf{w}_{k})=\sigma(f_{jk})$ .

Finally, we can obtain the optimal parameters by

2.2 Testing Phase

Given a testing image $X=\{x_{j},j=1,\ldots,|X|\}$ , with the learned network $(\mathbf{W},\mathbf{\Phi},\mathbf{A})\ast$ , its predicted skeleton map $\hat{Y}=\{\hat{y}_{j},j=1,\ldots,|X|\}$ is obtained by

Recall that $z_{j}=0$ and $z_{j}>0$ mean that $x_{j}$ is a non-skeleton/skeleton pixel, respectively. We refer to our method as FSDS, for fusing scale-associated deep side outputs.

3 Understanding of the Proposed Method

To understand our method more deeply, we illustrate the intermediate results of our method and compare with those of HED in Fig. 5. The response of each scale-associated side output can be obtained by the similar way of Eqn. 10. We compare the response of each scale-associated side output to the corresponding one in HED (The side output 1 in HED is connected to conv1_2, while ours start from conv2_2.). With the extra scale-associated supervision, the responses of our side outputs are indeed related to scale. For example, the first one fires on the structure with small scales, such as the legs, the interior textures and the object boundaries; while in the second one, the skeleton parts of the head and neck are clear and meanwhile the noises on small scale structure are suppressed. In addition, we perform scale-specific fusion, by which each fused scale-specific skeleton score map indeed corresponds to one scale (See the first three response maps corresponding to legs, neck and torso respectively). The side outputs in HED are not able to differentiate skeleton pixels with different scales. Consequently, the first two respond on the whole body, which bring noises to the final fusion one.

Experimental Results

In this section we discuss the implementation details and compare the performance of our skeleton extraction methods with competitors.

Our architecture is built on the public available implementation of FCN and HED . The whole network is fine-tuned from an initialization with the pre-trained VGG 16-layer net .

The hyper parameters of our network include: mini-batch size(10), base learning rate ( $1\times 10^{-6}$ ), loss weight for each side-output (1), momentum (0.9), initialization of the nested filters(0), initialization of of the scale-specific weighted fusion layer ( $1/n$ , where $n$ is the number of sliced scale-specific map), the learning rate of the scale-specific weighted fusion layer ( $5\times 10^{-6}$ ), weight decay ( $2\times 10^{-4}$ ), maximum number of training iterations ( $20,000$ ).

Data augmentation is a principal way to generate sufficient training data for learning a “good” deep network. We rotate the images to 4 different angles ( $0^{\circ}$ , $90^{\circ}$ , $180^{\circ}$ , $270^{\circ}$ ) and flip with different axis(up-down,left-right,no flip), then resize images to 3 different scales ( $0.8$ , $1.0$ , $1.2$ ), totally leading to an augmentation factor of 36. Note that when resizing a groundtruth skeleton map, the scales of the skeleton pixels in it should be multiplied by a resize factor accordingly.

2 Performance Comparison

We conduct our experiments by comparing our method FSDS with many others, including a traditional image processing method (Lindeberg’s method ), three learning based segment linking methods ( Levinshtein’s method , Lee’s method and Particle Filter ), three per-pixel classification/regression methods (Distance Regression , MIL and MISL ) and a deep learning based method (HED ). For all theses methods, we use the source code provided by authors under default setting. For HED and FSDS, we perform iterations of sufficient numbers to obtain optimal models, $15,000$ and $18,000$ iterations for FSDS and HED, respectively. We apply a standard non-maximal suppression algorithm to the response maps of HED and ours to obtain the thinned skeletons for performance evaluation.

We follow the evaluation protocol used in , under which the performances of skeleton extraction methods are measured by their maximum F-meansure ( $\frac{2\cdot\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}$ ) as well as precision-recall curves with respect to the groundtruth skeleton map. To obtain the precision-recall curves, the detected symmetry response is first thresholded into a binary map, which is then matched with the groundtruth skeleton map. The matching allows small localization errors between detected positives and groundtruths. If a detected positive is matched with at least one groundtruth skeleton pixel, it is classified as true positive. In contrast, pixels that do not correspond to any groundtruth skeleton pixel are false positives. By assigning different thresholds to the detected skeleton response, we can obtain a sequence of precision and recall pair, which is used to plot the precision-recall curve.

2.2 SK506

We first conduct our experiments on our newly built SK506 Dataset. Object skeletons in this dataset have large variances in both structures and scales. We split this dataset into 300 training and 206 testing images. We report the F-measure as well as the average runtime per image of each method on this dataset in Table. 1. Observed that, both traditional image processing and per-pixel/segment learning methods perform not well, indicating the difficulty of this task. In addition, the segment linking methods are extremely time consuming. Our method FSDS outperforms others significantly, even compared with the deep learning based method HED. Besides, thanks to the powerful convolution computation ability of GPU, our method can process images in real time, about 20 images per second. The precision/recall curves shown in Fig. 6 evidence again that FSDS is better than the competitors, as ours shows both improved recall and precision at most of the precision-recall regimes. We illustrate the skeleton extraction results obtained by several methods in Fig. 7 for qualitative comparison. These qualitative examples show that our method hits on more groundtruth skeleton points and meanwhile suppresses the false positives. The false positives in the results of HED are probably introduced across response of different scales. Benefited from scale-associated learning and scale-specific fusion, our method is able to suppress such false positives.

2.3 WH-SYMMAX

The WH-SYMMAX dataset totally contains 328 images, among which the first 228 are used for training and the rest are used for testing. The precision/recall curves of skeleton extraction methods are shown in Fig. 9 and summary statistics are in Table 2. Qualitative comparisons are illustrated in Fig. 10. Both quantitative and qualitative results demonstrate that our method is clearly better than others.

2.4 SYMMAX300

As we discussed in Sec. 1, a large number of groundtruths are labeled on non-object parts in SYMMAX300 , which do not have organized structures as object skeletons. Our aim is to suppress those on non-object parts, so that the obtained skeletons can be used for other potential applications. In addition, the groundtruths for scale are not provided by SYMMAX300. Therefore, we do not evaluate our method quantitatively on SYMMAX300. Even so, Fig. 8 shows that our method can obtain good skeletons of some objects in SYMMAX300. We also observe that the results obtained by our method have significantly less noises on background.

2.5 Cross Dataset Generalization

One may concern the scale-associated side outputs learned from one dataset might lead to higher generalization error when applying them to another dataset. To explore whether this is the case, we test the model learned from one dataset on another one. For comparison, we list the cross dataset generalization results of MIL , HED and our method in Table 3. Our method achieves better cross dataset generalization results than both the “non-deep” method (MIL) and the “deep” method (HED).

2.6 Symmetric Part Segmentation

Part-based models are widely used for object detection and recognition in natural images . To verify the usefulness of the extracted skeletons, we follow the criteria in for symmetric part segmentation. We evaluate the ability of our skeleton to find segmentation masks corresponding to object parts in a cluttered scene. Our network provides a predicted scale for each skeleton pixel (the fused skeleton score map for which scale has maximal response). With it we can recover object parts from skeletons. For each skeleton pixel $x_{j}$ , we can predict its scale by $\hat{s}_{j}=\sum_{i=1}^{M}\text{Pr}(z_{j}=i|X;\mathbf{\Theta}\ast,\mathbf{\Phi}\ast,\mathbf{a}_{0}\ast)r_{i}$ . Then for a skeleton segment $\{x_{j},j=1,\ldots,N\}$ , where $N$ is the number of the skeleton pixels in this segment, we can obtain a segmented object part mask by $\mathcal{M}=\bigcup_{j=1}^{N}D_{j}$ , where $D_{j}$ is the disk of center $x_{j}$ and diameter $\hat{s}_{j}$ . A confidence score is also assigned to each object part mask for quantitative evaluation: $P_{\mathcal{M}}=\frac{1}{N}\sum_{j=1}^{N}(1-\text{Pr}(z_{j}=0|X;\mathbf{\Theta}\ast,\mathbf{\Phi}\ast,\mathbf{a}_{0}\ast))$ . We compare our segmented part masks with Lee’s method and Levinshtein’s method on their BSDS-Parts dataset , which contains 36 images annotated with ground-truth masks corresponding to the symmetric parts of prominent objects. The segmentation results are evaluated by the protocol used in : A segmentation mask $\mathcal{M}_{seg}$ is counted as a hit if its overlap with the ground-truth mask $\mathcal{M}_{gt}$ is greater than $0.4$ , where overlap is measured by intersection-over-union (IoU). A precision/recall curve is obtained by varying a threshold over the confidence scores of segmented masks. The quantitative evaluation results are summarized in Fig. 11, which indicate a significant improvement over the other two methods. Some qualitative results on the BSDS-Parts dataset are shown in Fig. 12.

2.7 Object Proposal Detection

To demonstrate the potential of the extracted skeletons in object detection, we do an experiment on object proposal detection. Let $h_{B}^{E}$ be the objectness score of a bounding box $B$ obtained by Edge Boxes , we define our objectness score by $h_{B}=\frac{\sum_{\forall\mathcal{M}{\cap}B\neq\emptyset}\mathcal{M}{\cap}B}{\sum_{\forall\mathcal{M}{\cap}B\neq\emptyset}\mathcal{M}+\epsilon}{\cdot}h_{B}^{E}$ , where $\epsilon$ is a very small number to ensure the denominator to be non-zero, and $\mathcal{M}$ is a part mask reconstructed by a detected skeleton segment. As shown in Fig. 13, the new scoring method achieves a better object proposal detection result than Edge Boxes.

Conclusion

We have presented a fully convolutional network with multiple scale-associated side outputs to extract skeletons from natural images. By pointing out the relationship between the receptive field sizes of the sequential stages in the network and the skeleton scales they can capture, we emphasized the important roles of the proposed scale-associated side outputs in (1) guiding multi-scale feature learning and (2) fusing scale-specific responses from different stages. The encouraging experimental results demonstrate the effectiveness of the proposed method for skeleton extraction from natural images. It achieves significant improvements to all other competitors.

Acknowledgement. This work was supported in part by the National Natural Science Foundation of China under Grant 61303095 and 61573160, in part by Research Fund for the Doctoral Program of Higher Education of China under Grant 20133108120017, and in part by Innovation Program of Shanghai Municipal Education Commission under Grant 14YZ018. We thank NVIDIA Corporation for providing their GPU device for our academic research.