Learning Compositional Neural Information Fusion for Human Parsing

Wenguan Wang, Zhijie Zhang, Siyuan Qi, Jianbing Shen, Yanwei Pang, Ling Shao

Introduction

Human parsing, which aims to decompose humans into semantic parts (e.g., arms, legs, etc.), is a crucial yet challenging task for detailed human body configuration analysis in 2D monocular images. It has gained increasing attention owing to its essential role in many areas of application, such as surveillance analysis , and fashion synthesis , to name a couple.

Recent human parsing approaches have made remarkable progress. Some representative ones are built upon well-designed deep learning architectures for semantic segmentation (e.g., fully convolutional networks (FCNs) , DeepLab , etc.). Though these achieve promising results, they fail to make full use of the rich structures in this task. Some others use extra human joints to better constrain body configurations , requiring additional training data of human keypoints and ignoring the compositional relations within human bodies.

In this paper, we segment body parts at multiple levels (see Figure 1), in contrast to most previous human parsers which only focus on atomic parts (represented as leaf nodes in the human hierarchy). The insight is that estimating the whole graph provides us cross-level information that can assist learning and inference for each body part. This is also evidenced by human perception studies ; a global shape can either precede or follow the recognition of its local parts, and both contribute to the final recognition.

We further specify this as a multi-source information fusion procedure, which integrates information from the following three processes (see Figure 1 (d)). 1) Direct inference (or unconscious inference) from the input image. For example, sometimes humans directly recognize objects ny relying on intuitive understanding . 2) Top-down inference, which recognizes fine-grained components from a whole entity. For example, when recognizing small fine-grained parts, exploring contextual information of the entire object is essential (see the regions in the white circles in Figure 1). 3) Bottom-up inference, which associates constituent parts to predict upper-level nodes. When objects are partially occluded or contain complex topologies, humans can assemble sub-parts to assist in recognizing the entities (see the regions in the blue circles in Figure 1).

Employing the strong learning power of deep neural networks , we build a compositional neural information fusion for these three inference processes in an end-to-end manner. This yields a hierarchical human parsing framework to better capture the compositional constraints and human part semantics. In addition, we design our model as a conditional fusion, i.e., the assembly of different information is dependent on the confidence estimations for the sources, instead of simply assuming all the sources are reliable. This is achieved by a learnable gate mechanism, leading to more accurate parsing results.

This paper makes three contributions. 1) We formulate the human parsing problem as a neural information fusion process over a compositionally structured network. 2) We analyze three important sources of information, leading to a novel network architecture that conditionally incorporates direct, top-down, and bottom-up inferences. 3) Our model achieves state-of-the-art performances for comprehensive evaluations on four public datasets (LIP , PASCAL-Person-Part , ATR and Fashion Clothing ). Testing with more than 20K images demonstrates the superiority over existing methods of exploiting compositional structural information for human parsing.

Related Work

Hierarchical/Graphical Models in Computer Vision: Hierarchical/graphical models are powerful for building structured representations, which can reflect task-specific relations and constraints. From early distributional semantic models, part-based models , MRF/CRF , And-Or grammar model , to deep structural networks , graph neural networks , trainable CRF , etc., hierarchical/graphical models have found applications in a wide variety of core computer vision tasks, such as object recognition , human parsing , pose estimation , visual dialog etc., to the extent that they are now ubiquitous in the field. Inspired by their general success, we leverage structural information to design our approach. In addition to directly inferring segments from the image features, we further derive two additional inference processes, i.e., bottom-up and top-down inference, to better capture human structures. This encourages more reasonable results that are consistent with the human body configuration.

Information Fusion: Our method is also inspired by the idea of fusing information from different sources to obtain a better prediction of the target. One typical application of this is sensor fusion, which is a broad field, discussed in more detail in . Many machine learning models can be regarded as information fusion methods: e.g., product of experts , Bayesian fusion, ensemble methods , and graphical models . Motivated by this general idea, we learn to adaptively fuse the direct inference along with top-down and bottom-up predictions in the compositional human structure for our final prediction.

Human Parsing Models: Traditional human parsing models are typically built upon hand-crafted visual features (e.g., color, HoG) , low-level image decompositions (e.g., super-pixel) , and heuristic hypotheses (e.g., grammars for human body configuration) . Though impressive results have been achieved, these pioneering works require a lot of carefully hand-designed pipelines, and suffer the limited representability of the hand-crafted features.

With the renaissance of connectionism in the computer vision community, recent research efforts take deep neural networks as their main building blocks . More specifically, some efforts address the task as an active template regression problem , propagate semantic information from a retrieved, annotated image corpus , merge multi-level image context in a unified convolutional neural network , or use Graph LSTMs to model human configurations . Some others leverage extra pose information to assist the task . In contrast to the above approaches addressing category-level understanding of human semantics, a few methods operate at an instance level .

The aforementioned deep human parsers generally achieve promising results, due to the strong learning power of neural networks and the plentiful availability of annotated data . However, they typically need to pre-segment images into superpixels , which breaks the end-to-end story and is time-consuming, or rely on extra human landmarks , requiring additional annotations or pre-trained pose estimators. Though also performs multi-level, fine-grained parsing, it neither explores different information flows within human hierarchies nor models the problem from the view of multi-source information fusion.

In contrast, we elaborately design a compositional neural information fusion framework, which explicitly captures human compositional structures and dynamically combines direct, bottom-up and top-down inference modes over the hierarchy. The overall model inherits the complementary advantages of FCNs and hierarchical models, yielding a unified, end-to-end trainable human parsing framework with a strong learning ability, improved representational power, as well as high processing speed.

Our Approach

Formally, we represent the hierarchical human body structure as a graph $\mathcal{G}\!=\!(\mathcal{V},\mathcal{E},\mathcal{Y})$ , where nodes $v\!\in\!\mathcal{V}$ represent human parts in different levels, and edges $e\!\in\!\mathcal{E}$ are two-tuples $e\!=\!(u,v)$ representing the compositional relation that node $v$ is a part of node $u$ . As shown in Figure 2 (c), the nodes are further grouped into $L(=\!3)$ levels: $\mathcal{V}\!=\!\mathcal{V}^{1}\!\cup\!\dots\!\cup\!\mathcal{V}^{L}$ , where $\mathcal{V}^{1}$ are the leaf nodes (the most fine-grained semantic parts typically considered in common human parsers), $\mathcal{V}^{2\!}\!=$ {upper-body, lower-body}, and $\mathcal{V}^{3\!}\!=$ {full-body}. For each node $v$ , we want to infer a segmentation map $y_{v}\!\in\!\mathcal{Y}$ that is a probability map of its label. Please note that such a problem setting does not introduce any additional annotation requirement, since higher-level annotations can be obtained by simply combining the lower-level labels.

There are three different sources of information when inferring $y_{v}$ for $v$ : 1) the raw input image, 2) $y_{u}$ for the parent node $u$ , and 3) $\bm{y_{w}}$ for all the child nodes $\bm{w}$ . We treat the final prediction of $y_{v}$ as a fusion of the information from these three sources. Next, we briefly review different methods to modeling this information fusion problem that motivate our solution and network design for human parsing.

Information fusion refers to the process of combining information from several sources $Z\!=\!\{z_{1},z_{2},\cdots,z_{n}\}$ in order to form a unified picture of the measured/predicted target $y$ . Each source provides an estimation of the target. These sources can be the raw data $x$ or some other quantities that can be inferred from $x$ . Several approaches have been proposed to tackle this problem.

$\bullet$ Product of experts (PoE) treats each source as an “expert”. It multiplies the probabilities and then renormalizes:

$\bullet$ Bayesian fusion. Denoting $Z_{s\!}\!=\!\{z_{1},z_{2},\!\cdots\!,z_{s}\}$ as the set of the first $s$ sources, it factorizes the posterior probability:

However, it is too difficult to learn all the conditional distributions. By assuming the independence of different information sources, we have the Naive Bayes:

which serves as an approximation of the true distribution.

$\bullet$ Ensemble methods. In this approach, each $z_{i}$ is a classifier that predicts $y$ . A typical ensemble method is Bayesian voting , which weights the prediction of each classifier to get the final prediction:

The AdaBoost algorithm also falls into this category.

$\bullet$ Graphical models (e.g., conditional random fields). In such models, each $z_{i}$ can be viewed as a node that contributes to the conditional probability:

where $A(\theta)$ is the log-partition function that normalizes the distribution. Computing $A(\theta)$ is often intractable, hence the solution is usually given by approximation methods, such as Monte Carlo methods or (loopy) belief propagation .

2 Compositional Neural Information Fusion

The above methods can all be viewed as ways to approximate the true underlying distribution $p(y|Z)$ , which can be written as a function of predictions from different information sources $Z$ :

There are potential drawbacks to following the exact solution of one of the above methods. First, they are not entirely consistent with each other. For example, the PoE multiplies all $p(y|z_{i})$ together, whereas ensemble methods compute their weighted sum. Each method approximates the true distribution in a different way and has its own tradeoff. Second, exact inference is difficult and solutions are often approximative (e.g., contrastive divergence is used for PoE and Monte Carlo methods for graphical models).

Therefore, instead of exactly following the computation of one of the above methods, we leverage neural networks to directly model this fusion function, due to their strong ability for flexible feature learning and function approximation . The hope is that we can directly learn to fuse multi-source information for a specific task.

However, the fusion network should not be learned arbitrarily without inductive biases , which is the preference for structural explanations exhibited in human reasoning processes. Here, we exploit the compositional nature of the problem and design the network with the following observations:

$\bullet$ In the compositional structure $\mathcal{G}$ , the final prediction $p(y_{v}|Z)$ for each node $v$ combines information from three different sources: 1) the direct inference $p(y_{v}|x)$ from the raw image input, 2) the top-down inference $p(y_{v}|y_{u})$ from the parent node $u$ , which utilizes the decompositional relation, and 3) the bottom-up inference $p(y_{v}|\bm{y_{w}})$ , which assembles predictions $\bm{y_{w}}$ for all the child nodes $\bm{w}$ to leverage the compositional relation.

$\bullet$ In many cases, simply fusing different estimations could be problematic. The final decision should be conditioned on the confidence of each information source.

Based on the above observations, we design our parser network to learn a compositional neural information fusion:

where the confidence $\delta$ is a learnable continuous function with outputs from 0 to 1. The symbols $\Rsh$ , $\downarrow$ , and $\uparrow$ denote direct, top-down, and bottom-up inference, respectively. As shown in Figure 2 (d), this function fuses information from the three sources in the compositional structure, taking into account the confidence of each source. For neural network realizations of this function, the probability terms can be relaxed to logits, which are essentially log-probabilities.

When carrying out such a prediction, there is one computational issue. Notice that the top-down/bottom-up inferences rely on an estimation of the parent/child node(s). This forms a circular dependency between a parent and its children. To solve this, we treat the direct inference result from the raw data as an initial estimation, and the top-down/bottom-up inferences rely on this initial estimationFor some nodes, bottom-up or top-down inference might not exist. The terminal leaf nodes $\mathcal{V}^{1}$ do not have bottom-up inference, while the root node $\mathcal{V}^{3}$ only has direct and bottom-up inference. For clarity of the method description, we discuss the general case with all three sources.. Therefore, we decompose the algorithm into three consecutive steps:

This procedure motivates the overall network architecture, where each step above can be learned as a module by a neural network. Next, we discuss our network design.

3 Network Architecture

Our model stacks the following parts to form an end-to-end system for hierarchical human parsing. The system does not require any preprocessing and the modules are FCNs, so it achieves high efficiency.

As the nodes $\mathcal{V}$ capture explicit semantics, a specific feature ${h}_{v}$ for each node $v$ is desired for more efficient representation. However, using several different, node-specific embedding networks will lead to a high computational cost. To remedy this, for each $l$ -th level, we first apply a level-specific FCN (LSF) to describe the level-wise semantics and contextual relations:

where $l\!\in\!\{1,2,3\}$ . More specifically, three LSFs ( ${F}^{1}_{\text{LSF}}$ , ${F}^{2}_{\text{LSF}}$ , and ${F}^{3}_{\text{LSF}}$ ) are learned to extract three level-specific embeddings ( ${h}^{1}_{\text{LSF}},{h}^{2}_{\text{LSF}}$ , and ${h}^{3}_{\text{LSF}}$ ). Further, for each node $v$ , an independent channel-attention block, Squeeze-and-Excitation (SE) , is applied to obtain its specific feature:

where $[<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>⋅</mo></mrow><annotation encoding="application/x-tex">\cdot</annotation></semantics></math>⋅]$ is a concatenation operation. Then, the bottom-up network $F_{\uparrow}$ gives a prediction according to compositional relations (see Figure 3 (d)):

Conditional Fusion Network. Before making the final prediction, we estimate the confidence $\delta$ of each information source using a neural gate function. For the direct inference of a node $v$ , we estimate the confidence by:

The confidence scores for the top-down and bottom-up processes follow a similar computational framework:

Finally, for each node $v$ , the fusion network $F_{\cup}$ combines the results from the three inference networks above for final prediction (see Figure 3 (e)):

Loss Function. To obtain the final segmentation map from $\text{logit}(y_{v}|Z)$ , we apply a softmax function over the logits of nodes in the same level. Thus, for each level, all the inference networks; $F_{\Rsh},F_{\downarrow},F_{\uparrow}$ , and the fusion network $F_{\cup}$ are trained by the standard cross-entropy loss:

4 Implementation Details

Top-down/Bottom-up Inference Network. The architectures of the top-down $F_{\downarrow}$ (Equation 12) and bottom-up $F_{\uparrow\!}$ (Equation 14) inference networks are very similar, and only differ in their strategies of processing the input features (see Equation 13). Both are achieved by three cascaded convolutional layers, with convolution sizes of $3\!\times\!3$ , $3\!\times\!3$ and $1\!\times\!1$ , respectively.

Information Fusion Network. $F_{\cup}$ in Equation 17 consists of three $1\!\times\!1$ convolutional layers with ReLU activations for non-linear mapping.

Experiments

Datasets: We perform extensive experiments on the following four widely-tested datasets:

LIP has 50,462 single-person images with elaborate pixel-wise annotations of 19 part categories (e.g., hair, face, left-/right-arms, left-/right-legs, left-/right-shoes, etc.). The images are divided into 30,462 samples for training, 10,000 for validation and 10,000 for testing.

PASCAL-Person-Part contains multiple humans per image in unconstrained poses and occlusions (1,716 for training and 1,817 for testing). It provides careful pixel-wise annotations for six body parts (i.e., head, torso, upper-/lower-arms, and upper-/lower-legs).

ATR includes 7,700 images (6,000 for training, 700 for validation and 1,000 for testing), annotated at pixel-level with 17 categories, e.g., hat, sunglass, face, upper-clothes, pants, left-/right-arms, left-/right-legs, etc.

Fashion Clothing consists of Colorful Fashion Parsing , Fashionista , and Clothing Co-Parsing . It is more concerned with human clothing details, including 17 categories (e.g., glass, hair, pants, shoes, shirt, upper-clothes, skirt, scarf, socks, etc.). It has 4,371 images in total (3,934 for training, and 437 for testing).

Evaluation Metrics: For LIP, following its standard protocol , we report pixel accuracy, mean accuracy and mean Intersection-over-Union (mIoU). For PASCAL-Person-Part, following conventions , the performance is evaluated in terms of mIoU. For ATR and Fashion Clothing, we report five metrics as does, including pixel accuracy, foreground accuracy, average precision, average recall, and average F1-score.

Training Settings: During training, the weights of the backbone network are loaded from ResNet101 pre-trained on ImageNet , and the remaining layers are randomly initialized. For data preparation, following , we apply data augmentation techniques for all the training data, including randomly scaling, cropping and left-right flipping. The random scale is set from 0.5 to 2.0, while the crop size is set to $473\!\times\!473$ . For optimization, we adopt SGD with a momentum of 0.9, and weight_decay of 0.0005. For the learning rate, we use the ‘poly’ learning rate schedule , $lr\!=\!base\_lr\!\times\!(1\!-\!\frac{iters}{total\_iters})^{power}$ , in which power $=$ $0.9$ and base_lr $=$ $0.007$ . The total_iters is $epochs\!\times\!batch\_size$ , where batch_size $=$ $40$ and epochs $=$ $150$ . We use multiple GPUs for the consumption of the large batch_size, and implement Synchronized Cross-GPU BN.

Testing Phase: Following general protocol , we average the per-pixel classification scores at multiple scales with flipping, i.e., the scale is 0.5 to 1.5 (in increments of 0.25) times the original size. Our model does not require any other pre-/post-processing steps (i.e., over-segmentation , human pose , CRF ), and thus achieves a processing speed of 23.0fps, averaged on PASCAL-Person-Part, which is faster than previous deep human parsers, such as Joint (0.1fps), Attention+SSL (2.0fps), MMAN (3.5fps) and MuLA (15fps).

Reproducibility: Our method is implemented on PyTorch and trained on four NVIDIA Tesla V100 GPUs with a 32GB memory per-card. All the testing procedures are carried out on a single NVIDIA TITAN Xp GPU with 12GB memory for a fair speed comparison. To provide full details of our training and testing processes, we release our code in https://github.com/ZzzjzzZ/CompositionalHumanParsing.

2 Quantitative Results

We compare the proposed method with several strong baselines on the four aforementioned challenging datasets.

LIP : We compare our method with 11 state-of-the-arts on LIP val set in Table 1. Our method achieves a huge boost in average IoU (4.64% better than the second best method, CE2P and 8.4% better than the third best, MuLA ). To verify its effectiveness in detail, we report per-class IoU in Table 2. Our model improves the performance over almost all classes, especially for the ones typically associated with small regions (e.g., gloves, sunglasses, socks, shoes), due to our top-down inference strategy. The results are also impressive for arms, legs, and shoes, demonstrating our model’s ability to distinguish between “left” and “right” with the help of composition relations.

PASCAL-Person-Part : On its test set, we compare our method with 15 state-of-the-arts using IoU score. As shown in Table 3, our model outperforms previous methods across the vast majority of classes and on average.

ATR : Table 4 gives evaluation on ATR test set. Our model again outperforms other competitors across most metrics. In particular, it achieves an average F-1 score of 85.51%, which is 3.45% better than TGPNet and 5.37% better than Co-CNN .

Fashion Clothing : We compare our method with five famous models on Fashion Clothing test, where we take the pre-computed evaluation from . From Table 5, we observe our model surpasses other competitors across all metrics by a large margin. Notably, it yields an F-1 score of 58.12%, significantly outperforming TGPNet and Attention by +6.20% and +9.44%, respectively.

Overall, our model consistently obtains promising results over different datasets, which clearly demonstrates its superior performance and strong generalizability. This also distinguishes our model from several previous state-of-the-art deep human parsers, such as , since it does not use extra pose annotations during training.

3 Qualitative Results

In Figure 5, we show some visual results on PASCAL-Person-Part test set. Our method yields more precise predictions compared to SS-NAN , DeepLabV2 and PGN . For example, in the last row, our method correctly labels the lower-legs of the rider, while other methods face difficulties in this case. Our model also provides clearer details for small parts. Observed from the second row, the small lower-arm regions can be successfully segmented out with the constraint of top-down inference. In general, by effectively exploiting the human semantic hierarchy, our approach outputs reasonable results for confusing labels on the human parsing task.

4 Ablation Study

Table 6 shows an evaluation of our full model compared to ablated versions without certain key components. All the variants are retrained independently with their specific network architectures. Here, 1st-Level denotes the automatic parts (e.g., head, leg, etc.) in $\mathcal{V}^{1}$ , 2nd-Level $\mathcal{V}^{2}$ (lower-/upper body), and 3rd-Level $\mathcal{V}^{3}$ (full body). The experiments are performed on PASCAL-Person-Part test set using mIoU metric. Three essential conclusions can be drawn from our results. First, instead of only modeling the fine-grained parts in $\mathcal{V}^{1}$ (i.e., backbone), even directly learning to parse the whole human hierarchy (i.e., direct) can bring a performance gain (64.14 $\rightarrow$ 65.27). This suggests that modeling the human hierarchy leads to a comprehensive understanding of human semantics. Second, further considering bottom-up and top-down inference provides substantial performance gain, demonstrating the benefit of exploiting human structures and efficient information fusion strategies in this problem. Note that in (direct vs. direct+bottom-up) and (direct+top-down vs. direct+bottom-up+top-down), even for the 1st-level nodes that do not have bottom-up inference, the training itself brings performance gain. The reason is that the bottom-up inference explicitly captures compositional relations and thus improves the quality of the learnt features. Similar observations can also be found in (direct vs. direct+top-down) and (direct+bottom-up vs. direct+bottom-up+top-down) for the 3rd-level node. These observations suggest the compositional information fusion not only improves the predictions during inference but also boosts the learning ability of our human parser model. Third, conditionally fusing information boosts performance, as the information from low-quality sources can be suppressed. This also provides a new glimpse into the information fusion mechanism over hierarchical models. A visual comparison between the results from our backbone network, our model only using compositional fusion and our full model can be found in Figure 6 (b-d), which intuitively shows the improvements from our conditional and compositional information fusion.

Conclusion

In this work, we parse human parts in a hierarchical form, enabling us to capture human semantics from a more comprehensive view. We tackle this hierarchical human parsing problem through a neural information fusion framework that explores the compositional relations within human structures. It efficiently combines the information from the direct, top-down, and bottom-up inference processes while considering the reliability of each process. Extensive quantitative and qualitative comparisons performed on five datasets demonstrate that our method outperforms the current alternatives by a large margin.

Acknowledgements The authors thank Prof. Song-Chun Zhu and Prof. Ying Nian Wu from UCLA Statistics Department for helpful comments on this work. This work reported herein was supported in part by DARPA XAI grant N66001-17-2-4029, ARO grant W911NF-18-1-0296, CCF-Tencent Open Fund, and the National Natural Science Foundation of China (No. 61632018).