Learning to Assemble Neural Module Tree Networks for Visual Grounding

Daqing Liu, Hanwang Zhang, Feng Wu, Zheng-Jun Zha

Introduction

Visual grounding (a.k.a., referring expression comprehension) aims to localize a natural language description in an image. It is one of the core AI tasks for testing the machine comprehension of visual scene and language . Perhaps the most fundamental and related grounding system for words is object detection (or segmentation ): the image regions (or pixels) are classified to the corresponding word of the object class. Despite their diverse model architectures , their sole objective is to calculate a grounding score for a visual region and a word, measuring the semantic association between the two modalities. Thanks to the development of deep visual features and language models , we can extend the grounding systems from fixed-size inventory of words to open-vocabulary or even descriptive and relational phrases .

However, grounding complex language sentences, e.g., “a pink umbrella carried by a girl in pink boots”, is far different from the above word or phrase cases. For example, given the image in Figure 1, for us humans, how to localize the “umbrella”? One may have the following reasoning process: 1) Identify the referent “umbrella”, but there are two of them. 2) Use the contextual evidence “carried by a girl”, but there are two girls. 3) By using more specific evidence “in pink boots”, localize the “girl” in the last step. 4) Finally, by accumulating the above evidences, localize the target “umbrella”.

Unfortunately, existing visual grounding methods generally rely on 1) a single monolithic score for the whole sentence (Figure 1(a)), or 2) a composite score for subject, predicate, and object phrases (Figure 1(b)). Though some of them adopt the word-level attention mechanism to focus on the informative language parts, their reasoning is still coarse compared to the above human-level reasoning. More seriously, such coarse grounding scores are easily biased to learn certain vision-language patterns but not visual reasoning, e.g., if most of the “umbrellas” are “carried by people” in the dataset, the score may not be responsive to other ones such as “people under umbrella stall”. Not surprisingly, this problem has been repeatedly discovered in many end-to-end vision-language embedding frameworks used in other tasks such as VQA and image captioning .

In this paper, we propose to exploit the Dependency Parsing Trees (DPTs) that have already offered an off-the-shelf schema for the composite reasoning in visual grounding. Specifically, to empower the visual grounding ability by DPT, we propose a novel neural module network: Neural Module Tree (NMTree) that provides explainable grounding scores in great detail. As illustrated in Figure 1(c), we transform a DPT into NMTree by assembling three primitive module networks: Single for leaves and root, Sum and Comp for internal nodes (detailed in Section 3.3). Each module calculates a grounding score, which is accumulated in a bottom-up fashion, simulating the visual evidence gained so far. For example in Figure 1(c), $\texttt{Comp}[carried]$ receives the scores gained by $\texttt{Sum}[by]$ and then calculates a new score for the region composition, meaning “something is carried by the thing that is already grounded by the ‘by’ node”. Thanks to the fixed reasoning schema, NMTree disentangles the visual perception from the composite reasoning to alleviate the unnecessary vision-language bias , as the primitive modules receive consistent training signals with relatively simpler visual patterns and shorter language constitutions.

One maybe concerned by the potential brittleness caused by DPT parsing errors that impact the robustness of the module assembly, as discovered in most neural module networks applied in practice . We address this issue in three folds: 1) the assembly is simple. Except for Single that is fixed for leaves and root, only Sum and Comp are to be determined at run-time; 2) Sum is merely an Add operation that requires no visual grounding; 3) we adopt the recently proposed Gumbel-Softmax (GS) approximation for the discrete assembly approximation. During training, the forward pass selects the two modules by GS sampler in a “hard” discrete fashion; the backward pass will update all possible decisions by using the straight-through gradient estimator in a “soft” robust way. By using the GS strategy, the entire NMTree can be trained end-to-end without any additional module layout annotations.

We validate the effectiveness of NMTree on three challenging visual grounding benchmarks: RefCOCO , RefCOCO+ , and RefCOCOg . NMTree achieves new state-of-the-art performances on most of test splits and grounding tasks. Qualitative results and human evaluation indicate that NMTree is transparent and explainable.

Related Work

Visual grounding is a task that requires a system to localize a region in an image while given a natural language expression. Different from object detection , the key for visual grounding is to utilize the linguistic information to distinguish the target from other objects, especially the objects of the same category.

To solve this problem, pioneering methods use the CNN-LSTM structure to localize the region that can generate the expression with maximum posteriori probability. Recently, joint embedding models are widely used, they model the conditional probability and then localize the region with maximum probability conditioned on the expression. Our model belongs to the second category. However, compared with the previous works which neglect the rich linguistic structure, we step forward by taking structure information into account. Compared to which relies on constituency parsing tree, our model applied dependency parsing tree with great parsing detail and the module assembly is learned end-to-end from scratch, while theirs is hand-crafted.

There are some works on using module networks in visual grounding task. However, they over-simplify the language structure and their modules are too coarse compared to ours. Fine-grained module networks are widely used in VQA . However, they rely on additional annotations to learn a sentence-to-module layout parser, which is not available in general domains. Our module layout is trained from scratch by using the Gumbel-Softmax training strategy , which has shown empirically effective in recent works .

NMTree Model

In this section, we first formulate the problem of visual grounding in Section 3.1. Then, by using the walk-through example illustrated in Figure 2, we introduce how to build NMTree in Section 3.2 and how to calculate the grounding score using NMTree in Section 3.3. Finally, we detail the Gumbel-Softmax training strategy in Section 3.4.

Therefore, the key is to define a proper $S(\cdot)$ that distinguishes the target region from others by comprehending the language composition.

To this end, we propose to use the Dependency Parsing Tree (DPT) as a fine-grained language decomposition, which empowers the grounding model to perform visual reasoning in great detail (Figure 1(c)):

where $t$ is a node in the tree, $S_{t}(\cdot)$ is a node-specific score function that calculates the similarity between a region and a node-specific language part $\mathcal{L}_{t}$ . Intuitively, Eq. (2) is more human-like: accumulating the evidence (e.g., grounding score) while comprehending the language. Next, we will introduce how to implement Eq. (2).

2 Sentence to NMTree

There are three steps to transform a sentence into the proposed NMTree, as shown in the bottom three blocks of Figure 2. First, we parse the sentence into a DPT, where each word is a tree node. Then, we encode each word and its linguistic information into a hidden vector by a Bidirectional Tree LSTM. Finally, we assemble the neural modules to the tree according to node hidden vectors.

Dependency Parsing Tree. We adopt a dependency parser from Spacy toolboxSpacy2: https://spacy.io/. As shown in Figure 2, it structures the language into a tree, where each node is a word with its part-of-speech (POS) tag and dependency relation label of the directed edge from it to another, e.g., “riding” is VB (verb) and its nsubj (nominal subject) is “man” as NN (noun). DPT offers an in-depth comprehension of a sentence and its tree structure offers a reasoning path for visual grounding. Note that there are always unnecessary syntax elements parsed from a free-form sentence such as determiners, symbols, and punctuation. We remove these nodes and edges to reduce the computational overhead without hurting the performance.

Bidirectional Tree LSTM. Once the DPT is obtained, we encode each node into a hidden vector by a bidirectional tree-structured LSTM . This bidirectional (i.e., bottom-up and top-down) propagation makes each node being aware of the information both from its children and parent. This is particularly crucial for capturing the context in a sentence. For each node $t$ , we embed the word $w_{t}$ , POS tag $p_{t}$ , and dependency relation label $d_{t}$ into a concatenated embedding vector as:

where $\mathbf{E}_{w}$ , $\mathbf{E}_{p}$ , and $\mathbf{E}_{d}$ are trainable embedding matrices, $\Pi_{w_{t}}$ , $\Pi_{p_{t}}$ , and $\Pi_{d_{t}}$ are one-hot encodings, for word, POS tag, and dependency relation label, respectively.

Our tree LSTM implementation For space reasons, we leave the details in supplementary material. is based on the Child-Sum Tree LSTM . Taking the bottom-up direction for example, a node $t$ receives the LSTM states from its children node set $\mathcal{C}_{t}$ and its embedding vector $\bm{e}_{t}$ as input to update the state:

where $\bm{c}_{tj}^{\uparrow},\,\bm{h}_{tj}^{\uparrow}$ denote the cell and hidden vectors of the $j$ -th child of node $t$ . By applying the TreeLSTM in two directions, we can obtain the final node hidden vector $\bm{h}_{t}$ as:

Module Assembler. Given the node representation $\bm{e}_{t}$ and the above obtained node hidden vector $\bm{h}_{t}$ , we can feed them into a module assembler to determine which module should be assembled to node $t$ . As we will detail in Section 3.3, we have three modules, i.e., Single, Sum, and Comp. Since the Single is always assembled on leaves and the root, the assembler only need to choose between Sum and Comp as:

It is worth noting that the assembler is not purely linguistic even though Eq. (6) is based on DPT node features. In fact, thanks to the back-propagation training algorithm, visual cues will be eventually incorporated into the parameters of Eq. (6). Figure 3 illustrates which type of words is likely to be assembled by each module. We can find that the Sum module has more visible words (e.g., adjectives and nouns), and the Comp module has more words describing relations (e.g., verbs and prepositions). This reveals the explainable potential of NMTree. Finally, by the above three steps, we get the NMTree that each node assembled. Next, we will elaborate the three types of modules.

3 NMTree Modules

Given the above assembled NMTree, we can implement the tree grounding score proposed in Eq. (2) by accumulating the scores in a bottom-up fashion. There are three types of modules used in NMTree, i.e., Single, Sum and Comp. Each module at node $t$ updates the grounding score $\bm{s}_{t}=[s_{t}^{1},\cdots,s_{t}^{K}]$ for all the $K$ regions in the image $\mathcal{I}$ and outputs to its parent. In the following, we will first introduce language representation and common functions used in the modules, and then detail each module.

Language Representation. For node $t$ , we have two language representations: $\bm{y}^{s}_{t}$ is used to associate with a single visual feature and $\bm{y}^{p}_{t}$ is used to associate with a pairwise visual feature. We denote the node set of node $t$ as $\mathcal{N}_{t}$ , which contains itself and all nodes rooted from $t$ . Therefore, the language representation can be calculated by the weighted sum of node embedding vectors from $\mathcal{N}_{t}$ :

Score Functions. There are two types of score functions used in our modules, denoted by the single score function $S_{s}$ and pairwise score function $S_{p}$ , where $S_{s}$ measures the similarity between a single region $\bm{x}$ and a language representation $\bm{y}$ , and $S_{p}$ indicates how likely pair-wise regions match with relationships. Formally we define them as:

where $[;]$ is a concatenation operation, $\odot$ is element-wise multiplication, and L2norm is used to normalize vectors.

Single Module. It is assembled at leaves and the root. Its job is to 1) calculate a single score for each region and the current language feature by Eq. (8), 2) add this new score to the scores collected from children, and then 3) pass the sum to its parent:

Note that for leaves, $\mathcal{C}_{t}=\phi$ as they have no children. As illustrated in Figure 2, its design motivation is to initiate the bottom-up grounding process by the most elementary words and finalize the grounding by passing the accumulated scores to ROOT.

Sum Module. It plays a transitional role during the reasoning process. It simply sums up the scores passed from its children and then passes the sum to its parent:

Note that this module has no parameters hence it significantly reduces the complexity of our model. As illustrated in Figure 2, intuitively, it transits the easy-to-localize words (cf. Figure 3(a)) such as “horse” and “man” to help the subsequent composite grounding.

Comp Module. This is the core module for composite visual reasoning. As shown in Figure 3(b), it is likely to be the relationship that connects two language constitutions. It first computes an “average region” visual feature that is grounded by the single scores:

In particular, $\bar{\bm{x}}$ can be considered as the contextual region that supports the target region score, e.g., “what is riding the horse” in Figure 2. Therefore, this module outputs the target region score to its parent:

Recall that $\mathbf{y}^{p}_{t}$ is pairwise language feature that represents the relationship words.

By reasoning along the assembled NMTree in bottom up fashion, we can obtain the overall accumulated grounding score in Eq. (2) at tree root. Moreover, thanks to the score output at each node, NMTree is transparent as the scores can be visualized as attention maps to investigate the grounding process. Figure 4 illustrates an extreme example with a very long expression with 22 tokens. However, by using the neural modules in NMTree , it still works well and reasons with explainable intermediate process. Next, we will discuss how to train NMTree.

4 NMTree Training

In contrast to previous neural module networks , NMTree does not require any additional annotations and is end-to-end trainable. Suppose $\bm{x}_{gt}$ is the ground-truth region, the objective is to minimize the cross-entropy loss:

where $\Theta$ is the trainable parameter set and softmax is across all $K$ regions in an image.

Recall that the assembling process in Eq. (6) is discrete and blocks the end-to-end training. Therefore, we utilize the Gumbel-Softmax strategy that is shown effective in recent works on architecture search. For more details, please refer to their papers. Here, we only introduce how to apply the Gumbel-Softmax for NMTree training.

Forward. We add Gumbel distribution as a noise into the relative scores (i.e. $\textrm{fc}([\bm{e}_{t},\bm{h}_{t}])$ ) of each module. It introduces stochasticity for the module assembling exploration. Specifically, we parameterize the assembler decision as a 2-d one-hot vector $\bm{z}$ , where the index of non-zero entry indicates the decision:

Backward. We take a continuous approximation that relaxes $\bm{z}$ to $\widetilde{\bm{z}}$ by replacing argmax with softmax, formally:

where $G$ is the same sample drawn in the forward pass (i.e., we reuse the noise samples). $\tau$ is a temperature parameter that the softmax function approaches to argmax while $\tau\rightarrow 0$ and approaches to uniform while $\tau\rightarrow\infty$ . Although there are discrepancies between the forward and backward pass, we empirically observe that the Gumbel-Softmax strategy performs well in our experiments.

Experiments

We conducted our experiments on three datasets that are collected from MS-COCO images. RefCOCO contains 142,210 referring expressions for 19,994 images. An interactive game is used during the expression collection. All expression-referent pairs are split into train, validation, testA, and testB. TestA contains the images with multiple people and testB contains the images with multiple objects. RefCOCO+ contains 141,564 referring expressions for 49,856 objects in 19,992 images. It is collected with the same interactive game as RefCOCO and is split into train, validation, testA, and testB, respectively. The difference from RefCOCO is that RefCOCO+ only allows expression described by appearance but no locations. RefCOCOg contains 95,010 referring expressions for 49,822 objects in 25,799 images. It is collected in a non-interactive way and contains longer expressions described by both appearance and locations. It has two types of data partitions. The first partition divides dataset into train and validation (val∗) sets. The second partition divides images into train, validation (val) and test sets.

2 Implementation Details and Metrics

Language Settings. We built specific vocabularies for the three datasets with words, POS tags, and dependency labels appeared more than once in datasets. Note that to obtain accurate parsing results, we did not trim the length of expressions. We used pre-trained GloVe to initialize word vectors. For dependency label vectors and POS tag vectors, we trained them from scratch with random initialization. We set the embedding sizes to 300, 50, 50 for words, POS tags, and dependency labels, respectively.

Visual Representations. To represent RoI features of an image, we concatenated object features and location features extracted from MAttNet , which is based on Faster RCNN with ResNet-101 as the backbone and trained with attribute heads. We employed Mask RCNN for object segmentation. The visual feature dimension $d_{x}$ was set to 3,072. For fair comparison, we also used VGG-16 as the backbone and $d_{x}$ was set to 5,120.

Parameter Settings. We optimized our model with Adam optimizer up to 40 epochs. The learning rate was initialized to 1e-3 and shrunk by 0.9 every 10 epochs. We set 128 images to the mini-batch size. The LSTM hidden size $d_{h}$ was set to 1,024, the hidden size of the attention in language representation was set to 1,024. The temperature $\tau$ of Gumbel-Softmax was set to 1.0.

Evaluation Metrics. For detection task, we calculated the Intersection-over-Union (IoU) between the detected bounding box and the ground-truth one, and treated the one with IoU at least 0.5 as correct. We used the Top-1 accuracy as the metric, which is the fraction of the correctly grounded test expressions. For segmentation task, we used Pr@0.5 (the percentage of expressions where IoU at least 0.5) and overall IoU as metrics.

3 Ablation Studies

Settings. We conducted extensive ablation studies to reveal the internal mechanism of NMTree. The ablations and their motivations are detailed as follows. Chain: it ignores the structure information of the language. Specifically, we represent a natural language expression as the weighted average of each word embedding, where the weights are calculated by soft attention on bi-LSTM hidden vectors of each word. The final grounding score is calculated by single score function between each region and the language representation. NMTree w/o Comp: it is the NMTree without the Comp module, forcing all internal nodes as Sum module. NMTree w/o Sum: it is the NMTree without the Sum module, forcing all internal nodes as Comp module. NMTree w/ Rule: it assembles modules by a hand-crafted rule. Instead of deciding which module should be assembled to each node by computing the relative score, we designed a fixed linguistic rule to make a discrete and non-trainable decisions. The rule is: set the internal nodes whose dependency relation label is ‘acl’ (i.e., adjectival clause) or ‘prep’ (i.e., prepositional modifier) as Comp module, and the others as Sum module.

Results. Table 1 shows the grounding accuracies of the ablation methods on the three benchmarks. We can have the following observations: 1) On all datasets, NMTree outperforms Chain even if we removed one module or used the hand-crafted rule. This is because the tree structure contains more linguistic information and more suitable for reasoning. Meanwhile, it also demonstrates that our proposed fine-grained composition is better than the holistic Chain. 2) When we removed one module, i.e., NMTree w/o Comp and NMTree w/o Sum, they are worse than the full NMTree. It demonstrates the necessity of the Sum and Comp. Note that removing any modules will also hurt the explainability of models. 3) NMTree w/o Comp and NMTree w/o Sum are comparable but NMTree w/o Sum is slightly better. This is because the Comp module is more complex and thus resulting in overfitting. 4) NMTree outperforms NMTree w/ Rule. It demonstrates that NMTree can automatically find which nodes need composite reasoning (as Comp) or not (as Sum). Further, it also implies that our NMTree is more suitable for visual grounding task as our assembler is aware of visual cues by the Gumbel-Softmax training strategy.

4 Comparison with State-of-the-Arts

Settings. We compared NMTree with other state-of-the-art visual grounding models published in recent years. According to whether the model requires language composition, we group those methods into: 1) Generation based methods which select the region with the maximum generation probability: MMI , Attribute , and Listener . 2) Holistic language based methods: NegBag . 3) Language composition based methods: CMN , VC , AccumAttn , and MAttN . 4) Composition methods with external parser: GroundNet , parser+CMN, and parser+MAttN. NMTree belongs to the fourth category, but its language composition is more fine-grained than others. We compared with them on three different settings: ground-truth regions, detected regions, and segmentation masks.

Results. From Table 2 and Table 3, we can find that: 1) the triplet composition models mostly outperform holistic models. This is because taking the advantage of linguistics information by decomposing sentences, even coarse-grained, is helpful in visual grounding. 2) Our model outperforms most triplet models with the help of fine-grained composite reasoning. 3) The parser-based methods are fragile to parser errors, leading to performance decline. However, our model is more robust because of the dynamic assembly and end-to-end train strategy. Although some of the performance gains are marginal, one should notice that it seems NMTree balances the well-known trade-off between performance and explainability . As we will discuss in the following, we achieve the explainability without hurting the accuracy.

5 Qualitative Analysis

In this section, we would like to investigate the internal reasoning steps of our model by qualitative resultsSince our work focuses on complex language cases, we mainly conducted qualitative experiments on RefCOCOg. More qualitative results are given in supplementary material.. In Figure 5, we visualize the tree structures, the module assembly, the attention map at each intermediate step, and the final results. In Figure 6, we visualize the reasoning process inside Comp modules. With these qualitative visualizations, we can have the following observations: 1) The visual concept words usually are assembled by Sum module while the relationship concept words are usually assembled by Comp module. 2) The attention maps of non-visual leaf nodes, e.g., ‘directly’ in 5(d), are usually scattered, while visual ones, e.g., ‘girl’ in 5(d), are usually concentrated. 3) Comp modules are aware of relationships, i.e., it can move the attention from the supporting objects to the target objects, as shown in Figure 6. 4) Along the tree, attention maps become more sharp, indicating the confidence of our model become stronger.

All the above observations suggest that our NMTree can reason along the tree and provide rich cues to support the final results. These reasoning patterns and supporting cues imply that our model is explainable. Therefore, to further investigate the explainability of our model, we conducted a human evaluation to measure whether the internal reasoning process is reasonable. Since the state-of-the-art model MAttNet does not contain internal reasoning process but only sums up three pre-defined module scores which directly point to the desired object, we compared with AccumAttn for it performs multi-step sequential reasoning and has image/textual attention at each time step. We first presented 60 examples with internal steps of each model to 6 human evaluators, and asked them to judge how clear that the model was doing at each step. Then each evaluator rated each example on 4-point Likert scale (unclear, slightly clear, mostly clear, clear) corresponding to scores of 1, 2, 3, and 4. The percentage of each choice and average scores are shown in Figure 7. We can find that our model outperforms AccumAttn and is often rated as “clear”. It indicates that the internal reasoning process of our model can be more clearly understood by humans.

Conclusion

In this paper, we proposed Neural Module Tree Networks (NMTree), a novel end-to-end model that localizes the target region by accumulating the grounding confidence score along the dependency parsing tree of a natural language sentence. NMTree consists of three simple neural modules, whose assembly is trained without additional annotations. Compared with previous visual grounding methods, our model performs a more fine-grained and explainable language composite reasoning with superior performance, demonstrated by extensive experiments on three benchmarks.

Acknowledgements. This work was supported by the National Key R&D Program of China under Grant 2017YFB1300201, the National Natural Science Foundation of China (NSFC) under Grants 61622211 and 61620106009, the Fundamental Research Funds for the Central Universities under Grant WK2100100030, and partially by NTU Data Science and Artificial Intelligence Research Center (DSAIR) and Alibaba-NTU JRI.

References

Appendix A Implementation of Tree LSTM

We simplified the implementation of tree LSTM (Eq. 4) in the main paper as:

where $\bm{c}_{tj}^{\uparrow},\,\bm{h}_{tj}^{\uparrow}$ denote the cell and hidden vectors of the $j$ -th child of node $t$ . Specifically, our tree LSTM transition equations are:

where $\odot$ is the element-wise multiplication, $\sigma(\cdot)$ is the sigmoid function, $W$ , $U$ , $b$ are trainable parameters.

Appendix B More Qualitative Results

In this section, we provide more qualitative results to demonstrate the internal reasoning steps of NMTree. In Figure 8, we visualize the reasoning process inside Comp modules. In Figure 9, Figure 10, and Figure 11, we visualize the tree structures, the module assembly, the attention map at each intermediate step, and the final results. Specifically, Figure 9 are qualitative results with ground-truth bounding boxes. As comparison, we also show some failure cases. Figure 10 are qualitative results with detected bounding boxes. Figure 11 are qualitative results with detected masks.