Learning by Abstraction: The Neural State Machine

Drew A. Hudson, Christopher D. Manning

Introduction

Language is one of the most marvelous feats of the human mind. The emergence of a compositional system of symbols that can distill and convey from rich sensory experiences to creative new ideas has been a major turning point in the evolution of intelligence, and made a profound impact on the nature of human cognition (Chomsky, 2017; Vygotsky, 1964; Boroditsky, 2011). According to Jerry Fodor’s Language of Thought hypothesis (Fodor, 1975; Schneider, 2011), thinking itself posses a language-like compositional structure, where elementary concepts combine in systematic ways to create compound new ideas or thoughts, allowing us to make “infinite use of finite means” (Chomsky, 2014) and fostering human’s remarkable capacities of abstraction and generalization (Lake et al., 2017).

Indeed, humans are particularly adept at making abstractions of various kinds: We make analogies and form concepts to generalize from given instances to unseen examples (Rogers & McClelland, 2004); we see things in context, and build compositional world models to represent objects and understand their interactions and subtle relations, turning raw sensory signals into high-level semantic knowledge (Navon, 1977); and we deductively draw inferences via conceptual rules and statements to proceed from known facts to novel conclusions (Harman, 1986; Hudson & Manning, 2018). Not only are humans capable of learning, but we are also talented at reasoning.

Ideas about compositionality, abstraction and reasoning greatly inspired the classical views of artificial intelligence (Smolensky, 1987; Newell, 1980), but have lately been overshadowed by the astounding success of deep learning over a wide spectrum of real-world tasks (He et al., 2016; Mnih et al., 2015; Wu et al., 2016b). Yet, even though neural networks are undoubtedly powerful, flexible and robust, recent work has repeatedly demonstrated their flaws, showing how they struggle to generalize in a systematic manner (Lake & Baroni, 2017), overly adhere to superficial and potentially misleading statistical associations instead of learning true causal relations Agrawal et al. (2016); Jia & Liang (2017), strongly depend on large amounts of data and supervision (Garnelo et al., 2016; Lake et al., 2017), and sometimes behave in surprising and worrisome ways Goodfellow et al. (2014); Das et al. (2017). The sheer size and statistical nature of these models that support robustness and versatility are also what hinder their interpretability, modularity, and soundness.

Motivated to alleviate these deficiencies and bring the neural and symbolic approaches more closely together, we propose the Neural State Machine, a differentiable graph-based model that simulates the operation of an automaton, and explore it in the domain of visual reasoning and compositional question answering. Essentially, we proceed through two stages: modeling and inference. Starting from an image, we first generate a probabilistic scene graph (Johnson et al., 2015; Krishna et al., 2017) that captures its underlying semantic knowledge in a compact form. Nodes correspond to objects and consist of structured representations of their properties, and edges depict both their spatial and semantic relations. Once we have the graph, we then treat it as a state machine and simulate an iterative computation over it, aiming to answer questions or draw inferences. We translate a given natural language question into a series of soft instructions, and feed them one-at-a-time into the machine to perform sequential reasoning, using attention to traverse its states and compute the answer.

Drawing inspiration from Bengio’s consciousness prior (Bengio, 2017), we further define a set of semantic embedded concepts that describe different entities and aspects of the domain, such as various kinds of objects, attributes and relations. These concepts are used as the vocabulary that underlies both the scene graphs derived from the image as well as the reasoning instructions obtained from the question, effectively allowing both modalities to “speak the same language”. Whereas neural networks typically interact directly with raw observations and dense features, our approach encourages the model to reason instead in a semantic and factorized abstract space, which enables the disentanglement of structure from content and improves its modularity.

We demonstrate the value and performance of the Neural State Machine on two recent Visual Question Answering (VQA) datasets: GQA (Hudson & Manning, 2019) which focuses on real-world visual reasoning and multi-step question answering, as well as VQA-CP (Agrawal et al., 2018), a recent split of the popular VQA dataset (Agrawal et al., 2017; Goyal et al., 2017) that has been designed particularly to evaluate generalization. We achieve state-of-the-art results on both tasks under single-model settings, substantiating the robustness and efficiency of our approach in answering challenging compositional questions. We then construct new splits leveraging the associated structured representations provided by GQA and conduct further experiments that provide significant evidence for the model’s strong generalization skills across multiple dimensions, such as novel compositions of concepts and unseen linguistic structures, validating its versatility under changing conditions.

Our model ties together two important qualities: abstraction and compositionality, with the respective key innovations of representing meaning as a structured attention distribution over an internal vocabulary of disentangled concepts, and capturing sequential reasoning as the iterative computation of a differentiable state machine over a semantic graph. We hope that creating such neural form of a classical model of computation will encourage and support the integration of the connectionist and symbolic methodologies in AI, opening the door to enhanced modularity, versatility, and generalization.

Related work

Our model connects to multiple lines of research, including works about compositionality Bottou (2014); Hu et al. (2017), concept acquisition (Higgins et al., 2017; Wu et al., 2016a), and neural computation (Graves et al., 2014; Miller et al., 2016; Andreas et al., 2016). Several works have explored the discovery and use of visual concepts in the contexts of reinforcement or unsupervised learning (Higgins et al., 2016; Choi et al., 2018) as well as in classical computer vision (Farhadi et al., 2009; Vogel & Schiele, 2007). Others have argued for the importance of incorporating strong inductive biases into neural architectures (Bottou, 2014; Battaglia et al., 2018; Andreas, 2019; Andreas et al., 2018), and indeed, there is a growing body of research that seeks to introduce different forms of structural priors inspired by computer architectures (Graves et al., 2016; Weston et al., 2014; Hudson & Manning, 2018) or theory of computation (Aleksander, 1994; Forcada & Carrasco, 2001), aiming to bridge the gap between the symbolic and neural paradigms.

We explore our model in the context of VQA, a challenging multimodal task that has gained substantial attention in recent years (Goyal et al., 2017; Wang et al., 2017; Hudson & Manning, 2018). Prior work commonly relied on dense visual features produced by either CNNs (Xiong et al., 2016; Yang et al., 2016) or object detectors (Anderson et al., 2017), with a few recent models that use the relationships among objects to augment those features with contextual information from each object’s surroundings (Li et al., 2019a; Teney et al., 2017; Norcliffe-Brown et al., 2018). We move further in this direction, performing iterative reasoning over inferred scene graphs, and in contrast to prior models, incorporate higher-level semantic concepts to represent both the visual and linguistic modalities in a shared and sparser manner that facilitates their interaction.

Closest to the NSM is a model called MAC (Hudson & Manning, 2018) we have developed in prior work, a recurrent network that applies attention-based operations to perform sequential reasoning. In fact, our new model follows the high-level structure proposed by MAC, where the new decoder (section 3.3) is analogous to the control unit, and the model simulation (section 3.4) is parallel to the read-write operation. At the same time, the Neural State Machine differs from MAC in two crucial respects: First, we reason over graph structures rather than directly over spatial maps of visual features, traversing the graph by successively shifting attention across its nodes and edges. Second, key to our model is the notion of semantic concepts that are used to express knowledge about both the visual and linguistic modalities, instead of working directly with the raw observational features. As our findings suggest, these ideas contribute significantly to the model’s performance compared to MAC and enhance its compositionality, transparency and generalization skills.

The Neural State Machine

The Neural State Machine is a graph-based network that simulates the computation of a finite automaton (Hopcroft, 2008), and is explored here in the context of VQA, where we are given an image and a question and asked to provide an answer. We go through two stages – modeling and inference, the first to construct the state machine, and the second to simulate its operation.

In the modeling stage, we transform both the visual and linguistic modalities into abstract representations. The image is decomposed into a probabilistic graph that represents its semantics – the objects, attributes and relations in the depicted visual scene (section 3.2), while the question is converted into a sequence of reasoning instructions (section 3.3) that have to be performed in order to answer it.

In the inference stage (section 3.4), we treat the graph as a state machine, where the nodes, the objects within the image, correspond to states, and the edges, the relations between the objects, correspond to transitions. We then simulate a serial computation by iteratively feeding the machine with the instructions derived from the question and traversing its states, which allows us to perform sequential reasoning over the semantic visual scene, as guided by the question, to arrive at the answer.

We begin with a formal definition of the machine. In simple terms, a state machine is a computational model that consists of a collection of states, which it iteratively traverses while reading a sequence of inputs, as determined by a transition function. In contrast to the classical deterministic versions, the neural state machine defines an initial distribution over the states, and then performs a fixed number of computation steps $N$ , recurrently updating the state distribution until completion. Formally, we define the neural state machine as a tuple $(C,S,E,\{r_{i}\}_{i=0}^{N},p_{0},\delta)$ :

$C$ the model’s alphabet, consisting of a set of concepts, embedded as learned vectors.

$E$ a collection of directed edges that specify valid transitions between the states.

$r_{i}$ a sequence of instructions, each of dimension $d$ , that are passed in turn as an input to the transition function $\delta$ .

$p_{0}:S\rightarrow$ a probability distribution of the initial state.

$\delta_{S,E}:p_{i}\times r_{i}\rightarrow p_{i+1}$ a state transition function: a neural module that at each step $i$ considers the distribution $p_{i}$ over the states as well as an input instruction $r_{i}$ , and uses it to redistribute the probability along the edges, yielding an updated state distribution $p_{i+1}$ .

In contrast to many common networks, the neural state machine operates over a discrete set of concepts. We create an embedded concept vocabulary $C$ for the machine (initialized with GloVe (Pennington et al., 2014)), that will be used to capture and represent the semantic content of input images. The vocabulary is grouped into $L+2$ properties such as object identity $C_{O}=C_{0}$ (e.g. cat, shirt), different types of attributes $C_{A}=\bigcup_{i=1}^{L}C_{i}$ (e.g. colors, materials) and relations $C_{R}=C_{L+1}$ (e.g. holding, behind), all derived from the Visual Genome dataset (Krishna et al., 2017) (see section 7.3 for details). We similarly define a set of embeddings $D$ for each of the property types (such as “color” or “shape”).

In using the notion of concepts, we draw a lot of inspiration from humans, who are known for their ability to learn concepts and use them for tasks that involve abstract thinking and reasoning (Bealer, 1998; Barsalou et al., 2003; Greeno, 1989; Pecher & Zwaan, 2005). In the following sections, rather than using raw and dense sensory input features directly, we represent both the visual and linguistic inputs in terms of our vocabulary, finding the most relevant concepts that they relate to. By associating such semantic concepts with raw sensory information from both the image and the question, we are able to derive higher-level representations that abstract away from irrelevant raw fine-grained statistics tied to each modaility, and instead capture only the semantic knowledge necessary for the task. That way we can effectively cast both modalities onto the same space to facilitate their interaction, and, as discussed in section 4, improve the model’s compositionality, robustness and generalization skills.

2 States and edge transitions

In order to create the state machine, we construct a probabilistic scene graph that specifies the objects and relations in a given image, and serves us as the machine’s state graph, where objects correspond to states and relations to valid transitions. Multiple models have been proposed for the task of scene graph generation (Xu et al., 2017; Yang et al., 2018; Chen et al., 2019; Zellers et al., 2018). Here, we largely follow the approaches of Yang et al. (2018) and Chen et al. (2019) in conjunction with a variant of the Mask R-CNN object detector (He et al., 2017) proposed by Hu et al. (2018). Further details regarding the graph generation can be found in section 7.4.

By using such a graph generation model, we can infer a scene graph that consists of: (1) A set of object nodes $S$ from the image, each accompanied by a bounding box, a mask, dense visual features, and a collection of discrete probability distributions $\{P_{i}\}_{i=0}^{L}$ for each of the object’s $L+1$ semantic properties (such as its color, material, shape, etc.), defined over the concept vocabulary $\{C_{i}\}_{i=0}^{L}$ presented above; (2) A set of relation edges between the objects, each associated with a probability distribution $P_{L+1}$ of its semantic type (e.g. on top of, eating) among the concepts in $C_{L+1}$ , and corresponding to a valid transition between the machine’s states.

Once we obtain the sets of state nodes and transition edges, we proceed to computing structured embedded representations for each of them. For each state $s\in S$ that corresponds to an object in the scene, we define a set of $L+1$ property variables $\{s^{j}\}_{j=0}^{L}$ and assign each of them with

Where $c_{k}\in C_{j}$ denotes each embedded concept of the $j^{th}$ property type and $P_{j}$ refers to the corresponding property distribution over these concepts, resulting in a soft-binding of concepts to each variable. To give an example, if an object is recognized by the object detector as likely to be e.g. red, then its color variable will be assigned to an averaged vector close to the embedding of the “red” concept. Edge representations are computed in a similar manner, resulting in matching embeddings of their relation type: $e^{\prime}=\sum_{c_{k}\in C_{L+1}}P_{L+1}(k)c_{k}$ for each edge $e\in E$ .

Consequently, we obtain a set of structured representations for both the nodes and the edges that underlie the state machine. Note that by associating each object and relation in the scene with not one, but a collection of vectors that capture each of their semantic properties, we are able to create disentangled representations that encapsulate the statistical particularities of the raw image and express it instead through a factorized discrete distribution over a vocabulary of embedded semantic concepts, aiming to encourage and promote higher compositionality.

3 Reasoning instructions

In the next step, we translate the question into a sequence of reasoning instructions (each expressed in terms of the concept vocabulary $C$ ), which will later be read by the state machine to guide its computation. The translation process consists of two steps: tagging and decoding.

We begin by embedding all the question words using GloVe (dimension $d=300$ ). We process each word with a tagger function that either translates it into the most relevant concept in our vocabulary or alternatively keeps it intact, if it does not match any of them closely enough. Formally, for each embedded word $w_{i}$ we compute a similarity-based distribution

Where $\mathbf{W}$ is initialized to the identity matrix and $C$ denotes the matrix of all embedded concepts along with an additional learned default embedding $c^{\prime}$ to account for structural or other non-content words.

Next, we translate each word into a concept-based representation:

Intuitively, a content word such as apples will be considered mostly similar to the concept apple (by comparing their GloVe embeddings), and thus will be replaced by the embedding of that term, whereas function words such as who, are, how will be deemed less similar to the semantic concepts and hence will stay close to their original embedding. Overall, this process allows us to “normalize" the question, by transforming content words to their matching concepts, while keeping function words mostly unaffected.

Finally, we process the normalized question words with an attention-based encoder-decoder, drawing inspiration from (Hudson & Manning, 2018): Given a question of $P$ normalized words $V^{P\times d}=\{v_{i}\}_{i=1}^{P}$ , we first pass it through an LSTM encoder, obtaining the final state $q$ to represent the question. Then, we roll-out a recurrent decoder for a fixed number of steps $N+1$ , yielding $N+1$ hidden states $\{h_{i=0}^{N}\}$ , and transform each of them into a corresponding reasoning instruction:

Here, we compute attention over the normalized question words at each decoding step. By repeating this process for all $N+1$ steps, we decompose the question into a series of reasoning instructions that selectively focus on its various parts, accomplishing the goal of this stage.

4 Model simulation

Having all the building blocks of the state machine ready, the graph of states $S$ and edges $E$ , the instruction series $\{r_{i}\}_{i=0}^{N}$ , and the concept vocabulary $C=\bigcup_{i=0}^{L+1}C_{i}$ , we can now simulate the machine’s sequential computation. Basically, we will begin with a uniform initial distribution $p_{0}$ over the states (the objects in the image’s scene), and at each reasoning step $i$ , read an instruction $r_{i}$ as derived from the question, and use it to redistribute our attention over the states (the objects) by shifting probability along the edges (their relations).

Formally, we perform this process by implementing a neural module for the state transition function $\delta_{S,E}:p_{i}\times r_{i}\rightarrow p_{i+1}$ . At each step $i$ , the module takes a distribution $p_{i}$ over the states as an input and computes an updated distribution $p_{i+1}$ , guided by the instruction $r_{i}$ . Our goal is to determine what next states to traverse to ( $p_{i+1}$ ) based on the states we are currently attending to ( $p_{i}$ ). To achieve that, we perform a couple of steps.

Recall that in section 3.2 we define for each object a set of $L+1$ variables, representing its different properties (e.g. identity, color, shape). We further assigned each edge with a variable that similarly represents its relation type. Our first goal is thus to find the instruction type: the property type that is most relevant to the instruction $r_{i}$ – basically, to figure out what the instruction is about. We compute the distribution $R_{i}=\textrm{softmax}({r_{i}^{T}}\circ{D})$ over the $L+2$ embedded properties $D$ , defined in section 3.1. We further denote $R_{i}(L+1)\in$ that corresponds to the relation property as $r_{i}^{\prime}$ , measuring the degree to which that reasoning instruction is concerned with semantic relations (in contrast to other possibilities such as e.g. objects or attributes).

Once we know what the instruction $r_{i}$ is looking for, we can use it as a guiding signal while traversing the graph from the current states we are focusing on to their most relevant neighbors. We compare the instruction to all the states $s\in S$ and edges $e\in E$ , computing for each of them a relevance score:

Where $\sigma$ is a non-linearity, $\{s^{j}\}_{j=0}^{L}$ are the state variables corresponding to each of its properties, and $e^{\prime}$ is the edge variable representing its type. We then get relevance scores between the instruction $r_{i}$ and each of the variables, which are finally averaged for each state and edge using $R_{i}$ .

Having a relevance score for both the nodes and the edges, we can use them to achieve the key goal of this section: shifting the model’s attention $p_{i}$ from the current nodes (states) $s\in S$ to their most relevant neighbors – the next states:

Here, we compute the distribution over the next states $p_{i+1}$ by averaging two probabilities $p_{i+1}^{s}$ and $p_{i+1}^{r}$ : the former is based on each potential next state’s own internal properties, while the latter considers the next states contextual relevance, relative to the current states the model attends to. Overall, by repeating this process over $N$ steps, we can simulate the iterative computation of the neural state machine.

After completing the final computation step, and in order to predict an answer, we use a standard 2-layer fully-connected softmax classifier that receives the concatenation of the question vector $q$ as well as an additional vector $m$ that aggregates information from the machine’s final states:

Where $m$ reflects the information extracted from the final states as guided by the final reasoning instruction $r_{N}$ : averaged first by the reasoning instruction type, and then by the attention over the final states, as specified by $p_{N}$ .

Overall, the above process allows us to perform a differentiable traversal over the scene graph, guided by the sequence of instructions that were derived from the question: Given an image and a question, we have first inferred a graph to represent the objects and relations in the image’s scene, and analogously decomposed the question into a sequence of reasoning instructions. Notably, we have expressed both the graph and the instructions in terms of the shared vocabulary of semantic concepts, translating them both into the same “internal language". Then, we simulate the state machine’s iterative operation, and over its course of computation, are successively shifting our attention across the nodes and edges as we ground each instruction in the graph to guide our traversal. Essentially, this allows us to locate each part of the question in the image, and perform sequential reasoning over the objects and relations in the image’s scene graph until we finally arrive at the answer.

Experiments

We evaluate our model (NSM) on two recent VQA datasets: (1) The GQA dataset (Hudson & Manning, 2019) which focuses on real-world visual reasoning and compositional question answering, and (2) VQA-CP (version 2) (Agrawal et al., 2018), a split of the VQA dataset (Goyal et al., 2017) that has been particularly designed to test generalization skills across changes in the answer distribution between the training and the test sets. We achieve state-of-the-art performance both for VQA-CP, and, under single-model settings, for GQA. To further explore the generalization capacity of the NSM model, we construct two new splits for GQA that test generalization over both the questions’ content and structure, and perform experiments based on them that provide substantial evidence for the strong generalization skills of our model across multiple dimensions. Finally, performance diagnosis, ablation studies and visualizations are presented in section 7.2 to shed more light on the inner workings of the model and its qualitative behavior.

Both our model and implemented baselines are trained to minimize the cross-entropy loss of the predicted candidate answer (out of the top 2000 possibilities), using a hidden state size of $d=300$ and, unless otherwise stated, length of $N=8$ computation steps for the MAC and NSM models. Please refer to section 7.5 for further information about the training procedure, implementation details, hyperparameter configuration and data preprocessing, along with complexity analysis of the NSM model. The model has been implemented in Tensorflow, and will be released along with the features and instructions for reproducing the described experiments.

We begin by testing the model on the GQA task (Hudson & Manning, 2019), a recent dataset that features challenging compositional questions that involve diverse reasoning skills in real-world settings, including spatial reasoning, relational reasoning, logic and comparisons. We compare our performance both with baselines, as appear in (Hudson & Manning, 2019), as well as with the top-5 single and top-10 ensemble submissions to the GQA challenge.The official leaderboard mixes up single-model and ensemble results – we present here separated scores for each track. For single-model settings, to have a fair comparison, we consider all models that, similarly to ours, did not use the strong program supervision as an additional signal for training, but rather learn directly from the questions and answers.

As table 1 shows, we achieve state-of-the-art performance for a single-model across the dataset’s various metrics (defined in (Hudson & Manning, 2019)) such as accuracy and consistency. For the ensemble setting, we compute a majority vote of 10 instances of our model, achieving the 3rd highest score compared to the 52 submissions that have participated in the challenge1 (table 4), getting significantly stronger scores compared to the 4th or lower submissions.

Note that while several submissions (marked with *) use the associated functional programs that GQA provides with each question as a strong supervision during train time, we intentionally did not use them in training our model, but rather aimed to learn the task directly using the question-answer pairs only. These results serve as an indicator for the ability of the model to successfully address questions that involve different forms of reasoning (see section 7 for examples), and especially multi-step inference, which is particularly common in GQA.

2 Generalization experiments

Motivated to measure the generalization capacity of our model, we perform experiments over three different dimensions: (1) changes in the answer distribution between the training and the test sets, (2) contextual generalization for concepts learned in isolation, and (3) unseen grammatical structures.

First, we measure the performance on VQA-CP (Agrawal et al., 2018), which provides a new split of the VQA2 dataset (Goyal et al., 2017), where the answer distribution is kept different between the training and the test sets (e.g. in the training set, the most common color answer is white, whereas in the test set, it is black). Such settings reduce the extent to which models can circumvent the need for genuine scene understanding skills by exploiting dataset biases and superficial statistics (Agrawal et al., 2016; Johnson et al., 2017; Goyal et al., 2017), and are known to be particularly difficult for neural networks (Lake et al., 2017). Here, we follow the standard VQA1/2 (Goyal et al., 2017) accuracy metric for this task (defined in (Agrawal et al., 2018)). Table 4 presents our performance compared to existing approaches. We can see that NSM surpasses alternative models by a large margin.

We perform further generalization studies on GQA, leveraging the fact that the dataset provides grounding annotations of the question words. For instance, a question such as “What color is the book on the table?" is accompanied by the annotation $\{4:(``book",n_{0}),7:(``table",n_{1})\}$ expressing the fact that e.g. the $4^{th}$ word refers to the book object node. These annotations allow us to split the training set in two interesting ways to test generalization over both content and structure (see figure 6 for an illustration of each split):

Content: Since the annotations specify which objects each question refers to, and by using the GQA ontology, we can identify all the questions that are concerned with particular object types, e.g. foods, or animals. We use this observation to split the training set by excluding all question-answer pairs that refer to these categories, and measure the model’s generalization over them. Note however, that the object detector module described in section 3.2 is still trained over all the scene graphs including those objects – rather, the goal of this split is to test whether the model can leverage the fact that it was trained to identify a particular object in isolation, in order to answer unseen questions about that type of object without any further question training.

Structure: We can use the annotations described above as masks over the objects (see figure 6 for examples), allowing us to divide the questions in the training set into linguistic pattern groups. Then, by splitting these groups into two separated sets, we can test whether a model is able to generalize from some linguistic structures to unseen ones.

Table 4 summarizes the results for both settings, comparing our model to the baselines released for GQA (Hudson & Manning, 2019), all using the same training scheme and input features. We can see that here as well, NSM performs significantly better than the alternative approaches, testifying to its strong generalization capacity both over concepts it has not seen any questions about (but only learned in isolation), as well as over questions that involve novel linguistic structures. In our view, these results point to the strongest quality of our approach. several prior works have argued for the great potential of abstractions and compositionality in enhancing models of deep learning (Andreas et al., 2018; Battaglia et al., 2018). Our results suggest that incorporating these notions may indeed be highly beneficial to creating models that are more capable in coping with changing conditions and can better generalize to novel situations.

Conclusion

In this paper, we have introduced the Neural State Machine, a graph-based network that simulates the operation of an automaton, and demonstrated its versatility, robustness and high generalization skills on the tasks of real-world visual reasoning and compositional question answering. By incorporating the concept of a state machine into neural networks, we are able to introduce a strong structural prior that enhances compositinality both in terms of the representation, by having a structured graph to serve as our world model, as well as in terms of the computation, by performing sequential reasoning over such graphs. We hope that our model will help in the effort to integrate symbolic and connectionist approaches more closely together, in order to elevate neural models from sensory and perceptual tasks, where they currently shine, into the domains of higher-level abstraction, knowledge representation, compositionality and reasoning.

Acknowledgments

We thank Samsung Electronics Co., Ltd. for partially funding this project. We also would like to thank the reviewers for their detailed and constructive comments, suggestions and feedback. Finally, we would like to thank the readers who have contacted us with useful comments and questions.

References

Supplementary material

Our model connects to multiple lines of research, including works about concept acquisition (Higgins et al., 2017), visual abstractions (Gan et al., 2017; Wu et al., 2016a; Lu et al., 2018, 2017; Yi et al., 2018), compositionality (Bottou, 2014; Hu et al., 2017), and neural computation (Graves et al., 2014; Miller et al., 2016; Andreas et al., 2016). Several works have explored the discovery and use of visual concepts in the contexts of reinforcement or unsupervised learning (Higgins et al., 2016; Choi et al., 2018) as well as in classical computer vision (Farhadi et al., 2009; Vogel & Schiele, 2007). Others have argued for the importance of incorporating strong inductive biases into neural architectures (Bottou, 2014; Battaglia et al., 2018; Andreas, 2019; Andreas et al., 2018), and indeed, there is a growing body of research that seeks to introduce different forms of structural priors inspired by computer architectures (Graves et al., 2016; Weston et al., 2014; Hudson & Manning, 2018) or theory of computation (Aleksander, 1994; Forcada & Carrasco, 2001), aiming to bridge the gap between the symbolic and neural paradigms.

One such structural prior that underlies our model is that of the probabilistic scene graph (Johnson et al., 2015; Krishna et al., 2017) which we construct and reason over to answer questions about presented images. Scene graphs provide a succinct representation of the image’s semantics, and have been effectively used for variety of applications such as image retrieval, captioning or generation (Johnson et al., 2015; Li et al., 2017; Johnson et al., 2018). Recent years have witnessed an increasing interest both in scene graphs in particular (Xu et al., 2017; Yang et al., 2018; Chen et al., 2019; Zellers et al., 2018; Li et al., 2018) as well as in graph networks in general (Battaglia et al., 2018; Kipf & Welling, 2016; Veličković et al., 2017) – a family of graph-structured models in which information is iteratively propagated across the nodes to turn their representations increasingly more contextual with information from their neighborhoods. In our work, we also use the general framework of graph networks, but in contrast to common approaches, we avoid the computationally-heavy state updates per each node, keeping the graph states static once predicted, and instead perform a series of refinements of one global attention distribution over the nodes, resulting in a soft graph traversal which better suits our need for supporting sequential reasoning.

We explore our model in the context of visual question answering (Gupta, 2017), a challenging multi-modal task that has gained substantial attention over the last years. Plenty of models have been proposed (Anderson et al., 2017; Wang et al., 2017; Hudson & Manning, 2018), focusing on visual reasoning or question answering in both abstract and real-world settings (Goyal et al., 2017; Johnson et al., 2017; Santoro et al., 2018; Hudson & Manning, 2019). Typically, most approaches use either CNNs (Xiong et al., 2016; Yang et al., 2016) or object detectors (Anderson et al., 2017) to derive visual features which are then compared to a fixed-size question embedding obtained by an LSTM. A few newer models use the relationships among objects to augment features with contextual information from each object’s surroundings (Li et al., 2019b; Teney et al., 2017; Norcliffe-Brown et al., 2018). We move further in this direction, performing iterative reasoning over the inferred scene graphs, and in contrast to prior models (Li et al., 2019a; Cadene et al., 2019), incorporate higher-level semantic concepts to represent both the visual and linguistic modalities in a shared and sparser manner to facilitate their interaction.

Other methods for visual reasoning such as (Yi et al., 2018; Mao et al., 2019; Mascharka et al., 2018) have similar motivation to ours of integrating neural and symbolic perspectives by learning and reasoning over structured representations. However, in contrast to our approach, those models heavily rely on symbolic execution, either through strong supervision of program annotations (Mascharka et al., 2018), non-differentiable python-based functions (Yi et al., 2018), or a collection of hand-crafted modules specifically designed for each given task (Mao et al., 2019), and consequently, they have mostly been explored in artificial environments such as CLEVR (Johnson et al., 2017). Instead, our model offers a fully-neural and more general graph-based design that, as demonstrated by our experiments, successfully scales to real-world settings.

Closest to our work is a model called MAC (Hudson & Manning, 2018), a recurrent network that applies attention-based operations to perform sequential reasoning. However, the Neural State Machine differs from MAC in two crucial respects: First, we reason over graph structures rather than directly over spatial maps of visual features, traversing the graph by successively shifting attention across its nodes and edges. Second, key to our model is the notion of semantic concepts that are used to express knowledge about both the visual and linguistic modalities, instead of working directly with the raw observational features. As our findings suggest, these ideas contribute significantly to the model’s performance compared to MAC and enhance its compositionality, transparency and generalization skills.

2 Ablation studies

To gain further insight into the relative contributions of different aspects of our model to its overall performance, we have conducted multiple ablation experiments, as summarized in table 5 and figure 4. First, we measure the impact of using our new visual features (section 3.2) compared to the default features used by the official baselines (Hudson & Manning, 2019). Such settings result in an improvement of 1.27%, confirming that most of the gain achieved by the NSM model (with 62.95% overall) stems from its inherent architecture rather than from the input features.

We then explore several ablated models: one where we do not perform any iterative simulation of the state machine, but rather directly predict the answer from its underlying graph structure (55.41%); and a second enhanced version where we perform a traversal across the states but without considering the typed relations among them, adopting instead a uniform fully-connected graph (58.83%). These experiments suggest that using the graph structure to represent the visual scene as well performing sequential reasoning over it by traversing its nodes, are both crucial to the model’s overall accuracy and have positive impact on its performance.

Next, we evaluate a variant of the model where instead of using the concept-based representations defined in section 3.1, we fallback to the more standard dense features to represent the various graph elements, which results in an accuracy of 58.48%. Comparing this score to that of the default model’s settings (62.95%) proves the significance of using the higher-level semantic representations to the overall accuracy of the NSM.

Finally, we measure the impact of varying the number of computation steps on the obtained results (figure 4), revealing a steady increase in accuracy as we perform a higher number of reasoning steps (until saturation for $N=8$ steps). These experiments further validate the effectiveness of sequential computation in addressing challenging questions, and especially compositional questions which are at the core of GQA.

3 Concept vocabulary

We define a vocabulary $C$ of 1335 embedded concepts about object types $C_{0}$ (e.g. cat, shirt), attributes, grouped into $L$ types ${\{C_{i}\}}_{i=1}^{L}$ (e.g. colors, materials), and relations $C_{L+1}$ (e.g. holding, on top of), all initialized with GloVe (Pennington et al., 2014). Our concepts consist of 785 objects, 170 relations, and 303 attributes that are divided into 77 types, with most types being binary (e.g. short/tall, light/dark). All concepts are derived from the Visual Genome dataset (Krishna et al., 2017), which provides human-annotated scene graphs for real-world images. We use the refined dataset supported by GQA (Hudson & Manning, 2019), where the graphs are further cleaned-up and consolidated, resulting in a closed-vocabulary version of the dataset.

4 Scene graph generation

In order to generate scene graphs from given images, we re-implement a model that largely follows Yang et al. (2018); Chen et al. (2019) in conjunction with an object detector proposed by Hu et al. (2018). Specifically, we use a variant of Mask R-CNN (He et al., 2017; Hu et al., 2018) to obtain object detections from each image, using Hu et al. (2018)’s official implementation along with ResNet-101 (He et al., 2016) and FPN (Lin et al., 2017) for features and region proposals respectively, and keep up to $50$ detections, with a confidence threshold of $0.2$ . We train the object detector (and the following graph generation model) over a cleaner version of Visual Genome scene graphs (Krishna et al., 2017), offered as part of the GQA dataset (Hudson & Manning, 2019). In particular, the detector heads are trained to classify both the object class and category (akin to YOLO’s hierarchical softmax (Redmon & Farhadi, 2017)), as well as the object’s attributes per each property type, resulting in a set of probability distributions $\{P_{i}\}_{i=0}^{L}$ over the object’s various properties (e.g. its identity, color or shape).

Once we obtain the graph nodes – the set of detected objects, we proceed to detecting their relations, mainly following the approach of Yang et al. (2018): First, we create a directed edge for each pair of objects that are in close enough proximity – as long as the relative distance between the objects in both axes is smaller than 15% of the image dimensions (which covers over 94% of the ground truth edges, and allows us to sparsify the graph, reducing the computation load). Once we set the graph structure, we process it through a graph attention network (Veličković et al., 2017) to predict the identities of each relation, resulting in a probability distribution $P_{L+1}$ over the relation type of each edge.

5 Implementation and training details

We train the model using the Adam optimization method (Kingma & Ba, 2014), with a learning rate of $10^{-4}$ and a batch size of 64. We use gradient clipping, and employ early stopping based on the validation accuracy, leading to a training process of approximately 15 epochs, equivalent to roughly 30 hours on a single Maxwell Titan X GPU. Both hidden states and word vectors have a dimension size of 300, the latter being initialized using GloVe (Pennington et al., 2014). During the training, we maintain exponential moving averages of the model weights, with a decay rate of 0.999, and use them at test time instead of the raw weights. Finally, we use ELU as non-linearity and dropout of 0.15 across the network: in the initial processing of images and questions, for the state and edge representations, and in the answer classifier. All hyperparameters were tuned manually (from the following ranges: learning rate $[5{\cdot}10^{-5},10^{-4},5{\cdot}10^{-4},10^{-3}]$ , batch size $ $, dropout$ [0.08,0.1,0.12,0.15,0.18,0.2] $and hidden and word dimensions$ $ to comply with GloVe provided sizes).

We have preprocessed all the questions by removing punctuation and keeping the top 5000 most common words, and use the standard training/test splits provided by the original datasets we have explored (Hudson & Manning, 2019; Agrawal et al., 2018): For GQA, we use the more common “balanced" version that has been designed to reduce biases within the answer distribution (similar in motivation to the VQA2 dataset (Goyal et al., 2017)), and includes 1.7M questions split into 70%/10%/10% for training, validation and test sets respectively. For VQA-CP, we use version v2 which consists of 438k/220k questions for training/test respectively. Finally, for the new generalization split, we have downsampled the data to have a ratio of 80%/20% for training/test over approximately 800k questions overall. All results reported are for a single-model settings, except the ensemble scores for GQA that compute majority vote over 10 models, and the ablation studies that are performed over 5 runs for each ablated version. To measure the confidence of the results, we have performed additional 5 runs of our best-performing model over both the GQA Test-Dev and VQA-CP, getting standard deviations of 0.22 and 0.31 respectively.

In terms of time complexity, the NSM is linear both in the number of question words $P$ , and the size of the graph ( $S\leq 50$ states and $E\approx O(S)$ due to the proximity-based pruning we perform while constructing the graph), multiplied by a constant $N=8$ of reasoning steps $O(N(V+E+P))$ . Similarly, the space complexity of our model is $O(V+E+P)$ since we do not have to store separated values for each computation step, but rather keep only the most recent ones.