Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Long Chen, Hanwang Zhang, Jun Xiao, Xiangnan He, Shiliang Pu, Shih-Fu Chang

Introduction

Visual scene understanding, e.g., what and where the things and stuff are, and how they relate with each other, is one of the core tasks in computer vision. With the maturity of object detection and segmentation , computers can recognize object categories, locations, and visual attributes well. However, scene understanding goes beyond the whereabouts of objects. A more crucial step is to infer their visual relationships — together with the objects, they offer comprehensive and coherent visually-grounded knowledge, called scene graphs . As shown in Figure 1 (a), the nodes and edges in scene graphs are objects and visual relationships, respectively. Moreover, scene graph is an indispensable knowledge representation for many high-level vision tasks such as image captioning , visual reasoning , and VQA .

A straightforward solution for Scene Graph Generation (SGG) is in an independent fashion: detecting object bounding boxes by an existing object detector, and then predicting the object classes and their pairwise relationships separately . However, these methods overlook the fruitful visual context, which offers a powerful inductive bias that helps object and relationship detection. For example in Figure 1, window and building usually co-occur within an image, and near is the most common relationship between tree and building; it is easy to infer that ? is building from window-on-? or tree-near-?. Such intuition has been empirically shown benefits in boosting SGG . More specifically, these methods use a conditional random field to model the joint distribution of nodes and edges, where the context is incorporated by message passing among the nodes through edges via a multi-step mean-field approximation ; then, the model is optimized by the sum of cross-entropy (XE) losses of nodes (e.g., objects) and edges (e.g., relationships).

Nevertheless, the coherency of the visual context is not captured effectively by existing SGG methods due to the main reason: the XE based training objective is not graph-coherent. By “graph-coherent”, we mean that the quality of the scene graph should be at the graph-level: the detected objects and relationships should be contextually consistent; however, the sum of XE losses of objects and relationships is essentially independent. To see the negative impact of this inconsistency, suppose that the red and the blue nodes are both misclassified in Figure 1 (b). Based on the XE loss, the errors are penalized equally. However, the error of misclassifying the red node should be more severe than the blue one, as the red error will influence more nodes and edges than the blue one. Therefore, we need to use a graph-level metric such as Recall@K and SPICE to match the graph-coherent objective, which penalizes more for misclassifying important hub nodes than others. Meanwhile, the training objective of SGG should be local-sensitive. By “local-sensitive”, we mean that the training objective is sensitive to the change of a single node. However, since the graph-coherent objective is a global pooling quantity, the individual contribution of the prediction of a node is lost. Thus, we need to design a disentangle mechanism to identify the individual contribution and provide an effective training signal for each local prediction.

In this paper, we propose a novel training paradigm: Counterfactual critic Multi-Agent Training (CMAT), to simultaneously meet the graph-coherent and local-sensitive requirements. Specifically, we design a novel communicative multi-agent model, where the objects are viewed as cooperative agents to maximize the quality of the generated scene graph. The action of each agent is to predict its object class labels, and each agent can communicate with others using pairwise visual features. The communication retains the rich visual context in SGG. After several rounds of agent communication, a visual relationship model triggers the overall graph-level reward by comparing the generated scene graph with the ground-truth.

For the graph-coherent objective, we directly define the objective as a graph-level reward (e.g., Recall@K or SPICE), and use policy gradient to optimize the non-differentiable objective. In the view of Multi-Agent Reinforcement Learning (MARL) , especially the actor-critic methods , the relationship model can be framed as a critic and the object classification model serves as a policy network. For the local-sensitive objective, we subtract a counterfactual baseline from the graph-level reward by varying the target agent and fixing the others before feeding into the critic. As shown in Figure 1 (c), to approximate the true influence of the red node acting as bike, we fix the predictions of the other nodes and replace the bike by non-bike (e.g., person, boy, and car), and see how such counterfactual replacement affects the reward (e.g., the edges connecting their neighborhood are all wrong).

To better encode the visual context for more effective CMAT training, we design an efficient agent communication model, which discards the widely-used relationship nodes in existing message passing works . Thanks to this design, we disentangle the agent communication (i.e., message passing) from the visual relationship detection, allowing the former to focus on modeling the visual context, and the latter, which is a communication consequence, to serve as the critic that guides the graph-coherent objective.

We demonstrate the effectiveness of CMAT on the challenging Visual Genome benchmark. We observe consistent improvements across extensive ablations and achieve state-of-the-art performances on three standard tasks.

In summary, we make three contributions in this paper:

We propose a novel training paradigm: Counterfactual critic Multi-Agent Training (CMAT) for SGG. To the best of our knowledge, we are the first to formulate SGG as a cooperative multi-agent problem, which conforms to the graph-coherent nature of scene graphs.

We design a counterfactual critic that is effective for training because it makes the graph-level reward local-sensitive by identifying individual agent contributions.

We design an efficient agent communication method that disentangles the relationship prediction from the visual context modeling, where the former is essentially a consequence of the latter.

Related Work

Scene Graph Generation. Detecting visual relationships regains the community attention after the pioneering work by Lu et al. and the advent of the first large-scale scene graph dataset by Krishna et al. . In the early stage, many SGG works focus on detecting objects and visual relations independently , but these independent inference models overlook the fruitful visual context. To benefit both object and relationship detection from visual context, recent SGG methods resort to the message passing mechanism . However, these methods fail to learn the visual context due to the conventional XE loss, which is not graph-level contextually consistent. Unlike previous methods, in this paper, we propose a CMAT model to simultaneously meet the graph-coherent and local-sensitive requirements.

Multi-Agent Policy Gradient. Policy gradient is a type of method which can optimize non-differentiable objective. It had been well-studied in many scene understanding tasks like image captioning , VQA , visual grounding , visual dialog , and object detection . Liang et al. used a DQN to formulate SGG as a single agent decision-making process. Different from these single agent policy gradient settings, we formulate SGG as a cooperative multi-agent decision problem, where the training objective is graph-level contextually consistent and conforms to the graph-coherent nature of scene graphs. Meanwhile, compared with many well-studied multi-agent game tasks , the agent number (64 objects) and action sample space (151 object categories) in CMAT are much larger.

Approach

In this section, we first introduce the components of the CMAT (Section 3.1). Then, we demonstrate the details about the training objective of the CMAT (Section 3.2).

We sequentially introduce the components of CMAT following the inference path ( path in Figure 2), including object proposals detection, agent communication, and visual relationship detection.

1.2 Agent Communication

Given the $n$ detected objects from the previous step, we regard each object as an agent and each agent will communicate with the others for $T$ rounds to encode the visual context. In each round of communication, as illustrated in Figure 3, there are three modules: extract, message, and update. These modules share parameters among all agents and time steps to reduce the model complexity. In the following, we introduce the details of these three modules.

Extract Module. The incarnation of the extract module is an LSTM, which encodes the agent interaction history and extracts the internal state of each agent. Specifically, for agent $i$ ( $i^{th}$ object) at $t$ -round ( $0<t\leqslant T$ ) communication:

where $\bm{h}^{t}_{i}$ is the hidden state of LSTM (i.e., the internal state of agent). $\bm{x}^{t}_{i}$ is the time-step input feature and $\bm{s}^{t}_{i}$ is the object class confidence. The initialization of $\bm{x}^{t}_{i}$ (i.e. $\bm{x}^{0}_{i}$ ) and $\bm{s}^{t}_{i}$ (i.e. $\bm{s}^{0}_{i}$ ) come from the proposal detection step. $\bm{e}^{t}_{i}$ is soft-weighted embedding of class label and $ $is a concatenate operation.$ F_{s} $and$ F_{e} $are learnable functionsFor conciseness, we leave the details in supplementary material. . All internal states$ \{\bm{h}^{t}_{i}\}$ are fed into the following message module to compose communication messages among agents.

Message Module. Considering the communication between agent $i$ and $j$ , the message module will compose message $M^{t}_{ij}$ and $M^{t}_{ji}$ for each agent. Specifically, the messages $M^{t}_{ij}$ for agent $i$ is a tuple $M^{t}_{ij}=(\bm{m}^{t}_{j},\bm{m}^{t}_{ij})$ including:

where $\bm{m}^{t}_{j}$ is a unary message which captures the identity of agent $j$ (e.g., the local object content), and $\bm{m}^{t}_{ij}$ is a pairwise message which models the interaction between two agents (e.g., the relative spatial layout). $\bm{h}^{t}_{ij}$ is the pairwise feature between agent $i$ and $j$ , and its initialization is the union box feature extracted by the object detector. $F_{m*}$ are message composition functions1. All communication message between agent $i$ and the others (i.e., $\{M^{t}_{i*}\}$ ) and its internal state $\bm{h}^{t}_{i}$ are fed into the following update module to update the time-step feature for next round agent communication We dubbed the communication step as agent communication instead of message passing for two reasons: 1) To be consistent with the concept of the multi-agent framework, where agent communication represents passing message among agents. 2) To highlight the difference with existing message passing methods that our communication model disentangles the relationship prediction from the visual context modeling. .

Update Module. At each round communication, we use a soft-attention to fuse message from other agents:

where $\alpha^{t}_{j}$ and $\alpha^{t}_{ij}$ are attention weights to fuse different message, $F_{att*}$ and $F_{u*}$ are attention and update functions1.

1.3 Visual Relationship Detection

After $T$ -round agent communication, all agents finish their states update. In inference stage, we greedily select the object labels $v^{T}_{i}$ based on confidence $\bm{s}^{T}_{i}$ . Then the relation model predict the relationship class for any object pairs:

where $F_{r}$ is the relationship function1. After predicting relationship for all object pairs, we finally obtain the generated scene graph: ( $\{v^{T}_{i}\},\{r_{ij}\}$ ).

2 Counterfactual Critic Multi-Agent Training

We demonstrate the details of the training objective of CMAT, including: 1) multi-agent policy gradient for the graph-coherent objective, and 2) counterfactual critic for the local-sensitive objective. The dataflow of our CMAT in training stage is shown in Figure 2 ( path).

Almost all prior SGG works minimize the XE loss as the training objective. Given a generated scene graph $(\hat{\mathcal{V}},\hat{\mathcal{E}})$ and its ground-truth $(\mathcal{V}^{gt},\mathcal{E}^{gt})$ , the objective is:

As can be seen in Eq. (5), the XE based objective is essentially independent and penalizes errors at all nodes equally.

To address this problem, we propose to replace XE with the following two graph-level metrics for graph-coherent training objective of SGG: 1) Recall@K : It computes the fraction of the correct predicted triplets in the top $K$ confident predictions. 2) SPICE : It is the F-score of predicted triplets precision and triplets recall. Being different from the XE loss, both Recall@K and SPICE are non-differentiable. Thus, our CMAT resorts to using the multi-agent policy gradient to optimize these objectives.

2.2 Multi-Agent Policy Gradient

We first describe formally the action, policy and state in CMAT, then derive the expression of parameter gradients.

Action. The action space for each agent is the set of all possible object classes, i.e., $v^{t}_{i}$ is the action of agent $i$ . We denote $V^{t}=\{v^{t}_{i}\}$ as the set of actions of all agents.

State. We follow previous work to use an LSTM (extract module) to encode the history of each agent. The hidden state $\bm{h}^{t}_{i}$ can be regarded as an approximation of the partially-observable environment state for agent $i$ . We denote $H^{t}=\{\bm{h}^{t}_{i}\}$ as the set of states of all agents.

Policy. The stochastic policy for each agent is the object classifier. In the training stage, the action is sampled based on the object class distribution, i.e., $\bm{p}^{T}_{i}=\text{softmax}(\bm{s}^{T}_{i})$ .

Because our CMAT only samples actions for each agents after $T$ -round agent communication, based on the policy gradient theorem , the (stochastic) gradient for the cooperative multi-agent in CMAT is:

where $Q(H^{T},V^{T})$ is the state-action value function. Instead of learning an independent network to fit the function $Q$ and approximate reward like actor-critic works ; in our CMAT, we follow to directly use the real global reward to replace $Q$ . The reasons are as follows: 1) The number of agents and possible actions for each agent in SGG are much larger than the previous multi-agent policy gradient settings, thus the number of training samples is insufficient to train an accurate value function. 2) This can reduce the model complexity and speed up the training procedures. Thus, the gradient for our CMAT becomes:

where $R(H^{T},V^{T})$ is the real graph-level reward (i.e., Recall@K or SPICE). It is worth noting that the reward $R(H^{T},V^{T})$ is a learnable reward function which includes a relation detection model.

2.3 Local-Sensitive Training Objective

As can been seen in Eq. (7), the graph-level reward can be considered as a global pooling contribution from all the local predictions, i.e., the reward for all the $n$ agents are identical. We demonstrate the negative impact of this situation with a toy example as shown in Figure 4.

Suppose all predictions of two generated scene graph are identical, except that the prediction of node “a” is different. Based on Eq. (7), all nodes in the first graph and second graph get a positive reward (i.e., 3 (right) -1(wrong) = +2) and a negative reward (i.e., 1 (right)-3 (wrong) = -2), respectively. The predictions for the nodes “b”,“c”, and “d” are identical in the two graphs, but their gradient directions for optimization are totally different, which results in many inefficient optimization iteration steps. Thus, the training objective of SGG should be local-sensitive, i.e., it can identify the contribution of each local prediction to provide an efficient training signal for each agent.

2.4 Counterfactual Critic

Given the global reward $R(H^{T},V^{T})$ and counterfactual baseline $\text{CB}^{i}(H^{T},V^{T})$ for action $\bm{v}^{T}_{i}$ of agent $i$ , the disentangled contribution of the action of agent $i$ is:

Note that $A^{i}(H^{T},V^{T})$ can be considered as the advantage in actor-critic methods , $\text{CB}^{i}(H^{T},V^{T})$ can be regarded as a baseline in policy gradient methods, which reduces the variance of gradient estimation. The whole network to calculate $A^{i}(H^{T},V^{T})$ is dubbed as the counterfactual critic Although the critic in CMAT is not a value function to estimate the reward as in actor-critic, we dubbed it as critic for two reasons: 1) The essence of a critic is calculating advantages for the actions of policy network. As in previous policy gradient work , the critic can be an inference algorithm without a value function. 2) The critic in CMAT includes a learnable relation model, which will also update its parameters at training. (Figure 2). Then the gradient becomes:

Finally, we incorporate the auxiliary XE supervised loss (weighted by a trade-off $\alpha$ ) for an end-to-end training, and the overall gradient is:

where CMAT encourages visual context exploration and XE stabilizes the training . We also follow to add an entropy term to regularize $\{\bm{p}^{T}_{i}\}_{i}$ .

Experiments

Dataset. We evaluated our method for SGG on the challenging benchmark: Visual Genome (VG) . For fair comparisons, we used the released data preprocessing and splits which had been widely-used in . The release selects the most frequent 150 object categories and 50 predicate classes. After preprocessing, each image has 11.5 objects and 6.2 relationships on average. The released split uses 70% of images for training (including 5K images as validation set) and 30% of images for test.

Settings. As the conventions in , we evaluate SGG on three tasks: Predicate Classification (PredCls): Given the ground-truth object bounding boxes and class labels, we need to predict the visual relationship classes among all the object pairs. Scene Graph Classification (SGCls): Given the ground-truth object bounding boxes, we need to predict both the object and pairwise relationship classes. Scene Graph Detection (SGDet): Given an image, we need to detect the objects and predict their pairwise relationship classes. In particular, the object detection needs to localize both the subject and object with at least 0.5 IoU with the ground-truth. As the conventions in , we used Recall@20 (R@20), Recall@50 (R@50), and Recall@100 (R@100) as the evaluation metrics.

Object Detector. For fair comparisons with previous works, we adopted the same object detector as . Specifically, the object detector is a Faster-RCNN with VGG backbone . Moreover, the anchor box size and aspect ratio are adjusted similar to YOLO-9000 , and the RoIPooling layer is replaced with the RoIAlign layer .

Training Details. Following the previous policy gradient works that use a supervised pre-training step as model initialization (aka, teacher forcing), our CMAT also utilized this two-stage training strategy. In the supervised training stage, we froze the layers before the ROIAlign layer and optimized the whole framework with the sum of objects and relationships XE losses. The batch size and initial learning rate were set to 6 and $10^{-3}$ , respectively. In the policy gradient training stage, the initial learning rate is set to $3\times 10^{-5}$ . For SGDet, since the number of all possible relationship pairs are huge (e.g., 64 objects leads to $\approx$ 4,000 pairs), we followed that only considers the relationships between two objects with overlapped bounding boxes, which reduced the number of object pairs to around 1,000.

Speed vs. Accuracy Trade-off. In the policy gradient training stage, the complete counterfactual critic calculation needs to sum over all possible object classes, which is significantly time-consuming (over 9,600 ( $\approx 151\times 64$ ) times graph-level evaluation at each iteration). Fortunately, we noticed that only a few classes for each agent have large prediction confidence. To make a trade-off between training speed and accuracy, we only sum over the two highest positive classes and the background class probabilities to estimate the counterfactual baseline. In our experiments, this approximation only results in a slight performance drop but 70x faster training time.

Post-processing for SGDet. For SGDet, we followed the post-processing step in for a fair comparison. After predicting the object class probabilities for each RoI, we used a per-class NMS to select the RoI class and its corresponding class-specific offsets from Faster-RCNN. The IoU threshold in NMS was set to 0.5 in our experiments.

2 Ablative Studies

We run a number of ablations to analyze CMAT, including the graph-level reward choice (for graph-coherent characteristic), the effectiveness of counterfactual baseline (for local-sensitive characteristic), and the early saturation problem in agent communication model. Results are shown in Table 2 and discussed in detail next.

Graph-level Reward Choices. To investigate the influence of choosing different graph-level metrics as the training reward, we compared two metrics: Recall@K and SPICE. In particular, we used the top-20 confident triplets as the predictions to calculate Recall and SPICE. The results are shown in Table 2 (a). We can observe that using both Recall and SPICE as the training reward can consistently improve the XE pre-trained model, because the graph-level metrics is a graph-coherent objective. Meanwhile, using Recall@K as training reward can always get slightly better performance than SPICE, because SPICE is not a suitable evaluation metric for the incomplete annotation nature of VG. Therefore, we used Recall@K as our training reward in the rest of the experiments.

Policy Gradient Baselines. To evaluate the effectiveness of our counterfactual baseline (CF), we compared it with other two widely-used baselines in policy gradient: Moving Average (MA) and Self-Critical (SC) . MA is a moving average constant over the recent rewards . SC is the received reward when model directly takes greedy actions as in the test. From Table 2 (b), we can observe that our CF baseline consistently improves the supervised initialization and outperforms others. Meanwhile, MA and SC can only improve the performance slightly or even worsen it. Because the CF baseline is a local-sensitive objective and provides a more effective training signal for each agent, while MA and SC baselines are only globally pooling rewards which are still not local-sensitive.

# of Communication Steps. To investigate the early saturation issue in message passing models , we compared the performance of CMAT with different numbers of communication steps from 2 to 5. From Table 2 (c), we can observe the trend seems contiguously better with the increase of communication step. Due to the GPU memory limit, we conducted experiments up to 5 steps. Compared to existing message passing methods, the reason why CMAT can avoid the early saturation issue is that our agent communication model discards the widely-used relationship nodes.

3 Comparisons with State-of-the-Arts

Settings. We compared CMAT with the state-of-the-art models. According to whether the model encodes context, we group these methods into: 1) VRD , AsscEmbed , FREQ are independent inference models, which predict object and relation classes independently. 2) MSDN , IMP , TFR , MOTIFS , Graph-RCNN , GPI , KER are joint inference models, which adopt message passing to encode the context. All these models are optimized by XE based training objective.

Quantitative Results. The quantitative results are reported in Table 1. From Table 1, we can observe that our CMAT model achieves the state-of-the-art performance under all evaluation metrics. It is worth noting that CMAT can especially improve the performance of SGCls significantly (i.e., 3.4% and 4.3% absolute improvement in with and without graph constraint setting respectively), which means our CMAT model can substantially improve the object label predictions compared to others. The improvements in object label predictions meet our CMAT design, where the action of each agent is to predict an object label. Meanwhile, it also demonstrates the effectiveness of counterfactual critic multi-agent training for message passing models (agent communication) compared with XE based training. For PredCls task, even we use the easiest visual relationship model and it achieves the best performance, which means the input for relationship model (i.e., the state of agent) can better capture the internal state of each agent. Meanwhile, it is worth noting that any stronger relationship model can seamlessly be incorporated into our CMAT. For SGDet task, the improvements are not as significant as the SGCls, the reason may come from the imperfect and noisy detected bounding boxes.

Qualitative Results. Figure 6 shows the qualitative results compared with MOTIFS. From the results in the top two rows, we can observe that CMAT rarely mistakes at the important hub nodes such as the man or girl, because CMAT directly optimizes the graph-coherent objective. From the results in the bottom two rows, the mistakes of CMAT always come from the incomplete annotation of VG: CMAT can detect more false positive (the blue color) objects and relationship than MOTIFS. Since the evaluation metric (i.e., Recall@K) is based on the ranking of labeled triplet confidence, thus, detecting more reasonable false positive results with high confidence will worsen the results.

Conclusions

We proposed a novel method CMAT to address the inherent problem with XE based objective in SGG: it is not a graph-coherent objective. CMAT solves the problems by 1) formulating SGG as a multi-agent cooperative task, and using graph-level metrics as the training reward. 2) disentangling the individual contribution of each object to allow a more focused training signal. We validated the effectiveness of CMAT through extensive comparative and ablative experiments. Moving forward, we are going to 1) design a more effective graph-level metric to guide the CMAT training and 2) apply CMAT in downstream tasks such as VQA , dialog , and captioning .

Acknowledgement This work is supported by Zhejiang Natural Science Foundation (LR19F020002, LZ17F020001), National Natural Science Foundation of China (6197020369, 61572431), National Key Research and Development Program of China (SQ2018AAA010010), the Fundamental Research Funds for the Central Universities and Chinese Knowledge Center for Engineering Sciences and Technology. L. Chen is supported by 2018 Zhejiang University Academic Award for Outstanding Doctoral Candidates. H. Zhang is supported by NTU Data Science and Artificial Intelligence Research Center (DSAIR) and NTU-Alibaba Lab.

References

Appendix

This supplementary document is organized as follows:

Section A provides the details of some simplified functions in Agent Communication and Visual Relationship Detection.

Section B provides the detailed proof of the CMAT convergence, which guarantees that the proposed CMAT method can converge to a locally optimal policy.

Section C provides the detailed derivation of Eq. (6), i.e., $\nabla_{\theta}J\approx\sum^{n}_{i=1}\nabla_{\theta}\log\bm{p}^{T}_{i}(v^{T}_{i}|h^{T}_{i};\theta)Q(H^{T},V^{T})$ .

Section D shows more qualitative results of CMAT compared with the strong baseline MOTIFS in SGDet setting.

Appendix A Details of Some Simplified Functions

We demonstrate the details of some omitted functions in Eq. (1), Eq. (2), Eq. (3) and Eq. (4).

Predicate Visual Features $z_{ij}$ . For the predicate visual features, we used RoIAlign to pool the union box of subject and object, and resized the union box feature to $7\times 7\times 512$ . Following , we used a $14\times 14\times 2$ binary feature map to model the geometric spatial position of subject and object, with one channel per box. We applied two convolutional layers on this binary feature map and obtained a new $7\times 7\times 512$ spatial position feature map. We added this position feature map with the previous resized union box feature, and applied two fully-connected layers to obtain the final predicate visual feature.

Appendix B Proof of the Convergence of CMAT

We denote $\pi_{i}$ as the policy of agent $i$ , i.e., $\pi_{i}=\bm{p}^{T}_{i}$ and $\bm{\pi}$ as the joint policy of all agents, i.e., $\bm{\pi}=\{\bm{p}^{T}_{1},...,\bm{p}^{T}_{n}\}$ . Then, the expected gradient of CMAT is given by (cf. Eq. (11)):

First, consider the expected contribution of this counterfactual baseline $b(H^{T},V^{T}_{-i})$ ,

Let $d^{\bm{\pi}}(s)$ be the discounted ergodic state distribution as defined by :

Thus, this counterfactual baseline does not change the expected gradient. The reminder of the expected policy gradient is given by:

Writing the joint policy into a product of the independent policies:

we have the standard single-agent policy gradient:

Konda et al. proved that this gradient converges to a local maximum of the expected return $J$ , given that: 1) the policy $\bm{\pi}$ is differentiable, 2) the update timescales for $\bm{\pi}$ are sufficiently slow. Meanwhile, the parameterization of the policy (i.e., the single-agent joint-action learner is decomposed into independent policies) is immaterial to convergence, as long as it remains differentiable. ∎

Appendix C Derivation of Eq. (6)

Based on the policy gradient theorem we provide the detailed derivation of Eq. (6) as follows. We denote the action sequence for agent $i$ as $\hat{A}_{i}=\{\hat{a}^{1}_{i},\hat{a}^{2}_{i},...,\hat{a}^{T}_{i}\}$ , and value function $V_{\theta}(\hat{A}_{i})$ as the expected future reward of sequence $\hat{A}_{i}$ . Then the gradient of agent $i$ is:

Further, the gradient for agent $i$ can be simplified as:

Therefore, for the time step $t$ , the gradient for agent $i$ is $\nabla_{\theta}\log\pi^{t}_{i}(a^{t}_{i})Q(s^{t}_{i},a^{t}_{i})$ . For multi-agent in a cooperative environment, the state-action function $Q$ should estimate the reward based on the set of all agent state and actions, i.e., $Q(S^{t},A^{t})$ . Then, the gradient for all agents is:

In our CMAT, we CMAT samples actions after $T$ -round agent communication, and the action for agent $i$ is $\bm{v}^{T}_{i}$ , the policy function is $\bm{p}^{T}_{i}$ , and the state of agent is $\bm{h}^{t}$ , i.e., $S^{t}=H^{t},A^{t}=V^{t}$ . Therefore, the gradient for the cooperative multi-agent in CMAT is:

Appendix D More Qualitative Results

Figure 7 and 8 show more qualitative results of CMAT and MOTIFS in SGDet setting. From the rows where CMAT is better than MOTIFS, we can see that CMAT rarely mistakes at the important hub nodes such aas the “surfboard” or “laptop”. This is because CMAT directly optimizes the graph-coherent objective. However, the rows show that the mistakes made by CMAT always come from the imcomplete anntation of CMAT can detect more false positive (the blue color) objects and relationship than MOTIFS. Since the evaluation metric (i.e., Recall@K) is based on the ranking of labeled triplet confidence, thus, detecting more reasonable false positive results with high confidence can worsen the performance.