Comprehending and Ordering Semantics for Image Captioning

Yehao Li, Yingwei Pan, Ting Yao, Tao Mei

Introduction

The ability to describe visual content with a descriptive utterance is a fundamental human capability that children are taught from childhood. To formalize such unique capability, the task of image captioning is developed to simulate the human-like interaction between vision and language. The ultimate target of this task is to produce a visually-grounded and linguistically coherent sentence, which covers most semantics in an image that are worthy of mention and meanwhile describes them in linguistic order. Modern image captioning techniques generally focus on the former aspect of enhancing vision-language alignment by first capturing fine-grained semantics (e.g., attributes , objects , or scene graph ) via pre-trained image encoder (object detector/classifier). Then, a series of innovations that employ visual attention over these fine-grained semantics are present to strengthen vision-language interaction. However, the capability of semantic comprehending in pre-trained detector/classifier is severely limited by the pre-defined semantic/class labels. In addition, the pre-trained detector/classifier is not optimized along with sentence decoding process, thereby hardly to be tuned for emphasizing visually salient semantics in output sentence. As shown in Figure 1 (a), the pre-trained object detector (Faster R-CNN) solely captures one major semantic word (“man”), while the other mined semantic words are either irrelevant (e.g., “horse”) or trivial (e.g., “sky” and “bushes”).

To enhance the scalability and generalization of image encoder, a recent pioneering practice is to leverage CLIP model (i.e., image encoder and text encoder ) that is trained on diverse and large-scale data. In this work, we regard CLIP model as a powerful cross-modal retrieval model that retrieves relevant sentences from the human-annotated sentence pool. Such way naturally accumulates more salient semantic words that tend to be mentioned in visually similar images, while more irrelevant semantic words are also introduced (see Figure 1 (b)). To alleviate this issue, we uniquely design a semantic comprehender that further refines the primary semantic cues in the searched sentences based on visual content. By doing so, the semantic comprehender (see Figure 1 (c)) not only filters out the irrelevant semantic words (e.g., “horse”), but also learns to infer the missing relevant semantic words (e.g., “cow” and “rides”), pursuing an enriched and accurate semantic understanding.

In pursuit of the linguistical coherence of the output sentence, the recent advances directly capitalize on the RNN/Transformer based sentence decoder for language modeling. Unfortunately, such paradigm overly relies on the language priors, and sometimes leans to hallucinate semantic words that are not actually in an image, a phenomenon known as “object hallucination” . Here we propose to mitigate the issue from the viewpoint of exploiting the inherent linguistic ordering of semantics as additional supervisory signals to guide sentence decoding process. Technically, a semantic ranker (see Figure 1 (c)) is leveraged to rank all the refined semantic words derived from semantic comprehender in linguistic order, yielding a sequence of ordered semantic words. This semantic word sequence manifests the emphasis of the relative linguistic position of each semantic word in a sequence. As such, the sequence acts as the inherent skeleton of the descriptive sentence, and thus can be exploited to encourage the generation of relevant words at each decoding timestep.

In this work, we design a novel Transformer-style encoder-decoder structure for image captioning, namely Comprehending and Ordering Semantics Networks (COS-Net). Our launching point is to unify the above-mentioned two processes of semantic comprehending and ordering into a single scheme, so that both semantic comprehender and ranker can be jointly optimized to better suit the sentence decoding procedure. Specifically, we first take the off-the-shelf CLIP as cross-modal retrieval model to retrieve semantically similar sentences for the input image. All semantic words in searched sentences are initially regarded as the primary semantic cues. Next, based on the output grid features of image encoder in CLIP, a visual encoder is utilized to contextually encode each grid feature into visual token via self-attention. By taking the primary semantic cues and visual tokens as inputs, semantic comprehender filters out irrelevant semantic words in primary semantic cues and meanwhile reconstructs the missing relevant semantic words through cross-attention mechanism. After that, semantic ranker learns to allocate all the refined semantic words in a linguistic order by upgrading each semantic word with the encoding of its estimated linguistic position. Finally, both the visual tokens and the ordered semantic words are dynamically integrated via attention to auto-regressively decode the output sentence word-by-word.

The main contribution of this work is the proposal of jointly comprehending and ordering the semantics in an image to boost image captioning. This also leads to the elegant views of how to nicely capture the richer relevant semantics that are worthy of mention from visual content, and how to explore the inherent linguistic ordering of them to further facilitate sentence generation. Extensive experiments on COCO demonstrate the effectiveness of our COS-Net.

Related Work

RNN-based Encoder-decoder Scheme. In the deep learning era, researchers in demonstrate that the using of RNN-based encoder-decoder significantly improves machine translation. Subsequently, this RNN-based encoder-decoder scheme becomes the de-facto recipe of modern image captioning techniques. In analogy to the RNN-based sequence modeling in machine translation, the earlier attempts directly employ the basic RNN-based encoder-decoder scheme for the task of image captioning, by encoding visual content with CNN and decoding output description with RNN. Next, the basic RNN-based scheme is upgraded with visual attention mechanism that learns to dynamically pinpoint the most relevant local patches to boost the word prediction at each decoding timestep. Meanwhile, semantic attention mechanism is incorporated into RNN-based encoder-decoder to selectively emphasize the most relevant semantic attributes for sentence generation. After that, bottom-up and top-down attention enables attention measurement at object level, rather than the conventional visual attention over equally-sized local patches. Scene graph structure that depicts the fine-grained semantics in an image is further integrated into the RNN-based encoder-decoder, aiming to exploit the language inductive bias.

Transformer-based Encoder-decoder Scheme. Sparked by the breakthroughs in NLP field via Transformer , numerous modern image captioning approaches capitalizing on Transformer-based encoder-decoder scheme start to emerge. The central spirit of this scheme aims to strengthen the visual encodings and vision-language interaction with self-attention or cross-attention mechanism in Transformer. Take for instance, in , the primary Transformer structure in NLP is directly employed for image captioning task. The spatial relations among objects are additionally incorporated into Transformer-based encoder-decoder in . Recently, a series of innovations have been proposed to upgrade the attention mechanism in Transformer-style structure with attention gate , mesh-like connections across multiple layers , high-order feature interaction , and relative geometry relationships of objects . Most recently, an auto-parsing network is designed to softly segment the inputs into a hierarchical tree, which is further imposed into Transformer-based encoder-decoder for image captioning.

Summary. The proposed COS-Net can also be considered Transformer-based encoder-decoder scheme that constructs most modules (e.g., visual encoder, sentence decoder, and semantic comprehender) with Transformer-style structure. CLIP-ViL is perhaps the most related work, which directly takes the pre-trained image encoder in CLIP as visual encoder in Transformer-based encoder-decoder . Our COS-Net goes beyond CLIP-ViL by utilizing CLIP to seek richer semantic cues that are worthy of mention from human-annotated sentence pool via cross-modal retrieval. Moreover, the semantic comprehender novelly refines the primary semantic cues by filtering out irrelevant semantic words and inferring missing relevant semantic words. A subsequent semantic ranker further allocates all refined semantic words in linguistic order, which serve as additional supervisory signals to boost image captioning.

Our Approach: COS-Net

Now we proceed to present our core proposal, i.e., Comprehending and Ordering Semantics Networks (COS-Net), that integrates both semantic comprehending and ordering processes into a unified architecture for image captioning. Figure 2 depicts the detailed architecture of COS-Net.

Inspired by Transformer-based encoder in image captioning or image recognition , we capitalize on multiple stacked Transformer blocks to encode the visual content into intermediate visual tokens. Formally, given an input image $I$ , we first employ the image encoder of CLIP (backbone: ResNet-101) to extract the grid feature map $\mathcal{V}_{I}={{\bf{v}}_{i}}|_{i=1}^{{N_{I}}}$ ( $N_{I}$ grids), coupled with the global feature ${\bf{v}}_{c}$ . Then, we transform both the global and grid features into a new embedding space, and further concatenate them as: $\mathcal{V}^{(0)}_{I}=[{\bf{v}}^{(0)}_{c},{{\bf{v}}^{(0)}_{i}}|_{i=1}^{{N_{I}}}]$ . After that, a visual encoder is employed to contextually encode all the transformed global and grid features $\mathcal{V}^{(0)}_{I}$ via self-attention, yielding the enriched visual tokens $\mathcal{V}^{(N_{v})}_{I}=[{\bf{v}}^{(N_{v})}_{c},{{\bf{v}}^{(N_{v})}_{i}}|_{i=1}^{{N_{I}}}]$ . Specifically, we implement this visual encoder by stacking $N_{v}$ Transformer blocks with multi-head attention. Hence, the $i$ -th Transformer block in visual encoder operates as:

where $\mathcal{F}$ denotes the feed-forward layer, $\bf{norm}$ is layer normalization, $W_{i}^{Q}$ , $W_{i}^{K}$ , $W_{i}^{V}$ , $W^{O}$ are weight matrices, and $d$ is a scaling factor. Note that in order to enable the inter-layer global feature interaction, we additionally concatenate the output global features from all Transformer blocks, which are further transformed into a holistic global feature:

2 Semantic Comprehending

Most existing image captioning techniques leverage a pre-trained object detector/classifier to capture the semantics in an image, which are directly fed into sentence decoder to produce the caption. Nevertheless, the semantic perception capability of these pre-trained detector/classifier is severely limited by pre-defined semantic/class labels. Moreover, the separate optimization between pre-trained detector/classifier and sentence decoder hinders the interaction in between. That makes it difficult to adaptively tune the object detector/classifier to better emphasize the salient semantics that are worthy of mention in the output sentence. To alleviate these limitations, we propose to utilize the off-the-shelf CLIP trained on diverse and large-scale data as a powerful cross-modal retrieval model, that directly accumulates more candidates of semantic words that tend to be mentioned in visually similar images. Based on such primary semantic cues mined through cross-modal retrieval, a new semantic comprehender is designed to screen out irrelevant semantic words and meanwhile infer the missing relevant semantic words, pursuing a comprehensive and accurate semantic understanding.

Cross-modal Retrieval. In an effort to exploit the richer contextual semantics implied in existing human-annotated image-sentence pairs in training set, we capitalize on a cross-modal retrieval model (i.e., CLIP) to search semantically relevant sentences in training sentence pool for each input image. Technically, let ${\bf{v}}_{c}$ and ${\bf{s}}^{c}$ denote the visual and textual feature extracted by the image encoder and text encoder in CLIP for the input image ${I}$ and each sentence $\mathcal{S}$ , respectively. Thus, by taking the input image $I$ as the search query, we retrieve the top- $K$ captions $\mathcal{S}_{r}=\{\mathcal{S}_{r_{1}},\mathcal{S}_{r_{2}},...,\mathcal{S}_{r_{K}}\}$ from training sentence pool according to the cosine similarity between $I$ and each caption $\mathcal{S}_{r_{k}}$ :

where ${\bf{s}}_{{r_{k}}}^{c}$ is the textual feature of caption $\mathcal{S}_{r_{k}}$ . After obtaining all the $K$ searched captions that are semantically relevant to the input image, we decompose them into a set of $N_{r}$ semantic words $\mathcal{V}_{s}={{\bf{s}}_{i}}|_{i=1}^{{N_{r}}}$ by removing the stop words, which are further taken as the primary semantic cues.

where $\mathcal{V}^{(i+1)}_{s}$ denotes the output enhanced semantic tokens of $i$ -th Transformer block. Accordingly, the final output semantic tokens of semantic comprehender $\mathcal{V}^{(N_{s})}_{s}=[{{\bf{o}}^{(N_{s})}_{i}}|_{i=1}^{{N_{o}}},{{\bf{s}}^{(N_{s})}_{i}}|_{i=1}^{{N_{r}}}]$ , are leveraged for predicting the refined and reconstructed semantic words.

where $\bf{asym}$ denotes the asymmetric loss and ${\bf{y}}_{m}$ is the ground-truth label of all missing relevant semantic words. Finally, the whole objective of semantic comprehender integrates both objectives of filtering out irrelevant semantic words and reconstructing missing relevant semantic words:

3 Semantic Ordering

After obtaining the screened and enriched semantics derived from semantic comprehender, the most typical way to generate description is to directly feed them into RNN/Transformer based sentence decoder for sentence modeling. However, this way overly relies on the language priors, possibly resulting in non-existent semantic words due to the phenomenon of object hallucination. To address the issue, we additionally involve a new module of semantic ranker that learns to estimate the linguistic position of each semantic word, thereby allocating all the semantic words in linguistic order as humans. In this way, the output sequence of ordered semantic words serve as additional visually-grounded language priors to encourage the generation of both relevant and coherent descriptions.

4 Sentence Decoding

Next, we fuse the holistic textual context $h^{\prime(i)}_{t}$ and visual context $h^{v{(i)}}_{t}$ with a $\bf{sigmoid}$ gate function, and the learnt hidden state $h^{(i+1)}_{t}$ is taken as the outputs of $i$ -th block:

Finally, the output hidden state of the last block $h^{(N_{d})}_{t}$ is utilized for predicting the next word ${w_{t+1}}$ via softmax.

5 Overall Objective

At training stage, the overall objective of our COS-Net is measured as the integration of the proxy objective in semantic comprehender $L_{s}$ and the typical cross entropy loss $L_{XE}$ for sentence generation: $\mathcal{L}=\mathcal{L}_{s}+\mathcal{L}_{XE}$ . Next, following , COS-Net can be further optimized with sentence-level reward (e.g., CIDEr score).

Experiments

Dataset. We empirically verify and analyze the effectiveness of our COS-Net on the widely adopted COCO benchmark for image captioning. The COCO dataset consists of more than 120,000 images, and each image is equipped with five human-annotated sentences. For fair comparison with most existing techniques, we strictly follow the standard dataset split in (known as Karpathy split), which leverages 5,000 images for validation, 5,000 images for testing, and the rest for training. Besides the standard Karpathy split, we adopt the robust split introduced in to conduct object hallucination analysis, which ensures that the object pairs mentioned in training, validation, and testing captions do not overlap. In the experiments, we perform the minimal sentence pre-processing by converting each sentence into lower case and meanwhile filtering out rare words that occur less than six times as in . The overall word vocabulary is thus built with 10,199 unique words. Moreover, to enable the learning of our semantic comprehender, we construct an additional semantic vocabulary ( $N_{c}=906$ ) by removing all the stop words in word vocabulary and selecting high-frequency semantic words.

Implementation Details. In COS-Net, the visual encoder, semantic comprehender, and sentence decoder are constructed with $N_{v}=6$ , $N_{s}=3$ , and $N_{d}=6$ Transformer blocks (hidden state size: 512). The image encoder in CLIP is directly employed over the input image, and each image is thus represented as a 512-dimensional global feature vector plus the 2,048-dimensional grid feature map. The typical two-stage training paradigm is adopted to train COS-Net. The whole architecture is implemented based on X-modaler codebase . Specifically, we first optimize the whole architecture of COS-Net by integrating the cross entropy loss with the proxy objective of semantic comprehender for 30 epoches (batch size: 32). In this stage, we leverage Adam optimizer with the learning rate scheduling strategy in (warmup: 20,000 iterations). For the second stage, we further optimize COS-Net with CIDEr score via self-critical sequence training strategy for another 50 epoches. The learning rate is set as 0.00001. At inference, the beam size in beam search strategy is set as 3. Following the standard evaluation setup, we report the performances of COS-Net over five evaluation metrics: BLEU@N (B@1-4), METEOR (M), ROUGE (R), CIDEr (C), and SPICE (S). In addition, we use CHAIR metric to assess the rate of object hallucination on the robust split. CHAIR metric includes two variants: CHAIRi (CHi) that measures what fraction of objects are hallucinated, and CHAIRs (CHs) that calculates what fraction of sentences include a hallucinated object.

2 Ablation Study

In this section, we conduct ablation study to investigate how each design in our COS-Net influences the overall performances on COCO dataset. Table 1 details the performance comparisons among different ablated runs of our COS-Net. Note that all results here are reported without self-critical sequence training strategy. We start from a base Transformer-based encoder-decoder structure (Base), which is a degraded version of COS-Net by solely using the CLIP grid features as visual inputs, without exploring primary semantic cues via cross-modal retrieval, semantic comprehending and ordering. After that, we extend the Based model by additionally exploring CLIP as cross-modal retrieval model to mine the primary semantic cues for boosting sentence generation. In this way, Base+CR exhibits better performances, which verify the merit of accumulating richer semantic words that tend to be mentioned in visually similar images through cross-modal retrieval. Next, Base+CR+FIS learns to filter out the irrelevant semantic words in primary semantic cues, and thus leads to performance gains. Base+CR+FIS+IMS is further benefited from the additional process of inferring the missing relevant semantic words. The results of these two ablated runs basically highlight the advantage of semantic screening and enriching in our semantic comprehender for image captioning. Finally, after integrating Base+CR+FIS+IMS with our semantic ranker that estimates the linguistic position of each semantic word derived from semantic comprehender, Base+CR+FIS+IMS+SR (i.e., our COS-Net) achieves the best performances across most evaluation metrics. The results validate the leverage of the sequence of ordered semantic words as additional visually-grounded language priors to enhance sentence generation.

3 Comparisons with State-of-the-Art

Here we compare our COS-Net with a series of state-of-the-art image captioning approaches on three different splits, i.e., the standard Karpathy test split, the official test split via online evaluation, and the robust split for object hallucination analysis. Specifically, for Karpathy test split, we follow modern techniques and utilize two different training setups for evaluation. One is the default single model setup that produces sentence via a single model, and the other is ensemble model setup that ensembles multiple models with different initialized parameters.

Single Model on Karpathy Test Split. Table 2 summarizes the performance comparisons in the default single model setup. All runs are briefly grouped into two directions: (1) the standard methods (e.g., SGAE , Up-Down , Transformer , $M^{2}$ Transformer ) that utilizes the pre-trained Faster R-CNN (backbone: ResNet-101) to extract visual inputs; (2) the approaches (e.g., CLIP-Res101 ) that take the strong CLIP grid features as visual inputs. Note that for fair comparisons with our COS-Net, we re-implement several upgraded variants of existing standard methods (e.g., Up-Down †, Transformer †, X-Transformer †) by using the same CLIP grid features as visual inputs. As shown in this table, our COS-Net consistently outperforms the state-of-the-art methods across all the evaluation metrics. In particular, under the setting of CIDEr score optimization, the CIDEr Score of COS-Net can reach 141.1%, which leads to the absolute improvement of 3.9% against the best competitor X-Transformer † (CIDEr: 137.2%). This generally demonstrates the key advantage of jointly comprehending and ordering the semantics in an image to facilitate sentence generation. Compared to the methods that leverage RNN-based structure (e.g., Up-Down and GCN-LSTM), Transformer and $M^{2}$ Transformer improve the performances by utilizing Transformer-based scheme that strengthens vision-language interaction via cross-attention. Instead of using the pre-trained Faster R-CNN to encode visual content in primary Up-Down, Up-Down † utilizes the CLIP grid features to trigger bottom-up and top-down attention, leading to clear performance boosts. The results indicate the stronger capability of semantic comprehending in CLIP that is trained on diverse and large-scale data. When further upgrading the conventional Transformer with CLIP grid features, Transformer † also manages to achieve better performances. However, these upgraded runs of existing approaches solely hinge on the visual content encoding via pre-trained CLIP without any interaction between CLIP and sentence decoder, and meanwhile ignore the inherent linguistic ordering of semantics. As an alternative, our COS-Net encourages a more comprehensive and accurate semantic understanding, and further learns to allocate the semantic words in linguistic ordering as humans, thereby achieving the best performances in terms of all evaluation metrics.

Ensemble Model on Karpathy Test Split. Next, we evaluate our COS-Net with ensembles of four models, which are trained with different random seeds. As shown in Table 3, the performance trends in the ensemble model setup are similar to those in single model setup. Concretely, the ensemble version of COS-Net surpasses the current state-of-the-art standard technique (ensemble X-Transformer) by an absolute improvement of 7.7% in CIDEr score. The results again demonstrate the effectiveness of jointly screening & enriching the primary semantic cues and further ordering semantics for image captioning.

Online Evaluation on Official Test Split. We further include more evaluations on the official test split by submitting COS-Net to online test server. Table 4 shows the performances with regard to 5 reference captions (c5) and 40 reference captions (c40). Since most top-performing methods in this online leaderboard adopt the ensemble model setup, here we report the performances of the ensemble COS-Net for fair comparison. Similarly, COS-Net surpasses all state-of-the-art approaches across all metrics.

Hallucination Analysis on Robust Split. To better understand the impact of semantic comprehending and ordering in our COS-Net, we conduct hallucination analysis to assess the rate of object hallucination (i.e., the image relevance of the generated captions) on the robust split. Table 5 lists the performances over both typical sentence metrics and the image relevance metrics (CHs and CHi). Following the evaluation in single model setup, we include two groups of baselines (i.e., the standard methods and their upgraded version with CLIP grid features). Similar trends are also observed in this hallucination analysis. Specifically, by equipping the standard approaches (e.g., Att2In and Up-Down) with CLIP grid features, Att2In † and Up-Down † achieve lower CHs and CHi scores, which show the stronger semantic understanding capability of CLIP. Moreover, our COS-Net goes beyond Transformer † by additionally mining primary semantic cues via cross-modal retrieval and further refining & ordering the semantics, leading to lower CHs and CHi scores. The results confirm that COS-Net is more robust by alleviating object hallucination.

4 Qualitative Results

In order to qualitatively show the effectiveness of COS-Net, we showcase several qualitative results of our COS-Net and two upgraded baselines (i.e., Transformer † and Up-Down †), coupled with the human-annotated ground-truth sentences (GT) in Figure 3. In general, it is easy to observe that all the three approaches are able to produce linguistically coherent descriptions. Nevertheless, when examining the semantic relevance between visual content and generated sentence, our COS-Net outperforms the other two baselines by capturing more relevant semantic words that are worthy of mention. For instance, in the first example, both Transformer † and Up-Down † only partially mine the major semantic words (red, plane, flying, and sky), while ignoring the salient semantic of smoke. Instead, COS-Net manages to comprehend all major semantics in this image (red, plane, flying, sky, and smoke) and further allocates them in linguistic order as humans, yielding both visually-grounded and linguistically coherent description.

Conclusion and Discussion

In this work, we delve into the idea of comprehending and ordering the rich semantics in an image for image captioning. To verify our claim, we present a new Transformer-style encoder-decoder structure, i.e., COS-Net, that unifies the two processes of enriched semantic comprehending and learnable semantic ordering into a single architecture. Particularly, a CLIP-based cross-modal retrieval model is initially utilized to accumulate the primary semantic cues implied in the searched semantically similar sentences. After that, a semantic comprehender filters out the irrelevant semantic words in primary semantic cues and meanwhile infers the missing relevant semantic words. Subsequently, a semantic ranker learns to estimate the linguistic position of each semantic word, leading to a sequence of ordered semantic words. The ordered semantic words serve as additional supervisory signals to guide sentence generation. We validate our proposals through extensive experiments conducted on COCO benchmark.

Broader Impact. Our COS-Net is trained to produce image descriptions based on the learnt statistics of training dataset, and as such will reflect biases naturally rooted in those data, thereby resulting in negative societal impacts. Thus more future research is necessary to address this issue.

Acknowledgments. This work was supported by the National Key R&D Program of China under Grant No. 2020AAA0108600.