TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, Jiebo Luo

Introduction

The Vision-language tasks incorporating scene text , e.g., Text-VQA and Text-Caption , pose new challenges to vision-language models of reading and understanding scene text in image context. Extended from Visual Question Answering (VQA) , Text-VQA aims to answer questions by understanding the scene text in the image-question context. Text-Caption seeks to generate an image caption that describes both the visual and scene text information in the image, as shown in Figure 1 (a). These tasks have many potential applications, including robotics , document understanding , assisting visually-impaired people , etc.

A typical Text-VQA/Text-Caption framework consists of 1) a feature encoder for each single modality (text word, visual object, and scene text), 2) a multi-modal fusion module, and 3) a decoding module for prediction generation. Previous studies improve the model’s performance by designing stronger network architectures. Among them, LoRRA added an OCR attention branch for scene text encoding to a VQA model . M4C proposed a transformer-based multi-modal fusion module and a multi-step multi-choice decoding module. Despite the effective network design, most previous models are optimized with a sole objective directly towards the correct answer/caption. Such a single answer/caption loss tries to predict each word in the ground-truth but is less effective in learning a joint representation among text word, visual object, and scene text. Without a good joint representation, directly optimizing for question-answering/image-captioning could be challenging. Inspired by the success of Vision-Language Pre-training (VLP) in image-text joint representation learning, we leverage the effective Text-VQA/Text-Caption network designs and explore to further improve Text-VQA/Text-Caption by pre-training.

Vision-Language Pre-training (VLP) shows its effectiveness in learning task-agnostic joint representations of image and text. The main idea is to first pre-train the model with pre-training tasks on image-caption datasets , and then fine-tune the model for a specific vision-language task . However, conventional VLP methods are designed intuitively for vision-language tasks and do not include scene text in pre-training. Therefore, previous methods fail to capture the scene text modality and its relationship with the visual and text modalities, and are thus less effective in Text-VQA/Text-Caption.

In this study, we propose Text-Aware Pre-training (TAP), which incorporates the scene text modality in pre-training to learn a joint representation of text word, visual object, and scene text. In TAP, we design text-aware pre-training tasks to better fuse scene text (including both scene text words and their visual regions detected by OCR) with the text words and visual objects. For the former, we refine the pre-training tasks in VLP to support the extra scene text input. We find it particularly important to include the detected scene text words as extra language inputs. The extra inputs anchor the scene text and language modalities and make the aligned representation learning easier. For the latter, previous studies show that the spatial relationships between scene text and object regions are important, e.g., the relationship “left” in Figure 1 (a). Therefore, we propose a “relative (spatial) position prediction” task that learns regions’ spatial relationships by predicting their relative spatial positions in pre-training.

The extra scene text modality, together with the specially designed pre-training tasks, effectively helps the model learn a better aligned representation among the three modalities: text word, visual object, and scene text. This aligned representation learning, even pre-trained and fine-tuned on the same downstream task dataset, leads to significant improvement over the non-TAP baseline and helps the TAP model achieve the new state of the art.

To further unleash the power of TAP, we clean and generate a large-scale scene text-related image-caption dataset for pre-training. In general image-caption datasets , many image-text pairs contain either no scene text-related visual regions or no scene text-related language referring, and are thus less helpful to Text-VQA/Text-Caption. On the visual side, we run an OCR detector to filter out images with no scene text. On the language side, we include the detected OCR text tokens as the additional caption input to obtain scene text-related language descriptions. In the end, we build a large-scale dataset named OCR-CC with around $1.4$ million scene text-related image-text pairs based on the Conceptual Captioning dataset . By using this large-scale dataset for pre-training, we observe further improvement on the Text-VQA and Text-Caption tasks.

We experiment with the TAP approach on the M4C network architecture and benchmark it on the TextVQA , ST-VQA , and TextCaps datasets. With the identical network architecture and training data, TAP improves the accuracy on the TextVQA dataset from $44.50\%$ to $49.91\%$ , compared with a non-TAP baseline. Our final model ranks No.1111According to the official leader-boards (Nov. 2020) on multiple Text-VQA/Text-Caption challenges, and outperforms previous methods by large margins: TextVQA ( $+8.3\%$ in absolute accuracy), ST-VQA ( $+8.6\%$ in absolute accuracy), and TextCaps ( $+10.2$ in CIDEr score).

To the best of our knowledge, we are the first to explore pre-training for Text-VQA and Text-Caption.

By explicitly incorporating scene text with three specially designed pre-training tasks, Text-Aware Pre-training (TAP) effectively learns a better aligned representation that leads to significant performance improvement on Text-VQA/Text-Caption.

We build a large-scale dataset named OCR-CC with around $1.4$ million scene text-related image-text pairs. TAP with OCR-CC leads to the new state of the art on multiple tasks: TextVQA ( $+8.3\%$ in absolute accuracy), ST-VQA ( $+8.6\%$ in absolute accuracy), and TextCaps ( $+10.2$ in CIDEr score). We will release the dataset and the models.

Related Work

Vision-language tasks incorporating scene text. Text-VQA and Text-Caption aim at reading and understanding scene text in images for question answering and image caption generation. Various datasets are built for the Text-VQA task, e.g., the TextVQA dataset , the ST-VQA dataset , etc. TextCaps is a dataset recently proposed for the Text-Caption task.

Recent studies proposed various network architectures to improve the Text-VQA/Text-Caption performance. Among them, LoRRA approached Text-VQA by extending a VQA model Pythia with an OCR attention branch. The answer vocabulary is a combination of a static vocabulary and detected OCR tokens. Multi-modal Multi-Copy Mesh (M4C) boosted the Text-VQA performance by proposing a transformer-based multi-modal fusion module and a multi-step multi-choice decoding module that supports multi-step answer decoding. M4C’s variants M4C-Captioner set a strong baseline on TextCaps with the question text inputs removed. SA-M4C further improved M4C by encoding the spatial relationships among visual regions as the attention masks in the multi-modal transformer. Similar explorations on the spatial relationships are studied in the Text-Caption task.

Despite the effective network design, all previous studies directly optimize towards the sole objective for the Text-VQA/Text-Caption task. We contend that such a single answer/caption loss could be ineffective in aligned representation learning and thus limits the Text-VQA/Text-Caption performance. In this study, we leverage the effective network designs and explore to further improve Text-VQA/Text-Caption by pre-training.

Vision-Language Pre-training (VLP). VLP shows its effectiveness in learning task-agnostic vision-language joint representations. Most studies focused on vision-language understanding tasks, e.g., image-text retrieval , visual question answering , visual grounding , etc. Recent studies unified the pre-training framework to cover generation tasks, e.g., image captioning .

However, conventional VLP methods do not capture scene text during pre-training and are therefore less effective for Text-VQA/Text-Caption. The proposed Text-aware Pre-training (TAP) explicitly incorporates scene text to learn a better aligned representation among the three modalities: text word, visual object, and scene text.

Text-Aware Pre-training (TAP)

TAP explicitly incorporates scene text in pre-training to improve Text-VQA/Text-Caption. We first pre-train the model with the scene text-aware pre-training tasks and then fine-tune it for a specific downstream task.

In this section, we first introduce the design of scene text-aware pre-training tasks. We then present the data corpus used for TAP and our proposed OCR-CC dataset. We postpone the model details to Section 4.2.

Figure 2 overviews TAP in pre-training and fine-tuning. In pre-training, the input to the fusion module are embeddings of ${K}$ text words $\mathbf{w}$ , ${M}$ object regions $\mathbf{v^{obj}}$ , ${N}$ scene text regions $\mathbf{v^{ocr}}$ , and a special begin token $\mathbf{p_{0}}$ . In the text word embedding, each word in the extended text input $\mathbf{w}=\left[\mathbf{w^{q}},\mathbf{w^{obj}},\mathbf{w^{ocr}}\right]$ is encoded as a feature vector, where $\mathbf{w^{q}},\mathbf{w^{obj}},\mathbf{w^{ocr}}$ are the question text, detected object labels, and detected scene text words. In the object and scene text embedding, object and scene text regions are detected and encoded by object detectors and OCR engines.

Taking the fused feature $\mathbf{f}=\left[\mathbf{f^{w}},\mathbf{f^{obj}},\mathbf{f^{ocr}},\mathbf{f^{p}}\right]$ as inputs, TAP improves multi-modal fusion by performing text-aware pre-training tasks. The proposed pre-training tasks consist of two parts, focusing on fusing scene text $\mathbf{v^{ocr}}$ with text words $\mathbf{w}$ and visual objects $\mathbf{v^{obj}}$ , respectively.

Scene-text language pre-training tasks. To better fuse the scene text $\mathbf{v^{ocr}}$ with the text words $\mathbf{w}$ , we design two scene-text language pre-training tasks based on the masked language modeling (MLM) and image-text (contrastive) matching (ITM) tasks in VLP . For MLM on the extended text input $\mathbf{w}=\left[\mathbf{w^{q}},\mathbf{w^{obj}},\mathbf{w^{ocr}}\right]$ , we randomly mask each text token in $\mathbf{w}$ with a probability of $15\%$ . The masked words $\mathbf{w_{mask}}$ are replaced with a special MASK token $80\%$ of the time, a random word $10\%$ , and remains unchanged $10\%$ . The MLM task takes the fused feature at the masked position $\mathbf{f_{mask}^{w}}$ as the input, and aims to recover the masked word $\mathbf{w_{mask}}$ with two fully-connected layers. For ITM, $\mathbf{w}$ is polluted $50\%$ of the time by replacing text sub-sequence $\mathbf{w^{q}}$ , $\mathbf{w^{obj}}$ , or $\mathbf{w^{ocr}}$ with a randomly-selected one from another image. The polluted text words $\mathbf{w}$ are thus not paired with the visual regions $\mathbf{v^{obj}}$ and $\mathbf{v^{ocr}}$ . The ITM task takes the sequence feature $\mathbf{f_{0}^{p}}$ as the input and aims to predict if the sequence has been polluted or not.

We find that the extra scene text word input $\mathbf{w^{ocr}}$ is critical for learning the scene-text language aligned representation. As a comparison to the extended text input $\mathbf{w}$ , pre-training with the original MLM and ITM on question text $\mathbf{w^{q}}$ leads to limited improvement over the non-pre-training baseline. The failure is due to the limited number of scene text-related words in the language input $\mathbf{w^{q}}$ . In this case, since many randomly masked words $\mathbf{w^{q}_{mask}}$ and polluted sequences are not relevant to scene text, scene text regions $\mathbf{v^{ocr}}$ are less important for solving the pre-training tasks (MLM, ITM) and are thus often overlooked. $\mathbf{w^{ocr}}$ in the extended text input $\mathbf{w}$ generates extra scene text referring in the language modality and thus makes TAP effective.

Scene-text visual pre-training tasks. Understanding the spatial relationships between the visual object $\mathbf{v^{obj}}$ and scene text $\mathbf{v^{ocr}}$ benefits Text-VQA/Text-Caption . The extra feature input of bounding box coordinates helps the spatial relationship learning , but hasn’t fully solved the problem. Recent studies hard code the coordinate features as the regions’ relationships in feature fusion and obtain further improvement. In this study, we explore spatial relationship learning by pre-training.

Specifically, we design a scene-text visual pre-training task in TAP. The main idea is to predict the relative spatial position between two randomly sampled visual regions. Therefore, we refer to the task as “relative (spatial) position prediction” (RPP). The input to the pre-training task is a randomly sampled visual object feature $\mathbf{f_{i}^{obj}}$ and scene text feature $\mathbf{f_{j}^{ocr}}$ , where $i\in\{1,\cdots,M\}$ and $j\in\{1,\cdots,N\}$ . The objective is to predict the relative spatial position between the two sampled regions $\mathbf{v_{i}^{obj}}$ and $\mathbf{v_{j}^{ocr}}$ . We start with a single relationship of whether “scene text region $\mathbf{v_{j}^{ocr}}$ is on object $\mathbf{v_{i}^{obj}}$ ,” and thus model RPP as a binary classification problem. We then extend the task to a 12-class relative position prediction problem with the classes defined by Yao et al. , including on, cover, overlap, eight-way relative orientation, and unrelated.

2 Pre-training corpus

TAP works well even without extra pre-training data. We first experiment with “TAP without extra data,” where we only use the downstream Text-VQA/Text-Caption dataset for pre-training, i.e., the training set of the TextVQA , ST-VQA , or TextCaps datasets. These datasets all contain less than $30$ K images and $150$ K image-text pairs. We detail the pre-training and fine-tuning pipeline for each downstream task in Section 4.2.

We then experiment with “TAP with large-scale data.” We build a large-scale scene text-related image-caption dataset named OCR-CC based on the Conceptual Caption (CC) dataset , and use the dataset for pre-training. Among the image-caption datasets , only the CC dataset contains a reasonable portion of images with meaningful scene text regions. Therefore, we run the Microsoft Azure OCR system222Public Microsoft OCR API: https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text on all images in the CC dataset and filter out the images with no scene text, watermarks only, and tiny scene text regions only. In the end, we obtain $1.367$ million image-caption pairs with a mean and median of $11.4$ and $6$ scene text detected per image. As a reference, the mean and median are $23.1$ and $12$ in the TextVQA dataset , and $8.03$ and $6$ in the ST-VQA dataset . We adopt the same region feature extraction method used in the TextVQA dataset to provide object and scene text region embedding. By including scene text words $\mathbf{w^{ocr}}$ as additional text inputs, OCR-CC provides scene text-related image-caption pairs for TAP. We keep the caption text from CC in OCR-CC and use it as the question text $\mathbf{w^{q}}$ in pre-training. We show the details of dataset collection, scene text number distribution, and additional qualitative examples of OCR-CC in the supplementary material.

Experiments

We benchmark TAP for both the Text-VQA task on the TextVQA and ST-VQA datasets, and the Text-Caption task on the TextCaps dataset . We use our proposed OCR-CC dataset for large-scale pre-training.

TextVQA. The TextVQA dataset contains 28,408 images from the Open Images dataset . We follow the same training/validation/test split used in the previous work in our experiments. The methods are evaluated by the soft-voting accuracy of 10 answers.

ST-VQA. The ST-VQA dataset contains 21,892 images from multiple sources including ICDAR 2013 , ICDAR 2015 , ImageNet , VizWiz , IIIT STR , Visual Genome , and COCO-Text . The methods are evaluated by both accuracy and Average Normalized Levenshtein Similarity (ANLS) .

TextCaps. The TextCaps dataset augments the 28,408 images in TextVQA with 145,329 captions. The captions are evaluated by the caption metrics (BLEU , METEOR , ROUGE_L , SPICE , and CIDEr ).

OCR-CC. Our OCR-CC dataset contains $1.367$ million scene text-related image-caption pairs from the Conceptual Captioning (CC) dataset . More details of OCR-CC are in the supplementary material.

2 Experiment settings

Network architecture. We conduct experiments based on the M4C network architecture . We extend the text input $\mathbf{w_{q}}$ with the object labels $\mathbf{w^{obj}}$ and scene text words $\mathbf{w^{ocr}}$ . We keep all remaining settings the same as in the original M4C , including the feature embedding, network architecture, training parameters, and layer initialization.

M4C’s text encoder is a three-layer trainable transformer initialized from the first three layers of BERT ${}_{\text{BASE}}$ . A pre-trained Faster R-CNN detects objects and represents the detected region with its visual and coordinate features. The final layer (fc7) of the detector is fine-tuned. An offline OCR detector detects scene text regions and represents the region with its visual, coordinates, FastText , and Pyramidal Histogram of Characters (PHOC) features. The fusion module in M4C is a four-layer multi-modal transformer that has the same hyper-parameters as BERT ${}_{\text{BASE}}$ . The fusion module is initialized from scratch. A multi-step decoding module then takes fused features $\mathbf{f^{ocr}},\mathbf{f^{p}}$ as inputs, and word-by-word predicts the final answer. The predicted answer word at each decoding step $T$ is selected either from a fixed frequent word vocabulary or from the dynamic OCR tokens. The word classification loss is applied to each decoding step.

Adapting to Text-VQA. By taking the fused feature $\mathbf{f}$ as input, we pre-train the feature encoder and fusion module with the pre-training tasks (MLM, ITM, RPP). MLM is only computed on the sequences that have not been polluted by ITM. The pre-trained model with the highest pre-training task accuracy is used to initialize the feature encoder and fusion module. In fine-tuning, the model step-by-step predicts the answer with an extra decoding module, and is trained with the answer classification loss in each step.

Adapting to Text-Caption. We keep the framework architecture the same for Text-Caption as for Text-VQA, except increasing the maximum answer decoding length from $12$ words to $30$ words . $\mathbf{w^{q}}$ is left blank in both pre-training and fine-tuning. The input text sequence $\mathbf{w}$ consists of $\mathbf{w^{ocr}}$ , $\mathbf{w^{obj}}$ , and the blank $\mathbf{w^{q}}$ . During fine-tuning, the framework is trained with the same multi-step word classification loss as used in Text-VQA.

Compared methods. We compare TAP with other state of the art and systematically study the following baselines and variants of our method.

TAP (Ours). We first experiment with “TAP without extra pre-training data.” We use the same downstream task dataset for both pre-training and fine-tuning, and follow the same training parameters as used in M4C. For the Text-VQA task, we pre-train the model for $24$ K iterations with the pre-training tasks (MLM, ITM, RPP) and then fine-tune it with the answer loss for another $24$ K iterations. The numbers of pre-training and fine-tuning iterations are both $12$ K for the Text-Caption task following M4C-Captioner .

M4C†. “M4C†” is the non-TAP baseline. Based on M4C, we include the detected object labels $\mathbf{w^{obj}}$ and scene text tokens $\mathbf{w^{ocr}}$ as the additional text input following “TAP.” We train the model for $48$ K iterations with the answer loss to match TAP’s total iteration number. Compared with “TAP,” the only difference is that “M4C†” trains the first $24$ K iterations with the answer loss, instead of the pre-training tasks.

TAP†† (Ours). “TAP††” reports our best performance achieved with extra pre-training data (TextVQA, ST-VQA, TextCaps, OCR-CC) and other minor modifications. We pre-train “TAP††” for $480$ K iterations. Section 4.4 details the benefits of each extra data source.

3 Text-VQA/Text-Caption results

TextVQA. Table 1 reports the accuracy on the TextVQA dataset . The top part of the table shows the results in the constrained setting that only uses TextVQA for training and Rosetta for OCR detection. The bottom compares our best performance with the state of the art in the unconstrained setting.

We list the adopted OCR detector in the “OCR system” column. LoRRA and M4C adopted the Rosetta OCR system . SA-M4C and SMA experiment with both Rosetta and other OCR systems (Google-OCR, SBD-Trans OCR). In this study, we experiment with Rosetta and the Microsoft Azure OCR system (Microsoft-OCR). We use Microsoft-OCR to detect the single OCR words appeared in the image, i.e., each detected scene text region contains only a single word. The “Extra data” column shows the used training data other than the TextVQA dataset. Previous methods adopt the ST-VQA dataset for joint training. Other than ST-VQA, TAP enables the use of weak data with no ground-truth answer in pre-training, e.g., TextCaps and OCR-CC. “TAP††” reports the final performance with all extra datasets.

Three major observations can be made from Table 1: 1) “TAP” significantly outperforms the non-TAP baseline “M4C†” with the identical training data and network architecture, in both the constrained setting (top part of Table 1) and the unconstrained setting (bottom part). In the constrained setting, TAP improves the non-TAP baseline accuracy from $39.55\%$ to $44.06\%$ . In the unconstrained setting, “TAP” with Microsoft-OCR obtain $5.4\%$ and $5.3\%$ absolute accuracy improvement over the corresponding non-TAP baselines “M4C†” and “M4C† +STVQA,” respectively. The improvement achieved with the same network and training data validates the effectiveness of our pre-training approach for Text-VQA/Text-Caption. 2) “TAP” outperforms the previous state of the art by large margins, even without large-scale pre-training. 3) Large-scale pre-training with the OCR-CC dataset further improves the accuracy. “TAP††” adopts OCR-CC in pre-training and improves the accuracy from $49.91\%$ to $54.71\%$ . The improvement shows that TAP benefits from extra training data, and indicates the effectiveness of our proposed OCR-CC.

ST-VQA. Table 2 shows the Text-VQA accuracy on the ST-VQA dataset in the unconstrained setting. “TAP” uses the Microsoft-OCR and is pre-trained and fine-tuned on the training set of ST-VQA. “TAP††” uses TextVQA, ST-VQA, TextCaps, and OCR-CC in pre-training. Similar conclusions as in Table 1 can be drawn from Table 2. First, “TAP” outperforms the state of the art by large margins, and significantly improves the non-TAP baseline “M4C†.” Second, large-scale pre-training further improves the accuracy by $+5.5\%$ as shown in bottom two rows.

TextCaps. Table 3 shows the CIDEr score on the TextCaps dataset . We report only the CIDEr score in the table and present the full table with other metrics in the supplementary material. We draw similar observations that with the same training data, “TAP” improves the CIDEr score of “M4C†” from $99.89$ to $105.05$ . Large-scale pre-training “TAP††” further improves the CIDEr score to $109.16$ .

4 Ablation studies

Pre-training tasks. We experiment with different pre-training tasks (MLM, ITM, RPP) as well as their variants. We conduct ablation studies on TextVQA with Microsoft-OCR and no extra data. We examine the effectiveness of scene-text language pre-training (MLM, ITM) and scene-text visual pre-training (RPP). We verify the importance of the extra scene-text token input $\mathbf{w^{ocr}}$ in MLM and ITM.

As shown in Table 4, the scene-text language pre-training in row $(d)$ and scene-text visual pre-training in row $(e)$ improve the non-TAP baseline (row $(b)$ ) from $44.50\%$ to $49.01\%$ and $46.42\%$ , respectively. “TAP” performs all pre-training tasks and further improves the accuracy to $49.91\%$ .

The extra scene text token input $\mathbf{w^{ocr}}$ is essential for TAP. Rows $(a$ - $d)$ in Table 4 show that neither extra $\mathbf{w^{ocr}}$ inputs (c.f. rows $(a,b)$ ) nor pre-training (c.f. rows $(b,c)$ ) alone lead to an improvement from the Non-TAP baseline (row $(b)$ ). In contrast, TAP with the extra $\mathbf{w^{ocr}}$ input (row $(d)$ ) boosts the accuracy to $49.01\%$ . The bottom rows $(e,f)$ show the effectiveness of RPP. RPP with a single spatial relationship “on” improves the accuracy from $44.50\%$ to $46.42\%$ (c.f. rows $(b,e)$ ). Combining RPP with MLM and ITM improves the accuracy from $49.01\%$ to $49.91\%$ (c.f. rows $(d,f)$ ). Extending spatial relationship classes to $12$ leads to an improvement from $49.91\%$ to $50.17\%$ .

Pre-training with extra data Table 5 breaks down the benefits of adopting different sources of extra data. We conduct experiments on the TextVQA dataset with Microsoft-OCR. TAP enables the use of weak data with no answer annotations in the pre-training stage such like TextCaps and OCR-CC, in addition to the Text-VQA datasets. Compared with “TAP” with no extra data, pre-training with ST-VQA and TextCaps improves the accuracy from $49.91\%$ to $50.57\%$ and $51.86\%$ (c.f., rows $(a,b)$ , rows $(b,c)$ ). The large-scale pre-training with OCR-CC (row $(d)$ ) achieves the accuracy of $52.10\%$ . Including all data during pre-training (row $(e)$ ) further improves the accuracy to $52.90\%$ .

Furthermore, we find that the extra data benefits the use of large models. The original architecture consists of a $3$ -layer text-only transformer and a $4$ -layer multi-modal transformer. We experiment with a $12$ -layer multi-modal transformer with the same structure as BERT ${}_{\text{BASE}}$ . We initialize the model from BERT ${}_{\text{BASE}}$ and remove the separate text transformer. We represent the two architectures as $(3,4)$ and $(0,12)$ in Table 5, where the numbers indicate the text and multi-modal transformer layer numbers. With extra transformer layers, the accuracy without extra data drops from $49.91\%$ to $48.78\%$ (row $(a)$ ), while the accuracy with extra data increases from $52.90\%$ to $54.71\%$ (row $(e)$ ).

5 How does TAP help?

In this section, we analyze how TAP helps Text-VQA/Text-Caption. We empirically show that with TAP, certain attention heads in the multi-modal transformer ground the scene text $\mathbf{v^{ocr}}$ to the semantically corresponded text word $\mathbf{w}$ or visual object $\mathbf{v^{obj}}$ . By learning such latent alignments, TAP improves the aligned representation learning and thus helps Text-VQA/Text-Caption.

Recent VLP analyses show that VLP learns the latent alignments between the semantically corresponded region-word or region-region pairs. Specifically, certain attention heads in the transformer generate higher attention scores between such corresponded pairs. The attention scores between corresponded pairs are also referred to as coreference scores . Similarly, we analyze the change in the coreference score of scene text-related pairs to better understand TAP.

There exist $(4$ layers $\times 12$ heads $)=48$ attention scores between any two positions in our multi-modal transformer. Following VALUE , we define the coreference score as the maximum attention score among all $48$ heads between two semantically corresponded positions. A text word and a scene text region are corresponded if they refer to the same scene text token, e.g., the text word and scene text region “coors” in Figure 3. We collect all corresponded pairs between the extended text input $\mathbf{w}$ and scene text regions $\mathbf{v^{ocr}}$ in the TextVQA dataset, and report the averaged score over all pairs. A scene text $\mathbf{v^{ocr}}$ and a visual object $\mathbf{v^{obj}}$ are corresponded if they share the spatial relationship “on.”

As shown in Table 6, we analyze TAP by comparing the change in the coreference score before and after TAP, i.e., “M4C†” and “TAP.” The first two rows show that TAP improves the scene-text language coreference scores by seven times. The bottom two rows show that TAP increases the scene-text visual coreference scores by two times. These increases validate that TAP successfully learns the latent alignment and thus improves joint representation learning.

Furthermore, Figure 3 visualizes the attention score between a text word and all visual regions. Qualitatively, we observe a higher coreference score with TAP (bottom row) than the non-TAP baseline (top row). For example, in Figure 3 (a), TAP grounds the text word “must” and “survive” to the corresponded scene text regions.

6 Qualitative results

Figure 4 shows representative failure cases of the non-TAP baseline “M4C†” that can be corrected by “TAP.” These cases show that TAP improves Text-VQA/Text-Caption by learning better aligned representations.

TAP shows a good performance on challenging questions that require paraphrasing the scene text sentences. For example, in Figure 4 (a), the model answers “who must survive” by the scene text “yaam must survive” in the image. The attention in Figure 3 further visualizes the latent region-word alignments.

TAP also performs better on questions that refer to a scene text via an intermediate object. For example, in Figure 4 (b), the model grounds the object region “the jacket on the man pointing” and generates the correct answer “ryman” with the scene text “ryman football league” on the man’s jacket.

Figure 4 (c) shows an example that TAP correctly understands the relative spatial relationship in question.

Furthermore, TAP helps the model read a large piece of text. For example, in Figure 4 (d), the model correctly answers the question “who edited the book” by finding the editors’ names “jeff vandermeer & mark roberts.” We note that each word is detected as a separate scene text region, e.g., “jeff,” “&,” etc., which makes the answer sequence prediction non-trivia.

The bottom row of Figure 4 shows examples of multiple questions on the same image. For example, (e,f) (g,h) show that the model selects correct scene text regions as the answer based on the input questions. More qualitative results are included in the supplementary material.

Conclusion

We have presented Text-Aware Pre-training (TAP) that explicitly incorporates scene text in pre-training and effectively learns a better aligned multi-modality representation for Text-VQA/Text-Caption. With the identical framework and training data, TAP boosts the non-TAP baselines by $+5.4\%$ in absolute accuracy on the TextVQA challenge. Furthermore, we build a large-scale dataset named OCR-CC and further improve the TAP performance. TAP outperforms the state-of-the-art methods by large margins. Analyses show that TAP helps the aligned representation learning among text word, visual object, and scene text.

Acknowledgment

Zhengyuan Yang and Jiebo Luo were supported in part by NSF awards IIS-1704337, IIS-1722847, and IIS-1813709.

References

Appendix A The OCR-CC Dataset

In this section, we introduce the details of building the OCR-CC dataset based on the Conceptual Captioning (CC) dataset . First, we run the Microsoft Azure OCR system on all CC images (around $3.1$ million). Then, we discard the images that don’t have scene text (around half of the CC images) or have watermark “text” only (around $5\%$ of the CC images). These watermark “text” records the source image website/provider and are thus not related to the image content. Figure 5 (c) shows examples of the discarded images, which either have no detected scene text or have watermark “text” only. In the end, we select $1,367,170$ images from CC as the images in our OCR-CC dataset. We pair each selected image with a caption $\mathbf{w}$ for pre-training. The caption text $\mathbf{w}$ is the concatenation of the original image caption $\mathbf{w^{q}}$ in CC, the detected object labels $\mathbf{w^{obj}}$ , and the detected scene text words $\mathbf{w^{ocr}}$ . Figures 5 (a,b) visualize the distribution of the scene text number in CC and our OCR-CC, respectively. Similar to the distribution on TextVQA and ST-VQA , the majority of images contains $3$ - $10$ detected scene text regions, while a small portion of images has a large number of scene text regions. Figure 5 (d) shows some representative selected images.

Appendix B TextCaps Results

Tables 7, 8 present the full results on TextCaps to supplement the abstracted results in the main paper’s Table 3. We draw similar conclusions from Tables 7, 8 as the ones in the main paper. Specifically, “TAP” significantly improves the non-TAP baseline “M4C†” in all metrics with the identical network architecture and training data. Our TAP approach also outperforms the previous state of the art by large margins.

Furthermore, we compare TAP with the oracle numbers, as shown in the gray text color at the bottom part of Tables 7, 8. “TAP” outperforms the “M4C (GT OCR)” that uses ground-truth scene text detection in training and inference. Meanwhile, there still exists a gap between “TAP” and human performance. We expect future studies focusing on captioning to further reduce the gap, e.g., with better decoding step pre-training designed especially for captioning.

Appendix C Hyper-parameters

We summarize the hyper-parameters used in the “TAP” and “TAP††” experiments. We conduct experiments based on the M4C and follow most of its hyper-parameter selections, as shown in Table 9. We highlight the changed parameters in bold in the table.

First, the max length of the extended text input $\mathbf{w}=\left[\mathbf{w^{q}},\mathbf{w^{obj}},\mathbf{w^{ocr}}\right]$ is set to $20+100+100=220$ .

Second, we increase the max length of scene text $\mathbf{v^{ocr}}$ from $50$ to $100$ when experimented with Microsoft-OCR. Compared with Rosetta, Microsoft-OCR generates more detected scene text regions in each image. For example, in the TextVQA dataset, the mean and median of scene text numbers are $12.8$ and $8$ with Rosetta, and are $23.1$ and $12$ with Microsoft-OCR. With Rosetta, $3.5\%$ of images contain more than $50$ scene text regions detected, while the percentage is $14.3\%$ with Microsoft-OCR. To cover more detected scene text, we increase the max length of scene text $\mathbf{v^{ocr}}$ from $50$ to $100$ when experimented with Microsoft-OCR.

In the experiment of “pre-training without extra data” (“TAP”), we follow the same learning rate step and maximum iteration settings as used in the fine-tuning. In pre-training with OCR-CC (“TAP††”), we pre-train the model for a maximum iteration of $480K$ and scale the learning rate steps linearly.

Appendix D Pre-train + Fine-tune vs. Joint-train

Results in the main paper’s Section 4.3 show that TAP works well even without extra data. We hypothesize that we can view TAP as a multi-task learning framework, and obtain similar improvement by using the pre-training tasks (MLM, ITM, RPP) as the auxiliary training loss. Therefore, we explore an alternative training pipeline named “joint train,” where the pre-training tasks are used as the auxiliary losses together with the main answer/caption loss. Because MLM and ITM tasks require “polluting” the input sequence, we randomly select $50\%$ of the samples in a batch to compute the pre-training loss and keep the remaining $50\%$ unchanged for the answer/caption loss.

Studies show that these two training pipelines can achieve similar performances, i.e., $49.91\%$ for “pre-train + fine-tune” and $49.46\%$ for “joint train” on TextVQA. Both methods significantly outperform the non-TAP baseline ( $44.50\%$ ). For “joint train,” we train the framework for $120$ K iterations. Compared with “joint train,” one advantage of the “pre-train + fine-tune” pipeline in the main paper is that the extra weak data with no answer/caption annotations can be more easily used.

The effectiveness of different TAP pipelines implies the potential of improving other multi-modal tasks by incorporating pre-training tasks. Specifically, the pre-training tasks can be used either in the “joint-train” approach to best preserve the main task’s training pipeline, or in the “pre-train + fine-tune” approach to benefit from the large-scale weak pre-training data.

Appendix E Qualitative Results

In this section, we present additional qualitative examples. Figure 6 shows the failure cases that can be corrected by OCR detection. Figure 7 presents the failure cases of our method. “TAP” occasionally fails on samples that require complex reasoning (Figures 7 (a,b)) or have incorrect scene text detection (Figures 7 (c,d)). For example, in Figure 7 (a), TAP selects the scene text “cutfittep” on the black bag as the answer, instead of the correct scene text “aldo” on the referred white bag.