PreSTU: Pre-Training for Scene-Text Understanding

Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, Radu Soricut

Introduction

Understanding the role of text as it appears in the context of a visual scene is important in various real-world applications, e.g., from automatically organizing images of receipts, to assisting visually-impaired users in overcoming challenges related to comprehension of non-Braille writing in their surroundings, to enabling autonomous robots to make safe decisions in environments designed for humans. As a result, scene-text understanding (STU) has received increased attention in vision-and-language (V&L) understanding tasks, such as visual question answering (VQA) or image captioning . Please see Figure 1 for an illustration.

We identify two distinct capabilities that models targeting STU must address: (i) recognizing text in a visual scene and (ii) connecting the text to its context in the scene. Previous solutions that target STU tasks often delegate scene-text recognition to off-the-shelf OCR (Optical Character Recognition) systems and model the visual context using pre-computed object-detection features. These two streams of information (noisy OCR strings and visual features on detected objects) are used as input into a V&L model. While achieving decent results, these methods heavily rely on the quality of the upstream OCR system and lack a direct connection between the text being recognized and a high-fidelity representation of its context.

More concretely, previous methods have not fully explored pre-training objectives that specifically target STU. In general, V&L pre-training objectives (e.g., masked language modeling, image-text matching , etc.) have been proven effective for learning and became the go-to approach in V&L research. However, these objectives typically do not require a model to understand the role of text embedded in a visual context. For instance, LaTr ignores the visual context during pre-training and instead focuses on modeling the co-occurrence statistics of layout-aware text-only OCR tokens. Even in systems that do perform STU pre-training, such as TAP , their models are built upon the aforementioned pipeline. Specifically, TAP represents the visual input by a set of object features detected and extracted by FRCNN . As a result, it may lose some visual contexts that cannot be captured by objectness (e.g., activities) but are relevant to understand the role of recognized text.

In this paper, we address such a challenge by incorporating an OCR-aware learning objective in the context of a high-fidelity representation of the image context. We adopt a Transformer-based encoder-decoder V&L architecture, using a T5 backbone. The model takes both image and text inputs. For the former, we extract fine-tunable visual features directly from image pixels using a ViT encoder, rather than adopting frozen visual features from pre-detected objects . For the latter, we concatenate task-specific text tokens (e.g., task prompts) with tokens extracted from an off-the-shelf OCR system, in a manner that allows the model to interpret (via the prompt) the OCR tokens in the context of the image.

Building upon this model, we propose PreSTU, a novel recipe for Pre-training for Scene-Text Understanding (Figure 2). PreSTU consists of two main steps. First, it teaches the model to recognize scene text from image pixelsThis makes our model more robust to the quality of OCR systems. and at the same time connect scene text to the visual context. Specifically, given an image and the “part” of the scene texts in the image, the model is pre-trained to predict the “rest” of the scene texts. We call this step splitocr. Second, it teaches the model to further strengthen the connection between scene text and visual context by pre-training with OCR-aware downstream tasks (e.g., vqa and cap). For pre-training, we leverage large-scale image-text resources , with the (noisy) scene text extracted by the off-the-shelf OCR system (Google Cloud OCRhttps://cloud.google.com/vision/docs/ocr).

We validate PreSTU on eight VQA (ST-VQA , TextVQA , VizWiz-VQA , VQAv2 , OCR-VQA , DocVQA , ChartQA , AI2D ) and four image captioning (TextCaps , VizWiz-Captions , WidgetCap , Screen2Words ) benchmarks. Our OCR-aware objectives splitocr, vqa, and cap are significantly beneficial. For instance, compared with strong baselines which take OCR signals as input, we observe more than 10% absolute gain on TextVQA and 42 CIDEr point gains on TextCaps (Figure 1). Finally, we conduct comprehensive experiments to understand which factors contribute to effective STU pre-training. In summary, our contributions are as follows:

We propose PreSTU, a simple and effective pre-training recipe with OCR-aware objectives designed for scene-text understanding (§2).

We show that our objectives consistently lead to improved scene-text understanding on twelve diverse downstream VQA / image captioning tasks (§3.1) and even on cases when OCR signals are absent during downstream tasks (§3.2).

We perform detailed analyses to understand the effect of our design choices on STU performance (§3.2).

PreSTU: Pre-Training for Scene-Text Understanding

Figure 2 provides an overview of PreSTU OCR-aware objectives and their input-output format. In what follows, we first describe our starting point: model architecture and OCR signals (§2.1). Then, we describe our recipe for pre-training (§2.2), including the objectives, splitocr, vqa, and cap (§2.2.1), and data sources (§2.2.2). Finally, we describe the fine-tuning stage and target benchmarks (§2.3).

V&L model architecture. Our main architecture is illustrated in Figure 3. We start from an encoder-decoder V&L architecture which unifies image-to-text (e.g., image captioning) and image+text-to-text (e.g., VQA) tasks. The pre-trained vision encoder is ViT-B/16 , and the pre-trained language encoder-decoder is mT5-Base . Specifically, ViT is a transformer-based encoder that takes a sequence of image patches as input, pre-trained on an image classification task. mT5 is a multilingual variant of text-to-text transformers T5 , pre-trained on a massive multilingual text corpus with the span corruption objective. See more details in the supplementary material.

As mentioned in LaTr , this starting point leads to modeling advantages over existing model architectures for STU tasks. First, we believe that understanding the role of OCR text in the visual context is much easier from image pixels, making ViT a natural choice. Second, mT5 uses wordpiece vocab to encode and decode text tokens; thus a certain level of robustness to the noise in the input OCR texts comes with it by default. On the other hand, M4C and TAP resort to a more complicated solution of using fastText and Pyramidal Histogram of Characters features . Third, mT5 is an encoder-decoder model which enables to generate the open-ended text. This is suitable for general image captioning and scene-text VQA where the answers tend to be out-of-vocab. In contrast, most prior works treat VQA as answer vocab-based classification. Lastly, our model is built upon well-developed vanilla unimodal building blocks in vision and NLP. We deliberately choose this general encoder-decoder architecture to push for the applicability of our objectives. Such a design choice allows us to develop less model-dependent pre-training objectives.

Image resolution. Unless stated otherwise, we use the image resolution of 640x640 in all of our experiments.

OCR signals. We obtain OCR signals from Google Cloud OCR for all pre-training and downstream datasets in our experiments. They come in the form of a set of texts and their corresponding box coordinates in the image (i.e., object detection-like). We order OCR texts based on their locations, top-left to bottom-right and concatenate them with the T5 separator . This allows models to implicitly learn the scene text’s spatial information and standarize the target output sequence during training. Unless stated otherwise, we use these sorted silver OCR texts in all of our experiments.

2 Pre-Training Stage

We consider two sets of OCR-aware pre-training objectives for scene-text understanding.

Task-agnostic objective: SplitOCR. Inspired by the impressive performance of the visual language modeling pre-training objective for image+text-to-text downstream tasks , we propose an OCR-aware pre-training objective called splitocr. This objective is designed to be downstream task-agnostic, focusing on teaching the two core capabilities for STU: recognizing scene text and connecting it to the visual context.

We randomly split the OCR texts into two parts and use the first part as additional input and the second part as a target. Recall that we have ordered the OCR texts based on their locations such that the model can recognize them in a consistent manner. Note that if the splitting point is right at the beginning of the OCR sequence, the model performs a simplified version of the traditional Optical Character Recognition task (i.e., predicting the whole OCR tokens). We denote this by ocr in Table 6 and also compare it with splitocr in our ablation studies.

Why splitocr? splitocr equips the model with the abilities to recognize scene text and connect it to the visual context in a unified, seamless manner. Specifically, operating splitocr upon the “first part” of OCR tokens and the image pixels (not pre-extracted global or object detection features) and predicting the “second part” of OCR tokens requires the model to (i) identify which scene text in the image still needs to be recognized, inherently connecting the input scene text to its visual context; (ii) perform the OCR task, inherently acquiring the scene-text recognition skill.

Task-specific objectives: VQA and CAP. We propose OCR-aware downstream-task-specific pre-training objectives on top of splitocr. We consider two objectives based on our downstream tasks: (i) VQA which predicts the target answer from the question prompt, the visual question, and OCR texts and (ii) CAP which predicts the target caption from the caption prompt and OCR texts. This is similar to previous approaches to STU, except that we encode the image pixels, not features from pre-detected regions.

Why VQA or CAP? Task-specific objectives aim to achieve two goals. First, they further encourage the learning of the relationship between scene text and its visual context through direct interaction between input image pixels and input OCR texts. Second, it eases the knowledge transfer from pre-training to fine-tuning since task-specific objectives share the same input format as that of the downstream tasks (§2.3). See Figure 2 for more details.

2.2 Pre-Training Data

Our main pre-training data is CC15M, the union of two popular image-text datasets: Conceptual Captions (CC3M) and Conceptual 12M (CC12M) .Due to expired URLs, only 13M $\langle image,caption\rangle$ pairs are used in our experiments. CC3M consists of 3.3M $\langle image,caption\rangle$ pairs, obtained by processing raw alt-text descriptions from the Web. CC12M extends CC3M by relaxing its over-restrictive filtering pipeline. We use CC15M for splitocr and cap pre-training. Note that the captions of CC15M are not used for splitocr and their images are not necessarily scene text-related. See more details in the supplementary material.

Since CC15M does not have data in the form of visual questions and their answers for us to leverage, we resort to ST-VQA . It is a scene-text VQA dataset whose images are collected from 6 diverse data sources (COCO-Text , Visual Genome , VizWiz , ICDAR , ImageNet , IIIT-STR ). We use its training set for pre-training. We use ST-VQA as pre-training data for other VQA benchmarks as well as a downstream benchmark for testing splitocr (§2.3).

3 Fine-tuning Stage

In all of our downstream scene-text V&L tasks, the input-output pairs follow the same format as either VQA or CAP ( with OCR text tokens as input.) The only difference from the task-specific pre-training is the training data.

We validate PreSTU on twelve datasets related to VQA and image captioning tasks. ST-VQA, TextVQA, and TextCaps are the main benchmarks for STU. We also consider other scene-text domains, including book (OCR-VQA), document (DocVQA), illustration (ChartQA), diagram (AI2D), and screenshot domains (WidgetCap and Screen2Words). VizWiz-VQA and VizWiz-Captions are for the blind and heavily involve STU. VQAv2 is a general VQA dataset. See complete details in the supplementary material.

4 Discussion

We compare PreSTU with two well-known prior STU works TAP and LaTr . In terms of modeling, TAP leverages two conventional V&L objectives: visual-region masked language modeling and image-text matching, as well as the objective of learning the relative spatial position of two OCR text detections. TAP models the image using object-based features , which we believe is a suboptimal visual context. Besides, TAP adopts vocab-based classification, less suitable for some STU tasks which are full of out-of-vocab words. LaTr overcomes those weaknesses by adopting a similar V&L architecture to ours (ViT-B/16 / T5large). However, its pre-training objective does not involve the visual component (ViT). Instead, it only pre-trains its language component to learn the co-occurrence statistics of layout-aware OCR tokens. As the visual component is distorted or absent during pre-training, these models do not inherently learn the two essential STU capabilities, and would likely suffer in a case when OCR signals are absent during downstream tasks. In contrast, PreSTU fully embraces the visual component. As shown in §3.2, this brings a huge benefit especially when OCR signals are not available. See a more detailed comparison in §3.1.4.

In terms of pre-training data, TAP aggregates scene-text dedicated downstream data, including ST-VQA, TextVQA, TextCaps, and OCR-CC. Thus, while it aligns well with the corresponding downstream tasks, it is less generalizable to other V&L tasks. In contrast, PreSTU adopts general pre-training data (i.e., CC15M), providing a more flexible interface for V&L tasks. Besides, LaTr argues that pre-training on document images is a better choice since acquiring large quantities of natural images with scene text for pre-training is challenging and hard to scale, and the amount of text is often sparse. Our work challenges this assumption and shows that one can pre-train effectively for STU on natural images with minimal preprocessing. (i.e., nothing beyond extracting OCR signals).

Finally, in terms of evaluation as we will show next, our experiments are done on a much wider range of benchmarks than before. This is in stark contrast to existing works which often focus on three benchmarks at most.

Experimental Results

Baselines. We denote by NoPreSTU our main baseline. It is the same pre-trained V&L model as PreSTU (i.e., ViT-B/16 / mT5) but not pre-trained with any of our pre-training objectives.

Metrics. For VQA tasks, we use standard VQA accuracy following . It is the average score over nine subsets of the ground-truth ten answers, where each score is: $min(\frac{\#answer\ occurrences}{3},1)$ . For ST-VQA/DocVQA, we use Average Normalized Levenshtein Similarity (ANLS), softly penalizing the model’s mistakes on scene-text recognition. For ChartQA, we report its official metric, a relaxed accuracy that allows a minor inaccuracy for numeric answers. For image captioning tasks, we use their standard evaluation metrics, including BLEU , METEOR , ROUGE-L , SPICE , and CIDEr .

The main goal of our experiments is to assess the utility of our pre-training objectives splitocr and vqa/cap in VQA (§3.1.1) and image captioning (§3.1.2) tasks.

Table 1 summarizes our main results on VQA tasks, including ST-VQA, TextVQA, VizWiz-VQA, and VQAv2. splitocr outperforms the baseline (i.e., without our STU pre-training) by a large margin on scene-text-heavy VQA tasks, more than +8.8 ANLS on ST-VQA, +10.4% on TextVQA, and +4.1% on VizWiz-VQA. With splitocr→vqa, we slightly but significantly improve the performance further on TextVQA and VizWiz-VQA, +1.1% and 0.7%, respectively. These results show the utility and applicability of our pre-training objectives for improving scene-text understanding.

splitocr and vqa are complementary on scene-text-heavy VQA tasks (TextVQA/VizWiz-VQA), where each of them alone underperforms splitocr→vqa. Additionally, we observe the first-stage pre-training via splitocr is more beneficial than the second-stage task-specific pre-training vqa. This could be due to the superiority of splitocr or the lack of large-scale scene-text VQA pre-training data, or both. We identify data development for scene-text VQA as an open research question.

Our results also highlight the importance of STU in general real-world VQA (i.e., not specially designed for STU). We observe a slight but significant improvement over the baseline on VQAv2 and a more significant improvement on VizWiz-VQA for blind people. We attribute this to a subset of questions that require text recognition and reasoning skills . We believe this is an important step since these questions are considered “hard to learn” or even “outliers” that work against VQA algorithms .

1.2 Image Captioning

Table 2 summarizes our main results on image captioning tasks, TextCaps and VizWiz-Captions. Aligned with the VQA results, splitocr significantly improves over the baseline across all evaluation metrics, with splitocr→cap performing best. The gain is notably 42.2 CIDEr points on TextCaps, and 18.4 on VizWiz-Captions. Overall, we highlight the usefulness of splitocr across V&L tasks with different input-output formats.

Similar to the VQA results, splitocr and cap are complementary. However, cap alone is more beneficial than splitocr alone. We attribute this to our large-scale web-based image-text data that is already suitable for cap pre-training. Despite such a strong cap model, splitocr still provides an additional benefit.

1.3 Applicability to Other Scene-Text Domains

Unlike prior STU literature , we further explore other scene-text domains (Table 3). We show that PreSTU is also effective on book (OCR-VQA), document (DocVQA), illustration (ChartQA), diagram (AI2D), and screenshot domains (WidgetCap & Screen2Words). This demonstrates the applicability of PreSTU to many different real-world STU problems.

1.4 Comparison to Prior Works

So far our results provide strong evidence for the benefit of our proposed objectives. In this section, we provide a comparison to prior works as further context. While apples-to-apples comparison has become increasingly difficult, we make our best attempt to analyze our results in the context of these works. For example, TAP’s objective has coupled the use of object detection signals, which we do not resort to. More importantly, many prior works do not release code, rely on private data, and/or require too large-scale pre-training that is prohibitively costly to reproduce.

We first compare PreSTU to recent works focusing on STU tasks (Rows Non-TAP to LaTr in Table 4). Overall, PreSTU establishes strong results on all tasks. Concretely, PreSTU achieves better results than all prior smaller-scale works (i.e., TAP, TAG, LOGOS). More interestingly, with much less data, we even outperform two larger models ConCap/UniTNT (139.1 vs. 105.6/109.4 in CIDEr) on TextCaps and (56.3% vs. 55.4%) on TextVQA.

PreSTU, however, performs worse than another larger model LaTr on TextVQA/ST-VQA. We attribute this to the superiority of LaTr’s V&L backbones. As shown in Table 5, LaTrbase with no pre-training significantly outperforms our baseline (NoPreSTU) on TextVQA (52.3% vs. 45.2%). LaTr and PreSTU use different scene-text pre-training data: LaTr uses five times larger data than PreSTU (64M vs. 13M in Table 4), which covers more diverse scene text. This is particularly beneficial to TextVQA/ST-VQA, which contain scene text from multiple domains (e.g., brand, vehicle, etc.) and may explain why LaTr outperforms PreSTU.

In contrast, OCR-VQA only covers book-related scene text. Thus, pre-training data becomes less important than pre-training approaches, and PreSTU outperforms LaTr (72.2% vs. 67.5% in Table 5). Moreover, while LaTr only shows its effectiveness on VQA tasks, PreSTU shows on both VQA and image captioning tasks.

We further compare PreSTU to extremely large-scale V&L models pre-trained on more than 2B $\langle image,text\rangle$ pairs. Interestingly, our best model even outperforms two much larger models Flamingo and GIT2 on some tasks; using much less data, we achieve better results than Flamingo (56.3% vs. 54.1%, Table 4) on TextVQA and than GIT2 (72.2% vs. 69.9%, Table 5) on OCR-VQA.

Recently, PaLI , a large-scale V&L model (ViT-e/mT5-XXL) pre-trained on 10B $\langle image,text\rangle$ pairs, reports SOTA results on all major V&L tasks, except for VizWiz-Captions (Table 4). It is worth noting that PreSTU (specifically, our ocr) was an ingredient in the pre-training objective of PaLI to tackle OCR and STU tasks, demonstrating ocr’s utility in large-scale SOTA models.

The closest to PreSTU in terms of model/data sizes is GITL, a smaller-scale version of GIT2 (347M parameters and 20M $\langle image,text\rangle$ pairs). As shown in Table 5, PreSTU outperforms (or is on par with) GITL on all tasks, demonstrating efficiency with respect to model/data sizes. See more comparisons in the supplementary material.

2 Analysis

We aim to understand PreSTU in detail. We show (a) the importance of different components of our design choice, (b) its zero-shot transferability, (c) the effect of pre-training image resolution, (d) the effect of pre-training data size, and (e) the effect of downstream OCR quality.

Detailed ablation. As shown in Figure 2, our PreSTU consists of two (optional) pre-training stages, followed by fine-tuning on downstream tasks. Here, we aim to understand the gain brought by each component. We consider different combinations of the design choices at each stage and organize the results stage-by-stage into Table 6. We have the following three major observations.

First, splitocr is significantly and consistently better than ocr (Rows with splitocr vs. Rows with ocr in their Stage-1). ocr is a “pure” OCR prediction task, a variant of our main splitocr (OCR-conditioned OCR prediction) in which the splitting point is always at the beginning. At first glance, such a result may seem counterintuitive: predicting the entire scene text is strictly harder than predicting part of the OCR text given the other part. When thought of carefully, this result indicates that ocr may put too much emphasis on recognizing scene text, at the expense of connecting scene text to its visual context. In other words, this highlights how splitocr is able to balance the two capabilities that we identify as important for STU (§1).

Second, splitocr (or ocr) makes the visual component (ViT) inherently better at recognizing text (gap between “Yes” and “No” Rows with Stage-1 pre-training vs. gap between “Yes” and “No” Rows without Stage-1 pre-training). Without Stage-1 (e.g., vqa/cap), removing OCR signals during fine-tuning leads to more than a 33% drop on TextVQA and a 49 CIDEr point drop on TextCaps. With Stage-1, these drops become less than 17% and 26 CIDEr points, respectively. For TextCaps, splitocr with “No” OCR input tokens during fine-tuning even outperforms the baseline with OCR input (116.6 vs. 100.0 in CIDEr). In summary, recognizing scene text via Stage-1 pre-training is important (i.e., cannot be achieved via vqa or cap alone).

Third, having two sources of OCR signals is beneficial. OCR signals by pre-trained ViT (Row splitocr→vqa/cap with “No”) and OCR signals by the off-the-shelf system (Row NoPreSTU "Yes") are complementary; we achieve the best result when leveraging both OCR signal sources (Row splitocr→vqa/cap with “Yes”). See more ablation studies in the supplementary material.

Zero-shot transferability on scene-text VQA. Table 7 shows zero-shot transferability of splitocr on TextVQA. We observe that performing splitocr and then fine-tuning on ST-VQA (splitocr→vqa) already leads to a strong model; splitocr→vqa without fine-tuning (44.3%) is competitive to NoPreSTU with fine-tuning on TextVQA training set (45.2%), while ST-VQA alone (vqa) only achieves 35.7%. This suggests that splitocr enables generalization for STU and may remove the need to collect TextVQA data entirely!

Effect of image resolutions during pre-training. We hypothesize that pre-training with high-resolution images is important for scene-text recognition; Table 8 supports this argument. Further, pre-training with the 224x224 image resolution (standard resolution for many vision tasks) almost does not help; it achieves the accuracy of 47.1%, close to 45.2% of NoPreSTU baseline (Table 6 Row 2), suggesting non-standard resolution must be considered to reap the benefit of STU pre-training.

Effect of pre-training data scale. How much data do we need to learn to recognize text? Table 9 shows the performance of TextVQA given checkpoints pre-trained on 1%, 3%, 10%, and 30% subsets of CC15M. We find that the TextVQA performance goes up as more pre-training data is included. This highlights the importance of data scale in acquiring transferable scene-text recognition skills.

Effect of downstream OCR systems. We study the effect of different OCR systems during fine-tuning (Table 10). We observe that the splitocr-pre-trained model is more robust to the change in downstream OCR systems than NoPreSTU. Indeed, splitocr + Rosetta can even perform better than NoPreSTU + gOCR. This result is consistent with Table 6, where we experiment with removing OCR texts entirely during fine-tuning. We also find that gOCR is the most effective. Interestingly, it is even better than human-annotated TextOCR; we hypothesize this is because TextOCR only provides word-level annotation whereas gOCR provides some grouping.

Related Work

Scene-Text Understanding. Most early STU works have merely focused on Optical Character Recognition (OCR). We instead focus on scene-text understanding (STU) in the context of V&L tasks: VQA and image captioning . The most common approach for these STU tasks is to fuse pre-extracted object detection features with off-the-shelf OCR signals as additional input . These works often focus on specific challenges in downstream STU tasks, including dealing with noisy OCR signals, enabling the generation of rare words, or incorporating geometric information of OCR texts. In contrast, our work focuses on pre-training general-purpose STU models and shows the effectiveness of our objectives on multiple downstream STU tasks (§3.1).

V&L Pre-Training for STU. One line of works incorporates OCR signals explicitly for pre-training . TAP proposes an objective to learn the relative spatial position of two OCR texts. LOGOS localizes a region that is most related to a given task and relies on its OCR text to complete the task. LaTr models the co-occurrence statistics of layout-aware OCR tokens. Our pre-training objectives, on the other hand, focus on learning both scene-text recognition and the role of scene-text in its visual context.

The other line of works is OCR-free. Recently, extremely large image-text models have shown promising results on STU tasks, despite having no explicit STU objectives (e.g., GIT2 , Flamingo ). However, it would require an analysis of their private data and a prohibitive amount of resources to pinpoint what contributes to such strong results. Our study offers a complementary perspective to this OCR-free approach by pushing the limit of the OCR-heavy approach further than before and conducting more thorough experiments at a smaller scale.

Conclusion

We introduce a simple recipe for scene-text understanding, consisting of OCR-aware pre-training objectives operating from image pixels. Our task-agnostic objective splitocr teaches the model to recognize scene text and to connect scene text to its visual context. Our task-specific objectives vqa and cap further strengthen that connection. We conduct comprehensive experiments to demonstrate the utility of this recipe.

Acknowledgments. We would like to thank Bo Pang, Xiao Wang, Kenton Lee, and Tania Bedrax-Weiss for their thoughtful feedback and discussions. J. Kil and W. Chao are supported in part by grants from the National Science Foundation (IIS-2107077, OAC-2118240, and OAC-2112606) and Cisco Systems, Inc.

References

Appendices

In this supplementary material, we provide details omitted in the main text.

Appendix A: V&L model implementation details (cf. §2.1 of the main text).

Appendix B: Pre-training & Scene-text V&L datasets (cf. §2.2.2 & §2.3 of the main text).

Appendix C: More comparisons to prior works (cf. §3.1.4 of the main text).

Appendix D: More ablation studies (cf. §3.2 of the main text).

Appendix A V&L model implementation details

Our model is an encoder-decoder V&L architecture consisting of ViT-B/16 as a visual module and mT5-Base as a language module. For the vision module, we adopt a transformer-based vision model ViT pre-trained on JFT-3B dataset , the extension of JFT-300M , with 3 billion images collected from the web. Our language module is initialized from mT5-Base , a multilingual variant of T5 , pre-trained on a new Common Crawl-based dataset with 101 different languages.

During training, all parameters in vision and language blocks are updated simultaneously. We choose Adafactor as an optimizer with $\beta_{1}$ = 0 and second-moment exponential decay = 0.8. For a learning rate, we schedule a linear warmup for 1K steps with inverse square-root decay. Our V&L architecture is implemented in Jax/Flax based on the open-source T5X framework.

We have done extensive hyperparameter tuning for our experiments. For instance, we find that the best hyper-parameter configuration for splitocr pre-training is — initial (peak) learning rate: 1e-3, batch size: 256, image resolution: 640x640, the length of input/target text tokens: 40/26, and dropout: 0.1. For TextVQA, we achieve the best result with initial learning rate: 2e-4 and the length of input/target text tokens: 72/8 (See Table 11 for more details).

Appendix B Pre-training & Scene-text V&L datasets

We provide more details about pre-training and scene-text V&L datasets used in our experiments.

Scene-Text on CC15M. We estimate the portion of scene text on CC15M with a study on $300$ randomly sampled images. We manually check each image and found: $59\%$ (177/300) have scene text; only 13% (38/300) are watermark-only images. This aligns with TAP’s report on CC3M (scene-text: 42%, watermark-only: 5%). Note that TAP mentioned “only the CC dataset contains a reasonable portion of images with meaningful scene text regions”, suggesting CC15M is suitable for STU pre-training.

ST-VQA is for scene-text VQA dataset. Its images are collected from various resources: COCO-Text , Visual Genome , VizWiz , ICDAR , ImageNet , and IIIT-STR . Since there is no official validation set, we follow the split provided by M4C , resulting in 23K/26K training/validation VQA examples.

TextVQA for scene-text VQA. It is a subset of Open Images with scene-text related QA pairs from human annotators with ten ground-truth answers. It has 34K/5K training/validation VQA examples from 21K/3K images.

VizWiz-VQA . The dataset contains 20K/3K training/validation VQA examples collected from blind users. Due to the nature of the questions asked by blind people, we identify this benchmark as a candidate to benefit from scene-text understanding, even though it was not directly designed for scene-text VQA.

VQAv2 . We further evaluate PreSTU on standard VQA benchmark to check if the scene-text recognition can also help on general VQA tasks. Following , we use the VQAv2 train/dev splits of *train2014/minival2014, which are 592K/65K VQA examples in total.

TextCaps for scene-text image captioning task. It uses the same subset of OpenImages images with TextVQA. Each image has five ground-truth captions, totaling 100K/15K training/validation captions.

VizWiz-Captions . Like Vizwiz-VQA, this benchmark was generated by blind users to solve their daily visual challenges. It contains 23.4K/7.7K training/validation images, where each image is paired with five captions. In total, there are 117K/38K training/validation image captions.

OCR-VQA is an OCR-based VQA dataset about images of book covers. Concretely, it requires models to answer visual questions by reading/interpreting the text on the book covers (e.g., author, title). In summary, OCR-VQA provides 207K images of book covers and more than 1 million VQA examples.

DocVQA asks for the textual (handwritten, typewritten, printed) content on the document images. In contrast with general VQA , models should understand additional visual cues, including layout (e.g., tables), style (e.g., font, color), and non-textual elements (e.g., tick boxes). In total, DocVQA contains 50K VQA examples with more than 12K document images.

ChartQA is a VQA benchmark based on charts. Specifically, it covers more than 23K VQA examples from 17K charts. In ChartQA, models are required to perform complex reasoning (e.g., logical and arithmetic operations) to understand charts and the corresponding questions.

AI2D is a VQA dataset of illustrative diagrams. The task of AI2D is to answer diagram-related questions by analyzing the diagram structure and identifying its visual entities and their semantic relationships. AI2D provides 5K diagrams with 15K VQA examples in total.

WidgetCap aims to generate language descriptions for UI elements (widgets) in the mobile interface. Mobile apps often lack widget captions in their interfaces, which recently becomes a primary issue for mobile accessibility. WidgetCap attempts to solve this challenge by providing an evaluation benchmark containing more than 162K language phrases (i.e., captions) with 61K UI elements.

Screen2Words is an image captioning task to generate a short summary of the mobile screen. To complete the task, models should have the capability of understanding the screen and conveying its content and functionalities in a concise language phrase. Screen2Words consists of 112K captions for 22K mobile screens in total.

Appendix C More comparisons to prior works

Comparison to TAP. While PreSTU adopts a general pre-training dataset (i.e., CC15M), TAP’s pre-training data aggregates scene-text dedicated downstream data, including ST-VQA, TextVQA, TextCaps, and OCR-CC. Thus, even if the size of TAP’s pre-training data (1.5M) is smaller, it may align better with the downstream tasks. However, since TAP’s approach focuses on the specific downstream tasks, it is less applicable to other V&L tasks, whereas PreSTU provides a more flexible interface.

Moreover, TAP adopts closed-set prediction by training an answer classifier based on the dataset-specific vocabulary. This may benefit the accuracy of the corresponding downstream task. In contrast, PreSTU chooses open-ended prediction as it is more generalizable in practice and is adopted by many recent works (e.g., PaLI, GIT).

Full Comparison. Table 12 shows full comparisons to prior works on all splits of benchmarks. Concretely, we report results on the test (validation) set for ST-VQA, the test-std (validation) for TextVQA/TextCaps, and the test-std (test-dev) set for VizWiz-VQA, VQAv2, and VizWiz-Captions. Aligned with the results in the main text, splitocr outperforms NoPreSTU on all evaluation metrics. In addition, splitocr→vqa/cap further boosts the performance, highlighting the importance of task-specific objectives (vqa and cap) during pre-training.

Appendix D More ablation studies

splitocr vs. cap. Table 1 of the main text shows the effectiveness of splitocr against vqa on VQA tasks. We further check its benefit over cap on VQA tasks. As shown in Table 13, splitocr consistently improves over cap (e.g., 53.2% vs. 49.3%) on TextVQA, further supporting that splitocr is important for higher accuracy.

We also investigate the effect of the order of pre-training stages. Concretely, we switch the order between splitocr and cap and demonstrate that applying splitocr first (i.e., default setting) is better (Table 14).

Order of OCR. PreSTU uses the fixed OCR order to standardize the target output sequence during pre-training. Compared to the random order, we see its advantage with consistent improvements (e.g., 132.4 vs. 134.6 on TextCaps CIDEr / 55.3% vs. 55.6% on TextVQA).

OCR System. We note that different prior works often use different commercial OCR engines to obtain their best results. Thus, it is hard to perform a fair comparison without extra costs. That said, we did evaluate PreSTU with different OCR engines (including Rosetta-en) at the downstream stage (Table 10 of the main text). A similar setup is used in LaTr : Rosetta-en/Amazon-OCR for downstream TextVQA/pre-training, respectively. In this setup, PreSTU outperforms LaTr on TextVQA Val (50.7% vs. 48.4%).

Appendix E Qualitative results

Figure 4 shows some examples of OCR tokens generated by splitocr. Our splitocr detects all (or almost all) OCR tokens in the images correctly, competitive to the gOCR system.

In §3.2 of the main text, we demonstrate that having two sources of OCR signals is beneficial (OCR signals by pre-trained ViT with splitocr and OCR signals by gOCR system). Figure 5 further supports this finding qualitatively. For instance, gOCR alone does not detect some OCR tokens in the image (e.g., “13”) or detects them incorrectly (e.g., “lexue”). This leads NoPreSTU to predict wrong answers (e.g., “5” or “cooper”). On the other hand, splitocr with gOCR tokens as input predicts the answers correctly with correct OCR tokens (e.g., “13” or “lexus”), demonstrating that two sources of OCR signals (i.e., ViT and gOCR) are complementary.

Figure 6 provides qualitative results for VizWiz-VQA and VizWiz-Captions, demonstrating the applicability of PreSTU to different VQA and image captioning tasks.

Appendix F Contributions

While our splitocr is inspired by SimVLM , the motivation is fundamentally different and it is not trivial to apply the prefix idea in the first place for OCR-aware pre-training. Concretely, SimVLM aims to serve downstream tasks that generate text like captions or answers (with optional text input). Thus, it is understandable why SimVLM could help. In contrast, for downstream STU tasks, OCR strings often serve only as the text input (Figures 2 & 3 of the main text). Therefore, while it makes sense to apply our second stage pre-training (cap & vqa) with OCR strings as the input, it is not intuitive to develop a separate OCR-only pre-training stage (splitocr) that leverages the idea of SimVLM. We came up with splitocr purely from the two essential STU capabilities: (i) recognizing text in an image, (ii) connecting the text to its visual context. Our contribution thus lies in how to fulfill the two requirements via a unified manner, which turns out to be a SimVLM-like objective.

Besides splitocr, another key contribution of our work is the comprehensive investigation of pre-training STU capabilities using a combination of easily reproducible objectives and a standard network architecture, on domains much more diverse than in previous works. Thus, we believe that our extensive analysis is valuable to the community.

Finally, we demonstrate the effectiveness of our OCR-aware method in large-scale settings. We choose CC15M as the pre-training dataset, which is often considered large-scale, and PaLI , an extremely large-scale model (with $10$ B data), utilizes our objective to achieve SOTA results on nearly all STU tasks (cf. §3.1.4 of the main text). This shows the utility of our pre-training objectives even in SOTA large-scale models.