Sequential Modeling Enables Scalable Learning for Large Vision Models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros

Introduction

Large language models (LLMs) such as GPT and LLaMA have taken the world by storm. What would it take to build a Large Vision Model (LVM)? From the animal world, we know that visual competences are not dependent on language. In particular, many experiments have shown that the visual world of non-human primates is remarkably similar to that of humans. So while the space of vision-language models such as LLaVA is interesting and worthwhile to pursue, in this paper we seek an answer to a different question – how far can we go from pixels alone?

The key features of contemporary LLMs that we seek to emulate in LVMs are: 1) scaling in the presence of big data, and 2) flexible specification of tasks through prompting (in-context learning). How do we achieve this? As usual, there are three main components that must be specified:

Data: We want to exploit all the remarkable diversity in visual data. First of all, just raw unannotated images and videos. Next, we want to exploit the variety of annotated visual data sources that have been produced over the last couple of decades – semantic segmentations, depth reconstructions, keypoints, multiple views of 3D objects, among others. We define a common format, “visual sentences”, in which to represent these different annotations without needing any meta-knowledge beyond the pixels. The total size of our training dataset is $1.64$ billion images/frames.

Architecture: We use a large transformer architecture ( $3$ billion parameters) trained on visual data represented as sequence of tokens, using a learned tokenizer that maps each image to a string of $256$ vector-quantized tokens.

Loss function: We draw inspiration from the natural language community, where masked token modeling has given way to sequential autoregressive prediction. Once images/videos/annotated images can all be represented as sequences, we can train the model to minimize the cross-entropy loss for predicting the next token.

With this extremely simple design, we demonstrate some noteworthy behaviors:

Appropriate scaling behavior as one increases model size and data size.

Many different vision tasks can now be “solved” by designing suitable prompts at test time. While the results don’t show as high performance as bespoke, specifically-trained models, the fact that so many tasks are all addressed by a single vision model is quite encouraging.

We see a clear benefit of the amount of unsupervised data on the performance on various standard vision tasks.

We see a hint of an ability for general visual reasoning – handling out-of-distribution data, and performing novel tasks. But further investigation is needed.

Related Work

Pretrained Vision Models. The value of using pretrained models (such as ImageNet-pretrained AlexNet ) has been demonstrated as far back as 2015 in R-CNN , and it has since become standard practice in computer vision. Self-supervised pretraining was proposed as a way to vastly increase the amount of data available for pretraining . Unfortunately, this was not very successful, likely because the CNN-based architectures of that time did not have enough capacity to absorb the data. With the introduction of Transformers , which have much higher capacity, researchers revisited self-supervised pretraining, and showed that transformer-based masked image reconstruction approaches, such as BEiT , MAE , SimMIM , perform vastly better than their CNN-based counterparts . Yet, despite their recent successes, current pretrained vision-only models have had trouble scaling up to the really large datasets, such as LAION .

Multi-task Learning and In-context Learning. From the classic one-model-per-task setups, computer vision is slowly moving toward a single model performing multiple different tasks. Various multi-task learning approaches exist but they are typically limited to a fixed, pre-defined number of tasks. More recently, methods inspired by in-context learning in LLMs forgo any notion of tasks and instead let the model infer the task directly from the input prompt. For example, Visual Prompting takes in a task input/output example pair and a query image at test time, concatenates them into a single 2-by-2 image, and uses inpainting to generate the desired output. But, since the inpainting is performed using a variant of MAE , the same problems with scaling are inherited by these approaches.

Auto-regressive Visual Models. The idea of using auto-regressive models for synthesizing visual data goes back at least 70 years. Inspired by Shannon’s use of $N$ -grams to synthesize language , a number of works, starting with Attneave’s seminal 1954 paper , applied this idea to sequentially synthesizing pixels , image patches , video frames , and motion capture data . As deep models became popular, newer works replaced $N$ -grams with RNNs or CNNs for pixel synthesis . Most recently, transformer-based autoregessive visual generation methods have been proposed , and, combined with language, have demonstrated impressive image synthesis results, e.g. Parti .

Data

“Data! Data! Data! I can’t make bricks without clay!”

The key requirement of any Large Pre-trained Model is that it must be trained on vast amounts of data. For language models, very large and very diverse datasets are fairly easy to obtain. For instance, the popular Common Crawl repository contains 250 billion web pages spanning the entire Web, is extremely diverse, and includes “natural demonstrations” like language translations, question answering, etc. In computer vision, we are still very far from having a data source of comparable size and diversity. One of the central contributions of our work is the first step toward curating such a dataset that we call Unified Vision Dataset v1 (UVDv1). To assemble it, we leverage many different sources of visual data: (1) unlabelled images, (2) images with visual annotations, (3) unlabelled videos, (4) videos with visual annotations, and (5) 3D synthetic objects. The unlabeled images, which represent over 80% of our data, capture a huge cross-section of our visual world, and provide the required diversity, at the cost of lower quality. Images with annotations have a much more constrained distribution, but are usually of higher quality. Video data is even more constrained (typically, to human-centric activities), but is an invaluable source of temporal data. Renderings of 3D synthetic objects are the lowest in diversity but can provide valuable hints about the behavior of 3D structures. Importantly, UVDv1 is a purely visual dataset, with no non-visual meta-data (e.g. text) included. All together, UVDv1 contains $1.64$ billion images.

Another important difference from Large Language Models is that language data has a natural, unified one-dimensional structure for all the data – a stream of text. Unfortunately, this is not the case for visual data, with different sources all having different structures. In this work we propose visual sentence as a unified unit of visual data, which enables us to train scalable models from a diverse set sources. A visual sentence is simply a sequence containing one or more images followed by an end-of-sentence (EOS) token. Figure 1 shows how the various data sources are partitioned into visual sentences. In particular:

Single images. A single image itself represents the simplest form of a visual sentence – { image, EOS}. We use the filtered subset of 1.49 billion images from the LAION 5B dataset. This is by far the largest part of our data, comprising 88.5%.

Image sequences. A sequence of images is a natural form of visual sentence. We create such sequences by sourcing video data from a wide range of existing datasets . Visual sentences of 16 frames are formed by randomly sampling the videos at three different strides (10, 20, and 30). In addition, we utilize synthetic 3D objects from the Objaverse Dataset to generate object-centric multiview sequences for a variety of objects. For each object, we sample one radius length between the object center and the camera from 1.5 to 2.2, and sample one constant elevation from -45 degrees to 45 degrees, then traverse different views of the object by changing the azimuth with a step length of 15 degrees and render 24 views. We rendered 42000 such sequences in total for training and 8000 for testing. Finally, we can also represent images belonging to the same semantic category as being (part of) a sequence. We use categories from ImageNet, concatenating together groups of images (2,4,8, or 16) from the same category into a 16-image long visual sentences.

Images with annotations. To handle different types of image annotations in a uniform way, we choose to represent all annotations as images. Some data types, e.g. semantic segmentation maps , edge maps , depth and normal images , are already represented this way. For others we apply tailored methods for each specific annotation type: 1) Object Detection: We create annotations by overlaying a color-coded bounding box around each object, following the methodology in ; 2) Human Pose: Human skeletons are rendered in pixel space, adhering to the OpenPose format, utilizing MMPose ; 3) Depth Estimation, Surface Normal, and Edge Detection: given ImageNet and COCO images, we generate annotations in line with the protocols from . 3) Style Transfer , De-rain , De-noise , Low Light Enhancement , and Stereo Datasets : These are all represented as image pairs (e.g. input/output). 4) Colorization: We convert ImageNet images to greyscale, producing image pairs. 5) Inpainting: The process involves randomly adding black-colored boxes in images to simulate corruption, resulting in image pairs. For all the above annotation types, we can create visual sentences by concatenating 8 image pairs of the same annotation type into a 16-image visual sentence. For datasets containing $k$ different annotations for the same image we use a different approach: for each set of $1+k$ images (input plus $k$ annotations), we randomly select $m$ elements, where $m\leq n+1\leq 16$ . These m-tuples are then concatenated to form visual sequences.

Image sequences with annotations. When converting annotated video data (VIPSeg , Hand14K , AVA , JHMDB ) to visual sentences, we apply two complementary strategies. The first is similar to how we treat image data with paired annotations: each visual sentence is constructed by concatenating frames with their annotations – {frame1,annot1,frame2,annot2,…}. The second method involves grouping multiple frames followed by their corresponding annotations – {frame1,frame2,annot1,annot2,…}.

We present a detailed summary of all the data sources, annotation type and data statistics of UVDv1 in the Appendix.

Approach

In this section, we describe the design of our autoregressive Large Vision Model. Unlike text data, which naturally exhibits discrete sequential structure, it is not straightforward to model image pixels in visual sentences. In this work, we take a two-stage approach: 1) train a large visual tokenizer (which operates on individual images) to convert each image into a sequence of visual tokens; 2) train an autoregressive transformer model on visual sentences, each represented as a sequence of tokens. We summarize our approach in Figure 2.

While the visual sentences exhibit a sequence structure between consecutive images, we don’t have such natural sequence structure within an image. Therefore, in order to apply a transformer model to images, prior works typically do one of the following: either divide the image into patches in scan-line order, and treat that as a sequence , or use a pre-trained image tokenizer, such as VQVAE or VQGAN , to cluster image features into a grid of discrete tokens, which, again, are turned into a sequence in scan-line order. We adopt the latter approach since the discrete categorical output from a model naturally forms a probabilistic distribution that one can easily sample from, enabling flexible conditional generation of new images within a visual sentence.

Specifically, we employ semantic tokens generated by a VQGAN model, a concept introduced by Esser et al . This framework consists of an encoding and a decoding mechanism, featuring a quantization layer that assigns input images to a sequence of discrete tokens from an established codebook. Our encoders and decoders are constructed purely with convolutional layers. The encoder is equipped with several downsampling modules to contract the spatial dimension of the input, whereas the decoder is fitted with an equivalent series of upsampling modules to restore the image to its initial size. For a given image, our VQGAN tokenizer produces 256 discrete tokens.

It is important to note that our tokenizer operates on individual images independently, rather than on the entire visual sentence at once. This independence allows us to decouple the tokenizer training from the downstream Transformer model so that the tokenizer can be trained on a dataset of single images without having to consider the distribution of visual sentences.

Implementation Details: We adopt an off-the-shelf VQGAN architecture from Chang et al. . We follow the exact configuration in Chang et al. , which uses a downsampling factor of $f=16$ and codebook size 8192. This means that for an image of size $256\times 256$ , our VQGAN tokenizer produces $16\times 16=256$ tokens where each can take 8192 different values. We found that using the results of an ImageNet pre-trained tokenizer did not generalize well beyond ImageNet images. Therefore, we train our own tokenizer on a 1.5B subset of the LAION 5B dataset .

2 Sequence Modeling of Visual Sentences

After converting images into discrete tokens with VQGAN, we treat our visual sentence as a unified sequence by concatenating the discrete tokens from multiple images into a 1D sequence. Importantly, all visual sentences are treated equally – we do not make use of any special tokens to indicate particular tasks or formats. We train a causal Transformer model with the next token prediction objective using a cross-entropy loss, similar to the standard approach for language models . Training the model the same way on all visual sentences enables the model to infer the relation between images from context instead of from task- or format-specific tokens. This gives the model an opportunity to generalize to other, unseen visual sentence structures.

Implementation Details: After tokenizing each image in a visual sentence into 256 tokens, we concatenate them to form a 1D sequence of tokens. On top of the sequences of visual tokens, our Transformer model is virtually the same as an autoregressive language model, so we adopt the Transformer architecture of LLaMA , a popular open-source language model with widely available implementations. We use a context length of 4096 tokens, which can fit 16 images under our VQGAN tokenizer. Similar to language models, we add a [BOS] (begin of sentence) token to the beginning of each visual sentence and an [EOS] (end of sentence) token to the end, and use sequence concatenation during training time to improve efficiency. We train our model on our entire UVDv1 dataset ( $420$ billion tokens) using one epoch (simple epoch training is standard in language models to avoid potential overfitting). We train 4 models with different numbers of parameters: 300 million, 600 million, 1 billion and 3 billion, following the same training configurations. We provide the detailed training hyperparameters in Appendix Appendix Overview.

3 Inference by Visual Prompting

Since the autoregressive Transformer in our model outputs a probability distribution of the next token conditioned on previous tokens, we can easily sample from this distribution to generate new visual tokens that complete a visual sentence. To use the model for downstream tasks, one can construct a partial visual sentence that defines a task at test time, and apply the model to generate the output. This is similar to in-context learning in language models or visual prompting in computer vision .

Experimental Results and Analysis

In this section, we evaluate the scaling abilities of our trained model, as well as its ability to understand and answer a range of diverse prompted tasks.

We investigate the scaling behavior of our model in terms of the training loss and downstream task performance as we increase the model size as well as the number of tokens seen during training.

Training loss. We first inspect the training loss of LVM with different parameter sizes, which we present in Figure 3. Since all our models are trained for only one epoch on the dataset, the model sees a given data sample just once, and therefore the training loss at any point during training is very similar to the validation loss. One can observe that as training progresses: 1) the training loss (perplexity) of the models, regardless of their size, continues to decrease; 2) as we increase the size of the model (parameter count), the loss decreases faster. These observations indicate that LVM shows strong scalability behavior with both larger models and more data.

Scalability on downstream benchmarks. While the LVM overall loss scales well during training, there is no guarantee that the better overall model would also perform better on a given specific downstream task. Therefore, we evaluate different sizes of models on 4 downstream tasks: semantic segmentation, depth estimation, surface normal estimation, and edge detection. We evaluate these tasks on the ImageNet validation set and generate all the annotations using the corresponding method described in Sec. 3. For each task, we give 5 pairs consisting of the inputs and corresponding ground-truth annotations as well as the query image as input prompt and evaluate the perplexity of the ground-truth annotation under our model’s prediction of the next $256$ tokens (one image). We report the results in Figure 4. We see that larger models indeed attain lower perplexity across all tasks, showcasing that our scalable overall performance does transfer to a range of downstream tasks.

Dataset ablation. While LVM attains better performance with larger models and more data, it is natural to ask whether each data component we collect in UVDv1 helps. To answer this question, we conduct an ablation study on our dataset by training several 3B models on subsets of our dataset, and compare their performances on downstream tasks. We use the same 4 downstream tasks and settings as before and present the results in Figure 5. We observe that each data component contributes positively to the downstream tasks. LVM not only benefits from larger data, but also improves with more diversity in the dataset, which includes both annotated and unsupervised image and video data.

2 Sequential Prompting

We begin with the most intuitive and straightforward approach to visually prompt the LVM: sequential reasoning. Here the prompt construction is very simple: we present the model with a sequence of 7 images and ask it to predict the next image ( $256$ tokens).

Video frame prediction. The most direct task for sequential prompting is video prediction. Figure 6 presents several next frame prediction examples, prompted by sequences from the Kinetics-700 validation set. At the top, 7 frame prompts (blue border) are followed by the predicted frame (red border). We observe a certain degree of inferential ability regarding spatial positioning, viewpoint and object understanding. Perplexity of prediction on Kinetics val set is 49.8. The last 4 rows show predictions with longer context (15 frames) and a longer prediction (4 frames). See Figures 17, 18, 19, 20, 21 and 22 in the Appendix for many more examples.

Rotation and Category prediction. The same type of simple sequential prompting can be used in other ways as well. For example, Figure 16 shows how prompting the model with a sequence of 3D rotations of a synthetic object around an arbitrary axis allows it to predict further rotation. Or we can think of a list of items of a given category as a sequence and predict other ideas in that same category, as shown in Figure 15. Note that, while the system was trained on groups of images from the same ImageNet category, here the prompt consists of sketches, which have not been seen in any annotated data.

Context length analysis. Next we ask how much temporal context is required to accurately predict the subsequent frame? We assessed the model’s frame generation perplexity when prompted with a context of varying lengths (1 to 15 frames). As Figure 7 shows, on the Kinetics-700 val set, we see a clear improvement in perplexity from 1 to 11 frames after which it stabilizes (from $62.1\rightarrow 48.4$ ).

3 Analogy Prompting

Our study progresses by evaluating a more complex prompting structure, which we call ‘Analogy Prompting’. This method challenges the model to comprehend analogies of arbitrary length and complexity, thereby testing its advanced interpretative abilities.

Qualitative Results. Figure 8 shows a sampling of qualitative results with analogy prompting on a number of tasks. The prompts consist of a sequence of 14 images giving examples of various tasks, followed by a 15th query image. Given each prompt, the next image predicted is the result. The top part of the figure shows several example prompts defining tasks that were part of the training set (but these actual images were never seen at training). The bottom part of the figure demonstrates generalization to tasks never shown at training. See the Appendix for many more qualitative examples.

Unseen Tasks and Dataset. We present the results for keypoint detection on Pascal 3D+ , evaluated using the standard Percentage of Correct Keypoints (PCK) metric with a of threshold 0.1. Remarkably, LVM achieves a PCK of 81.2 without training on this dataset, demonstrating impressive generalization capabilities. In comparison, we show some existing task-specific model: StackedHourglass scores 68.0 PCK, MSS-Net achieves 68.9 PCK, and StarMap registers 78.6 PCK.

Comparison with Visual Prompting. The closest approach to ours that also allows for defining arbitrary tasks is Visual Prompting . In Table 1, we compare various visual prompting models on few-shot segmentation, object detection, and colorization tasks. Note that our sequential LVM beats previous approaches on almost all tasks.

Task Compositing. Figure 9 demonstrates compositing several tasks together within a single prompt. Here, we demonstrate the rotation task together with the novel keypoint correspondence task and ask the model to continue the pattern. The model is able to successfully combine these two at test ti me, demonstrating some degree of compositionality.

4 Miscellaneous Prompts

Here we try to see how far we can push our model by offering it various prompts it has not seen before. Figure 10 shows a few such prompts that happened to work reasonably well. Figure 11 shows some prompts which are not easily describable by words – these are the type of tasks where LVMs might eventually outshine LLMs.

In Figure 13, we show initial qualitative results on a typical visual reasoning question as found on non-verbal human IQ tests (Raven’s Progressive Matrices ). With considerable squinting, one could imagine the LVM having a latent ability for grasping abstract visual patterns and applying the grasped pattern to extrapolate the shown visual sequence. This exciting result warrants further study.

Limitations

Figure 12 shows some typical failure cases of the current model. One common element, the use of visual prompt to define a task is often under-constrained (more so than in language, since images are very high-dimensional), or the requested task might be beyond the capabilities of the current system. Other, more mundane failures involve issues with the tokenizer and lack of high-quality video training data.

Limited computing resources placed severe constraints that prevented us from exploring a range of intriguing problems, including the impact of different data sets and detailed ablation studies. It is important to note that, despite this being one of the biggest vision models to date, it is still rather small in comparison with modern Large Language Models. Therefore, the question of emergence and true generalization in Large Vision Models remains wide open and ripe for further study.

We are grateful to many friends and colleagues for the discussions and comments on this work, including Yossi Gandelsman, Aleksander Holynski, Angjoo Kanazawa, Qianqian Wang, Sophia Koepke, Quoc Le, Chen Liang, Ekin D. Cubuk, Assaf Shocher, Amir Zamir, Carl Vondrick, Ludwig Schmidt, Aviral Kumar, Xiaolong Wang, Yonglong Tian, Miki Rubinstein and Dilip Krishnan. The work has been supported, in part, by ONR MURI N0001 4-22-1-2773, N00014-21-1-2801, and N00014-21-1-2812, ERC HOLI, Apple Graduate Fellowship to Yutong Bai, and compute donation via the Google TPU Research Cloud.

References

Appendix Overview

This supplementary document complements the main manuscript by providing detailed insights and additional support. It is structured as follows:

– Explores the specifics of LVMs used in our study, including model sizes, architectural details, and optimization hyperparameters.

Appendix B: Unified Vision Dataset (UVD) In-Depth Analysis

– Provides a comprehensive examination of UVD, discussing its composition, data distribution and more details.

Appendix C: Additional Results

– Offers extended results and visual evidence for our study, including supplementary figures and quantitative assessments.

Appendix A Approach: Large Vision Models (LVMs)

As stated before, we use the Transformer variant of LLaMA as our model architecture. To form different model sizes, we vary the hiddmen dimension, MLP intermediate dimension, number of heads and number of layers. We present the details in Table 2. For the rests of the hyperparameters, we keep them the same as the standard LLaMA model.

A.2 Training and optimizer details.

Folling the LLaMA model, we use the AdamW optimizer to train our models. We use the same optimizer hyperparameters for all our models, and we present them in Table 3. All our models are trained on TPU-v3 pods on Google Cloud. Our largest model, LVM-3B, takes around 14 days to train on a v3-512 TPU pod.

Appendix B Unified Vision Dataset (UVD) Details

The Unified Vision Dataset (UVD) represents an extensive compilation of visual data spanning a wide array of domains and annotation types. It integrates a diverse set of datasets, each contributing unique characteristics and annotations, thereby creating a rich resource for various vision-related tasks. The following Table 4 provides a detailed overview of UVD, categorizing the datasets into specific groups based on their content type and annotation features. This categorization includes unpaired image data, images with annotations, videos, videos with annotations, and synthetic 3D views. Each dataset within these categories is listed with its corresponding token count, annotation type, and annotation source, offering a comprehensive perspective of the UVD’s structure and composition.

B.2 Summary of Dataset Distribution in UVD

The Unified Vision Dataset (UVD) encompasses a diverse array of visual data, aggregating over 430 billion tokens. The distribution of these tokens across various categories underscores the dataset’s extensive coverage, see Figure 14:

: This category, featuring datasets like LAION , is the largest, providing a vast collection of unannotated images suitable for a wide range of applications, particularly in unsupervised learning.

Images with Annotations (7.15%; 30.78 billion tokens)

: Including prominent datasets such as ImageNet 1K and COCO , this segment offers annotated images for image classification, object detection, semantic segmentation etc.

Videos (4.24%; 18.26 billion tokens)

: Comprising datasets like UCF101 and Moments in Time , this category provides unannotated video content, ideal for general video analysis and unsupervised learning in dynamic scenes.

Videos with Annotations (0.06%; 0.25 billion tokens)

: Though smaller in token count, this category is significant, with datasets like VIPSeg and Hand14K offering annotated videos for specific tasks like video segmentation and human pose estimation.

Synthetic 3D Views (0.05%; 0.22 billion tokens)

: Datasets like Objaverse in this category cater to advanced 3D vision tasks, providing synthetic 3D views for cutting-edge research.

Overall, UVDv1’s rich composition, with its extensive token array, positions it as a comprehensive resource for various tasks in computer vision, from basic image processing to complex analyses in video and 3D data.

B.3 Details of Constructing Video Visual Sentences

We implemented specific tokenization strategies for each video dataset, taking into account their unique characteristics and contents. These tailored tokenization processes, inclusive of epoch details, ensure a comprehensive and diverse representation of each dataset’s unique video content.

Tokenized with strides of 4 and 7, capturing sequences of 16 frames. Random starting points were used for each of the 10 epochs to ensure diversity in human-object interactions.

CO3D [68]:

Focused on 3D objects, tokenized with strides of 4 or 8 frames. Each sequence used 1 or 2 shots, with random starts in each epoch to capture object depth and detail.

Ego4D [37]:

Strides of 12, 24, and 36 were employed, each sequence consisting of 16 frames. Randomization of starting points was implemented over 10 epochs to capture a range of egocentric activities.

Charades v1 [76]:

Tokenized using strides of 10, 20, and 30 for 16-frame sequences. Random starting points across 2 epochs captured diverse narrative scenes.

Kinetics 700 [13]:

Employed strides of 8 and 24, with each sequence capturing 16 frames. Random starts in each epoch over 10 epochs were used to represent a broad spectrum of human activities.

Diving48 [51]:

Strides of 2 and 4 for tokenization, capturing 32-frame sequences to detail diving techniques. Random starting points were utilized across all epochs for comprehensive motion analysis.

AVA [60]:

This dataset was tokenized with strides of 10 and 20, each sequence consisting of 16 frames. Random starts for sequences were used in each of the 50 epochs to capture varied human actions.

Jester [56]:

Tokenized to capture the subtlety of hand gestures with 16-frame sequences. Randomization in the starting points was employed to enhance gesture diversity.

YouCook [22]:

Tokenized with strides of 10, 20, and 30, each sequence comprising 16 frames. Random starting points over 4 epochs were used to capture a variety of cooking procedures.

CharadesEgo [77]:

Focused on first-person narratives, tokenized using strides of 10, 20, and 30 for 16-frame sequences over 2 epochs.

YouTube VOS [93]:

Tokenized using strides of 2, 4, and 8, focusing on detailed object movements within 16-frame sequences over 2 epochs.

MultiSports [52]:

Captured sports actions with strides of 4, 8, and 12 for 16-frame sequences across 3 epochs.

ActivityNet [12]:

Tokenized with strides of 5, 10, and 15, capturing 16 frames per sequence over 4 epochs to represent a wide range of activities.

Hand14K [31]:

Focused on hand gesture recognition, tokenized with sequences of 16 frames, capturing detailed hand movements over multiple epochs.

Moments in Time [58]:

Captured a wide array of activities and phenomena with a stride of 0, considering the short length of the videos, over multiple epochs.

Multi-Moments in Time [59]:

An extension of Moments in Time, tokenized with strides of 0, 2, and 4 for different runs, each sequence comprising 16 frames to capture simultaneous actions over multiple epochs.

Appendix C Additional Results

Additional results for sequential prompting are presented, including:

Figure 15 illustrates the model’s capability in interpreting hand-drawn sketches from ImageNet-Sketch . We construct visual sentence from a sequence of 15 images from ImageNet-Sketch and then ask the model to predict the subsequent image. This method evaluates LVM’s proficiency in interpreting and understanding hand-drawn sketches.

D Rotation about arbitrary axes:

In our evaluation set for Objaverse, we adopt a range of unseen objects to test LVM’s ability to handle arbitrary axis rotation. The model predicts the next 4 images based on a visual sentence of 16 images. As illustrated in Figure 16, LVM demonstrates its capacity to reason about the direction of spatial rotation based on the context provided by the prompt, leading to reasonable predictions. For this tasks, LVM exhibits 11.8 as in perplexity.

Frames Prediction:

Figures 17 to 22 demonstrate frame prediction using the evaluation set from Kinetics 700 dataset. The model predicts the next 4 frames based on a visual sentence of 16 frames. The Fréchet Inception Distance (FID) score for single-frame prediction conditioned on 15 frames is 21.018, indicating the LVM’s proficiency in understanding spatial and temporal dynamics.

C.2 Analogy Prompting

Further results for analogy prompting in various contexts are provided, highlighting the model’s adaptability and understanding in different scenarios.

In Figure 23, the pose estimation analogy is constructed using the visual sentence of “image-to-joint”, where the model predicts poses from given images. This assesses the model’s ability to interpret analogy pairs and understand human poses and joint relationships.

Depth Estimation Analogy:

Figure 24 presents the “image-to-depth” analogy for depth estimation. The visualizations utilize the validation set from , whose annotations are generated by DPT , and re-normalised to following .

Surface Normal Estimation Analogy:

The “image-to-surface normal image” analogy is depicted in Figure 25. This analogy tests the model’s depth of understanding of 3D structures from 2D data. Despite inaccuracies in some normal surface images from the prompts, our model shows notable robustness and generalization.

Semantic Segmentation Analogy:

Results for the “image-to-segmentation” analogy are shown in Figure 26, emphasizing semantic segmentation. The visualizations are based on the validation set from ADE20K .

Edge Detection Analogy:

Results for the “image-to-edge” analogy are shown in Figure 27, emphasizing edge detection. The visualizations are based on the validation set from , annotated using DexiNed .

Image Inpainting Analogy:

In Figure 28, the “partially masked image-to-image” analogy is explored, demonstrating the model’s capabilities in image inpainting. The model is challenged with different mask ratios, showing significant semantic understanding, as evidenced by a Mean Squared Error (MSE) of 0.106.

Image Colorization Analogy:

Figure 29 shows the “gray-scale image-to-image” analogy for image colorization. This test showcases the model’s ability to handle complex image scenarios, with an MSE of 0.51.

Derain Analogy:

Figure 30 shows the “rainy image-to-image” analogy for image deraining.