LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan

cs.CV cs.AI

Introduction

With the development of the Internet and smartphones, there has been a proliferation of video websites and apps (e.g., Youtube and TikTok), leading to a substantial increase number of videos (Xue et al., 2022). Consequently, a set of video tasks have emerged, such as video search (Smith & Chang, 1997), video recommendation (Deldjoo et al., 2016), and video editing (Casares et al., 2002; Bonneel et al., 2014). To solve video understanding tasks, video-language pretraining has been employed by training foundation models by combining computer vision (He et al., 2016; Dosovitskiy et al., 2020) and natural language processing (Vaswani et al., 2017). These models can capture video semantics and solve downstream tasks (Karpathy et al., 2014; Mithun et al., 2018).

However, current VL pretraining frameworks are often limited to vision and language modalities. The ImageBind (Girdhar et al., 2023) introduces an indirect alignment method for multi-modal pretraining. It aligns other modalities to images, facilitating a comprehensive understanding of various modalities such as infrared (Jia et al., 2021), depth (Kim et al., 2022), audio (Piczak, 2015), and IMU (Grauman et al., 2022). In practical tasks such as zero-shot retrieval and classification as shown in Figure 1, the alignment with language modality is predominantly required for various modalities. While the indirect alignment of ImageBind may result in performance degradation, the LanguageBind method does not need images as intermediaries and facilitates straightforward expansion to additional modalities in downstream tasks.

In this paper, we propose the LanguageBind, a language-based multi-modal pretraining framework that can extend video-language pretraining to multiple (N) modalities. As the language modality contains rich semantic information and is well-explored (Kenton & Toutanova, 2019; Dai et al., 2019), we take it as the bind across different modalities. This process maps all modalities to a unified embedding space, enabling effective semantic alignment. To improve training efficiency, we employ Low-Rank Adaptation (LoRA) (Hu et al., 2021) for fine-tuning, achieving impressive training results with minimal training iterations.

To further improve the modal integrity in pretraining and validate our LanguageBind, we introduce a dataset with five modalities, the VIDAL-10M, which includes VL, IL (infrared-language), DL (depth-language), and AL (audio-language) data pairs. The videos of previous datasets are always truncated segments from long videos (Miech et al., 2019; Xue et al., 2022), resulting in fragmented semantics. To avoid this problem, we construct our video-text pairs from short videos with complete stories. To ensure the quality of the central language modality, we perform multi-view text generation and enhancement on VIDAL-10M.

The proposed LanguageBind ensures that we can extend vision-language to multiple (N) modalities, and our dataset VIDAL-10M benefits more downstream tasks beyond VL tasks, including video retrieval (Luo et al., 2022), depth classification (Cao et al., 2017), infrared classification (Baffa & Lattari, 2018) and audio classification (Palanisamy et al., 2020). As shown in Figure 2, LanguageBind achieves superior performances on a broad range of 15 tasks. In zero-shot text to video retrieval, LanguageBind achieves superior performance on four datasets, surpassing InterVideo (Wang et al., 2022c) by 1.9% on MSR-VTT (Xu et al., 2016), 8.8% on MSVD (Chen & Dolan, 2011), 6.3% on DiDeMo (Anne Hendricks et al., 2017), and 4.4% on ActivityNet (Caba Heilbron et al., 2015). For zero-shot classification on depth and infrared data, LanguageBind achieves a substantial performance advantage over ImageBind. LanguageBind attains top-1 accuracy of 87.2% and 65.1% on LLVIP and NYU-D, respectively, outperforming ImageBind by 23.8% and 11.1%. For zero-shot audio classification tasks, LanguageBind outperforms ImageBind with a 23.8% higher top-1 accuracy on the ESC50 dataset.

We summarize our primary contributions as follows:

We propose LanguageBind, the langauge-based multi-modal pretraining approach. During the pretraining process, all modalities gradually align with the language modality through contrastive learning, and these modalities are unified within a shared embedding space.

We introduce VIDAL-10M, a large-scale five-modal video dataset, containing 10 million data pairs with aligned VL, IL, DL, and AL. To the best of our knowledge, VIDAL-10M is the first large-scale video dataset with depth and infrared modalities.

Extensive experiments validate the effectiveness of our dataset and approach, achieving remarkable performance in video and other modality understanding tasks.

Related work

Multi-modal pretraining begins with pretraining in vision and language. CLIP (Radford et al., 2021) pioneered the alignment of images and texts on a large-scale dataset comprising 400 million samples, effectively establishing a bridge between the image and text domains. This alignment benefits a variety of downstream tasks, including zero-shot classification and image-text retrieval. CLIP can also be used as a foundation for alignment in other modalities. For instance, CLIP4Clip (Luo et al., 2022) aligns video with text, CLAP (Wu* et al., 2023) aligns audio with text, and PointCLIP (Zhang et al., 2022) aligns point clouds with text. Recent efforts have undertaken a comprehensive exploration of multi-modal alignment through pretraining. Augmenting the alignment process with additional modalities can enhance the model’s robustness while maintaining its performance, as observed in VALOR (Chen et al., 2023a) and VAST (Chen et al., 2023b). However, as the number of modalities increases, the training paradigm required to align them effectively undergoes significant changes. Meta-transformer (Zhang et al., 2023) accommodates 12 modalities and utilizes distinct tokenizers to harmonize the embedding space across modalities. ImageBind (Girdhar et al., 2023) expands multi-modal alignment pretraining to encompass six modalities but may not perform as well in language-related tasks due to indirect alignment. In our work, we propose LanguageBind, a direct alignment mechanism designed to align alternative modalities directly with the language modality, which has the highest information density. This direct alignment mechanism yields discernible improvements in downstream task performance.

Multi-modal Datasets

Multi-modal datasets serve as the foundation for multi-modal pretraining. Initially, these datasets only consisted of videos and their corresponding categories, as shown in Table 1. HMDB-51 (Kuehne et al., 2011) and UCF-101 (Soomro et al., 2012) are examples of such datasets, which contain truncated segments from long videos with manual annotation. However, creating these datasets required significant human effort, which limited their scalability and diversity. To address this issue, researchers turned their attention to the abundance of video-text resources available on the internet. Inspired by the success of image-text datasets (Sharma et al., 2018; Changpinyo et al., 2021), they used script-based programming (Schuldt et al., 2004; Kong et al., 2019; Sigurdsson et al., 2018) to extract millions of video-text data pairs. However, acquiring data from modalities like infrared (Teledyne FLIR, 2015a; b) and depth (Silberman et al., 2012), which require specialized equipment and manual annotation, has been challenging. This has severely limited the scale of the data and its alignment with other modalities. Although existing work like ImageBind (Girdhar et al., 2023) has attempted to bind various image-paired datasets and achieve indirect semantic alignment between different modalities, this approach still faces issues of incomplete and indirect data alignment. Thus, there is an urgent need for multi-modal datasets with direct semantic aligned data pairs, especially for modalities with five or more types.

Method

In this section, we present LanguageBind, a multi-modal pretraining approach designed to align the semantics of different modalities and enhance cross-modal retrieval and zero-shot classification. As shown in Figure 3, LanguageBind consists of three parts: (a) multi-modal encoders, (b) language encoder, and (c) multi-modal joint learning.

For other modalities besides language, we employ the 24-layer, 1024-dimensional vision transformer with a patch size of 14. The encoders are initialized from OpenCLIP-large (Ilharco et al., 2021). Depth and infrared are treated as RGB images, which are replicated 3 times in the channel dimension to align with RGB images. Following ImageBind, audio data is transformed into spectrograms with a duration of 10 seconds (128 mel-bins) and we repeat and pad the spectrograms. For example, a 4-second spectrogram would be repeated twice and then padded with zero for an additional 2 seconds. Similarly, we also replicate it 3 times in the channel dimension. If the duration exceeds 10 seconds, we randomly sample three 10-second audio segments, each from the front 1/3, middle 1/3, and back 1/3 of the original audio, and finally stack them together.

where ${\displaystyle{\bm{P}}}$ is a sequence of learnable position tokens, and $i$ represents the position index at patches.

LoRA fine-tuning

Modality extending

To extend the LanguageBind method to multiple (N) modalities, the first step involves processing the data into a sequence of tokens. Subsequently, the parameters are initialized from OpenCLIP. The encoder for different modalities is then trained through token masking and LoRA fine-tuning while keeping the language encoder frozen. Finally, this modality is aligned with the language feature space.

2 Language Encoder and Multi-modal Joint Learning

where $x_{i}$ is the $i$ -th modality data and $y_{j}$ is the $j$ -th text and their features are normalized. $K$ and $\tau$ are batch size and the temperature. The direct alignment of each modality $\mathbf{M}$ with language $\mathbf{T}$ enables us to significantly enhance zero-shot classification and retrieval tasks.

The VIDAL-10M dataset

In this section, we will describe how to construct our VIDAL-10M dataset, including 3 million pairs of video-language data, 3 million pairs of infrared-language data, 3 million pairs of depth-language data, and 1 million pairs of audio-language data. As shown in Figure 4, the collection process consists of three main steps: visual search term database construction (Section 4.1), video and audio collection and filtering (Section 4.2), and modality generation and enhancement (Section 4.3).

To build a video dataset with rich visual concepts and diversity, we design a unique search term acquisition strategy. We leverage text data including labels and captions from various visual task datasets (YouTube-8M (Abu-El-Haija et al., 2016), MSR-VTT (Xu et al., 2016), COCO (Lin et al., 2014), AVA (Gu et al., 2018), HMDB-51 (Kuehne et al., 2011), ImageNet (Deng et al., 2009)) to create a large-scale search term database with diversity and broad applicability. Then we filter these search terms based on their frequency and employ the NLTK toolkit for part-of-speech tagging, followed by tallying the occurrences of keywords (nouns and verbs). A balanced subset of 100,000 search items corresponding to these keywords is then extracted as the final search team database.

2 Video and Audio collection and filtering

During the data collection process, we utilize the aforementioned search terms to retrieve video-text pairs and audio-text pairs from relevant platforms, e.g. YouTube Shorts, and Freesound. Regarding video collection, in order to obtain short videos with high-quality textual descriptions, we implemented a filtering mechanism for the title and hashtags. Video samples with titles containing less than 2 words and without video hashtag labels are excluded from our dataset. Moreover, we removed irrelevant words and hashtags, such as ”youtube”, ”fyp”, ”shorts”, etc. Furthermore, to ensure a complete, consistent, and precise depiction of the event within a single full video, we decide to impose a duration limit of 20 seconds. Shorter videos tend to exhibit better scene coherence and event integrity and are more closely aligned with corresponding hashtags and title descriptions. Ultimately, we obtain a short video dataset that encompasses more specific rather than abstract content. Concerning audio collection, we rank the audio list on different audio platforms based on its similarity to the search terms. Additionally, we conduct filtering operations similar to videos, taking into account factors such as audio ratings, download counts, user comments, tags, and duration. This comprehensive approach allows us to curate and refine the audio and video content more effectively.

3 Modality generation and enhancement

The language modality of VIDAL-10M consists of multi-view texts, including title, hashtags, keyframe captions, video captions, and enhanced captions. The detailed text generation and enhancement pipeline is illustrated in Figure 7. Hashtags in VIDAL-10M are specifically designed to highlight the main subjects and actions depicted in the video. These hashtags serve as key indicators, emphasizing the focal points and dynamic elements of the video. However, hashtags alone may not fully capture the spatial information conveyed by the video frames. To address this limitation, we leverage the image captioning model OFA (Wang et al., 2022b) to generate supplementary keyframe captions that enrich the spatial information at the keyframe level. These captions also contain local temporal information related to the video content, which is beneficial for visual-text pretraining. Besides spatial information, temporal information concealed within the video is equally significant, providing crucial insights into the progression and sequencing of events within the video. To further supplement the overall thematic and temporal information of the video, we employ the mPLUG-owl model (Ye et al., 2023) to generate video captions based on the combination of video, title, and hashtags. By leveraging the title and hashtags as accurate video labels, we guide the mPLUG-owl model to generate captions that align with the video theme, reducing potential model bias to a certain extent. Furthermore, to extract valuable information from the generated video captions, we utilize the ChatGPT model to refine and enhance the textual description, thereby greatly improving the quality of the text. By incorporating the above text components, multi-view textual descriptions provide a comprehensive and detailed representation of the video content.

Infrared and depth modality generation

In the field of depth and infrared, creating modal datasets typically requires specialized equipment and human effort, resulting in limited data. Despite the success of large-scale pretraining models (Radford et al., 2021; Wu* et al., 2023; Luo et al., 2022; Chen et al., 2023b) in NLP and CV, there remains a lack of large-scale data in this field. To address this challenge, we propose using advanced generative models specifically to construct a large-scale dataset of depth and infrared. The sRGB-TIR model (Lee et al., 2023) is used for infrared modality generation and the GLPN model (Kim et al., 2022) for depth modality generation, generating depth and infrared from keyframes in our videos. While some limitations may exist, our collection of millions of video frames and corresponding texts with highly diverse semantics can significantly reduce the presence of model biases.

Experiments and Results

In this section, we evaluate the effectiveness of LanguageBind in various downstream tasks through different experiments. Firstly, LanguageBind’s capability to align video and text is assessed using zero-shot video-text retrieval. Additionally, we use LanguageBind to enhance the performance of downstream tasks that involve depth, infrared images, and audios. Finally, we conduct ablation experiments to analyze the impact of different parameter configurations and text descriptions on LanguageBind’s performance.

In the zero-shot video-text retrieval benchmark, we utilize ViT-L/14 as the video encoder and add temporal attention layers for fair comparison, which can be found in Appendix B. According to the results presented in Table 2, the performance of LanguageBind exceeds that of VideoCoca (Yan et al., 2022) and OmniVL (Wang et al., 2022a) by 8.3% and 8.0% respectively on MSR-VTT. In comparison to the ImageBind model utilizing the Vit-Huge architecture, the LanguageBind model, employing the Vit-Large model, showcases superior experimental outcomes. Furthermore, compared to models based on CLIP-Large but using more training data, LanguageBind achieves state-of-the-art (SOTA) performance on four datasets, outperforming InterVideo (Wang et al., 2022c) by 1.9% on MSR-VTT, 8.8% on MSVD, 6.3% on DiDeMo, and 4.4% on ActivityNet. We also exceed TVTSv2 (Zeng et al., 2023) by 4.4% and 3.2% on MSR-VTT and DiDeMo, respectively. Moreover, we outperforms UMT-L Li et al. (2023a) on all datasets. For a fair comparison of dataset validity, we use the Vit-B/32 model of CLIP4CLIP to conduct validation experiments using the 100K subset of VIDAL-10M and the 380k subset of HowTo100M. As shown in Table 3, the VIDAL-100k outperforms the HT100M-380k on both MSRVTT and MSVD datasets, validating the effectiveness of our dataset.

Zero-shot X-Language classification

We compare our model with the recent state-of-the-art multi-modal pretraining models, OpenCLIP (Ilharco et al., 2021) and ImageBind (Girdhar et al., 2023) on multi-modal understanding ability tasks in Table 5.1. For video zero-shot classification, we outperform ImageBind by 14.0% with a smaller model on Kinetics-400 (Kay et al., 2017), and we also report the results of multi-view/crop (Simonyan & Zisserman, 2014) on OpenCLIP for further comparison. For infrared, LanguageBind exhibits a noteworthy 23.8% performance advantage over ImageBind on the LLVIP and outperforms OpenCLIP on all three datasets (LLVIP, FLIR V1, and V2). For depth images, our zero-shot results on NYU-D surpass ImageBind by a substantial margin of 11.1% and outperform OpenCLIP by 19.7%. For audio, we outperform ImageBind by 10.1% on Audioset dataset and 1.1% on VGGSound dataset. We outperform ImageBind by a large margin of 23.9% on the ESC-50 dataset.

2 Zero-shot in multiple modalities

We compare zero-shot text-to-audio retrieval performance on Clotho (Font et al., 2013) datasets and Audiocaps (Kim et al., 2019) dataset in Table 4. For the Clotho dataset, LanguageBind exhibits a significantly higher margin surpassing AVFIC (Nagrani et al., 2022) and ImageBind by 9.1% and 6.1% respectively. Moreover, our LanguageBind model also surpasses the powerful baseline of VALOR (Chen et al., 2023a) by 3.7%. The same trend is also observed in the Audiocaps dataset. LanguageBind outperformed AVFIC and ImageBind by 2.9% and 5.5%, respectively. Overall, LanguageBind significantly outperforms prior works on two benchmarks validating that it is an efficient way to align audio and language modalities.

Zeor-shot langauge-based multi-modal joint retrieval

In Table 5.2, we conduct multi-modal joint retrieval to explore the complementarity of joint space. We report the R@1 scores on MSR-VTT and Place datasets, while reporting accuracy on other datasets. For MSR-VTT, we only evaluate using videos that include audio. Integrating audio embeddings for video-language retrieval further improves performance, increasing it from 41.4 to 42.0. Similar trends have been observed in other modalities, where each modality has the potential to enhance the performance when combined with other modalities. These results demonstrate that LanguageBind is capable of learning a more consistent feature space.

Emergent zero-shot retrieval

As shown in Table 5.2, we explore the zero-shot performance of emergency coverage in four datasets, including RGB images, audio, infrared, and depth. Due to the novelty of our approach, there are no ”fair” baseline models for comparison. Nonetheless, we compare our results with ImageBind, which aligns with images directly. For example, we achieved R@1 scores of 10.6 and 10.0 on AVE (Tian et al., 2018) and VGGS, respectively. On each benchmark, the performance of emergency zero-shot retrieval achieves significant gains, even approaching results obtained by incorporating textual features. These results suggest that LanguageBind aligns various modalities and implicitly transfers text supervision associated with specific modalities and tasks.

3 Training Loss and Architecture

Following ImageBind, we mainly focus on depth and infrared, which are visual and spatial modality. We report R@1 score for Clotho and Audiocaps datasets and top-1 accuracy for others. We provide more ablation in Appendix E Training epochs. We conduct an experiment in Table 7a to study the effect of training epochs which shows that the LoRA fine-tuning is highly effective. Although 3 epochs of training regimen yield superior accuracy, we chose to optimize for a single epoch, achieving a balance between performance and training cost. Training batch size. In Table 7b, we evaluate the effect of batch size on representation learning. The experiments have shown that a larger batch size is not necessarily better. In fact, a batch size of 1,024 is the most optimal. Training strategy. As indicated by Table 7c, we compare three different strategies. Training from scratch exhibits the poorest performance, likely due to the lack of prior knowledge from CLIP pretraining. On the other hand, full tuning shows significant improvement compared to training from scratch. This highlights the positive impact of leveraging prior knowledge in the form of pre-trained weights. Meanwhile, the LoRA method stands out for its advantages in terms of time and memory cost. It requires less time and memory resources compared to full tuning. Furthermore, LoRA outperforms full tuning on multiple datasets such as LLVIP, FLIRv1, and Clotho. This indicates that LoRA is not only efficient but also effective in learning new knowledge specific to different domains while better retaining the previously acquired knowledge from the pre-trained OpenCLIP models. Rank of LoRA. In our investigation, we examined prevalent rank configurations for LoRA, as detailed in Table 7d. We observe that smaller rank values lead to more significant performance improvements, whereas larger rank tends to decrease performance. This trend may be attributed to the potential overfitting of the model. Temperature for loss. We scrutinize the impact of diverse temperature in Table 7e. We find that the learnable temperature initiated from 0.07 performs best, outperforming the fixed temperature strategy proposed by ImageBind. Masked ratio. We explore the impact of different mask ratios in Table 7f. The results show that a mask ratio of 0.5 demonstrates the highest performance, requiring only a quarter of the computational resources, aligning with findings in FLIP (Li et al., 2023b).

Conclusion

In this work, we propose the LanguageBind, a language-based semantic alignment method for multi-modal pretraining. We employ contrastive learning to establish modality semantic alignment between the language modality and all other modalities. To improve modal integrity, we also construct the first large-scale multi-modal dataset directly aligned to language modality, VIDAL-10M, comprising 10 million aligned VL, IL, DL, and AL pairs. Extensive experimental results, including zero-shot X-language comprehension and indirect alignment between different modalities, demonstrate the effectiveness of LanguageBind’s multimodal alignment and complementary capabilities, as well as the effectiveness of VIDAL-10M.

Reproducibility Statement

We provide a comprehensive overview of the multi-modal encoder, detailing its architecture and functionality in Section 3.1.

We outline the language encoder in Section 3.2.

We expound on the methodologies employed for multi-modal joint learning in Section 3.2

For VIDAL-10M dataset construction details.

We describe the procedures employed to construct the search term database in Section 4.1.

We provide insights into the strategies used for collecting and filtering video and audio data within VIDAL-10M in Section 4.2.

We elaborate on the generation of infrared and depth data, as well as the processes involved in multi-view text generation and enhancement in Section 4.3

We promise to release the VIDAL-10M dataset upon publication.

We describe in detail the training hyperparameters in Appendix B.

We describe the setup of the downstream task dataset Appendix C.

References

Appendix

Appendix A Statistics of VIDAL-10M Dataset

In order to build a video dataset with rich visual concepts and diversity, we develop a unique but simple search term acquisition strategy. This strategy involves obtaining search terms from various visual datasets (as shown in Table 8). Subsequently, we use these search terms to gather videos from the YouTube Shorts platform, which has become a popular source for video data due to its abundance and diverse content. We collect videos in various categories, including sports, animals, nature, etc., resulting in a large and diverse dataset. Examples of video-audio-text-depth-infrared pairs in the VIDAL-10M dataset are shown in Figure 6. Moreover, to ensure data quality, we manually design a list of stop words that are filtered from our datasets. These words include terms such as ”bts”, ”bmw”, and ”nfl”, among others, that are not relevant to our research.

Furthermore, we analyze the distribution of video categories with varying durations in our datasets, as illustrated in Figure 8. The normal distribution pattern observed in this analysis indicates that our dataset covers a wide range of concepts. Besides, we show the proportions of each category across different duration grades in the VIDAL-10M dataset in Figure 9.

FPS, Aspect ratio and Resolution

The first aspect examined in the dataset is the Frames Per Second (FPS) of the videos. FPS refers to the number of frames or images displayed per second in a video. The aspect ratio of a video represents the proportional relationship between its width and height dimensions. It is a critical factor in determining the visual presentation and viewing experience of the videos. The distribution of FPS and aspect ratios in Figure 10 provides insights into the smoothness and fluidity of the recorded content and sheds light on the various formats and orientations used. Video resolution refers to the number of pixels in each dimension that a video contains. It directly affects the clarity, sharpness, and level of detail in the visual content. Examining the distribution of resolutions (Figure 11) in the dataset provides an understanding of the available video quality and the technological capabilities of the recorded material.

Appendix B Pretraining details

In this section, we introduce our training configuration.

For the video-text retrieval based CLIP4Clip, we verify that the VIDAL-10M dataset is highly aligned. We adopted the training framework of CLIP4Clip, and the model is initialized from ViT-B/32, and the rest of the parameters are the same as the default settings, except for 1 epoch and batch size of 512. For the video-text retrieval based LanguageBind, we add a temporal attention before each spatial attention following Aim (Yang et al., 2023). The temporal attention is initialized from the spatial attention and LoRA is applied only to the temporal attention. We add temporal position embedding before each temporal attention. We show the details of results as shown in Table 11, Table 12 and Table 13. For zero-shot video classification, The text templates are sourced from OpenCLIP, with a modification consisting of the substitution of “photo” with “video” across all templates.

Depth-Language.

The model is initialized from OpenCLIP with a frozen language encoder. For each individual sample, we employ a random selection approach to extract either a depth image from the video sequence. Subsequently, we resize these frames to have a short edge length of 256 units, followed by a central cropping process to attain dimensions of 224×224. Additionally, we tripled the number of channels in both the depth image. The text templates employed for zero-shot classification are sourced from OpenCLIP, with a modification consisting of the substitution of “photo” with “depth photo” across all templates. This alteration yields an approximate performance gain of 1%.

Infrared-Language.

Following depth-language, it is worth noting that the text templates corresponding to infrared images retain the “photo” designation, as no discernible performance improvement is observed from this particular modification.

Audio-Language.

The data are preprocessed as in 3.1. Unlike depth and infrared, spectrograms differ much from the domain of conventional visual images. Therefore, it is not easy to overfit during training, so we increase the training epoch and the rank of LoRA. Additionally, we replace “the/a photo of” with “the/a sound of” across all templates for audio zero-shot classification.

Appendix C Downstream datasets

We perform video-text retrieval experiments on 2 datasets. (a) MSR-VTT (Xu et al., 2016) comprises 10K YouTube videos, each paired by 200K captions. In our analysis, we present results based on the 1K-A test subset. (b) MSVD (Chen & Dolan, 2011) consists of about 120K sentences and reports results on test data (670 samples).

Infrared-language.

(a) LLVIP (Jia et al., 2021) constitutes a dataset for pedestrian object detection within the infrared spectrum. Following ImageBind, we extracted all people from the images, designating all other objects as background elements. This process resulted in a dataset comprising 7,622 ‘background’ classes and 7,954 ‘person’ classes, which was subsequently employed for binary classification testing. (b) FLIR v1 (Teledyne FLIR, 2015a) offers comprehensive annotations for both thermal and visible spectrum frames. From the test data, we derived a dataset containing 11,696 images by extracting bounding boxes. This dataset encompasses 4 categories – [’bicycle’, ’car’, ’dog’, ’person’]. (c) FLIR v2 (Teledyne FLIR, 2015b) includes 16,696 images after processing similarly, which were categorized into 12 classes – [’bike’, ’bus’, ’car’, ’hydrant’, ’light’, ’motor’, ’other vehicle’, ’person’, ’sign’, ’skateboard’, ’stroller’, ’truck’].

Depth-language.

We use NYU-v2 Depth-only (NYU-D) (Silberman et al., 2012) to validate by 654 test samples. Through preprocessing, we constrained the depth images to a maximum depth of 10 meters. Following ImageBind, we undertook a category reorganization process, resulting in a total of 10 scene categories.

Audio-language.

We validate the zero-shot classification capability with the ESC-50 (Piczak, 2015) dataset, which has 2000 test audios, each uniquely labelled. For zero-shot retrieval, we use the Clotho (Font et al., 2013) dataset. Each audio has 5 corresponding captions, so we use text-to-audio retrieval to validate the model performance. We perpare test data following ImageBind.

Appendix D License

Unless explicitly noted otherwise, our released datasets are provided to users under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (”CC BY-NC-SA 4.0”), in conjunction with the additional terms outlined herein. The CC BY-NC-SA 4.0 license can be accessed at https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode. By downloading or utilizing our datasets from our website or other sources, you agree to adhere to the terms of CC BY-NC-SA 4.0, as well as the terms outlined in our dataset Terms. In the event of any conflict between the terms of CC BY-NC-SA 4.0 and our dataset Terms, the latter shall prevail. We once again emphasize that this dataset is exclusively intended for non-commercial purposes, such as academic research, teaching, or scientific publications. We strictly prohibit any commercial use of the dataset or any derived works, including the sale of data or utilization of data for commercial gain.

Appendix E Additional Ablation Study

In this section, we conduct extensive experiments to investigate the impact of several factors. At first, we examine the effects of different enhanced textual inputs on downstream tasks. Furthermore, we assess the impact of data volumes on pretraining. In addition, we explore various training strategies to enhance zero-shot classification. Finally, we conduct a meticulous analysis of model training configurations to ensure robust transferability.

In Table E.1, we conduct various experiments to explore how different text sources impact language modality. We verify the effectiveness of LanguageBind, trained with text from multiple sources, across various modalities. While some text sources yield good results, we discover that a single text source may not be universally suitable for all downstream tasks and datasets. In terms of video and depth modalities, the ChatGPT enhanced caption proves to be advantageous. For infrared images, the OFA performs best in the LLVIP dataset, while the raw caption achieves the highest accuracy in FLIR v1 and v2. That’s why our VIDAL-10M provides multi-view textual descriptions, allowing for flexibility in selecting an appropriate text source that caters to diverse task requirements.

E.2 Scaling the size of dataset

We analyze the impact of different data amounts on MSR-VTT and report the R@1 score for zero-shot retrieval as shown in Figure 12. Our findings indicate that an increase in data amount leads to significant improvement in recognition performance. Specifically, the performance of 3M ChatGPT-enhanced text surpasses that of 500k and 100k data by 0.9% and 1.6%, respectively.

Furthermore, the trends observed in both video-to-text retrieval and text-to-video retrieval consistently demonstrate that the interaction between modalities plays a pivotal role in enhancing the learning process. Consequently, with the expansion of data size, the textual descriptions within the VIDAL-10M dataset align more closely with the video content and demonstrate increased scalability.