Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration

Chenyang Lyu, Minghao Wu, Longyue Wang, Xinting Huang, Bingshuai Liu, Zefeng Du, Shuming Shi, Zhaopeng Tu

Introduction

Instruction-tuned large language models (LLMs) have demonstrated impressive capabilities across various domains, exhibiting zero-shot generalization without the need for task-specific fine-tuning (Ouyang et al., 2022; Wei et al., 2022; Sanh et al., 2022; Chung et al., 2022; OpenAI, 2023). However, these models are primarily limited to processing text-based data. Previous research on multi-modal pre-training has shown promise in aligning knowledge from different modalities within a shared latent space (Wang et al., 2022a; Alayrac et al., 2022; Bao et al., 2022; Wang et al., 2022b). Furthermore, there is a recent line of research papers focusing on enabling multi-modal pre-trained models to understand and follow instructions (Xu et al., 2022; Zhu et al., 2023; Liu et al., 2023; Li et al., 2023a; Gong et al., 2023; Dai et al., 2023; Su et al., 2023; Huang et al., 2023).

In this work, we propose Macaw-LLM, a multi-modal instruction-tuned LLM that integrates four different modalities, including image, video, audio, and text, into one single model. We propose a novel alignment approach that aligns multi-modal features to the embeddings of LLMs, which produces aligned features that are closer to the textual features of language models and can be naturally injected into the input sequence of LLMs. A key motivation for our approach is to streamline the adaptation process for LLMs. In particular, Macaw-LLM employs a one-stage instruction fine-tuning process, promoting a simpler learning experience. Previous multi-modal systems typically require two-stage training Li et al. (2023c); Zhu et al. (2023); Liu et al. (2023); Dai et al. (2023), where the first stage usually trains the projection layer for alignment between multi-modal features and text features, and the second stage is the general instruction fine-tuning for LLMs. In contrast, our approach aligns the multi-modal features to the embedding layer of LLMs, which produce aligned features based on LLMs embeddings that can be naturally injected into the input sequence of LLMs. This makes our approach more advantageous.

To address the limitations of current multi-modal datasets that predominantly emphasize specific task types, we create our Macaw-LLM instruction dataset, which is described in Section 4. This dataset covers a wide range of instructional tasks and combines various data modalities, making it more diverse and better-suited for multi-modal instruction-tuned LLMs. We utilize the remarkable generative capability of current LLMs, such as GPT-3.5-Turbo, to curate this dataset, ensuring the target text properly aligns with human instructions.

Our contributions in this work can be summarized as follows:

We propose a novel architecture for multi-modal language modeling, which jointly learns to align multi-modal features and textual features and generate output sequence.

We release Macaw-LLM instruction dataset, a large-scale multi-modal instruction dataset that covers diverse instructional tasks leveraging image and video modalities, which facilitates future work on multi-modal LLMs.

Related Work

Large language models (LLMs) have showcased exceptional generative capabilities in a wide range of natural language processing (NLP) tasks (Brown et al., 2020; Thoppilan et al., 2022; Hoffmann et al., 2022; Chowdhery et al., 2022). By leveraging techniques such as supervised instruction tuning and reinforcement learning from human feedback (RLHF), LLMs exhibit remarkable few- and zero-shot generalization capabilities (Ouyang et al., 2022; Wei et al., 2022; Sanh et al., 2022; Chung et al., 2022; Muennighoff et al., 2022; OpenAI, 2023; Anil et al., 2023). Recently, Wang et al. (2022c) highlight the lack of diversity in human-written instructions and demonstrate that machine-generated instructions can be used for instruction tuning. Since then, several instruction-tuned LLMs have been fine-tuned using various machine-generated instruction datasets (Taori et al., 2023; Chiang et al., 2023; Li et al., 2023b). More surprisingly, Wu et al. (2023b) reveal that instruction-following is not solely a property of LLMs, as even relatively small language models can follow instructions when fine-tuned on large-scale instruction datasets.

Multi-Modality

Drawing inspiration from the human learning process, artificial intelligence (AI) researchers are actively exploring the combination of different modalities to train deep learning models. With the success of LLMs, feature alignment among multiple modalities has attracted great interest for its applications. There is a line of research works that learns a joint embedding space for multiple modalities (Radford et al., 2021; Baevski et al., 2022; Girdhar et al., 2023). Some researches also attempt to combine the pre-trained vision-only and language-only models, showcasing impressive zero-shot capabilities (Alayrac et al., 2022; Li et al., 2023c; Su et al., 2022). More recently, a number of works explore to enable the multi-modal LLMs to follow the instructions (Zhu et al., 2023; Ye et al., 2023; Li et al., 2023a; Chen et al., 2023; Gong et al., 2023; Dai et al., 2023). Xu et al. (2022) introduce MultiInstruct, the first multi-modal instruction tuning benchmark dataset covering a wide range of multi-modal tasks and categories. Liu et al. (2023) explore the multi-modal instruction-tuning using the machine-generated data. Su et al. (2023) allow the textual LLMs to support six modalities using the parameter-efficient fine-tuning technique LoRA.

Our Work

In this work, we propose Macaw-LLM, a multi-modal LLM that effectively integrates information from visual, audio, and textual modalities, enabling it to comprehend and execute instructions accurately.

Methodology

In this section, we provide a comprehensive description of Macaw-LLM. We begin by presenting an outline of the model architecture, followed by a detailed description of each individual module within Macaw-LLM, namely the modality module, alignment module, and cognitive module. Lastly, we provide an in-depth explanation of the training process of Macaw-LLM.

We present an overview of Macaw-LLM in this section. As shown in Figure 1, there are three major modules in Macaw-LLM as follows:

Modality Module: Existing LLMs primarily focus on processing textual information. To incorporate additional modalities such as visual and audio data, we integrate extra modality encoders into Macaw-LLM. This enhancement enables our Macaw-LLM to handle multiple modalities effectively.

Alignment Module: Since each modality encoder is trained independently, the learned representations of different modalities may not be directly compatible. To address this, we propose the alignment module, which unifies the representations from different modalities, enabling effective integration of multi-modal information.

Cognitive Module: LLMs have demonstrated remarkable capability in understanding and following human instructions. In Macaw-LLM, we leverage pretrained LLMs as our cognitive module, which forms the foundation of Macaw-LLM. It is worth noting that the cognitive module also serves as the textual modality encoder in our approach.

Figure 1 provides a visual representation of the Macaw-LLM architecture, while Section 3.2 and Section 3.3 offer detailed explanations of the modality module and alignment module, respectively. As the cognitive module of Macaw-LLM, the effectiveness of instruction-tuned LLMs has been demonstrated by several previous works (Ouyang et al., 2022; Wei et al., 2022; OpenAI, 2023; Taori et al., 2023; Chiang et al., 2023; Anil et al., 2023), and we follow their practices in our Macaw-LLM.

2 Modality Module

Existing LLMs are highly powerful but typically limited to processing only textual information. In this section, we describe how we encode information from different modalities.

Radford et al. (2021) propose a novel framework, known as CLIP (Radford et al., 2021), which exploits a significantly wider range of supervision by directly learning from unprocessed textual data related to images. In this work, we utilize the capabilities of CLIP-ViT-B/16 for encoding visual information, including images and video frames.

Audio Modality Encoder

Radford et al. (2022) introduce a novel multilingual speech recognition model called Whisper (Radford et al., 2022). This model is trained on a vast audio dataset with weak supervision. In Macaw-LLM, we leverage the power of Whisper-base to encode the audio signals, thereby extracting meaningful representations from the audio data.

Textual Modality Encoder

LLMs are commonly pre-trained on the massive text corpora, so instruction-tuned LLMs can naturally process text information. In this work, we consider LLaMA-7B (Touvron et al., 2023) as the foundation of Macaw-LLM.

We acknowledge the existence of numerous publicly available pre-trained models that can serve as modality encoders. However, we leave the investigation of their utility to the future work.

3 Alignment Module

Modality encoders are typically trained separately, leading to potential discrepancies in the representations generated by different encoders. As a result, it becomes crucial to align these independent representations within a joint space. In this section, we outline the approach we employ to align these representations.

where $d_{k}$ is the dimensionality of the key and query vectors, and $n_{q}$ and $n_{k}$ are the number of queries and keys, respectively.

Modality Alignment

Encoding: We firstly leverage the pre-trained models ,CLIP and Whisper, to encode multi-modal features:

Transformation: To reduce computational costs and minimize the number of tokens in the prefix, we employ a 1-D convolutional layer to compress the length of the multi-modal features to a smaller and fixed value. Subsequently, a linear layer is employed to adjust the hidden size of the features, aligning it with the size of the LLMs embeddings as follows:

Alignment: Each modality encoder is trained separately, resulting in distinct representations for different modalities. To establish a common representation space, it becomes necessary to align these representations across modalities. In this work, we consider the transformed visual and audio modality representations obtained in Equation 3 as the soft tokens of LLM, the cognitive model, so we propose to align the visual and audio representations with the textual embedding space using the attention mechanism in Equation 1 as follows:

where ${\bm{h}}^{\prime}$ is the modality representation obtained in Equation 3 (ı.e. ${\bm{h}}_{i}^{\prime}$ , ${\bm{h}}_{v}^{\prime}$ , and ${\bm{h}}_{a}^{\prime}$ ) and ${\bm{h}}^{a}$ is the corresponding aligned representation, specifically, ${\bm{h}}^{a}_{i}$ , ${\bm{h}}^{a}_{v}$ , and ${\bm{h}}^{a}_{a}$ . After such an alignment operation facilitated by the attention mechanism, the LLM (cognitive module) can seamlessly process the representations from various modalities.

Integration: The integration of aligned modality representations into the instruction can be achieved effortlessly through the concatenation operation. Given the aligned modality representations, the integration can be defined as follows:

where $[:]$ represents the concatenation operation, ${\bm{x}}$ represents the multi-modal instruction, ${\bm{x}}_{\textrm{t}}$ represents the sequence of tokens in the textual instruction, and $\textrm{Embed}({\bm{x}}_{\textrm{t}})$ represents the sequence of embeddings of ${\bm{x}}_{\textrm{t}}$ .

In this section, we describe how we align the multi-modality representation into a shared representation space using the attention mechanism. It is important to note that our model, Macaw-LLM, has the capability to process multiple modalities concurrently, while the textual instruction ${\bm{x}}_{\textrm{t}}$ is always necessary as part of the instruction ${\bm{x}}$ . We intend to investigate the direct utilization of visual or audio instructions in our future work.

4 One-Step Instruction Fine-Tuning

The common multi-modal practice in previous works involves two-step training (Li et al., 2023c; Liu et al., 2023; Dai et al., 2023). The first step focuses on training the projection layer to align multi-modal features with textual features, while the second step involves fine-tuning the general instruction for LLMs. In contrast, our approach, Macaw-LLM, simplifies the adaptation process by employing a one-step instruction fine-tuning approach. This approach ensures coherent alignment across the modalities and eliminates the potential risk of error propagation that can occur in multi-step fine-tuning procedures.

In this work, we fine-tune all the parameters $\bm{\theta}$ in Macaw-LLM, and the objective is to minimize the negative log-likelihood over the response ${\bm{y}}$ with respect to $\bm{\theta}$ as follows:

where $N$ denotes the number of tokens in ${\bm{y}}$ and $y_{j}$ is the $j$ -th token in ${\bm{y}}$ . By employing such a one-step fine-tuning strategy, Macaw-LLM can effectively harmonize the different modules.

Macaw-LLM Instruction Dataset

Current multi-modal datasets, such as visual question answering (Antol et al., 2015; Goyal et al., 2017), summarization (Li et al., 2017; Jangra et al., 2023), and dialogue (Shuster et al., 2021; Sun et al., 2022), predominantly emphasize specific task types, resulting in a limited diversity of tasks. Additionally, the target text in these datasets often lacks proper alignment with the style of human-written text, making it difficult for models fine-tuned on such data to effectively follow human instructions. To address these limitations, we utilize the remarkable generative capability of current LLMs (i.e. GPT-3.5-Turbo) to curate our Macaw-LLM instruction dataset.

To generate the dataset, we utilize the power of GPT-3.5-Turbo. We provide it with a prompt in the form of an image or video caption (see Figure 3). To optimize the generation process and improve efficiency, we generate 10 instruction-response pairs within a single query. For image caption data, we rely on the MS COCO dataset (Lin et al., 2014). It consists of 328,000 images accompanied by captions. From this dataset, we randomly select a subset of 10,000 images with their respective captions to create our dataset. In addition to image data, we incorporate video caption data from two datasets: Charades (Sigurdsson et al., 2016) and AVSD (AlAmri et al., 2019). These datasets collectively contain 9,848 videos with captions, which we utilize to create our own dataset.

We repeat this process and obtain approximately 69K examples based on COCO image captions and about 50K examples based on Charades and AVSD video captions. The dataset creation process is illustrated in Figure 2. Table 1 provides statistics about the dataset, including the number of items, the word count of instructions and responses, and examples of each type.

Our current dataset is focused on single-turn dialogues, but we acknowledge the significance of including multi-turn dialogues and expanding the dataset to encompass a wider range of multi-modal content. To address this, we are actively engaged in the process of incorporating multi-turn dialogues and diversifying the dataset to enhance its richness. These additions will greatly contribute to enriching the dataset and will be beneficial for the fine-tuning process of language learning models (LLMs).

Experimental Setup

In this study, we utilize instruction data from three different sources:

Text instruction dataset: For textual instruction-tuning, we make use of the Alpaca instruction dataset (Taori et al., 2023), which comprises approximately 52,000 instruction-response examples distilled from the Text-Davinci-003 model.

Image instruction dataset: To create an image instruction dataset, we curate around 69K instruction-response pairs by generating them from COCO image captions (Lin et al., 2014) using GPT-3.5-Turbo as described in Section 4.

Video instruction data: We generate approximately 50K video instruction-response examples by utilizing the video captions from the Charades (Sigurdsson et al., 2016) and AVSD (AlAmri et al., 2019) datasets using GPT-3.5-Turbo as described in Section 4.

In practice, we randomly sample 50K examples from each type of instruction data and combine them to form a final training dataset consisting of 150K examples. Note that the audio inputs are currently associated with the video instruction data and we are actively in the process of creating the audio instruction dataset.

2 Hyperparameters

We utilize DeepSpeed (Rasley et al., 2020) for optimization during the training process. The training is conducted on 8 Nvidia A100 GPUs. For each device, the training batch size is set to 4. We employ a gradient accumulation step of 3. The model is trained for 5 epochs, with a learning rate of $3\times 10^{-5}$ . The warmup ratio is 0.03, along with a cosine learning rate scheduler. The maximum sequence length is fixed at 512. We use FP16 precision for both training and inference.

Examples

To showcase the effectiveness and potential of our proposed Macaw-LLM in creating human-like conversational agents, this section provides compelling examples that demonstrate the system’s remarkable ability to understand and generate responses related to visual content. These examples vividly illustrate how Macaw-LLM seamlessly processes and integrates multiple modalities of information, such as visuals and audio, within the domain of natural language processing (NLP). By generating informative, relevant, and coherent responses to a wide range of questions, Macaw-LLM clearly demonstrates its proficiency in NLP and underscores its potential for developing highly effective human-machine communication interfaces.

We present several examples that highlight the proficiency of our Macaw-LLM in understanding and following multi-modal instructions. In Figure 4, Figure 5, and Figure 6, we showcase our system’s multi-modal ability to understand and generate responses based on an image. These examples demonstrate how our system comprehends visual content and produces high-quality, fluent responses in natural language conversations. Our system generates contextually relevant and informative answers to various questions about the image, demonstrating its capability to communicate about visual content naturally and fluently. Figure 7 and Figure 8 present two examples that demonstrate Macaw-LLM’s excellent understanding of videos. We showcase its responses to various questions related to the video content, highlighting its ability to comprehend video information effectively. Furthermore, Figure 9 demonstrates our system’s capacity to process and integrate multiple modalities of information simultaneously. In this example, in addition to answering various video-grounded questions, Macaw-LLM effectively identifies whether the dog in the video is barking or not.

In summary, the examples provided showcase the impressive capabilities of our system in generating top-notch, contextually appropriate, and logically consistent responses to diverse questions about visual content within a natural language conversation. The proficiency of our system in natural language processing (NLP) and its adeptness in seamlessly incorporating multiple modalities of information underscore its tremendous potential in designing efficient interfaces for human-machine communication.

Limitations

In this section, we summarize the limitations of Macaw-LLM as follows:

Evaluation: We show some examples showcasing the multi-modal ability of our Macaw-LLM. However, we acknowledge that these efforts may not be fully adequate for accurately and comprehensively demonstrate model capabilities. Gudibande et al. (2023) highlights that instruction-tuned LLMs might not perform as well as the reported evaluation results suggest. Hence, we have concerns regarding the ability of our evaluation to provide an accurate reflection of the true capabilities of Macaw-LLM.

Single-Turn Dialogue: While our training data mainly consists of ”dialog-like” instructions, it’s important to note that these instructions are currently limited to single-turn interactions. It is crucial to acknowledge that Macaw-LLM are not currently optimized for handling multi-turn dialogues and may not effectively leverage long-range context.

Hallucination, Toxicity and Fairness: According to empirical evidence presented by Wu et al. (2023b), instruction-tuned LLMs may encounter issues such as hallucination, toxicity, and fairness. However, it is important to note that we do not evaluate our models, Macaw-LLM, in relation to these aspects due to the unavailability of suitable evaluation suites.

We acknowledge these limitations and recognize the need for addressing them in future work.

Conclusion and Future Work

In this paper, we present Macaw-LLM, a multi-modal instruction-tuned LLM that accommodates four distinct modalities: image, video, audio, and text. In addition to the standard modality module and cognitive module, we propose a novel approach to align representations from different modality encoders into a shared space. Unlike previous methods, our approach combines representation alignment and instruction tuning into a single step, mitigating potential error propagation during multi-step tuning. Furthermore, we curate Macaw-LLM instruction dataset, a large-scale dataset of multi-modal instructions using GPT-3.5-Turbo. We demonstrate examples showcasing the multi-modal understanding ability of Macaw-LLM.

We discuss the limitations of our work and point out that current multi-modal instruction-tuned LLMs may suffer from various aspects in Section 7. We leave the investigation of these issues to the future work. Furthermore, we intend to broaden our corpus to encompass multi-turn and multilingual dialogues. This endeavor will take advantage of the capabilities of LLMs to effectively generate/translate long-document texts (Wang et al., 2017; Lyu et al., 2023; Wang et al., 2023; Wu et al., 2023a).