MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang

cs.CV

Introduction

Recent advances in Large Language Models (LLMs) acheive great success in Natural Language Processing (NLP) . It is a natural progression to introduce multi-modality into LLMs and turn it into Multi-modal Large Language Models (MLLMs), which is able to conduct multimodal rationalization and understanding. MLLMs have shown incredible emergent capabilities in various multimodal tasks such as perception (e.g., existence, count, position, OCR) , commonsense reasoning , and code reasoning , resulting in a potential path to Artificial General Intelligence (AGI). Compared to LLMs and other task-specific models, MLLMs provide a more human-like interpretation of the scenarios, a user-friendly interface for interaction, and a broader range of capabilities.

Existing vision-centric MLLMs follow the paradigm that utilizing pre-trained LLMs and visual encoder with additional learnable modules (Q-former or simple projection layer ). In video field, some previous works follow this paradigm to build video MLLMs, while works in the other paradigm combine existing visual perception tools (e.g., tracking and classification) and LLMs through Application Programming Interface (API) to build a system without training. Yet, previously, there is no exploration of a model or system based on long videos (over one minute), and there is also a lack of a standardized benchmark to evaluate the capabilities of these systems.

In this paper, we present MovieChat, a novel framework that integrates vision models and LLMs to conduct long video understanding tasks. We claim that the computation complexity, memory cost, and long-term temporal connection are the main challenges for long video understanding. Atkinson-Shiffrin memory model proposes that short-term memory functions as a buffer of long-term memory, serving as a processor for the encoding of information into long-term memory. Inspired by this, we propose a memory mechanism to deal with long video understanding tasks, which includes a rapidly updated short-term memory and a compact thus sustained long-term memory. We use a sliding window approach to extract video features and represent them in token form, which are then sequentially fed into the short-term memory frame by frame. The short-term memory has a fixed length, and when it reaches its set limit, the earliest tokens are popped and consolidated into the long-term memory. After passing through a projection layer, the video representation is inputted into a large language model for interaction with the user. As shown in Fig. 1, our proposed MovieChat mechanism outperforms other existing methods in terms of Video Random Access Memory (VRAM) cost. We also release a new benchmark, MovieChat-1K, with 1K long videos and 13K manual question-answering pairs for validation of the effectiveness of our proposed MovieChat.

The contributions of this work are summarized as:

We present MovieChat, a novel framework that integrates vision models and LLMs, which is the first to support long video ( $\textgreater 10$ K frames) understanding tasks.

We propose an effective memory management mechanism to reduce the computation complexity and memory cost, while enhancing the long-term connection.

We release the first long video understanding benchmark, MovieChat-1K, with manual annotations and conduct extensive quantitative evaluation and case studies to evaluate the comparable performance of both understanding capability and inference cost.

Related Works

LLMs have achieved great success in natural language processing (NLP) tasks recently. Many works try to build MLLMs by combining models of other modalities. Flamingo bridges powerful pre-trained vision-only and language-only models and achieves state-of-the-art performance with few-shot learning. BLIP-2 proposes a generic and efficient pre-training strategy that bootstraps vision-language pre-training from an off-the-shelf frozen pre-trained image encoders and a frozen large language model. MiniGPT-4 also aligns a frozen visual encoder with a frozen LLM, Vicuna , using just one projection layer to realize the system. Otter showcases improved instruction-following ability and in-context learning. In video field, ChatVideo treats tracklets as the basic video unit and allows users’ interacting with the LLMs. VideoChat integrates video foundation models and LLMs via a learnable neural interface, excelling in spatiotemporal reasoning, event localization, and causal relationship inference. Video-LLaMA further leverages pre-trained models ImageBind and LLaMA , bootstraping cross-modal training in videos following BLIP-2. Yet, these methods fail to handle long video understanding because of high computation complexity, large memory cost, and weak long-term temporal connection. Therefore, our main effort is to introduce an effective memory mechanism to overcome these challenges.

2 Long Video Understanding

Understanding long videos is a challenging task in computer vision. Prior arts use 3D CNN for long-term feature bank , object/human-centric motion , or other forms as video representations. MIST decomposes dense self-attention into a cascade segment and region selection module to increase the computation efficiency for understanding minutes of long videos. Building long-form video understanding datasets is challenging and rarely explored. captures large scale data from Kinetics-400 , but only for generic event boundary detection tasks. creates a language grounding benchmark from audio descriptions of movies, but it lacks long-term understanding evaluation. successfully builds a benchmark contains multiple sources of information (e.g., video clips, plots, and DVS) for question-answering tasks in the movie field. There are also several datasets of video-caption/description pairs among various domains, such as cooking (e.g., MPII Cooking and TACoS ), instruction (e.g., HowTo100M and HiREST ), Ego , and movie (e.g., MovieQA and MovieNet ) from different sources such as YouTube , Twitter , and Internet . Yet, those datasets lack diverse and fine-grained dense captioning for long videos.

3 Memory Models in Vision Tasks

There are some prior works exploring memory models in various vision tasks in videos, such as video object segmentation (VOS) , multi-object tracking (MOT) , visual object tracking (VOT) , and action understanding . MeMOT builds a large spatiotemporal memory that stores the past observations of the tracked objects. XMem develops an architecture that incorporates multiple independent yet deeply-connected feature memory storage to handle long videos with thousands of frames. We learn from the experience of those prior arts and further adopt an effective memory mechanism in combination with LLMs.

Our method focuses on reducing the redundancy of visual tokens in the video and building a memory mechanism to pass the information among a large temporal range.

MovieChat

Our proposed method, MovieChat, comprises several key components, including the frame-wise visual feature extractor, the short-term and long-term memory modules, the video projection layer, and the Large Language Model (LLM), as illustrated in Fig. 2. MovieChat is designed for ultra-long videos ( $\textgreater 10$ K frames) understanding through interactive dialogue with the user. To address the impractical storage demands of concurrently storing a vast number of frames in both GPU memory and RAM, we employ a sliding window approach to efficiently process the video. The short-term memory module embeds dense tokens with sliding window and the long-term memory module periodically updates. MovieChat supports two inference modes: Breakpoint mode is used to understand a specific moment in the video, providing insights and answers based on that particular frame or scene; Global mode, on the other hand, is employed to comprehend the entire video as a whole, enabling a comprehensive understanding of the overall content and context.

2 Visual Feature Extraction

3 Short-term Memory

Short-term memory stores the frame tokens in a temporary fixed-length buffer. The previously extracted visual features by sliding window $G$ times without further processing are used to construct short-term memory, which can be formulated by:

where $\mathcal{S}$ is short-term memory, and $K$ is equal to $C\times G$ . Note that we set short-term memory to contain a fixed length of $K$ frames since the role of short-term memory is to assist in video understanding based on previous short-term contextual information.

The update strategy for short-term memory is based on the First-in-First-out (FIFO) queue. As a new batch of visual tokens enters, when the short-term memory reaches its capacity, we pop the currently stored frames to the memory consolidation module and clear the short-term memory. The output video feature obtained from the consolidation module augments the long-term memory; on the other hand, it reinitializes the short-term memory with this feature. The initialization aims at communicating the information between different sliding windows, thereby achieving more efficient compression.

4 Long-term Memory

Long-term memory can effectively avoid the problem of catastrophic knowledge forgetting, which is crucial for handling long video understanding tasks. The features stored in short-term memory are dense tokens, but due to the limitations of GPU memory and computation cost, storing all the tokens dropped from short-term memory into long-term memory buffer in sequence is infeasible. Besides, we observe significant temporal redundancy in videos, where activities span multiple frames with minimal visual changes. To this end, we propose a method to merge adjacent similar frames to simplify video feature representation and accelerate video encoding. This method transforms the dense tokens to the sparse memories, which are stored in long-term memory.

To be specific, as shown in Algorithm 1, we conduct memory consolidation by merging the most similar tokens in the adjacent frames following ToMe periodically. We find that the token embedding in Transformer already summarize the information of each frame for using in calculating the average cosine similarity $s$ of $N$ embedded tokens:

For long-term memory, the number of tokens exceeds the maximum length of the positional encoding from the pre-trained model. Thus, our model utilizes the positional encoding mechanism following BERT , which results in a portion exceeding the length threshold $n$ without available positional encoding. In order to handle long enough long memory, we adopt the hierarchically decomposed positional encoding method proposed by Su et al. , which allows to extend the absolute positional encoding of length from $n$ to $n^{2}$ .

5 Inference

Previous methods always use the representation of the whole video to conduct understanding and question-answering, which may fail in localizing specific moment especially in long videos. To this end, we propose two inference modes, global and breakpoint, for long video understanding task as follows.

Global mode is defined as the understanding and question-answering for the whole video. In this case, we only use long-term memory $\mathcal{L}$ as the video representation $\mathbf{V}$ .

Breakpoint mode.

Breakpoint mode is distinctly defined as understanding specific moments in a video. Since events inherently possess continuity, we need to consider not only the information directly related to the moments stored in short-term memory $\mathcal{S}$ but also the information indirectly related stored in long-term memory $\mathcal{L}$ . Based on this, we hypothesize that when querying the movie at a specific moment $t$ , the video representation $\mathbf{V}$ should be the aggregation of $\mathcal{L}$ , $\mathcal{S}$ , and the current video frame feature $\mathbf{x}_{t}$ . We find that simply concatenating these items yields excellent performance and leave further exploration of additional aggregation choices for future work.

Subsequently, the video representation $\mathbf{V}$ goes through a Q-former and a linear projection layer before being fed into the LLM $\mathcal{O}$ , which can be formulated as:

where $\mathcal{P}$ is the projection from visual space to text space. $\mathbf{A}$ represents the answer or instruction, and $\mathbf{Q}$ is employed to denote the question, respectively.

A New Benchmark: MovieChat-1K

Previous works on building long video understanding benchmarks either focus on non-question-answering tasks (e.g., language grounding , generic event boundary detection , user engagement and movie metadata prediction , etc.) or lack long-form understanding evaluation . To better evaluate the performance of MovieChat, we collect a new benchmark for long video understanding tasks, MovieChat-1K, which contains 1K high quality video clips sourced from various movies and TV series with 14K manual annotations.

As shown in Fig. 3(a), we collect videos from 15 popular categories with varying distribution, including documentary film, detective film, animation film, and so on. Among these, each video comprises multiple alternating scenes, contributing to a diverse and dynamic visual narrative within the context of the collection. The visual representation in Fig. 3(b) demonstrates the clip duration distribution of MovieChat-1K. Over 90% of the videos exhibit a duration ranging from 10K to 12K frames, while 14.6% of videos extending beyond 12K frames. Only 8.6% of videos have duration less than 10k frames.

For each video, we manually set and provide 1 dense caption for the whole video, 3 question-answering pairs for global mode and 10 question-answering pairs with timestamps for breakpoint mode. Fig. 3(c) illustrates the distribution of question types in MovieChat-1K. Note that MovieChat-1K is specifically designed for long video comprehension tasks, the majority of questions are open-ended, with only a quarter classified as multiple-choice questions, marked by initiators such as ‘Do,’ ‘Does,’ ‘Is,’ or ‘Are.’ We also compute the word distributions of our provided question-answer pairs. As illustrated in Fig. 4, which includes common objects (people, clothes, etc.), time (day, night, etc.), scenes (indoor, outdoor, etc.), and so on. More statistics information can be found in appendix.

Experiments

We conduct quantitative and qualitative evaluations between MovieChat and previous methods. Additionally, we perform ablation studies to investigate MovieChat. Experimental settings and analyses can be found in appendix.

We use several widely used open-ended datasets: MSVD-QA , MSRVTT-QA , and ActivityNet-QA for short video question-answering tasks. The evaluation process is under the assistance of LLM with the default hyper-parameter settings. The accuracy and relative scores on a scale of to $5$ are reported. Compared to previous methods , MovieChat achieves comparable performance even it is not specifically designed for short video question-answering tasks, as shown in Tab. 1.

Short video generative performance.

Following , we employ GPT-assisted evaluation to conduct a more comprehensive comparison of the text generation performance between MovieChat and previous methods on processed ActivityNet-QA . The evaluation pipeline covers crucial metrics (including Correctness of Information, Detailed Orientation, Contextual Understanding, Temporal Understanding and Consistency) and assigns relative scores to the generated predictions on a scale of 1-5. We present the results of the generation performance evaluation in Tab. 2. The results reveal its competitive performance across all key aspects compared to previous methods.

Long video question-answering.

We evaluate the long video question-answering performance of MovieChat with our proposed MovieChat-1K. We split 1,000 videos into training set (800), test set (100), validation set (100) and only use test set for final performance evaluation. We select three recent LLM-based video understanding models (e.g. Video Chat , Video LLaMA , and Video-ChatGPT ) as the baselines. Yet, none of those methods can support such long video ( $\textgreater 10$ K frames). Therefore, to accommodate their length limitations in global questions, we uniformly sample from the original video up to the maximum frame count which can be officially supported by each individual model. For breakpoint questions, we extend half of the maximum frame count before and after the breakpoint (i.e., placing the breakpoint at the center frame).

To enhance the robustness of the results, we simultaneously employ GPT-3.5 and Claude as LLM assistants, with the additional support of human blind rating. We observe a discrepancy between the accuracy and relative score generated by the previously LLM-assisted evaluation method for video question-answering tasks. However, merely adjusting the prompt for the LLM cannot effectively address this issue. Therefore, after obtaining the accuracy and score from the LLM-assisted evaluation method, we implement manual filtering to remove results with inconsistent values, thus improving the reliability of our outcomes.

As shown in Tab. 3, compared to previous methods , MovieChat reads more video frames. In both global mode and breakpoint mode, our method maintains a performance gain in terms of the average accuracy and score provided by LLM assistants and human blind rating. We comprehensively evaluate MovieChat’s question-answering performance across different question types compared to baselines. The results indicate that our approach outperforms the baselines in both open-ended and true-false questions.

Long video generative performance.

We compare the quality of answers generated by MovieChat and previous methods in long video question-answering on MovieChat-1K. As shown in Tab. 4, with the average score provided by GPT-3.5 , Claude and human bling rating, our approach continues to generate higher-quality answers even as the video contents become more extensive.

2 Ablation Study

As MovieChat incorporates a memory mechanism including short-term memory and long-term memory, it is imperative to evaluate how the proposed memory mechanism influences the performance. Tab. 5 and Tab. 6 provide the memory-dependent performance of MovieChat for long video question-answering and generative tasks with the average results of GPT-3.5 , Claude , and human blind rating. MovieChat with the memory mechanism significantly outperforms the memory-independent variant, which signifies the importance of memory mechanisms.

Hyper-parameter ablations.

We perform a series of hyperparameter ablations based on the MovieChat-1K dataset to better understand MovieChat. Fig. 5 shows the performance when ablating the length of memory buffers, consolidation length and short-term initialization with the average results of GPT-3.5 , Claude , and human blind rating. The performance of MovieChat degrades when all four are significantly changed, showing the validity of our empirically chosen hyperparameyers. Fig. 5 demonstrates that information obtained from the video expands with the growing length of memory buffers, while the loss of finer details intensifies with the fixed length of consolidation. Furthermore, using merged tokens for short-term initialization outperforms last few tokens and uniform sampling. Additionally, the length of merged tokens and the memory buffer size have a combined effect on MovieChat’s performance.

3 Case Study

We perform an extensive case study of MovieChat on a variety of open-ended long video (such as cartoon movie and TV series) for long video question-answering, including the breakpoint mode (Q#1) and the global mode (Q#2). The evaluation is conducted between MovieChat and previous methods as shown in Fig. 6 . For Q#1 in breakpoint mode, we mark the timestamp when the question is asked. For long videos over $10$ K frames, MovieChat is still capable of providing excellent responses to questions regarding both the current moment and the entire video content with less hallucination. More examples to show long video scene understanding and temporal understanding ability of MovieChat are available in appendix.

Limitation

Although MovieChat has demonstrated impressive abilities in long video understanding, it is still an early-stage prototype and has some limitations, including: 1) Limited perception capacities. MovieChat’s performance is hindered by the pretrained short video understanding model. 2) Inadequate Time Processing. MovieChat provides only rough estimates of the duration proportions of events within long videos, lacking precision in temporal details.

Conclusion

Conclusively, we presents an innovative video understanding system integrating video foundation models and large language models. By incorporating a memory mechanism represented by tokens in Transformers, our proposed system, MovieChat overcomes challenges associated with analyzing long videos. MovieChat achieves state-of-the-art performance in long video understanding, surpassing existing systems limited to handling videos with few frames.

References

Appendix A Memory consolidation algorithm of MovieChat.

As shown in Fig. A1, for each sampled frame $x_{i}$ , we calculate its similarity with adjacent frames. After that, we select the pair with the greatest similarity, merge and replace these two frames, resulting in a new sequence. We conduct the merge operation repeatedly until the count of existing frames in short-term memory reaches the predefined value.

Appendix B MovieChat-1K Statistics Information

MovieChat-1K contains videos from 15 popular categories with varying distribution. As shown in Tab. B1, every video comprises multiple alternating scenes.

Video information and visual question-answer data format.

To the best of our knowledge, a long video understanding dataset has not yet been established. Our work represents the initial step in creating and making it publicly available.We create MovieChat1K, containing 1k long videos and corresponding 1k dense captions, and 13k visual question-answer pairs.One visual example of these arrangements is provided in Figure B2.

Sentence length distribution of question-answer pairs.

MovieChat1K exhibits diverse lengths of question-answer pairs in the segmented clip level. Fig. B3 and Fig. B4 demonstrate the length distribution of question-answer pairs in different modes. Despite the distribution of question-answer pairs varies between the global mode and breakpoint mode, the majority of questions tends to concentrate between 5-15 words in length, while the length of answers generally have fewer than 10 words.

Stastics information of dense captions.

To facilitate a more detailed understanding of long videos, we provide a dense caption for each video. As shown in Fig. B5, MovieChat-1K exhibits diverse caption lengths in the segmented clip level. Approximately two-thirds of the clips have captions with 100-149 words, while one-fifth of the clip captions have fewer than 100 words. About 11% of clips have long captions with more than 150 words.

To analyze the word distribution of our generated captions, we compute their distributions. The resulting word distribution of the captions is presented in Fig. B6, which includes common objects (man, woman, people, girl, etc.), attributes (detective, various, small, white, etc.), locations (inside, behind, south, next, etc.), scenes (room, house, building, office, etc.), actions/events (talk, enter, leave, take, etc.), and more.

In terms of actionness, MovieChat-1K captions contains nearly the same number of verbs as with the WebVid10M dataset . To evaluate this, we use the NLTK toolkit to analyze the number of verbs in captions, focusing on extracting and tagging all unique verbs. We find a total of 109,485 verbs in the WebVid10M caption dataset, while the MovieChat-1K captions contain 102,988 unique instances of verbs. While these counts may not be entirely accurate due to our simple counting method, we believe they provide a rough indication of the actionness of the two datasets.

Comparison between MovieChat-1K and other benchmarks.

MovieChat-1K provides a large-scale benchmark for long video understanding, which contains 1K movies, 1K dense captions and 13k question-answer pairs. The comparison between different datasets are shown in Tab. B2. It is evident that MovieChat-1K provides the longest average duration for movie clips. MovieQA exclusively offers question-answer pairs related to movies, while MovieGraphs supplies captions associated with movies. Unlike other datasets, MovieNet encompasses three main types of texts: subtitle, synopsis, and script, excluding question-answer pairs. Additionally, the synopsis category is designed for the entire movie rather than video clips. Consequently, MovieChat-1K is more suitable for studying long video comprehension compared to other datasets.

Appendix C LLM-Assisted Evaluation for the short video question-answering task.

Following , we use LLM-Assisted Evaluation for the short video question-answering task in Section 5.1. Given the question, correct answer, and predicted answer by the model, the LLM assistants should return the True or False judgement and relative score ( to $5$ ). The whole prompt is shown in Fig. C1. It takes about $250$ tokens per question. We report the baseline results of short video question-answering from https://github.com/mbzuai-oryx/Video-ChatGPT.

Appendix D Hyperparameter Setting

We report the detailed hyperparameter settings of MovieChat in Tab. D3. The sliding window size of MovieChat is set to 16, which means that every slide involves the extraction of 16 frames. We configure the short-term memory to consist of 18 frames, with each frame containing 32 tokens. When the short-term memory reaches its capacity, it is directed to the memory consolidation module to be merged into 2 representative frames. The 2 frames are simultaneously input into the long-term memory with a total length of 256 and used to reinitialize the short-term memory.

Appendix E LLM-Assisted Evaluation for short video generative performance.

We use LLM-Assisted Evaluation proposed by for short video generative performance in Section 5.1. The evaluation pipeline assesses various capabilities of the model and assigns a relative score ( $1$ to $5$ ) to the generated predictions, in the following five aspects: Correctness of Information, Detail Orientation, Contextual Understanding, Temporal Understanding and Consistency. We follow the corresponding prompts provided in https://github.com/mbzuai-oryx/Video-ChatGPT and report the baseline results of short video generative performance from it.

Appendix F Ablation study on large language models.

Most previous video understanding methods primarily employed LLama and its variants as text decoders. With the average results of GPT-3.5 , Claude and human blind rating, Tab. F4 and Tab. F5 illustrate how the performance of MovieChat changes when using LLama and LLama2 as the large language model respectively.

Contrary to our hypothesis, under every evaluation conditions, the performance metrics of MovieChat with LLama2 hardly surpassed those of MovieChat with LLama . We further investigate a specific example to analyze this phenomenon. As shown in Fig. F2, the bold segments represent direct responses to the questions from two versions of MovieChat. MovieChat with LLama provided answers that are more aligned with the video content. Surprisingly, MovieChat with LLama2 offer an approximation of the time required for each step (indicated by underlines Fig. F2). While its time estimates do not precisely match the actual durations, the proportion of time provided was realistic. Even though LLama2 cannot obtain specific time information when processing feature-rich video frames, MovieChat’s memory buffer design allows for dense sampling of video frames, enabling LLama2 to estimate the proportion of time for each scene based on adjacent similar frames. Therefore, we propose that the lower evaluation metric results of MovieChat with LLama2 compared to MovieChat with LLama may be attributed to the question-answer pairs provided by the dataset.

Appendix G Manual filtering strategy for LLM-Assisted Evaluation.

For each test data, utilized GPT-3.5 to provide an evaluation result in terms of a ’yes/no’ response and a corresponding score, as demonstrated in Fig. C1. The score is an integer value ranging from 0 to 5, where a score of 5 indicates the highest degree of meaningful correspondence. However, we observe instances where GPT-3.5 offered judgments and scores that do not align, such as providing a ’yes’ response with a score of 0 or a ’no’ response with a score of 5. This discrepancy has the potential to impact the accuracy of results and introduce fluctuations. We adapt the prompts used for GPT-3.5 with the aim of addressing this concern and did not yield the desired mitigation. Hence, we introduce an artificial filtering strategy. For each evaluation result generated by GPT-3.5 , we conduct manual screening. We retain only those outcomes that exhibited consistency between the ’yes/no’ judgments and the associated scores, thus enhancing the reliability of the evaluations. Similarly, we applied the same filtering strategy to the evaluation results generated by Claude .

Appendix H Quantitative evaluation for long video different types question answering.

As shown in Fig. 3, MovieChat-1K contains question-answer pairs of varies types. To better assess the performance of MovieChat, we conduct evaluations on the long video question answering task using various types of questions. We roughly categorize the question types into multiple-choice questions and open-ended questions. With the average results of GPT-3.5 , Claude and human blind rating, Tab. H6 and Tab. H7 respectively present the accuracy and scores of MovieChat and the baseline across different question categories in both global mode and breakpoint mode. In various research conditions, our approach consistently outperforms the baseline, thus substantiating the robustness of MovieChat.

Appendix I Quantitative evaluation for long video generative performance in breakpoint mode

Similar to Tab. 4, with the average results of GPT-3.5 , Claude and human blind rating, Tab. I8 demonstrates that our method outperforms the baseline in long video generative performance in breakpoint mode.

Appendix J Pearson correlation coefficient of different score methods.

The Pearson correlation coefficient is represented by the formula:

where $r_{xy}$ is the Pearson correlation coefficient between two variables $x$ and $y$ , $x_{i}$ and $y_{i}$ are the individual sample points for variables $x$ and $y$ , $\overline{x}$ and $\overline{y}$ are the averages of the $x$ and $y$ samples respectively, and $n$ is the number of sample points. The formula essentially assesses the extent of linear correlation between two variables by evaluating the product of their deviations from their respective means. The numerator represents the covariance between the two variables, and the denominator normalizes this value, ensuring that the coefficient remains between -1 and +1. The Pearson correlation coefficient quantifies the extent to which two variables co-vary in comparison to their individual variations.

As shown in Tab. J9and Fig. J3, we conduct pearson correlation analysis between GPT-3.5 , Claude , and human blind rating. The result indicates a substantial agreement among these evaluation methods. The alignment of scores across different score methods strengthens the reliability of our assessment. Crucially, our proposed method, MovieChat outperforms previous methods in long video understanding tasks. The superior performance of MovieChat is evident across a broad spectrum of categories, suggesting that our model not only has a deeper understanding of long videos and respective questions but also exhibits a more accurate and consistent ability to generate relevant responses.

Appendix K Evaluation results with GPT, Claude and human blind rating.

As shown in K10–K18, we provide detailed scoring results for GPT-3.5 , Claude , and human blind rating across various experiments.

Appendix L Analysis on hyperparameter ablations.

As the lengths of the short-term and long-term memory buffers increase, the information acquired by MovieChat from the video expands, as illustrated in Fig. 5. However, more video compression leads to the loss of more detailed information, while the length of the merged tokens remains constant. Therefore, as the lengths of two memory buffers increase, the performance of MovieChat exhibits a trend of initially rising and then declining.

Fig. 5 demonstrates how memory consolidation influences the performance. Since the LLM-based evaluation shows a positive correlation between accuracy and score, we use accuracy to gauge performance. When memory buffer parameters remain constant, shorter merged tokens indicate increased frame information compression, potentially resulting in information loss when excessive. Conversely, longer merged tokens, despite retaining a greater extent of short-term memory in the face of compression, correspondingly result in less overall information acquisition. Moreover, when the length of the memory buffer changes, as exemplified by long-term memory, the corresponding peak performance of MovieChat shifts in response to the altered length of merged tokens. This demonstrates the need to strike a balance between dense information extraction and information compression in long video understanding tasks.

We also conduct experiments to compare various methods for initializing the short-term memory, including selecting the last few tokens, uniform sampling, and using merged tokens. The results indicate that the use of merged tokens produces the best performance. When initializing the next short-term memory with the last few tokens from the previous short-term memory, it is unable to adequately represent the information from the previous time step. Consequently, this leads to the final merged tokens being either repetitive or lacking coherence with the previous time step. Uniform sampling faces similar issues, but it manages to capture information with representative frames from the previous time step. Consequently, its performance surpasses that of initializing with the last few tokens, yet it remains inferior to using merged tokens for initialization.

Appendix M Examples for scene understanding and temporal understanding of MovieChat.

We perform an extensive case study of MovieChat on a variety of open-ended long video (such as cartoon movie in and TV series) for long video question-answering and captioning task, including the global mode and the breakpoint mode. The evaluation tasks include scene understanding and temporal understanding as shown in Fig. M5, Fig. M5, Fig. M7 and Fig. M7. For Q#1 and Q#2, we remarks timestamps in frames. For long videos over $10$ K frames, MovieChat is still capable of providing excellent responses to questions regarding both the current moment and the entire video content.