Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams

Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin

Introduction

Online video streaming is a prevalent media format with a broad spectrum of applications. In the field of robotics, for instance, robots operating in the wild can leverage stream understanding models to interpret and react to their environment in real-time . Similarly, in surveillance systems, stream understanding models can process and analyze video streams from specific locations continuously, thereby improving overall security . However, best existing large video-language models fails to perform real-time long video question-answering upon user queries . The main reason is that: visual tokens between consecutive frames are heavy and redundant without effective compression, making it impossible to save all visual features in limited GPU Memory (VRAM), as well as significantly increasing the decoding latency of language model.

Considering how humans process live video streams in real-time can provide inspiration for the design of video stream understanding models. This procedure can be divided into four steps : 1) Perceiving: human eyes continuously encode an endless visual information into brain. 2) Memorizing: human brain compresses the visual information and update brain memory with it. With limited memory capacity, humans tend to have clearer detailed memories of recent events while they only remember the most important parts of events from the distant past. 3) Recalling: whenever a person is asked about what happens before, his/her brain retrieve the memory. 4) Answering: human brain integrates the memory information with the context provided by the question, and generate an answer.

It is worth noting that the four human processing steps above are not strictly sequential. As shown in Figure 1 (b) (focus on the brown part and ignore the blue part), the first two steps can be performed by a process (on the left), while the last two steps being performed by another process simultaneously (on the right). In other words, humans can perceive and memorize new information while recalling and answering questions about the past simultaneously. While the “process” for perceiving and memorizing is always running, the “process” for recalling and answering is only activated upon user questions. This is the key to online video stream understanding. In contrast, most existing video-QA methods are based on offline video understanding, where user query and finite-length video are given to the model at the same time. As shown in Figure 1 (a), these methods only consist of the two strictly sequential steps: perceiving and answering. The lack of a compressed memory mechanism in these offline methods result in a dilemma: 1) If the model keeps the redundant visual tokens of all frames, the high VRAM consumption leads to limited input frame capacity. 2) If the model performs question-aware encoding and only keep those visual tokens that are relevant to the question, it has to re-encode all the visual information from scratch every time a new query is given, leading to an unacceptable inference latency for online video streams.

To address this challenge, we introduce Flash-VStream, a video-language model that is able to process extremely long video streams in real-time and respond to user queries simultaneously. As shown in Figure 1 (c), Flash-VStream (blue) highly resembles human processing pipeline (brown) in terms of “4-step, 2-process” design philosophy. The frame encoder resembles human eyes and the LLM resembles human brain. The learnable memory mechanism in Flash-VStream, named Spatial-Temporal-Abstract-Retrieved (STAR) memory, is carefully designed to compress necessary visual information and update memory in a online and real-time manner, as shown in Figure 3.

In addition, recognizing the limitations of existing offline and short-length video QA benchmarks, for evaluating video stream understanding in online settings, we propose VStream-QA, a novel question answering benchmark specifically designed for online video stream understanding. The main features of VStream-QA lies in: i) Each question-answer pair is marked with a specific timestamp in the video and only related to the visual information before that timestamp, which is consistent with the online video stream understanding setting. ii) The video length ranges from 30 minutes to 60 minutes, which is significantly longer than existing benchmarks, making it capable of evaluating model’s performance on extremely long videos. iii) The videos cover a variety of content, including first-person perspective (ego-centric) videos, and third-person perspective movies.

On these challenging online benchmarks, Flash-VStream achieves state-of-the-art performance, while achieving significant reductions in inference latency and VRAM consumption as shown in Figure 2 and Table 1. Zero-shot video question answering experiments on 4 conventional offline video QA benchmarks further prove the generalization ability of Flash-VStream, as shown in Table 3. Comprehensive ablation studies prove the effectiveness of the memory mechanism we adopted. We summarize our contribution as follows:

We introduce Flash-VStream, a novel large video-language model that is able to process extremely long video streams in real-time and respond to user queries simultaneously. A cleverly designed memory mechanism named STAR is introduced to compress necessary visual information while leaving out the redundancy between consecutive frames.

While maintaining state-of-the-art performance on both online and offline benchmarks, Flash-VStream achieves significant reductions in inference latency and GPU Memory (VRAM) consumption, enabling online video stream QA in real-time.

We also propose VStream-QA, a new QA benchmark specifically designed for video understanding in online settings. Its question-answer-timestamp triplet design is consistent with online scenario and its video length is significantly longer than existing benchmarks, making it capable of evaluating model’s performance on nearly-infinite long video streams.

Related work

Multi-modal large language models. With recent advances in Large Language Models (LLMs) , many works try to build Multimodal Large Language Models (MLLMs) that integrate text with visual data or other modalities. For instance, the BLIP series proposed a efficient strategy for bootstrapping multimodal understanding with pretrained LLMs and image encoders, and the LLaVA series leverage GPT-generated visual instruction data to tune open language models. With the development of image-text models, researchers have begun extending image data to videos. The biggest challenge for Video LLM is how to compress redundant frame features. LLaMA-VID represents single-frame features with a few tokens, Chat-UniVi employs dynamic tokens to model image and video features of different scale, and Vista-LLaMA uses a sequential visual projector to represent an entire video with fewer tokens. These methods either requires a multi-step visual encoding process with high latency , or have a linearly increasing VRAM cost with the number of frames , making them unsuitable for real-time long video stream understanding. MovieChat proposed to combine all frame features through a simple average strategy. Though it is able to process long video with limited VRAM cost, its performance is suboptimal due to its training-free framework and non-learnable memory mechanism. In our proposed Flash-VStream, we introduce a learnable memory mechanism that encode frames in a online and real-time manner, disentangling the visual encoding process and answer decoding process, thus enabling real-time video stream understanding.

Real-time video stream understanding. Real-time video stream understanding is a challenging task that requires the model to process video streams in real-time and finish specific tasks based on the video. Most existing real-time methods are designed to perform a single, specific vision task, such as real-time object tracking and real-time action recognition . Considering natural language is becoming a general interface for various tasks and modalities , our work focuses on real-time video stream question answering upon user queries, which is a more challenging and comprehensive task.

Memory mechanism for long sequence processing. Memory mechanism is widely used to store and retrieve information in all forms of sequence processing tasks, such as time series forecasting , recommendation system , machine translation , and video object segmentation . Inspired by the idea of Neural Turing Machine (NTM) , a learnable mechanism that resembles the working memory system of human cognition, we proposed a learnable visual memory that is able to compress visual information and update memory in a online and real-time manner.

Flash-VStream

As shown in Figure 3, our Flash-VStream framework consists of three main components: (1) a streaming visual encoder that continuously processes video frames, (2) a Spatial-Temporal-Abstract-Retrieved memory mechanism (STAR memory), including memory writing and reading with the help of a feature buffer. (3) a LLM decoder capable of providing real-time responses to questions raised by users. To perform real-time inference, Flash-VStream is deployed in two asynchronous processes. The frame handler process manages the streaming visual encoder and STAR memory consolidation. The question handler process manages the real-time LLM decoder, STAR memory reading and interactions with users. The only connection between these two processes is the shared memory, which can be written by the first process and read by both.

2 Spatial-Temporal-Abstract-Retrieved memory

Spatial memory. Spatial memory houses the most recent and detailed spatial information for short-term use, implemented as a FIFO (First-In-First-Out) queue, as illustrated in Figure 4 and Equation 2. This architecture enables continuous updating with the newest frames, facilitating immediate access to fine-grained spatial data.

Temporal memory. Temporal memory integrates dynamic information over time, crucial for long-term retention. When its size surpasses $N_{\text{tem}}$ , the $g_{\text{wkmeans}}$ (Weighted K-means Clustering) algorithm is applied, as shown in Equation 3 and Algorithm 1. This strategy condenses the memory content into $N_{\text{tem}}$ clusters which can be seen as the representation of key events in videos. Then the centroids of these clusters are used as the new memory for efficiently storing temporal contexts.

Abstract memory. Abstract memory supports high-level semantic concept interpretation through $f_{SA}$ , the Semantic Attention model. It follows Equation 4 to synthesize the insights gained from both spatial and temporal memories into abstracted, actionable knowledge. $f_{SA}$ keeps adjusting $M_{\text{abs}}$ , the synopsis of whole video by newest features. Refer to Figure 4 and Algorithm 2 for details.

Retrieved memory. Retrieved memory focuses on recalling precise spatial details by identifying and retrieving the most substantial frame features. As shown in Figure 4, it first selects the top-K (where K equals $N_{\text{ret}}$ ) largest clusters from the $N_{\text{tem}}$ clusters obtained in temporal memory $M_{\text{tem}}$ . Then the nearest frame features in feature buffer to centroids of these K clusters are retrieved to supplement the temporal memory with more detailed spatial information. This process is illustrated in Equation 5 and Algorithm 3.

In brief, a new feature $e^{t}$ is written to STAR memory as follows:

Here $g_{\text{pooling}}(e,P^{\prime})$ applies Average Pooling to compress feature map $e$ from $P^{2}$ to $P^{\prime 2}$ size along width and height dimensions. $\texttt{concat}(a,b)$ means concatenating tensors $a$ and $b$ along time axis.

3 Real-time LLM decoder

The LLM decoder works as part of a real-time question answering server. When triggered by a question $Q^{t}$ at time $t$ , the LLM decoder first calculates the text embedding $I_{\text{text}}^{t}=f_{\text{embed}}(Q^{t})$ and maps the STAR memory $M^{t}=M_{\text{spa}}^{t}+M_{\text{tem}}^{t}+M_{\text{abs}}^{t}+M_{\text{ret}}^{t}$ to embedding space with the projector $I_{\text{vision}}^{t}=f_{\text{proj}}(M^{t})$ . Then it starts to generate answer $A^{t}=f_{\text{LLM}}(I_{\text{text}}^{t},I_{\text{vision}}^{t}).\text{decode}()$ in real time.

4 Implementation details

In this study, we utilize pre-trained CLIP ViT-L/14-224px as streaming visual encoder. Following LLaVA , we choose a 2-layer-MLP as visual projector and pre-trained Vicuna-7B as LLM decoder. Considering the balance between performance and resource consumption, we set $P_{\text{spa}}=8$ , $P_{\text{tem}}=4$ , $P_{\text{abs}}=1$ , $N_{\text{buff}}=300$ , $N_{\text{spa}}=1$ , $N_{\text{tem}}=N_{\text{abs}}=25$ and $N_{\text{ret}}=3$ . The MAXSIZE of STAR memory is set to 681 tokens in order to keep computational efficiency.

We train Flash-VStream for 2 stages: modality alignment and instruction tuning. The training data keep the same with LLaMA-VID , including LLaVA-filtered-558K image-caption pairs and LLaMA-VID-filtered-232K video-caption pairs for stage 1, LLaVA-filtered-665K image QA pairs and Video-ChatGPT-filtered-98K video QA pairs for stage 2. For each stage, the model is trained for 1 epoch on 8 A100 80G GPUs. During training, the parameters of visual encoder are frozen and the parameters of LLM are frozen only for the first stage. All training and inference experiments was conducted under BF16 precision to save time and resources. Other hyper-parameters can be found at Table 7.

VStream-QA: A new benchmark for online video stream QA

Previous video QA benchmarks mostly focus on offline video understanding, where user query and finite-length video are given to the model at the same time. To our best knowledge, there is no existing benchmark specifically designed for online video stream understanding. Also, most existing benchmarks are limited to short-length videos within 1 minute or medium-length videos within 10 minutes , which are unsuitable for simulating online video stream.

To address this problem, we propose VStream-QA, a novel question answering benchmark specifically designed for online video stream understanding. VStream-QA consists of two parts: VStream-QA-Ego and VStream-QA-Movie, which are designed for evaluating first-perspective ego-centric understanding and third-perspective plot understanding, respectively. The prominent features of VStream-QA are i) each question-answer pair is marked with a specific timestamp in the video and only related to the visual information before that timestamp, ii) containing extremely videos (30 minutes to 60 minutes) that is significantly longer than existing benchmarks, and iii) covering a variety of video sources and question types.

Specifically, VStream-QA-Ego consists of 10 1-hour-long ego-centric video clips from Ego4D dataset together with 1.5K question-answer-timestamp triplets , while VStream-QA-Movie consists of 22 half-an-hour-long movie clips from MovieNet dataset together with 2K question-answer-timestamp triplets. As shown in Figure 5, these two parts consist of a total of 21 hours of video and 3.5K question-answer pairs. Our proposed VStream-QA fills the gap in existing benchmarks for online video stream understanding, and provides a extremely long video test set that can be used to evaluate in both online settings and conventional offline settings.

We carefully design 5 types of questions to evaluate the model’s ability to understand both scene content and temporal information. As shown in Figure 5, the question types are well balanced. Specifically, [Scene Summary] and [Action Description] are open-ended questions designed to evaluate the model’s ability to understand static and dynamic scene content. [Event Occurrence] are yes/no questions designed to evaluate the model’s ability to detect whether a specific event or scene occurs in the video. [Ordered Event Narrative] and [Sequence Validation] are both designed to evaluate the model’s ability to understand the temporal order of events in the video, with the former being open-ended and the latter being yes/no questions. For yes/no questions, its answer ratio is well balanced with 46.3% yes and 53.7% no.

In order to balance the annotation quality, the data scale, and the total annotation expenses, we designed a 5-steps data generation pipeline as follows: 1) Video Selection; 2) Dense Captioning; 3) Summary Generation; 4) Question-Answer Generation; and 5) Human Filtering. For details of each steps, please refer to Section C.1.

Experiment

Datasets. For the purpose of real-time video stream understanding, it is crucial for models to keep accurate and efficient. To evaluate real-time understanding ability and computational efficiency of models, we them models on Realtime-VStream-QA-Ego/Movie datasets (or RVS-Ego/Movie for short). The real-time version of VStream-QA differentiates normal version by ensuring each question grounded before a predefined timestamp. To evaluate the basic question answering capability of Flash-VStream, we conduct zero-shot open-ended video question answering experiments on ActivityNet-QA , NExT-QA , MSVD-QA , MSRVTT-QA and the proposed VStream-QA-Ego/Movie datasets (or VS-Ego/Movie for short).

Evaluation Metrics. For open-ended video question answering tasks, we adopt GPT-3.5 metric following common practices in . With question, ground truth answer and the prediction generated by model, GPT-3.5 is able to judge whether this prediction is correct and provide a score between 0 and 5. We report the GPT-3.5 accuracy and score of each model on VQA datasets. For computational efficiency test, we report the average respond latency (from questioning to answering) and maximum video random-access memory (VRAM) of models.

2 Zero-shot video question answering

As our model is only trained on , we compare Flash-VStream with other competitive methods Video-ChatGPT, MovieChat, Chat-UniVi, Vista-LLaMA and LLaMA-VID on zero-shot real-time VideoQA datasets in Table 1, and on normal zero-shot VideoQA datasets in Table 3. Video-ChatGPT uses temporal pooling and spatial pooling for video understanding. This simple method performs well in real-time movie understanding. MovieChat implements a merge-based memory consolidation and uses a Q-Former as feature aggregator. Although it is competitive in understanding some short-video scenes, it falls behind in the domain of extremely long-video understanding, such as with RVS-Ego and RVS-Movie, as shown in Table 1. The newly proposed Chat-UniVi and LLaMA-VID have relative high performances on real-time video understanding benchmark. However, the high computation burden and high latency make it difficult to deploy them for real-time understanding scenes. Flash-VStream achieves SoTA on these benchmarks, demonstrating the proposed STAR memory’s exceptional capabilities in information compression and long video comprehension.

3 Computational efficiency

We measure the inference latency of each model by counting the respond wall time of the question handler process, as presented in Figure 2. For many models, the inference latency scales up with number of frames because their architectures demand processing all frames at once. Distinct from them, Flash-VStream leverages an efficient multiprocessing STAR memory mechanism (see Section 3.2) for streaming processing frames, which allows relative low inference latency and VRAM cost (detailed in Table 1). These attributes enable real-time inference.

4 Ablation study

We conduct an ablation study to evaluate the effects of key components of the STAR memory mechanism, i.e., spatial, temporal, abstract and retrieved memory. Removing temporal memory can cause a severe performance drop (as shown in the second row of Table 4), indicating that temporal memory is vital in long video stream understanding, as it enables the integration of contextual information across frames for coherent comprehension. Other types of memory also contribute a lot as they capture different aspect of visual information, such as spatial layout, high-level concepts and pivotal experiences.

Semantic Attention.

We compare the proposed Semantic Attention with other memory updating strategies as shown in Table 5. Q-Former is widely used by many models and Sequential Q-Former is used by . These updating methods are all transformer-based. Despite its lightweight nature, the Semantic Attention model outperforms other methods by a large margin. We suppose the reason is that the training dataset is too small for Q-Former based model to adequately learn. The architecture of Semantic Attention facilitates the extraction of key information and the selectively forgetting of irrelevant details, enhancing the model’s ability to comprehend abstract concepts in long videos.

Design of spatial size and temporal length of memory.

In Table 6, we evaluate how spatial size and temporal length of memory influence long video understanding tasks. For spatial size of memory, although a smaller feature map is harmful to the performance, an excessively larger feature map is not an optimal choice either (see the first row of Table 6(a)). A similar pattern can be observed by varying temporal length of memory in Table 6(b), in line with findings from . Considering the expensive computational cost of larger and longer memory, we adopt a balanced design.

5 Memory token visualization

We investigate the memory consolidation procedure in deep feature space. Specifically, in the left part of Figure 6, when inputting a video stream containing 3 significantly different scenes (talking, playing the drums and end credits), the memory will focus on the scene with the longest duration, just like what human will do in their minds. Relatively static scenes and relatively dynamic scenes are both given lots of attention, as shown in the right part of Figure 6. The visualization proves that memory tokens effectively reveal the distribution of the vision tokens.

6 Case study

To better demonstrate the feature of VStream-QA as well as the effectiveness of Flash-VStream model, we hereby provide a case study on VStream-QA-Movie dataset. As shown in Figure 7, a question timestamp is equipped with each question-answer pair, indicating the time when the question is asked. Models are only provided with the visual content before the question timestamp. Thanks to the carefully designed STAR memory mechanism, our Flash-VStream grasp the key visual information and turns out to be the only model that successfully understands the theme of this long movie clip, while LLaMA-VID, VideoChatGPT and VStream-QA fail to do so for various reasons. This proves the effectiveness of our proposed Flash-VStream model in long video understanding tasks. Refer to model generated answers and the figure caption for details.

Conclusion

In conclusion, we have introduced Flash-VStream, a video-language model for real-time processing of online video streams and answering user questions. It incorporates a smartly designed memory called STAR, and significantly reduces inference latency and VRAM consumption. In addition, we have proposed a new benchmark for online video understanding called VStream-QA. Our model outperforms existing methods on this new online benchmark and maintains SoTA performance on offline video understanding benchmarks. We hope our work could inspire further research and advancements in the field of online video stream understanding.

References

Appendix

Appendix A Memory implementation details

This section describes the details of the proposed Spatial-Temporal-Abstract-Retrieved memory mechanism in Section 3.2. The STAR memory has both parametric and non-parametric updating strategies. Spatial memory uses simple replacing method.

Appendix B Training details

The training procedure of Flash-VStream is similar to that of . In the modality alignment stage (stage 1), we train the Semantic attention model and the projector for one epoch. In the instruction tuning stage (Stage 2), we fine-tune the Semantic attention model, the projector and the LLM for another epoch. The overall training can be finished in 15 hours on 8 A100 80G GPUs (BFloat16) with extracted visual features. Detailed training settings are shown in Table 7.

Appendix C VStream-QA benchmark design details

Here we provide more details of VStream-QA online video understanding benchmark.

Video Selection. We first select 10 videos from Ego4D dataset with each video being 1 hour long, and 22 videos from MovieNet dataset with each video being 30 minutes long. Both Ego-centric videos and movie clips are chosen to cover a wide range of content types. Refer to next subsection for details.

Dense Captioning. We use GPT-4V to generate dense captions for each video clip. Long videos are divided into pieces of 30 seconds, and 8 frames are sparsely sampled from each piece as input to GPT-4V. Each output caption describes the content of the 30-second video piece, and marked with a specific timestamp.

Summary Generation. We use GPT-4 to deduplicate and summarize the dense captions generated by GPT-4V. The summary is designed to be a concise description scene-level clip, typically originated from multiple dense captions that correspond to several minutes of video content. Timestamps are carefully kept throughout the summarization process.

Question-Answer Generation. We use GPT-4 to generate 5 types of QA pair based on the scene summary. Each QA is generated from a single or several consecutive scene summaries, to ensure that the QA is only related to the visual information before the timestamp.

Human Filtering. Volunteers are invited to judge the relevance of the generated QA pairs to the video content. The following types of QA pairs are carefully filtered out: i) questions are irrelevant with the video or ambiguous, ii) questions require additional knowledge beyond the video, iii) questions are able to answered without the video, iv) answers are wrong or ambiguous. repetitive.

C.2 Variety of video content

Besides the variety of question types, VStream-QA benchmark also involves various type of video content.

VStream-QA-Ego video topics: [’cooking’, ’playing-card’, ’writing’, ’home-maintenance’, ’sightseeing’, ’reading’].

VStream-QA-Movie movie genres: ["Action", "Adventure", "Sci-Fi", "Crime", "Drama", "Thriller", "War", "Mystery", "Comedy", "Fantasy", "History", "Biography", "Horror"].

Appendix D Limitations

Although the proposed VStream-QA is the first benchmark that aims to simulate real-world video streaming scenarios, it still falls short in fully representing the scenario of comprehending infinitely long video streams in the real world. Besides, the proposed approach only involves the coarse-grained understanding task, i.e., QA. In the real world, video streams encompass more complex comprehension tasks. It is our aspiration that the Flash-VStream could inspire related research in this field.

D.2 GPT-3.5-based evaluation metric

In the proposed VStream-QA benchmark and many other video question answering benchmarks, GPT-3.5 based evaluation is adopted as the preferred metric. However, we notice that there is always a discrepancy between the distribution of GPT accuracy and GPT score. Specifically, for answers classified as “no”, many of them are assigned with a high score like “4” or “5”, also discussed by . This abnormal phenomenon reduces the credibility of this “ $0\sim 5$ score” metric in GPT-3.5-based MLLM evaluation.

Appendix E Broader Impacts

Real-time understanding models for long video streams may lead to potential negative societal impacts, including but not limited to unauthorized surveillance or privacy-infringing tracking. However, we firmly believe that the task itself is neutral with positive applications, such as health monitoring and emergency response.