UniVTG: Towards Unified Video-Language Temporal Grounding

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, Mike Zheng Shou

cs.CV

Introduction

With the increasing interest in sharing daily lives, video has emerged as the most informative yet diverse visual form on social media. These videos are collected in a variety of settings, including untrimmed instructional videos , and well-edited vlogs . With massive scales and diverse video forms, automatically identifying relevant moments based on user queries has become a critical capability in the industry for efficient video browsing.

This significant demand has given rise to a number of video understanding tasks, including moment retrieval , highlight detection , and video summarization . As depicted in Fig. 1, moment retrieval tends to localize consecutive temporal windows (interval-level) by giving natural sentences; highlight detection aims to pick out the key segment with highest worthiness (curve-level) that best reflects the video gist; video summarization collects a set of disjoint shots (point-level) to summarize the video, with general or user-specific queries. Despite task-specific datasets and models have been developed, these tasks are typically studied separately. In general, these tasks share a common objective of grounding various scale clips based on customized user queries, which we refer to as Video Temporal Grounding (VTG).

Though these tasks are closely related, their relationship has not been explicitly studied until recently. introduces the first unified benchmark QVHighlights for moment retrieval and highlight detection, and presents the first model Moment-DETR for jointly learning. On this basis, UMT expands audio inputs, and QD-DETR develops negative-pairs and saliency tokens. Nevertheless, these studies solely focus on designing models that intersect two subtasks and learn grounding capabilities rely on specific labels. This means that they lack the ability to generalize the VTG across diverse temporal labels, such as unique point-level narrations in Ego4D . Furthermore, we have witnessed promising progress in Vision-Language Pretraining (VLP). One notable work is GLIP , which develops a unified model via joint utilizing large-scale diverse image annotations such as image captions and bounding boxes for spatial grounding. However, we do not observe similar progress in video-language pretraining. Most works in this area are designed for video-level tasks such as video-text retrieval rather than temporal grounding. This is largely due to the manual cost of fine-grained temporal annotations is expensive, making it challenging to obtain open-source, scalable yet diverse annotations to support grounding pretraining along the temporal axis in videos.

Therefore, we see a clear motivation to pursue a Unified VTG framework and propose our UniVTG, which aims to unify diversity in VTG along three directions: (i) From the label and task aspect, we first define a formulation for VTG where each video is decomposed as a clip sequence that each clip is assigned three basic query-conditional elements. Such a formulation enables us to unify various VTG labels and tasks under the same framework. Moreover, to address the limitation of temporal labels, we propose a data annotation scheme based on CLIP to produce scalable fine-grained pseudo labels. (ii) From the model aspect, we develop a flexible yet effective grounding model that inherits the principles of our formulation. Our model devises single-stream and dual-stream pathways for modality fusion and modality alignment respectively, and is equipped with three heads to decode three key elements. This favorable design is capable of addressing each task and utilizing each label. (iii) Lastly, thanks to the unified framework and the availability of pseudo labels, we can perform large-scale temporal grounding pretraining across various labels to enhance our grounding abilities. This empowers us to address various VTG downstream tasks across multiple domains, including zero-shot inference.

To validate the effectiveness of our proposed framework, we conduct experiments not only on joint moment retrieval and highlight detection benchmark (QVHighlights ), but also on three individual tasks for moment retrieval (Ego4D , Charades-STA , TACoS ), highlight detection (YouTube Highlights , TVSum ) and video summarization (QFVS ). Our UniVTG, one unified model with $4.2$ M samples for temporal grounding pretraining, has achieved remarkable results, outperforming state-of-the-art methods that are specifically tailored for each task. Overall, our contributions are four folds:

To the best of our knowledge, our UniVTG is the first video temporal grounding pretraining that across varied domains and tasks, including moment retrieval, highlight detection and video summarization.

We introduce a unified VTG framework that can fully leverage rich supervision from open-source, scalable yet diverse temporal annotations, such as point-level, interval-level, and curve-level labels.

To address the limitations of pretraining corpus, we develop an efficient annotation method that uses CLIP as a teacher to produce scalable pseudo temporal labels.

We demonstrate the effectiveness and flexibility of the proposed framework across four settings and seven datasets. Detailed ablation studies validate the superiority of the proposed components.

Related Work

We review three VTG tasks: moment retrieval, highlight detection, and video summarization, and compare them as different variations of a common problem.

Moment Retrieval aims to localize target moments i.e., one or many continuous intervals within a video by a language query, as shown in Fig. 2 (b). Previous methods fall into two categories: proposal-based and proposal-free. The proposal-based methods employ a two-stage process of scanning the entire video to generate candidate proposals, which are then ranked based on their matching to the text query. In contrast, the proposal-free methods learn to regress the start and end boundaries directly without requiring proposal candidates. Our UniVTG borrows from proposal-free approaches but extends it by incorporating diverse temporal labels and tasks with a concise design.

Highlight Detection aims to assign a worthiness score to each video segment e.g., Fig. 2 (c), and then return the top highest scoring segment as the highlight. Previous highlight detection datasets tend to be domain-specific and query-agnostic, in which many efforts treat this task as a visual or visual-audio scoring problem. Nevertheless, video highlights typically have a theme, which is often reflected in the video titles or topics e.g., “surfing”. Recently, proposes a joint moment retrieval and highlight detection benchmark QVHighlights that enables users to produce various highlights for one video conditional on different text queries.

Video Summarization aims to summarize the whole video by a set of shots to provide a quick overview e.g., Fig.2 (a), which contains two forms: Generic video summarization that captures the important scene using visual clues merely, while Query-focused video summarization that allows users to customize the summary by specifying text keywords (e.g., tree and cars). The latter is closer to practical usage hence we focus on it. Recently, IntentVizor proposes an interactive approach allowing users to adjust their intents to obtain a superior summary.

In general, each of the three tasks represents a specific form of VTG that grounds different scales of clips from videos (e.g., a consecutive clip set, a single clip or a disjoint clip set) by offering customized text queries (e.g., sentences, titles or keywords). However, previous methods address some subtasks solely. Based on this insight, our goal is to develop a unified framework to handle all of them.

2 Vision-Language Pretraining

The emergence of large-scale vision-language datasets, such as , has paved the way for the development of VLP to enhance video-text representation for various vision-language tasks . The representative CLIP has shown that image-level visual representations can be effectively learned using large-scale noisy image-text pairs. Furthermore, GLIP makes an effort along the spatial axis, which leverages various image annotations, such as image labels, captions, and bounding boxes, to develop strong region-level understanding capacity for spatial grounding tasks. However, due to the expensive manual cost of fine-grained temporal-level annotations i.e., temporal bounding box, this grounding pretraining has not been extended to the temporal axis in videos, limiting its progress to match the spatial counterparts. To address this limitation, we explore alternative approaches that leverage accessible timestamp narrations and derive pseudo supervision as the pretraining corpus.

On the other hand, there are several efforts have been made to perform temporal-friendly video pretraining to pursue a better video representation for grounding tasks. But the resulting pretraining model still requires an additional grounding model such as 2D-TAN to perform video grounding. In contrast, powered by our unified framework and scalable pseudo annotations, we can directly conduct VLP with grounding as a pretraining task. This way eliminates the need for additional grounding models and enables zero-shot grounding capacity.

Towards Unified VTG: Tasks and Labels

The UniVTG pipeline is displayed in Fig. 3. In this section, we start by introducing the unified formulation.

Given a video $V$ and a language query $Q$ , we first divide $V$ into a sequence of $L_{v}$ fixed-length clips $\{v_{1},\cdots,v_{L_{v}}\}$ , where each clip $v_{i}$ is of length $l$ and has a centered timestamp $t_{i}$ . The free-form text query $Q$ has $L_{q}$ tokens, denoted as $Q=\{q_{1},\cdots,q_{L_{q}}\}$ . We then define three elements for each clip $v_{i}=\left(f_{i},d_{i},s_{i}\right)$ , described as follows:

Foreground indicator $f_{i}\in\{0,1\}$ : a binary value indicating whether the $i$ -th clip $v_{i}$ belongs to the foreground or not. If clip $v_{i}$ is the foreground of $Q$ , then $f_{i}=1$ , otherwise $f_{i}=0$ .

Saliency score $s_{i}\in$ : a continuous score determining the relevance between the visual content of clip $v_{i}$ and the query $Q$ . If the clip and query are highly correlated, $s_{i}=1$ ; If they are totally irrelevant, then $s_{i}=0$ . Notably, it is reasonable to assume that $s_{i}>0$ if a clip is in the foreground of $Q$ , otherwise $s_{i}=0$ .

In Fig.3 (a), we draw a schematic diagram to represent these three elements of clip $v_{i}$ in our definition.

2 Revisiting Various VTG Tasks and Labels

Treating clips as the atom composition of a video, we define the VTG problem as collecting a target clip set $M=\{v_{i}\in V|Q\}$ from $V$ , conditional on language query $Q$ . We next illustrate how to extend this definition to various tasks and labels. Especially, for each label, we answer:

How to collect scalable label corpus for pretraining?

When using the unified formulation, how can we obtain unknown elements based on the available one?

Moment retrieval aims to localize one or many intervals in a video corresponding to a sentence $Q$ . As shown in Fig. 3 (Right blue), moment retrieval aims to select $m$ consecutive clip sets $M=M_{1}\cup\dots\cup M_{m}$ , where $m\geq 1$ , and $M_{j}$ is the $j$ -th target moment. $M$ can be simplified as the boundary set of foreground clips $\{b_{i}|f_{i}=1\}$ .

The temporal interval with specific target boundaries is a common label for moment retrieval. However, annotating intervals requires manually reviewing the full video, which is expensive. A solution is ASR that provide start and end timestamps, but ASR is often too noisy and poorly aligned with the visual content, making it suboptimal. Here, we sought an alternative solution. We found that visual captions tend to be descriptive, making them well-suited as grounding queries, thus if we can know how these videos are cut from the raw source, we can use this information to create pseudo intervals. We find that VideoCC is a viable option for this purpose. It is worth noting that VideoCC is initially developed for video-level pretraining (e.g., power video-text retrieval), and we are the pioneer to investigate its potential in temporal grounding pretraining.

Once we obtain intervals, we convert interval labels into the proposed formulation by defining $f_{i}=0$ and $s_{i}=0$ for clips that are not in target interval, and we assign $f_{i}=1$ and assume $s_{i}>0$ for clips that belongs to the target interval.

2.2 Highlight Detection and Curve-wise Label.

Highlight detection aims to assign an importance score to each video clip (making its annotations like a curve), then return the few highest-scoring clips as the highlight, where queries may or may not be provided as input. For video highlighting datasets without language queries, we can use video titles or video domain name as $Q$ because they are highly related to the topic of the video. Then, this task is equivalent to picking clips with the top highest saliency scores i.e. $M=\{v_{i}|s_{i}\in\text{top-}K\}$ .

Due to the interestingness contain subjectivity, the same video usually needs to be labeled by several people to eliminate bias. This makes curve labels the most expensive yet informative temporal annotations. Therefore, we are motivated to find an efficient way of producing scalable curve labels. Intuitively, interestingness reflects how each clip is relevant to the video gist. As depicted in Fig. 4 (a), we first define a concept bank using an open-world detection class list . Next, we use CLIP as a teacher to get the clip-level cosine similarities between each concept. Then, we select top- $5$ concepts as the video gist, and save their CLIP similarities as pseudo curve labels, i.e., Fig. 4 (b).

As shown in Fig. 4 (c), after obtaining curve labels, we assign $f_{i}=1$ for clips with $s_{i}$ greater than a threshold $\tau$ , otherwise $f_{i}=0$ . The $\tau$ is estimated based on the similarity of each video, refer to Supp. for details. The offsets $d_{i}$ are defined as the distance between the foreground clip and its nearest neighboring clips where $f_{i}=0$ .

2.3 Video Summarization and Point-wise Label.

Query-focused video summarization aims to summarize the entire video with a set of shots to provide a quick overview, with user-specific concepts (for example, trees and cars). The generated summary should be succinct while representative of the entire video around the given query. We define this task by regarding keywords as $Q$ , and select a set of clips $M=\{v_{i}|f_{i}=1\}$ , where the size of $M$ is required to not exceed $\alpha\%$ of the original video length $|M|\leq\alpha\%|V|$ e.g., $\alpha=2\%$ .

The annotations in QFVS are point labels that indicate whether each shot belongs to the concept or not. The cost of point labels is much cheaper than that of interval and curve labels since people only need to glance at a specific time. The recently Ego4D dataset uses this point labeling to annotate massive-scale data by assigning a narration to an exact timestamp, such as “I am opening the washing-machine” at ${t}_{i}=2.30$ sec. Due to the favorable scale, it is natural to adapt them for large-scale pretraining. Recently, there have been attempts to improve video-text representation using point-wise annotations to improve the video-text representation and augment NLQ baselines . Despite this, these methods mainly focus on transferring within the same domain.

For point labels, we derive $s_{i}>0$ if clip $f_{i}=1$ , otherwise $s_{i}=0$ . During pretraining, we estimate its temporal label $b_{i}$ based on the average distance between consecutive narrations within the video .

Towards Unified VTG: Model

We here introduce our unified model which seamlessly inherits our proposed unified formulation.

As shown in Fig. 5, our model mainly comprises a frozen video encoder, a frozen text encoder, and a multi-modal encoder. The video and text encoders are keep consistent with Moment-DETR , which employs the concatenation of CLIP (ViT-B/32) and SlowFast (R-50) features as video representation, and use the CLIP text encoder to extract token level features. Our multi-modal encoder contains $k$ self-attention blocks that followed by three specific heads to decode the prediction.

(ii) For cross-modal interaction, learnable position embeddings $\mathbf{E}^{pos}$ and modality-type embeddings $\mathbf{E}^{type}$ are added to each modality to retain both positional and modality information:

2 Pretraining Objectives

To match the previous unified formulation i.e., $\left({f}_{i},{d}_{i},{s}_{i}\right)$ , we devise three different heads to decode each element respectively, each one calling a capability.

Notably, this regression objective is only devised for foreground clips i.e., $f_{i}=1$ .

where $\|\cdot\|_{2}$ represents the $L2$ -norm of a vector.

For each video $\mathbf{V}$ , we randomly sample a foreground clip $\mathbf{v}_{p}$ with $f_{p}=1$ and $s_{p}>0$ as a positive sample; we treat other clips in the same video $\mathbf{v}_{j}$ with saliency $s_{j}$ less than $s_{p}$ as negative samples, i.e., $\Omega=\{j|s_{j}<s_{p},1\leq j\leq{L}_{v}\}$ , and perform intra-video contrastive learning:

where $\tau$ is a temperature parameter and set as $0.07$ .

Besides, we regard sentences from other samples within batches $k\in B$ as negative samples, and develop the inter-video contrastive learning for cross-sample supervision:

Our saliency score head training loss is the combination of inter- and intra-video contrastive learning:

To this end, our total training objective is the combination of each head loss overall clips in the training set.

where $N$ is the clip number of the training set.

3 Inference

Experiments

In this section, we conduct experiments on various benchmarks to evaluate our approach. Mainly, we design the experiments to study the following questions:

$\mathbf{Q1}$ : How much improvement could be made by UniVTG grounding pretraining?

$\mathbf{Q2}$ : What are the effects of using different pretraining corpus from various labels?

$\mathbf{Q3}$ : Is it necessary to use the proposed unified formulation and unified model?

More ablation studies can be found in Supplementary.

We have summarized the dataset information in Tab.1. For pretraining, we gather $1.8$ M point labels from Ego4D and $0.9$ M interval labels from VideoCC . For curve labels, we apply CLIP teacher method (Fig. 4) to Ego4D and VideoCC datasets to get $1.5$ M pseudo labels. Therefore, a total of $4.2$ M temporal annotations are used for grounding pretraining. For downstream tasks, we assess our methods on four VTG tasks across seven datasets, spanning (i) Jointly moment retrieval and highlight detection; (ii) Moment Retrieval; (iii) Highlight Detection; (iv) Video Summarization. Additional details are listed in Supp.

Evaluation Metrics. For QVHighlights, we follow official , Recall@ $1$ with IoU thresholds $0.5$ and $0.7$ , mean average precision (mAP) with IoU thresholds $0.5$ and $0.75$ , and the average mAP over a series of IoU thresholds $[0.5$ : $0.05$ : $0.95]$ are used for moment retrieval. For highlight detection, mAP and HIT@ $1$ are used, a clip is treated as a true positive if it has the saliency score of Very Good. For Charades-STA, NLQ, TACoS, Recall@ $1$ with IoU thresholds $0.3$ , $0.5$ and $0.7$ , and mIoU are used. For YouTube Highlights and TVSum, we follow and use mAP and Top- $5$ mAP, respectively. For QFVS, we follow that reports F1-score per video as well as an average.

Implementation Details. We set $k=4$ multi-modal transformer encoder layers, with $d=1024$ hidden size and $8$ attention heads. The drop path rates are $0.1$ for transformer layers and $0.5$ for input FFN projectors. During the pretraining stage, our experiments are carried out on $8$ A100 GPUs. When it comes to downstream tasks, we use one GPU. For moment retrieval, all baselines and UniVTG use the same video and text features. For highlight detection and video summarization, we report results following and . See Supp. for more details.

2 Comparison with State-of-the-arts (𝐐𝟏𝐐𝟏\mathbf{Q1})

As illustrated in Tab. 2, we first evaluate our UniVTG on QVHighlights test split: (i) Without pretraining, UniVTG has shown comparable performance to two joint optimization counterparts Moment-DETR and UMT , demonstrating its superior model design for joint task optimization. (ii) With large-scale pretraining, UniVTG exhibits a significant improvement on all metrics, such as ${+8.16}$ Avg. mAP and ${+5.32}$ HIT@ $1$ . As a result, UniVTG surpasses all baselines by a large margin. Notably, UMT introduces audio modality and ASR pretraining , but it is still worse than us by Avg. mAP of ${5.55}$ and HIT@ $1$ of ${3.89}$ . (iii) Due to the large-scale pretraining, UniVTG can perform zero-shot grounding and outperforms several supervised baselines without any training samples.

2.2 Moment Retrieval

In Tab. 3, we compare the results of our method and the mainstream moment retrieval methods on three widely used benchmarks. (i) Similar to the observation made by QVHighlights, without pretraining, we find that UniVTG is still superior to other compared methods. This demonstrates once more the effectiveness of our concise architecture. (ii) Large-scale grounding pretraining has resulted in significant improvements, leading to a considerable increase in the mIoU i.e., $+2.97$ in NLQ, $+2.07$ in Charades-STA, and $+5.03$ in TACoS. (iii) Notably, in NLQ, our zero-shot result has outperformed all the baselines methods due to the close pretraining domain. However, it is worth mentioning that the zero-shot performance on TACoS is inferior. This could be because the videos have scenes that are very similar to each other, with only small spatial variations, making it difficult to effectively apply zero-shot methods.

2.3 Highlight Detection

In Tab. 5.2.3 and Tab. 5.2.3, we conduct highlight detection experiments on YouTube Highlights and TVSum respectively, where the baselines with $\dagger$ (rows 6-9) are incorporate with audio features. We observe that (i) grounding pretraining brings improvement on UniVTG and surpasses all baselines in Avg. mAP. (ii) In TVSum, gain discrepancy among domains may stem from its small scale (50 samples) and scoring subjectivity. In contrast, the larger YouTube dataset (600 videos) yields more consistent pretraining gains. (ii) Moreover, in zero-shot setting, UniVTG beats several video-only baselines such as .

2.4 Video Summarization

In Tab. 6, we present the QFVS benchmark results. Our pretrained UniVTG achieves a $0.8\%$ higher Avg. F1-score than IntentVizor , where the latter is an interactive method and being tailored for the video summarization task. This result demonstrates the generalization of our method on video summarization task.

3 Ablation Studies

Effect of different labels for pretraining ( $\mathbf{Q2}$ ). In Tab. 7 top half, we investigate the effect of different labels corpus for pretraining. The results here are before unified formulation i.e., the original label provided by the pretraining set. Our findings (rows 1-4) indicate that (i) incorporating any type of label for pretraining yields considerable performance gains on most benchmarks. (ii) Combining all three types of data (row 5) for pretraining further boost the outcomes, such as $+5.2$ MR’s mAP and $+1.1$ HL’s mAP over baseline (row 1) on QVHighlights.

Effect of unified formulation ( $\mathbf{Q3}$ ). In Tab. 7 bottom half, we further study the impacts of unified formulation i.e., the benefits of deriving unknown elements for pretraining. From rows 2-4 vs rows 6-8, We find that (i) training corpora receive performance gains in most settings, which proves that the label converting methods are crucial for better utilizing temporal labels. (ii) Among all settings, curve labels appear to be the most effective ones, and beat the manual point labels except in a few domains e.g., TACoS. (iii) We get the optimal result (row 9) by using full three converted corpus for pretraining, with $4.62$ MR’s mAP and $1.28$ HL’s mAP increase over counterparts (row 5) on QVHighlights.

Effect or pretraining scale. In Fig. 6, we explore the effect of utilizing various scales of labels for pretraining. We observe a steady performance improvement on both moment retrieval and highlight detection tasks as the training sample size increases. It also shows that unifying labels to construct a large training corpus can greatly benefit the VTG.

Conclusion

This paper introduces UniVTG, a framework that unifies diverse VTG tasks and labels by addressing three key challenges: (i) We define a unified formulation for VTG to convert various labels and tasks under a single framework, and propose a label scaling scheme. (ii) We develop an effective yet flexible model that can handle various VTG tasks and training labels. (iii) Due to the unified framework and availability of scalable labels, it becomes feasible to perform large-scale temporal grounding pretraining over diverse labels. We demonstrate the effectiveness and flexibility of our UniVTG on four settings across seven datasets, spanning joint optimization as well as individual tasks.

Acknowledgements

This project is supported by the National Research Foundation, Singapore under its NRFF Award NRF-NRFF13-2021-0008, the DSO National Laboratories, Mike Zheng Shou’s Start-Up Grant from NUS. The computational work for this article was partially performed on resources of the National Super computing Centre, Singapore.

References

Appendix of UniVTG

The concept bank is a class list for open-world detection, sourced from herehttps://storage.googleapis.com/openimages/v6/oidv6-class-descriptions.csv. This list comprises $19,995$ class names, such as ”Sandwich Cookies,” ”Air conditioning,” and ”Advertising.” After conducting a manual check, we determined that the class list can effectively encompass the majority of common concepts.

In our approach, we begin by capturing frame-level clip image features from the video at a rate of 2 fps. Following this, we calculate their respective similarity scores in relation to the given class list. We then determine top-5 classes with the highest average scores, representing the most significant concepts within the video.

To derive intervals from the curve obtained from the diverse distributions, a fixed threshold is hard to determined and lacks the flexiblity. Thus, we discretize the continuous curve by a small value of $0.05$ and pick the maximum discrete value as our threshold. Then, adjacent clips that share the maximum discrete value to form an interval. In this way, we may produce multiple temporal windows from one video. This process is shown in Fig. 9.

B. Datasets

Pretraining corpus. To establish our pretraining corpus, we collect data through three ways: For point labels, we extract the timestamped narrations from Ego4D by excluding the NLQ val / test splits. For interval labels, we select a subset of videos (less than 300K) sourced from VideoCC https://github.com/google-research-datasets/videoCC-data, and treat their start and end timestamp as windows and caption as query. For curve labels, we derive them from the above VideoCC subset videos. Below, we describe the benchmarks used for the four settings separately.

(i) Joint Moment Retrieval and Highlight Detection. QVHighlights is the only dataset with available annotations for both moment retrieval and highlight detection, making it an ideal choice for benchmarking multi-task joint optimization. This dataset contains $10,148$ videos with an average length of $150$ sec that covers daily vlogs, travel vlogs, and news events scenarios. There are a total of $10,310$ queries associated with $18,367$ moments (on average, $1.8$ disjoint moments per query in the video).

(ii) Moment Retrieval. We utilize three benchmarks to further evaluate moment retrieval: Charades-STA , Ego4D Natural Language Queries (NLQ) and TACoS . (a) Charades-STA contains $16,128$ indoor videos with an average length of $30.6$ sec, which are made up of $12,408$ query-interval pairs for training and $3,720$ query-interval pairs for testing. (b) NLQ focuses on daily egocentric scenarios, where videos are $8-20$ minutes long and queries are question, e.g.“What did i pour in the bowl?”, making this benchmark challenging. The training set contains $11.3$ K annotated queries from $1$ K videos, whereas the validation set contains $3.9$ K queries from $0.3$ K videos. (c) TACoS contains $127$ videos with an average duration of $4.78$ minutes, where $75$ videos are used for training, $27$ and $25$ videos for validation and testing, respectively.

(iii) Highlight Detection. We utilize two benchmarks to further evaluate highlight detection: YouTube Highlights and TVSum . (a) YouTube Highlights has $6$ domains with $433$ videos, where video titles are not provided, thus we use the domain name of each video as text queries. (b) While TVSum includes $10$ domains, each with $5$ videos, we use their video titles as text queries. We follow data splits that the ratio of training:testing is $0.8$ : $0.2$ .

(iv) Video Summarization. We utilize the QFVS benchmark to evaluate the video summarization. This dataset includes the four videos in UT Egocentric dataset . Each video is recorded in daily life and lasts between $3-5$ hours. Each query in this dataset is represented by two words from a total of $48$ pre-defined concepts.

C. Experimental settings

(i) In Tab. 8, we detail the parameters for each setting. Notably, for highlight detection benchmarks YouTube Highlights and TVSum, which contain multiple domains treated as separate splits, we perform parameters tuning for $\lambda_{\text{intra}}$ within each domain. Then we aggregate the results obtained using optimal settings. The optimal settings are listed in Tab. 9-10.

(ii) During training, to maintain the balance between positive and negative samples, we allocate a weight of $0.1$ to the negatives ( $f_{i}=0$ ) in binary cross-entropy loss Eq. 4.

(iv) For video summarization, we adhere to the same preprocessing settings in , which extracts video frame features at $1$ FPS and take a $5$ seconds as a clip and compute the average frame feature within a clip to generate its clip-level feature. By applying the KTS algorithm , we split a long video into small segments under the conditions that the number of segments in a video is no more than $20$ and each segment contains no more than $200$ clips.

D. Ablation studies of training objective

Since we use identical training objectives during the stages of pretraining and downstream transferring. To gain a more thorough understanding of the impact each component has, we have constructed ablation studies as seen in Tab. 11, where the top half, we study the effect of downstream training objectives (without introduce any pretraining), while in the bottom half, we investigate the effect of pretraining training objectives (the downstream tuning use the same optimal parameter settings).

E. Parameters sensitivity

Transformer layers. In Tab. 12, we abalate the transformer layers $L\in$ of multi-modal encoder in our unified model (without pretraining).

Projector dimension. In Fig. 10, we study the effect of projector dimension from $256$ to $1024$ (without pretraining).

F. Loss weights

In Tab. 13, we study the effect of foreground loss on three moment retrieval benchmarks (with pretraining).

G. Visualizations

In Fig. 7 and 8, we show quantitative visualizations of UniVTG predictions across different settings and domains.