Less Is More: Picking Informative Frames for Video Captioning

Yangyu Chen, Shuhui Wang, Weigang Zhang, Qingming Huang

Introduction

Human are born with the ability to identify useful information and filter redundant information. In biology, this mechanism is called sensory gating , which describes neurological processes of filtering out unnecessary stimuli in the brain from all possible environmental stimuli, thus prevents an overload of redundant information in the higher cortical centers of the brain. This cognitive mechanism is essentially consistent with a huge body of researches in computer vision.

As one of the strong evidences practicing on visual sensory gating, attention is introduced to identify the salient visual regions with high objectness and meaningful visual patterns of an image . The attention has also been established on videos that contains consecutive image frames. Existing study follows a common procedure which includes a frame-level appearance modeling and motion modeling on equal interval frame sampling, say, every 3 frames or 5 frames. Visual features and motion features are extracted on the selected frame subset one by one, and they are all fed into the learning stage. Similar to image, the video attention is recognized as a spatial-temporal saliency that identifies both salient objects and their motion trajectories . The video attention is also recognized as the word-frame association learned by sparse coding or gaze-guided attention learning , which is a de-facto frame weighting mechanism. The visual attention mechanism also benefits many downstream tasks such visual captioning and visual question answering for image and video .

Despite the success on bridging vision and language achieved by existing attention-based methods, there still exists critical issues to be addressed as follows.

Frame selection perspective. As shown in Figure 1(a), there are many frames with duplicated and redundant visual appearance information selected with equal interval frame sampling. This will also involve remarkable computation expenditures and less performance gain as the information from the input is not appropriately sampled. For example, it takes millions of floating point calculation to extract a frame-level visual feature for a moderate-sized CNN model. Moreover, there is no guarantee that all the frames selected by equal interval sampling contain meaningful information, so it tends to be more sensitive to content noise such as motion blur, occlusion and object zoom-out.

Downstream video captioning task perspective. Previous attention-based models mostly identify the spatial layout of visual saliency, but the temporal redundancy existing in neighboring frames remains unsolved as all the frames are taken into consideration. This may lead to an unexpected information overload on the visual-linguistic correlation analysis model. For example, the dense-captioning-based strategy can potentially describe images/videos in finer levels of detail by captioning many visual regions within an image/video-clip. With an increasing number of frames, many highly similar visual regions will be generated and the problem will become prohibitive as the search space of sequence-to-sequence association becomes extremely large.

We answer the follow question: Is there a way to use as less number of frames as possible to well approximate the performance using all the frames for video captioning? We propose PickNet to perform informative frame picking for video captioning. Specifically, the base model for visual-linguistic association in video captioning is a standard Encoder-Decoder framework . We develop a reinforcement-learning-based procedure to train the network sequentially, where the reward of each frame picking action is designed by considering both visual and textual cues. From visual perspective, we maximize the diversity between current picked frame candidate and the selected frames. From textual perspective, we minimize the discrepancy between ground truth caption and the generated one using current picked candidate. If the candidate is rewarded, it will be selected and the corresponding latent representation of Encoder-Decoder will be updated for future trials. This procedure goes on until the end of the video sequence. Consequently, a compact frame subset can be selected to represent the visual information and perform video captioning without performance degradation.

To the best of our knowledge, this is the first study on frame selection for video captioning. In fact, our framework can go beyond the Encoder-Decoder framework in video captioning task, and serves as a complementary building block for other state-of-the-art solutions. It can also be adapted by other task-specific objectives for video analysis. In summary, the merits of our PickNet include:

Flexibility. We design a plug-and-play reinforcement-learning-based PickNet to select informative frames which can pick informative frames for the next learning stage. A compact frame subset can be selected to represent the visual information and perform video captioning without performance degradation.

Efficiency. The architecture can largely cut down the usage of convolution operations. It makes our method more applicable for real-world video processing.

Effectiveness. Experiment shows that our model can use a small number of frames to achieve comparable or even better performance compared to state-of-the-art.

Related Works

The visual captioning is the task that translating visual contents into natural language. Early to 2002, Kojima et al. proposed the first video captioning system for describing human behavior. From then on, a series of image and video captioning studies have been conducted. Early approaches tackle this problem using bottom-up paradigm , which first generate descriptive words of an image by attribute learning and object recognition, then combine them by language models which fit predicted words to predefined sentence templates. With the development of neural networks and deep learning, modern captioning systems are based on CNN and RNN, with the Encoder-Decoder architecture.

An active branch of captioning is utilizing the attention mechanism to weight the input features. For image captioning, the mechanism is typically in the form of spatial attention. Xu et al. first introduced an attention based model that automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. For video captioning, the temporal attention is added. Yao et al. took into account both the local and global temporal structure of videos to produce descriptions, and their model learned to automatically select the most relevant temporal segments given the text-generating RNN. However, the attention based methods, especially temporal attention, are operated on full observed condition, which is not suitable in some real world applications, such as blind navigation. Our method do not require the global information of videos, which is more effective in these applications.

2 Frame selection

The main battle of studying how to select informative frames of videos is in the video summarization field. This problem may be formulated as image searching. For example, Song et al. considered images related to the video title that can serve as a proxy for important visual concepts, so they developed a co-archetypal analysis technique that learns canonical visual concepts shared between video and images, and used it to summarize videos. Other people use sparse learning to deal with this problem. Zhao et al. proposed to learn a dictionary from given video using group sparse coding, and the summary video was then generated by combining segments that cannot be sparsely reconstructed using the learned dictionary.

Some video analysis task cooperates with frame selection mechanism. For example, in action detection, Yeung et al. designed a policy network to directly predict the temporal bounds of actions, which decreased the need for processing the whole video, and improved the detection performance. However, the prediction made by this method is in the form of normalized global position, which requires the knowledge of the video length, and this makes it unable to deal with real video streams. Different from the above methods, our model select frames based on both semantic and visual information, and do not need to know the global length of videos.

Method

Our method can be viewed as the combination of two parts: the Encoder-Decoder based sentence generator and the PickNet.

Like most of video captioning methods, our model is built on the Encoder-Decoder based sentence generator. In this subsection, we briefly introduce this building block.

Encoder. Given an input video, we use a recurrent video encoder which takes a sequence of visual features $(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{n})$ as input and outputs a fixed size vector $\mathbf{v}$ as the representation of this video. The encoder is built on top of a Long Short-Term Memory (LSTM) unit, which has been widely used for video encoding, since it is known to properly deal with long range temporal dependencies. Different from vanilla recurrent neural network unit, LSTM introduces a memory cell $\mathbf{c}$ which maintains the history of the inputs observed up to a time-step. Specifically, we use the following equations:

where $\odot$ denotes the element-wise Hadamard product, $\sigma$ is the sigmoid function, $\phi$ is the hyperbolic tangent tanh, $W_{*}$ are learned weight matrices and $\mathbf{b}_{*}$ are learned biases vectors. The hidden state $\mathbf{h}$ and memory cell $\mathbf{c}$ are initialized to zero. And the last hidden state $\mathbf{h}_{T}$ is used as the final encoded video representation $\mathbf{v}$ .

Decoder and sentence generation. Once the representation of the video has been generated, the recurrent decoder can employ it to generate the corresponding description. At every time-step of the decoding phase, the decoder unit uses the encoded vector $\mathbf{v}$ , previous generated one-hot representation word $\textbf{w}_{t-1}$ and previous internal state $\mathbf{p}_{t-1}$ as input, and outputs a new internal state $\mathbf{p_{t}}$ . Like , our decoder unit is the Gated Recurrent Unit (GRU) , a simplified version of LSTM, which is good at language decoding. The output of GRU is modulated via two sigmoid gates: a reset gate $\mathbf{r}_{t}$ and an update gate $\mathbf{z}_{t}$ . The operation detail is as the following:

Exploiting the values of the above gates, the output of the decoder at timestep $t$ is computed as:

where $W_{*}$ and $\mathbf{b}_{*}$ are learned weights and biases and $W_{w}$ transforms the one-hot encoding of words to a dense lower dimensional embedding. Again, $\odot$ denotes the element-wise product, $\sigma$ is the sigmoid function and $\phi$ is the hyperbolic tangent. A softmax function is applied on $\mathbf{p}_{t}$ to compute the probability of producing certain word at current time-step:

where $W_{p}$ is used to project the output of the decoder to the dictionary space and $\omega$ denotes all parameters of the Encoder-Decoder. Also, the internal state $\mathbf{p}$ is initialized to zero. We use the greedy decode routine to generate every word. It means that at every time-step, we choose the word that has the maximal $p_{\omega}(\mathbf{w}_{t}|\mathbf{w}_{t-1},\mathbf{w}_{t-2},...,\mathbf{w}_{1},\mathbf{v})$ as the current output word. Specifically, we use a special token $<$ BOS $>$ as $\textbf{w}_{0}$ to start the decoding, and when the decoder generates another special token $<$ EOS $>$ , the decoding procedure is terminated.

2 Our approach

The PickNet aims to select informative video content without knowing the global information. It means that the pick decision can only be based on the current observation and the history, which makes it more difficult than video summarization tasks. The more challenging thing is, we do not have supervised information to guide the learning of PickNet in video captioning tasks. Therefore, we formulate the problem as a reinforcement learning task, i.e., given an input image sequence sampled from a video, the agent should select a subset of them under certain policy to retain video content as much as possible. Here, we use PickNet to produce the picking policy. Figure 4 shows the architecture of PickNet.

where $W_{*}$ are learned weight matrices and $\mathbf{b}_{*}$ are learned biases vectors. During training, we use stochastic policy, i.e., the action is sampled according to Equation (13). When testing, the policy becomes determined, hence the action with higher probability is chosen. If the policy decides to pick the current frame, the frame feature will be extracted by a pretrained CNN and embedded into a lower dimension, then passed to the encoder unit, and the template will be updated:

We force PickNet to pick the first frame, thus the encoder will always process at least one frame, which makes the training procedure more robust. Figure 3 shows how PickNet works with the encoder. It is worth noting that the input of PickNet can be of any other forms, such as the difference between optical flow maps, which may handle the motion information more properly.

2.2 Rewards

The design of rewards is very essential to reinforcement learning. For the purpose of picking informative video frames, we consider two parts of reward: the language reward and visual diversity reward.

Language reward. First of all, the picked frames should contain rich semantic information, which can be used to effectively generate language description. In the video captioning task, it is natural to use the evaluated language metrics as the language reward. Here, we choose CIDEr score. Given a video $v_{i}$ and a collection of human generated reference sentences $S_{i}=\{s_{ij}\}$ , the goal of CIDEr is to measure the similarity of the machine generated sentence $c_{i}$ to a majority of how most people describe the video. So the language reward $r_{l}$ is defined as:

where $N_{p}$ is the number of picked frames, $\mathbf{x}_{i}^{(j)}$ is the $j$ -th value of the $i$ -th visual feature, and $\mu^{(j)}=\frac{1}{N_{p}}\sum_{i=1}^{N_{p}}\mathbf{x}_{i}^{(j)}$ is the mean of all the $j$ -th value of visual features.

Picks limitation. If the number of picked frames is too large or too small, it may lead to poor performances in either efficiency or effectiveness, so we assign a negative reward to discourage this situations. Empirically, we set the minimum picked number $N_{\text{min}}$ as 3, which stands for beginning, highlight and ending. The maximum picked number $N_{\text{max}}$ is initially set as the $\frac{1}{3}$ of total frame number, and will be shrunk down along with the training process, until decreased to a minimum value $\tau$ .

In summary, we merge the two parts of reward, and the final reward can be written as

where $\lambda_{*}$ is the weighting hyper-parameters and $R^{-}$ is the penalty reward.

3 Training

The training procedure is splitted into three stages. The first stage is to pretrain the Encoder-Decoder. We call it supervision stage. In the second stage, we fix the Encoder-Decoder and train PickNet by reinforcement learning. It is called reinforcement stage. And the final stage is the joint training of PickNet and the Encoder-Decoder. We call it adaptation stage. We use standard back-propagation to train the Encoder-Decoder, and REINFORCE to train PickNet.

Supervision stage. When training the Encoder-Decoder, traditional method maximizes the likelihood of the next ground-truth word given previous ground-truth words using back-propagation. However, this approach causes the exposure bias , which results in error accumulation during generation at test time, since the model has never been exposed to its own predictions. In order to alleviate this phenomenon, the schedule sampling procedure is used, which feeds back the model’s own predictions and slowly increases the feedback probability during training. We use SGD with cross entropy loss to train the Encoder-Decoder. Given the ground-truth sentences $\mathbf{y}=(\mathbf{y}_{1},\mathbf{y}_{2},\ldots,\mathbf{y}_{m})$ , the loss is defined as:

where $p_{\omega}(\mathbf{y}_{t}|\mathbf{y}_{t-1},\mathbf{y}_{t-2},\dots\mathbf{y}_{1},\mathbf{v})$ is given by the parametric model in Equation (11).

Reinforcement stage. In this stage, we fix the Encoder-Decoder and treat it as the environment, which can produce language reward to reinforce PickNet. The goal of training is to minimize the negative expected reward:

where $\theta$ denotes all parameters of PickNet, $p_{\theta}$ is the learned policy parameterized by Equation (13), and $\mathbf{a}^{s}=(a^{s}_{1},a^{s}_{2},\ldots,a^{s}_{n})$ while $a^{s}_{t}$ is the action sampled from the learned policy at the time step $t$ .

We train PickNet by using REINFORCE algorithm, which is based on the observation that the gradient of a non-differentiable expected reward can be computed as follows:

Using the chain rule, the gradient can be rewritten as:

where $\mathbf{s}_{t}$ is the input to the softmax function. In practice, the gradient can be approximated using a single Monte-Carlo sample $\mathbf{a}^{s}=(a^{s}_{1},a^{s}_{2},\ldots,a^{s}_{n})$ from $p_{\theta}$ :

When using REINFORCE to train the policy network, we need to estimate a baseline reward $b$ to diminish the variance of gradients. Here, the self-critical strategy is used to estimate $b$ . In brief, the reward obtained by current model under inferencing used at test stage, denoted as $r(\hat{\mathbf{a}})$ , is treated as the baseline reward. Therefore, the final gradient expression is:

Adaptation stage. After the first two stages, the Encoder-Decoder and PickNet are well pretrained, but there exists a gap between them because the Encoder-Decoder use the full video frames as input while PickNet just selects a portion of frames. So we need a joint training stage to integrate this two parts together. However, the pick action is not differentiable, so the gradients introduced by cross-entropy loss can not flow into PickNet. Hence, we follow the approximate joint training scheme. In each iteration, the forward pass generates frame picks which are treated just like fixed picks when training the Encoder-Decoder, and the backward propagation and REINFORCE updates are performed as usual. It acts like performing dropout in time sequence, which can improve the versatility of the Encoder-Decoder.

Experimental Setup

We evaluate our model on two widely used video captioning benchmark datasets: the Microsoft Video Description (MSVD) and the MSR Video-to-Text (MSR-VTT) .

Microsoft Video Description (MSVD). The Microsoft Video Description is also known as YoutubeClips. It contains 1,970 Youtube video clips, each labeled with around 40 English descriptions collected by Amazon Mechanical Turkers. As done in previous works , we split the dataset into three parts: the first 1,200 videos for training, then the followed 100 videos for validation and the reset 670 videos for test. This dataset mainly contains short video clips with a single action, and the average duration is about 9 seconds. So it is very suitable to use only a portion of frames to represent the full video.

MSR Video-to-Text (MSR-VTT). The MSR Video-to-Text is a large-scale benchmark for video captioning. It provides 10,000 video clips, and each video is annotated with 20 English descriptions. Thus, there are 200,000 video-caption pairs in total. This dataset is collected from a commercial video search engining and so far it covers the most comprehensive categories and diverse visual contents. Following the original paper, we split the dataset in contiguous groups of videos by index number: 6,513 for training, 497 for validation and 2,990 for test.

2 Metrics

We employ four popular metrics for evaluation: BLEU , ROUGEL , METEOR and CIDEr. As done in previous video captioning works, we use METEOR and CIDEr as the main comparison metrics. In addition, Microsoft COCO evaluation server has implemented these metrics and release evaluation functionshttps://github.com/tylin/coco-caption, so we directly call such evaluation functions to test the performance of video captioning. Also, the CIDEr reward is computed by these functions.

3 Video preprocessing

First, we sample equally-spaced 30 frames for every video, and resize them into 224 $\times$ 224 resolution. Then the images are encoded with the final convolutional layer of ResNet152 , which results in a 2,048-dimensional vector. Most video captioning models use motion features to improve performance. However, we only use the appearance features in our model, because extracting motion features is very time-consuming, which deviates from our purpose that cutting down the computation cost for video captioning.

4 Text preprocessing

We tokenize the labeled sentences by converting all words to lowercases and then utilizing the word_tokenize function from NLTKhttp://www.nltk.org/ toolbox to split sentences into words and remove punctuation. Then, the word with frequency less than 3 is removed. As a result, we obtain the vocabulary with 5,491 words from MSVD and 13,065 words from MSR-VTT. For each dataset, we use the one-hot vector (1-of- $N$ encoding, where $N$ is the size of vocabulary) to represent each word.

5 Implementation details

We use the validation set to tune some hyperparameters of our framework. The learning rates for three training stages are set to $3\times 10^{-4}$ , $3\times 10^{-4}$ and $1\times 10^{-4}$ , respectively. The training batchsize is 128 for MSVD and 256 for MSR-VTT, while each stage is trained up to 100 epoches and the best model is used to initialize the next stage. The minimum value of maximum picked frames $\tau$ is set to 7, and the penalty reward $R^{-}$ is $-1$ . To regularize the training and avoid over-fitting, we apply the well known regularization technique Dropout with retain probability 0.5 on the input and output of the encoding LSTMs and decoding GRUs. Embeddings for video features and words have size 512, while the sizes of all recurrent hidden states are empirically set to 1,024. For PickNet, the size of glance is 56 $\times$ 56, and the size of hidden layer is 1,024. The Adam optimizer is used to update all the parameters.

Results and Discussion

Figure 5 gives some example results on the test sets of two datasets. As it can be seen, our PickNet can select informative frames, so the rest of our model can use these selected frames to generate reasonable descriptions. More results are offered in supplemental materials. In order to demonstrate the effectiveness of our framework, we compare our approach with some state-of-the-art methods on the two datasets, and analyze the learned picks of PickNet.

We compare our approach on MSVD with six state-of-the-art approaches for video captioning: TA , S2VT , LSTM-E , p-RNN HRNE and BA . LSTM-E uses a visual-semantic embedding to generate better captions. TA uses temporal attention while p-RNN use both temporal and spatial attention. BA uses a hierarchical encoder while HRNE use a hierarchical decoder to describe videos. S2VT uses stack LSTMs both for the encode and decode stage. All of these methods use motion features (C3D or optical flow) and extract visual features frame by frame. Besides, we report the performance of our baseline model, which encodes all the sampled frames. In order to compare our PickNet with other picking policies, we conduct two other trials that pick frames by randomly selecting and $k$ -means clustering, respectively. Also, for analyzing the effect of different rewards, we conduct some ablation studies on them. As it can be noticed in Table 3, our method improves plain techniques and achieves the state-of-the-art performance on MSVD. This result outperforms the most recent state-of-the-art method by a considerable margin of $\frac{76.0-65.8}{65.8}\approx 15.5\%$ on the CIDEr metric. Further, we try to compare the time efficiency among these approaches. However, most of state-of-the-art methods do not release executable codes, so the accurate performance may not be available. Instead, we estimate the running time by the complexity of visual feature extractors and the number of processed frames. The details of running time estimation are listed in supplemental materials. Thanks to the PickNet, our captioning model is $4\scriptsize{\sim}33$ times faster than other methods.

On MSR-VTT, we compare four state-of-the-art approaches: ruc-uva , Aalto , DenseCap and MS-RNN . ruc-uva incorporates the Encoder-Decoder with two new stages called early embedding which enriches input with tag embeddings, and late reranking which re-score generated sentences in terms of their relevance to a specific video. Aalto first trains two models which are separately based on attribute and motion features, and then trains a evaluator to choose the best candidate generated by the two captioning model. DenseCap generates multiple sentences with regard to video segments and uses a winner-take-all scheme to produce the final description. MS-RNN uses a multi-modal LSTM to model the uncertainty in videos to generate diverse captions. Compared with these methods, our method can be simply trained in end-to-end fashion, and does not rely upon any attribute information. The performance of these approaches and that of our solution is reported in Table 4. We observe that our approach is able to achieve competitive result even without utilizing attribute information, while other methods take advantage of attributes and auxiliary information sources. Also, our model is fastest. It is also worth noting that the PickNet can be easily integrated with the compared methods, since none of them incorporated with frame selection algorithm. For example, DenseCap generates region-sequence candidates based on equally sampled frames. It can alternatively utilize PickNet to reduce the time for generating candidates by cutting down the number of selected frames.

2 Analysis of learned picks

We collect statistics on the properties of our PickNet. Figure 6 shows the distributions of the number and position of picked frames on the test sets of MSVD and MSR-VTT. As observed in Figure 6(a), in the vast majority of the videos, less than 10 frames are picked. It implies that in most case only $\frac{10}{30}\approx 33.3\%$ frames are necessary to be encoded for captioning videos, which can largely reduce the computation cost. Specifically, the average number of picks is around $6$ for MSVD and $8$ for MSR-VTT. Looking at the distributions of position of picks in Figure 6(b), we observe a pattern of power law distribution, i.e., the probability of picking a frame is reduced as time goes by. It is reasonable since most videos are single-shot and the anterior frames are sufficient to represent the whole video.

3 Captioning for streaming video

One of the advantage of our method is that it can be applied to streaming video. Different from offline video captioning, captioning for streaming video requires the model to tackle with unbounded video and generate descriptions immediately when the visual information has changed, which meets the demand of practical applications. For this online setting, we first sample frames at 1fps, and then sequentially feed the sampled frames to PickNet. If certain frame is picked, the pretrained CNN will be used to extract visual features of this frame. After that, the encoder will receive this feature, and produce a new encoded representation of the video stream up to current time. Finally, the decoder will generate a description based on the encoded representation. Figure 7 demonstrates an example of online video captioning with the picked frames and corresponding descriptions. As it is shown, the descriptions will be more appropriate and more determined as the informative frames are picked. More results can be seen in supplemental materials.

Conclusion

In this work, we design a plug-and-play reinforcement-learning-based PickNet to select informative frames for the task of video captioning, which achieves promising performance on effectiveness, efficiency and flexibility on popular benchmarks. This architecture can largely cut down the usage of convolution operations by picking only $6\scriptsize{\sim}8$ frames for a video clip, while other video analysis methods usually require more than 40 frames. This property makes our method more applicable for real-world video processing. The proposed PickNet has a good flexibility and could be potentially employed to other video-related applications, such as video classification and action detection, which will be further addressed in our future work.

References

Details on Time Estimation

We estimate the running time by the complexity of visual feature extractors and the number of processed frames, and our PickNet (V+L) is treated as the baseline. For visual features, both appearance and motion are taken into consideration. The relative running time of different CNNs is based on public reports . If a specific model uses motion features, the total computation time will be doubled, since extracting motion features is very time-consuming. For each model, the number of processed frames is set to the expected number of sampled frames under the sampling method. However, some model sample every 5 or 10 frames from input video, so the expected number depends on video length. In order to compare these model with others, we consider the input video with a duration of 10 seconds and a frame rate of 36fps, thus the total number of frames for input video is fixed to 360. Table 3 and Table 4 show the detail of time estimation on MSVD and MSR-VTT.

It is worth mentioning that the estimated time on other compared approaches is just rough. In fact, the actual speedup of our model will be higher than we estimated, because most of the compared models use complex pipeline. For instance, in the p-RNN, the temporal- and spatial-attention mechanisms are exploited to selectively focus on visual elements during sentence generation. Also, in the DenseCap, a lexical FCN is utilized to extract region features of every frames, then the cost-effective lazy forward (CELF) method is employed to generate region-sequence candidates, and finally a bi-directional LSTM is used to encode each region-sequence candidate. All the above procedures are far more complex than our Encoder-Decoder pipeline, therefore will consume much more processing time.

More Result Examples

Figure 8 shows more results for offline video captioning on MSVD and MSR-VTT. As mentioned before, our PickNet can select a minor portion of frames to represent the whole video and describe the video with regard to picked frames. Moreover, in the left column, only 3 or 4 frames are picked, and these picked frames are all in the front part of the video. We suppose that it is because these videos contain univocal content, and in each video most frames are similar. Under these circumstances, it is enough to pick a few of frames at the beginning to describe these video. Meanwhile, in the right column, more frames are picked, and the picked frames are scattered. We suggest that it is because the content of these videos is diverse. In this situation, the PickNet will traverse the whole video and select more frames.

Altogether, two characteristics of picked frames can be found. The first characteristic is that the picked frames are concise and highly related to the generated descriptions. For example, in Figure 8(a), our model only selects four frames, which correspond to holding the gun, porting arms, aiming, and shooting, respectively. All other frames are more or less duplicated visually or semantically, so those redundant frames are ignored. In Figure 8(c), our model selects the 5th frame instead of the 6th frame. Although the 6th frame is more diverse than the 5th frame, it is not related to the description, so our model does not select it but pick the 5th frame to confirm that the clip is about playing a guitar. In Figure 8(f), the picked frames appropriately describe the shot change, therefore the model can focus on the two women and understand they are talking to each other. The second one is that the adjacent frames may be picked to represent action. For example, in Figure 8(b), our model selects a pair of adjacent frames, i.e., the 6th and the 7th frames, which can properly represent the seasoning action. In Figure 8(d), the first two frames are picked to represent the chopping action. In Figure 8(e), the 5th to 7th frames are selected to represent the playing action. And in Figure 8(f), the 15th and the 16th frames are chosen to represent the talking action.

With these characteristics, our model may generate more accurate descriptions than ground-truths. For example, in Figure 8(b), our model explicitly indicates there is a woman, while the ground-truth only use someone to refer to it. In Figure 8(f), the generated sentence correctly describes the content of the video, while the ground-truth just tells it is a movie clip.

2 Online video captioning

For online video captioning, we first sample frames at 1fps, and then sequentially feed the sampled frames to PickNet. If a certain frame is picked, the pre-trained CNN will be used to extract visual features of this frame. After that, the encoder will receive this feature, and produce a new encoded representation of the video stream up to current time. Finally, the decoder will generate a description based on the encoded representation. Figure 9 demonstrates some examples of online video captioning with the picked frames and the corresponding descriptions.

As we discussed before, the descriptions will be more appropriate and determined as the informative frames are picked. In example (a), the most salient object in the first picked frame is the cat, so the generated description is just about the cat. After observing enough frames, the model knows this video is about a woman is playing with a kitten, and produces the correct description. In example (b), the model first generates a description that a boy is running, since the man in a blue shirt is more prominent and the motion pattern seems like running. Along with picking the following frames, the other two persons are noticed and their actions are recognized as dancing, hence the model produces a more accurate description that three persons are dancing. And at the beginning of example (c), the model is only aware of that there is a man with a sword, and do not know what the man is doing. After picking the third frame, it is clear that the man is stabbing a target, then the word target is substituted by a more precise word silhouette when more frames are picked. The video version of online captioning results can be seen in the uploaded videos.