Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell

Introduction

Recognition and description of images and videos is a fundamental challenge of computer vision. Dramatic progress has been achieved by supervised convolutional neural network (CNN) models on image recognition tasks, and a number of extensions to process video have been recently proposed. Ideally, a video model should allow processing of variable length input sequences, and also provide for variable length outputs, including generation of full-length sentence descriptions that go beyond conventional one-versus-all prediction tasks. In this paper we propose Long-term Recurrent Convolutional Networks (LRCNs), a class of architectures for visual recognition and description which combines convolutional layers and long-range temporal recursion and is end-to-end trainable (Figure 1). We instantiate our architecture for specific video activity recognition, image caption generation, and video description tasks as described below.

Research on CNN models for video processing has considered learning 3D spatio-temporal filters over raw sequence data , and learning of frame-to-frame representations which incorporate instantaneous optic flow or trajectory-based models aggregated over fixed windows or video shot segments . Such models explore two extrema of perceptual time-series representation learning: either learn a fully general time-varying weighting, or apply simple temporal pooling. Following the same inspiration that motivates current deep convolutional models, we advocate for video recognition and description models which are also deep over temporal dimensions; i.e., have temporal recurrence of latent variables. Recurrent Neural Network (RNN) models are “deep in time” – explicitly so when unrolled – and form implicit compositional representations in the time domain. Such “deep” models predated deep spatial convolution models in the literature .

The use of RNNs in perceptual applications has been explored for many decades, with varying results. A significant limitation of simple RNN models which strictly integrate state information over time is known as the “vanishing gradient” effect: the ability to backpropagate an error signal through a long-range temporal interval becomes increasingly difficult in practice. Long Short-Term Memory (LSTM) units, first proposed in , are recurrent modules which enable long-range learning. LSTM units have hidden state augmented with nonlinear mechanisms to allow state to propagate without modification, be updated, or be reset, using simple learned gating functions. LSTMs have recently been demonstrated to be capable of large-scale learning of speech recognition and language translation models .

We show here that convolutional networks with recurrent units are generally applicable to visual time-series modeling, and argue that in visual tasks where static or flat temporal models have previously been employed, LSTM-style RNNs can provide significant improvement when ample training data are available to learn or refine the representation. Specifically, we show that LSTM type models provide for improved recognition on conventional video activity challenges and enable a novel end-to-end optimizable mapping from image pixels to sentence-level natural language descriptions. We also show that these models improve generation of descriptions from intermediate visual representations derived from conventional visual models.

We instantiate our proposed architecture in three experimental settings (Figure 3). First, we show that directly connecting a visual convolutional model to deep LSTM networks, we are able to train video recognition models that capture temporal state dependencies (Figure 3 left; Section 4). While existing labeled video activity datasets may not have actions or activities with particularly complex temporal dynamics, we nonetheless observe significant improvements on conventional benchmarks.

Second, we explore end-to-end trainable image to sentence mappings. Strong results for machine translation tasks have recently been reported ; such models are encoder-decoder pairs based on LSTM networks. We propose a multimodal analog of this model, and describe an architecture which uses a visual convnet to encode a deep state vector, and an LSTM to decode the vector into a natural language string (Figure 3 middle; Section 5). The resulting model can be trained end-to-end on large-scale image and text datasets, and even with modest training provides competitive generation results compared to existing methods.

Finally, we show that LSTM decoders can be driven directly from conventional computer vision methods which predict higher-level discriminative labels, such as the semantic video role tuple predictors in (Figure 3, right; Section 6). While not end-to-end trainable, such models offer architectural and performance advantages over previous statistical machine translation-based approaches.

We have realized a generic framework for recurrent models in the widely adopted deep learning framework Caffe , including ready-to-use implementations of RNN and LSTM units. (See http://jeffdonahue.com/lrcn/.)

Background: Recurrent Networks

Traditional recurrent neural networks (RNNs, Figure 2, left) model temporal dynamics by mapping input sequences to hidden states, and hidden states to outputs via the following recurrence equations (Figure 2, left):

Though RNNs have proven successful on tasks such as speech recognition and text generation , it can be difficult to train them to learn long-term dynamics, likely due in part to the vanishing and exploding gradients problem that can result from propagating the gradients down through the many layers of the recurrent network, each corresponding to a particular time step. LSTMs provide a solution by incorporating memory units that explicitly allow the network to learn when to “forget” previous hidden states and when to update hidden states given new information. As research on LSTMs has progressed, hidden units with varying connections within the memory unit have been proposed. We use the LSTM unit as described in (Figure 2, right), a slight simplification of the one described in , which was derived from the original LSTM unit proposed in . Letting σ(x)=(1+ex)1\sigma(x)=\left(1+e^{-x}\right)^{-1} be the sigmoid non-linearity which squashes real-valued inputs to a $range,andlettingrange, and letting\tanh(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}=2\sigma(2x)-1bethehyperbolictangentnonlinearity,similarlysquashingitsinputstoabe the hyperbolic tangent non-linearity, similarly squashing its inputs to arange,theLSTMupdatesfortimesteprange, the LSTM updates for time steptgiveninputsgiven inputsx_{t},,h_{t-1},and, andc_{t-1}$ are:

xyx\odot y denotes the element-wise product of vectors xx and yy.

Recently, LSTMs have achieved impressive results on language tasks such as speech recognition and machine translation . Analogous to CNNs, LSTMs are attractive because they allow end-to-end fine-tuning. For example, eliminates the need for complex multi-step pipelines in speech recognition by training a deep bidirectional LSTM which maps spectrogram inputs to text. Even with no language model or pronunciation dictionary, the model produces convincing text translations. and translate sentences from English to French with a multi-layer LSTM encoder and decoder. Sentences in the source language are mapped to a hidden state using an encoding LSTM, and then a decoding LSTM maps the hidden state to a sequence in the target language. Such an encoder-decoder scheme allows an input sequence of arbitrary length to be mapped to an output sequence of different length. The sequence-to-sequence architecture for machine translation circumvents the need for language models.

The advantages of LSTMs for modeling sequential data in vision problems are twofold. First, when integrated with current vision systems, LSTM models are straightforward to fine-tune end-to-end. Second, LSTMs are not confined to fixed length inputs or outputs allowing simple modeling for sequential data of varying lengths, such as text or video. We next describe a unified framework to combine recurrent models such as LSTMs with deep convolutional networks to form end-to-end trainable networks capable of complex visual and sequence prediction tasks.

Long-term Recurrent Convolutional Network (LRCN) model

This work proposes a Long-term Recurrent Convolutional Network (LRCN) model combining a deep hierarchical visual feature extractor (such as a CNN) with a model that can learn to recognize and synthesize temporal dynamics for tasks involving sequential data (inputs or outputs), visual, linguistic, or otherwise. Figure 1 depicts the core of our approach. LRCN works by passing each visual input xtx_{t} (an image in isolation, or a frame from a video) through a feature transformation ϕV(.)\phi_{V}(.) with parameters VV, usually a CNN, to produce a fixed-length vector representation ϕV(xt)\phi_{V}(x_{t}). The outputs of ϕV\phi_{V} are then passed into a recurrent sequence learning module.

In its most general form, a recurrent model has parameters WW, and maps an input xtx_{t} and a previous time step hidden state ht1h_{t-1} to an output ztz_{t} and updated hidden state hth_{t}. Therefore, inference must be run sequentially (i.e., from top to bottom, in the Sequence Learning box of Figure 1), by computing in order: h1=fW(x1,h0)=fW(x1,0)h_{1}=f_{W}(x_{1},h_{0})=f_{W}(x_{1},0), then h2=fW(x2,h1)h_{2}=f_{W}(x_{2},h_{1}), etc., up to hTh_{T}. Some of our models stack multiple LSTMs atop one another as described in Section 2.

The success of recent deep models for object recognition suggests that strategically composing many “layers” of non-linear functions can result in powerful models for perceptual problems. For large TT, the above recurrence indicates that the last few predictions from a recurrent network with TT time steps are computed by a very “deep” (TT layer) non-linear function, suggesting that the resulting recurrent model may have similar representational power to a TT layer deep network. Critically, however, the sequence model’s weights WW are reused at every time step, forcing the model to learn generic time step-to-time step dynamics (as opposed to dynamics conditioned on tt, the sequence index) and preventing the parameter size from growing in proportion to the maximum sequence length.

In most of our experiments, the visual feature transformation ϕ\phi corresponds to the activations in some layer of a deep CNN. Using a visual transformation ϕV(.)\phi_{V}(.) which is time-invariant and independent at each time step has the important advantage of making the expensive convolutional inference and training parallelizable over all time steps of the input, facilitating the use of fast contemporary CNN implementations whose efficiency relies on independent batch processing, and end-to-end optimization of the visual and sequential model parameters VV and WW.

We consider three vision problems (activity recognition, image description and video description), each of which instantiates one of the following broad classes of sequential learning tasks:

Sequential input, static output (Figure 3, left): x1,x2,...,xTy\langle x_{1},x_{2},...,x_{T}\rangle\mapsto y. The visual activity recognition problem can fall under this umbrella, with videos of arbitrary length TT as input, but with the goal of predicting a single label like running or jumping drawn from a fixed vocabulary.

Static input, sequential output (Figure 3, middle): xy1,y2,...,yTx\mapsto\langle y_{1},y_{2},...,y_{T}\rangle. The image captioning problem fits in this category, with a static (non-time-varying) image as input, but a much larger and richer label space consisting of sentences of any length.

Sequential input and output (Figure 3, right): x1,x2,...,xTy1,y2,...,yT\langle x_{1},x_{2},...,x_{T}\rangle\mapsto\langle y_{1},y_{2},...,y_{T^{\prime}}\rangle. In tasks such as video description, both the visual input and output are time-varying, and in general the number of input and output time steps may differ (i.e., we may have TTT\neq T^{\prime}). In video description, for example, the number of frames in the video should not constrain the length of (number of words in) the natural language description.

In the previously described generic formulation of recurrent models, each instance has TT inputs x1,x2,...,xT\langle x_{1},x_{2},...,x_{T}\rangle and TT outputs y1,y2,...,yT\langle y_{1},y_{2},...,y_{T}\rangle. Note that this formulation does not align cleanly with any of the three problem classes described above – in the first two classes, either the input or output is static, and in the third class, the input length TT need not match the output length TT^{\prime}. Hence, we describe how we adapt this formulation in our hybrid model to each of the above three problem settings.

With sequential inputs and static outputs (class 1), we take a late-fusion approach to merging the per-time step predictions y1,y2,...,yT\langle y_{1},y_{2},...,y_{T}\rangle into a single prediction yy for the full sequence. With static inputs xx and sequential outputs (class 2), we simply duplicate the input xx at all TT time steps: t{1,2,...,T}:xt:=x\forall t\in\{1,2,...,T\}:x_{t}:=x. Finally, for a sequence-to-sequence problem with (in general) different input and output lengths (class 3), we take an “encoder-decoder” approach, as proposed for machine translation by . In this approach, one sequence model, the encoder, maps the input sequence to a fixed-length vector, and another sequence model, the decoder, unrolls this vector to a sequential output of arbitrary length. Under this type of model, a run of the full system on one instance occurs over T+T1T+T^{\prime}-1 time steps. For the first TT time steps, the encoder processes the input x1,x2,...,xTx_{1},x_{2},...,x_{T}, and the decoder is inactive until time step TT, when the encoder’s output is passed to the decoder, which in turn predicts the first output y1y_{1}. For the latter T1T^{\prime}-1 time steps, the decoder predicts the remainder of the output y2,y3,...,yTy_{2},y_{3},...,y_{T^{\prime}} with the encoder inactive. This encoder-decoder approach, as applied to the video description task, is depicted in Section 6, Figure 5 (left).

Under the proposed system, the parameters (V,W)\left(V,W\right) of the model’s visual and sequential components can be jointly optimized by maximizing the likelihood of the ground truth outputs yty_{t} at each time step tt, conditioned on the input data and labels up to that point (x1:t,y1:t1)\left(x_{1:t},y_{1:t-1}\right). In particular, for a training set D\mathcal{D} of labeled sequences (xt,yt)t=1TD(x_{t},y_{t})_{t=1}^{T}\in\mathcal{D}, we optimize parameters (V,W)\left(V,W\right) to minimize the expected negative log likelihood of a sequence sampled from the training set L(V,W,D)=1D(xt,yt)t=1TDt=1TlogP(ytx1:t,y1:t1,V,W)\mathcal{L}(V,W,\mathcal{D})=-\frac{1}{|\mathcal{D}|}\sum_{(x_{t},y_{t})_{t=1}^{T}\in\mathcal{D}}\sum_{t=1}^{T}\log P(y_{t}|x_{1:t},y_{1:t-1},V,W).

We next demonstrate the power of end-to-end trainable hybrid convolutional and recurrent networks by exploring three applications: activity recognition, image captioning, and video description.

Activity recognition

Activity recognition is an instance of the first class of sequential learning tasks described above: each frame in a length TT sequence is the input to a single convolutional network (i.e., the convnet weights are tied across time). We consider both RGB and flow as inputs to our recognition system. Flow is computed with and transformed into a “flow image” by scaling and shifting xx and yy flow values to a range of [128,+128][-128,+128]. A third channel for the flow image is created by calculating the flow magnitude.

During training, videos are resized to 240×320240\times 320 and we augment our data by using 227×227227\times 227 crops and mirroring. Additionally, we train the LRCN networks with video clips of 16 frames, even though the UCF101 videos are generally much longer (on the order of 100 frames when extracting frames at 30 FPS). Training on shorter video clips can be seen as analogous to training on image crops and is a useful method of data augmentation. LRCN is trained to predict the video’s activity class at each time step. To produce a single label prediction for an entire video clip, we average the label probabilities – the outputs of the network’s softmax layer – across all frames and choose the most probable label. At test time, we extract 16 frame clips with a stride of 8 frames from each video and average across all clips from a single video.

The CNN base of LRCN in our activity recognition experiments is a hybrid of the CaffeNet reference model (a minor variant of AlexNet ) and the network used by Zeiler & Fergus . The network is pre-trained on the 1.2M image ILSVRC-2012 classification training subset of the ImageNet dataset, giving the network a strong initialization to facilitate faster training and avoid overfitting to the relatively small video activity recognition datasets. When classifying center crops, the top-1 classification accuracy is 60.2% and 57.4% for the hybrid and CaffeNet reference models, respectively.

We compare LRCN to a single frame baseline model. In our baseline model, TT video frames are individually classified by a CNN. As in the LSTM model, whole video classification is done by averaging scores across all video frames.

We evaluate our architecture on the UCF101 dataset which consists of over 12,000 videos categorized into 101 human action classes. The dataset is split into three splits, with just under 8,000 videos in the training set for each split.

We explore various hyperparameters for the LRCN activity recognition architecture. To explore different variants, we divide the first training split of UCF101 into a smaller training set (\approx6,000 videos) and a validation set (\approx3,000 videos). We find that the most influential hyperparameters include the number of hidden units in the LSTM and whether fc6fc_{6} or fc7fc_{7} features are used as input to the LSTM. We compare networks with 256256, 512512, and 10241024 LSTM hidden units. When using flow as an input, more hidden units leads to better peformance with 1024 hidden units yielding a 1.7% boost in accuracy in comparison to a network with 256 hidden units on our validation set. In contrast, for networks with RGB input, the number of hidden units has little impact on the performance of the model. We thus use 1024 hidden units for flow inputs, and 256 for RGB inputs. We find that using fc6fc_{6} as opposed to fc7fc_{7} features improves accuracy when using flow as input on our validation set by 1%. When using RGB images as input, the difference between using fc6fc_{6} or fc7fc_{7} features is quite small; using fc6fc_{6} features only increases accuracy by 0.2%. Because both models perform better with fc6fc_{6} features, we train our final models using fc6fc_{6} features (denoted by LRCN-fc6{fc_{6}}). We also considered subsampling the frames input to the LSTM, but found that this hurts performance compared with using all frames. Additionally, when training the LRCN network end-to-end, we found that aggressive dropout (0.90.9) was needed to avoid overfitting.

Table I reports the average accuracy across the three standard test splits of UCF101. Columns 2-3, compare video classification of LRCN against the baseline single frame architecture for both RGB and flow inputs. LRCN yields the best results for both RGB and flow and improves upon the baseline network by 0.83% and 2.91% respectively. RGB and flow networks can be combined by computing a weighted average of network scores as proposed in . Like , we report two weighted averages of the predictions from the RGB and flow networks in Table I (right). Since the flow network outperforms the RGB network, weighting the flow network higher unsurprisingly leads to better accuracy. In this case, LRCN outperforms the baseline single-frame model by 3.40%.

Table II compares LRCN’s accuracy with the single frame baseline model for individual classes on Split 1 of UCF101. For the majority of classes, LRCN improves performance over the single frame model. Though LRCN performs worse on some classes including Knitting and Mixing, in general when LRCN performs worse, the loss in accuracy is not as substantial as the gain in accuracy for classes like BoxingPunchingBag and HighJump. Consequently, accuracy is higher overall.

Table III compares accuracies for the LRCN flow and LRCN RGB models for individual classes on Split 1 of UCF101. Note that for some classes the LRCN flow model outperforms the LRCN RGB model and vice versa. One explanation is that activities which are better classified by the LRCN RGB model are best determined by which objects are present in the scene, while activities which are better classified by the LRCN flow model are best classified by the kind of motion in the scene. For example, activity classes like Typing are highly correlated with the presence of certain objects, such as a keyboard, and are thus best learned by the LRCN RGB model. Other activities such as SoccerJuggling include more generic objects which are frequently seen in other activities (soccer balls, people) and are thus best identified from class-specific motion cues. Because RGB and flow signals are complementary, the best models take both into account.

LRCN shows clear improvement over the baseline single-frame system and is comparable to accuracy achieved by other deep models. report the results on UCF101 by computing a weighted average between flow and RGB networks and achieve 87.6%. reports 65.4% accuracy on UCF101, which is substantially lower than LRCN.

Image captioning

In contrast to activity recognition, the static image captioning task requires only a single invocation of a convolutional network since the input consists of a single image. At each time step, both the image features and the previous word are provided as inputs to the sequence model, in this case a stack of LSTMs (each with 1000 hidden units), which is used to learn the dynamics of the time-varying output sequence, natural language.

The outputs of the final LSTM in the stack are the inputs to a learned linear prediction layer with a softmax producing a distribution P(yty1:t1,ϕV(x))P(y_{t}|y_{1:t-1},\phi_{V}(x)) over words yty_{t} in the model’s vocabulary, including the token denoting the end of the caption, allowing the model to predict captions of varying length. The visual model ϕV\phi_{V} used for our image captioning experiments is either the CaffeNet reference model, a variant of AlexNet , or the more modern and computationally expensive VGGNet model pre-trained for ILSVRC-2012 classification.

Without any explicit language modeling or impositions on the structure of the generated captions, the described LRCN system learns mappings from images input as pixel intensity values to natural language descriptions that are often semantically descriptive and grammatically correct.

We evaluate our image description model for retrieval and generation tasks. We first demonstrate the effectiveness of our model by quantitatively evaluating it on the image and caption retrieval tasks proposed by and seen in . We report results on Flickr30k , and COCO 2014 datasets, both with five captions annotated per image.

Retrieval results on the Flickr30k dataset are recorded in Table IV. We report median rank, Medr, of the first retrieved ground truth image or caption and Recall@KK, the number of images or captions for which a correct caption or image is retrieved within the top KK results. Our model consistently outperforms the strong baselines from recent work as can be seen in Table IV. Here, we note that the VGGNet model in (called OxfordNet in their work) outperforms our model on the retrieval task. However, VGGNet is a stronger convolutional network than that used for our results on this task. The strength of our sequence model (and integration of the sequence and visual models) can be more directly measured against the ConvNet result, which uses a very similar base CNN architecture (AlexNet , where we use CaffeNet) pretrained on the same data.

We also ablate the model’s retrieval performance on a randomly chosen subset of 1000 images (and 5000 captions) from the COCO 2014 validation set. Results are recorded in Table V. The first group of results for each task examines the effectiveness of an LSTM compared with a “vanilla” RNN as described in Section 2. These results demonstrate that the use of the LSTM unit compared to the simpler RNN architecture is an important element of our model’s performance on this task, justifying the additional complexity and suggesting that the LSTM’s gating mechanisms allowing for “long-term” memory may be quite useful, even for relatively simple sequences.

Within the second and third result groups, we compare performance among the three sequence model architectural variants depicted in Figure 4. For both tasks and under all metrics, the two layer, unfactored variant (LRCN2u) performs worse than the other two. The fact that LRCN1u outperforms LRCN2u indicates that stacking additional LSTM layers alone is not beneficial for this task. The other two variants (LRCN2f and LRCN1u) perform similarly across the board, with LRCN2f appearing to have a slight edge in the image to caption task under most metrics, but the reverse for caption to image retrieval.

Unsurprisingly, finetuning the CNN (indicated by the “FT?” column of Table V) and using a more powerful CNN (VGGNet rather than CaffeNet) each improve results substantially across the board. Finetuning boosts the R@kk metrics by 3-5% for CaffeNet, and 5-8% for VGGNet. Switching from CaffeNet to VGGNet improves results by around 8-12% for the caption to image task, and by roughly 11-17% for the image to caption task.

1.2 Generation

We evaluate LRCN’s caption generation performance on the COCO2014 dataset using the official metrics on which COCO image captioning submissions are evaluated. The BLEU and METEOR metrics were designed for automatic evaluation of machine translation methods. ROUGE-L was designed for evaluating summarization performance. CIDEr-D was designed specifically to evaluate the image captioning task.

In Table VI we evaluate variants of our model along the same axes as done for the retrieval tasks in Table V. In the last of the three groups of results, we additionally explore and evaluate various caption generation strategies that can be employed for a given network. The simplest strategy, and the one employed for most of our generation results in our prior work , is to generate captions greedily; i.e., by simply choosing the most probable word at each time step. This is equivalent to (and denoted in Table VI by) beam search with beam width 1. In general, beam search with beam width NN approximates the most likely caption by retaining and expanding only the NN current most likely partial captions, according to the model. We find that of the beam search strategies, a beam width of 3-5 gives the best generation numbers – performance saturates quickly and even degrades for larger beam width (e.g., 10).

An alternative, non-deterministic generation strategy is to randomly sample NN captions from the model’s distribution and choose the most probable among these. Under this strategy we also examine the effect of applying various choices of scalar factors (inverse of the “temperature”) TT to the real-valued predictions input to the softmax producing the distribution. For larger values of TT the samples are greedier and less diverse, with T=T=\infty being equivalent to beam search with beam width 1. Larger values of NN suggest using smaller values of TT, and vice versa – for example, with large NN and large TT, most of the O(N)\mathcal{O}(N) computation is wasted as many of the samples will be redundant. We assess saturation as the number of samples NN grows, and find that N=100N=100 samples with T=2T=2 improves little over N=25N=25. We also varied the temperature TT among values 1, 1.5, and 2 (all with N=100N=100) and found T=1.5T=1.5 to perform the best.

We adopt the best-performing generation strategy from the bottom-most set of results in Table VI (sampling with T=1.5T=1.5, N=100N=100) as the strategy for the middle set of results in the table, which ablates LRCN architectures. We also record generation performance for all architectures (Table VI, top set of results) with the simpler generation strategy used in our earlier work for ease of comparison with this work and for future researchers. For the remainder of this discussion, we will focus on the middle set of results, and particularly on the CIDEr-D (C) metric, as it was designed specifically for automatic evaluation of image captioning systems. We see again that the LSTM unit outperforms an RNN unit for generation, though not as significantly as for retrieval. Between the sequence model architecture choices (depicted in Figure 4) of the number of layers LL and whether to factor, we see that in this case the two-layer models (LRCN2f and LRCN2u) perform similarly, outperforming the single layer model (LRCN1u). Interestingly, of the three variants, LRCN2f is the only one to perform best for both retrieval and generation.

We see again that fine-tuning (FT) the visual representation and using a stronger vision model (VGGNet ) improves results significantly. Fine-tuning improves CIDEr-D by roughly 0.04 points for CaffeNet, and by roughly 0.07 points for VGGNet. Switching from finetuned CaffeNet to VGGNet improves CIDEr-D by 0.13 points.

In Table VII we compare generation performance with contemporaneous and recent work submitted to the 2015 COCO caption challenge using our best-performing method (under the CIDEr-D metric) from the results on the validation set described above – generating a caption for a single image by taking the best of N=100N=100 samples with a scalar factor of T=1.5T=1.5 applied to the softmax inputs, using an LRCN model which pairs a fine-tuned VGGNet with our LRCN2f (two layer, factored) sequence model architecture. Our results are competitive with the contemporary work, performing 4th best in CIDEr-D (0.934, compared with the best result of 0.946 from ), and 3rd best in METEOR (0.335, compared with 0.346 from ).

In addition to standard quantitative evaluations, we also employ Amazon Mechnical Turk workers (“Turkers”) to evaluate the generated sentences. Given an image and a set of descriptions from different models, we ask Turkers to rank the sentences based on correctness, grammar and relevance. We compared sentences from our model to the ones made publicly available by . As seen in Table VIII, our fine-tuned (FT) LRCN model performs on par with the Nearest Neighbour (NN) on correctness and relevance, and better on grammar.

We show sample captions in Figure 6. We additionally note some properties of the captions our model generates. When using the VGG model to generate sentences in the validation set, we find that 33.7% of our generated setences exactly match a sentence in the training set. Furthermore, we find that when using a beam size of one, our model generates 42% of the vocabulary words used by human annotators when describing images in the validation set. Some words, such as “lady” and “guy”, are not generated by our model but are commonly used by human annotators, but synonyms such as “woman” and “man” are two of the most common words generated by our model.

Video description

In video description the LSTM framework allows us to model the video as a variable length input stream. However, due to the limitations of available video description datasets, we rely on more “traditional” activity and video recognition processing for the input and use LSTMs for generating a sentence. We first distinguish the following architectures for video description (see Figure 5). For each architecture, we assume we have predictions of activity, tool, object, and locations present in the video from a CRF based on the full video input. In this way, we observe the video as whole at each time step, not incrementally frame by frame.

(a) LSTM encoder & decoder with CRF max. (Figure 5(a)) This architecture is motivated by the video description approach presented in . They first recognize a semantic representation of the video using the maximum a posteriori (MAP) estimate of a CRF with video features as unaries. This representation, e.g., \langleknife,cut,carrot,cutting board\rangle, is concatenated into an input sequence (knife cut carrot cutting board) which is translated to a natural language sentence (a person cuts a carrot on the board) using statistical machine translation (SMT) . We replace SMT with an encoder-decoder LSTM, which encodes the input sequence as a fixed length vector before decoding to a sentence.

(b) LSTM decoder with CRF max. (Figure 5(b)) In this variant we provide the full visual input representation at each time step to the LSTM, analogous to how an image is provided as an input to the LSTM in image captioning.

(c) LSTM decoder with CRF probabilites. (Figure 5(c)) A benefit of using LSTMs for machine translation compared to phrase-based SMT is that it can naturally incorporate probability vectors during training and test time which allows the LSTM to learn uncertainties in visual generation rather than relying on MAP estimates. The architecture is the the same as in (b), but we replace max predictions with probability distributions.

We evaluate our approach on the TACoS multilevel dataset, which has 44,762 video/sentence pairs (about 40,000 for training/validation). We compare to who use max prediction as well as a variant presented in which takes CRF probabilities at test time and uses a word lattice to find an optimal sentence prediction. Since we use the max prediction as well as the probability scores provided by , we have an identical visual representation. uses dense trajectories and SIFT features as well as temporal context reasoning modeled in a CRF. In this set of experiments we use the two-layered, unfactored version of LRCN, as described for image description.

Table IX shows the BLEU-4 score. The results show that (1) the LSTM outperforms an SMT-based approach to video description; (2) the simpler decoder architecture (b) and (c) achieve better performance than (a), likely because the input does not need to be memorized; and (3) our approach achieves 28.8%, clearly outperforming the best reported number of 26.9% on TACoS multilevel by .

More broadly, these results show that our architecture is not restricted only to input from deep networks, but can be cleanly integrated with fixed or variable length inputs from other vision systems.

Related Work

We present previous literature pertaining to the three tasks discussed in this work. Additionally, we discuss subsequent extensions which combine convolutional and recurrent networks to achieve improved results on activity recognition, image captioning, and video description as well as related new tasks such as visual question answering.

Activity Recognition. State-of-the-art shallow models combine spatio-temporal features along dense trajectories and encode features as bags of words or Fisher vectors for classification. Such shallow features track how low level features change through time but cannot track higher level features. Furthermore, by encoding features as bags of words or Fisher vectors, temporal relationships are lost.

Many deep architectures proposed for activity recognition stack a fixed number of video frames for input to a deep network. propose a fusion convolutional network which fuses layers which correspond to different input frames at various levels of a deep network. proposes a two stream CNN which combines one CNN trained on RGB frames and one CNN trained on a stack of 10 flow frames. When combining RGB and flow by averaging softmax scores, results are comparable to state-of-the-art shallow models on UCF101 and HMDB51. Results are further improved by using an SVM to fuse RGB and flow as opposed to simply averaging scores. Alternatively, and propose learning deep spatio-temporal features with 3D convolutional neural networks. , propose extracting visual and motion features and modeling temporal dependencies with recurrent networks. This architecture most closely resembles our proposed architecture for activity classification, though it differs in two key ways. First, we integrate 2D CNNs that can be pre-trained on large image datasets. Second, we combine the CNN and LSTM into a single model to enable end-to-end fine-tuning.

Image Captioning. Several early works on image captioning combine object and scene recognition with template or tree based approaches to generate captions. Such sentences are typically simple and are easily distinguished from more fluent human generated descriptions. address this by composing new sentences from existing caption fragments which, though more human like, are not necessarily accurate or correct.

More recently, a variety of deep and multi-modal models have been proposed for image and caption retrieval, as well as caption generation. Though some of these models rely on deep convolutional nets for image feature extraction , recently researchers have realized the importance of also including temporally deep networks to model text. propose an RNN to map sentences into a multi-modal embedding space. By mapping images and language into the same embedding space, they are able to compare images and descriptions for image and annotation retrieval tasks. propose a model for caption generation that is more similar to the model proposed in this work: predictions for the next word are based on previous words in a sentence and image features. propose an encoder-decoder model for image caption retrieval which relies on both a CNN and LSTM encoder to learn an embedding of image-caption pairs. Their model uses a neural language decoder to enable sentence generation. As evidenced by the rapid growth of image captioning, visual sequence models like LRCN are increasingly important for describing the visual world using natural language.

Video Description. Recent approaches to describing video with natural language have made use of templates, retrieval, or language models . To our knowledge, we present the first application of deep models to the video description task. Most similar to our work is , which use phrase-based SMT to generate a sentence. In Section 6 we show that phrase-based SMT can be replaced with LSTMs for video description as has been shown previously for language translation .

2 Contemporaneous and Subsequent Work

Similar work in activity recognition and visual description was conducted contemporaneously with our work, and a variety of subsequent work has combined convolutional and recurrent networks to both improve upon our results and achieve exciting results on other sequential visual tasks.

Activity Recognition. Contemporaneous with our work, train a network which combines CNNs and LSTMs for activity recognition. Because activity recognition datasets like UCF101 are relatively small in comparison to image recognition datasets, pretrain their network using the Sports-1M dataset which includes over a million videos mined from YouTube. By training a much larger network (four stacked LSTMs) and pretraining on a large video dataset, achieve 88.6% on the UCF101 dataset.

also combines a convolutional network with an LSTM to predict multiple activities per frame. Unlike LRCN, focuses on frame-level (rather than video-level) predictions, which allows their system to label multiple activities that occur in different temporal locations of a video clip. Like we show for activity recognition, demonstrates that including temporal information improves upon a single frame baseline. Additionally, employ an attention mechanism to further improve results.

Image Captioning. and also propose models which combine a CNN with a recurrent network for image captioning. Though similar to LRCN, the architectures proposed in and differ in how image features are input into the sequence model. In contrast to our system, in which image features are input at each time step, and only input image features at the first time step. Furthermore, they do not explore a “factored” representation (Figure 4). Subsequent work has proposed attention to focus on which portion of the image is observed during sequence generation. By including attention, aim to visually focus on the current word generated by the model. Other works aim to address specific limitations of captioning models based on combining convolutional and recurrent architectures. For example, methods have been proposed to integrate new vocabulary with limited or no examples of images and corresponding captions.

Video Description. In this work, we rely on intermediate features for video description, but end-to-end trainable models for visual captioning have since been proposed. propose creating a video feature by pooling high level CNN features across frames. The video feature is then used to generate descriptions in the same way an image is used to generate a description in LRCN. Though achieving good results, by pooling CNN features, temporal information from the video is lost. Consequently, propose an LSTM to encode video frames into a fixed length vector before sentence generation with an LSTM. Using an end-to-end trainable “sequence-to-sequence” model which can exploit temporal structure in video, improve upon results for video description. propose a similar model, adding a temporal attention mechanism which weights video frames differently when generating each word in a sentence.

Visual Grounding. combine CNNs with LSTMs for visual grounding. The model first encodes a phrase which describes part of an image using an LSTM, then learns to attend to the appropriate location in the image to accurately reconstruct the phrase. In order to reconstruct the phrase, the model must learn to visually ground the input phrase to the appropriate location in the image.

Natural Language Object Retrieval. In this work, we present methods for image retrieval based on a natural language description. In contrast, use a model based on LRCN for object retrieval, which returns the bounding box around a given object as opposed to an entire image. In order to adapt LRCN to the task of object retrieval, include local convolutional features which are extracted from object proposals and the spatial configuration of object proposals in addition to a global image feature. By including local features, effectively adapt LRCN for object retrieval.

Conclusion

We’ve presented LRCN, a class of models that is both spatially and temporally deep, and flexible enough to be applied to a variety of vision tasks involving sequential inputs and outputs. Our results consistently demonstrate that by learning sequential dynamics with a deep sequence model, we can improve upon previous methods which learn a deep hierarchy of parameters only in the visual domain, and on methods which take a fixed visual representation of the input and only learn the dynamics of the output sequence.

As the field of computer vision matures beyond tasks with static input and predictions, deep sequence modeling tools like LRCN are increasingly central to vision systems for problems with sequential structure. The ease with which these tools can be incorporated into existing visual recognition pipelines makes them a natural choice for perceptual problems with time-varying visual input or sequential outputs, which these methods are able to handle with little input preprocessing and no hand-designed features.

Acknowledgments

The authors thank Oriol Vinyals for valuable advice and helpful discussion throughout this work. This work was supported in part by DARPA’s MSEE and SMISC programs, NSF awards IIS-1427425 and IIS-1212798, and the Berkeley Vision and Learning Center. The GPUs used for this research were donated by NVIDIA. Marcus Rohrbach was supported by a fellowship within the FITweltweit-Program of the German Academic Exchange Service (DAAD). Lisa Anne Hendricks was supported by the NDSEG.

References