Learning where to Attend with Deep Architectures for Image Tracking
Misha Denil, Loris Bazzani, Hugo Larochelle, Nando de Freitas
Introduction
Humans track and recognize objects effortlessly and efficiently, exploiting attentional mechanisms (Rensink, 2000; Colombo, 2001) to cope with the vast stream of data. We use the human visual system as inspiration to build a system for simultaneous object tracking and recognition from gaze data. An attentional strategy is learned online to choose fixation points which lead to low uncertainty in the location of the target object. Our tracking system is composed of two interacting pathways. This separation of responsibility is a common feature in models from the computational neuroscience literature as it is believed to reflect a separation of information processing into ventral and dorsal pathways in the human brain (Olshausen et al., 1993a).
The identity pathway (ventral) is responsible for comparing observations of the scene to an object template using an appearance model, and on a higher level, for classifying the target object. The identity pathway consists of a two hidden layer deep network. The top layer corresponds to a multi-fixation Restricted Boltzmann Machine (RBM) (Larochelle & Hinton, 2010), as shown in Figure 1. It accumulates information from the first hidden layers at consecutive time steps. For the first layers, we use (factored)-RBMs (Hinton & Salakhutdinov, 2006; Ranzato & Hinton, 2010; Welling et al., 2005; Swersky et al., 2011), but autoencoders (Vincent et al., 2008), sparse coding (Olshausen & Field, 1996; Kavukcuoglu et al., 2009), two-layer ICA (Köster & Hyvärinen, 2007) and convolutional architectures (Lee et al., 2009) could also be adopted.
The control pathway (dorsal) is responsible for aligning the object template with the full scene so the remaining modules can operate independently of the object’s position and scale. This pathway is separated into a localization module and a fixation module which work cooperatively to accomplish this goal. The localization module is implemented with a particle filter (Doucet et al., 2001) which estimates the location, velocity and scale of the target object. We make no attempt to implement such states with neural architectures, but it seems clear that they could be encoded with grid cells (McNaughton et al., 2006) and retinotopic maps as in V1 and the superior colliculus (Rosa, 2002; Girard & Berthoz, 2005). The fixation module learns an attentional strategy to select fixation points relative to the object template. These fixation points are the centres of partial template observations, and are compared with observations of the corresponding locations in the scene using the appearance model (see Figure 2). Reward is assigned to each fixation based on the uncertainty of the target location at each time step. The fixation module uses the reward signal to adapt its gaze selection policy to achieve good localization. Our previous work (Bazzani et al., 2010) used Hedge (Auer et al., 1998a; Freund & Schapire, 1997) to learn this policy. In this extended paper we show that a straightforward adaptation of our previous approach to the partial information setting results in poor performance, and we propose an alternative method based on modelling the reward surface as a Gaussian Process. This approach gives good performance in the presence of partial information and allows us to expand the action space from a small, discrete set of fixation points to a continuous domain.
The proposed system can be motivated from different perspectives. First, starting with Isard & Blake (1996), many particle filters have been proposed for image tracking, but these typically use simple observation models such as B-splines (Isard & Blake, 1996) and colour templates (Okuma et al., 2004). RBMs are more expressive models of shape, and hence, we conjecture that they will play a useful role where simple appearance models fail. Second, from a deep learning computational perspective, this work allows us to tackle large images and video, which is typically not possible due to the number of parameters required to represent large images in deep models. The use of fixations synchronized with information about the state (e.g. location and scale) of such fixations eliminates the need to look at the entire image or video. Third, the system is invariant to image transformations encoded in the state, such as location, scale and orientation. Fourth, from a dynamic sensor network perspective, this paper presents a very simple, but efficient and novel way of deciding how to gather measurements dynamically. Lastly, in the context of psychology, the proposed model realizes to some extent the functional architecture for dynamic scene representation of Rensink (2000). The rate at which different attentional mechanisms develop in newborns (including alertness, saccades and smooth pursuit, attention to object features and high-level task driven attention) guided the design of the proposed approach and was a great source of inspiration (Colombo, 2001).
Our attentional model can be seen as building a saliency map (Koch & Ullman, 1985) over the target template. Previous work on saliency modelling has focused on identifying salient points in an image using a bottom up process which looks for outliers under some local feature model (which may include a task dependent prior, global scene features, or various other heuristics). These features can be computed from static images (Torralba et al., 2006), or from local regions of spacetime (Gaborski et al., 2004) for video. Additionally, a wide variety of different feature types have been applied to this problem, including engineered features (Gao et al., 2007) as well as features that are learned from data (Zhang et al., 2009). Core to these methods is the idea that saliency is determined by some type of novelty measure. Our approach is different, in that rather than identifying locally or globally novel features, our process identifies features which are useful for the task at hand. In our system the saliency signal for a location comes from a top down process which evaluates how well the features at that location enable the system to localize the target object. The work of Gao et al. (2007) considers a similar approach to saliency by defining saliency to be the mutual information between the features at a location and the class label of an object being sought; however, in order to make their model tractable the authors are forced to use specifically engineered features. Our system is able to cope with arbitrary feature types, and although we consider only on localization in this paper, our model is sufficiently general to be applied to identifying salient features for other goals.
Recently, a dynamic RBM state-space model was proposed in Taylor et al. (2010). Both the implementation and intention behind that proposal are different from the approach discussed here. To the best of our knowledge, our approach is the first successful attempt to combine dynamic state estimation from gazes with online policy learning for gaze adaptation, using deep network network models of appearance. Many other dual-pathway architectures have been proposed in computational neuroscience, including Olshausen et al. (1993b) and Postma et al. (1997), but we believe ours has the advantage that it is very simple, modular (with each module easily replaceable), suitable for large datasets and easy to extend.
Identity Pathway
The identity pathway in our model mirrors the ventral pathway in neuroscience models. It is responsible for modelling the appearance of the target object and also, at a higher level, for classification.
We use (factored)-RBMs to model the appearance of objects and perform object classification using the gazes chosen by the control module (see Figure 3). These undirected probabilistic graphical models are governed by a Boltzmann distribution over the gaze data and the hidden features . We assume that the receptive fields , also known as RBM weights or filters, have been trained beforehand. We also assume that readers are familiar with these models and, if otherwise, refer them to Ranzato & Hinton (2010) and Swersky et al. (2010).
2 Classification Model
The identity pathway also performs object recognition, classifying a sequence of gaze instances selected with the gaze policy. We implement a multi-fixation RBM very similar to the one proposed in Larochelle & Hinton (2010), where the binary variables (see Figure 4) are introduced to encode the relative gaze location within the multi-fixation RBM (a “1 in ” or “one hot” encoding of the gaze location was used for ).
The multi-fixation RBM uses the relative gaze location information in order to aggregate the first hidden layer representations at consecutive time steps into a single, higher level representation .
More specifically, the energy function of the multi-fixation RBM is:
where the normalization constant ensures that Equation 1 sums to 1. To sample from this distribution, one can use Gibbs sampling by alternating between sampling the top-most hidden layer given all individual processed gazes and vice versa. To train the multi-fixation RBM, we collect a training set consisting in sequences of pairs by randomly selecting gaze positions at which to fixate and computing the associated . These sets are extracted from a collection of images in which the object to detect has been centred. Unsupervised learning using contrastive divergence can then be performed on this training set. See Larochelle & Hinton (2010) for more details.
The main difference between this multi-fixation RBM and the one described in Larochelle & Hinton (2010) is that does not explicitly model the class label . Instead, a multinomial logistic regression classifier is trained separately, to predict from the aggregated representation extracted from . More specifically, we use the vector of activation probabilities of all hidden units in , conditioned on and , as the aggregated representation:
We experimented with a single fixation module, but found the multi-fixation module to increase classification accuracy. To improve the estimate the class variable over time, we accumulate the classification decisions at each time step.
Note that the process of pursuit (tracking) is essential to classification. As the target is tracked, the algorithm fixates at locations near the target’s estimated location. The size and orientation of these fixations also depends on the corresponding state estimates. Note that we don’t fixate exactly at the target location estimate as this would provide only one distinct fixation over several time steps if the tracking policy has converged to a specific gaze. It should also be pointed out that instead of using random fixations, one could again use the control strategy proposed in this paper to decide where to look with respect to the track estimate so as to reduce classification uncertainty. We leave the implementation of this extra attentional mechanism for future work.
Control Pathway
The control pathway mirrors the responsibility of the dorsal pathway in human visual processing. It tracks the state of the target (position, speed, etc) and normalizes the input so that other modules need not account for these variations. At a higher level it is responsible for learning an attentional strategy which maximizes the amount of information learned with each fixation. The structure of the control pathway is shown in Figure 5.
For the transition model, we will adopt a classical autoregressive process.
Our aim is to estimate recursively in time the posterior distributionWe use the notation to represent the past history of a variable over time. and its associated features, including the marginal distribution — known as the filtering distribution or belief state. This distribution satisfies the following recurrence:
Except for standard distributions (e.g. Gaussian or discrete), this recurrence is intractable.
After learning the observation model we will use it for tracking. The observation model is often defined in terms of the distance of the observations from a template ,
where denotes a distance metric and an object template (for example, a color histogram or spline). In this model, the observation is a function of the current state hypothesis and the selected action. The problem with this approach is eliciting a good template. Often color histograms or splines are insufficient. For this reason, we will construct the templates with (factored)-RBMs as follows. First, optical flow is used to detect new object candidates entering the visual scene. Second, we assign a template to the detected object candidate, as shown in Figure 2. The same figure also shows a typical foveated observation (higher resolution in the center and lower in the periphery of the gaze) and the receptive fields for this observation learned beforehand with an RBM. The control algorithm will be used to learn which parts of the template are most informative, either by picking from amoung a predefined set of fixation points, or by using a continuous policy. Finally, we define the likelihood of each observation directly in terms of the distance of the hidden units of the RBM , to the hidden units of the corresponding template region . That is,
The above template is static, but conceivably one could adapt it over time.
2 Reward Function
It is also useful to consider the cumulative reward
which is the sum of the instantaneous rewards which have been received up to time . The gaze control strategies we consider are all “no-regret” which means that the average gap between our cumulative reward and the cumulative reward from always picking the optimal action goes to zero as .
In our current implementation, each action is a different gaze location and the objective is to choose where to look so as to minimize the uncertainty about the belief state.
Gaze control
We compare several different strategies for learning the gaze selection policy. In an earlier version of this work (Bazzani et al., 2010) we learned the gaze selection policy with a portfolio allocation algorithm called Hedge (Freund & Schapire, 1997; Auer et al., 1998b). Hedge requires knowledge of the rewards for all actions at each time step, which is not realistic when gazes must be preformed sequentially, since the target object will move between fixations. We compare this strategy, as well as two baseline methods, to two very different alternatives.
EXP3 is an extension of Hedge to partial information games (Auer et al., 2001). Unlike Hedge, EXP3 requires knowledge of the reward only for the action selected at each time step. EXP3 is more appropriate to the setting at hand, and is also more computationally efficient than Hedge; however, this comes at a cost of substantially lower theoretical performance.
Both Hedge and EXP3 learn gaze selection policies which choose among a discrete set of predetermined fixation points. We can instead learn a continuous policy by estimating the reward surface using a Gaussian Process (Rasmussen & Williams, 2006). By assuming that the reward surface is smooth, we can draw on the tools of Bayesian optimization (Brochu et al., 2010) to search for the optimal gaze location using as few exploratory steps as possible.
The following sections describe each of these approaches in more detail.
We consider two baseline strategies, which we call random and circular. The random strategy samples gaze selections uniformly from a small discrete set of possibilities. The circular strategy also uses a small discrete set of gaze locations and cycles through them in a fixed order.
2 Hedge
To use Hedge (Freund & Schapire, 1997; Auer et al., 1998b) for gaze selection we must first discretize the action space by selecting a fixed finite number of possible fixation points. Hedge maintains an importance weight for each possible fixation point and uses them to form a stochastic policy at each time step. An action is selected according to this policy and the reward for each possible action is observed. These rewards are then used to update the importance weights and the process repeats. Pseudo code for Hedge is shown in Algorithm 1.
3 EXP3
EXP3 (Auer et al., 2001) is a generalization of Hedge to the partial information setting. In order to maintain estimates for the importance weights, Hedge requires reward information for each possible action at each time step. EXP3 works by wrapping Hedge in an outer loop which simulates a fully observed reward vector at each time step. EXP3 selects actions based on a mixture of the policy found by Hedge and a uniform distribution. EXP3 is able to function in the presence of partial information, but this comes at the cost of substantially worse theoretical guarantees. Pseudo code for EXP3 is shown in Algorithm 2.
4 Bayesian Optimization
Both Hedge and EXP3 discretize the space of possible fixation points and learn a distribution over this finite set. In contrast, Bayesian optimization is able to treat the space of possible fixation points as fully continuous by placing a smoothness prior on how reward is expected to vary with location. Intuitively, if we know the reward at one location, then we expect other, nearby locations to produce similar rewards. Gaussian Process priors encode this type of belief (Rasmussen & Williams, 2006), and have been used extensively for optimization of cost functions when it is important to minimize the total number of function evaluations (Brochu et al., 2010).
We assume that the true reward function is not directly measurable, and what we observe are measurements of this function corrupted by Gaussian noise. That is, at each time step the instantaneous reward , is given by
Given a set of observations we can compute the posterior predictive distribution for :
It remains to specify the form of the kernel function, . We experimented with several possibilities, but found that the specific form of the kernel function is not critical to the performance of this method. For the experiments in this paper we used the squared exponential kernel,
Equation 2 is a Gaussian Process estimate of the reward surface and can be used to select a fixation point for the next time step. This estimate gives both a predicted reward value and an associated uncertainty for each possible fixation point. This is the strength of Gaussian Processes for this type of optimization problem, since the predictions can be used to balance exploration (choosing a fixation point where the reward is highly uncertain) and exploitation (choosing a point we are confident will have high reward).
There are many selection methods available in the literature which offer different tradeoffs between these two criteria. In this paper we use GP-UCB (Srinivas et al., 2010) which selects
where is a parameter. The setting (with ) is used throughout this paper.
Treatment of the hyperparameters requires special consideration in this setting. The pure Bayesian approach is to put a prior on each parameter and integrate them out of the predictive distribution. However, since the integrals involved are not tractable analytically, this requires computationally expensive numerical approximations. Speed is an issue here since GP-UCB requires that we optimize a function of the posterior process at each time step so, for instance, computing Monte Carlo averages for each evaluation of Equation 2 is prohibitively slow.
An alternative approach is to choose parameter values via maximum likelihood. This can be done quickly, and allows us to make speedy predictions; however, in this case we suffer from problems of data scarcity, particularly early in the tracking process when few observations have been made. The length scale parameters are particularly prone to receiving very poor estimates when there is little data available.
We have found that using informative priors for the length scale parameters and making MAP, rather than ML, estimates at each time step provides a solution to the problems described above. MAP estimates can be made quickly using gradient optimization methods (Rasmussen & Williams, 2006), and informative priors provide resistance to the problems encountered with ML. The experiments in Section 6 place uniform priors on the magnitude and noise parameters and place independent Student-t priors on each length scale parameter. The experiments also use an initial data collection phase of 10 time steps before any adjustment of the parameters is made.
Algorithm
Since the belief state cannot be computed analytically, we will adopt particle filtering to approximate it. The full algorithm is shown in Algorithm 3.
We refer readers to Doucet et al. (2001) for a more in depth treatment of these sequential Monte Carlo methods. Assume that at time we have particles (samples) distributed according to . We can approximate this belief state with the following empirical distribution
Particle filters combine sequential importance sampling with a selection scheme designed to obtain new particles distributed approximately according to .
The joint distributions and are of different dimension. We first modify and extend the current paths to obtain new paths using a proposal kernel As our goal is to design a sequential procedure, we set
that is . The aim of this kernel is to obtain new paths whose distribution
is as “close” as possible to . Since we cannot choose because this is the quantity we are trying to approximate in the first place, it is necessary to weight the new particles so as to obtain consistent estimates. We perform this “correction” with importance sampling, using the weights
The choice of the transition prior as proposal distribution is by far the most common one. In this case, the importance weights reduce to the expression for the likelihood. However, it is possible to construct better proposal distributions, which make use of more recent observations, using object detectors (Okuma et al., 2004), saliency maps (Itti et al., 1998), optical flow, and approximate filtering methods such as the unscented particle filter. One could also easily incorporate strategies to manage data association and other tracking related issues. After normalizing the weights, , we obtain the following estimate of the filtering distribution:
Experiments
In this section, three experiments are carried out to evaluate quantitatively and qualitatively the proposed approach. The first experiment provides comparisons to other control policies on a synthetic dataset. The second experiment, on a similar synthetic dataset, demonstrates how the approach can handle large variations in scale, occlusion and multiple targets. The final experiment is a demonstration of tracking and classification performance on several real videos. For the synthetic digit videos, we trained the first-layer RBMs on the foveated images, while for the real videos we trained factored-RBMs on foveated natural image patches (Ranzato & Hinton, 2010).
The first experiment uses 10 video sequences (one for each digit) built from the MNIST dataset. Each sequence contains a moving digit and static digits in the background (to create distractions). The objective is to track and recognize the moving digit; see Figure 7. The gaze template had gaze positions, chosen so that gaze G5 was at the center. The location of the template was initialized with optical flow.
We compare the Hedge learning algorithm against algorithms with deterministic and random policies. The deterministic policy chooses each gaze in sequence and in a particular pre-specified order, whereas the random policy selects a gaze uniformly at random. We adopted the Bhattacharyya distance in the specification of the observation model. A multi-fixation RBM was trained to map the first layer hidden units of three time consecutive time steps into a second hidden layer, and trained a logistic regressor to further map to the 10 digit classes. We used the transition prior as proposal for the particle filter.
Tables 1 and 2 report the comparison results. Tracking accuracy was measured in terms of the mean and standard deviation (in brackets) over time of the distance between the target ground truth and the estimate; measured in pixels. The analysis highlights that the error of the learned policy is always below the error of the other policies. In most of the experiments, the tracker fails when an occlusion occurs for the deterministic and the random policies, while the learned policy is successful. This is very clear in the videos at: http://www.youtube.com/user/anonymousTrack
The loss of track for the simple policies is mirrored by the high variance results in Table 1 (experiments , , , and so on). The average mean and standard deviations (last column of Table 1) make it clear that the proposed strategy for learning a gaze policy can be of enormous benefit. The improvements in tracking performance are mirrored by improvements in classification performance (Table 2).
Figure 7 provides further anecdotal evidence for the policy learning algorithm. The top sequence shows the target and the particle filter estimate of its location over time. The middle sequence illustrates how the policy changes over time. In particular, it demonstrates that hedge can effectively learn where to look in order to improve tracking performance (we chose this simple example as in this case it is obvious that the center of the eight (G5) is the most reliable gaze action). The classification results over time are shown in the third row.
The second experiment addresses a similar video sequence, but tracking multiple targets. The image scale of each target changes significantly over time, so the algorithm has to be invariant with respect to these scale transformations. In this case, we used a mixture proposal distribution consisting of motion detectors and the transition prior. We also tested a saliency proposal but found it to be less effective than the motion detectors for this dataset. Figure 8 (top) shows some of the video frames and tracks. The videos allow one to better appreciate the performance of the multi-target tracking algorithm in the presence of occlusions.
Tracking and classification results for the real videos are shown in Figure 8 and the accompanying videos.
2 Partial Information Policies
In this section, two experiments are carried out to evaluate the performance of the different gaze selection policies.
In the first experiment we compare the performance of each gaze selection method on a data set of several videos of digits from the MNIST data set moving on a black background. The target in each video encounters one or more partial occlusions which the tracking algorithm must handle gracefully. Additionally, each video sequence has been corrupted with 30% noise. We measure the error between the estimated track and the ground truth for each gaze selection method, and demonstrate that Bayesian optimization preforms comparably to Hedge, but that EXP3 is not able to reach a satisfactory level of performance. We also demonstrate qualitatively that the Bayesian optimization approach learns good gaze selection policies on this data set.
Our second experiment provides evidence that the Bayesian optimization method can generalize to real world data.
Table 3 reports the results from our first experiment. The table shows the mean tracking error, measured by averaging distance between the estimated and ground truth track over the entire video sequence. Here we see that the Bayesian optimization approach compares favorably to Hedge in terms of tracking performance, and that EXP3 preforms substantially worse than the other two methods. Although Hedge preforms marginally better than Bayesian optimization, it is important to remember that Bayesian optimization solves a significantly more difficult problem. Hedge relies on discretizing the action space, and must have access to the rewards for all possible actions at each time step. In contrast, Bayesian optimization considers a fully continuous action space, and receives reward information only for the chosen actions.
Figure 9 shows the reward surfaces learned for each digit by Bayesian optimization, as well as a visualization of the overall best fixation points using data aggregated across ten runs. The optimal fixation points found by the algorithm are tightly clustered, and the resulting observations are very distinguishable.
In our second experiment we use the Youtube celebrity dataset from Kim et al. (2008). This data set consists of several videos of celebrities taken from Youtube and is challenging for tracking algorithms as the videos exhibit a wide variety of illuminations, expressions and face orientations. We run our tracking model using Bayesian optimization to learn a gaze selection policy on this data set, and present some results in Figure 10. Although we report only qualitative results from this experiment, it provides anecdotal evidence that Bayesian optimization is able to form a good gaze selection policy on real world data.
Conclusions and Future Work
We have proposed a decision-theoretic probabilistic graphical model for joint classification, tracking and planning. The experiments demonstrate the significant potential of this approach. We examined several different strategies for gaze control in both the full and partial information settings. We saw that a straightforward generalization of the full information policy to partial information gave poor performance and we proposed an alternative method which is able not only to perform well in the presence of partial information but also allows us to expand the set of possible fixation points to be a continuous domain.
There are many routes for further exploration. In this work we pre-trained the (factored)-RBMs. However, existing particle filtering and stochastic optimization algorithms could be used to train the RBMs online. Following the same methodology, we should also be able to adapt and improve the target templates and proposal distributions over time. This is essential to extend the results to long video sequences where the object undergoes significant transformations (e.g. as is done in the predator tracking system (Kalal et al., 2010)).
Deployment to more complex video sequences will require more careful and thoughtful design of the proposal distributions, transition distributions, control algorithms, template models, data-association and motion analysis modules. Fortunately, many of the solutions to these problems have already been engineered in the computer vision, tracking and online learning communities. Admittedly, much work remains to be done.
Saliency maps are ubiquitous in visual attention studies. Here, we simply used standard saliency tools and motion flow in the construction of the proposal distributions for particle filtering. There might be better ways to exploit the saliency maps, as neurophysiological experiments seem to suggest (Gottlieb et al., 1998).
One of the most interesting avenues for future work is the construction of more abstract attentional strategies. In this work, we focused on attending to regions of the visual field, but clearly one could attend to subsets of receptive fields or objects in the deep appearance model.
The current model has no ability to recover from a tracking failure. It may be possible to use information from the identity pathway (i.e. the classifier output) to detect and recover from tracking failure.
A closer examination of the exploration/exploitation tradeoff in the tracking setting is in order. For instance, the methods we considered assume that future rewards are independent of past actions. This assumption is clearly not true in our setting, since choosing a long sequence of very poor fixation points can lead to tracking failure. We can potentially solve this problem by incorporating the current tracking confidence into the gaze selection strategy. This would allow the exploration/exploitation trade off to be explicitly modulated by the needs of the tracker, e.g. after choosing a poor fixation point the selection policy could be adjusted temporarily to place extra emphasis on exploiting good fixation points until confidence in the target location has been recovered. Contextual bandits provide a framework for integrating and reasoning about this type of side-information in a principled manner.
Acknowledgments
We thank Ben Marlin, Kenji Okuma, Marc’Aurelio Ranzato and Kevin Swersky. This work was supported by CIFAR’s NCAP program and NSERC.