Deep Networks with Internal Selective Attention through Feedback Connections

Marijn Stollenga, Jonathan Masci, Faustino Gomez, Juergen Schmidhuber

Introduction

Deep convolutional neural networks (CNNs) with max-pooling layers trained by backprop on GPUs have become the state-of-the-art in object recognition , segmentation/detection , and scene parsing (for an extensive review see ). These architectures consist of many stacked feedforward layers, mimicking the bottom-up path of the human visual cortex, where each layer learns progressively more abstract representations of the input data. Low-level stages tend to learn biologically plausible feature detectors, such as Gabor filters . Detectors in higher layers learn to respond to concrete visual objects or their parts, e.g., . Once trained, the CNN never changes its weights or filters during evaluation.

Evolution has discovered efficient feedforward pathways for recognizing certain objects in the blink of an eye. However, an expert ornithologist, asked to classify a bird belonging to one of two very similar species, may have to think for more than a few milliseconds before answering , implying that several feedforward evaluations are performed, where each evaluation tries to elicit different information from the image. Since humans benefit greatly from this strategy, we hypothesize CNNs can too. This requires: (1) the formulation of a non-stationary CNN that can adapt its own behaviour post-training, and (2) a process that decides how to adapt the CNNs behaviour.

This paper introduces Deep Attention Selective Networks (dasNet) which model selective attention in deep CNNs by allowing each layer to influence all other layers on successive passes over an image through special connections (both bottom-up and top-down), that modulate the activity of the convolutional filters. The weights of these special connections implement a control policy that is learned through reinforcement learning after the CNN has been trained in the usual way via supervised learning. Given an input image, the attentional policy can enhance or suppress features over multiple passes to improve the classification of difficult cases not captured by the initially supervised training. Our aim is to let the system check the usefulness of internal CNN filters automatically, omitting manual inspection .

In our current implementation, the attentional policy is evolved using Separable Natural Evolution Strategies (SNES; ), instead of a conventional, single agent reinforcement learning method (e.g. value iteration, temporal difference, policy gradients, etc.) due to the large number of parameters (over 1 million) required to control CNNs of the size typically used in image classification. Experiments on CIFAR-10 and CIFAR100 show that on difficult classification instances, the network corrects itself by emphasizing and de-emphasizing certain filters, outperforming a previous state-of-the-art CNN.

Maxout Networks

In this work we use the Maxout networks , combined with dropout , as the underlying model for dasNet. Maxout networks represent the state-of-the-art for object recognition in various tasks and have only been outperformed (by a small margin) by averaging committees of several convolutional neural networks. A similar approach, which does not reduce dimensionality in favor of sparsity in the representation has also been recently presented . Maxout CNNs consist of a stack of alternating convolutional and maxout layers, with a final classification layer on top:

Pooling Layer.

A pooling layer is used to reduced the dimensionality of the output from a convolutional layer. The usual approach is to take the maximum value among non- or partially-overlapping patches in every map, therefore reducing dimensionality along the height and width . Instead, a Maxout pooling layer reduces every bb consecutive maps to one map, by keeping only the maximum value for every pixel-position, where bb is called the block size. Thus the map reduces cc input maps to c=c/bc^{\prime}=c/b output maps.

Classification Layer.

Finally, a classification step is performed. First the output of the last pooling layer is flattened into one large vector x\vec{x}, to form the input to the following equations:

Reinforcement Learning

Reinforcement learning (RL) is a general framework for learning to make sequential decisions order to maximize an external reward signal . The learning agent can be anything that has the ability to act and perceive in a given environment.

At time tt, the agent receives an observation otOo_{t}\in O of the current state of the environment stSs_{t}\in S, and selects an action, atAa_{t}\in A, chosen by a policy π:OA\pi:O\to A, where S,OS,O and AA the spaces of all possible states, observations, and action, respectively.In this work π:OA\pi:O\to A is a deterministic policy; given an observation it will always output the same action. However, π\pi could be extended to stochastic policies. The agent then enters state st+1s_{t+1} and receives a reward rt\mathdsRr_{t}\in\mathds{R}. The objective is to find the policy, π\pi, that maximizes the expected future discounted reward, E[tγtrt]E[\sum_{t}\gamma^{t}r_{t}], where γ\gamma\in discounts the future, modeling the “farsightedness” of the agent.

In dasNet, both the observation and action spaces are real valued O=\mathdsRdim(O)O=\mathds{R}^{dim(O)}, A=\mathdsRdim(A)A=\mathds{R}^{dim(A)}. Therefore, policy πθ\pi_{\theta} must be represented by a function approximator, e.g. a neural network, parameterized by θ\theta. Because the policies used to control the attention of the dasNet have state and actions spaces of close to a thousand dimensions, the policy parameter vector, θ\theta, will contain close to a million weights, which is impractical for standard RL methods. Therefore, we instead evolve the policy using a variant for Natural Evolution Strategies (NES; ), called Separable NES (SNES; ). The NES family of black-box optimization algorithms use parameterized probability distributions over the search space, instead of an explicit population (i.e., a conventional ES ). Typically, the distribution is a multivariate Gaussian parameterized by mean μ\mu and covariance matrix Σ\Sigma. Each epoch a generation is sampled from the distribution, which is then updated the direction of the natural gradient of the expected fitness of the distribution. SNES differs from standard NES in that instead of maintaining the full covariance matrix of the search distribution, uses only the diagonal entries. SNES is theoretically less powerful than standard NES, but is substantially more efficient.

Deep Attention Selective Networks (dasNet)

The idea behind dasNet is to harness the power of sequential processing to improve classification performance by allowing the network to iteratively focus the attention of its filters. First, the standard Maxout net (see Section 2) is augmented to allow the filters to be weighted differently on different passes over the same image (compare to equation 1):

where θ\theta is the parameter vector of the policy, πθ\pi_{\theta}, and vt\mathbf{v}_{t} is the output of the network on pass tt.

Algorithm 1 describes the dasNet training algorithm. Given a Maxout net, M\mathbf{M}, that has already been trained to classify images using training set, X, the policy, π\pi, is evolved using SNES to focus the attention of M\mathbf{M}. Each pass through the while loop represents one generation of SNES. Each generation starts by selecting a subset of nn images from X at random. Then each of the pp samples drawn from the SNES search distribution (with mean μ\mu and covariance Σ\Sigma) representing the parameters, θi\theta_{i}, of a candidate policy, πθi\pi_{\theta_{i}}, undergoes nn trials, one for each image in the batch. During a trial, the image is presented to the Maxout net TT times. In the first pass, t=0t=0, the action, a0{\mathbf{a}}_{0}, is set to ai=1,ia_{i}=1,\forall i, so that the Maxout network functions as it would normally — the action has no effect. Once the image is propagated through the net, an observation vector, o0{\mathbf{o}}_{0}, is constructed by concatenating the following values extracted from M\mathbf{M}, by h()h(\cdot):

the average activation of every output map Avg(yj)Avg(y_{j}) (Equation 2), of each Maxout layer.

the intermediate activations yˉj\bar{y}_{j} of the classification layer.

the class probability vector, vt\mathbf{v}_{t}.

While averaging map activations provides only partial state information, these values should still be meaningful enough to allow for the selection of good actions. The candidate policy then maps the observation to an action:

On the next pass, the same image is processed again, but this time using the filter weighting, a1{\mathbf{a}}_{1}. This cycle is repeated until pass TT (see figure 1 for a illustration of the process), at which time the performance of the network is scored by:

where v\mathbf{v} is the output of M\mathbf{M} at the end of the pass TT, dd is the correct classification, and λcorrect\lambda_{correct} and λmisclassified\lambda_{misclassified} are constants. LiL_{i} measures the weighted loss, where misclassified samples are weighted higher than correctly classified samples λmisclassified>λcorrect\lambda_{misclassified}>\lambda_{correct}. This simple form of boosting is used to focus on the ‘difficult’ misclassified images. Once all of the input images have been processed, the policy is assigned the fitness:

where λL2\lambda_{L2} is a regularization parameter.

Once all of the candidate policies have been evaluated, SNES updates its distribution parameters (μ,Σ\mu,\Sigma) according the natural gradient calculated from the sampled fitness values, F\mathcal{F}. As SNES repeatedly updates the distribution over the course of many generations, the expected fitness of the distribution improves, until the stopping criterion is met.

Related Work

Human vision is still the most advanced and flexible perceptual system known. Architecturally, visual cortex areas are highly connected, including direct connections over multiple levels and top-down connections. Felleman and Van Essen constructed a (now famous) hierarchy diagram of 32 different visual cortical areas in macaque visual cortex. About 40% of all pairs of areas were considered connected, and most connected areas were connected bidirectionally. The top-down connections are more numerous than bottom-up connections, and generally more diffuse . They are thought to play primarily a modulatory role, while feedforward connections serve as directed information carriers .

Analysis of response latencies to a newly-presented image lends credence to the theory that there are two stages of visual processing: a fast, pre-attentive phase, due to feedforward processing, followed by an attentional phase, due to the influence of recurrent processing . After the feedforward pass, we can recognize and localize simple salient stimuli, which can “pop-out” , and response times do not increase regardless of the number of distractors. However, this effect has only been conclusively shown for basic features such as color or orientation; for categorical stimuli or faces, whether there is a pop-out effect remains controversial . Regarding the attentional phase, feedback connections are known to play important roles, such as in feature grouping , in differentiating a foreground from its background, (especially when the foreground is not highly salient ), and perceptual filling in . Work by Bar et al. supports the idea that top-down projections from prefrontal cortex play an important role in object recognition by quickly extracting low-level spatial frequency information to provide an initial guess about potential categories, forming a top-down expectation that biases recognition. Recurrent connections seem to rely heavily on competitive inhibition and other feedback to make object recognition more robust .

In the context of computer vision, RL has been shown to be able to learn saccades in visual scenes to learn selective attention , learn feedback to lower levels , and improve face recognition . It has been shown to be effective for object recognition , and has also been combined with traditional computer vision primitives . Iterative processing of images using recurrency has been successfully used for image reconstruction and face-localization . All these approaches show that recurrency in processing and an RL perspective can lead to novel algorithms that improve performance. However, this research is often applied to simplified datasets for demonstration purposes due to computation constraints, and are not aimed at improving the state-of-the-art. In contrast, we apply this perspective directly to the known state-of-the-art neural networks to show that this approach is now feasible and actually increases performance.

Experiments on CIFAR-10/100

The experimental evaluation of dasNet focuses on ambiguous classification cases in the CIFAR-10 and CIFAR-100 data sets where, due to a high number of common features, two classes are often mistaken for each other. These are the most interesting cases for our approach. By learning on top of an already trained model, dasNet must aim at fixing these erroneous predictions without disrupting, or forgetting, what has been learned.

The CIFAR-10 dataset is composed of 32×3232\times 32 color images split into 5×1045\times 10^{4} training and 10410^{4} testing samples, where each image is assigned to one of 1010 classes. The CIFAR-100 is similarly composed, but contains 100100 classes.

The number of steps, TT, for the RL was experimentally determined and fixed at 55; enough steps to allow dasNet to adapt while being small enough to be practical. While it is be possible to iterate until some condition is met, this could be a serious limitation in real-time applications where predictable processing latency is critical. In all experiments we set λcorrect=0.005\lambda_{\text{correct}}=0.005, λmisclassified=1\lambda_{\text{misclassified}}=1 and λL2=0.005\lambda_{\text{L2}}=0.005.

The Maxout network, M\mathbf{M}, used in the experiments was trained with data augmentation following the suggested global contrast normalization and ZCA normalization protocol. The model consists of three convolutional maxout layers followed by a fully connected maxout and softmax outputs. Dropout of 0.50.5 was used in all layers except the input layer, and 0.20.2 for the input layer. The population size for SNES was set to 50.

Table 1 shows the performance of dasNet vs. other methods, where it achieves a relative improvement of 6%6\% with respect to the vanilla CNN. This establishes a new state-of-the-art result for this challenging dataset.

Figure 3 shows the classification of a cat-image from the test-set. All output map activations in the final step are shown at the top. The difference in activations compared to the first step, i.e., the (de-)emphasis of each map, is shown on the bottom. On the left are the class probabilities for each time-step. At the first step, the classification is ‘dog’, and the cat could indeed be mistaken for a puppy. Note that in the first step, the network has not yet received any feedback. In the next step, the probability for ‘cat’ goes up dramatically, beating ’dog’, and subsequently drops a bit in the following steps. The network has successfully disambiguated a cat from a dog. If we investigate the filters, we see that already in the lower layer emphasis changes significantly. Some filters focus more on surroundings whilst others de-emphasize the eyes.

In the second layer, almost all output maps are emphasized. In the third and highest convolutional layer, the most complex changes to the network. At this level the positional correspondence is largely lost, and the filters are known to code for ‘higher level’ features. It is in this layer that changes are the most influential because they are closest to the final output layers. It is hard to analyze the effect of the alterations, but we can see that the differences are not simple increases or decreases of the output maps, as we then would expect the final activations and their corresponding increases to be largely similar. Instead we see complex emphasis and pattern suppression.

To investigate the dynamics, a small 2-layer dasNet network was trained for different values of TT. Then they were evaluated by allowing them to run for [0..9][0..9] steps. Figure 2 shows results of training dasNet on CIFAR-100 for T=1T=1 and T=2T=2. The performance goes up from the vanilla CNN, peaks at the step=Tstep=T as expected, and reduces but stays stable after that. So even though the dasNet was trained using only a small number of steps, the dynamics stay stable when these are evaluated for as many as 10 steps.

To verify whether the dasNet policy is actually making good use of its gates, their information content is estimated the following way: The gate values in the last step are taking and used directly for classification. If the gates are used properly then their activation should contain information that is relevant for classification and we would expect a

dasNet that was trained with T=2T=2 and are used as features for classification. Then using only the final gate-values (so without e.g. the output of the classification layer), a classification using 15-nearest neighbour and logistic regression was performed. This resulted in a performance of 40.70% and 45.74% correct respectively, similar to the performance of dasNet, confirming that they contain significant information and we can conclude that they are purposefully used.

Conclusion

DasNet is a deep neural network with feedback connections that are learned by through reinforcement learning to direct selective internal attention to certain features extracted from images. After a rapid first shot image classification through a standard stack of feedforward filters, the feedback can actively alter the importance of certain filters “in hindsight”, correcting the initial guess via additional internal “thoughts”.

DasNet successfully learned to correct image misclassifications produced by a fully trained feedforward Maxout network. Its active, selective, internal spotlight of attention enabled state-of-the-art results.

Future research will also consider more complex actions that spatially focus on (or alter) parts of observed images.

Acknowledgments

We acknowledge Matthew Luciw, who provided a short literature review, partially included in the Related Work section.

References