Conducting Credit Assignment by Aligning Local Representations

Alexander G. Ororbia, Ankur Mali, Daniel Kifer, C. Lee Giles

Introduction

In artificial neural networks, credit assignment is the task of computing the contribution to the overall error caused by individual units in the network, and then using this information to update the parameters of the entire network. Credit assignment and updates are most often done with the help of the gradient calculated by the well-known back-propagation of errors (Rumelhart et al., 1988),We will refer to back-propagation of errors also as “backprop” and “back-propagation” throughout. which provides a theoretical basis for training deep networks (i.e. gradient descent) but also presents challenges in practice due to the known vanishing/exploding gradient and shrinking variance problems.

The most common strategies for dealing with such problems include (1) a careful initialization of the network parameters, often following from a network-specific analysis of the learning dynamics caused by back-propagation (Glorot and Bengio, 2010; He et al., 2015; Sussillo, 2014; Mishkin and Matas, 2015), and (2) modifying the network structure, for example by using ReLU instead of sigmoid activations or adding batch normalization layers Ioffe and Szegedy (2015). Challenges such as these are often a barrier for new users to training deep networks to accuracies that approach the state of the art.

In this paper, we present a novel training algorithm, called Local Representation Alignment (LRA), that is robust to poor choices of initialization and can train deep networks from initializations that would cause backprop to fail. This allows network designers to choose units, including non-differentiable ones, based on the type of representation they provide rather than on the ideosyncrasies of backprop-based algorithms.

The idea behind LRA is that every layer, not just the output layer, has a target and each layer’s weights are adjusted so that its output moves closer to its target. While this idea is common to prior work such as TargetProp (Carreira-Perpiñán and Wang, 2012; Bengio, 2014; Lee et al., 2015), one key novelty of LRA is that it chooses targets that are in the possible representation of the associated layers and hence the layer’s parameters can be updated more effectively (i.e. layers are not forced to try to match a target that is impossible to achieve). Thus, unlike innovations such as Difference Target Propagation (Lee et al., 2015), Batch Normalization (Ioffe and Szegedy, 2015), etc, LRA does not need to introduce new layers in the architecture. As a result it can be viewed either as an alternative to such approaches, or as a complementary technique because it is compatible with these other methods (i.e. it can be used with batch normalization layers and residual blocks and any other layers that are helpful for the problem-specific representations a deep network needs to acquire).

Our method for setting the targets treats a deep network as a collection of smaller, related subgraphs. This view allows us to flexibly incorporate ideas like feedback alignment (Lillicrap et al., 2016) to train deep networks with non-differentiable activations – specifically, we use the feedback matrix to update the targets rather than the weights. This variant of LRA can also train differentiable networks as fast as back-propagation but more robustly – it is less sensitive to weight initializations than backprop and even other variants of feedback alignment (Lillicrap et al., 2016; Nøkland, 2016).

Another interesting feature of LRA is that it dynamically chooses which layers need to be trained - in the beginning, all layers, even the bottom-most layers in the network, receive significant updates; towards the end of training, only the top few layers need to be updated.

In the experiments of this paper, we compare two variations of the LRA – one where updates are based on calculus and another where error feedback weights are used (which turns out to be superior in robustness and speed). In our results, we show that LRA is:

robust to initialization when training highly nonlinear, deep networks. This result even holds in the extreme case of zero initialization, which back-propagation and target propagation cannot even handle.

able to avoid the vanishing gradient problem and train deep networks rather independently of the nonlinearity used internally. This means we can train performant models composed of many units such as the classical logistic sigmoid.

able to adapt the amount of computation expended during training. The depth of credit assignment is tied to how well a representation aligns with a target as governed by the local loss function.

able to learn networks that contain discrete-valued activation functions.

able to learn networks that contain stochastic units.

easily able to work with biologically inspired mechanisms, such as hard and soft lateral competition among hidden units. Since derivatives are no longer required for these mechanisms, a pathway to integrating other, even non-differentiable neurocognitively-motivated mechanisms is opened up.

We first explain how back-propagation can be re-cast in the target propagation framework of Lee et al. (2015) in Section 2. This makes it easier to identify weaknesses in the “backprop as targetprop” viewpoint and explain the modifications that give rise to LRA. Then we explain how a further modification allows LRA to work with discrete, non-differentiable units, as well as stochastic units. We discuss related work in Section 4 and present experimental results in Section 3 using MNIST and Fashion MNIST, a new and much more challenging benchmark.

Local Representation Alignment

To simplify notation, we describe LRA in the context of a multilayer perceptron architecture. However, LRA naturally applies to any stacked neural architecture, including those that are recurrent. In the supplementary material we provide the general formulation of LRA such that it may be applied recurrent models as well.

2 Setting the targets.

by the chain rule and inductive hypothesis.

Hence mini-batch gradient descent can be viewed as, for every iteration, selecting a mini-batch, choosing targets for each layer to obtain a per-layer optimization problem, and partially optimizing the layer-wise loss (with one gradient step).

This view naturally leads to three ways to improve training:

The per-layer loss can be customized for each layer. For example, the least squares loss can be replaced by the L1L_{1} norm or the log-penalty (Cauchy).

3 Divide and Conquer: The Computation Subgraph

LRA aims to decompose the larger credit assignment problem in neural architectures into smaller, easier-to-solve sub-problems. With this in mind, we can view any stacked neural architecture, or rather, its full, underlying operation graph, as a composition of “computation subgraphs”. A directed, acyclic computation graph may be decomposed into a set of smaller direct subgraphs, where a subgraph’s boundaries are defined as the set of input node variables and the set of output node variables.

Although LRA can be extended to recurrent networks (an extension is presented in the supplementary materials), in this paper, for the sake of explanation, we specialized it to feedforward neural architectures.

Such short-circuit operations have been shown to be useful empirically (Lillicrap et al., 2016; Nøkland, 2016) although they are not very well understood theoretically. Their use in prior work even allowed networks consisting of tanh nonlinearities (but not ReLU) to be trained from initial weights equal to 0 Nøkland (2016). In contrast, our experiments show that LRA-fdbk is even more robust and can train a much broader set of networks from 0.

In Algorithm 1, \otimes is used to denote the Hadamard product (or elementwise multiplication).

The normalization function Normalize()Normalize(\cdot) depicted in Algorithm 1 is defined formally as:

There are many possible choices for the local losses that measure the discrepancy between a layer’s output and its target. One possibility is the L2-norm, defined as

Another choice is the L1L_{1} norm, which is defined as:

After preliminary experimentation, we actually found a different loss, the log-penalty, to work much better with LRA for a wide variety of networks. The log-penalty function is derived from the log likelihood of the Cauchy distribution. In this paper, we implemented the log-penalty loss, for a single vector, as:

where the loss is computed over all dimensions z|\mathbf{z}| of the vector z\mathbf{z} (where a dimension is indexed/accessed by integer ii).

LRA also performs variable depth credit assignment because of the condition in the while loop that stops the backward pass early when the feedforward activation of a layer is close to its target (i.e. the local loss is at most ϵ\epsilon). This feature is not strictly necessary, but it is a nice addition that allows it to save computation by eventually only modifying the top layers of a network.

Finally, for the inner loop which successively refines the target, we found that with LRA-fdbk, the best performance is achieved when K=1K=1. This setting also makes its computational requirements comparable to backprop.

3.2 Handling Non-Differentiable/Stochastic Activations

This difficulty is easily circumvented with a trick inspired by Lee et al. (2015). We illustrate it with the following two activations (1) sign(h)sign(h) (also known as the Heavyside step function) which returns 1-1, 0, or 1, depending on whether hh is negative, , positive, respectively, and (2) bernoulli(h)bernoulli(h) which outputs 1 with probability σ(h)\sigma(h) and with probability 1σ(h)1-\sigma(h) (where σ\sigma is the sigmoid function).

This means that we can rewrite these activations as a composition of two functions g(f(h))g(f(h)), where ff is a differentiable approximation of the activation function. For instance:

where bernoulli(x){}^{*}(x) returns 1 with probability xx and otherwise.

Splitting activation functions this way allows us to extend our notation so that:

4 Overcoming Poor Initializations

Poor initializations affect networks with various activations differently, as we shall observe in the experiments later in this chapter. LRA, in both forms, can be seen as correcting for poor settings–something back-propagation of errors cannot do. LRA-diff can, if given a large enough local computation budget KK, “walk away” from poor settings quickly. That is, even if the penultimate layer n1n-1 provides bad features for the final classification, the target for that layer will be a set of features that help the final layer make a better prediction. Then, recursively, if layer n2n-2 provides bad features, the target for that layer will be a set of features that will help layer n1n-1 achieve its target which, in turn, will help the final layer.

In the experiments, we investigate how robust LRA is to various settings of the initialization scheme and how other algorithms, especially target propagation and feedback alignment, compare.

5 Overcoming Exploding and Vanishing Gradients

When neural networks are made deeper, backprop error gradients must pass backward through many layers using the global feedback pathway that involves a series of multiplications. As a result of these extra multiplications, these gradients tend to either explode or vanish (Bengio et al., 1994; Pascanu et al., 2013). In order to keep the values of the gradients within reasonable magnitudes and prevent zero gradients, it is common to impose constraints to ensure that layers are sufficiently linear in order to prevent post-activations from reaching their saturation regimes. However, this required linearity can create less than desirable side-effects, e.g., adversarial samples (Szegedy et al., 2013; Ororbia II et al., 2017b).

Experimental Results

The goal of these experiments is to test how easy it is to train deep networks with different algorithms (compared to LRA) and how robust these algorithms are to various settings, such as choice of initialization weights. Our claim is that LRA makes training of algorithms very easy and does not require much tuning to achieve high levels of accuracy – our goal is not necessarily to reach the state-of-the-art on any particular task (which typically requires expensive hyperparameter tuning). One of the use-cases of LRA is for researchers outside of deep learning who want to experiment with many different (and possibly novel) architectures designed for their data.

For all experiments in this paper, we keep the parameter optimization setting the same for all scenarios so that we may tease out the effects of individual learning algorithms instead. Specifically, updates calculated by each algorithm are used in a simple first-order gradient descent with a fixed learning rate of 0.010.01 and mini-batches of 50 samples.

We briefly describe the datasets used to investigate the ability of each learning algorithm in training deep, nonlinear networks.

The popular MNIST dataset (LeCun et al., 1998a) contains 28x2828x28 grey-scale pixel images, each belonging to one of 10 digit categories. There are 60,000 training images, from which we create a validation subset of 10,000 images, and 10,000 test images.

Fashion MNIST (Xiao et al., 2017) is a dataset composed of 28x2828x28 grey-scale images of clothing items, meant to serve as a much more difficult drop-in replacement for MNIST itself. The size and structure of the training and testing splits are the same as in MNIST and each image is associated with one of 10 classes. We create a validation set of 10,000 samples from the training split via random sampling without replacement.

1 Effect of Manifold-Walking

We first investigate how LRA-diff’s computation budget KK, specifically the number of sub-optimization steps allocated per subgraph, affects its ability to train a highly nonlinear network. Furthermore, we contrast this procedure against the vastly simpler error-feedback variant, LRA-fdbk (which uses K=1K=1 so is also much faster). To do so, we construct networks of three hidden layers of 64 hyperbolic tangent units with biases initialized from zero and weights according to the following classical heuristic:

noting that ninn_{in} is the size of previous/incoming layer of post-activities or the number of columns of the weight matrix WW (if working in column-major form). U[a,a]U[-a,a] is the uniform distribution in the interval (a,a)(-a,a).

For LRA-diff, we varied the computation budget K={5,10,30,50}K=\{5,10,30,50\}, and for LRA-fdbk, we fixed K=1K=1. The entries of the feedback matrix for LRA-fdbk were generated independently from a Gaussian with standard deviation σE\sigma_{E}. We tried the following settings of σE={1.0,1.25,1.5,2.0}\sigma_{E}=\{1.0,1.25,1.5,2.0\} (note in the plots this is denoted sd). Both variants of LRA employed the Cauchy loss (see Equation 4) as the metric for measuring discrepancy between representation and target. Networks were trained over 100 epochs but we only show the first 5 epochs, since roughly after this point, the differences in generalization rates were too similar to warrant visualization.

In Figure 2, for LRA-diff, we see that increasing KK leads to ultimately better generalization and sooner on both MNIST and Fashion MNIST. However, there is a diminishing return as one dramatically increases the number of steps from K=30K=30 to K=50K=50. As KK is the number of iterations of the inner loop in Algorithm 1, increasing KK leads to significant slowdown. However, more importantly, LRA-fdbk, which uses K=1K=1, is therefore a far faster variation of LRA and it reaches the same level of generalization. This means LRA-fdbk is able to use the short-circuit feedback connections to create a useful displacement for the model’s current input representation to help lower its local loss. We found that the initialization of the error feedback weights affects LRA-fdbk’s performance, though as one raises the standard deviation, the impact is far less severe than varying the number of steps in LRA-diff.Choosing a value for the standard deviation that is too low, especially below one, however, can slow down the learning process. We found that naively using a standard deviation of one worked quite well for our preliminary experiments and thus did no further tuning after Experiment 3.1.

In Figure 3, we show the filters acquired by the feedforward network on Fashion MNIST after a single epoch. To create these filter visualizations, we employ the feature activation maximization approach as presented by Erhan et al. (2010). Furthermore, while this approach generally only applies to the first hidden layer of units, which sit closest to the input pixel nodes, we can apply the same technique to the upper hidden layers of the network, such as the third layer, by simply ignoring the nonlinearity at each level of the model. Thus, we approximately “linearize” the nonlinear network which allows us to collapse successive weight matrices back into a single matrix (taking advantage of this natural property of deep linear networks). This will incur some minor approximation error, since the network is not truly linear, but we found that this approximation gave us a very fast and reasonably good picture of what knowledge might be captured in the synaptic connections that form the memories of the upper layer nodes, i.e., those closest to the output layer. Observe that both versions of LRA (note that for LRA-diff we use the network trained with K=30K=30) learn reasonably good and clear filters after just one pass through the data. Back-propagation, however, learns far noisier filters. Again, it is surprising to see that LRA-fdbk learn so well with only one pass through the data, given that it is far cheaper computationally than LRA-diff. Encouraged by this positive behavior, its low computation requirements, and the fact that LRA-fdbk can also handle a far greater range of activations, such as discrete-valued and stochastic ones, we focus on LRA-fdbk (with K=1K=1, so it only makes once pass through its inner loop, making its computational requirements comparable to back-prop) for the rest of the paper.

2 Robustness to Initialization

It is very difficult to train deep (and thin) networks from simplistic initialization schemes (Romero et al., 2014). Furthermore, LeCun et al. (1998b) showed that using the logistic sigmoid as an activation function can slow down learning considerably, largely due to its non-zero mean, which was further investigated by Glorot and Bengio (2010). Given the problems that come with unit saturation and vanishing gradients (Bengio et al., 1994), training a very deep and thin network, especially composed of logistic sigmoid units, with only back-propagation, can be very difficult.

To investigate LRA’s robustness to poor initialization, we use it and competing methods to train deep nonlinear networks consisting of either logistic sigmoid or hyperbolic tangent activation functions with model parameters (i.e. network weights) initialized from a parametrized, zero-mean Gaussian distribution. The Gaussian distribution is a very simple, common way to initialize the parameters of a neural model, and controlling its standard deviation, σ\sigma, allows us to probe different cases when back-propagation fails. In this experiment, we investigate the settings σ={0.025,0.05,0.1}\sigma=\{0.025,0.05,0.1\}, and compare LRA and backprop (BP) to algorithms such as Difference Target Propagation (DTP) from Lee et al. (2015), Feedback Alignment (FA) from Lillicrap et al. (2016), and Direct Feedback Alignment (DFA) from Nøkland (2016). Furthermore, we also show the situation where the weights are simply initialized to zero. Whenever it was not possible to learn from zero with a given algorithm, such as BP and DTP, we simply marked the appropriate slot with an XX. To initialize the feedback weights of DFA and FA, we follow the protocol prescribed by Nøkland (2016).

The network architecture each algorithm is responsible for training is the same: a multilayer perceptron containing eight hidden layers of 128 processing elements. We examine the ability of each algorithm to train the same architecture employing logistic sigmoid (a non-zero mean activation function) and the hyperbolic tangent (a zero-mean activation function).

In Table 1, we present the best found generalization error rate for the deep architecture learned with each algorithm under each initialization setting. Observe that LRA-fdbk is rather robust to the initializations scheme, and more importantly, is able to train to good generalization regardless of which unit type is used. Furthermore, even at initializations close to or at zero, LRA-fdbk is able to train deep networks of both logistic sigmoid and hyperbolic tangent networks. In Figure 4, we focus on the setting with σ=0.025\sigma=0.025 for the Gaussian distribution used to initialize the networks (i.e. the lowest non-zero setting). Observe that despite poor initialization, even within 10 epochs, LRA-fdbk is able to consistently reach good classification error (5%\approx 5\% on MNIST and <20%<20\% on Fashion MNIST), while the other methods struggle to reach those numbers even after 100 epochs. DFA is competitive with LRA-fdbk when using hyperbolic tangent units but trains the same network composed of sigmoid units poorly.

According to our results, for DTP, the inverse mapping used to reconstruct the underlying layerwise targets does not work all that well when weights are initialized from a purely random Gaussian, especially with a low standard deviation. As observed in Table 1, DTP struggles to train these deep networks, even when given the advantage and allowed to use an adaptive learning rate unlike the other algorithms. DTP’s struggle might be the result of losing too much of its layerwise target information too soon–the inverse mapping (or decoder of the layerwise auto-associative structure) requires a strong signal at each iteration to learn and if the signal is too weak or lost, the target produced for reconstruction becomes rather useless. However, DTP does far better than FA across the scenarios, although its lacks the ability to train from zero initialization. Our preliminary experimentation with DTP also uncovered that, in addition to requiring a more complex outer optimization procedure (like RMSprop) to achieve decent results, the learning procedure is highly dependent on its conditions and internal hyper-parameter settings (and there exist few heuristics on good starting points). To make DTP work well, significant tweaking of its settings would be required on a per-dataset/per-architecture basis in order to improve targets for the inverse mapping. Since the error of DTP (or target propagation algorithms in general) is represented as the change in activities of the same set of neurons, if any neural activity is unstable, the overall algorithm will fail to train the underlying model effectively. Furthermore, because of the extra computation involved in DTP, it is also far slower than LRA.

With respect to DFA, the layers are no longer related through a sequential backward pathway. This means that the lower-level neurons are disconnected from the forward propagation pathway when errors are calculated using the feedback projection weights. In contrast, we find that in FA the error signal is still created by a backward pass as in BP, but this time with the final per-neuron derivatives approximated by the feedback weights that replace the transpose of the forward weights in the BP global feedback pathway. Hence FA fails in cases where we have instability or few gradients are acting on participating neurons. DFA actually works fairly well compared to the other baselines if the activation function is the hyperbolic tangent, and does outperform FA when the logistic sigmoid is utilized. Furthermore, unlike DTP and BP, DFA and FA can train networks from zero, although LRA does a much better job.

Another interesting and important property of LRA-fdbk is that, unlike all of the other approaches investigated, it can automatically decide its depth of credit assignment. Specifically, in the case of the MLP, LRA-fdbk can decide how many subgraphs it needs to update. This is possible because of the condition in the while loop that stops the backwards pass if the feedforward activations are already similar to the targets (i.e. have a local loss at most ϵ\epsilon). This condition can be used at the mini-batch level or at the per-sample level. In Figure 5, we observe this dynamic behavior when recording the mean depth (or average number of subgraphs updated over the full training set in one pass), seeing that the network starts, within the first several epochs, by updating all of the subgraphs of the network. However, as learning continues, usually past five epochs, we see the number of subgraphs updated decrease, and, in the case of the tanh networks, approach one or zero. While not presented in the depth plots, we also recorded the average number of updates made at each layer. These logs revealed that, around the same time the average depth approaches one, even updates at the very top subgraph become less frequent. This aligns with our intuition that once latent representations of a lower layer are “good enough”, LRA can quit expending computation on that layer, on a per-sample basis. No other algorithm, including BP, has the property to adapt its computation when calculating parameter displacements (which is also why BP is depicted as horizontal line in Figure 5–its cost is the same over each epoch).

3 Training from Null Initialization

Next, we further experiment with LRA’s ability to train networks from null, or pure zero, initialization. While BP and DTP will fail in this setting, DFA and FA will not. However, in order for DFA and FA to operate in this special setting certain restrictions are needed, e.g. the activation function must be specific like the hyperbolic tangent Nøkland (2016) and non-zero initialization must be used for certain activations including the linear rectifier. LRA does not impose these restrictions, and furthermore, can easily handle non-differentiable operations, e.g., the signum function, as we shall observe shortly.

To demonstrate LRA’s ability to handle a wide variety of functions, we train models of 3 layers of 800 hidden units with updates estimated over mini-batches of 20 samples. Parameters were updated using the Adam adaptive learning rate (Kingma and Ba, 2014). The activation functions we experimented with included the softsign (Glorot and Bengio, 2010), the softplus (Glorot et al., 2011), the linear rectifier (Glorot et al., 2011), local-winner-take-all (LWTA) lateral competition (Srivastava et al., 2013), and the signum (or sign).

For the networks that used LWTA and signum units, the architecture for any particular subgraph of the MLP, except the bottommost and topmost subgraphs, is defined as follows (essentially decomposing the activation into two parts, as discussed in Section 2.3.2):

In this experiment, the LWTA blocks we employed grouped four neurons together, with no overlap, yielding 200 blocks of laterally competitive neurons. Index precedence is used to break any ties. This form of structured sparsity through competition has also been observed in a biological neural circuits when modeling brain processes. Specifically, areas of the brain exhibit are structured with neurons providing excitatory feedback to nearby neurons, as evidenced in studies of cortical and sub-cortical regions of the brain (Stefanis, 1969; Andersen et al., 1969; Eccles, 2013). The concept of inter-neuronal competition also plays a key role in Bayesian theories of the brain, specifically those that build on (sparse) predictive coding (Olshausen and Field, 1997; Rao and Ballard, 1999), which argue that lateral competition allow the underlying system to uncover the few causal factors, out of the many possible, that explain a given input stimulus at any time step. From a practical machine learning perspective, sparsity is highly desirable for a wide variety of reasons (Glorot et al., 2011), and has recently been shown to be useful in training directed neural generative models of sequential data (Ororbia II et al., 2017a).

Note that with LRA-fdbk, one can easily integrate what we will call soft lateral competition instead of the hard vector quantization in LWTA. For example, one can use the softmax instead of the argmax operator for each competition block. This would mean the lateral competition block would be defined as:

This soft block could then be treated as the probabilities of a mini categorical distribution and sampled accordingly, if a hard-decision is still required. Since LRA-fdbk does not require the derivative of the lateral competition block function, one does not need to compute the expensive Jacobian associated with the pure softmax function.When the softmax is used at the output layer in purely differentiable systems, one takes advantage of the analytically simplified derivative of the output with respect to pre-activities when using the categorical log likelihood loss function. When using the softmax inside the network, this trick is no longer available to the practitioner. For this experiment, we refer to the proposed soft LWTA as SLWTA.

In Table 2, we see that LRA-fdbk is able to successfully train different activation functions. This includes the deep models that contain discrete-valued units. It is interesting to note, that while all networks trained with LRA-fdbk generalize reasonably well, a network that performs best on one dataset does not necessarily perform the best on the other. For example, the best-performing network on MNIST was the network that used the signum function while the sparse rectifier network performed better on Fashion MNIST. To dig deeper into the discriminative ability of each layer of a network learned with LRA-fdbk versus other algorithms such as BP and DFA, we extracted the hidden representations of the model learned under each when applied to the test-set of Fashion MNIST. Each multidimensional representation vector was then projected to 2D for visualization using t-SNE, Barnes-Hutt approximation (Van Der Maaten, 2013). The results of this visualization can be found Figure 6. We see, first and foremost, that the model learned with LRA-fdbk has acquired distributed representations that contain information useful for properly separating the data-points by class.

4 Stochastic Networks

We next investigate if LRA, or specifically LRA-fdbk, can successfully handle networks with stochastic units, such as Bernoulli-distributed variables, an important class of non-differentiable activation functions. We compare LRA with DTP and back-propagation of errors. For back-propagation of errors, since a discrete sampling function is non-differentiable, we explore a variety of approximations that deal with binary stochastic units, which can be considered to be noisy modifications of the logistic sigmoid. To train networks with these kinds of units, we need an estimator, which can be classified under two categories–unbiased and biased. If the expected value of an estimate equals the expectation of the derivative it is trying to estimate, the estimator is unbiased. Otherwise, it is biased. We examine several such estimators, including the straight-through estimator STE (Bengio et al., 2013; Chung et al., 2016), variations of the slope-annealing trick (Chung et al., 2016), and reinforcement learning approaches (Williams, 1992; Chung et al., 2016) to training discrete-valued variables, i.e., REINFORCE. REINFORCE operates directly on the loss of the network and does not require back-propagated gradients while the class of STEs simply replace the derivative of the Bernoulli sampling operation with the identity function. The sigmoid-adjusted STE replaces the same derivative with that of the logistic sigmoid. In slope-annealing, we multiply the input value by a slope-value mm, which is increased throughout training to make the sigmoidal derivative ultimately approach the step function.

The stochastic models trained for this experiment each contain two layers of 200 hidden units (which is the setting used by Lee et al. (2015)) and parameters are trained over 500 epochs. The specific architecture is as follows:

where sigm(v)sigm(v) is the logistic sigmoid, which parametrizes the probability pp of the layer of Bernoulli variables, and S(p)B(1,p)S(p)\sim\mathbf{B}(1,p) is a stochastic operator that takes in a probability pp and returns either a zero or a one, e.g., a binary variable.

In Table 3, observe that LRA is able to effectively train networks composed of stochastic binary units, competitive with DTP and outperforming the other estimators used in back-propagation. This is encouraging, since it is well-known that actual neurons communicate via spikes, and modeling this discrete signal as a Bernoulli variable brings us one step closer to incorporating neuro-biological ideas into artificial neural architectures. We believe that using spike-like variables in a neural system offers a form of regularization much akin to that of drop-out (Srivastava et al., 2014). The key feature of using spike variables is that at test-time, we do not “shut off” this mechanism as is done in drop-out (where we calculate an expectation over all possible sub-models by multiplying the activities by the drop-out probability used in training). One could easily use a stochastic model such as the one we train to also characterize its uncertainty at the posterior by simply estimating its variance in addition to the mean as we have done in these particular experiments.

5 Feature Extraction versus Optimization

In our final experiment, we investigate if LRA works more like a feature extraction algorithm, which is more useful for pre-training, or as an optimization procedure. We conduct this experiment in two parts across two different researchers. The first part of the experiment entails training two MLPs with 3 hidden layers of 256 units, using hyperbolic tangent activation functions, one with LRA-fdbk and one with back-propagation of errors. The parameters, e.g. synaptic weight matrices and bias vectors, are extracted from each model and communicated to a second researcher. The identifiers that indicate which network was trained by which algorithm are removed before communication. The second researcher is to fine-tune both networks using back-propagation of errors and stochastic gradient descent.

In Table 4, we report model performance on each split (training, validation, and test) at the end of pretraining and the best model (as measured on validation) found during fine-tuning. We see that fine-tuning an LRA-trained network with back-propagation of errors appears to improve performance, especially on the training and validation sets. Due to the small training error (compared to validation and testing) and the similarities in errors between backprop and LRA, it appears that LRA still works like an optimization algorithm but one that is regularized (so it is slower to overfit). The source of this regularization might come from the use of the error feedback weights.

To test this idea further, we conducted another experiment that measured the angle between the parameter updates computed by LRA (collectively denoted by ΔLRA\Delta_{LRA}) after processing a particular mini-batch and the angle of the update as would have been given by backprop, (collectively denoted by ΔBP\Delta_{BP}). The angle is measured by first computing the cosine of the angle between the two different types of updates computed at any iteration ii, or cos(ΔLRA,ΔBP=(ΔLRA)TΔBP/(ΔLRA2ΔBP2)cos(\Delta_{LRA},\Delta_{BP}=(\Delta_{LRA})^{T}\Delta_{BP}/(||\Delta_{LRA}||_{2}||\Delta_{BP}||_{2}), and ultimately converting to degrees. As indicated in Figure 7, we see that over the course of training (100 epochs), the angle between the updates found by the backprop and LRA are within the desired 9090^{\circ} (meaning that the LRA update is a descent direction) and while it increases from an initial 1010^{\circ}, it ultimately stabilizes at about a divergence of 3838^{\circ} on average.

Related Work

As suggested by the algorithm’s name, LRA approaches the credit assignment problem by explicitly formulating and optimizing the related problem of learning good latent representations, or abstractions, of the data. Specifically, LRA decomposes the problem into a series of linked local learning problems. This form of learning is what we will call “coordinated local learning”. Like classical local (neurobiologically plausible) rules, such as Hebbian learning (Hebb, 1949), we make use of information that is within close proximity of particular groups of neurons to compute updates to synaptic weights–guessed initial activation patterns and target activation patterns for any particular layer. However, unlike pure local learning, part of the local information LRA uses, i.e., the targets, are created through a process that is guided more globally by either explicit feedback weights or an iterative-inference pathway (implemented by the chain rule of calculus). The motivation behind this particular style of computation that defines LRA comes from the theory of predictive coding, part of which posits that local computations occur at multiple levels of the biological structure underlying the human brain (Grossberg, 1982; Rao and Ballard, 1999; Huang and Rao, 2011; Clark, 2013; Panichello et al., 2013). This stands in contrast to back-propagation of errors, the workhorse algorithm behind modern neural networks, which crucially conducts credit assignment through the use of a global feedback pathway to carry back the error signals needed for computing updates (Ororbia II et al., 2017a). This particular pathway creates problems such as exploding and vanishing gradients (Bengio et al., 1994; Glorot and Bengio, 2010) and imposes severe restrictions on kinds of operations and modifications we can use–highly nonlinear mechanisms such as lateral neuronal competition and non-differentiable operators, such as discrete-valued stochastic activations, are difficult or even impossible to implement.

As we have shown earlier, LRA was created from the perspective of viewing back-propagation of errors from the perspective of target propagation (Bengio, 2014; Lee et al., 2015), of which recirculation (Hinton and McClelland, 1988; O’Reilly, 1996) is a predecessor algorithm. In recirculation, a single hidden layer autoencoder uses the datum as the target value for reconstruction (affecting the decoder) and the initial encoded representation of the datum as the target for the encoder, which is computed after a second forward pass. Target propagation revolves around the concept of the function inverse–if we had access to the inverse of the network of forward propagations, we could compute which input values at the lower levels of the network would result in better values at the top that would please the global cost. In essence, we would use the inverse to propagate back along the network the target value and then update each layer to move closer to this target value. So long as we have access to the inverse of the functions used for each layer, we can use any non-linear activation, including those that are discrete-valued. Under simple conditions, when all the layer objectives are combined, target propagation could yield updates with the same sign as the updates obtained by back-propagation (Le Cun, 1986).

Like LRA, algorithms like target propagation and recirculation can be viewed as using higher-level objectives that seek better representations of data, governed by the principle of discrepancy reduction (Ororbia II et al., 2017a) which entails a two-step process for learning: 1) seek better representations of data, 2) minimize the mismatch between the model state and this better state. Furthermore, they represent a strong push towards using local, more biologically plausible, rules to learn neural systems.

Local learning first made a small resurgence when training deeper networks first came into mainstream view in the form of layer-wise training of unsupervised models (Bengio et al., 2007), supervised models (Lee et al., 2014), and semi-supervised models/hybrid training (Ororbia II et al., 2015b, a). Although important in stimulating work towards improved learning and initialization of more complex neural models, the key problem with these layer-wise training approaches was that they were greedy–building a model from the bottom-up, freezing lower-level parameters as higher-level feature detectors were learnt. These approaches lacked the global coordination where upper-layer feature detectors direct lower-layer feature detectors as to what basic patterns they should be finding.A lower-level feature detector might be able to find different aspects of structure in its input since multiple patterns might satisfy its layer-wise objective but this might not help the layers above find better higher-level patterns/abstractions. Nonetheless, interesting local update rules could be used in the construction of these “stacked” models–back-propagation on the reconstruction cross entropy for autoencoders (Vincent et al., 2008) and Contrastive Divergence for Boltzmann-based models (Hinton, 2002; Bengio et al., 2007). Another interesting approach, and one related to LRA in that it cares about sparsity of representations, is that of stacked sparse coding (He et al., 2014), which greedily learns a composition of sparse coding sub-models (Olshausen and Field, 1997).

The idea of learning locally is slowly becoming prominent in the training artificial neural networks, with other recent proposals including kickback (Balduzzi et al., 2015), which was derived specifically for regression problems. MAC/QP (Carreira-Perpiñán and Wang, 2012) relaxes the hard constraint that the output of one layer equals the input to the next layer, adding a penalty term to the objective function when they are different. This allows the training of deep networks to be done locally and parallelized. Decoupled neural interfaces (Jaderberg et al., 2016) also operate locally, but take the approach of learning a predictive model of error gradients (and inputs) instead of trying to use local information to estimate an update to weights. As a result, this procedure allows layers of the underlying model to be trained independently. Other related approaches, which take a stochastic/probabilistic view of learning, include expectation propagation (Jylänki et al., 2014), the variational walkback algorithm (Alias Parth Goyal et al., 2017), and equilibrium propagation (Scellier and Bengio, 2017). Contrastive Hebbian learning (CHL) (Movellan, 1991; Xie and Seung, 2003; O’reilly, 2001) works similarly to Contrastive Divergence in that it is ultimately computing parameter updates using a positive phase and a negative phase, trying to make the negative phase (or the “fantasies”) more similar to the positive phase (which is the state of the model clamped at the data).

Another important idea that comes into play in LRA is that learning is possible with asymmetry, and even more interestingly, with random fixed feedback loops. This was shown in a learning algorithm called feedback alignment (Lillicrap et al., 2016), which essentially replaces the backward pass of back-propagation that involves the transpose of the feedforward weights with a random matrix of the same dimensions. Direct feedback alignment (Nøkland, 2016) extends this idea further by directly connecting the output layer’s pre-activation derivative to each layer’s post-activation. These feedback alignment procedures directly resolve one of the main criticisms of back-propagation–the weight-transport problem (Grossberg, 1987; Liao et al., 2016)–by showing that coherent learning is possible with asymmetric forward and backward pathways. LRA, however, uses the idea of random error feedback loops quite differently–use the error feedback connections to generate a better target representation to move towards instead of simply replacing the error gradient computations within back-propagation’s global feedback pathway.

Conclusion & Future Work

We showed how back-propagation can be re-cast in the framework of target propagation and used the insights from this perspective to propose the Local Representation Alignment (LRA) algorithm. LRA is a training procedure that decomposes the credit assignment problem of artificial neural networks into smaller, local learning problems. Specifically, we introduced the notion of the computation subgraph–an object that encompasses two layers of processing elements and the underlying operations and parameter variables that connect them–and how to view a deep network as a linked set of such subgraphs. Motivated by fundamental ideas in representation learning, LRA structures every subgraph within the network to have a target, not just at the output layer, and adjusts the free parameters of the subgraph to move the output closer to the target. LRA, in contrast to previous approaches including target propagation, chooses targets that are in the possible representation of the associated layers and hence the layer’s parameters can be updated more effectively (i.e. layers are not made to match a target that is impossible to achieve). The subgraph view also allowed us to introduce a short-circuit pathway, inspired by the idea of feedback alignment, which allows LRA to handle non-differentiable activation functions.

Unlike previously proposed algorithms, including back-propagation, target propagation, and variants of feedback alignment, LRA is far less sensitive to parameter initialization when training highly nonlinear networks. Furthermore, it can adaptively decide the depth of the credit assignment it needs to conduct, which can lead to savings in computation per step. In addition to being compatible with recent innovations such as batch normalization and drop-out, LRA is architecture-agnostic, so long as the global model can be decomposed into a series of linked subgraphs, where the output each subgraph can be viewed as a hidden representation to which a target can be assigned. For the case of feedforward networks, our experiments on MNIST and Fashion MNIST add strong empirical evidence to support the above claims.

LRA offers a pathway for users of neural networks to design the architecture for the problem at hand rather than for the traits and quirks of back-prop-based algorithms. This means that non-differentiable units and more biologically-motivated ones may be utilized. Since discrepancy reduction (Ororbia II et al., 2017a) can be viewed as a special variant of LRA and was shown to be capable of successfully learning directed models of temporal data without back-propagation through time, future work will include training recurrent network models, on videos and text documents, with LRA. Furthermore, we intend to examine the algorithm’s performance and behavior at a larger scale than that investigated in this paper.

We would like to acknowledge support for this project from the National Science Foundation (NSF grant #1317560 ).

Appendix A.

In Figure 8, we illustrate a computation subgraph that contains a non-differentiable operation, such as a discrete-valued nonlinearity or a sampling function. LRA avoids the need for approximating the derivative of the non-differentiable function by simply short-circuiting the pathway that gradients would naturally flow through. This is where the wiring of the error weights becomes even more useful–we can “pocket” exotic functions right on top of the subgraph’s input post-activation. Another way to view this setup is to consider the input post-activation to be composed of two processing steps, an initial nonlinear transformation (such as through the hyperbolic tangent) followed by a discretization step (such as through a signum).

Appendix B.

In this appendix we describe how Local Representation Alignment (LRA), specifically, LRA-fdbk, can be applied to the training of a recurrent neural network (RNN). This variation of LRA aligns with the extension of back-propagation of errors to sequential neural models, or back-propagation through time (Werbos, 1988).

Essentially, we apply LRA of errors but with one crucial exception–we must unfold the network TT steps back in timeNote that this action is the same as unrolling a mathematical recurrence relation, since the hidden state of an RNN is recursively computed., ideally from the end of the sample sequence back up to its beginning. Note that in practice, we break up our sequences into sub-sequences of length KK, where K<TK<T, and process time-varying data in chunks to maintain computational tractability. This creates a very deep feedforward network, with each input at each time step fed into the unfolded graph and the underlying parameters copied at each time step.

We will focus on applying LRA to the situation where an RNN is fully unfolded over the length of an entire sample sequence. Assume a simple single-hidden layer Elman RNN with a linear output layer, defined as follows:

where the parameters to learn are simply Θ={W,V,U}\Theta=\{W,V,U\} (biases omitted). The task is next-step prediction, so at each time step, yt=yz,t2=xt+1\mathbf{y}_{t}=\mathbf{y}^{2}_{z,t}=\mathbf{x}_{t+1}, where tt indexes a particular point in time.

Next, we define the variable e\mathbf{e} to be the first derivative of a given local loss, where et2\mathbf{e}^{2}_{t} is the first partial derivative of the local loss with respect to the output units zt2\mathbf{z}^{2}_{t} at time tt and et1\mathbf{e}^{1}_{t} is the first partial derivative of the local loss with respect to the hidden unit post-activation zt1\mathbf{z}^{1}_{t}. In short, assuming a Gaussian loss for both output and hidden local losses L2\mathcal{L}_{2} and L1\mathcal{L}_{1} with a fixed variance of 1, this means that:

Finally, a single, extra set of recurrent parameters EE will transmit the error from the output units back to the hidden units.

To train an RNN over TT steps in time, we simply unfold the network as in back-propagation through time but copy the error units and error weights TT times as well. Finding the targets for the hidden layers of this unfolded RNN would then amount to:

noting that et2\mathbf{e}^{2}_{t} can be readily computed since the target for the output units is the data point at the next time step t+1t+1, in other words yz,t1=xt+1\mathbf{y}^{1}_{z,t}=\mathbf{x}_{t+1}. Once targets for each zt2\mathbf{z}^{2}_{t} and zt1\mathbf{z}^{1}_{t} have been found, the updates to the parameters of this model are then calculated as follows:

where z01=0\mathbf{z}^{1}_{0}=0, or the null state. The update to UU is written in simplified form since the output nonlinearity’s first derivative is the identity.

References