Biologically Motivated Algorithms for Propagating Local Target Representations

Alexander G. Ororbia, Ankur Mali

Coordinated Local Learning Algorithms

Algorithms within the Discrepancy Reduction (?) family offer computational mechanisms for two key steps when learning from patterns. These steps include:

Search for latent representations that better explain the input/output, also known as target representations. This creates the need for local (higher-level) objectives that will guide current latent representations towards better ones.

Reduce, as much as possible, the mismatch between a model’s currently “guessed” representations and target representations. The sum of the internal, local losses is also defined as the total discrepancy in a system, and can also be thought of as a sort of pseudo-energy function.

This general process forms the basis of what we call coordinated local learning rules. Computing targets with these kinds of rules should not require an actual pathway, as in back-propagation, and instead make use of top-down and bottom-up signals to generate targets. This idea is particularly motivated by the theory of predictive coding (?) (which started to impact modern machine learning applications (?)), which claims that the brain is in a continuous process of creating and updating hypotheses (using error information) to predict the sensory input. This paper will explore two ways in which this hypothesis updating (in the form of local target creation) might happen: 1) through error-correction in Local Representation Alignment (LRA-E), and 2) through repeated encoding and decoding as in Difference Target Propagation (DTP). It should also be noted that one is not restricted to only using neural building blocks–LRA-E could be used to train stacked Gradient Boosted Decision Trees (GBDTs), which would be faster than in (?), which employed a form of target propagation to calculate local updates.

The idea of learning locally, in general, is slowly becoming prominent in the training of artificial neural networks, with recent proposals including decoupled neural interfaces (?) and kickback (?) (which was derived specifically for regression problems). Furthermore, (?) demonstrated that neural models using simple local Hebbian updates (within a predictive coding framework) could efficiently conduct supervised learning. Far earlier approaches that employed local learning included the layer-wise training procedures that were once used to build models for unsupervised learning (?), supervised learning (?), and semi-supervised learning (?; ?). The key problem with these older algorithms is that they were greedy–a model was built from the bottom-up, freezing lower-level parameters as higher-level feature detectors were learnt.

Another important idea underlying algorithms such as LRA and DTP is that learning is possible with asymmetry–which directly resolves the weight-transport problem (?; ?), another strong neuro-biological criticism of backprop. This is even possible, surprisingly, if the feedback weights are random and fixed, which is at the core of two algorithms we will also compare to–Random Feedback Alignment (RFA) (?) and Direct Feedback Alignment (DFA) (?). RFA replaces the transpose of the feedforward weights in backprop with a similarly-shaped random matrix while DFA directly connects the output layer’s pre-activation derivative to each layer’s post-activation. It was shown in (?; ?) that these feedback loops would be better suited in generating target representations.

To concretely describe how LRA is practically implemented, we will specify how LRA is applied to a 3-layer feedforward network, or multilayer perceptron (MLP). Note that LRA generalizes to models with more layers ( $L\geq 3$ ).

where the loss is computed over all dimensions $|\mathbf{z}|$ of the vector $\mathbf{z}$ (where a dimension is indexed/accessed by integer $i$ ). Note that for this loss function, we assume that $\mathbf{z}$ is a vector of probabilities computed by using the softmax function as the output nonlinearity, $\mathbf{z}^{3}=\frac{exp(\mathbf{h}^{3})}{\sum_{i}exp(\mathbf{h}^{3}_{i})}$ . For the hidden layers, we can choose between a wider variety of loss functions, and in this paper, we experimented with assuming either a Gaussian or Cauchy distribution over the hidden units. For the Gaussian distribution (or L2 norm), we have the following:

where $\sigma^{2}$ represents fixed scalar variance (we set $\sigma^{2}=1/2$ ). For the Cauchy distribution (or log-penalty), we obtain:

where $\otimes$ indicates the Hadamard product and $\gamma$ is a decay factor (a value that we found should be set to less than $1.0$ ) meant to ensure that the error weights change more slowly than the forward weights. An attractive property of LRA is that the derivatives of the pointwise activation functions can be dropped, yielding the second variation of the update rule, as long as the activation function is monotonically non-decreasing in its input (for stochastic activation functions, the output distribution for a larger input should stochastically dominate the output distribution for a smaller input). This is also satisfying from a biological perspective since it is unlikely that neurons would utilize point-wise activation derivatives in computing updates (?). The update for error weights is simply proportional to the negative transpose of the update computed for the matching forward weights, which is a computationally fast and cheap rule we propose inspired by (?).

In Figure 1(a), we compare the updates calculated by LRA-E (as well as DFA and our proposed DTP- $\sigma$ , described later) with those given by back-propagation, after each mini-batch, by plotting the angles over the first 20 epochs of learning for a 3-layer MLP (256 units per layer) trained with stochastic gradient descent (SGD) with mini-batches of 50 image samples using a categorical output loss and Gaussian local losses. As long as the angle of the updates computed from LRA are within 90 degrees of the updates obtained by back-propagation, LRA will move parameters towards the same general direction as back-propagation (which greedily points in the direction of steepest descent) and will still find good local optima. In Figure 1(a), this does indeed appear to be the case for the MLP example. The angle, fortunately, while certainly non-zero, never deviates too far from the direction pointed by back-propagation and remains relatively stable throughout the learning process. (Observe that DFA and DTP - $\sigma$ have, interestingly enough, update angles that are quite similar to LRA-E.) Alongside Figure 1(a), in Figure 1(b), we plot our neural model’s total internal discrepancy, $D(\mathbf{y},\mathbf{x})$ (or V_DD), which is a simple linear combination of all of the internal local losses for a given data point. Observe that while the (validation) output loss (V_L) continually decreases, V_DD does not always appear to do so. We conjecture that this “bump”, which appears at the start of learning, is the result of the evolution of LRA-E’s error weights, which are used to directly control the target generation process. So even though backprop and LRA-E might start down the same path in error space (or on the loss surface), as indicated by the initially low angle between updates, this trajectory is not ideal for LRA’s units/targets. This means that error weights will change more rapidly at training’s start, resulting in targets that vary quite a bit (raising internal loss values). However, once the error weights start to converge to an approximate transpose of the feedforward weights, the process of correction becomes easier and V_DD desirably declines.

Improving Difference Target Propagation

As mentioned earlier, Difference Target Propagation (DTP) (and also, less directly, recirculation (?; ?)), like LRA-E, also falls under the same family of algorithms concerned with minimizing internal discrepancy, as shown in (?; ?). However, DTP takes a very different approach to computing alignment targets–instead of transmitting messages through error units and error feedback weights as in LRA-E, DTP employs feedback weights to learn the inverse of the mapping created by the feedforward weights. However, (?) showed that DTP struggles to assign good local targets as the network becomes deeper, i.e., more highly nonlinear, facing an initially promising albeit brief phase in learning where generalization error decreases (within the first few epochs) before ultimately collapsing (unless very specific initializations are used). One potential cause of this failure could be the lack of a strong enough mechanism to globally coordinate the local learning problems created by the encoder-decoder pairs that underlie the system. In particular, we hypothesize that this problem might be coming from the noise injection scheme, which is local and fixed, offering no adaptation to each specific layer and making some of the layerwise optimization problems more difficult than necessary. Here, we will aim to remove this potential cause through an adaptive layerwise corruption scheme.

This process we will refer to as DTP. In our proposed, improved variation of DTP, or DTP- $\sigma$ , we will take an “adaptive” approach to the noise injection process $\epsilon$ . To develop our adaptive noise scheme, we have taken some insights from studies of biological neuron systems, which show there are varying levels of signal corruption in different neuronal layers/groups (?; ?; ?; ?). It has been argued that this noise variability enhances neurons’ overall ability to detect and transmit signals across a system (?; ?; ?) and, furthermore, that the presence of this noise yields more robust representations (?; ?; ?). There also is biological evidence demonstrating that an increase in the noise level across successive groups of neurons is thought to help in local neural computation (?; ?; ?).

As we will see in our experimental results, DTP- $\sigma$ is a much more stable learning algorithm (especially with respect to the original DTP), especially when training deeper/wider networks. DTP- $\sigma$ benefits from a stronger form of overall coordination among its internal encoding/decoding sub-problems through the pair-wise comparison of local loss values that drive the hidden layer corruption.

A Comment on the Efficiency of LRA-E and DTP

Note that LRA-E, while a bit slower than backprop per update (given its use of the error weights to generate hidden layer targets), is much faster than DTP and DTP- $\sigma$ . Specifically, if we focus on matrix multiplications used to find targets, which make up the bulk of the computation underlying both processes, LRA-E only requires $2(L-1)$ matrix multiplications while DTP and DTP- $\sigma$ require $4(L-3)+L$ multiplications. In particular, DTP has a very expensive target generation phase, requiring 2 applications of the encoder parameters (1 of these is from the network’s initial feedfoward pass) and 3 applications of the decoder parameters to create targets to train the forward weights and inverse-mapping weights.

Experimental Results

In this section, we present experimental results of training MLPs using a variety of learning algorithms.

MNIST: This dataset Available at the URL: http://yann.lecun.com/exdb/mnist/. contains $28\times 28$ images with gray-scale pixel feature values in the range of $. The only preprocessing applied to this data is to normalize the pixel values to the range of$ by dividing them by 255.

Fashion MNIST: This database (?) contains $28x28$ grey-scale images of clothing items, meant to serve as a much more difficult drop-in replacement for MNIST itself. Training contains 60000 samples and testing contains 10000, each image is associated with one of 10 classes. We create a validation set of 2000 samples from the training split. Preprocessing was the same as on MNIST.

For both datasets and all models, over 100 epochs, we calculate updates over mini-batches of 50 samples. Furthermore, we do not regularize parameters any further, e.g., drop-out (?) or weight penalties. All feedfoward architectures for all experiments were of either $3$ , $5$ , or $8$ hidden layers of $256$ processing elements. The post-activation function used was the hyperbolic tangent and the top layer was chosen to be a maximum-entropy classifier (i.e., a softmax function). The output layer objective for all algorithms was to minimize the categorical negative log likelihood.

Parameters were initialized using a scheme that gave best performance on the validation split of each dataset on a per-algorithm basis. Though we wanted to use very simple initialization schemes for all algorithms, in preliminary experiments, we found that the feedback alignment algorithms as well as DTP (and DTP- $\sigma$ ) worked best when using a uniform fan-in-fan-out scheme (?). (?) confirms this result, originally showing how these algorithms often are unstable or fail to perform well using initializations based on simple uniform or Gaussian distributions. For LRA-E, however, we initialized the parameters using a zero-mean Gaussian distribution (variance of $0.05$ ).

The choice of parameter update rule was also somewhat dependent on the learning algorithm employed. Again, as shown in (?), it is difficult to get good, stable performance from algorithms, such as the original DTP, when using simple SGD. As done in (?), we used the RMSprop (?) adaptive learning rate with a global step size of $\lambda=0.001$ . For Backprop, RFA, DFA, and LRA-E, we were able to use SGD ( $\lambda=0.01$ ).

In this experiment, we compare all of the algorithms discussed earlier. These include back-propagation (Backprop), Random Feedback Alignment (RFA) (?), Direct Feedback Alignment (DFA) (?), Equilibrium Propagation (?) (Equil-Prop) and the original Difference Target Propagation (?) (DTP). Our algorithms include our proposed, improved version of DTP (DTP- $\sigma$ ) and the proposed error-driven Local Representation Alignment (LRA-E).

The results of our experiments are presented in Tables 2 and 2. Test and training scores are reported for the set of model parameters that had lowest validation error. Observe that LRA-E is the most stable and consistently well-performing algorithm compared to the other backprop alternatives, closely followed by our DTP- $\sigma$ . More importantly, algorithms like Equil-Prop and DTP appear to break down when training deeper networks, i.e., the 8-layer MLP. Note that while DTP was used to successfully train a 7-layer network of 240 units (using RMSprop) (?), we followed the same settings reported for networks deeper than $7$ and in our experiments uncovered that the algorithm begins to struggle as the layers are made wider, starting with the width of $256$ . However, this problem is rectified using DTP- $\sigma$ , leading to much more stable performance and even to cases where the algorithm completely overfits the training set (as in the case of 3 and 5 layers for MNIST). Nonetheless, LRA-E still performs best with respect to generalization across both datasets, despite using a naïve initialization scheme. Table 3 shows the results of using update rules other than SGD for LRA-E, e.g., Adam (?) or RMSprop (?) for a 3-layer MLP, (global step size $0.001$ for both algorithms). We see that LRA-E is compatible with other learning rate schemes and reaches better generalization performance when using them.

Figure 2 displays a t-SNE (?) visualization of the top-most hidden layer of a learned 5-layer MLP using either DFA, Equil-Prop, DTP- $\sigma$ , and LRA-E on Fashion MNIST samples. Qualitatively, we see that all learning algorithms extract representations that separate out the data points reasonably well, at least in the sense that points are clustered based on clothing type. However, it appears that LRA-E representations yield more strongly separated clusters, as evidenced by somewhat wider gaps between them, especially around the pink, blue, and black colored clusters.

Finally, DTP, as also mentioned in (?), appears to be quite sensitive to its initialization scheme. For both MNIST and Fashion MNIST, we trained DTP and our proposed DTP- $\sigma$ with three different settings, including random orthogonal (Ortho), fan-in-fan-out (Gloro), and simple zero-mean Gaussian (G) initializations. Figure 3 shows the validation accuracy curves of DTP and DTP- $\sigma$ as a function of epoch for 5 and 8 layer networks with various weight initializations. As shown in Figure 3, DTP is highly unstable as the network gets deeper while DTP- $\sigma$ is not. Furthermore, DTP- $\sigma$ ’s performance appears to be less dependent on the weight initialization scheme. Thus, our experiments show promising evidence of DTP- $\sigma$ ’s generalization improvement over the original DTP. Moreso, as indicated by Tables 2 and 2, DTP- $\sigma$ can, overall, perform nearly as well as LRA-E.

Conclusions

In this paper, we proposed two learning algorithms: error-driven Local Representation Alignment and adaptive noise Difference Target Propagation. On two classification benchmarks, we show strong positive results when training deep multilayer perceptrons. With respect to other types of neural structures, e.g., locally connected ones, we would expect our proposed algorithms to work well, especially LRA-E, since the target computation/error unit mechanism is agnostic to the underlying building blocks of the feedforward model (permitting extension to models such as residual networks). Future work will include adapting these algorithms to larger-scale tasks requiring more complex, exotic architectures.

Coordinated Local Learning Algorithms

Improving Difference Target Propagation

A Comment on the Efficiency of LRA-E and DTP

Experimental Results

Conclusions

References