Learning values across many orders of magnitude

Hado van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, David Silver

Introduction

Our main motivation is the work by Mnih et al. (2015), in which Q-learning (Watkins, 1989) is combined with a deep convolutional neural network (cf. LeCun et al., 2015). The resulting deep Q network (DQN) algorithm learned to play a varied set of Atari 2600 games from the Arcade Learning Environment (ALE) (Bellemare et al., 2013), which was proposed as an evaluation framework to test general learning algorithms on solving many different interesting tasks. DQN was proposed as a singular solution, using a single set of hyperparameters, but the magnitudes and frequencies of rewards vary wildly between different games. To overcome this hurdle, the rewards and temporal-difference errors were clipped to $ $. For instance, in Pong the rewards are bounded by$ -1 $and$ +1 $while in Ms. Pac-Man eating a single ghost can yield a reward of up to$ +1600 $, but DQN clips the latter to$ +1 $as well. This is not a satisfying solution for two reasons. First, such clipping introduces domain knowledge. Most games have sparse non-zero rewards outside of$ $. Clipping then results in optimizing the frequency of rewards, rather than their sum. This is a good heuristic in Atari, but it does not generalize to other domains. More importantly, the clipping changes the objective, sometimes resulting in qualitatively different policies of behavior.

We propose a method to adaptively normalize the targets used in the learning updates. If these targets are guaranteed to be normalized it is much easier to find suitable hyperparameters. The proposed technique is not specific to DQN and is more generally applicable in supervised learning and reinforcement learning. There are several reasons such normalization can be desirable. First, sometimes we desire a single system that is able to solve multiple different problems with varying natural magnitudes, as in the Atari domain. Second, for multi-variate functions the normalization can be used to disentangle the natural magnitude of each component from its relative importance in the loss function. This is particularly useful when the components have different units, such as when we predict signals from sensors with different modalities. Finally, adaptive normalization can help deal with non-stationary. For instance, in reinforcement learning the policy of behavior can change repeatedly during learning, thereby changing the distribution and magnitude of the values.

Many machine-learning algorithms rely on a-priori access to data to properly tune relevant hyper-parameters (Bergstra et al., 2011; Bergstra and Bengio, 2012; Snoek et al., 2012). However, it is much harder to learn efficiently from a stream of data when we do not know the magnitude of the function we seek to approximate beforehand, or if these magnitudes can change over time, as is for instance typically the case in reinforcement learning when the policy of behavior improves over time.

Input normalization has long been recognized as important to efficiently learn non-linear approximations such as neural networks (LeCun et al., 1998), leading to research on how to achieve scale-invariance on the inputs (e.g., Ross et al., 2013; Ioffe and Szegedy, 2015; Desjardins et al., 2015). Output or target normalization has not received as much attention, probably because in supervised learning data is commonly available before learning commences, making it straightforward to determine appropriate normalizations or to tune hyper-parameters. However, this assumes the data is available a priori, which is not true in online (potentially non-stationary) settings.

Natural gradients (Amari, 1998) are invariant to reparameterizations of the function approximation, thereby avoiding many scaling issues, but these are computationally expensive for functions with many parameters such as deep neural networks. This is why approximations are regularly proposed, typically trading off accuracy to computation (Martens and Grosse, 2015), and sometimes focusing on a certain aspect such as input normalization (Desjardins et al., 2015; Ioffe and Szegedy, 2015). Most such algorithms are not fully invariant to rescaling the targets.

In the Atari domain several algorithmic variants and improvements for DQN have been proposed (van Hasselt et al., 2016; Bellemare et al., 2016; Schaul et al., 2016; Wang et al., 2016), as well as alternative solutions (Liang et al., 2016; Mnih et al., 2016). However, none of these address the clipping of the rewards or explicitly discuss the impacts of clipping on performance or behavior.

2 Preliminaries

An important special case is when $f_{\bm{\theta}}$ is a neural network (McCulloch and Pitts, 1943; Rosenblatt, 1962), which are often trained with a form of SGD (Rumelhart et al., 1986), with hyperparameters that interact with the scale of the loss. Especially for deep neural networks (LeCun et al., 2015; Schmidhuber, 2015) large updates may harm learning, because these networks are highly non-linear and such updates may ‘bump’ the parameters to regions with high error.

Adaptive normalization with Pop-Art

We propose to normalize the targets $Y_{t}$ , where the normalization is learned separately from the approximating function. We consider an affine transformation of the targets

At first glance it may seem we have made little progress. If we learn $\mathbf{\Sigma}$ and ${\bm{\mu}}$ using the same algorithm as used for the parameters of the function $g$ , then the problem has not become fundamentally different or easier; we would have merely changed the structure of the parameterized function slightly. Conversely, if we consider tuning the scale and shift as hyperparameters then tuning them is not fundamentally easier than tuning other hyperparameters, such as the step size, directly.

Fortunately, there is an alternative. We propose to update $\mathbf{\Sigma}$ and ${\bm{\mu}}$ according to a separate objective with the aim of normalizing the updates for $g$ . Thereby, we decompose the problem of learning an appropriate normalization from learning the specific shape of the function. The two properties that we want to simultaneously achieve are

to update scale $\mathbf{\Sigma}$ and shift ${\bm{\mu}}$ such that $\mathbf{\Sigma}^{-1}(Y-{\bm{\mu}})$ is appropriately normalized, and

to preserve the outputs of the unnormalized function when we change the scale and shift.

We discuss these properties separately below. We refer to algorithms that combine output-preserving updates and adaptive rescaling, as Pop-Art algorithms, an acronym for “Preserving Outputs Precisely, while Adaptively Rescaling Targets”.

Unless care is taken, repeated updates to the normalization might make learning harder rather than easier because the normalized targets become non-stationary. More importantly, whenever we adapt the normalization based on a certain target, this would simultaneously change the output of the unnormalized function of all inputs. If there is little reason to believe that other unnormalized outputs were incorrect, this is undesirable and may hurt performance in practice, as illustrated in Section 3. We now first discuss how to prevent these issues, before we discuss how to update the scale and shift.

The only way to avoid changing all outputs of the unnormalized function whenever we update the scale and shift is by changing the normalized function $g$ itself simultaneously. The goal is to preserve the outputs from before the change of normalization, for all inputs. This prevents the normalization from affecting the approximation, which is appropriate because its objective is solely to make learning easier, and to leave solving the approximation itself to the optimization algorithm.

Without loss of generality the unnormalized function can be written as

where $h_{\bm{\theta}}$ is a parametrized (non-linear) function, and $g_{\bm{\theta},\mathbf{W},\bm{b}}=\mathbf{W}h_{\bm{\theta}}(x)+\bm{b}$ is the normalized function. It is not uncommon for deep neural networks to end in a linear layer, and then $h_{\bm{\theta}}$ can be the output of the last (hidden) layer of non-linearities. Alternatively, we can always add a square linear layer to any non-linear function $h_{\bm{\theta}}$ to ensure this constraint, for instance initialized as $\mathbf{W}_{0}=\mathbf{I}$ and $\bm{b}_{0}={\bm{0}}$ .

The following proposition shows that we can update the parameters $\mathbf{W}$ and $\bm{b}$ to fulfill the second desideratum of preserving outputs precisely for any change in normalization.

then the outputs of the unnormalized function $f$ are preserved precisely in the sense that

Algorithm 1 is an example implementation of SGD with Pop-Art for a squared loss. It can be generalized easily to any other loss by changing the definition of $\bm{\delta}$ . Notice that $\mathbf{W}$ and $\bm{b}$ are updated twice: first to adapt to the new scale and shift to preserve the outputs of the function, and then by SGD. The order of these updates is important because it allows us to use the new normalization immediately in the subsequent SGD update.

2 Adaptively rescaling targets

A natural choice is to normalize the targets to approximately have zero mean and unit variance. For clarity and conciseness, we consider scalar normalizations. It is straightforward to extend to diagonal or dense matrices. If we have data $\{(X_{i},Y_{i})\}_{i=1}^{t}$ up to some time $t$ , we then may desire

This can be generalized to incremental updates

Here $\nu_{t}$ estimates the second moment of the targets and $\beta_{t}\in$ is a step size. If $\nu_{t}-\mu_{t}^{2}$ is positive initially then it will always remain so, although to avoid issues with numerical precision it can be useful to enforce a lower bound explicitly by requiring $\nu_{t}-\mu_{t}^{2}\geq\epsilon$ with $\epsilon>0$ . For full equivalence to (3) we can use $\beta_{t}=1/t$ . If $\beta_{t}=\beta$ is constant we get exponential moving averages, placing more weight on recent data points which is appropriate in non-stationary settings.

A constant $\beta$ has the additional benefit of never becoming negligibly small. Consider the first time a target is observed that is much larger than all previously observed targets. If $\beta_{t}$ is small, our statistics would adapt only slightly, and the resulting update may be large enough to harm the learning. If $\beta_{t}$ is not too small, the normalization can adapt to the large target before updating, potentially making learning more robust. In particular, the following proposition holds.

When using updates (4) to adapt the normalization parameters $\sigma$ and $\mu$ , the normalized targets are bounded for all $t$ by

For instance, if $\beta_{t}=\beta=10^{-4}$ for all $t$ , then the normalized target is guaranteed to be in $(-100,100)$ . Note that Proposition 2 does not rely on any assumptions about the distribution of the targets. This is an important result, because it implies we can bound the potential normalized errors before learning, without any prior knowledge about the actual targets we may observe.

It is an open question whether it is uniformly best to normalize by mean and variance. In the appendix we discuss other normalization updates, based on percentiles and mini-batches, and derive correspondences between all of these.

3 An equivalence for stochastic gradient descent

We now step back and analyze the effect of the magnitude of the errors on the gradients when using regular SGD. This analysis suggests a different normalization algorithm, which has an interesting correspondence to Pop-Art SGD.

We consider SGD updates for an unnormalized multi-layer function of form $f_{\bm{\theta},\mathbf{W},\bm{b}}(X)=\mathbf{W}h_{\bm{\theta}}(X)+\bm{b}$ . The update for the weight matrix $\mathbf{W}$ is

where ${\bm{\delta}}_{t}=f_{\bm{\theta},\mathbf{W},\bm{b}}(X)-Y_{t}$ is gradient of the squared loss, which we here call the unnormalized error. The magnitude of this update depends linearly on the magnitude of the error, which is appropriate when the inputs are normalized, because then the ideal scale of the weights depends linearly on the magnitude of the targets.In general care should be taken that the inputs are well-behaved; this is exactly the point of recent work on input normalization (Ioffe and Szegedy, 2015; Desjardins et al., 2015).

Now consider the SGD update to the parameters of $h_{\bm{\theta}}$ , $\bm{\theta}_{t}=\bm{\theta}_{t-1}-\alpha{\bm{J}}_{t}\mathbf{W}_{t-1}^{\top}{\bm{\delta}}_{t}$ where ${\bm{J}}_{t}=(\nabla g_{\bm{\theta},1}(X),\ldots,\nabla g_{\bm{\theta},m}(X))^{\top}$ is the Jacobian for $h_{\bm{\theta}}$ . The magnitudes of both the weights $\mathbf{W}$ and the errors ${\bm{\delta}}$ depend linearly on the magnitude of the targets. This means that the magnitude of the update for $\bm{\theta}$ depends quadratically on the magnitude of the targets. There is no compelling reason for these updates to depend at all on these magnitudes because the weights in the top layer already ensure appropriate scaling. In other words, for each doubling of the magnitudes of the targets, the updates to the lower layers quadruple for no clear reason.

This analysis suggests an algorithmic solution, which seems to be novel in and of itself, in which we track the magnitudes of the targets in a separate parameter $\sigma_{t}$ , and then multiply the updates for all lower layers with a factor $\sigma_{t}^{-2}$ . A more general version of this for matrix scalings is given in Algorithm 2. We prove an interesting, and perhaps surprising, connection to the Pop-Art algorithm.

where $h_{\bm{\theta}}$ is the same differentiable function in both cases, and the functions are initialized identically, using $\mathbf{\Sigma}_{0}=\mathbf{I}$ and ${\bm{\mu}}={\bm{0}}$ , and the same initial $\bm{\theta}_{0}$ , $\mathbf{W}_{0}$ and $\bm{b}_{0}$ . Consider updating the first function using Algorithm 1 (Pop-Art SGD) and the second using Algorithm 2 (Normalized SGD). Then, for any sequence of non-singular scales $\{\mathbf{\Sigma}_{t}\}_{t=1}^{\infty}$ and shifts $\{{\bm{\mu}}_{t}\}_{t=1}^{\infty}$ , the algorithms are equivalent in the sense that 1) the sequences $\{\bm{\theta}_{t}\}_{t=0}^{\infty}$ are identical, 2) the outputs of the functions are identical, for any input.

The proposition shows a duality between normalizing the targets, as in Algorithm 1, and changing the updates, as in Algorithm 2. This allows us to gain more intuition about the algorithm. In particular, in Algorithm 2 the updates in top layer are not normalized, thereby allowing the last linear layer to adapt to the scale of the targets. This is in contrast to other algorithms that have some flavor of adaptive normalization, such as RMSprop (Tieleman and Hinton, 2012), AdaGrad (Duchi et al., 2011), and Adam (Kingma and Adam, 2015) that each component in the gradient by a square root of an empirical second moment of that component. That said, these methods are complementary, and it is straightforward to combine Pop-Art with other optimization algorithms than SGD.

Binary regression experiments

We first analyze the effect of rare events in online learning, when infrequently a much larger target is observed. Such events can for instance occur when learning from noisy sensors that sometimes captures an actual signal, or when learning from sparse non-zero reinforcements. We empirically compare three variants of SGD: without normalization, with normalization but without preserving outputs precisely (i.e., with ‘Art’, but without ‘Pop’), and with Pop-Art.

The inputs are binary representations of integers drawn uniformly randomly between and $n=2^{10}-1$ . The desired outputs are the corresponding integer values. Every 1000 samples, we present the binary representation of $2^{16}-1$ as input (i.e., all 16 inputs are 1) and as target $2^{16}-1=65,535$ . The approximating function is a fully connected neural network with 16 inputs, 3 hidden layers with 10 nodes per layer, and tanh internal activation functions. This simple setup allows extensive sweeps over hyper-parameters, to avoid bias towards any algorithm by the way we tune these. The step sizes $\alpha$ for SGD and $\beta$ for the normalization are tuned by a grid search over $\{10^{-5},10^{-4.5},\ldots,10^{-1},10^{-0.5},1\}$ .

Figure 1a shows the root mean squared error (RMSE, log scale) for each of 5000 samples, before updating the function (so this is a test error, not a train error). The solid line is the median of 50 repetitions, and shaded region covers the 10th to 90th percentiles. The plotted results correspond to the best hyper-parameters according to the overall RMSE (i.e., area under the curve). The lines are slightly smoothed by averaging over each 10 consecutive samples.

SGD favors a relatively small step size ( $\alpha=10^{-3.5}$ ) to avoid harmful large updates, but this slows learning on the smaller updates; the error curve is almost flat in between spikes. SGD with adaptive normalization (labeled ‘Art’) can use a larger step size ( $\alpha=10^{-2.5}$ ) and therefore learns faster, but has high error after the spikes because the changing normalization also changes the outputs of the smaller inputs, increasing the errors on these. In comparison, Pop-Art performs much better. It prefers the same step size as Art ( $\alpha=10^{-2.5}$ ), but Pop-Art can exploit a much faster rate for the statistics (best performance with $\beta=10^{-0.5}$ for Pop-Art and $\beta=10^{-4}$ for Art). The faster tracking of statistics protects Pop-Art from the large spikes, while the output preservation avoids invalidating the outputs for smaller targets. We ran experiments with RMSprop but left these out of the figure as the results were very similar to SGD.

Atari 2600 experiments

An important motivation for this work is reinforcement learning with non-linear function approximators such as neural networks (sometimes called deep reinforcement learning). The goal is to predict and optimize action values defined as the expected sum of future rewards. These rewards can differ arbitrarily from one domain to the next, and non-zero rewards can be sparse. As a result, the action values can span a varied and wide range which is often unknown before learning commences.

Mnih et al. (2015) combined Q-learning with a deep neural network in an algorithm called DQN, which impressively learned to play many games using a single set of hyper-parameters. However, as discussed above, to handle the different reward magnitudes with a single system all rewards were clipped to the interval $ $. This is harmless in some games, such as Pong where no reward is ever higher than 1 or lower than$ -1 $, but it is not satisfactory as this heuristic introduces specific domain knowledge that optimizing reward frequencies is approximately is useful as optimizing the total score. However, the clipping makes the DQN algorithm blind to differences between certain actions, such as the difference in reward between eating a ghost (reward$ >=100 $) and eating a pellet (reward$ =25$) in Ms. Pac-Man. We hypothesize that 1) overall performance decreases when we turn off clipping, because it is not possible to tune a step size that works on many games, 2) that we can regain much of the lost performance by with Pop-Art. The goal is not to improve state-of-the-art performance, but to remove the domain-dependent heuristic that is induced by the clipping of the rewards, thereby uncovering the true rewards.

We ran the Double DQN algorithm (van Hasselt et al., 2016) in three versions: without changes, without clipping both rewards and temporal difference errors, and without clipping but additionally using Pop-Art. The targets are the cumulation of a reward and the discounted value at the next state:

where $Q(s,a;\bm{\theta})$ is the estimated action value of action $a$ in state $s$ according to current parameters $\bm{\theta}$ , and where $\bm{\theta}^{-}$ is a more stable periodic copy of these parameters (cf. Mnih et al., 2015; van Hasselt et al., 2016, for more details). This is a form of Double Q-learning (van Hasselt, 2010, 2011). We roughly tuned the main step size and the step size for the normalization to $10^{-4}$ . It is not straightforward to tune the unclipped version, for reasons that will become clear soon.

Without clipping the rewards, Pop-Art produces a much narrower band within which the gradients fall. Across games, $95\%$ of median norms range over less than two orders of magnitude (roughly between 1 and 20), compared to almost four orders of magnitude for clipped Double DQN, and more than six orders of magnitude for unclipped Double DQN without Pop-Art. The wide range for the latter shows why it is impossible to find a suitable step size with neither clipping nor Pop-Art: the updates are either far too small on some games or far too large on others.

After 200M frames, we evaluated the actual scores of the best performing agent in each game on 100 episodes of up to 30 minutes of play, and then normalized by human and random scores as described by Mnih et al. (2015). Figure 2 shows the differences in normalized scores between (clipped) Double DQN and Double DQN with Pop-Art.

The main eye-catching result is that the distribution in performance drastically changed. On some games (e.g., Gopher, Centipede) we observe dramatic improvements, while on other games (e.g., Video Pinball, Star Gunner) we see a substantial decrease. For instance, in Ms. Pac-Man the clipped Double DQN agent does not care more about ghosts than pellets, but Double DQN with Pop-Art learns to actively hunt ghosts, resulting in higher scores. Especially remarkable is the improved performance on games like Centipede and Gopher, but also notable is a game like Frostbite which went from below 50% to a near-human performance level. Raw scores can be found in the appendix.

Some games fare worse with unclipped rewards because it changes the nature of the problem. For instance, in Time Pilot the Pop-Art agent learns to quickly shoot a mothership to advance to a next level of the game, obtaining many points in the process. The clipped agent instead shoots at anything that moves, ignoring the mothership.A video is included in the supplementary material. However, in the long run in this game more points are scored with the safer and more homogeneous strategy of the clipped agent. One reason for the disconnect between the seemingly qualitatively good behavior combined with lower scores is that the agents are fairly myopic: both use a discount factor of $\gamma=0.99$ , and therefore only optimize rewards that happen within a dozen or so seconds into the future.

On the whole, the results show that with Pop-Art we can successfully remove the clipping heuristic that has been present in all prior DQN variants, while retaining overall performance levels. Double DQN with Pop-Art performs slightly better than Double DQN with clipped rewards: on 32 out of 57 games performance is at least as good as clipped Double DQN and the median (+0.4%) and mean (+34%) differences are positive.

Discussion

We have demonstrated that Pop-Art can be used to adapt to different and non-stationary target magnitudes. This problem was perhaps not previously commonly appreciated, potentially because in deep learning it is common to tune or normalize a priori, using an existing data set. This is not as straightforward in reinforcement learning when the policy and the corresponding values may repeatedly change over time. This makes Pop-Art a promising tool for deep reinforcement learning, although it is not specific to this setting.

We saw that Pop-Art can successfully replace the clipping of rewards as done in DQN to handle the various magnitudes of the targets used in the Q-learning update. Now that the true problem is exposed to the learning algorithm we can hope to make further progress, for instance by improving the exploration (Osband et al., 2016), which can now be informed about the true unclipped rewards.

References

Appendix

In this appendix, we introduce and analyze several extensions and variations, including normalizing based on percentiles or minibatches. Additionally, we prove all propositions in the main text and the appendix.

For the experiments described in Section 4 in the main paper, we closely followed the setup described in Mnih et al. and van Hasselt et al. . In particular, the Double DQN algorithm is identical to that described by van Hasselt et al. The shown results were obtained by running the trained agent for 30 minutes of simulated play (or 108,000 frames). This was repeated 100 times, where diversity over different runs was ensured by a small probability of exploration on each step ( $\epsilon$ -greedy exploration with $\epsilon=0.01$ ), as well as by performing up to 30 ‘no-op’ actions, as also used and described by Mnih et al. In summary, the evaluation setup was the same as used by Mnih et al., except that we allowed more evaluation time per game (30 minutes instead of 5 minutes), as also used by Wang et al. .

The results in Figure 2 were obtained by normalizing the raw scores by first subtracting the score by a random agent, and then dividing by the absolute difference between human and random agents, such that

The raw scores are given below, in Table 1.

Generalizing normalization by variance

We can change the variance of the normalized targets to influence the magnitudes of the updates. For a desired standard deviation of $s>0$ , we can use

with the updates for $\nu_{t}$ and $\mu_{t}$ as normal. It is straightforward to show that then a generalization of Proposition 2 holds with a bound of

This additional parameter is for instance useful when we desire fast tracking in non-stationary problems. We then want a large step size $\alpha$ , but without risking overly large updates.

The new parameter $s$ may seem superfluous because increasing the normalization step size $\beta$ also reduces the hard bounds on the normalized targets. However, $\beta$ additionally influences the distribution of the normalized targets. The histograms in the left-most plot in Figure 3 show what happens when we try to limit the magnitudes using only $\beta$ . The red histogram shows normalized targets where the unnormalized targets come from a normal distribution, shown in blue. The normalized targets are contained in $ $, but the distribution is very non-normal even though the actual targets are normal. Conversely, the red histogram in the middle plot shows that the distribution remains approximately normal if we instead use$ s $to reduce the magnitudes. The right plot shows the effect on the variance of normalized targets for either approach. When we change$ \beta $while keeping$ s=1 $fixed, the variance of the normalized targets can drop far below the desired variance of one (magenta curve). When we use change$ s $while keeping$ \beta=0.01 $fixed, the variance remains predictably at approximately$ s $(black line). The difference in behavior of the resulting normalization demonstrates that$ s$ gives us a potentially useful additional degree of freedom.

Sometimes, we can simply roll the additional scaling $s$ into the step size, such that without loss of generality we can use $s=1$ and decrease the step size to avoid overly large updates. However, sometimes it is easier to separate the magnitude of the targets, as influenced by $s$ , from the magnitude of the updates, for instance when using an adaptive step-size algorithm. In addition, the introduction of an explicit scaling $s$ allows us to make some interesting connections to normalization by percentiles, in the next section.

Adaptive normalization by percentiles

Instead of normalizing by mean and variance, we can normalize such that a given ratio $p$ of normalized targets is inside the predetermined interval. The per-output objective is then

For normally distributed targets, there is a direct correspondence to normalizing by means and variance.

For example, percentiles of $p=0.99$ and $p=0.95$ correspond to $s\approx 0.4$ and $s\approx 0.5$ , respecticely. Conversely, $s=1$ corresponds to $p\approx 0.68$ . The fact only applies when the targets are normal. For other distributions the two forms of normalization differ even in terms of their objectives.

We now discuss a concrete algorithm to obtain normalization by percentiles. Let $Y^{(n)}_{t}$ denote order statistics of the targets up to time $t$ ,For non-integer $x$ we can define $Y^{(x)}$ by either rounding $x$ to an integer or, perhaps more appropriately, by linear interpolation between the values for the nearest integers. such that $Y^{(1)}_{t}=\textrm{min}_{i}\{Y_{i}\}_{i=1}^{t}$ , $Y^{(t)}_{t}=\textrm{max}_{i}\{Y_{i}\}_{i=1}^{t}$ , and $Y^{((t+1)/2)}_{t}=\operatorname*{\textrm{median}}_{i}\{Y_{i}\}_{i=1}^{t}$ . For notational simplicity, define $n^{+}\equiv\frac{t+1}{2}+p\frac{t-1}{2}$ and $n^{-}\equiv\frac{t+1}{2}-p\frac{t-1}{2}$ . Then, for data up to time $t$ , the goal is

Solving for $\sigma_{t}$ and $\mu_{t}$ gives

In the special case where $p=1$ we get $\mu_{t}=\frac{1}{2}(\textrm{max}_{i}Y_{i}+\textrm{min}_{i}Y_{i})$ and $\sigma_{t}=\frac{1}{2}(\textrm{max}_{i}Y_{i}-\textrm{min}_{i}Y_{i})$ . We are then guaranteed that all normalized targets fall in $ $, but this could result in an overly conservative normalization that is sensitive to outliers and may reduce the overall magnitude of the updates too far. In other words, learning will then be safe in the sense that no updates will be too big, but it may be slow because many updates may be very small. In general it is probably typically better to use a ratio$ p<1$.

Exact order statistics are hard to compute online, because we would need to store all previous targets. To obtain more memory-efficient online updates for percentiles we can store two values $y^{\textrm{min}}_{t}$ and $y^{\textrm{max}}_{t}$ , which should eventually have the property that a proportion of $(1-p)/2$ values is larger than $y^{\textrm{max}}_{t}$ and a proportion of $(1-p)/2$ values is smaller than $y^{\textrm{min}}_{t}$ , such that

This can be achieved asymptotically by updating $y^{\textrm{min}}_{t}$ and $y^{\textrm{max}}_{t}$ according to

where the indicator function $\mathcal{I}(\cdot)$ is equal to one when its argument is true and equal to zero otherwise.

If $\sum_{t=1}^{\infty}\beta_{t}$ and $\sum_{t=1}^{\infty}\beta_{t}^{2}$ , and the distribution of targets is stationary, then the updates in (7) converge to values such that (6) holds.

If the step size $\beta_{t}$ is too small it will take long for the updates to converge to appropriate values. In practice, it might be better to let the magnitude of the steps depend on the actual errors, such that the update takes the form of an asymmetrical least-squares update [Newey and Powell, 1987, Efron, 1991].

Online learning with minibatches

Online normalization by mean and variance with minibatches $\{Y_{t,1},\ldots,Y_{t,B}\}$ of size $B$ can be achieved by using the updates

Another interesting possibility is to update $y^{\textrm{min}}_{t}$ and $y^{\textrm{max}}_{t}$ towards the extremes of the minibatch such that

The statistics of this normalization depend on the size of the minibatches, and there is an interesting correspondence to normalization by percentiles.

Consider minibatches $\{\{Y_{t,1},\ldots,Y_{t,B}\}\}_{t=1}^{\infty}$ of size $B\geq 2$ whose elements are drawn i.i.d. from a uniform distribution with support on $[a,b]$ . If $\sum_{t}\beta_{t}=\infty$ and $\sum_{t}\beta_{t}^{2}<\infty$ , then in the limit the updates (9) converge to values such that (6) holds, with $p=(B-1)/(B+1)$ .

This fact connects the online minibatch updates (9) to normalization by percentiles. For instance, a minibatch size of $B=20$ would correspond roughly to online percentile updates with $p=19/21\approx 0.9$ and, by Proposition 4, to a normalization by mean and variance with a $s\approx 0.6$ . These different normalizations are not strictly equivalent, but may behave similarly in practice.

Proposition 6 quantifies an interesting correspondence between minibatch updates and normalizing by percentiles. Although the fact as stated holds only for uniform targets, the proportion of normalized targets in the interval $ $more generally becomes larger when we increase the minibatch size, just as when we increase$ p $or decrease$ s$, potentially resulting in better robustness to outliers at the possible expense of slower learning.

A note on initialization

When using constant step sizes it is useful to be aware of the start of learning, to trust the data rather than arbitrary initial values. This can be done by using a step size as defined in the following fact.

Consider a recency-weighted running average $\bar{z}_{t}$ updated from a stream of data $\{Z_{t}\}_{t=1}^{\infty}$ using $\bar{z}_{t}=(1-\beta_{t})\bar{z}_{t-1}+\beta_{t}Z_{t}$ , with $\beta_{t}$ defined by

Then 1) the relative weights of the data in $Z_{t}$ are the same as when using a constant step size $\beta$ , and 2) the estimate $\bar{z}_{t}$ does not depend on the initial value $\bar{z}_{0}$ .

A similar result was derived to remove the effect of the initialization of certain parameters by Kingma and Ba for a stochastic optimization algorithm called Adam. In that work, the initial values are assumed to be zero and a standard exponentially weighted average is explicitly computed and stored, and then divided by a term analogous to $1-(1-\beta)^{t}$ . The step size (10) corrects for any initialization in place, without storing auxiliary variables, but for the rest the method and its motivation are very similar.

Alternatively, it is possible to initialize the normalization safely, by choosing a scale that is relatively high initially. This can be beneficial when at first the targets are relatively small and noisy. If we would then use the step size in (10), the updates would treat these initial observations as important, and would try to fit our approximating function to the noise. A high initialization (e.g., $\nu_{0}=10^{4}$ or $\nu_{0}=10^{6}$ ) would instead reduce the effect of the first targets on the learning updates, and would instead use these only to find an appropriate normalization. Only after finding this normalization the actual learning would then commence.

Deep Pop-Art

Sometimes it makes sense to apply the normalization not to the output of the network, but at a lower level. For instance, the $i^{\text{th}}$ output of a neural network with a soft-max on top can be written

where $\mathbf{W}$ is the weight matrix of the last linear layer before the soft-max. The actual outputs are already normalized by using the soft-max, but the outputs $\mathbf{W}h_{\bm{\theta}}(X)+\bm{b}$ of the layer below the soft-max may still benefit from normalization. To determine the targets to be normalized, we can either back-propagate the gradient of our loss through the soft-max or invert the function.

More generally, we can consider applying normalization at any level of a hierarchical non-linear function. This seems a promising way to counteract undesirable characteristics of back-propagating gradients, such as vanishing or exploding gradients [Hochreiter, 1998].

In addition, normalizing gradients further down in a network can provide a straightforward way to combine gradients from different sources in more complex network graphs than a standard feedforward multi-layer network. First, the normalization allows us to normalize the gradient from each source separately before merging gradients, thereby avoiding one source to fully drown out any others and allowing us to weight the gradients by actual relative importance, rather than implicitly relying on the current magnitude of each as a proxy for this. Second, the normalization can prevent undesirably large gradients when many gradients come together at one point of the graph, by normalizing again after merging gradients.

Proofs

then the outputs of the unnormalized function $f$ are preserved precisely in the sense that

When using updates (4) to adapt the normalization parameters $\sigma$ and $\mu$ , the normalized target $\sigma_{t}^{-1}(Y_{t}-\mu_{t})$ is bounded for all $t$ by

The inequality follows from the fact that $\nu_{t-1}\geq\mu_{t-1}^{2}$ . ∎

where $h_{\bm{\theta}}$ is the same differentiable function in both cases, and the functions are initialized identically, using $\mathbf{\Sigma}_{0}=\mathbf{I}$ and ${\bm{\mu}}={\bm{0}}$ , and the same initial $\bm{\theta}_{0}$ , $\mathbf{W}_{0}$ and $\bm{b}_{0}$ . Consider updating the first function using Algorithm 1 and the second using Algorithm 2. Then, for any sequence of non-singular scales $\{\mathbf{\Sigma}_{t}\}_{t=1}^{\infty}$ and shifts $\{{\bm{\mu}}_{t}\}_{t=1}^{\infty}$ , the algorithms are equivalent in the sense that 1) the sequences $\{\bm{\theta}_{t}\}_{t=0}^{\infty}$ are identical, 2) the outputs of the functions are identical, for any input.

Let $\bm{\theta}_{t}^{1}$ and $\bm{\theta}_{t}^{2}$ denote the parameters of $h_{\bm{\theta}}$ for Algorithms 1 and 2, respectively. Similarly, let $\mathbf{W}^{1}$ and $\bm{b}^{1}$ be parameters of the first function, while $\mathbf{W}^{2}$ and $\bm{b}^{2}$ are parameters of the second function. It is enough to show that single updates of both Algorithms 1 and 2 from the same starting points have equivalent results. That is, if

where the quantities $\bm{\theta}^{2}$ , $\mathbf{W}^{2}$ , and $\bm{b}^{2}$ are updated with Algorithm 2 and quantities $\bm{\theta}^{1}$ , $\mathbf{W}^{1}$ , and $\bm{b}^{1}$ are updated with Algorithm 1. We do not require $\mathbf{W}^{2}_{t}=\mathbf{W}^{1}_{t}$ or $\bm{b}^{2}_{t}=\bm{b}^{1}_{t}$ , and indeed these quantities will generally differ.

We use the shorthands $f^{1}_{t}$ and $f^{2}_{t}$ for the first and second function, respectively. First, we show that $\mathbf{W}^{1}_{t}=\mathbf{\Sigma}^{-1}_{t}\mathbf{W}^{2}_{t}$ , for all $t$ . For $t=0$ , this holds trivially because $\mathbf{W}^{1}_{0}=\mathbf{W}^{2}_{0}=\mathbf{W}_{0}$ , and $\mathbf{\Sigma}_{0}={\bm{I}}$ . Now assume that $\mathbf{W}^{1}_{t-1}=\mathbf{\Sigma}^{-1}_{t-1}\mathbf{W}^{2}_{t-1}$ . Let $\delta_{t}=Y_{t}-f^{1}_{t}(X_{t})$ be the unnormalized error at time $t$ . Then, Algorithm 1 results in

Similarly, $\bm{b}^{1}_{0}=\mathbf{\Sigma}_{0}^{-1}(\bm{b}^{2}_{0}-\mu_{0})$ and if $\bm{b}^{1}_{t-1}=\mathbf{\Sigma}_{t-1}^{-1}(\bm{b}^{2}_{t-1}-\mu_{t-1})$ then

Now, assume that $\bm{\theta}^{1}_{t-1}=\bm{\theta}^{2}_{t-1}$ . Then,

As $\bm{\theta}^{1}_{0}=\bm{\theta}^{2}_{0}$ by assumption, $\bm{\theta}_{t}^{1}=\bm{\theta}_{t}^{2}$ for all $t$ .

Finally, we put everything together and note that $f^{1}_{0}=f^{2}_{0}$ and that

For any $\mu$ and $\sigma$ , the normalized targets are distributed according to a normal distribution because the targets themselves are normally distributed and the normalization is an affine transformation. For a normal distribution with mean zero and variance $v$ , the values $1$ and $-1$ are both exactly $1/\sqrt{v}$ standard deviations from the mean, implying that the ratio of data between these points is $\Phi(1/\sqrt{v})-\Phi(-1/\sqrt{v})$ , where

is the standard normal cumulative distribution. The normalization by mean and variance is then equivalent to a normalization by percentiles with a ratio $p$ defined by

where we used the fact that $\operatorname*{\textrm{erf}}$ is odd, such that $\operatorname*{\textrm{erf}}(x)=-\operatorname*{\textrm{erf}}(-x)$ . ∎

If $\sum_{t=1}^{\infty}\beta_{t}$ and $\sum_{t=1}^{\infty}\beta_{t}^{2}$ , and the distribution of targets is stationary, then the updates

so this is a fixed point of the update. Note further that the variance of the stochastic update is finite, and that the expected direction of the updates is towards the fixed point, so that this fixed point is an attractor. The conditions on the step sizes ensure that the fixed point is reachable ( $\sum_{t=1}^{\infty}\beta_{t}=\infty$ ) and that we converge upon it in the limit ( $\sum_{t=1}^{\infty}\beta_{t}^{2}<\infty$ ). For more detail and weaker conditions, we refer to reader to the extensive literature on stochastic approximation [Robbins and Monro, 1951, Kushner and Yin, 2003]. The proof for the update for $y^{\textrm{min}}_{t}$ is exactly analogous. ∎

Because of the conditions on the step size, the quantities $y^{\textrm{min}}_{t}$ and $y^{\textrm{max}}_{t}$ will converge to the expected value for the minimum and maximum of a set of $B$ i.i.d. random variables. The cumulative distribution function (CDF) for the maximum of $B$ i.i.d. random variables with CDF $F(x)$ is $F(x)^{B}$ , since

The CDF for a uniform random variables with support on $[a,b]$ is

The associated expected value can then be calculated to be

so that a fraction of $\frac{1}{B+1}$ of samples will be larger than this value. Through a similar reasoning, an additional fraction of $\frac{1}{B+1}$ will be smaller than the minimum, and a ratio of $p=\frac{B-1}{B+1}$ will on average fall between these values. ∎

Consider a weighted running average $x_{t}$ updated from a stream of data $\{Z_{t}\}_{t=1}^{\infty}$ using

where $\beta$ is a constant. Then 1) the relative weights of the data in $x_{t}$ are the same as when only the constant step size $\beta$ is used, and 2) the average does not depend on the initial value $x_{0}$ .

and where $\mu_{0}=\mu_{0}^{\beta}$ . Note that $\mu_{t}$ as defined by (11) exactly removes the contribution of the initial value $\mu_{0}$ , which at time $t$ have weight $(1-\beta)^{t}$ in the exponential moving average $\mu_{t}^{\beta}$ , and then renormalizes the remaining value by dividing by $1-(1-\beta)^{t}$ , such that the relative weights of the observed samples $\{Z_{t}\}_{t=1}^{\infty}$ is conserved.

so that then (11) holds for $\mu_{t}$ . Finally, verify that $\mu_{1}=Y_{1}$ . Therefore, (11) holds for all $t$ by induction. ∎