Understanding Gradient Clipping in Private SGD: A Geometric Perspective

Xiangyi Chen, Zhiwei Steven Wu, Mingyi Hong

Introduction

However, the clipping operation can create a substantial bias in the update direction. To illustrate this clipping bias, consider the following two optimization problems even without the privacy constraint.

Example 2. Let $f(x)=\frac{1}{2}\sum_{i=1}^{2}\frac{1}{2}(x-a_{i})^{2}$ , where $a_{1}=-3$ and $a_{2}=3$ . The minimum of $f$ is achieved at $x^{*}=0$ , where the expected clipped gradient is also 0. However, SGD with clipped gradients and $c=1$ may never converge to $x^{*}$ since the expected clipped gradients are all 0 for any $x\in$ , which means all these points are ”stationary” for the algorithm.

Both examples above show that clipping bias can prevent convergence in the worst case. Existing analyses on gradient clipping quantify this clipping bias either with 1) the difference between clipped and unclipped gradients (Pichapati et al., 2019), or 2) the fraction of examples with gradient norms exceeding the clip threshold $c$ (Zhang et al., 2019). These approaches suggest that a small clip threshold will lead to large clipping bias and worsen the training performance of DP-SGD. However, in practice, DP-SGD often remains effective even with a small clip threshold, which indicates a gap in the current theoretical understanding of gradient clipping.

We study the effects of gradient clipping on SGD and DP-SGD and provide:

Theoretical and empirical evaluation of DP-SGD. Building on the SGD analysis, we obtain a similar convergence guarantee on DP-SGD with gradient clipping. Importantly, we are able to prove such convergence guarantee even without Lipschitzness of the loss function, which is often required for DP-SGD analyses. We also provide extensive empirical studies to investigate the gradient distributions of DP-SGD across different epoches on two real datasets. To visualize the symmetricity of the gradient distributions, we perform multiple random projections on the gradients and examine the two-dimensional projected distributions. Our results suggest that the gradient distributions in DP-SGD quickly exhibit symmetricity, despite the asymmetricity at initialization.

Gradient correction mechanism. Finally, we provide a simple modification to DP-SGD that can mitigate the clipping bias. We show that perturbing the gradients before clipping can provably reduce the clipping bias for any gradient distribution. The pre-clipping perturbation does not by itself provide privacy guarantees, but can trade-off the clipping bias with higher variance.

2 Related work

Convergence of SGD with clipped gradient

In this section, we analyze convergence of SGD with clipped gradient, but without the Gaussian perturbation. This simplification is useful for isolating the clipping bias. Consider the standard stochastic optimization formulation

where $D$ denotes the underlying distribution over the examples $s$ . In the next section, we will instantiate $D$ as the empirical distribution over the private dataset. We assume that the algorithm is given access to a stochastic gradient oracle: given any iterate $x_{t}$ of SGD, the oracle returns $\nabla f(x_{t})+\xi_{t}$ , where $\xi_{t}$ is independent noise with zero mean. In addition, we assume $f(x)$ is G-smooth, i.e. $\|\nabla f(x)-\nabla f(y)\|\leq G\|x-y\|,\forall x,y$ . At each iteration $t$ , SGD with gradient clipping performs the following update:

where $g_{t}=\text{clip}({\nabla f(x_{t})+\xi_{t}},c)$ denotes the realized clipped gradient.

To carry out the analysis of iteration (3), we first note that the standard convergence analysis for SGD-type method consists of two main steps:

S2) Show that the aforementioned quantity is proportional to $\|\nabla f(x_{t})\|^{2}$ or $c\|\nabla f(x_{t})\|$ , indicating that the size of gradient also decreases to zero.

In our analysis below, we will see that showing the first step is relatively easy, while the main challenge is to show that the second step holds true.Our first result is given below.

Let $G$ be the Lipschitz constant of $\nabla f$ such that $\|\nabla f(x)-\nabla f(y)\|\leq G\|x-y\|,\forall x,y$ . For SGD with gradient clipping of threshold $c$ , if we set $\alpha=\frac{1}{\sqrt{T}}$ , we have

A straightforward ”nice” distribution will be one that can ensure $\langle\nabla f(x_{t}),g_{t}\rangle\geq\Omega(\|\nabla f(x_{t})\|_{2}^{2}),\ \forall g_{t}$ , i.e. all stochastic gradients are positively aligned with the true gradient. This may be satisfied when the gradient is large and the noise $\xi$ is bounded. However, when the gradient is small, it is hard to argue that this can still be true in general. Specifically, in the training of neural nets, the cosine similarities between many stochastic gradients and the true gradient (i.e. $\cos(\nabla f(x_{t}),\nabla f(x_{t})+\xi_{t}$ )) can be negative, which implies that this assumption does not hold (see Figure 3 in Section 4).

Although Figure 3 seems to exclude the ideal distribution, we observe that the distribution of cosine of the gradients appears to be symmetric. Will such a ”symmetricity” property help define the ”nice” distribution for gradient clipping? If so, how to characterizes the performance of gradient clipping in this situation? In the following result, we rigorously answer to these questions.

Theorem 2 states that when the noise distribution is symmetric, gradient clipping will keep the expected clipped gradients positively aligned with the true gradient. This is the desired property that can guarantee convergence. The probability term characterizes the possible slow down caused by gradient clipping, we provide more discussions and experiments on this term in the Appendix.

Combining Theorem 2 with Theorem 1, we have Corollary 1 to fully characterize the convergence behavior of SGD with gradient clipping.

2 Beyond symmetric distributions

Mixture of symmetric or positively skewed. If $p$ is a mixture of multiple symmetric or positively skewed distributions, one can split the distributions into multiple ones and use their individual properties. E.g. one can easily establish convergence guarantee for $p$ being a mixture of $m$ spherical distributions with mean $u_{1},...,u_{m}$ and $\langle f(x_{t}),u_{i}\rangle\geq 0,\forall i\in[m]$ as in the following theorem.

Given $m$ distributions with the pdf of the $i$ th distribution being $p_{i}(\xi)=\phi_{i}(\|\xi-u_{i}\|)$ for some function $\phi_{i}$ . If $\nabla f(x_{t})=\sum_{i=1}^{n}w_{i}u_{i}$ for some $w_{i}\geq 0,\sum_{i=1}^{m}w_{i}=1$ . Let $p^{\prime}(\xi)=\sum_{i=1}^{m}w_{i}p_{i}(\xi-\nabla f(x_{t})),$ be a mixture of these distributions with zero mean. If $\langle u_{i},\nabla f(x_{t})\rangle\geq 0,\forall i\in[m]$ , we have

Besides these examples of favorable biases above, there are also many cases where $b_{t}$ can be negative and lead to a convergence gap, such as negatively skewed distributions or multimodal distributions with highly imbalanced modes. We have illustrated possible distributions in our divergence examples (Examples 1 and 2). In such cases, one should expect that clipping has an adversarial impact on the convergence guarantee. However, as we also show in Section 4, the gradient distributions on real datasets tend to be symmetric, and so their clipping bias to be small.

DP-SGD with Gradient Clipping

where $S_{t}$ is a random subsample of $S$ (with replacement)Alternatively, subsampling with replacement (Wang et al., 2019) and Poisson subsampling (Zhu and Wang, 2019) have also been proposed. and $Z_{t}\sim\mathcal{N}(0,\sigma^{2}I)$ is the noise added for privacy. We first recall the privacy guarantee of the algorithm below:

There exist constants $u$ and $v$ so that given the number of iterations $T$ , for any $\epsilon\leq uq^{2}T$ , where $q=\frac{|S_{t}|}{n}$ , DP-SGD with gradient clipping of threshold $c$ is $(\epsilon,\delta)$ -differentially private for any $\delta>0$ , if $\sigma^{2}\geq v\frac{c^{2}T\ln\left(\frac{1}{\delta}\right)}{n^{2}\epsilon^{2}}$ .

By accounting for the sub-sampling noise and Gaussian perturbation in DP-SGD, we obtain the following convergence guarantee, where we further bound the clipping bias term $b_{t}$ with the Wasserstein distance between the gradient distribution and a coupling symmetric distribution.

where $h_{c}(y)=\min(y^{2},\frac{3}{4}cy)$ and $W_{v,c}(p,p^{\prime})$ is the Wasserstein distance between $p$ and $p^{\prime}$ with metric function $d_{v,c}(a,b)=|\langle v,\text{clip}(v+a,c)\rangle-\langle v,\text{clip}(v+b,c)\rangle|$ and $D_{f}\geq f(x_{1})-\min_{x}f(x)$ .

Remark on convergence rate. DP-SGD achieves convergence rate of $O({\sqrt{d}}/{(n\epsilon)})$ in the existing literature. As shown in Theorem 5, with gradient clipping, the rate becomes $O({\sqrt{d}}/{(n\epsilon)})+\text{clipping bias}$ . When gradient distribution is symmetric, the convergence rate of $O({\sqrt{d}}/{(n\epsilon)})$ can be recovered. This result implies that when gradient distribution is symmetric, the clipping operation will only affect the convergence rate of DP-SGD by a constant factor. In addition, since the clipping bias is the Wasserstein distance between the empirical gradient distribution and an oracle symmetric distribution, it can be small when the gradient distribution is approximately symmetric.

Experiments

In this section, we investigate whether the gradient distributions of DP-SGD are approximate symmetric in practice. However, since the gradient distributions are high-dimensional, certifying symmetricity is in general intractable. We instead consider two simple proxy measures and visualizations.

Setup. We run DP-SGD implemented in Tensorflow https://github.com/tensorflow/privacy/tree/master/tutorials on two popular datasets MNIST (LeCun et al., 2010) and CIFAR-10 (Krizhevsky et al., 2009). For MNIST, we train a CNN with two convolution layers with 16 4 $\times$ 4 kernels followed by a fully connected layer with 32 nodes. We use DP-SGD to train the model with $\alpha=0.15$ , and a batchsize of 128. For CIFAR-10, we train a CNN with two convolutional layers with 2 $\times$ 2 max pooling of stride 2 followed by a fully connected layer, all using ReLU activation, each layer uses a dropout rate of 0.5. The two convolution layer has 32 and 64 3 $\times$ 3 kernels, the fully connected layer has 1500 nodes. We use $\alpha=0.001$ and decrease it by 10 times every 20 epochs. The clip norm of both experiments is set to be $c=1$ and the noise multiplier is 1.1.

Visualization with random projections. We visualize the gradient distribution by projecting the gradient to a two-dimensional space using random Gaussian matrices. Note that given any symmetric distribution, its two-dimensional projection remains symmetric for any projection matrix. On the contrary, if for all projection matrix, the projected gradient distribution is symmetric, the original gradient distribution should also be symmetric. We repeat the projection using different randomly generated matrices and visualize the induced distributions.

From Figure 1, we can see that on both datasets, the gradient distribution is non-symmetric before training (Epoch 0), but over the epochs, the gradient distributions become increasingly symmetric. The distribution of gradients on MNIST at the end of epoch 9 projected to a random two-dimensional space using different random matrices is shown in Figure 2. It can be seen that the approximate symmetric property holds for all 8 realizations. We provide many more visualizations from different realized random projections across different epochs in the Appendix.

Symmetricity of angles. We also measure the cosine similarities between per-sample stochastic gradients and the true gradient. We observe that the cosine similarities between per-sample stochastic gradients and the true gradient (i.e. $\cos(\nabla f(x_{t})+\xi_{t,i},\nabla f(x_{t}))$ ) is approximate symmetric around 0 as shown in the histograms in Figure 3.

Mitigating Clipping Bias with Perturbation

From previous analyses, SGD with gradient clipping and DP-SGD have good convergence performance when the gradient noise distribution is approximately symmetric or when the gradient bias favors convergence (e.g., mixture of symmetric distributions with aligned mean). Although in practice, gradient distributions do exhibit (approximate) symmetry (see Sec. 4), it would be desirable to have tools to handle situations where the clipping bias does not favor convergence. Now we provide an approach to decrease the bias. If one adds some Gaussian noise before clipping, i.e.

we can prove $|b_{t}|=O\left(\frac{\sigma^{2}_{\xi_{t}}}{k^{2}}\right)$ as in Theorem 6.

Let $g_{t}=\text{clip}(\nabla f(x_{t})+\xi_{t}+k\zeta_{t},c)$ and $\zeta_{t}\sim\mathcal{N}(0,I)$ . Then gradient clipping algorithm has following properties:

where $\sigma^{2}_{\xi_{t}}$ is the variance of the gradient noise $\xi_{t}$ .

More discussion can be found in the Appendix. Note that when the perturbation approach is applied to DP-SGD, the update rule (8) becomes

where each per-sample stochastic gradient is be perturbed by the noise. By adding the noise, one trade-offs bias with variance. Larger noise make the algorithm converges possibly slower but better. This trick can be helpful when the gradient distribution is not favorable. To verify its effect in practice, we run DP-SGD with gradient clipping on a few unfavorable problems including examples in Section 1 and a new high dimensional example. We set $\sigma=1$ on all the examples (i.e. $Z_{t}\sim\mathcal{N}(0,I)$ ). For the new example, we minimize the function $f(x)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{2}\|x-z_{i}\|^{2}$ with $n=10000$ . Each $z_{i}$ is drawn from a mixture of isotropic Gaussian with 3 components of dimension 10. The covariance matrix of all components is $I$ and the means of the 3 components are drawn from $\mathcal{N}(0,36I)$ , $\mathcal{N}(0,4I)$ , $\mathcal{N}(0,I)$ , respectively. We set $\alpha=0.015$ for the new examples and $\alpha=0.001$ for the examples in Section 1. Figure 4 shows $\|x_{t}-\operatorname*{arg\,min}_{x}f(x)\|$ versus $t$ . We can see DP-SGD with gradient clipping converges to non-optimal points as predicted by theory. In contrast, pre-clipping perturbation ensures convergence.

Conclusion and Future Work

In this paper, we provide a theoretical analysis on the effect of gradient clipping in SGD and private SGD. We provide a new way to quantify the clipping bias by coupling the gradient distribution with a geometrically symmetric distribution. Combined with our empirical evaluation showing that gradient distribution of private SGD follows some symmetric structure along the trajectory, these results provide an explanation why gradient clipping works in practice. We also provide a perturbation-based technique to reduce the clipping bias even for adversarial instances.

There are some interesting directions for future work. One main message of this paper is that when gradient distribution is symmetric, gradient clipping will not be detrimental to the performance of DP-SGD. Thus, looking for methods to symmetrify the gradient distribution could be an interesting topic. Another interesting direction is to study gradient distribution of different types of models empirically. We notice the gradient distribution of CNNs on MNIST and CIFAR-10 might be symmetric and a clipping threshold around 1 works well. However, McMahan et al. (2017) found a relatively large clipping threshold around 10 works best for LSTMs. This implies the gradient distribution on some models might be less symmetric and a concrete empirical analysis on it could motivate future research. Finally, it could be interesting to investigate properties of gradient clipping on a broader class of gradient distributions beyond symmetricity.

Acknowledgement

The research is supported in part by a NSF grant CMMI-1727757, a Google Faculty Research Award, a J.P. Morgan Faculty Award, and a Facebook Research Award.

References

Appendix A Proof of Theorem 1

Then, by update rule and the fact that $\|g_{t}\|\leq c$ , we have

Take expectation, sum over $t$ from 1 to $T$ , divide both sides by $T\alpha$ , rearranging and substituting into $\alpha=\frac{1}{\sqrt{T}}$ , we get

where $D_{f}=f(x_{1})-\min_{x}f(x)$ . $\square$

Appendix B Proof of Theorem 2

Let us first consider the case with $\|\nabla f(x_{t})\|\leq\frac{3}{4}c$ .

where $\bar{g}_{t}\triangleq\nabla f(x_{t})$ and the last equality is because $\langle a,b\rangle=\|a\|\|b\|\cos(a,b)$ for any vector $a,b$ , and that the clipping operation keeps directions.

Now it reduces to analyzing $T_{2}(\xi_{t})$ . Our target now is to prove $T_{2}(\xi_{t})\geq 0$ .

Let us first consider the case where $||\bar{g}_{t}+\xi_{t}||\geq c$ and $||\bar{g}_{t}-\xi_{t}||\geq c$ . In this case, we have

Another case is one of $||\bar{g}_{t}+\xi_{t}||$ and $||\bar{g}_{t}-\xi_{t}||$ is less than $c$ . Assume $\cos(\bar{g}_{t},\bar{g}_{t}-\xi_{t})<0$ . In this case, we must have $\cos(\bar{g}_{t},-\xi_{t})<0$ and $cos(\bar{g}_{t},\bar{g}_{t}+\xi_{t})>0$ from basic properties of trigonometric functions. Then, from Lemma 2, we have

where the last inequality is due to Lemma 1.

Similar argument applies to the case with $cos(\bar{g}_{t},\bar{g}_{t}+\xi_{t})<0$ .

For the case with $cos(\bar{g}_{t},\bar{g}_{t}+\xi_{t})\geq 0$ and $cos(\bar{g}_{t},\bar{g}_{t}-\xi_{t})\geq 0$ , we trivially have $T_{2}(\xi_{t})\geq 0$ . Thus, we have

B.2 When gradient is large

Now let us look at the case where gradient is large, i.e. $\|\nabla f(x_{t})\|\geq\frac{3}{4}c$ .

where the last equality is by definition of cosine and the fact that the clipping operation keeps directions.

In the following, we want to show $T_{7}$ is an non-decreasing function of $\|\nabla f(x_{t})\|$ , then the result can be directly obtained from first part of the theorem.

Notice that $T_{7}$ is invariant to simultaneous rotation of $\nabla f(x_{t})$ and the noise distribution (i.e., changing the coordinate axis of the space). Thus, wlog, we can assume $\nabla f(x_{t})_{1}=y>0$ and $\nabla f(x_{t})_{i}=0$ for $2\leq i\leq d$ . To show $T_{7}$ is a non-decreasing function of $\|\nabla f(x_{t})\|$ , it is enough to show each term in the integration is an non-decreasing function of $y$ . I.e., it is enough to show that, for all $\xi_{t}$ , the following quantity

is an non-decreasing function of $y$ for $y>0$ when $\nabla f(x_{t})=[y,0,...,0]$ .

First consider the case where $\|\nabla f(x_{t})+\xi_{t}\|\leq c$ . In this case, (22) reduces to

which is a monotonically increasing function of $y$ .

Now consider the case with $\|\nabla f(x_{t})+\xi_{t}\|\geq c$ , we have

which is a non-decreasing function of $y$ .

we have $r^{\prime}(z)=c(1-\frac{z^{2}}{z^{2}+q^{2}})\geq 0$ . The term in RHS of (B.2) can be treated as $z=y+\xi_{t,1}$ and $q^{2}=\sum_{i=2}^{d}\xi_{t,i}^{2}$ .

Since the clipping function is continuous, combined with the above results, we know (22) is an non-decreasing function of $\|\nabla f(x_{t})\|$ .

From first part of the theorem, we know when $\|\nabla f(x_{t})\|=\frac{3}{4}c$ , we have

Combine the above result with (B.2) and the non-decreasing property of $T_{7}$ , we see that when $\|\nabla f(x_{t})\|\geq\frac{3}{4}c$ , the following holds:

B.3 Technical lemmas

To prove the desired result, we only need the numerator of RHS of (B.3) to be non-negative.

Denote $h(\xi)=cos(g,g+\xi)+cos(g,g-\xi)$ , since $h$ is rotation invariant, we can assume $\xi_{1}>0$ and $\xi_{t,i}=0$ for $2\leq i\leq d$ wlog. Also, because $h(\xi)=h(-\xi)$ , we can assume $g_{1}\geq 0$ wlog.

Now suppose $g_{1}=a$ , $\sum_{i=2}^{d}g_{i}^{2}=b^{2}$ , Denote the numerator of RHS of (B.3) as $T_{3}$ , it can be written as

and $e=\frac{\langle g,\xi\rangle}{\|g\|\|\xi\|}=\frac{a}{\sqrt{a^{2}+b^{2}}}$ .

Now let us analyze when $T_{3}$ can be possibly less than 0. Recall that by assumption, $\xi_{1}>0$ and $e\geq 0$ . Then we know $T_{5}\geq 0$ . We have $T_{3}\geq 0$ trivially when $T_{4}\geq 0$ , i.e. when $\xi_{1}e\leq\sqrt{a^{2}+b^{2}}$ .

Now assume $\xi_{1}e>\sqrt{a^{2}+b^{2}}$ . To ensure $T_{3}\geq 0$ , we can alternatively ensure $T_{5}^{2}\geq T_{4}^{2}$ in this case.

For $T_{6}$ , we can further simplify it as

where the last inequality is because $\xi^{2}\geq\xi^{2}e^{2}\geq a^{2}+b^{2}$ and $\xi_{1}a>0$ as assumed previously.

Combining all above, we have $T_{6}\geq 0\implies T_{5}^{2}-T_{4}^{2}\geq 0\implies T_{3}\geq 0\implies cos(g,g+\xi)+cos(g,g-\xi)\geq 0$ which proves the desired result. ∎

Proof: Express $\xi$ using a coordinate system with one axis parallel to $g$ . Define the basis of this coordinate system as $v_{1},v_{2},...v_{d}$ with $v_{1}$ = $g/\|g\|$ . Then we have $\xi=\sum_{i=1}^{d}b_{i}v_{i}$ and $cos(g,\xi)>0$ if and only if $b_{1}>0$ .

Then it is clear that $\|g+\xi\|\geq\|g-\xi\|$ when $b_{1}>0$ which means $cos(g,\xi)>0$ .

Similar arguments applies to the case with $cos(g,\xi)<0$ ∎

Appendix C Proof of Theorem 3

Given $m$ distributions with the pdf of the $i$ th distribution being $p_{i}(\xi_{t})=\phi_{i}(\|\xi_{t}-u_{i}\|)$ for some function $\phi_{i}$ . If $\nabla f(x_{t})=\sum_{i=1}^{m}w_{i}u_{i}$ for some $w_{i}\geq 0,\sum_{i=1}^{m}w_{i}=1$ . Let $p^{\prime}(\xi_{t})=\sum_{i=1}^{m}w_{i}p_{i}(\xi_{t}-\nabla f(x_{t})),$ be a mixture of these distributions with zero mean. If $\langle u_{i},\nabla f(x_{t})\rangle\geq 0,\forall i\in[m]$ , we have

Proof: First, we notice that Theorem 2 can be restated into a more general form as follows.

Now we can use the above results to prove the theorem.

Then, because (39) and $E_{\xi_{t}\sim{p_{i}}}[g_{t}]=u_{i}$ and that $p_{i}$ corresponds to a noise with spherical distribution added to $u_{i}$ , we have

Appendix D Proof of Theorem 5

Recall the algorithm has the following update rule

where $g_{t,i}\triangleq\nabla f(x_{t})+\xi_{t,i}$ is the stochastic gradient at iteration $t$ evaluated on sample $i$ , and $S_{t}$ is a subset of whole dataset $D$ ; $Z_{t}\sim\mathcal{N}(0,\sigma^{2}I)$ is the noise added for privacy. We denote $g_{t}=\frac{1}{|S_{t}|}\sum_{i\in S_{t}}\text{clip}(\nabla f(x_{t})+\xi_{t,i},c)$ in the remaining parts of the proof to simplify notation.

Following traditional convergence analysis of SGD using smoothness assumption, we first have

Taking expectation conditioned on $x_{t}$ , we have

Take overall expectation and sum over $t\in[T]$ and rearrange, we have

To achieve $(\epsilon,\delta)$ -privacy, we need $\sigma^{2}=v\frac{Tc^{2}\ln(\frac{1}{\delta})}{n^{2}\epsilon^{2}}$ for some constant $v$ by Theorem 1 in Abadi et al. [2016b]. Substituting the expression of $\sigma^{2}$ into the above inequality, we get

where we define $D_{f}=f(x_{1})-\min_{x}f(x)$ .

Setting $T\alpha=\frac{\sqrt{D_{f}}n\epsilon}{\sqrt{G}c\sqrt{d}\sqrt{\ln(\frac{1}{\delta})}}$ , we have

Setting $\alpha=\frac{\sqrt{D_{f}d\ln(\frac{1}{\delta})}}{n\epsilon c\sqrt{G}}$ , we have

The remaining step is to analyze the term on LHS of (52). We first notice that the gradient sampling scheme yields

with $\xi_{t}$ being a discrete random variable that can takes values $\xi_{t,i},i\in D$ with equal probability and $D$ is the whole dataset.

Now it is time to split the bias as following.

when $\|\nabla f(x_{t})\|\leq\frac{3}{4}c$ and

when $\|\nabla f(x_{t})\|\geq\frac{3}{4}c$ .

Now we bound the bias term using Wasserstein distance as follows.

which we define $W_{v,c}(p,p^{\prime})$ as the Wasserstein distance between $p$ and $p^{\prime}$ using the metric $d_{v,c}$ .

Appendix E Proof of Theorem 6

Let $g_{t}=\text{clip}(\nabla f(x_{t})+\xi_{t}+k\zeta_{t},c)$ and $\zeta_{t}\sim\mathcal{N}(0,I)$ . Then gradient clipping algorithm has following properties:

where $\sigma^{2}_{\xi_{t}}$ is the variance of the gradient noise $\xi_{t}$ .

where the second equality is obtained by applying (61) and using the fact that $\xi_{t}$ is zero mean.

Noticing that $\tau\leq 1$ and define $\hat{W}_{t}=W_{t}^{\prime}-\tau\frac{\xi_{t}}{k}$ , we have

and the integration term only depends on $\|\frac{\xi_{t}}{k}\|$ due to sphere symmetricity of $\psi$ . Thus we can assume $\xi_{t,1}=\|\xi_{t}\|$ and $\xi_{t,i}=0$ for $i\geq 2$ , wlog. Then, we have

where we have define $\hat{W}_{t}=W_{t}^{\prime}-\tau\frac{\xi_{t}}{k}$ and $q=\int_{-\infty}^{\infty}|h^{\prime\prime}(x)|dx$ with $h(x)$ being the pdf of 1-dimensional standard normal distribution. Thus, $q$ is a dimension independent constant.

Combining (E), (68), and (67) finishes the proof. ∎

Appendix F More experiments on random projection

We show the projection of stochastic gradients into 2d space described in Section 4 for different projection matrices in Figure 5-8. It can be seen that as the training progresses, the gradient distribution in 2d space tend to be increasingly more symmetric.

Appendix G Evaluation on the probability term

Despite the discussions above, the distribution of norm of stochastic gradients and noise norm combined with the 2d visualization experiments implies the noise on gradient might follow a mixture of distributions with each component being approximate symmetric. Especially one component may correspond to a approximate 0 mean distribution of stochastic gradients. Intuitively this can be true since each class of data may corresponds to a few variations of stochastic gradients and the gradient noise is observed to be low rank in Li et al. . We have some discussions in Section 2.2 to explain how convergence can be achieved in the cases of symmetric distribution mixtures but it may not be the complete explanation here. Further exploration of gradient distribution in practice is an important question and we leave it for future research.

Appendix H Additional results and discussions on the probability term and the noise adding approach in Section 5

The results below verify our lower bound.