Understanding Gradient Clipping in Private SGD: A Geometric Perspective

Xiangyi Chen, Zhiwei Steven Wu, Mingyi Hong

Introduction

However, the clipping operation can create a substantial bias in the update direction. To illustrate this clipping bias, consider the following two optimization problems even without the privacy constraint.

Example 2. Let f(x)=12i=1212(xai)2f(x)=\frac{1}{2}\sum_{i=1}^{2}\frac{1}{2}(x-a_{i})^{2}, where a1=3a_{1}=-3 and a2=3a_{2}=3. The minimum of ff is achieved at x=0x^{*}=0, where the expected clipped gradient is also 0. However, SGD with clipped gradients and c=1c=1 may never converge to xx^{*} since the expected clipped gradients are all 0 for any xx\in, which means all these points are ”stationary” for the algorithm.

Both examples above show that clipping bias can prevent convergence in the worst case. Existing analyses on gradient clipping quantify this clipping bias either with 1) the difference between clipped and unclipped gradients (Pichapati et al., 2019), or 2) the fraction of examples with gradient norms exceeding the clip threshold cc (Zhang et al., 2019). These approaches suggest that a small clip threshold will lead to large clipping bias and worsen the training performance of DP-SGD. However, in practice, DP-SGD often remains effective even with a small clip threshold, which indicates a gap in the current theoretical understanding of gradient clipping.

We study the effects of gradient clipping on SGD and DP-SGD and provide:

Theoretical and empirical evaluation of DP-SGD. Building on the SGD analysis, we obtain a similar convergence guarantee on DP-SGD with gradient clipping. Importantly, we are able to prove such convergence guarantee even without Lipschitzness of the loss function, which is often required for DP-SGD analyses. We also provide extensive empirical studies to investigate the gradient distributions of DP-SGD across different epoches on two real datasets. To visualize the symmetricity of the gradient distributions, we perform multiple random projections on the gradients and examine the two-dimensional projected distributions. Our results suggest that the gradient distributions in DP-SGD quickly exhibit symmetricity, despite the asymmetricity at initialization.

Gradient correction mechanism. Finally, we provide a simple modification to DP-SGD that can mitigate the clipping bias. We show that perturbing the gradients before clipping can provably reduce the clipping bias for any gradient distribution. The pre-clipping perturbation does not by itself provide privacy guarantees, but can trade-off the clipping bias with higher variance.

2 Related work

Convergence of SGD with clipped gradient

In this section, we analyze convergence of SGD with clipped gradient, but without the Gaussian perturbation. This simplification is useful for isolating the clipping bias. Consider the standard stochastic optimization formulation

where DD denotes the underlying distribution over the examples ss. In the next section, we will instantiate DD as the empirical distribution over the private dataset. We assume that the algorithm is given access to a stochastic gradient oracle: given any iterate xtx_{t} of SGD, the oracle returns f(xt)+ξt\nabla f(x_{t})+\xi_{t}, where ξt\xi_{t} is independent noise with zero mean. In addition, we assume f(x)f(x) is G-smooth, i.e. f(x)f(y)Gxy,x,y\|\nabla f(x)-\nabla f(y)\|\leq G\|x-y\|,\forall x,y. At each iteration tt, SGD with gradient clipping performs the following update:

where gt=clip(f(xt)+ξt,c)g_{t}=\text{clip}({\nabla f(x_{t})+\xi_{t}},c) denotes the realized clipped gradient.

To carry out the analysis of iteration (3), we first note that the standard convergence analysis for SGD-type method consists of two main steps:

S2) Show that the aforementioned quantity is proportional to f(xt)2\|\nabla f(x_{t})\|^{2} or cf(xt)c\|\nabla f(x_{t})\|, indicating that the size of gradient also decreases to zero.

In our analysis below, we will see that showing the first step is relatively easy, while the main challenge is to show that the second step holds true.Our first result is given below.

Let GG be the Lipschitz constant of f\nabla f such that f(x)f(y)Gxy,x,y\|\nabla f(x)-\nabla f(y)\|\leq G\|x-y\|,\forall x,y. For SGD with gradient clipping of threshold cc, if we set α=1T\alpha=\frac{1}{\sqrt{T}}, we have

A straightforward ”nice” distribution will be one that can ensure f(xt),gtΩ(f(xt)22), gt\langle\nabla f(x_{t}),g_{t}\rangle\geq\Omega(\|\nabla f(x_{t})\|_{2}^{2}),\ \forall g_{t}, i.e. all stochastic gradients are positively aligned with the true gradient. This may be satisfied when the gradient is large and the noise ξ\xi is bounded. However, when the gradient is small, it is hard to argue that this can still be true in general. Specifically, in the training of neural nets, the cosine similarities between many stochastic gradients and the true gradient (i.e. cos(f(xt),f(xt)+ξt\cos(\nabla f(x_{t}),\nabla f(x_{t})+\xi_{t})) can be negative, which implies that this assumption does not hold (see Figure 3 in Section 4).

Although Figure 3 seems to exclude the ideal distribution, we observe that the distribution of cosine of the gradients appears to be symmetric. Will such a ”symmetricity” property help define the ”nice” distribution for gradient clipping? If so, how to characterizes the performance of gradient clipping in this situation? In the following result, we rigorously answer to these questions.

Theorem 2 states that when the noise distribution is symmetric, gradient clipping will keep the expected clipped gradients positively aligned with the true gradient. This is the desired property that can guarantee convergence. The probability term characterizes the possible slow down caused by gradient clipping, we provide more discussions and experiments on this term in the Appendix.

Combining Theorem 2 with Theorem 1, we have Corollary 1 to fully characterize the convergence behavior of SGD with gradient clipping.

2 Beyond symmetric distributions

Mixture of symmetric or positively skewed. If pp is a mixture of multiple symmetric or positively skewed distributions, one can split the distributions into multiple ones and use their individual properties. E.g. one can easily establish convergence guarantee for pp being a mixture of mm spherical distributions with mean u1,...,umu_{1},...,u_{m} and f(xt),ui0,i[m]\langle f(x_{t}),u_{i}\rangle\geq 0,\forall i\in[m] as in the following theorem.

Given mm distributions with the pdf of the iith distribution being pi(ξ)=ϕi(ξui)p_{i}(\xi)=\phi_{i}(\|\xi-u_{i}\|) for some function ϕi\phi_{i}. If f(xt)=i=1nwiui\nabla f(x_{t})=\sum_{i=1}^{n}w_{i}u_{i} for some wi0,i=1mwi=1w_{i}\geq 0,\sum_{i=1}^{m}w_{i}=1. Let p(ξ)=i=1mwipi(ξf(xt)),p^{\prime}(\xi)=\sum_{i=1}^{m}w_{i}p_{i}(\xi-\nabla f(x_{t})), be a mixture of these distributions with zero mean. If ui,f(xt)0,i[m]\langle u_{i},\nabla f(x_{t})\rangle\geq 0,\forall i\in[m], we have

Besides these examples of favorable biases above, there are also many cases where btb_{t} can be negative and lead to a convergence gap, such as negatively skewed distributions or multimodal distributions with highly imbalanced modes. We have illustrated possible distributions in our divergence examples (Examples 1 and 2). In such cases, one should expect that clipping has an adversarial impact on the convergence guarantee. However, as we also show in Section 4, the gradient distributions on real datasets tend to be symmetric, and so their clipping bias to be small.

DP-SGD with Gradient Clipping

where StS_{t} is a random subsample of SS (with replacement)Alternatively, subsampling with replacement (Wang et al., 2019) and Poisson subsampling (Zhu and Wang, 2019) have also been proposed. and ZtN(0,σ2I)Z_{t}\sim\mathcal{N}(0,\sigma^{2}I) is the noise added for privacy. We first recall the privacy guarantee of the algorithm below:

There exist constants uu and vv so that given the number of iterations TT, for any ϵuq2T\epsilon\leq uq^{2}T, where q=Stnq=\frac{|S_{t}|}{n}, DP-SGD with gradient clipping of threshold cc is (ϵ,δ)(\epsilon,\delta)-differentially private for any δ>0\delta>0, if σ2vc2Tln(1δ)n2ϵ2\sigma^{2}\geq v\frac{c^{2}T\ln\left(\frac{1}{\delta}\right)}{n^{2}\epsilon^{2}}.

By accounting for the sub-sampling noise and Gaussian perturbation in DP-SGD, we obtain the following convergence guarantee, where we further bound the clipping bias term btb_{t} with the Wasserstein distance between the gradient distribution and a coupling symmetric distribution.

where hc(y)=min(y2,34cy)h_{c}(y)=\min(y^{2},\frac{3}{4}cy) and Wv,c(p,p)W_{v,c}(p,p^{\prime}) is the Wasserstein distance between pp and pp^{\prime} with metric function dv,c(a,b)=v,clip(v+a,c)v,clip(v+b,c)d_{v,c}(a,b)=|\langle v,\text{clip}(v+a,c)\rangle-\langle v,\text{clip}(v+b,c)\rangle| and Dff(x1)minxf(x)D_{f}\geq f(x_{1})-\min_{x}f(x).

Remark on convergence rate. DP-SGD achieves convergence rate of O(d/(nϵ))O({\sqrt{d}}/{(n\epsilon)}) in the existing literature. As shown in Theorem 5, with gradient clipping, the rate becomes O(d/(nϵ))+clipping biasO({\sqrt{d}}/{(n\epsilon)})+\text{clipping bias}. When gradient distribution is symmetric, the convergence rate of O(d/(nϵ))O({\sqrt{d}}/{(n\epsilon)}) can be recovered. This result implies that when gradient distribution is symmetric, the clipping operation will only affect the convergence rate of DP-SGD by a constant factor. In addition, since the clipping bias is the Wasserstein distance between the empirical gradient distribution and an oracle symmetric distribution, it can be small when the gradient distribution is approximately symmetric.

Experiments

In this section, we investigate whether the gradient distributions of DP-SGD are approximate symmetric in practice. However, since the gradient distributions are high-dimensional, certifying symmetricity is in general intractable. We instead consider two simple proxy measures and visualizations.

Setup. We run DP-SGD implemented in Tensorflow https://github.com/tensorflow/privacy/tree/master/tutorials on two popular datasets MNIST (LeCun et al., 2010) and CIFAR-10 (Krizhevsky et al., 2009). For MNIST, we train a CNN with two convolution layers with 16 4×\times4 kernels followed by a fully connected layer with 32 nodes. We use DP-SGD to train the model with α=0.15\alpha=0.15, and a batchsize of 128. For CIFAR-10, we train a CNN with two convolutional layers with 2×\times2 max pooling of stride 2 followed by a fully connected layer, all using ReLU activation, each layer uses a dropout rate of 0.5. The two convolution layer has 32 and 64 3×\times3 kernels, the fully connected layer has 1500 nodes. We use α=0.001\alpha=0.001 and decrease it by 10 times every 20 epochs. The clip norm of both experiments is set to be c=1c=1 and the noise multiplier is 1.1.

Visualization with random projections. We visualize the gradient distribution by projecting the gradient to a two-dimensional space using random Gaussian matrices. Note that given any symmetric distribution, its two-dimensional projection remains symmetric for any projection matrix. On the contrary, if for all projection matrix, the projected gradient distribution is symmetric, the original gradient distribution should also be symmetric. We repeat the projection using different randomly generated matrices and visualize the induced distributions.

From Figure 1, we can see that on both datasets, the gradient distribution is non-symmetric before training (Epoch 0), but over the epochs, the gradient distributions become increasingly symmetric. The distribution of gradients on MNIST at the end of epoch 9 projected to a random two-dimensional space using different random matrices is shown in Figure 2. It can be seen that the approximate symmetric property holds for all 8 realizations. We provide many more visualizations from different realized random projections across different epochs in the Appendix.

Symmetricity of angles. We also measure the cosine similarities between per-sample stochastic gradients and the true gradient. We observe that the cosine similarities between per-sample stochastic gradients and the true gradient (i.e. cos(f(xt)+ξt,i,f(xt))\cos(\nabla f(x_{t})+\xi_{t,i},\nabla f(x_{t}))) is approximate symmetric around 0 as shown in the histograms in Figure 3.

Mitigating Clipping Bias with Perturbation

From previous analyses, SGD with gradient clipping and DP-SGD have good convergence performance when the gradient noise distribution is approximately symmetric or when the gradient bias favors convergence (e.g., mixture of symmetric distributions with aligned mean). Although in practice, gradient distributions do exhibit (approximate) symmetry (see Sec. 4), it would be desirable to have tools to handle situations where the clipping bias does not favor convergence. Now we provide an approach to decrease the bias. If one adds some Gaussian noise before clipping, i.e.

we can prove bt=O(σξt2k2)|b_{t}|=O\left(\frac{\sigma^{2}_{\xi_{t}}}{k^{2}}\right) as in Theorem 6.

Let gt=clip(f(xt)+ξt+kζt,c)g_{t}=\text{clip}(\nabla f(x_{t})+\xi_{t}+k\zeta_{t},c) and ζtN(0,I)\zeta_{t}\sim\mathcal{N}(0,I). Then gradient clipping algorithm has following properties:

where σξt2\sigma^{2}_{\xi_{t}} is the variance of the gradient noise ξt\xi_{t}.

More discussion can be found in the Appendix. Note that when the perturbation approach is applied to DP-SGD, the update rule (8) becomes

where each per-sample stochastic gradient is be perturbed by the noise. By adding the noise, one trade-offs bias with variance. Larger noise make the algorithm converges possibly slower but better. This trick can be helpful when the gradient distribution is not favorable. To verify its effect in practice, we run DP-SGD with gradient clipping on a few unfavorable problems including examples in Section 1 and a new high dimensional example. We set σ=1\sigma=1 on all the examples (i.e. ZtN(0,I)Z_{t}\sim\mathcal{N}(0,I)). For the new example, we minimize the function f(x)=1ni=1n12xzi2f(x)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{2}\|x-z_{i}\|^{2} with n=10000n=10000. Each ziz_{i} is drawn from a mixture of isotropic Gaussian with 3 components of dimension 10. The covariance matrix of all components is II and the means of the 3 components are drawn from N(0,36I)\mathcal{N}(0,36I), N(0,4I)\mathcal{N}(0,4I), N(0,I)\mathcal{N}(0,I), respectively. We set α=0.015\alpha=0.015 for the new examples and α=0.001\alpha=0.001 for the examples in Section 1. Figure 4 shows xtarg minxf(x)\|x_{t}-\operatorname*{arg\,min}_{x}f(x)\| versus tt. We can see DP-SGD with gradient clipping converges to non-optimal points as predicted by theory. In contrast, pre-clipping perturbation ensures convergence.

Conclusion and Future Work

In this paper, we provide a theoretical analysis on the effect of gradient clipping in SGD and private SGD. We provide a new way to quantify the clipping bias by coupling the gradient distribution with a geometrically symmetric distribution. Combined with our empirical evaluation showing that gradient distribution of private SGD follows some symmetric structure along the trajectory, these results provide an explanation why gradient clipping works in practice. We also provide a perturbation-based technique to reduce the clipping bias even for adversarial instances.

There are some interesting directions for future work. One main message of this paper is that when gradient distribution is symmetric, gradient clipping will not be detrimental to the performance of DP-SGD. Thus, looking for methods to symmetrify the gradient distribution could be an interesting topic. Another interesting direction is to study gradient distribution of different types of models empirically. We notice the gradient distribution of CNNs on MNIST and CIFAR-10 might be symmetric and a clipping threshold around 1 works well. However, McMahan et al. (2017) found a relatively large clipping threshold around 10 works best for LSTMs. This implies the gradient distribution on some models might be less symmetric and a concrete empirical analysis on it could motivate future research. Finally, it could be interesting to investigate properties of gradient clipping on a broader class of gradient distributions beyond symmetricity.

Acknowledgement

The research is supported in part by a NSF grant CMMI-1727757, a Google Faculty Research Award, a J.P. Morgan Faculty Award, and a Facebook Research Award.

References

Appendix A Proof of Theorem 1

Then, by update rule and the fact that gtc\|g_{t}\|\leq c, we have

Take expectation, sum over tt from 1 to TT, divide both sides by TαT\alpha, rearranging and substituting into α=1T\alpha=\frac{1}{\sqrt{T}}, we get

where Df=f(x1)minxf(x)D_{f}=f(x_{1})-\min_{x}f(x). \square

Appendix B Proof of Theorem 2

Let us first consider the case with f(xt)34c\|\nabla f(x_{t})\|\leq\frac{3}{4}c.

where gˉtf(xt)\bar{g}_{t}\triangleq\nabla f(x_{t}) and the last equality is because a,b=abcos(a,b)\langle a,b\rangle=\|a\|\|b\|\cos(a,b) for any vector a,ba,b, and that the clipping operation keeps directions.

Now it reduces to analyzing T2(ξt)T_{2}(\xi_{t}). Our target now is to prove T2(ξt)0T_{2}(\xi_{t})\geq 0.

Let us first consider the case where gˉt+ξtc||\bar{g}_{t}+\xi_{t}||\geq c and gˉtξtc||\bar{g}_{t}-\xi_{t}||\geq c. In this case, we have

Another case is one of gˉt+ξt||\bar{g}_{t}+\xi_{t}|| and gˉtξt||\bar{g}_{t}-\xi_{t}|| is less than cc. Assume cos(gˉt,gˉtξt)<0\cos(\bar{g}_{t},\bar{g}_{t}-\xi_{t})<0. In this case, we must have cos(gˉt,ξt)<0\cos(\bar{g}_{t},-\xi_{t})<0 and cos(gˉt,gˉt+ξt)>0cos(\bar{g}_{t},\bar{g}_{t}+\xi_{t})>0 from basic properties of trigonometric functions. Then, from Lemma 2, we have

where the last inequality is due to Lemma 1.

Similar argument applies to the case with cos(gˉt,gˉt+ξt)<0cos(\bar{g}_{t},\bar{g}_{t}+\xi_{t})<0.

For the case with cos(gˉt,gˉt+ξt)0cos(\bar{g}_{t},\bar{g}_{t}+\xi_{t})\geq 0 and cos(gˉt,gˉtξt)0cos(\bar{g}_{t},\bar{g}_{t}-\xi_{t})\geq 0, we trivially have T2(ξt)0T_{2}(\xi_{t})\geq 0. Thus, we have

B.2 When gradient is large

Now let us look at the case where gradient is large, i.e. f(xt)34c\|\nabla f(x_{t})\|\geq\frac{3}{4}c.

where the last equality is by definition of cosine and the fact that the clipping operation keeps directions.

In the following, we want to show T7T_{7} is an non-decreasing function of f(xt)\|\nabla f(x_{t})\|, then the result can be directly obtained from first part of the theorem.

Notice that T7T_{7} is invariant to simultaneous rotation of f(xt)\nabla f(x_{t}) and the noise distribution (i.e., changing the coordinate axis of the space). Thus, wlog, we can assume f(xt)1=y>0\nabla f(x_{t})_{1}=y>0 and f(xt)i=0\nabla f(x_{t})_{i}=0 for 2id2\leq i\leq d. To show T7T_{7} is a non-decreasing function of f(xt)\|\nabla f(x_{t})\|, it is enough to show each term in the integration is an non-decreasing function of yy. I.e., it is enough to show that, for all ξt\xi_{t}, the following quantity

is an non-decreasing function of yy for y>0y>0 when f(xt)=[y,0,...,0]\nabla f(x_{t})=[y,0,...,0] .

First consider the case where f(xt)+ξtc\|\nabla f(x_{t})+\xi_{t}\|\leq c. In this case, (22) reduces to

which is a monotonically increasing function of yy.

Now consider the case with f(xt)+ξtc\|\nabla f(x_{t})+\xi_{t}\|\geq c, we have

which is a non-decreasing function of yy.

we have r(z)=c(1z2z2+q2)0r^{\prime}(z)=c(1-\frac{z^{2}}{z^{2}+q^{2}})\geq 0. The term in RHS of (B.2) can be treated as z=y+ξt,1z=y+\xi_{t,1} and q2=i=2dξt,i2q^{2}=\sum_{i=2}^{d}\xi_{t,i}^{2}.

Since the clipping function is continuous, combined with the above results, we know (22) is an non-decreasing function of f(xt)\|\nabla f(x_{t})\|.

From first part of the theorem, we know when f(xt)=34c\|\nabla f(x_{t})\|=\frac{3}{4}c, we have

Combine the above result with (B.2) and the non-decreasing property of T7T_{7}, we see that when f(xt)34c\|\nabla f(x_{t})\|\geq\frac{3}{4}c, the following holds:

B.3 Technical lemmas

To prove the desired result, we only need the numerator of RHS of (B.3) to be non-negative.

Denote h(ξ)=cos(g,g+ξ)+cos(g,gξ)h(\xi)=cos(g,g+\xi)+cos(g,g-\xi), since hh is rotation invariant, we can assume ξ1>0\xi_{1}>0 and ξt,i=0\xi_{t,i}=0 for 2id2\leq i\leq d wlog. Also, because h(ξ)=h(ξ)h(\xi)=h(-\xi), we can assume g10g_{1}\geq 0 wlog.

Now suppose g1=ag_{1}=a, i=2dgi2=b2\sum_{i=2}^{d}g_{i}^{2}=b^{2}, Denote the numerator of RHS of (B.3) as T3T_{3}, it can be written as

and e=g,ξgξ=aa2+b2e=\frac{\langle g,\xi\rangle}{\|g\|\|\xi\|}=\frac{a}{\sqrt{a^{2}+b^{2}}}.

Now let us analyze when T3T_{3} can be possibly less than 0. Recall that by assumption, ξ1>0\xi_{1}>0 and e0e\geq 0. Then we know T50T_{5}\geq 0. We have T30T_{3}\geq 0 trivially when T40T_{4}\geq 0, i.e. when ξ1ea2+b2\xi_{1}e\leq\sqrt{a^{2}+b^{2}}.

Now assume ξ1e>a2+b2\xi_{1}e>\sqrt{a^{2}+b^{2}}. To ensure T30T_{3}\geq 0, we can alternatively ensure T52T42T_{5}^{2}\geq T_{4}^{2} in this case.

For T6T_{6}, we can further simplify it as

where the last inequality is because ξ2ξ2e2a2+b2\xi^{2}\geq\xi^{2}e^{2}\geq a^{2}+b^{2} and ξ1a>0\xi_{1}a>0 as assumed previously.

Combining all above, we have T60    T52T420    T30    cos(g,g+ξ)+cos(g,gξ)0T_{6}\geq 0\implies T_{5}^{2}-T_{4}^{2}\geq 0\implies T_{3}\geq 0\implies cos(g,g+\xi)+cos(g,g-\xi)\geq 0 which proves the desired result. ∎

Proof: Express ξ\xi using a coordinate system with one axis parallel to gg. Define the basis of this coordinate system as v1,v2,...vdv_{1},v_{2},...v_{d} with v1v_{1} = g/gg/\|g\|. Then we have ξ=i=1dbivi\xi=\sum_{i=1}^{d}b_{i}v_{i} and cos(g,ξ)>0cos(g,\xi)>0 if and only if b1>0b_{1}>0.

Then it is clear that g+ξgξ\|g+\xi\|\geq\|g-\xi\| when b1>0b_{1}>0 which means cos(g,ξ)>0cos(g,\xi)>0.

Similar arguments applies to the case with cos(g,ξ)<0cos(g,\xi)<0

Appendix C Proof of Theorem 3

Given mm distributions with the pdf of the iith distribution being pi(ξt)=ϕi(ξtui)p_{i}(\xi_{t})=\phi_{i}(\|\xi_{t}-u_{i}\|) for some function ϕi\phi_{i}. If f(xt)=i=1mwiui\nabla f(x_{t})=\sum_{i=1}^{m}w_{i}u_{i} for some wi0,i=1mwi=1w_{i}\geq 0,\sum_{i=1}^{m}w_{i}=1. Let p(ξt)=i=1mwipi(ξtf(xt)),p^{\prime}(\xi_{t})=\sum_{i=1}^{m}w_{i}p_{i}(\xi_{t}-\nabla f(x_{t})), be a mixture of these distributions with zero mean. If ui,f(xt)0,i[m]\langle u_{i},\nabla f(x_{t})\rangle\geq 0,\forall i\in[m], we have

Proof: First, we notice that Theorem 2 can be restated into a more general form as follows.

Now we can use the above results to prove the theorem.

Then, because (39) and Eξtpi[gt]=uiE_{\xi_{t}\sim{p_{i}}}[g_{t}]=u_{i} and that pip_{i} corresponds to a noise with spherical distribution added to uiu_{i}, we have

Appendix D Proof of Theorem 5

Recall the algorithm has the following update rule

where gt,if(xt)+ξt,ig_{t,i}\triangleq\nabla f(x_{t})+\xi_{t,i} is the stochastic gradient at iteration tt evaluated on sample ii, and StS_{t} is a subset of whole dataset DD; ZtN(0,σ2I)Z_{t}\sim\mathcal{N}(0,\sigma^{2}I) is the noise added for privacy. We denote gt=1StiStclip(f(xt)+ξt,i,c)g_{t}=\frac{1}{|S_{t}|}\sum_{i\in S_{t}}\text{clip}(\nabla f(x_{t})+\xi_{t,i},c) in the remaining parts of the proof to simplify notation.

Following traditional convergence analysis of SGD using smoothness assumption, we first have

Taking expectation conditioned on xtx_{t}, we have

Take overall expectation and sum over t[T]t\in[T] and rearrange, we have

To achieve (ϵ,δ)(\epsilon,\delta)-privacy, we need σ2=vTc2ln(1δ)n2ϵ2\sigma^{2}=v\frac{Tc^{2}\ln(\frac{1}{\delta})}{n^{2}\epsilon^{2}} for some constant vv by Theorem 1 in Abadi et al. [2016b]. Substituting the expression of σ2\sigma^{2} into the above inequality, we get

where we define Df=f(x1)minxf(x)D_{f}=f(x_{1})-\min_{x}f(x).

Setting Tα=DfnϵGcdln(1δ)T\alpha=\frac{\sqrt{D_{f}}n\epsilon}{\sqrt{G}c\sqrt{d}\sqrt{\ln(\frac{1}{\delta})}}, we have

Setting α=Dfdln(1δ)nϵcG\alpha=\frac{\sqrt{D_{f}d\ln(\frac{1}{\delta})}}{n\epsilon c\sqrt{G}}, we have

The remaining step is to analyze the term on LHS of (52). We first notice that the gradient sampling scheme yields

with ξt\xi_{t} being a discrete random variable that can takes values ξt,i,iD\xi_{t,i},i\in D with equal probability and DD is the whole dataset.

Now it is time to split the bias as following.

when f(xt)34c\|\nabla f(x_{t})\|\leq\frac{3}{4}c and

when f(xt)34c\|\nabla f(x_{t})\|\geq\frac{3}{4}c.

Now we bound the bias term using Wasserstein distance as follows.

which we define Wv,c(p,p)W_{v,c}(p,p^{\prime}) as the Wasserstein distance between pp and pp^{\prime} using the metric dv,cd_{v,c}.

Appendix E Proof of Theorem 6

Let gt=clip(f(xt)+ξt+kζt,c)g_{t}=\text{clip}(\nabla f(x_{t})+\xi_{t}+k\zeta_{t},c) and ζtN(0,I)\zeta_{t}\sim\mathcal{N}(0,I). Then gradient clipping algorithm has following properties:

where σξt2\sigma^{2}_{\xi_{t}} is the variance of the gradient noise ξt\xi_{t}.

where the second equality is obtained by applying (61) and using the fact that ξt\xi_{t} is zero mean.

Noticing that τ1\tau\leq 1 and define W^t=Wtτξtk\hat{W}_{t}=W_{t}^{\prime}-\tau\frac{\xi_{t}}{k}, we have

and the integration term only depends on ξtk\|\frac{\xi_{t}}{k}\| due to sphere symmetricity of ψ\psi. Thus we can assume ξt,1=ξt\xi_{t,1}=\|\xi_{t}\| and ξt,i=0\xi_{t,i}=0 for i2i\geq 2, wlog. Then, we have

where we have define W^t=Wtτξtk\hat{W}_{t}=W_{t}^{\prime}-\tau\frac{\xi_{t}}{k} and q=h(x)dxq=\int_{-\infty}^{\infty}|h^{\prime\prime}(x)|dx with h(x)h(x) being the pdf of 1-dimensional standard normal distribution. Thus, qq is a dimension independent constant.

Combining (E), (68), and (67) finishes the proof. ∎

Appendix F More experiments on random projection

We show the projection of stochastic gradients into 2d space described in Section 4 for different projection matrices in Figure 5-8. It can be seen that as the training progresses, the gradient distribution in 2d space tend to be increasingly more symmetric.

Appendix G Evaluation on the probability term

Despite the discussions above, the distribution of norm of stochastic gradients and noise norm combined with the 2d visualization experiments implies the noise on gradient might follow a mixture of distributions with each component being approximate symmetric. Especially one component may correspond to a approximate 0 mean distribution of stochastic gradients. Intuitively this can be true since each class of data may corresponds to a few variations of stochastic gradients and the gradient noise is observed to be low rank in Li et al. . We have some discussions in Section 2.2 to explain how convergence can be achieved in the cases of symmetric distribution mixtures but it may not be the complete explanation here. Further exploration of gradient distribution in practice is an important question and we leave it for future research.

Appendix H Additional results and discussions on the probability term and the noise adding approach in Section 5

The results below verify our lower bound.