Understanding Gradient Clipping in Private SGD: A Geometric Perspective
Xiangyi Chen, Zhiwei Steven Wu, Mingyi Hong
Introduction
However, the clipping operation can create a substantial bias in the update direction. To illustrate this clipping bias, consider the following two optimization problems even without the privacy constraint.
Example 2. Let , where and . The minimum of is achieved at , where the expected clipped gradient is also 0. However, SGD with clipped gradients and may never converge to since the expected clipped gradients are all 0 for any , which means all these points are ”stationary” for the algorithm.
Both examples above show that clipping bias can prevent convergence in the worst case. Existing analyses on gradient clipping quantify this clipping bias either with 1) the difference between clipped and unclipped gradients (Pichapati et al., 2019), or 2) the fraction of examples with gradient norms exceeding the clip threshold (Zhang et al., 2019). These approaches suggest that a small clip threshold will lead to large clipping bias and worsen the training performance of DP-SGD. However, in practice, DP-SGD often remains effective even with a small clip threshold, which indicates a gap in the current theoretical understanding of gradient clipping.
We study the effects of gradient clipping on SGD and DP-SGD and provide:
Theoretical and empirical evaluation of DP-SGD. Building on the SGD analysis, we obtain a similar convergence guarantee on DP-SGD with gradient clipping. Importantly, we are able to prove such convergence guarantee even without Lipschitzness of the loss function, which is often required for DP-SGD analyses. We also provide extensive empirical studies to investigate the gradient distributions of DP-SGD across different epoches on two real datasets. To visualize the symmetricity of the gradient distributions, we perform multiple random projections on the gradients and examine the two-dimensional projected distributions. Our results suggest that the gradient distributions in DP-SGD quickly exhibit symmetricity, despite the asymmetricity at initialization.
Gradient correction mechanism. Finally, we provide a simple modification to DP-SGD that can mitigate the clipping bias. We show that perturbing the gradients before clipping can provably reduce the clipping bias for any gradient distribution. The pre-clipping perturbation does not by itself provide privacy guarantees, but can trade-off the clipping bias with higher variance.
2 Related work
Convergence of SGD with clipped gradient
In this section, we analyze convergence of SGD with clipped gradient, but without the Gaussian perturbation. This simplification is useful for isolating the clipping bias. Consider the standard stochastic optimization formulation
where denotes the underlying distribution over the examples . In the next section, we will instantiate as the empirical distribution over the private dataset. We assume that the algorithm is given access to a stochastic gradient oracle: given any iterate of SGD, the oracle returns , where is independent noise with zero mean. In addition, we assume is G-smooth, i.e. . At each iteration , SGD with gradient clipping performs the following update:
where denotes the realized clipped gradient.
To carry out the analysis of iteration (3), we first note that the standard convergence analysis for SGD-type method consists of two main steps:
S2) Show that the aforementioned quantity is proportional to or , indicating that the size of gradient also decreases to zero.
In our analysis below, we will see that showing the first step is relatively easy, while the main challenge is to show that the second step holds true.Our first result is given below.
Let be the Lipschitz constant of such that . For SGD with gradient clipping of threshold , if we set , we have
A straightforward ”nice” distribution will be one that can ensure , i.e. all stochastic gradients are positively aligned with the true gradient. This may be satisfied when the gradient is large and the noise is bounded. However, when the gradient is small, it is hard to argue that this can still be true in general. Specifically, in the training of neural nets, the cosine similarities between many stochastic gradients and the true gradient (i.e. )) can be negative, which implies that this assumption does not hold (see Figure 3 in Section 4).
Although Figure 3 seems to exclude the ideal distribution, we observe that the distribution of cosine of the gradients appears to be symmetric. Will such a ”symmetricity” property help define the ”nice” distribution for gradient clipping? If so, how to characterizes the performance of gradient clipping in this situation? In the following result, we rigorously answer to these questions.
Theorem 2 states that when the noise distribution is symmetric, gradient clipping will keep the expected clipped gradients positively aligned with the true gradient. This is the desired property that can guarantee convergence. The probability term characterizes the possible slow down caused by gradient clipping, we provide more discussions and experiments on this term in the Appendix.
Combining Theorem 2 with Theorem 1, we have Corollary 1 to fully characterize the convergence behavior of SGD with gradient clipping.
2 Beyond symmetric distributions
Mixture of symmetric or positively skewed. If is a mixture of multiple symmetric or positively skewed distributions, one can split the distributions into multiple ones and use their individual properties. E.g. one can easily establish convergence guarantee for being a mixture of spherical distributions with mean and as in the following theorem.
Given distributions with the pdf of the th distribution being for some function . If for some . Let be a mixture of these distributions with zero mean. If , we have
Besides these examples of favorable biases above, there are also many cases where can be negative and lead to a convergence gap, such as negatively skewed distributions or multimodal distributions with highly imbalanced modes. We have illustrated possible distributions in our divergence examples (Examples 1 and 2). In such cases, one should expect that clipping has an adversarial impact on the convergence guarantee. However, as we also show in Section 4, the gradient distributions on real datasets tend to be symmetric, and so their clipping bias to be small.
DP-SGD with Gradient Clipping
where is a random subsample of (with replacement)Alternatively, subsampling with replacement (Wang et al., 2019) and Poisson subsampling (Zhu and Wang, 2019) have also been proposed. and is the noise added for privacy. We first recall the privacy guarantee of the algorithm below:
There exist constants and so that given the number of iterations , for any , where , DP-SGD with gradient clipping of threshold is -differentially private for any , if .
By accounting for the sub-sampling noise and Gaussian perturbation in DP-SGD, we obtain the following convergence guarantee, where we further bound the clipping bias term with the Wasserstein distance between the gradient distribution and a coupling symmetric distribution.
where and is the Wasserstein distance between and with metric function and .
Remark on convergence rate. DP-SGD achieves convergence rate of in the existing literature. As shown in Theorem 5, with gradient clipping, the rate becomes . When gradient distribution is symmetric, the convergence rate of can be recovered. This result implies that when gradient distribution is symmetric, the clipping operation will only affect the convergence rate of DP-SGD by a constant factor. In addition, since the clipping bias is the Wasserstein distance between the empirical gradient distribution and an oracle symmetric distribution, it can be small when the gradient distribution is approximately symmetric.
Experiments
In this section, we investigate whether the gradient distributions of DP-SGD are approximate symmetric in practice. However, since the gradient distributions are high-dimensional, certifying symmetricity is in general intractable. We instead consider two simple proxy measures and visualizations.
Setup. We run DP-SGD implemented in Tensorflow https://github.com/tensorflow/privacy/tree/master/tutorials on two popular datasets MNIST (LeCun et al., 2010) and CIFAR-10 (Krizhevsky et al., 2009). For MNIST, we train a CNN with two convolution layers with 16 44 kernels followed by a fully connected layer with 32 nodes. We use DP-SGD to train the model with , and a batchsize of 128. For CIFAR-10, we train a CNN with two convolutional layers with 22 max pooling of stride 2 followed by a fully connected layer, all using ReLU activation, each layer uses a dropout rate of 0.5. The two convolution layer has 32 and 64 33 kernels, the fully connected layer has 1500 nodes. We use and decrease it by 10 times every 20 epochs. The clip norm of both experiments is set to be and the noise multiplier is 1.1.
Visualization with random projections. We visualize the gradient distribution by projecting the gradient to a two-dimensional space using random Gaussian matrices. Note that given any symmetric distribution, its two-dimensional projection remains symmetric for any projection matrix. On the contrary, if for all projection matrix, the projected gradient distribution is symmetric, the original gradient distribution should also be symmetric. We repeat the projection using different randomly generated matrices and visualize the induced distributions.
From Figure 1, we can see that on both datasets, the gradient distribution is non-symmetric before training (Epoch 0), but over the epochs, the gradient distributions become increasingly symmetric. The distribution of gradients on MNIST at the end of epoch 9 projected to a random two-dimensional space using different random matrices is shown in Figure 2. It can be seen that the approximate symmetric property holds for all 8 realizations. We provide many more visualizations from different realized random projections across different epochs in the Appendix.
Symmetricity of angles. We also measure the cosine similarities between per-sample stochastic gradients and the true gradient. We observe that the cosine similarities between per-sample stochastic gradients and the true gradient (i.e. ) is approximate symmetric around 0 as shown in the histograms in Figure 3.
Mitigating Clipping Bias with Perturbation
From previous analyses, SGD with gradient clipping and DP-SGD have good convergence performance when the gradient noise distribution is approximately symmetric or when the gradient bias favors convergence (e.g., mixture of symmetric distributions with aligned mean). Although in practice, gradient distributions do exhibit (approximate) symmetry (see Sec. 4), it would be desirable to have tools to handle situations where the clipping bias does not favor convergence. Now we provide an approach to decrease the bias. If one adds some Gaussian noise before clipping, i.e.
we can prove as in Theorem 6.
Let and . Then gradient clipping algorithm has following properties:
where is the variance of the gradient noise .
More discussion can be found in the Appendix. Note that when the perturbation approach is applied to DP-SGD, the update rule (8) becomes
where each per-sample stochastic gradient is be perturbed by the noise. By adding the noise, one trade-offs bias with variance. Larger noise make the algorithm converges possibly slower but better. This trick can be helpful when the gradient distribution is not favorable. To verify its effect in practice, we run DP-SGD with gradient clipping on a few unfavorable problems including examples in Section 1 and a new high dimensional example. We set on all the examples (i.e. ). For the new example, we minimize the function with . Each is drawn from a mixture of isotropic Gaussian with 3 components of dimension 10. The covariance matrix of all components is and the means of the 3 components are drawn from , , , respectively. We set for the new examples and for the examples in Section 1. Figure 4 shows versus . We can see DP-SGD with gradient clipping converges to non-optimal points as predicted by theory. In contrast, pre-clipping perturbation ensures convergence.
Conclusion and Future Work
In this paper, we provide a theoretical analysis on the effect of gradient clipping in SGD and private SGD. We provide a new way to quantify the clipping bias by coupling the gradient distribution with a geometrically symmetric distribution. Combined with our empirical evaluation showing that gradient distribution of private SGD follows some symmetric structure along the trajectory, these results provide an explanation why gradient clipping works in practice. We also provide a perturbation-based technique to reduce the clipping bias even for adversarial instances.
There are some interesting directions for future work. One main message of this paper is that when gradient distribution is symmetric, gradient clipping will not be detrimental to the performance of DP-SGD. Thus, looking for methods to symmetrify the gradient distribution could be an interesting topic. Another interesting direction is to study gradient distribution of different types of models empirically. We notice the gradient distribution of CNNs on MNIST and CIFAR-10 might be symmetric and a clipping threshold around 1 works well. However, McMahan et al. (2017) found a relatively large clipping threshold around 10 works best for LSTMs. This implies the gradient distribution on some models might be less symmetric and a concrete empirical analysis on it could motivate future research. Finally, it could be interesting to investigate properties of gradient clipping on a broader class of gradient distributions beyond symmetricity.
Acknowledgement
The research is supported in part by a NSF grant CMMI-1727757, a Google Faculty Research Award, a J.P. Morgan Faculty Award, and a Facebook Research Award.
References
Appendix A Proof of Theorem 1
Then, by update rule and the fact that , we have
Take expectation, sum over from 1 to , divide both sides by , rearranging and substituting into , we get
where .
Appendix B Proof of Theorem 2
Let us first consider the case with .
where and the last equality is because for any vector , and that the clipping operation keeps directions.
Now it reduces to analyzing . Our target now is to prove .
Let us first consider the case where and . In this case, we have
Another case is one of and is less than . Assume . In this case, we must have and from basic properties of trigonometric functions. Then, from Lemma 2, we have
where the last inequality is due to Lemma 1.
Similar argument applies to the case with .
For the case with and , we trivially have . Thus, we have
B.2 When gradient is large
Now let us look at the case where gradient is large, i.e. .
where the last equality is by definition of cosine and the fact that the clipping operation keeps directions.
In the following, we want to show is an non-decreasing function of , then the result can be directly obtained from first part of the theorem.
Notice that is invariant to simultaneous rotation of and the noise distribution (i.e., changing the coordinate axis of the space). Thus, wlog, we can assume and for . To show is a non-decreasing function of , it is enough to show each term in the integration is an non-decreasing function of . I.e., it is enough to show that, for all , the following quantity
is an non-decreasing function of for when .
First consider the case where . In this case, (22) reduces to
which is a monotonically increasing function of .
Now consider the case with , we have
which is a non-decreasing function of .
we have . The term in RHS of (B.2) can be treated as and .
Since the clipping function is continuous, combined with the above results, we know (22) is an non-decreasing function of .
From first part of the theorem, we know when , we have
Combine the above result with (B.2) and the non-decreasing property of , we see that when , the following holds:
B.3 Technical lemmas
To prove the desired result, we only need the numerator of RHS of (B.3) to be non-negative.
Denote , since is rotation invariant, we can assume and for wlog. Also, because , we can assume wlog.
Now suppose , , Denote the numerator of RHS of (B.3) as , it can be written as
and .
Now let us analyze when can be possibly less than 0. Recall that by assumption, and . Then we know . We have trivially when , i.e. when .
Now assume . To ensure , we can alternatively ensure in this case.
For , we can further simplify it as
where the last inequality is because and as assumed previously.
Combining all above, we have which proves the desired result. ∎
Proof: Express using a coordinate system with one axis parallel to . Define the basis of this coordinate system as with = . Then we have and if and only if .
Then it is clear that when which means .
Similar arguments applies to the case with ∎
Appendix C Proof of Theorem 3
Given distributions with the pdf of the th distribution being for some function . If for some . Let be a mixture of these distributions with zero mean. If , we have
Proof: First, we notice that Theorem 2 can be restated into a more general form as follows.
Now we can use the above results to prove the theorem.
Then, because (39) and and that corresponds to a noise with spherical distribution added to , we have
Appendix D Proof of Theorem 5
Recall the algorithm has the following update rule
where is the stochastic gradient at iteration evaluated on sample , and is a subset of whole dataset ; is the noise added for privacy. We denote in the remaining parts of the proof to simplify notation.
Following traditional convergence analysis of SGD using smoothness assumption, we first have
Taking expectation conditioned on , we have
Take overall expectation and sum over and rearrange, we have
To achieve -privacy, we need for some constant by Theorem 1 in Abadi et al. [2016b]. Substituting the expression of into the above inequality, we get
where we define .
Setting , we have
Setting , we have
The remaining step is to analyze the term on LHS of (52). We first notice that the gradient sampling scheme yields
with being a discrete random variable that can takes values with equal probability and is the whole dataset.
Now it is time to split the bias as following.
when and
when .
Now we bound the bias term using Wasserstein distance as follows.
which we define as the Wasserstein distance between and using the metric .
Appendix E Proof of Theorem 6
Let and . Then gradient clipping algorithm has following properties:
where is the variance of the gradient noise .
where the second equality is obtained by applying (61) and using the fact that is zero mean.
Noticing that and define , we have
and the integration term only depends on due to sphere symmetricity of . Thus we can assume and for , wlog. Then, we have
where we have define and with being the pdf of 1-dimensional standard normal distribution. Thus, is a dimension independent constant.
Combining (E), (68), and (67) finishes the proof. ∎
Appendix F More experiments on random projection
We show the projection of stochastic gradients into 2d space described in Section 4 for different projection matrices in Figure 5-8. It can be seen that as the training progresses, the gradient distribution in 2d space tend to be increasingly more symmetric.
Appendix G Evaluation on the probability term
Despite the discussions above, the distribution of norm of stochastic gradients and noise norm combined with the 2d visualization experiments implies the noise on gradient might follow a mixture of distributions with each component being approximate symmetric. Especially one component may correspond to a approximate 0 mean distribution of stochastic gradients. Intuitively this can be true since each class of data may corresponds to a few variations of stochastic gradients and the gradient noise is observed to be low rank in Li et al. . We have some discussions in Section 2.2 to explain how convergence can be achieved in the cases of symmetric distribution mixtures but it may not be the complete explanation here. Further exploration of gradient distribution in practice is an important question and we leave it for future research.
Appendix H Additional results and discussions on the probability term and the noise adding approach in Section 5
The results below verify our lower bound.