Numerical Composition of Differential Privacy
Sivakanth Gopi, Yin Tat Lee, Lukas Wutschitz
Introduction
Differential privacy (DP) introduced by [DMNS06] provides a provable and quantifiable guarantee of privacy when the results of an algorithm run on private data are made public. Formally, we can define an -differentially private algorithm as follows.
An algorithm is -DP if for any two neighboring databases differing in exactly one user and any subset of outputs, we have
Intuitively, it says that looking at the outcome of , we cannot tell whether it was run on or . Hence an adversary cannot infer the existence of any particular user in the input database, and therefore cannot learn any personal data of any particular user.
DP algorithms have an important property called composition. Suppose and are DP algorithms and say , i.e., runs both the algorithms on and outputs their results. Then is also a DP algorithm.
If is -DP and is -DP, then is -DP.
This also holds under adaptive composition (denoted by ), where can look at both the database and the output of .Here . It turns out that both compositions enjoy much better DP guarantees than this simple composition rule. Let be an -DP algorithm and let denote the (adaptive) composition of with itself times. The naive composition rule shows that is -DP. This was significantly improved in [DRV10].
If is -DP, then is -DP where
Note that if and , then satisfies -DP. Using simple composition (Proposition 1.2), we can only claim that is -DP. Thus advanced composition often results in -factor savings in privacy which is significant in practice. The optimal DP guarantees for -fold composition of an -DP algorithm were finally obtained by [KOV15]. For composing different algorithms, the situation is more complicated. If are DP algorithms such that is -DP, then it is shown by [MV16] that computing the exact DP guarantees for is #P-complete. They also give an algorithm to approximate the DP guarantees of to desired accuracy which runs in
In most situations, DP algorithms come with a collection of -DP guarantees, i.e., for each value of , there exists such that the algorithm is -DP.
For example the privacy curve of a Gaussian mechanism (with sensitivity and noise scale ) is given by where is the Gaussian CDF [BW18]. Suppose we want to compose several Gaussian mechanisms, which -DP guarantee should we choose for each mechanism? Any choice will lead to suboptimal DP guarantees for the final composition. Instead, we need a way to compose the privacy curves directly. This was suggested through the use of privacy region in [KOV15] and explicitly studied in the -DP framework of [DRS19]. -DP is a dual way (and equivalent) to look at the privacy curve
In an influential paper where they introduce Differentially Private Deep Learning, [ACG+16] proposed a method called the Moments Accountant (MA) for giving an upper bound the privacy curve of a composition of DP algorithms. They applied their method to bound the privacy loss of differentially private Stochastic Gradient Descent (DP-SGD) algorithm which they introduced. Analyzing the privacy loss of DP-SGD involves composing the privacy curve of each iteration of training with itself times, where is the total of number of training iterations. Typical values of range from to (such as when training large models like GPT3). The Moments Accountant was subsumed into the framework of Renyi Differential Privacy (RDP) introduced by [Mir17]. The running time of these accountants are independent of , but they only give an upper bound and cannot approximate the privacy curve to arbitrary accuracy.
DP-SGD is one of the most important DP algorithms in practice, because one can use it to train neural networks to achieve good privacy-vs-utility tradeoffs. Therefore obtaining accurate and tight privacy guarantees for DP-SGD is important. For example reducing from to , can mean that one can train the network for 4 times more epochs while staying within the same privacy budget. Therefore DP-SGD is one of the main motivations for this work.
There are also situations when the PRVs do not have bounded moments and so Moments Accountant or Renyi DP cannot be applied for analyzing privacy. An example of such an algorithm is the DP-SGD-JL algorithm of [BGK+21] which uses numerical composition of PRVs to analyze privacy.
GDP Accountant
[DRS19, BDLS19] introduced the notion of Gaussian Differential Privacy (GDP) and used it to develop an accountant for DP-SGD. The accountant is based on central limit theorem and only gives an approximation to the true privacy curve, where the approximation gets better with . But as we show in Figure 1, GDP accountant can significantly underreport the true epsilon value.
Several different notions of privacy were introduced for obtaining good upper bounds on the privacy curve of composition of DP algorithms such as Concentrated DP (CDP) [DR16, BS16], Truncated CDP [BDRS18] etc. None of these methods can approximate the privacy curve of compositions to arbitrary accuracy. The notion of -DP introduced by [DRS19], allows for a lossless composition theorem, but computing the privacy curve of composition seems computationally hard and they do not give any algorithms for doing it.
1 Our Contributions
The main contribution of this work is a new algorithm with an improved analysis for computing the privacy curve of the composition of a large number of DP algorithms.
Suppose are DP algorithms. Then the privacy curve of adaptive composition can be approximated in time
Suppose is a DP algorithm. Then the privacy curve of (adaptively) composed with itself times can be approximated in time
Thus we improve the state-of-the-art by at least a factor of in running time. We also note that our algorithm improves the memory required by a factor of See Figure 1 for a comparison of our algorithm with that of [KJPH21]. Also note that RDP Accountant (equivalent to the Moments Accountant) significantly overestimates the true , while the GDP Accountant significantly underestimates the true In contrast, the upper and lower bounds provided by our algorithm lie very close to each other.
Our algorithm (also the prior work of [KJH+20]) proceeds by approximating the privacy loss random variables (PRVs) by truncating and discretizing them. We then use Fast Fourier Transform (FFT) to convolve the distributions efficiently. The main difference is in the approximation procedure and the error analysis. In the approximation procedure, we correct the approximation so that the expected value of the discretization matches with the expected value of the PRV.
DP Preliminaries
Therefore an algorithm is -DP iff for all neighboring databases
Let and be any two privacy curves. The composition of the privacy curves, denoted by , is defined as
where are independently sampled and are independently sampled.
Note that there can be many pairs of random variables which have the same privacy curve, but the above operation is well-defined. If and , then it was shown by [DRS19] that
[DRS19] also show that is a commutative and associative operation.
Given two DP algorithms and , the adaptive composition is an algorithm which outputs , i.e., can look at the database and also the output of the previous algorithm . Adaptive composition of more than two algorithms is similarly defined. Suppose has privacy curve and has privacy curve (i.e., is a DP algorithm with privacy curve for any fixed ). The following composition theorem shows how to get the privacy curve of
Let be DP algorithms with privacy curves given by respectively. The privacy curve of the adaptive composition is given by
Privacy Loss Random Variables (PRVs)
The notion of privacy loss random variables (PRVs) is a unique way to assign a pair for any privacy curve such that PRVs allow us to compute composition of two algorithms via summing random variables (Theorem 3.5) (equivalently, convolving their distributions). Thus PRVs can be thought of as a reparametrization of privacy curves where composition becomes convolution. In this paper, we differ from the usual definition of PRVs given in [DR16, KJH+20], which are tied to a specific algorithm. Instead we think of them as a reparametrization of privacy curves and study them directly. This allows us to succinctly prove many useful properties of PRVs.
where are probability density functions of respectively.
The following theorem shows that the PRVs for a privacy curve are given by the log-likelihood random variables of
The following theorem provides a formula for computing the privacy curve in terms of the PRVs and conversely a formula for PRVs in terms of the privacy curve. A similar statement appears in [SMM19, KJH+20].
The privacy curve can be expressed in terms of PRVs as:
PRVs are useful in computing privacy curves because the composition of two privacy curves can be computed by adding the corresponding pairs of PRVs. A similar statement appears in [DR16].
Let be two privacy curves with PRVs and respectively. Then the PRVs for are given by . In particular,
Let be the privacy random variables for . By Theorem 3.2,
In Appendix B, we provide a proof of Theorems 3.2 and 3.3. We also discuss how to compute the PRVs for a subsampled mechanism given the PRVs for the original mechanism and give examples of PRVs for few standard mechanisms. These are used in our experiments to calculate the PRVs for DP-SGD.
Numerical composition of privacy curves
In this section, we present an efficient and numerically accurate method, ComposePRV (Algorithm 1), for composing privacy guarantees by utilizing the notion of PRVs.
The subroutine DiscretizePRV (Algorithm 2) is used to truncate and discretize PRVs. In this subroutine, we shift the discretized random variables such that it has the same mean as the original variables. This is one of main differences between our algorithm and the algorithm in [KJPH21, KH21]. We show that this significantly decreases the discretization error and allow us to use much coarser mesh instead of .
For simplicity, throughout this paper, we will assume that the PRVs do not have any mass at . This is with out loss of generality. Suppose for each Let . Then
Therefore we can use Algorithm 1 to approximate the distribution of , and use it to approximate the distribution of .
Error analysis
To analyze the discretization error, we introduce the notion of coupling approximation, a variant of Wasserstein distance. Intuitively, a good coupling approximation is a coupling where the two random variables are close to each other with high probability.
Given two random variables , we write if there exists a coupling between such that
The following lemma shows that if we have a good coupling approximation to a PRV , then the privacy curves and should be close.
By Theorem 3.2, and hence
Therefore the goal of our analysis is to show that the ComposePRV algorithm finds a good coupling approximation to We first show that the DiscretizePRV algorithm computes a good coupling approximation to the PRVs and crucially, it preserves the expected value after truncation. Lemma C.5 shows that where is the approximation of a PRV output by Algorithm 2 and is the truncation of to
We then use the following key lemma which shows that when we add independent coupling approximations (where expected values match), we get a much better coupling approximation than what the triangle inequality predicts.
Let where are coupled such that w.p. . Then w.p. . Note that are independent of each other. By Hoeffding’s inequality,
if we set . ∎
This lemma shows that the error of times composition is around and hence setting gives small enough error. Next, we bound the domain size . Naively, the domain size should be of the order of because is the sum of independent random variables with each bounded by a constant. In the appendix, we give a tighter tail bound of .
Let be the privacy random variables for a -DP algorithm, then for any , we have
This shows that and hence truncating the domain with only introduces an additive error in the privacy curve. Therefore, if the composition satisfies a good privacy guarantee (namely for small enough ), we can truncate the domain at . Together with the fact that mesh size is , this gives a -time algorithm for computing the privacy curve when we compose the same mechanism with itself times. The following theorem gives a formal statement of the error bounds of our algorithm, it is proved in Appendix 5.
Let be the approximation of produced by ComposePRV algorithm with mesh size
Furthermore, our algorithm takes time where is the number of distinct algorithms among .
A simple way to set such that the condition (7) holds is by choosing an such that:
where is the inverse of . To set the value of , we do not need the exact value of (or ). We only need an upper bound on , which can often by obtained by using the RDP Accountant or any other method to derive upper bounds on privacy.
Experiments
In this section, we demonstrate the utility of our composition method by computing the privacy curves for the DP-SGD algorithm which is one of the most important algorithms in differential privacy.
The DP-SGD algorithm [ACG+16] is a variant of stochastic gradient descent with steps. In each step, the algorithm selects a fraction of training examples uniformly at random. The algorithm adds a Gaussian vector with variance to the clipped gradient of the selected batch. Then it performs a gradient step (or any other iterative methods) using the noisy gradient computed. The privacy loss of DP-SGD involves composing the privacy curve of each iteration with itself times. The PRVs for each iteration have a closed form and depend only (see Appendix). Our algorithms use this closed form of PRVs.
See Figure 1(b) for the comparison between our algorithm and the GDP and RDP Accountant. Our method provides a lower and upper bound of the privacy curve according to (8). In Figure 1(a), we compare our algorithm with [KJPH21] (implemented in [KP21]). Under the same mesh size, our algorithm computes a much closer upper and lower bound.
We validate our program for the case . When , we have an exact formula for
Note that our error analysis in Section 5 ignores floating point errors. This is because they are negligible compared to the discretization and truncation errors we analyzed in Section 5 for the range of we are interested in. Our implementation uses ong } {\verb doube floating point format which is platform dependent, however, it guarantees a precision at least as good as double precision which has a resolution of . Computations involving of these orders of magnitude suffer from floating point inaccuracies. Our implementation therefore only allows values which are greater than which sufficies for practical use cases. See Appendix A for more details.
1 Comparison with [KJPH21]
In this section, we provide more results demonstrating the practical use of our algorithm. We compare runtimes of our algorithm with [KJPH21], which is the state-of-the-art, for typical values of privacy parameters (, , ).
See Figure 3 for the effect of the number of discretisation points on the accuracy of . Our algorithm requires about a few orders of magnitude smaller number of discretization points to converge compared to the algorithm of [KJPH21]. A similar picture can be seen in Figure 4. While for a small number of compositions, the algorithm of [KJPH21] gives reasonable estimates, for a large number of compositions, their error bounds worsen quickly.
We note that runtimes are directly proportional to the memory required by the algorithms and so a separate memory analysis is not required; the runtime and memory are dominated by the number of points in the discretization of PRV. All experiments are performed on a Intel Xeon W-2155 CPU with 3.30GHz with 128GB of memory.
In order to compare runtimes, we align the accuracy of both FFT algorithms. We find sets of numerical parameters (number of discretization bins and domain length) such that both algorithms give similarly accurate bounds and verify it visually (see Figure 9 (b)). Figure 9 illustrates the runtimes for varying numbers of DPSGD steps. We observe a significant reduction in the runtime using our algorithms.
Acknowledgements
We would like to thank Janardhan Kulkarni and Sergey Yekhanin for several useful discussions and encouraging us to work on this problem. L.W. would like to thank Daniel Jones and Victor Rühle for fruitful discussions and helpful guidance.
References
Appendix A Effect of floating point arithmetic
In this section, we demonstrate the effect of floating point inaccuracies on the computed privacy parameters. Figure 6 compares lower and upper bounds of the privacy curve with the analytical solution for small values of . As mentioned in section 6, we use a floating point representation with a resolution of at least . The number of discretization points in this examples are on the order of . Consequently, we expect floating point inaccuracies to become dominant for values on the order of . This can be also seen in the illustration, where the lower and upper bound fail to produce meaningful results for .
Appendix B Privacy Loss Random Variables
In this section, we continue the discussion on privacy random variables in Section 3. First, we give the proof of the formula for PRVs of and the formula for a privacy curve given its PRVs (Theorem 3.2).
We will now prove that We have
Therefore and . To complete the proof, note that
Since the PDFs of PRVs satisfy the relation , we can rewrite the equation 5 in terms of just or just .
To get the other form for , we use the integration by parts formula.
We now prove the converse relation by differentiating the expression for twice. We have:
In this section, we state the PRVs for a few standard mechanisms.
The PRVs for are:
Let and . By Theorem 3.2,
A similar calculation shows that ∎
The PRVs for the privacy curve are:
Let and . By Theorem 3.2,
A similar calculation shows that where ∎
The PRVs for the privacy curve of a -DP algorithm are
Morever , therefore the privacy curve is symmetric by Proposition C.9, i.e., . These conditions together imply that are PRVs for the -DP curve. ∎
Note that in the all the above examples, we have as the privacy curves are symmetric.
B.2 Subsampling
In this section, we calculate the PRVs for a subsampled mechanism given the PRVs for the original mechanism. Given two random variables and a sampling probability , denotes the mixture where we sample w.p. and w.p.
Let be the PRVs for a privacy curve . Let be the PRVs for . Then
The CDFs of and are given by:
Appendix C Missing Proofs in Error Analysis
Here we collect some useful properties of coupling approximations. The following lemma shows that the coupling approximations satisfy a triangle inequality.
Suppose are random variables such that and . Then
There exists couplings and such that
From these two couplings, we can construct a coupling between : sample , sample from (given by coupling ) and finally sample from (given by coupling ). Therefore for this coupling, we have:
The following lemma shows that small total variation distance implies good coupling approximation.
If the total variation distance , then .
It is well known that for any two random variables , there exists a coupling such that . This immediately implies what we want. ∎
C.2 Bounding the error using tail bounds of PRVs
The goal of this section is to bound the error of ComposePRV in terms of the tail bounds of the underlying PRVs.
Let be PRVs and let be the approximation produced by the ComposePRV algorithm (Algorithm 1) for with truncation parameter and mesh size
We can directly bound using moment generating functions as
The following key lemma shows that the DiscretizePRV algorithm (Algorithm 2) produces a good coupling approximation to the PRV and preserves the mean.
We will now construct the coupling between such that . The coupling is as follows: First sample . Suppose for some integer such that , then return . Clearly, the distribution of matches with and .
Since our error bound on is slightly different from the assumption in Lemma 5.3, we need the following generalization using the same proof.
In the algorithm, we only calculate the distribution of instead of . The following simple lemma shows that this is still a good approximation as long as stays within with high probability.
Let be random variables supported on . Then
Combining all the above lemmas, we get the following corollary.
Let be random variables supported on and let be the discretization of produced by DiscretizePRV algorithm with mesh size and truncation parameter . Then
Let Y^{L}\equiv Y_{i}\big{|}_{|Y_{i}|\leq L} be the truncation of By Lemma C.5, for some and where Now applying the triangle inequality for coupling approximations (Lemma C.1), we have
where By Lemma C.6, we have
where Finally applying triangle inequality (Lemma C.1) once again, we get:
where . We can bound as:
C.3 Tail Bound for PRVs
To finish the proof of our main theorem (Theorem 5.5, we need a tail bound on PRVs in terms of their privacy curves. First, we need a lemma relating the PRVs of a privacy curve with the PRVs of .
Let be the PRVs for a privacy curve . Then the PRVs for the privacy curve are
Let be the PRVs for . We know that . So Then by Theorem 3.2,
Now, we show our tail bound, which shows the PRVs for a -DP algorithm satisfies roughly that .
We have and where is the privacy curve of a -DP algorithm. By Theorem 3.2, we have
By Proposition C.9, the PRVs for are . Therefore by a similar argument, we have
C.4 Proof of Theorem 5.5
Now, we can prove our main theorem. See 5.5
For the runtime, we note that the bottleneck of our algorithm is to compute the convolution, which can be done using FFT. In total, we need to compute many FFT for distinct algorithms, one for each for computing the Fourier transform and one of computing the inverse Fourier transform. Since the length of the array for the FFT is bounded by , this costs in total.