Numerical Composition of Differential Privacy

Sivakanth Gopi, Yin Tat Lee, Lukas Wutschitz

Introduction

Differential privacy (DP) introduced by [DMNS06] provides a provable and quantifiable guarantee of privacy when the results of an algorithm run on private data are made public. Formally, we can define an $(\varepsilon,\delta)$ -differentially private algorithm as follows.

An algorithm $\mathcal{M}$ is $(\varepsilon,\delta)$ -DP if for any two neighboring databases $D,D^{\prime}$ differing in exactly one user and any subset $S$ of outputs, we have $\Pr[\mathcal{M}(D)\in S]\leq e^{\varepsilon}\Pr[\mathcal{M}(D^{\prime})\in S]+\delta.$

Intuitively, it says that looking at the outcome of $\mathcal{M}$ , we cannot tell whether it was run on $D$ or $D^{\prime}$ . Hence an adversary cannot infer the existence of any particular user in the input database, and therefore cannot learn any personal data of any particular user.

DP algorithms have an important property called composition. Suppose $M_{1}$ and $M_{2}$ are DP algorithms and say $M(D)=(M_{1}(D),M_{2}(D))$ , i.e., $M$ runs both the algorithms on $D$ and outputs their results. Then $M$ is also a DP algorithm.

If $M_{1}$ is $(\varepsilon_{1},\delta_{1})$ -DP and $M_{2}$ is $(\varepsilon_{2},\delta_{2})$ -DP, then $M(D)=(M_{1}(D),M_{2}(D))$ is $(\varepsilon_{1}+\varepsilon_{2},\delta_{1}+\delta_{2})$ -DP.

This also holds under adaptive composition (denoted by $M=M_{2}\circ M_{1}$ ), where $M_{2}$ can look at both the database and the output of $M_{1}$ .Here $M(D)=(M_{1}(D),M_{2}(D,M_{1}(D)))$ . It turns out that both compositions enjoy much better DP guarantees than this simple composition rule. Let $M$ be an $(\varepsilon,\delta)$ -DP algorithm and let $M^{\circ k}$ denote the (adaptive) composition of $M$ with itself $k$ times. The naive composition rule shows that $M^{\circ k}$ is $(k\varepsilon,k\delta)$ -DP. This was significantly improved in [DRV10].

If $M$ is $(\varepsilon,\delta)$ -DP, then $M^{\circ k}$ is $(\varepsilon^{\prime},k\delta+\delta^{\prime})$ -DP where

Note that if $\varepsilon=O\left(\frac{1}{\sqrt{k}}\right)$ and $\delta=o\left(\frac{1}{k}\right)$ , then $M^{\circ k}$ satisfies $(O_{\delta^{\prime}}(1),\delta^{\prime})$ -DP. Using simple composition (Proposition 1.2), we can only claim that $M^{\circ k}$ is $(O(\sqrt{k}),o(1))$ -DP. Thus advanced composition often results in $\sqrt{k}$ -factor savings in privacy which is significant in practice. The optimal DP guarantees for $k$ -fold composition of an $(\varepsilon,\delta)$ -DP algorithm were finally obtained by [KOV15]. For composing different algorithms, the situation is more complicated. If $M_{1},M_{2},\dots,M_{k}$ are DP algorithms such that $M_{i}$ is $(\varepsilon_{i},\delta_{i})$ -DP, then it is shown by [MV16] that computing the exact DP guarantees for $M=M_{1}\circ M_{2}\circ\dots\circ M_{k}$ is #P-complete. They also give an algorithm to approximate the DP guarantees of $M$ to desired accuracy $\eta$ which runs in

In most situations, DP algorithms come with a collection of $(\varepsilon,\delta)$ -DP guarantees, i.e., for each value of $\varepsilon$ , there exists $\delta$ such that the algorithm is $(\varepsilon,\delta)$ -DP.

For example the privacy curve of a Gaussian mechanism (with sensitivity $1$ and noise scale $\sigma$ ) is given by $\delta(\varepsilon)=\Phi\left(-\varepsilon\sigma+1/2\sigma\right)-e^{\varepsilon}\Phi\left(-\varepsilon\sigma-1/2\sigma\right)$ where $\Phi(\cdot)$ is the Gaussian CDF [BW18]. Suppose we want to compose several Gaussian mechanisms, which $(\varepsilon,\delta)$ -DP guarantee should we choose for each mechanism? Any choice will lead to suboptimal DP guarantees for the final composition. Instead, we need a way to compose the privacy curves directly. This was suggested through the use of privacy region in [KOV15] and explicitly studied in the $f$ -DP framework of [DRS19]. $f$ -DP is a dual way (and equivalent) to look at the privacy curve $\delta(\varepsilon).$

In an influential paper where they introduce Differentially Private Deep Learning, [ACG+16] proposed a method called the Moments Accountant (MA) for giving an upper bound the privacy curve of a composition of DP algorithms. They applied their method to bound the privacy loss of differentially private Stochastic Gradient Descent (DP-SGD) algorithm which they introduced. Analyzing the privacy loss of DP-SGD involves composing the privacy curve of each iteration of training with itself $k$ times, where $k$ is the total of number of training iterations. Typical values of $k$ range from $1000$ to $300000$ (such as when training large models like GPT3). The Moments Accountant was subsumed into the framework of Renyi Differential Privacy (RDP) introduced by [Mir17]. The running time of these accountants are independent of $k$ , but they only give an upper bound and cannot approximate the privacy curve to arbitrary accuracy.

DP-SGD is one of the most important DP algorithms in practice, because one can use it to train neural networks to achieve good privacy-vs-utility tradeoffs. Therefore obtaining accurate and tight privacy guarantees for DP-SGD is important. For example reducing $\varepsilon$ from $2$ to $1$ , can mean that one can train the network for 4 times more epochs while staying within the same privacy budget. Therefore DP-SGD is one of the main motivations for this work.

There are also situations when the PRVs do not have bounded moments and so Moments Accountant or Renyi DP cannot be applied for analyzing privacy. An example of such an algorithm is the DP-SGD-JL algorithm of [BGK+21] which uses numerical composition of PRVs to analyze privacy.

GDP Accountant

[DRS19, BDLS19] introduced the notion of Gaussian Differential Privacy (GDP) and used it to develop an accountant for DP-SGD. The accountant is based on central limit theorem and only gives an approximation to the true privacy curve, where the approximation gets better with $k$ . But as we show in Figure 1, GDP accountant can significantly underreport the true epsilon value.

Several different notions of privacy were introduced for obtaining good upper bounds on the privacy curve of composition of DP algorithms such as Concentrated DP (CDP) [DR16, BS16], Truncated CDP [BDRS18] etc. None of these methods can approximate the privacy curve of compositions to arbitrary accuracy. The notion of $f$ -DP introduced by [DRS19], allows for a lossless composition theorem, but computing the privacy curve of composition seems computationally hard and they do not give any algorithms for doing it.

1 Our Contributions

The main contribution of this work is a new algorithm with an improved analysis for computing the privacy curve of the composition of a large number of DP algorithms.

Suppose $M_{1},M_{2},\dots,M_{k}$ are DP algorithms. Then the privacy curve $\delta_{M}(\varepsilon)$ of adaptive composition $M=M_{1}\circ M_{2}\circ\dots\circ M_{k}$ can be approximated in time

Suppose $M$ is a DP algorithm. Then the privacy curve $\delta_{M^{\circ k}}(\varepsilon)$ of $M$ (adaptively) composed with itself $k$ times can be approximated in time

Thus we improve the state-of-the-art by at least a factor of $k$ in running time. We also note that our algorithm improves the memory required by a factor of $k.$ See Figure 1 for a comparison of our algorithm with that of [KJPH21]. Also note that RDP Accountant (equivalent to the Moments Accountant) significantly overestimates the true $\varepsilon$ , while the GDP Accountant significantly underestimates the true $\varepsilon.$ In contrast, the upper and lower bounds provided by our algorithm lie very close to each other.

Our algorithm (also the prior work of [KJH+20]) proceeds by approximating the privacy loss random variables (PRVs) by truncating and discretizing them. We then use Fast Fourier Transform (FFT) to convolve the distributions efficiently. The main difference is in the approximation procedure and the error analysis. In the approximation procedure, we correct the approximation so that the expected value of the discretization matches with the expected value of the PRV.

DP Preliminaries

Therefore an algorithm $\mathcal{M}$ is $(\varepsilon,\delta)$ -DP iff $\delta\left(\mathcal{M}(D)||\mathcal{M}(D^{\prime})\right)(\varepsilon)\leq\delta$ for all neighboring databases $D,D^{\prime}.$

Let $\delta_{1}\equiv\delta(X_{1}||Y_{1})$ and $\delta_{2}\equiv\delta(X_{2}||Y_{2})$ be any two privacy curves. The composition of the privacy curves, denoted by $\delta_{1}\otimes\delta_{2}$ , is defined as

where $X_{1},X_{2}$ are independently sampled and $Y_{1},Y_{2}$ are independently sampled.

Note that there can be many pairs of random variables which have the same privacy curve, but the above operation is well-defined. If $\delta(X_{1}||Y_{1})\equiv\delta(X_{1}^{\prime}||Y_{1}^{\prime})$ and $\delta(X_{2}||Y_{2})\equiv\delta(X_{2}^{\prime}||Y_{2}^{\prime})$ , then it was shown by [DRS19] that

[DRS19] also show that $\otimes$ is a commutative and associative operation.

Given two DP algorithms $M_{1}$ and $M_{2}$ , the adaptive composition $(M_{2}\circ M_{1})(D)$ is an algorithm which outputs $(M_{1}(D),M_{2}(D,M_{1}(D))$ , i.e., $M_{2}$ can look at the database $D$ and also the output of the previous algorithm $M_{1}(D)$ . Adaptive composition of more than two algorithms is similarly defined. Suppose $M_{1}$ has privacy curve $\delta_{1}$ and $M_{2}$ has privacy curve $\delta_{2}$ (i.e., $M_{2}(\cdot,y)$ is a DP algorithm with privacy curve $\delta_{2}$ for any fixed $y.$ ). The following composition theorem shows how to get the privacy curve of $M_{2}\circ M_{1}.$

Let $M_{1},M_{2},\dots,M_{k}$ be DP algorithms with privacy curves given by $\delta_{1},\delta_{2},\dots,\delta_{k}$ respectively. The privacy curve of the adaptive composition $M_{k}\circ M_{k-1}\circ\dots\circ M_{1}$ is given by $\delta_{1}\otimes\delta_{2}\otimes\cdots\otimes\delta_{k}.$

Privacy Loss Random Variables (PRVs)

The notion of privacy loss random variables (PRVs) is a unique way to assign a pair $(X,Y)$ for any privacy curve $\delta$ such that $\delta\equiv\delta(X||Y).$ PRVs allow us to compute composition of two algorithms via summing random variables (Theorem 3.5) (equivalently, convolving their distributions). Thus PRVs can be thought of as a reparametrization of privacy curves where composition becomes convolution. In this paper, we differ from the usual definition of PRVs given in [DR16, KJH+20], which are tied to a specific algorithm. Instead we think of them as a reparametrization of privacy curves and study them directly. This allows us to succinctly prove many useful properties of PRVs.

where $X(t),Y(t)$ are probability density functions of $X,Y$ respectively.

The following theorem shows that the PRVs for a privacy curve $\delta=\delta(P||Q)$ are given by the log-likelihood random variables of $P,Q.$

The following theorem provides a formula for computing the privacy curve $\delta$ in terms of the PRVs and conversely a formula for PRVs in terms of the privacy curve. A similar statement appears in [SMM19, KJH+20].

The privacy curve $\delta$ can be expressed in terms of PRVs $(X,Y)$ as:

PRVs are useful in computing privacy curves because the composition of two privacy curves can be computed by adding the corresponding pairs of PRVs. A similar statement appears in [DR16].

Let $\delta_{1},\delta_{2}$ be two privacy curves with PRVs $(X_{1},Y_{1})$ and $(X_{2},Y_{2})$ respectively. Then the PRVs for $\delta_{1}\otimes\delta_{2}=\delta(X_{1},X_{2}||Y_{1},Y_{2})$ are given by $(X_{1}+X_{2},Y_{1}+Y_{2})$ . In particular,

Let $(X,Y)$ be the privacy random variables for $\delta(X_{1},X_{2}||Y_{1},Y_{2})$ . By Theorem 3.2,

In Appendix B, we provide a proof of Theorems 3.2 and 3.3. We also discuss how to compute the PRVs for a subsampled mechanism given the PRVs for the original mechanism and give examples of PRVs for few standard mechanisms. These are used in our experiments to calculate the PRVs for DP-SGD.

Numerical composition of privacy curves

In this section, we present an efficient and numerically accurate method, ComposePRV (Algorithm 1), for composing privacy guarantees by utilizing the notion of PRVs.

The subroutine DiscretizePRV (Algorithm 2) is used to truncate and discretize PRVs. In this subroutine, we shift the discretized random variables such that it has the same mean as the original variables. This is one of main differences between our algorithm and the algorithm in [KJPH21, KH21]. We show that this significantly decreases the discretization error and allow us to use much coarser mesh $h\approx 1/\sqrt{k}$ instead of $h\approx 1/k$ .

For simplicity, throughout this paper, we will assume that the PRVs $Y_{1},Y_{2},\dots,Y_{k}$ do not have any mass at $\infty$ . This is with out loss of generality. Suppose $\Pr[Y_{i}=\infty]=\delta_{i}$ for each $i.$ Let $Y_{i}^{\prime}=Y_{i}|_{Y_{i}\neq\infty}$ . Then

Therefore we can use Algorithm 1 to approximate the distribution of $Y_{1}^{\prime}+Y_{2}^{\prime}+\dots+Y_{k}^{\prime}$ , and use it to approximate the distribution of $Y_{1}+Y_{2}+\dots+Y_{k}$ .

Error analysis

To analyze the discretization error, we introduce the notion of coupling approximation, a variant of Wasserstein distance. Intuitively, a good coupling approximation is a coupling where the two random variables are close to each other with high probability.

Given two random variables $Y_{1},Y_{2}$ , we write $|Y_{1}-Y_{2}|\leq_{\eta}h$ if there exists a coupling between $Y_{1},Y_{2}$ such that $\Pr[|Y_{1}-Y_{2}|>h]\leq\eta.$

The following lemma shows that if we have a good coupling approximation $\widetilde{Y}$ to a PRV $Y$ , then the privacy curves $\delta_{Y}(\varepsilon)$ and $\delta_{\widetilde{Y}}(\varepsilon)$ should be close.

By Theorem 3.2, $\delta_{Y}(\varepsilon)=\Pr[Y\geq\varepsilon+Z]$ and hence

Therefore the goal of our analysis is to show that the ComposePRV algorithm finds a good coupling approximation $\widetilde{Y}$ to $Y=\sum_{i=1}^{k}Y_{i}.$ We first show that the DiscretizePRV algorithm computes a good coupling approximation to the PRVs and crucially, it preserves the expected value after truncation. Lemma C.5 shows that $|\widetilde{Y}-Y^{L}|\leq_{0}h$ where $\widetilde{Y}$ is the approximation of a PRV $Y$ output by Algorithm 2 and $Y^{L}$ is the truncation of $Y$ to $[-L,L].$

We then use the following key lemma which shows that when we add independent coupling approximations (where expected values match), we get a much better coupling approximation than what the triangle inequality predicts.

Let $X_{i}=Y_{i}-\widetilde{Y}_{i}$ where $(Y_{i},\widetilde{Y}_{i})$ are coupled such that $|Y_{i}-\widetilde{Y}_{i}|\leq h$ w.p. $1$ . Then $X_{i}\in[-h,h]$ w.p. $1$ . Note that $X_{1},X_{2},\dots,X_{k}$ are independent of each other. By Hoeffding’s inequality,

if we set $t=h\sqrt{2k\log{\frac{2}{\eta}}}$ . ∎

This lemma shows that the error of $k$ times composition is around $\sqrt{k}\cdot h$ and hence setting $h\approx 1/\sqrt{k}$ gives small enough error. Next, we bound the domain size $L$ . Naively, the domain size $L$ should be of the order of $\sqrt{k}$ because $Y$ is the sum of $k$ independent random variables with each bounded by a constant. In the appendix, we give a tighter tail bound of $Y$ .

Let $(X,Y)$ be the privacy random variables for a $(\varepsilon,\delta)$ -DP algorithm, then for any $t\geq 0$ , we have

This shows that $\Pr[|Y|\geq\varepsilon+2]\leq\frac{4}{3}\delta$ and hence truncating the domain with $L=2+\varepsilon$ only introduces an additive $\delta$ error in the privacy curve. Therefore, if the composition satisfies a good privacy guarantee (namely $\varepsilon=O(1)$ for small enough $\delta$ ), we can truncate the domain at $L=\Theta(1)$ . Together with the fact that mesh size is $1/\sqrt{k}$ , this gives a $O(\sqrt{k})$ -time algorithm for computing the privacy curve when we compose the same mechanism with itself $k$ times. The following theorem gives a formal statement of the error bounds of our algorithm, it is proved in Appendix 5.

Let $\widetilde{Y}$ be the approximation of $Y=\sum_{i=1}^{k}Y_{i}$ produced by ComposePRV algorithm with mesh size

Furthermore, our algorithm takes $O\left(b\frac{L}{h}\log\left(\frac{L}{h}\right)\right)$ time where $b$ is the number of distinct algorithms among $\mathcal{M}_{1},\mathcal{M}_{2},\dots,\mathcal{M}_{k}$ .

A simple way to set $L$ such that the condition (7) holds is by choosing an $L$ such that:

where $\varepsilon_{\mathcal{A}}(\delta)$ is the inverse of $\delta_{\mathcal{A}}(\varepsilon)$ . To set the value of $L$ , we do not need the exact value of $\varepsilon_{\mathcal{M}}$ (or $\varepsilon_{\mathcal{M}_{i}}$ ). We only need an upper bound on $\varepsilon_{\mathcal{M}}$ , which can often by obtained by using the RDP Accountant or any other method to derive upper bounds on privacy.

Experiments

In this section, we demonstrate the utility of our composition method by computing the privacy curves for the DP-SGD algorithm which is one of the most important algorithms in differential privacy.

The DP-SGD algorithm [ACG+16] is a variant of stochastic gradient descent with $k$ steps. In each step, the algorithm selects a $p$ fraction of training examples uniformly at random. The algorithm adds a Gaussian vector with variance $\propto\sigma^{2}$ to the clipped gradient of the selected batch. Then it performs a gradient step (or any other iterative methods) using the noisy gradient computed. The privacy loss of DP-SGD involves composing the privacy curve of each iteration with itself $k$ times. The PRVs for each iteration have a closed form and depend only $p,\sigma$ (see Appendix). Our algorithms use this closed form of PRVs.

See Figure 1(b) for the comparison between our algorithm and the GDP and RDP Accountant. Our method provides a lower and upper bound of the privacy curve according to (8). In Figure 1(a), we compare our algorithm with [KJPH21] (implemented in [KP21]). Under the same mesh size, our algorithm computes a much closer upper and lower bound.

We validate our program for the case $p=1$ . When $p=1$ , we have an exact formula for

Note that our error analysis in Section 5 ignores floating point errors. This is because they are negligible compared to the discretization and truncation errors we analyzed in Section 5 for the range of $\delta$ we are interested in. Our implementation uses ong } {\verb doube floating point format which is platform dependent, however, it guarantees a precision at least as good as double precision which has a resolution of $10^{-15}$ . Computations involving $\delta$ of these orders of magnitude suffer from floating point inaccuracies. Our implementation therefore only allows $\delta$ values which are greater than $10^{-10}$ which sufficies for practical use cases. See Appendix A for more details.

1 Comparison with [KJPH21]

In this section, we provide more results demonstrating the practical use of our algorithm. We compare runtimes of our algorithm with [KJPH21], which is the state-of-the-art, for typical values of privacy parameters ( $\sigma=0.8$ , $p=4\times 10^{-3}$ , $\varepsilon=1.5$ ).

See Figure 3 for the effect of the number of discretisation points $n$ on the accuracy of $\delta$ . Our algorithm requires about a few orders of magnitude smaller number of discretization points to converge compared to the algorithm of [KJPH21]. A similar picture can be seen in Figure 4. While for a small number of compositions, the algorithm of [KJPH21] gives reasonable estimates, for a large number of compositions, their error bounds worsen quickly.

We note that runtimes are directly proportional to the memory required by the algorithms and so a separate memory analysis is not required; the runtime and memory are dominated by the number of points in the discretization of PRV. All experiments are performed on a Intel Xeon W-2155 CPU with 3.30GHz with 128GB of memory.

In order to compare runtimes, we align the accuracy of both FFT algorithms. We find sets of numerical parameters (number of discretization bins and domain length) such that both algorithms give similarly accurate bounds and verify it visually (see Figure 9 (b)). Figure 9 illustrates the runtimes for varying numbers of DPSGD steps. We observe a significant reduction in the runtime using our algorithms.

Acknowledgements

We would like to thank Janardhan Kulkarni and Sergey Yekhanin for several useful discussions and encouraging us to work on this problem. L.W. would like to thank Daniel Jones and Victor Rühle for fruitful discussions and helpful guidance.

References

Appendix A Effect of floating point arithmetic

In this section, we demonstrate the effect of floating point inaccuracies on the computed privacy parameters. Figure 6 compares lower and upper bounds of the privacy curve with the analytical solution for small values of $\delta$ . As mentioned in section 6, we use a floating point representation with a resolution of at least $10^{-15}$ . The number of discretization points in this examples are on the order of $10^{4}$ . Consequently, we expect floating point inaccuracies to become dominant for values on the order of $10^{-11}$ . This can be also seen in the illustration, where the lower and upper bound fail to produce meaningful results for $\delta<2\times 10^{-11}$ .

Appendix B Privacy Loss Random Variables

In this section, we continue the discussion on privacy random variables in Section 3. First, we give the proof of the formula for PRVs of $\delta(P||Q)$ and the formula for a privacy curve given its PRVs (Theorem 3.2).

We will now prove that $\delta(X||Y)=\delta(P||Q).$ We have

Therefore $\Pr[Q\in S_{\varepsilon}]=\Pr[Y>\varepsilon]$ and $\Pr[P\in S_{\varepsilon}]=\Pr[X>\varepsilon]$ . To complete the proof, note that

Since the PDFs of PRVs $(X,Y)$ satisfy the relation $Y(t)=e^{t}X(t)$ , we can rewrite the equation 5 in terms of just $Y$ or just $X$ .

To get the other form for $\delta(\varepsilon)$ , we use the integration by parts formula.

We now prove the converse relation by differentiating the expression for $\delta(\varepsilon)$ twice. We have:

In this section, we state the PRVs for a few standard mechanisms.

The PRVs for $\delta(\mathcal{N}(\mu,1)||\mathcal{N}(0,1))$ are:

Let $P=\mathcal{N}(\mu,1)$ and $Q=\mathcal{N}(0,1)$ . By Theorem 3.2,

A similar calculation shows that $X=\mathcal{N}\left(-\frac{\mu^{2}}{2},\mu^{2}\right)$ ∎

The PRVs for the privacy curve $\delta\left(\mathsf{Lap}\left(\mu,1\right)||\mathsf{Lap}\left(0,1\right)\right)$ are:

Let $P=\mathsf{Lap}(\mu,1)$ and $Q=\mathsf{Lap}(0,1)$ . By Theorem 3.2,

A similar calculation shows that $X=|Z|-|Z-\mu|$ where $Z\sim\mathsf{Lap}(0,1).$ ∎

The PRVs for the privacy curve of a $(\varepsilon,\delta)$ -DP algorithm are

Morever $X=-Y$ , therefore the privacy curve $\delta(X||Y)$ is symmetric by Proposition C.9, i.e., $\delta(X||Y)=\delta(Y||X)$ . These conditions together imply that $X,Y$ are PRVs for the $(\varepsilon,\delta)$ -DP curve. ∎

Note that in the all the above examples, we have $X=-Y$ as the privacy curves are symmetric.

B.2 Subsampling

In this section, we calculate the PRVs for a subsampled mechanism given the PRVs for the original mechanism. Given two random variables $P,Q$ and a sampling probability $p\in$ , $p\cdot P+(1-p)\cdot Q$ denotes the mixture where we sample $P$ w.p. $p$ and $Q$ w.p. $1-p.$

Let $(X,Y)$ be the PRVs for a privacy curve $\delta(P||Q)$ . Let $(X_{p},Y_{p})$ be the PRVs for $\delta_{p}=\delta(P||\ p\cdot P+(1-p)\cdot Q)$ . Then

The CDFs of $X_{p}$ and $Y_{p}$ are given by:

Appendix C Missing Proofs in Error Analysis

Here we collect some useful properties of coupling approximations. The following lemma shows that the coupling approximations satisfy a triangle inequality.

Suppose $X,Y,Z$ are random variables such that $|X-Y|\leq_{\eta_{1}}h_{1}$ and $|Y-Z|\leq_{\eta_{2}}h_{2}$ . Then $|X-Z|\leq_{\eta_{1}+\eta_{2}}h_{1}+h_{2}.$

There exists couplings $(X,Y)$ and $(Y,Z)$ such that

From these two couplings, we can construct a coupling between $(X,Z)$ : sample $X$ , sample $Y$ from $Y|X$ (given by coupling $(X,Y)$ ) and finally sample $Z$ from $Z|Y$ (given by coupling $(Y,Z)$ ). Therefore for this coupling, we have:

The following lemma shows that small total variation distance implies good coupling approximation.

If the total variation distance $d_{TV}(X,Y)\leq\eta$ , then $|X-Y|\leq_{\eta}0$ .

It is well known that for any two random variables $X,Y$ , there exists a coupling such that $d_{TV}(X,Y)=\Pr[X\neq Y]$ . This immediately implies what we want. ∎

C.2 Bounding the error using tail bounds of PRVs

The goal of this section is to bound the error of ComposePRV in terms of the tail bounds of the underlying PRVs.

Let $Y_{1},Y_{2},\dots,Y_{k}$ be PRVs and let $\widetilde{Y}$ be the approximation produced by the ComposePRV algorithm (Algorithm 1) for $Y=\sum_{i=1}^{k}Y_{i}$ with truncation parameter $L$ and mesh size

We can directly bound $\Pr\left[\left|\sum_{i=1}^{k}\widetilde{Y}_{i}\right|\geq L\right]$ using moment generating functions as

The following key lemma shows that the DiscretizePRV algorithm (Algorithm 2) produces a good coupling approximation to the PRV and preserves the mean.

We will now construct the coupling between $(Y^{L},\widetilde{Y})$ such that $|Y^{L}-(\widetilde{Y}-\mu)|\leq\frac{h}{2}$ . The coupling is as follows: First sample $y\sim Y^{L}$ . Suppose $y\in(ih-\frac{h}{2},ih+\frac{h}{2}]$ for some integer $i$ such that $-n\leq i\leq n$ , then return $\widetilde{y}=\mu+ih$ . Clearly, the distribution of $\widetilde{y}$ matches with $\widetilde{Y}$ and $|y-(\widetilde{y}-\mu)|=|y-ih|\leq\frac{h}{2}$ .

Since our error bound on $\widetilde{Y}$ is slightly different from the assumption in Lemma 5.3, we need the following generalization using the same proof.

In the algorithm, we only calculate the distribution of $Y_{1}\oplus Y_{2}\oplus\dots\oplus Y_{k}$ instead of $Y_{1}+Y_{2}+\dots+Y_{k}$ . The following simple lemma shows that this is still a good approximation as long as $\sum_{i}Y_{i}$ stays within $[-L,L]$ with high probability.

Let $Y_{1},Y_{2},\dots,Y_{k}$ be random variables supported on $(-L,L]$ . Then

Combining all the above lemmas, we get the following corollary.

Let $Y_{1},Y_{2},\dots,Y_{k}$ be random variables supported on and let $\widetilde{Y}_{i}$ be the discretization of $Y_{i}$ produced by DiscretizePRV algorithm with mesh size $h=\frac{h_{0}}{\sqrt{\frac{k}{2}\log\frac{2}{\eta_{0}}}}$ and truncation parameter $L$ . Then

Let $Y^{L}\equiv Y_{i}\big{|}_{|Y_{i}|\leq L}$ be the truncation of $Y_{i}.$ By Lemma C.5, $|Y_{i}^{L}-(\widetilde{Y}_{i}-\mu_{i})|\leq_{0}\frac{h}{2}$ for some $\mu_{i}$ and $|Y_{i}^{L}-Y_{i}|_{\xi_{i}}\leq 0$ where $\xi_{i}=\Pr[|Y_{i}|\geq L].$ Now applying the triangle inequality for coupling approximations (Lemma C.1), we have

where $\eta_{1}=\sum_{i}\xi_{i}=\sum_{i}\Pr[|Y_{i}|\geq L].$ By Lemma C.6, we have

where $\eta_{2}=\Pr\left[\left|\sum_{i=1}^{k}\widetilde{Y}_{i}\right|\geq L\right].$ Finally applying triangle inequality (Lemma C.1) once again, we get:

where $\eta=\eta_{0}+\eta_{1}+\eta_{2}$ . We can bound $\Pr\left[\left|\sum_{i=1}^{k}\widetilde{Y}_{i}\right|\geq L\right]$ as:

C.3 Tail Bound for PRVs

To finish the proof of our main theorem (Theorem 5.5, we need a tail bound on PRVs in terms of their privacy curves. First, we need a lemma relating the PRVs of a privacy curve $\delta(P||Q)$ with the PRVs of $\delta(Q||P)$ .

Let $(X,Y)$ be the PRVs for a privacy curve $\delta(P||Q)$ . Then the PRVs for the privacy curve $\delta(Q||P)$ are $(-Y,-X).$

Let $(\widetilde{X},\widetilde{Y})$ be the PRVs for $\delta(Q||P)$ . We know that $\delta(P||Q)=\delta(X||Y)$ . So $\delta(Q||P)=\delta(Y||X).$ Then by Theorem 3.2,

Now, we show our tail bound, which shows the PRVs $(X,Y)$ for a $(\varepsilon,\delta)$ -DP algorithm satisfies roughly that $\Pr(|Y|\geq\varepsilon+2)\leq 2\delta$ .

We have $\delta(X||Y)\leq f_{\varepsilon,\delta}$ and $\delta(Y||X)\leq f_{\varepsilon,\delta}$ where $f_{\varepsilon,\delta}$ is the privacy curve of a $(\varepsilon,\delta)$ -DP algorithm. By Theorem 3.2, we have

By Proposition C.9, the PRVs for $\delta(Y||X)$ are $(-Y,-X)$ . Therefore by a similar argument, we have

C.4 Proof of Theorem 5.5

Now, we can prove our main theorem. See 5.5

For the runtime, we note that the bottleneck of our algorithm is to compute the convolution, which can be done using FFT. In total, we need to compute $b+1$ many FFT for $b$ distinct algorithms, one for each for computing the Fourier transform and one of computing the inverse Fourier transform. Since the length of the array for the FFT is bounded by $O(L/h)$ , this costs $O(bL/h\log(L/h))$ in total.