Fingerprinting Codes and the Price of Approximate Differential Privacy

Mark Bun, Jonathan Ullman, Salil Vadhan

Introduction

Consider a database $D\in\mathcal{X}^{n}$ , in which each of the $n$ rows corresponds to an individual’s record, and each record is an element of some data universe $\mathcal{X}$ (e.g. $\mathcal{X}=\{0,1\}^{d}$ , corresponding to $d$ binary attributes per record). The goal of privacy-preserving data analysis is to enable rich statistical analyses on such a database while protecting the privacy of the individuals. It is especially desirable to achieve $(\varepsilon,\delta)$ -differential privacy [DMNS06, DKM+06], which (for suitable choices of $\varepsilon$ and $\delta$ ) guarantees that no individual’s data has a significant influence on the information released about the database. A natural way to measure the tradeoff between these two goals is via sample complexity—the minimum number of records $n$ such that there exists a (possibly computationally unbounded) algorithm that achieves both differential privacy and statistical accuracy.

Some of the most basic statistics are counting queries, which are queries of the form “What fraction of individual records in $D$ satisfy some property $q$ ?” In particular, we would like to design an algorithm that takes as input a database $D$ and, for some family of counting queries $\mathcal{Q}$ , outputs an approximate answer to each of the queries in $\mathcal{Q}$ that is accurate to within, say, $\pm.01$ . Suppose we are given a bound on the number of queries $|\mathcal{Q}|$ and the dimensionality of the database records $d$ , but otherwise allow the family $\mathcal{Q}$ to be arbitrary. What is the sample complexity required to achieve $(\varepsilon,\delta)$ -differential privacy and statistical accuracy for $\mathcal{Q}$ ?

Of course, if we drop the requirement of privacy, then we could achieve perfect accuracy when $D$ contains any number of records. However, in many interesting settings the database $D$ consists of random samples from some larger population, and an analyst is actually interested in answering the queries on the population. Thus, even without a privacy constraint, $D$ would need to contain enough records to ensure that (with high probability) for every query $q\in\mathcal{Q}$ , the answer to $q$ on $D$ is close to the answer to $q$ on the whole population, say within $\pm.01$ . To achieve this form of statistical accuracy, it is well-known that it is necessary and sufficient for $D$ to contain $\Theta(\log|\mathcal{Q}|)$ samples.For a specific family of queries $\mathcal{Q}$ , the necessary and sufficient number of samples is proportional to the VC-dimension of $\mathcal{Q}$ , which can be as large as $\log|\mathcal{Q}|$ . In this work we consider whether there is an additional “price of differential privacy” if we require both statistical accuracy and $(\varepsilon,\delta)$ -differential privacy (for, say, $\varepsilon=O(1)$ , $\delta=o(1/n)$ ). This benchmark has often been used to evaluate the utility of differentially private algorithms, beginning with the seminal work of Dinur and Nissim [DN03].

These results show that the price of privacy is small for datasets with few attributes, but may be large for high-dimensional datasets. For example, if we simply want to estimate the mean of each of the $d$ attributes without a privacy guarantee, then $\Theta(\log d)$ samples are necessary and sufficient to get statistical accuracy. However, the best known $(\varepsilon,\delta)$ -differentially private algorithm requires $\Omega(\sqrt{d})$ samples—an exponential gap. In the special case of pure $(\varepsilon,0)$ -differential privacy, a lower bound of $\Omega(d)$ is known [HT10]. However, for the general case of approximate $(\varepsilon,\delta)$ -differential privacy the best known lower bound is $\Omega(\log d)$ [DN03]. More generally, there are no known lower bounds that separate the sample complexity of $(\varepsilon,\delta)$ -differential privacy from the sample complexity required for statistical accuracy alone.

In this work we close this gap almost completely, and show that there is indeed a “price of approximate differential privacy” for high-dimensional datasets.

We establish this lower bound using a combinatorial object called a fingerprinting code, which was originally introduced by Boneh and Shaw [BS98] for the problem of watermarking copyrighted content. Specifically, we use Tardos’ construction of optimal fingerprinting codes [Tar08]. The use of “secure content distribution schemes” to prove lower bounds for differential privacy originates with the work of Dwork et al. [DNR+09], who used “traitor-tracing schemes,” which are a cryptographic analogue of information-theoretic fingerprinting codes, to prove computational hardness results for differential privacy. Extending this connection, Ullman [Ull13] used fingerprinting codes to construct a novel traitor-tracing scheme and obtain a strong computational hardness result for differential privacy.In fact, one way to prove Theorem 1.1 is by replacing the one-way functions in [Ull13] with a random oracle, and thereby obtain an information-theoretically secure traitor-tracing scheme. Here we show that a direct use of fingerprinting codes yields information-theoretic lower bounds on sample complexity.

Next, we consider the sample complexity of answering an arbitrary set $\mathcal{Q}$ of counting queries to within error $\pm\alpha$ . As above, if we assume the database contains samples from a population, and require only that the answers to queries on the sampled database and the population are close, to within $\pm\alpha$ , then $\Theta(\log|\mathcal{Q}|/\alpha^{2})$ samples are necessary and sufficient for just statistical accuracy. When $|\mathcal{Q}|$ is large (relative to $d$ and $1/\alpha$ ), the best sample complexity for differential privacy is again achieved by the private multiplicative weights algorithm, and is $O(\sqrt{d}\log|\mathcal{Q}|/\alpha^{2})$ . For pure differential privacy, a lower bound of $\Omega(d\log|\mathcal{Q}|/\alpha^{2})$ is known [Har11]. On the other hand, the best known lower bound for approximate differential privacy is $\Omega(\max\{\log|\mathcal{Q}|/\alpha,1/\alpha^{2}\})$ , which follows from the techniques of [DN03]. To resolve this gap, we give a composition theorem that allows us to obtain a nearly optimal lower bound by combining Theorem 1.1 with (variants of) the existing sample complexity lower bounds. The result shows that the private multiplicative weights algorithm achieves nearly-optimal sample-complexity as a function of $|\mathcal{Q}|,d$ , and $\alpha$ .

We remark that the condition that $d\geq 6\log(1/\alpha)$ is both necessary (up to the constant factor) and fairly mild. Necessary because the noisy histogram algorithm (see, e.g. [Vad16]) requires $n=O(2^{d/2}\sqrt{\log|\mathcal{Q}|}/\alpha)$ samples, which is better than the conclusion of the lower bound when $d<2\log(1/\alpha)$ . Mild because differential privacy cannot be satisfied for large query sets unless $\alpha\gtrsim 1/\sqrt{n}$ , so the condition is no stronger than assuming $n\lesssim 2^{d/3}$ , in which case the number of samples is exponential in the dimension. Similarly, the condition $s\geq d/\alpha^{2}$ is also necessary, since adding independent noise to each query requires only $n\gtrsim|\mathcal{Q}|^{1/2}/\alpha$ samples.

Finally, we consider the sample complexity of the natural and well studied class of $k$ -way marginal queries, also known as $k$ -way conjunction queries (see e.g. [BCD+07, KRSU10, GHRU11, TUV12, CTUW14, DNT13]). A $k$ -way marginal query on a database $D\in(\{0,1\}^{d})^{n}$ is specified by a set $S\subseteq[d]$ , $|S|\leq k$ , and a pattern $t\in\{0,1\}^{|S|}$ and asks “What fraction of records in $D$ has each attribute $j$ in $S$ set to $t_{j}$ ?” The number of $k$ -way marginal queries on $\{0,1\}^{d}$ is about $2^{k}\binom{d}{k}$ . For the special case of $k=1$ , the queries simply ask for the mean of each attribute, which was discussed above. We prove that the lower bound of Theorem 1.2, which applies to worst-case queries, also holds for the special case of $k$ -way marginal queries when $\alpha$ is not too small.

We remark that, since the number of $k$ -way marginal queries is about $2^{k}\binom{d}{k}$ , the sample complexity lower bound in Theoem 1.3 essentially matches that of Theorem 1.2. The two theorems are incomparable, since Theorem 1.2 applies even when $\alpha$ is exponentially small in $d$ , but only applies for a worst-case family of queries.

We now describe the main technical ingredients used to prove these results. For concreteness, we will describe the main ideas for the case of $k$ -way marginal queries.

Fingerprinting codes, introduced by Boneh and Shaw [BS98], were originally designed to address the problem of watermarking copyrighted content. Roughly speaking, a (fully-collusion-resilient) fingerprinting code is a way of generating codewords for $n$ users in such a way that any codeword can be uniquely traced back to a user. Each legitimate copy of a piece of digital content has such a codeword hidden in it, and thus any illegal copy can be traced back to the user who copied it. Moreover, even if an arbitrary subset of the users collude to produce a copy of the content, then under a certain marking assumption, the codeword appearing in the copy can still be traced back to one of the users who contributed to it. The standard marking assumption is that if every colluder has the same bit $b$ in the $j$ -th bit of their codeword, then the $j$ -th bit of the “combined” codeword in the copy they produce must be also $b$ . We refer the reader to the original paper of Boneh and Shaw [BS98] for the motivation behind the marking assumption and an explanation of how fingerprinting codes can be used to watermark digital content.

We show that the existence of short fingerprinting codes implies sample complexity lower bounds for $1$ -way marginal queries. Recall that a $1$ -way marginal query $q_{j}$ is specified by an integer $j\in[d]$ and asks simply “What fraction of records in $D$ have a $1$ in the $j$ -th bit?” Suppose a coalition of users takes their codewords and builds a database $D\in(\{0,1\}^{d})^{n}$ where each record contains one of their codewords, and $d$ is the length of the codewords. Consider the $1$ -way marginal query $q_{j}(D)$ . If every user in $S$ has a bit $b$ in the $j$ -th bit of their codeword, then $q_{j}(D)=b$ . Thus, if an algorithm answers $1$ -way marginal queries on $D$ with non-trivial accuracy, its output can be used to obtain a combined codeword that satisfies the marking assumption. By the tracing property of fingerprinting codes, we can use the combined codeword to identify one of the users in the database. However, if we can identify one of the users from the answers, then the algorithm is not differentially private.

Theorems 1.2 and 1.3 are proven by using this composition technique repeatedly to combine our lower bound for $1$ -way marginals with (variants of) several known lower bounds that capture the optimal dependence on $\log|\mathcal{Q}|$ and $1/\alpha^{2}$ .

The connection between fingerprinting codes and differential privacy lower bounds extends to arbitrary families $\mathcal{Q}$ of counting queries. We introduce the notion of a generalized fingerprinting code with respect to $\mathcal{Q}$ , where each codeword corresponds to a data universe element $x\in\mathcal{X}$ and the bits of the codeword are given by $q(x)$ for each $q\in\mathcal{Q}$ , but is the same as an ordinary fingerprinting code otherwise. The existence of a generalized fingerprinting code with respect to $\mathcal{Q}$ , for $n$ users, implies a sample complexity lower bound of $n$ for privately releasing answers to $\mathcal{Q}$ . We also show a partial converse to the above result, which states that some sort of “fingerprinting-code-like object” is necessary to prove sample complexity lower bounds for answering counting queries under differential privacy. This object has similar semantics to a generalized fingerprinting code, however the marking assumption required for tracing is slightly stronger and the probability that tracing succeeds can be significantly smaller than what is required by the standard definition of fingerprinting codes. Our partial converse parallels the result of Dwork et al. [DNR+09] that shows computational hardness results for differential privacy imply a “traitor-tracing-like object.” We leave it as an open question to pin down precisely the relationship between fingerprinting codes and information-theoretic lower bounds in differential privacy (and also between traitor-tracing schemes and computational hardness results for differential privacy).

2 Other Related Work

There have been attempts to prove optimal sample complexity lower bounds for $k$ -way marginals. In particular, when $k$ is a constant, Kasiviswanathan et al. [KRSU10] and De [De12] prove a lower bound of $\min\{|\mathcal{Q}|^{1/2}/\alpha,1/\alpha^{2}\}$ on the sample complexity. Note that when $\alpha$ is a constant, these lower bounds are $O(1)$ .

In addition to the general computational hardness results referenced above, there are several results that show stronger hardness results for restricted types of efficient algorithms [UV11, GHRU11, DNV12].

2.2 Subsequent Work

Subsequent to our work, Steinke and Ullman [SU15a] refined our use of fingerprinting codes to prove a lower bound of $\Omega(\sqrt{d\log(1/\delta)}/\varepsilon)$ on the number of samples required to release the mean of each of the $d$ attributes under $(\varepsilon,\delta)$ -differential privacy when $\delta\ll 1/n$ . This lower bound is optimal up to constant factors, and improves on Theorem 1.1 by a factor of roughly $\sqrt{\log(1/\delta)}\cdot\log d$ . They also improve and simplify our analysis of robust fingerprinting codes.

Our fingerprinting code technique has also been used to prove lower bounds for other types of differentially private data analyses. Namely, Dwork et al. [DTTZ14] prove lower bounds for differentially private principal component analysis and Bassily, Smith, and Thakurta [BST14] prove lower bounds for differentially private empirical risk minimization. In order to establish lower bounds for privately releasing threshold functions, Bun et al. [BNSV15] construct a fingerprinting-code-like object that yields a lower bound for the problem of releasing a value between the minimum and maximum of a dataset.

Dwork et al. [DSS+15] observe that the privacy attack implicit in our negative results is closely related to the influential attacks that were employed by Homer et al. [HSR+08] (and further studied in [SOJH09]) to violate privacy of public genetic datasets. Using this connection, they show how to make Homer et al.’s attack robust to very general models of noise and how to make the attack work without detailed knowledge of the population the dataset represents.

A pair of works [HU14, SU15b] show that fingerprinting codes and the related traitor-tracing schemes imply both information-theoretic lower bounds and computational hardness results for the “false discovery” problem in adaptive data analysis. Specifically, they show lower bounds for answering an online sequence of adaptively chosen counting queries where the database is a sample from some unknown distribution and the answers must be accurate with respect to that distribution. These works [HU14, SU15b] effectively reverse a connection established in [DFH+15, BSSU15], which used differentially private algorithms to obtain positive results for this problem.

Our technique for composing lower bounds in differential privacy has also found applications outside of privacy. Specifically, Liberty et al. [LMTU14] used this technique to prove nearly optimal lower bounds on the space required to “sketch” a database while approximately preserving answers to $k$ -way marginal queries (called “frequent itemset queries” in their work).

Preliminaries

We define a database $D\in\mathcal{X}^{n}$ to be an ordered tuple of $n$ rows $(x_{1},\ldots,x_{n})\in\mathcal{X}$ chosen from a data universe $\mathcal{X}$ . We say that two databases $D,D^{\prime}\in\mathcal{X}^{n}$ are adjacent if they differ only by a single row, and we denote this by $D\sim D^{\prime}$ . In particular, we can replace the $i$ th row of a database $D$ with some fixed “junk” element of $\mathcal{X}$ to obtain another database $D_{-i}\sim D$ . We emphasize that if $D$ is a database of size $n$ , then $D_{-i}$ is also a database of size $n$ .

Let $\mathcal{A}:\mathcal{X}^{n}\to\mathcal{R}$ be a randomized algorithm (where $n$ is a varying parameter). $\mathcal{A}$ is $(\varepsilon,\delta)$ -differentially private if for every two adjacent databases $D\sim D^{\prime}$ and every subset $S\subseteq\mathcal{R}$ ,

Let $\mathcal{A}:\mathcal{X}^{n}\to\mathcal{R}$ be a randomized algorithm such that for every $D\in\mathcal{X}^{n}$ , every $i,j\in[n]$ , and every subset $S\subseteq\mathcal{R}$ ,

Let $\bot$ denote the fixed junk element of $\mathcal{X}$ . Then $\mathcal{A}^{\prime}:\mathcal{X}^{n-1}\to\mathcal{R}$ defined by $\mathcal{A}^{\prime}(x_{1},\ldots,x_{n-1})=\mathcal{A}(x_{1},\ldots,x_{n-1},\bot)$ is $(2\varepsilon,(e^{\varepsilon}+1)\delta)$ -differentially private.

Let $D=(x_{1},\ldots,x_{n-1})$ and $D^{\prime}=(x_{1},\ldots,x_{i}^{\prime},\ldots,x_{n-1})$ be adjacent databases. Then for any $S\subseteq\mathcal{R}$ , we have

2 Counting Queries and Accuracy

In this paper we study algorithms that answer counting queries. A counting query on $\mathcal{X}$ is defined by a predicate $q:\mathcal{X}\to\{0,1\}$ . Abusing notation, we define the evaluation of the query $q$ on a database $D=(x_{1},\ldots,x_{n})\in\mathcal{X}^{n}$ to be its average value over the rows,

When $\beta=0$ we may simply write that $a$ or $\mathcal{A}$ is $\alpha$ -accurate for $\mathcal{Q}$ .

An important example of a collection of counting queries is the set of $k$ -way marginals. For all of our results it will be sufficient to consider only the set of monotone $k$ -way marginals.

A (monotone) $k$ -way marginal $q_{S}$ over $\{0,1\}^{d}$ is specified by a subset $S\subseteq[d]$ of size $|S|\leq k$ . It takes the value $q_{S}(x)=1$ if and only if $x_{i}=1$ for every index $i\in S$ . The collection of all (monotone) $k$ -way marginals is denoted by $\mathcal{M}_{k,d}$ .

3 Sample Complexity

In this work we prove lower bounds on the sample complexity required to simultaneously achieve differential privacy and accuracy.

We will focus on the case where $\varepsilon=O(1)$ and $\delta=o(1/n)$ . This setting of the parameters is essentially the most-permissive for which $(\varepsilon,\delta)$ -differential privacy is still a meaningful privacy definition. However, pinning down the exact dependence on $\varepsilon$ and $\delta$ is still of interest. Regarding $\varepsilon$ , this can be done via the following standard lemma, which allows us to take $\varepsilon=1$ without loss of generality.

For every set of counting queries $\mathcal{Q}$ , universe $\mathcal{X}$ , $\alpha,\beta\in,\varepsilon\leq 1$ . $(\mathcal{Q},\mathcal{X})$ has sample complexity $n^{*}$ for $(\alpha,\beta)$ -accuracy and $(1,o(1/n))$ -differential privacy if and only if it has sample complexity $\Theta(n^{*}/\varepsilon)$ for $(\alpha,\beta)$ -accuracy and $(\varepsilon,o(1/n))$ -differential privacy.

One direction ( $O(n^{*}/\varepsilon)$ samples are sufficient) is the “secrecy-of-the-sample lemma,” which appeared implicitly in [KLN+11]. The other direction ( $\Omega(n^{*}/\varepsilon)$ samples are necessary) appears to be folklore.

The next lemma allows us to generically translate sample complexity lower bounds for constant accuracy into lower bounds that depend on the error parameter $\alpha$ . For some sets of queries, such as $1$ -way marginals, the dependence we get on $\alpha$ is tight. However, as we will see in Section 5, we can obtain lower bounds with an even stronger dependence on $\alpha$ for specific sets of queries.

Let $\mathcal{Q}$ be a set of counting queries on $\mathcal{X}$ and let $\beta,\varepsilon,\delta>0$ . Suppose $(\mathcal{Q},\mathcal{X})$ has sample complexity $n^{*}$ for $(\alpha_{0},\beta)$ -accuracy and $(\varepsilon,\delta)$ -differential privacy, where $\alpha_{0}\in(0,1)$ is a constant. Then $(\mathcal{Q},\mathcal{X})$ has sample complexity $\Omega(n^{*}/\alpha)$ for $(\alpha,\beta,\gamma)$ -accuracy and $(\varepsilon,\delta)$ -differential privacy.

The mechanism $\mathcal{A}^{\prime}$ inherits $(\varepsilon,\delta)$ -differential privacy from $\mathcal{A}$ , since changing one row of $D^{\prime}$ changes one row of the padded database $D^{\prime}$ . Now we argue accuracy. Suppose $a_{q}$ is an answer such that $|a_{q}-q(D)|\leq\alpha$ . Note that by construction, $q(D)=\frac{1}{m}(nq(D^{\prime})+(m-n)q(x_{0}))$ , and hence $q(D^{\prime})=\frac{1}{n}(mq(D)-(m-n)q(x_{0}))$ . Thus we have

Taking $n=\lceil m\alpha/\alpha_{0}\rceil$ makes this quantity at most $\alpha_{0}$ , completing the proof. ∎

For context, we can restate some prior results on differentially private counting query release in our sample-complexity terminology.

For every set of counting queries $\mathcal{Q}$ on $\mathcal{X}$ and every $\alpha>0$ , $(\mathcal{Q},\mathcal{X})$ has sample complexity at most

for $(\alpha,0)$ -accuracy and $(1,o(1/n))$ -differential privacy.

The next theorem shows that, when the data universe is not too small, the private multiplicative weights algorithm is nearly-optimal as a function of $|\mathcal{Q}|$ and $1/\alpha$ when each parameter is considered individually.

for $(\alpha,0)$ -accuracy and $(1,o(1/n))$ -differential privacy.

4 Re-identifiable Distributions

All of our eventual lower bounds will take the form of a “re-identification” attack, in which we possess data from a large number of individuals, and identify one such individual who was included in the database. In this attack, we choose a distribution on databases and give an adversary 1) a database $D$ drawn from that distribution and 2) either $\mathcal{A}(D)$ or $\mathcal{A}(D_{-i})$ for some row $i$ , where $\mathcal{A}$ is an alleged sanitizer. The adversary’s goal is to identify a row of $D$ that was given to the sanitizer. We say that the distribution is re-identifiable if there is an adversary who can identify such a row with sufficiently high confidence whenever $\mathcal{A}$ outputs accurate answers. If the adversary can do so, it means that there must be a pair of adjacent databases $D\sim D_{-i}$ such that the adversary can distinguish $\mathcal{A}(D)$ from $\mathcal{A}(D_{-i})$ , which means $\mathcal{A}$ cannot be differentially private.

Here the probability is taken over the choice of $D$ and $i$ as well as the coins of $\mathcal{A}$ and $\mathcal{B}$ . We allow $\mathcal{D}$ and $\mathcal{B}$ to share a common state.

Note that, when row $i$ is not in the dataset, then it would be an error for $\mathcal{B}$ to declare that row $i$ is in the dataset, and condition 2 requires that the probability of this error occurring is at most $\xi$ .

The common state between $\mathcal{D}$ and $\mathcal{B}$ should be thought of as auxiliary information about the realization of $D$ that may help $\mathcal{B}$ identify a user $i$ . Formally, we could model this shared state by having $\mathcal{D}$ output an additional string $aux$ that is given to $\mathcal{B}$ but not to $\mathcal{A}$ . However, we make the shared state implicit to reduce notational clutter. The need for this shared state will become apparent when we use fingerprinting codes to construct re-identifiable distributions; in the context of fingerprinting codes, the shared state represents auxiliary information about a codebook that helps the $\mathit{Trace}$ algorithm accuse a guilty pirate.

Lower Bounds via Fingerprinting Codes

In Section 3.1 we give the relevant background on fingerprinting codes and in Section 3.2 we prove our lower bounds for $1$ -way marginals.

Fingerprinting codes were introduced by Boneh and Shaw [BS98] to address the problem of watermarking digital content. A fingerprinting code is a pair of randomized algorithms $(\mathit{Gen},\mathit{Trace})$ . The code generator $\mathit{Gen}$ outputs a codebook $C\in\{0,1\}^{n\times d}$ . Each row $c_{i}$ of $C$ is the codeword of user $i$ . For a subset of users $S\subseteq[n]$ , we use $C_{S}\in\{0,1\}^{|S|\times d}$ to denote the set of codewords of users in $S$ . The parameter $d$ is called the length of the fingerprinting code.

The security property of fingerprinting codes asserts that any codeword can be “traced” to a user $i\in[n]$ . Moreover, we require that the fingerprinting code is “fully-collusion-resilient”—even if any “coalition” of users $S\subseteq[n]$ gets together and “combines” their codewords in any way that respects certain constraints known as a marking assumption, then the combined codeword $c^{\prime}$ can be traced to a user $i\in S$ . That is, there is a tracing algorithm $\mathit{Trace}$ that takes as inputs the codebook and combined codeword $c^{\prime}$ and outputs either a user $i\in[n]$ or $\bot$ , and we require that if $c^{\prime}$ satisfies the constraints, then $\mathit{Trace}(C,c^{\prime})\in S$ with high probability. Moreover, $\mathit{Trace}$ should accuse an innocent user, i.e. $\mathit{Trace}(C,c^{\prime})\in[n]\setminus S$ , with very low probability. Analogous to the definition of re-identifiable distributions (Definition 2.10), we allow $\mathit{Gen}$ and $\mathit{Trace}$ to share a common state.As in Definition 2.10, we could model this by having $\mathit{Gen}$ output an additional string $aux$ that is given to $\mathit{Trace}$ . However, we make the shared state implicit to reduce notational clutter. When designing fingerprinting codes, one tries to make the marking assumption on the combined codeword as weak as possible.

The basic marking assumption is that each bit of the combined word $c^{\prime}$ must match the corresponding bit for some user in $S$ . Formally, for a codebook $C\in\{0,1\}^{n\times d}$ , and a coalition $S\subseteq[n]$ , we define the set of feasible codewords for $C_{S}$ to be

Observe that the combined codeword is only constrained on coordinates $j$ where all users in $S$ agree on the $j$ -th bit.

We are now ready to formally define a fingerprinting code.

where the probability is taken over the coins of $\mathit{Gen},\mathit{Trace}$ , and $\mathcal{A}_{\mathit{FP}}$ . The algorithms $\mathit{Gen}$ and $\mathit{Trace}$ may share a common state.

We remark that our proof of Theorem 3.5, showing how to construct re-identifiable distributions from a fingerprinting codes, will only require collusion resilience against coalitions $S$ of size $|S|\geq n-1$ . Our choice to state Definition 3.1 using resilience against arbitrary coalitions is more consistent with the literature on fingerprinting codes.

Tardos [Tar08] constructed a family of fingerprinting codes with a nearly optimal number of users $n$ for a given length $d$ .

As we will see in the next subsection, fingerprinting codes satisfying Definition 3.1 will imply lower bounds on the sample complexity for releasing $1$ -way marginals with $(\alpha,0)$ -accuracy (accuracy for every query). In order to prove sample-complexity lower bounds for $(\alpha,\beta)$ -accuracy with $\beta>0$ , we will need fingerprinting codes satisfying a stronger security property. Specifically, we will expand the feasible set $F(C_{S})$ to include all codewords that satisfy most feasibility constraints, and require that even codewords in this expanded set can be traced. Formally, for any $\beta\in$ , we define

where the probability is taken over the coins of $\mathit{Gen},\mathit{Trace}$ , and $\mathcal{A}_{\mathit{FP}}$ . The algorithms $\mathit{Gen}$ and $\mathit{Trace}$ may share a common state.

In Section 6 we show how to construct error-robust fingerprinting codes with a nearly-optimal number of users that are tolerant to a constant fraction of errors.

2 Lower Bounds for 111-Way Marginals

By combining Theorem 3.5 with Theorem 3.2 we obtain a sample complexity lower bound for $1$ -way marginals, and thereby establish Theorem 1.1 in the introduction.

Let $(\mathit{Gen},\mathit{Trace})$ be the promised fingerprinting code. We define the re-identifiable distribution $\mathcal{D}$ to simply be the output distribution of the code generator, $\mathit{Gen}$ . And we define the privacy adversary $\mathcal{B}$ to take the answers $a=\mathcal{A}(D)\in^{|\mathcal{M}_{1,d}|}$ , obtain $\overline{a}\in\{0,1\}^{|\mathcal{M}_{1,d}|}$ by rounding each entry of $a$ to $\{0,1\}$ , run the tracing algorithm $\mathit{Trace}$ on the rounded answers $\overline{a}$ , and return its output. The shared state of $\mathcal{D}$ and $\mathcal{B}$ will be the shared state of $\mathit{Gen}$ and $\mathit{Trace}$ .

Now we will verify that $\mathcal{D}$ is $(\xi,\xi)$ -re-identifiable. First, suppose that $\mathcal{A}(D)$ outputs answers $a=(a_{q_{j}})_{j\in[d]}$ that are $(1/3,\beta)$ -accurate for $1$ -way marginals. That is, there is a set $G\subseteq[d]$ such that $|G|\geq(1-\beta)d$ and for every $j\in G$ , the answer $a_{q_{j}}$ estimates the fraction of rows having a $1$ in column $j$ to within $1/3$ . Let $\overline{a}_{q_{j}}$ be $a_{q_{j}}$ rounded to the nearest value in $\{0,1\}$ . Let $j$ be a column in $G$ . If column $j$ has all $1$ ’s, then $a_{q_{j}}\geq 2/3$ , and $\overline{a}_{q_{j}}=1$ . Similarly, if column $j$ has all ’s, then $a_{q_{j}}\leq 1/3$ , and $\overline{a}_{q_{j}}=0$ . Therefore, we have

By security of the fingerprinting code (Definition 3.3), we have

But the event $\mathit{Trace}(D,\overline{a})=\bot$ is exactly the same as $\mathcal{B}(D,\mathcal{A}(D))=\bot$ , and thus we have established the first condition necessary for $\mathcal{D}$ to be $(\xi,\xi)$ -re-identifiable.

The second condition for re-identifiability follows directly from the soundness of the fingerprinting code, which asserts that for every adversary $\mathcal{A}_{\mathit{FP}}$ , in particular for $\mathcal{A}$ , it holds that

Using the additional structure of Tardos’ fingerprinting code, and our robust fingerprinting codes, we can prove minimax lower bounds for an “inference version” of the problem computing the $1$ -way marginals of a product distribution.

We can now formally define the problem of inferring the marginals $p$ as follows.

Our lower bound can thus be stated as follows,

The proof has the same general structure that we used to prove Theorem 3.5. Here, we describe additional observations about the structure of the fingerprinting codes used in that proof (see Section 6 for a description of Tardos’ fingerprinting code) that allow it to carry over to the inference version of computing $1$ -way marginals.

First, in Tardos’ (non-robust) fingerprinting code, the codebook $D$ is chosen by first sampling marginals $p\in^{d}$ from an appropriate distribution and then sampling $D$ from $\mathcal{D}_{p}^{\otimes n}.$ The robust fingerprinting codes we construct in Section 6 also have this property.To generate a codebook $D^{\prime}$ for our robust fingerprinting code, we sample a codebook $D$ from Tardos’ fingerprinting code and then insert additional columns of all $1$ ’s or all ’s to $D$ in random locations. Equivalently, we can obtain a codebook $D^{\prime}$ by appending $1$ ’s and ’s in random locations of $p$ to obtain a vector $p^{\prime}$ and then sampling $D^{\prime}$ from $\mathcal{D}_{p^{\prime}}^{\otimes n}.$ Thus the instances used to prove Theorem 3.5 indeed consist of independent samples from a product distribution, which is what the inference problem assumes.

Next, recall that the proof of Theorem 3.5 shows that any string that is $(\alpha,\beta)$ -accurate for the $1$ -way marginals of $D$ can be traced successfully. It is moreover the case that any string that is $(\alpha,\beta)$ -accurate for the marginals $p$ can also be traced successfully. This is because the rows of $D$ are sampled independently from $\mathcal{D}_{p}$ , so accuracy for the $1$ -way marginals of $D$ and accuracy for $p$ coincide with high probability, at least when $n=\omega(\log d)$ :

Let $p\in^{d}$ and let $D\leftarrow_{\mbox{\tiny R}}\mathcal{D}_{p}^{\otimes n}$ . Let $a\in^{d}$ denote the exact $1$ -way marginals of $D$ . Then for every $\alpha,\eta>0$ , and $n=\Omega(\log(d/\eta)/\alpha^{2})$ , we have $\|a-p\|_{\infty}\leq\alpha$ with probability at least $1-\eta$ over the choice of $D$ .

We remark that Steinke and Ullman [SU15a] showed that accuracy with respect to the marginals $p$ actually suffices to trace regardless of the value of $n$ .

These two observations suffice to show that, when $n$ is too small, a differentially private algorithm cannot be accurate for $p$ with high probability over the choices of both $p$ and $D$ . Thus, for every differentially private algorithm, there exists some $p$ such that the algorithm is not accurate with high probability over the choice of $D$ , which means that the algorithm does not accurately infer the marginals of an arbitrary product distribution. ∎

3 Fingerprinting Codes for General Query Families

In this section, we generalize the connection between fingerprinting codes and sample complexity lower bounds for arbitrary sets of queries. We show that a generalized fingerprinting code with respect to any family of counting queries $\mathcal{Q}$ yields a sample complexity lower bound for $\mathcal{Q}$ , which is analogous to our lower bound for $1$ -way marginals (Theorem 3.5). We then argue that some type of fingerprinting code is necessary to prove any sample complexity lower bound by exhibiting a tight connection between such lower bounds and a weak variant of our generalized fingerprinting codes.

We begin by defining our generalization of fingerprinting codes. Fix a finite data universe $\mathcal{X}$ and a set of counting queries $\mathcal{Q}$ over $\mathcal{X}$ . A generalized fingerprinting code with respect to the family $\mathcal{Q}$ consists of a pair of randomized algorithms $(\mathit{Gen},\mathit{Trace})$ . The code generation algorithm $\mathit{Gen}$ produces a codebook $C\in\mathcal{X}^{n}$ . Each row $c_{i}$ of $C$ is the codeword corresponding to user $i$ . A coalition $S\subseteq[n]$ of pirates receives the subset $C_{S}=\{c_{i}:i\in S\}$ of codewords, and produces an answer vector $a\in^{|\mathcal{Q}|}$ . We replace the traditional marking condition on the pirates with the generalized constraint that they output a feasible answer vector. A natural way to define feasibility for answer vectors is to require a condition similar to $(\alpha,\beta)$ -accuracy, i.e. an answer vector $a$ is feasible if $|a_{q}-q(C_{S})|\leq\alpha$ for all but a $\beta$ fraction of queries $q\in\mathcal{Q}$ . We thus define a generalized set of feasible answer vectors by

When $\alpha=1-1/n$ , the generalized set of feasible answer vectors captures the traditional marking assumption by rounding each entry of a feasible answer vector to or $1$ .An equivalent way to view a codebook is as a set of $n$ codewords $C\in(\{0,1\}^{|\mathcal{Q}|})^{n}$ , where each user’s codeword is $c_{i}=(q(x))_{q\in\mathcal{Q}}$ for some $x\in\mathcal{X}$ . Notice that the case where $\mathcal{Q}$ is the class of $1$ -way marginals places no constraints on the structure of a codeword, i.e. a codeword can be any binary string. With this viewpoint, the goal of the pirates is to output an answer vector $a\in^{|\mathcal{Q}|}$ with $|a_{q}-\frac{1}{|S|}\sum_{i\in S}(c_{i})_{q}|\leq\alpha$ for all but a $\beta$ fraction of the queries $q\in\mathcal{Q}$ .

A pair of algorithms $(\mathit{Gen},\mathit{Trace})$ is an $(n,\mathcal{Q})$ -fingerprinting code for $(\alpha,\beta)$ -accuracy with security $(\gamma,\xi)$ if $\mathit{Gen}$ outputs a codebook $C\in\mathcal{X}^{n}$ and for every (possibly randomized) adversary $\mathcal{A}_{\mathit{FP}}$ , and every coalition $S\subseteq[n]$ with $|S|\geq n-1$ , if we set $a\leftarrow_{\mbox{\tiny R}}\mathcal{A}_{\mathit{FP}}(C_{S})$ , then

where the probability is taken over the coins of $\mathit{Gen},\mathit{Trace}$ , and $\mathcal{A}_{\mathit{FP}}$ . The algorithms $\mathit{Gen}$ and $\mathit{Trace}$ may share a common state.

The security properties of Definition 3.12 differ from those of an ordinary fingerprinting code in two ways so as to enable a clean statement of a composition theorem for generalized fingerprinting codes (Theorem 4.6). First, we use two separate security parameters $\gamma,\xi$ for the different types of tracing errors, as in the definition of re-identifiable distributions. Second, security only needs to hold for coalitions of size $n-1$ or $n$ . However, this condition implies security for coalitions of arbitrary size with an increased false accusation probability of $n\xi$ .

As in Theorem 3.5, the existence of a generalized $(n,\mathcal{Q})$ -fingerprinting code implies a sample complexity lower bound of $n$ for privately releasing answers to $\mathcal{Q}$ , with essentially the same proof.

In particular, if $\gamma\leq 1/3$ and $\xi=o(1/n)$ , then there is no algorithm $\mathcal{A}:\mathcal{X}^{n}\to^{|\mathcal{Q}|}$ that is $(O(1),o(1/n))$ -differentially private and $(\alpha,\beta)$ -accurate for $\mathcal{Q}$ .

We now turn to investigate whether a converse to Theorem 3.13 holds. We show that a sample complexity lower bound for a family of queries $\mathcal{Q}$ is essentially equivalent to the existence of a weak type of fingerprinting code, where the tracing procedure depends on the family $\mathcal{Q}$ and the tracing error probabilities satisfy certain affine constraint. It remains an interesting open question to determine the precise relationship between privacy lower bounds and our notion of generalized fingerprinting codes.

A pair of algorithms $(\mathit{Gen},\mathit{Trace})$ is an $(n,\mathcal{Q})$ -weak fingerprinting code for $(\alpha,\beta)$ -accuracy with security $(\varepsilon,\delta)$ if $\mathit{Gen}$ outputs a codebook $C\in\mathcal{X}^{n}$ and for every (possibly randomized) adversary $\mathcal{A}_{\mathit{FP}}$ that outputs a feasible answer vector with probability $2/3$ , and every coalition $S\subseteq[n]$ with $|S|\geq n-1$ , if we set $a\leftarrow_{\mbox{\tiny R}}\mathcal{A}_{\mathit{FP}}(C_{S})$ , then

where the probabilities are taken over the coins of $\mathit{Gen}$ , $\mathit{Trace}$ , and $\mathcal{A}_{\mathit{FP}}$ . The algorithms $\mathit{Gen}$ and $\mathit{Trace}$ may share a common state.

That is, we require the false accusation probability $\Pr[\mathit{Trace}(C,a)\in[n]\setminus S]$ to be much smaller than the total probability of accusing any user. Note that a tracing algorithm that accuses a random user with probability $p$ will falsely accuse a user with probability $p/n$ when $|S|=n-1$ ; however, this does not satisfy Definition 3.14 because we require the gap between the two probabilities to be at least a factor of $e^{\varepsilon}n$ .

Observe that taking $\xi<(1-\delta)/2e^{\varepsilon}n$ in Definition 3.12 yields an $(n,\mathcal{Q})$ -weak fingerprinting code with security $(\varepsilon,\delta)$ . However, Definition 3.14 is weaker than Definition 3.12 in a few important ways. First, security only holds against pirates with a failure probability of at most $1/3$ . Second, while Definition 3.12 requires completeness error $\Pr[\mathit{Trace}(C,a)=\bot]<\xi$ , a weak fingerprinting code allows $\Pr[\mathit{Trace}(C,a)=\bot]=1-o(1)$ as long as $\Pr[\mathit{Trace}(C,a)\in[n]\setminus S]$ is sufficiently small.

The following theorem shows that the existence of an $(n,\mathcal{Q})$ -weak fingerprinting code is essentially equivalent to a sample complexity lower bound of $n$ against $\mathcal{Q}$ .

Then there exists an $i^{*}$ such that $\Pr[\mathit{Trace}(C,\mathcal{A}_{\mathit{FP}}(C))=i^{*}]\geq p/n$ . By differential privacy,

On the other hand, by the security of the weak fingerprinting code and differential privacy,

This yields a contradiction whenever $\varepsilon^{\prime}\leq\varepsilon/2$ and $\delta^{\prime}\leq\delta/(1+e^{\varepsilon/2}n)$ .

We now show the converse direction, i.e. that the high sample complexity of $(\mathcal{Q},\mathcal{X})$ implies the existence of a weak fingerprinting code. We begin with a technical lemma which shows that the high sample complexity of $\mathcal{Q}$ also rules out mechanisms that satisfy only a one-sided constraint on the probability of any event under the replacement of one row:

Let $\varepsilon\leq 1/2$ . Let $\mathcal{A}$ be an $(\alpha,\beta)$ -accurate algorithm for $\mathcal{Q}$ on databases $D\in\mathcal{X}^{m}$ . Suppose we have that for all databases $D\in\mathcal{X}^{m}$ , all $i\in[m]$ , and all measurable $T\subseteq\text{Range}(\mathcal{A})$ that

Let $d=\mathit{VC}(\mathcal{Q})$ be the VC-dimension of $\mathcal{Q}$ and let

Then there exists a $(6\varepsilon,(e^{2\varepsilon}+e^{5\varepsilon})\delta)$ -differentially private algorithm $\mathcal{B}$ on databases of size $n=\lceil m/\varepsilon\rceil$ that gives $(\alpha+\alpha^{\prime},\beta)$ -accurate answers to $\mathcal{Q}$ on any database $D^{\prime}\in\mathcal{X}^{n}$ with probability at least $1/2$ .

On input a database $D^{\prime}\in\mathcal{X}^{n}$ , consider the algorithm $\mathcal{B}^{\prime}$ that samples a random subset $D$ consisting of $m$ rows from $D^{\prime}$ (without replacement) and returns $\mathcal{A}(D)$ . Then by our hypothesis on $\mathcal{A}$ , for every $i\in[n]$ and every measurable $T\subseteq\text{Range}(\mathcal{B})=\text{Range}(\mathcal{A})$ we have

On the other hand, a “secrecy-of-the-sample” argument [KLN+11] enables us to obtain the reverse inequality. For a row $k\in[n]$ , consider the following two experiments:

Experiment 1: Sample a random subset $D$ of $m$ rows from $D^{\prime}_{-k}$ .

Experiment 2: Sample $j\leftarrow_{\mbox{\tiny R}}[n]$ , and then sample a random subset $D$ of $m$ rows from $D^{\prime}_{-j}$ .

Any database $D$ sampleable under Experiment 1 appears with probability $1/{n\choose m}$ , but appears with probability at least

Combining the two inequalities shows that for every database $D^{\prime}\in\mathcal{X}^{n}$ and every $i,k\in[n]$ ,

By Lemma 2.2, the algorithm $\mathcal{B}(D^{\prime}_{1},\ldots,D^{\prime}_{n-1})=\mathcal{B}^{\prime}(D^{\prime}_{1},\ldots,D^{\prime}_{n-1},\bot)$ is $(6\varepsilon,(e^{2\varepsilon}+e^{5\varepsilon})\delta)$ -differentially private.

Finally, uniform convergence of the sampling error of $\mathcal{B}^{\prime}$ implies that it remains an accurate algorithm, and hence so is $\mathcal{B}$ . In particular, when $D$ is a random sample of $m$ rows from $D^{\prime}$ and $d$ is the VC-dimension of $\mathcal{Q}$ , we have [AB09]:

Taking $\alpha^{\prime}$ as in the theorem statement makes the total failure probability of $\mathcal{B}$ at most $1/2$ . ∎

Now we proceed to complete the proof of Theorem 3.15. Suppose $(\mathcal{Q},\mathcal{X})$ has sample complexity greater than $n$ for $(\alpha+\alpha^{\prime},\beta)$ -accuracy (with failure probability $1/2$ ) and $(6\varepsilon,(e^{2\varepsilon}+e^{5\varepsilon})\delta)$ -differential privacy. By Lemma 3.16, for every $(\alpha,\beta)$ -accurate mechanism $\mathcal{A}$ for $\mathcal{Q}$ there exists a database $D\in\mathcal{X}^{m}$ with $m=\lfloor n\varepsilon\rfloor$ , a set $T$ , and an index $i$ such that

We now argue that it is without loss of generality to restrict our attention to mechanisms $\mathcal{A}$ whose range is the finite set $I_{m}^{|\mathcal{Q}|}=\{0,\frac{1}{2m},\frac{1}{m},\ldots,1-\frac{1}{2m},1\}^{|\mathcal{Q}|}$ . To see this, note that the exact answer to any counting query $q$ on a database $D\in\mathcal{X}^{m}$ is in the set $\{0,\frac{1}{m},\frac{2}{m},\ldots,1-\frac{1}{m},1\}$ . Therefore, if an answer $a\in$ satisfies $|a-q(D)|\leq\alpha$ , then the value

is a point in $I_{m}$ that also satisfies $|\bar{a}-q(D)|\leq\alpha$ . Thus, we will henceforth assume that the mechanism’s output lies in this finite range.

We now apply the min-max theorem from game theory (or equivalently, linear programming duality), to exhibit a fixed distribution on $(D,T,i)$ for which Inequality (3) holds. Specifically, consider a two-player zero-sum game in which Player 1 chooses a triple $(D,T,i)$ , where $D\in\mathcal{X}^{m}$ , $T\subseteq I_{m}^{|\mathcal{Q}|}$ , and $i\in[m]$ , and Player 2 chooses a randomized function $\mathcal{A}:\mathcal{X}^{m}\to I_{m}^{|\mathcal{Q}|}$ that is $(\alpha,\beta)$ -accurate for $\mathcal{Q}$ . Let the payoff to Player 1 be

By inequality (3), the value of this game is greater than $\delta$ . So by the min-max theorem there exists a mixed strategy for Player 1 that achieves a payoff greater than $\delta$ against any mixed strategy for Player 2. (Note that we can apply the min-max theorem because we have assumed that the mechanism’s output lies in a finite range.) That is, there exists a distribution $\mathcal{D}$ over triples $(D,T,i)$ such that for any randomized algorithm $\mathcal{A}:\mathcal{X}^{m}\to I_{m}^{|\mathcal{Q}|}$ that takes any $D$ to a feasible vector in $F_{\alpha,\beta}(D)$ with probability at least $2/3$ ,

Now consider the following code: $\mathit{Gen}$ samples a database $D$ , a set $T$ , and an index $i$ according to the promised distribution $\mathcal{D}$ . The codebook $C$ is $(D_{\pi(1)},\ldots,D_{\pi(m)})$ where $\pi:[m]\to[m]$ is a random permutation. On input an answer vector $a$ , the algorithm $\mathit{Trace}$ checks whether $a\in T$ . If it is, then $\mathit{Trace}$ outputs $\pi(i)$ , and otherwise outputs $\bot$ .

To analyze the security of this code, fix a coalition $S$ of $m-1$ users using a pirate strategy $\mathcal{A}_{\mathit{FP}}$ . Because the codebook is a random permutation of the rows of $D$ , it is equivalent to analyze the original database $D$ and a random coalition of $m-1$ users. Thus the part of the codebook $C_{S}$ given to the pirates is a random set of $m-1$ rows from $D$ , i.e. $D_{-j}$ for a random $j\in[m]$ with the junk row at index $j$ removed. The condition that $\mathcal{A}_{\mathit{FP}}$ outputs a feasible answer vector is equivalent to $a=\mathcal{A}_{\mathit{FP}}(C_{S})$ being an $(\alpha,\beta)$ -accurate answer vector. Therefore, letting $\mathcal{A}:\mathcal{X}^{m}\to I_{m}^{|\mathcal{Q}|}$ be the algorithm that runs $\mathcal{A}_{\mathit{FP}}$ on its input with the junk row removed, we have

On the other hand, the probability that $\mathit{Trace}$ outputs the user $j$ not in the coalition is

because the events $\{j=i\}$ and $\{\mathit{Trace}(C,a)=i\}$ are independent. Thus by (4),

where both probabilities are taken over the coins of $\mathit{Gen},\mathit{Trace}$ , and $\mathcal{A}_{\mathit{FP}}$ . ∎

A Composition Theorem for Sample Complexity

In this section we state and prove a composition theorem for sample complexity lower bounds. At a high-level the composition theorem starts with two pairs, $(\mathcal{Q},\mathcal{X})$ and $(\mathcal{Q}^{\prime},\mathcal{X}^{\prime})$ , for which we know sample-complexity lower bounds of $n$ and $n^{\prime}$ respectively, and attempts to prove a sample-complexity lower bound of $n\cdot n^{\prime}$ for a related family of queries on a related data universe.

Specifically, our sample-complexity lower bound will apply to the “product” of $\mathcal{Q}$ and $\mathcal{Q}^{\prime}$ , defined on $\mathcal{X}\times\mathcal{X}^{\prime}$ . We define the product $\mathcal{Q}\land\mathcal{Q}^{\prime}$ to be

Since $q,q^{\prime}$ are boolean-valued, their conjunction can also be written $q(x)q^{\prime}(x^{\prime})$ .

We now begin to describe how we can prove a sample complexity lower bound for $\mathcal{Q}\land\mathcal{Q}^{\prime}$ . First, we describe a certain product operation on databases. Let $D\in\mathcal{X}^{n}$ , $D=(x_{1},\ldots,x_{n})$ , be a database. Let $D^{\prime}_{1},\ldots,D^{\prime}_{n}\in(\mathcal{X}^{\prime})^{n^{\prime}}$ where $D^{\prime}_{i}=(x^{\prime}_{i1},\ldots,x^{\prime}_{in^{\prime}})$ be $n$ databases. We define the product database $D^{*}=D\times(D^{\prime}_{1},\ldots,D^{\prime}_{n})\in(\mathcal{X}\times\mathcal{X}^{\prime})^{n\cdot n^{\prime}}$ as follows: For every $i=1,\ldots,n,j=1,\ldots,n^{\prime}$ , let the $(i,j)$ -th row of $D^{*}$ be $x^{*}_{(i,j)}=(x_{i},x^{\prime}_{ij})$ . Note that we index the rows of $D^{*}$ by $(i,j)$ . We will sometimes refer to $D^{\prime}_{1},\ldots,D^{\prime}_{n}$ as the “subdatabases” of $D^{*}$ .

The key property of these databases is that we can use a query $q\land q^{\prime}\in\mathcal{Q}\land\mathcal{Q}^{\prime}$ to compute a “subset-sum” of the vector $s_{q^{\prime}}=(q^{\prime}(D^{\prime}_{1}),\ldots,q^{\prime}(D^{\prime}_{n}))$ consisting of the answers to $q^{\prime}$ on each of the $n$ subdatabases. That is, for every $q\in\mathcal{Q}$ and $q^{\prime}\in\mathcal{Q}^{\prime}$ ,

Thus, every approximate answer $a_{q\land q^{\prime}}$ to a query $q\land q^{\prime}$ places a subset-sum constraint on the vector $s_{q^{\prime}}$ . (Namely, $a_{q\land q^{\prime}}\approx\frac{1}{n}\sum_{i=1}^{n}q(x_{i})q^{\prime}(D^{\prime}_{i})$ ) If the database $D$ and family $\mathcal{Q}$ are chosen appropriately, and the answers are sufficiently accurate, then we will be able to reconstruct a good approximation to $s_{q^{\prime}}$ . Indeed, this sort of “reconstruction attack” is the core of many lower bounds for differential privacy, starting with the work of Dinur and Nissim [DN03]. The setting they consider is essentially the special case of what we have just described where $D^{\prime}_{1},\ldots,D^{\prime}_{n}$ are each just a single bit ( $\mathcal{X}^{\prime}=\{0,1\}$ , and $\mathcal{Q}^{\prime}$ contains only the identity query). In Section 5 we will discuss choices of $D$ and $\mathcal{Q}$ that allow for this reconstruction.

We now state the formal notion of reconstruction attack that we want $D$ and $\mathcal{Q}$ to satisfy.

for at least a $1-\beta$ fraction of queries $q\in\mathcal{Q}$ , $\mathcal{B}_{D}(a)$ outputs a vector $t\in^{n}$ such that

Then we say that $D\in\mathcal{X}^{n}$ enables an $\alpha^{\prime}$ -reconstruction attack from $(\alpha,\beta)$ -accurate answers to $\mathcal{Q}$ .

A reconstruction attack itself implies a sample-complexity lower bound, as in [DN03]. However, we show how to obtain stronger sample complexity lower bounds from the reconstruction attack by applying it to a product database $D^{*}$ to obtain accurate answers to queries on its subdatabases. For each query $q^{\prime}\in\mathcal{Q}^{\prime}$ , we run the adversary promised by the reconstruction attack on the approximate answers given to queries of the form $(q\land q^{\prime})\in\mathcal{Q}\land\{q^{\prime}\}$ . As discussed above, answers to these queries will approximate subset sums of the vector $s_{q^{\prime}}=(q^{\prime}(D^{\prime}_{1}),\ldots,q^{\prime}(D^{\prime}_{n}))$ . When the reconstruction attack is given these approximate answers, it returns a vector $t_{q^{\prime}}=(t_{q^{\prime},1},\ldots,t_{q^{\prime},n})$ such that $t_{q^{\prime},i}\approx s_{q^{\prime},i}=q^{\prime}(D^{\prime}_{i})$ on average over $i$ . Running the reconstruction attack for every query $q^{\prime}$ gives us a collection $t=(t_{q^{\prime},i})_{q^{\prime}\in\mathcal{Q}^{\prime},i\in[n]}$ where $t_{q^{\prime},i}\approx q^{\prime}(D^{\prime}_{i})$ on average over both $q^{\prime}$ and $i$ . By an application of Markov’s inequality, for most of the subdatabases $D^{\prime}_{i}$ , we have that $t_{q^{\prime},i}\approx q^{\prime}(D^{\prime}_{i})$ on average over the choice of $q^{\prime}\in\mathcal{Q}^{\prime}$ . For each $i$ such that this guarantee holds, another application of Markov’s inequality shows that for most queries $q^{\prime}\in\mathcal{Q}^{\prime}$ we have $t_{q^{\prime},i}\approx q^{\prime}(D^{\prime}_{i})$ , which is our definition of $(\alpha,\beta)$ -accuracy (later enabling us to apply a re-identification adversary for $\mathcal{Q}^{\prime}$ ).

The algorithm we have described for obtaining accurate answers on the subdatabases is formalized in Figure 1.

We are now in a position to state the main lemma that enables our composition technique. The lemma says that if we are given accurate answers to $\mathcal{Q}\land\mathcal{Q}^{\prime}$ on $D^{*}$ and the database $D\in\mathcal{X}^{n}$ enables a reconstruction attack from accurate answers to $\mathcal{Q}$ , then we can obtain accurate answers to $\mathcal{Q}^{\prime}$ on most of the subdatabases $D^{\prime}_{1},\ldots,D^{\prime}_{n}\in(\mathcal{X}^{\prime})^{n^{\prime}}$ .

The additional bookkeeping in the proof is to handle the case where $a$ is only accurate for most queries. In this case the reconstruction attack may fail completely for certain queries $q^{\prime}\in\mathcal{Q}^{\prime}$ and we need to account for this additional source of error.

Assume the answer vector $a=(a_{q\land q^{\prime}})_{q\in\mathcal{Q},q^{\prime}\in\mathcal{Q}^{\prime}}$ is $(\alpha,\beta)$ -accurate for $\mathcal{Q}\land\mathcal{Q}^{\prime}$ on $D^{*}=D\times(D^{\prime}_{1},\ldots,D^{\prime}_{n})$ . By assumption, $D$ enables a reconstruction attack $\mathcal{B}_{D}$ that succeeds in reconstructing an approximation to $s_{q^{\prime}}=(q^{\prime}(D^{\prime}_{1}),\ldots,q^{\prime}(D^{\prime}_{n}))$ when given $(\alpha,c\beta)$ -accurate answers for the family of queries $\mathcal{Q}\land\{q^{\prime}\}$ . Consider the set of $q^{\prime}$ on which the reconstruction attack succeeds, i.e.

Since $a$ is $(\alpha,\beta)$ -accurate, an application of Markov’s inequality shows that

Thus, $|\mathcal{Q}^{\prime}_{\mathit{good}}|\geq(1-1/c)|\mathcal{Q}^{\prime}|$ .

Recall that, by (5), we can interpret answers to $\mathcal{Q}\land\mathcal{Q}^{\prime}$ as subset sums of answers to the subdatabases, so for every $q^{\prime}\in\mathcal{Q}^{\prime}_{\mathit{good}}$ ,

for at least a $1-c\beta$ fraction of queries $q\land q^{\prime}\in\mathcal{Q}\land\{q^{\prime}\}$ . Since $D$ enables a reconstruction attack from $(\alpha,c\beta)$ -accurate answers to $\mathcal{Q}$ , by Definition 4.1, $\mathcal{B}_{D}((a_{q\land q^{\prime}})_{q\in\mathcal{Q}})$ recovers a vector $t_{q^{\prime}}\in^{n}$ such that

Since this holds for every $q^{\prime}\in\mathcal{Q}^{\prime}_{\mathit{good}}$ , we have

The statement inside the final probability is precisely that $(t_{q^{\prime},i})_{q^{\prime}\in\mathcal{Q}^{\prime}}$ is $(6c\alpha^{\prime},2/c)$ -accurate for $\mathcal{Q}^{\prime}$ on $D^{\prime}_{i}$ . This completes the proof of the lemma. ∎

We now explain how the main lemma allows us to prove a composition theorem for sample complexity lower bounds. We start with a query family $\mathcal{Q}$ on a database $D\in\mathcal{X}^{n}$ that enables a reconstruction attack, and a distribution $\mathcal{D}^{\prime}$ over databases in $(\mathcal{X}^{\prime})^{n^{\prime}}$ that is re-identifiable from answers to a family $\mathcal{Q}^{\prime}$ . We show how to combine these objects to form a re-identifiable distribution $\mathcal{D}^{*}$ for queries $\mathcal{Q}\land\mathcal{Q}^{\prime}$ over $(\mathcal{X}\times\mathcal{X}^{\prime})^{n\cdot n^{\prime}}$ , yielding a sample complexity lower bound of $n\cdot n^{\prime}$ .

A sample from $\mathcal{D}^{*}$ consists of $D^{*}=D\times(D^{\prime}_{1},\ldots,D^{\prime}_{n})$ where each subdatabase $D^{\prime}_{i}$ is an independent sample from from $\mathcal{D}^{\prime}$ . The main lemma above shows that if there is an algorithm $\mathcal{A}$ that is accurate for $\mathcal{Q}\land\mathcal{Q}^{\prime}$ on $D^{*}$ , then an adversary can reconstruct accurate answers to $\mathcal{Q}^{\prime}$ on most of the subdatabases $D^{\prime}_{1},\ldots,D^{\prime}_{n}$ . Since these subdatabases are drawn from a re-identifiable distribution, the adversary can the re-identify a member of one of the subdatabases $D^{\prime}_{i}$ . Since the identified member of $D^{\prime}_{i}$ is also a member of $D^{*}$ , we will have a re-identification attack against $D^{*}$ as well.

We are now ready to formalize our composition theorem.

Let $\mathcal{Q}$ be a family of counting queries on $\mathcal{X}$ , and let $\mathcal{Q}^{\prime}$ be a family of counting queries on $\mathcal{X}^{\prime}$ . Let $\gamma,\xi,\alpha^{\prime},\alpha,\beta\in$ be parameters. Assume that for some parameters $c>1$ , $\gamma,\xi,\alpha^{\prime},\alpha,\beta\in$ , the following both hold:

There exists a database $D\in\mathcal{X}^{n}$ that enables an $\alpha^{\prime}$ -reconstruction attack from $(\alpha,c\beta)$ -accurate answers to $\mathcal{Q}$ .

There is a distribution $\mathcal{D}^{\prime}$ on databases $D\in(\mathcal{X}^{\prime})^{n^{\prime}}$ that is $(\gamma,\xi)$ -re-identifiable from $(6c\alpha^{\prime},2/c)$ -accurate answers to $\mathcal{Q}^{\prime}$ .

Then there is a distribution on databases $D^{*}\in(\mathcal{X}\times\mathcal{X}^{\prime})^{n\cdot n^{\prime}}$ that is $(\gamma+1/6,\xi)$ -re-identifiable from $(\alpha,\beta)$ -accurate answers to $\mathcal{Q}\land\mathcal{Q}^{\prime}$ .

Assume that $\mathcal{A}(D^{*})$ is $(\alpha,\beta)$ -accurate for $\mathcal{Q}\land\mathcal{Q}^{\prime}$ . By Lemma 4.2, we have

where the last inequality is by (6). Thus, it suffices to prove that

We prove this inequality by giving a reduction to the re-identifiability of $\mathcal{D}^{\prime}$ . Consider the following sanitizer $\mathcal{A}^{\prime}$ : On input $D^{\prime}\leftarrow_{\mbox{\tiny R}}\mathcal{D}^{\prime}$ , $\mathcal{A}^{\prime}$ first chooses a random index $i^{*}\leftarrow_{\mbox{\tiny R}}[n]$ . Next, it samples $D^{\prime}_{1},\ldots,D^{\prime}_{i^{*}-1},D^{\prime}_{i^{*}+1},\ldots,D^{\prime}_{n}\leftarrow_{\mbox{\tiny R}}\mathcal{D}^{\prime}$ independently, and sets $D^{\prime}_{i^{*}}=D^{\prime}$ . Finally, it runs $\mathcal{A}$ on $D^{*}=D\times(D^{\prime}_{1},\ldots,D^{\prime}_{n})$ and then runs the reconstruction attack $\mathcal{R}^{*}$ to recover answers $(t_{q^{\prime},i})_{q^{\prime}\in\mathcal{Q}^{\prime},i\in[n]}$ and outputs $(t_{q^{\prime},i^{*}})_{q^{\prime}\in\mathcal{Q}^{\prime}}$ .

Notice that since $D^{\prime}_{1},\ldots,D^{\prime}_{n}$ are all i.i.d. samples from $\mathcal{D}^{\prime}$ , their joint distribution is independent of the choice of $i^{*}$ . Specifically, in the view of $\mathcal{B}^{*}$ , we could have chosen $i^{*}$ after seeing its output on $D^{*}$ . Therefore, the following random variables are identically distributed:

$(t_{q^{\prime},i})_{q^{\prime}\in\mathcal{Q}^{\prime}}$ , where $(t_{q^{\prime},i})_{q^{\prime}\in\mathcal{Q}^{\prime},i\in[n]}$ is the output of $\mathcal{R}_{D}^{*}(\mathcal{A}(D^{*}))$ on $D^{*}\leftarrow_{\mbox{\tiny R}}\mathcal{D}^{*}$ , and $i\leftarrow_{\mbox{\tiny R}}[n]$ .

$\mathcal{A}^{\prime}(D^{\prime})$ where $D^{\prime}\leftarrow_{\mbox{\tiny R}}\mathcal{D}^{\prime}$ .

where the last inequality follows because $\mathcal{D}^{\prime}$ is a $(\gamma,\xi)$ -re-identifiable from $(6c\alpha^{\prime},2/c)$ -accurate answers to $\mathcal{Q}^{\prime}$ . Thus we have established (8). Combining (7) and (8) completes the proof of the claim.

The next claim follows directly from the definition of $\mathcal{B}^{*}$ and the fact that $\mathcal{D}^{\prime}$ is $(\gamma,\xi)$ -re-identifiable.

For every $(i,j)\in[n]\times[n^{\prime}]$ ,

Combining Claims 4.4 and 4.5 suffices to prove that $\mathcal{D}^{*}$ is $(\gamma+1/6,\xi)$ -re-identifiable from $(\alpha,\beta)$ -accurate answers to $\mathcal{Q}\land\mathcal{Q}^{\prime}$ , completing the proof of the theorem. ∎

The proof of Theorem 4.3 also yields a composition theorem for generalized fingerprinting codes. Specifically, Theorem 4.6 below shows how to combine a reconstruction attack for a query family $\mathcal{Q}$ on a database $D\in\mathcal{X}^{n}$ with a $(n^{\prime},\mathcal{Q}^{\prime})$ -generalized fingerprinting code to obtain a $(n\cdot n^{\prime},\mathcal{Q}\land\mathcal{Q}^{\prime})$ -generalized fingerprinting code.

There exists a database $D\in\mathcal{X}^{n}$ that enables an $\alpha^{\prime}$ -reconstruction attack from $(\alpha,c\beta)$ -accurate answers to $\mathcal{Q}$ .

There exists a $(n^{\prime},\mathcal{Q}^{\prime})$ -generalized fingerprinting code for $(6c\alpha^{\prime},2/c)$ -accuracy with security $(\gamma,\xi)$ .

Then there is a $(n\cdot n^{\prime},\mathcal{Q}\land\mathcal{Q}^{\prime})$ -generalized fingerprinting code for $(\alpha,\beta)$ -accuracy with security $(\gamma+1/6,\xi)$ .

Applications of the Composition Theorem

In this section we show how to use our composition theorem (Section 4) to combine our new lower bounds for $1$ -way marginal queries from Section 3 with (variants of) known lower bounds from the literature to obtain our main results. In Section 5.1 we prove a lower bound for $k$ -way marginal queries when $\alpha$ is not too small (at least inverse polynomial in $d$ ), thereby proving Theorem 1.2 in the introduction. Then in Section 5.2 we obtain a similar lower bound for arbitrary counting queries that allows $\alpha$ to take a wider range of parameters..

A known reconstruction-based lower bound of $\Omega(k)$ for $k$ -way marginals.

A known reconstruction-based lower bound of $\Omega(1/\alpha^{2})$ for $k$ -way marginals.

The lower bound of $\Omega(k)$ for $k$ -way marginals is a special case of a lower bound of $\Omega(\mathit{VC}(\mathcal{Q}))$ due to [Rot10] and based on [DN03], where $\mathit{VC}(\mathcal{Q})$ is the Vapnik-Chervonenkis (VC) dimension of $\mathcal{Q}$ . The lower bound of $\Omega(1/\alpha^{2})$ for $k$ -way marginals is due to [KRSU10, De12].

To apply our composition theorem, we need to formulate these reconstruction attack in the language of Definition 4.1. In particular, we observe that these reconstruction attacks readily generalize to allow us to reconstruct fractional vectors $s\in^{n}$ , instead of just boolean vectors as in [DN03, Rot10].

First we state and prove that the linear dependence on $k$ is necessary.

Let $\mathcal{Q}$ be a collection of counting queries over a data universe $\mathcal{X}$ . We say a set $\{x_{1},\ldots,x_{k}\}\subseteq\mathcal{X}$ is shattered by $\mathcal{Q}$ if for every string $v\in\{0,1\}^{k}$ , there exists a query $q\in\mathcal{Q}$ such that $(q(x_{1}),\ldots,q(x_{k}))=(v_{1},\ldots,v_{k})$ . The VC-Dimension of $\mathcal{Q}$ denoted $\mathit{VC}(\mathcal{Q})$ is the cardinality of the largest subset of $\mathcal{X}$ that is shattered by $\mathcal{Q}$ .

For each $i=1,\ldots,k$ , let $x_{i}=(1,1,\ldots,0,\ldots,1)$ where the zero is at the $i$ -th index. We will show that $\{x_{1},\ldots,x_{k}\}$ is shattered by $\mathcal{M}_{k,d}$ . For a string $v\in\{0,1\}^{k}$ , let the query $q_{v}(x)$ take the conjunction of the bits of $x$ at indices set to in $v$ . Then $q_{v}(x_{i})=1$ iff $v_{i}=1$ , so $(q_{v}(x_{1}),\ldots,q_{v}(x_{k}))=(v_{1},\ldots,v_{k})$ . ∎

Let $\mathcal{Q}$ be a collection of counting queries over a data universe $\mathcal{X}$ and let $n=\mathit{VC}(\mathcal{Q})$ . Then there is a database $D\in\mathcal{X}^{n}$ which enables a $4\alpha$ -reconstruction attack from $(\alpha,0)$ -accurate answers to $\mathcal{Q}$ .

Let $\{x_{1},\ldots,x_{n}\}$ be shattered by $\mathcal{Q}$ , and consider the database $D=(x_{1},\ldots,x_{n})$ . Let $s\in^{n}$ be an arbitrary string to be reconstructed and let $a=(a_{q})_{q\in\mathcal{Q}}$ be $(\alpha,0)$ -accurate answers. That is, for every $q\in\mathcal{Q}$

Consider the brute-force reconstruction attack $\mathcal{B}$ defined in Figure 4. Notice that, since $a$ is $(\alpha,0)$ -accurate, $\mathcal{B}$ always finds a suitable vector $t$ . Namely, the original database $s$ satisfies the constraints.

We will show that the reconstructed vector $t$ satisfies

Let $T$ be the set of coordinates on which $t_{i}>s_{i}$ and let $S$ be the set of coordinates where $s_{i}>t_{i}$ . Note that

We will show that absolute values of the sums over $T$ and $S$ are each at most $2\alpha$ . Since $\{x_{1},\ldots,x_{n}\}$ is shattered by $\mathcal{Q}$ , there is a query $q\in\mathcal{Q}$ such that $q(x_{i})=1$ iff $i\in T$ . Therefore, by the definitions of $t$ and $(\alpha,0)$ -accuracy,

so by the triangle inequality, $\frac{1}{n}\sum_{i\in T}(t_{i}-s_{i})\leq 2\alpha$ . An identical argument shows that $\frac{1}{n}\sum_{i\in S}(s_{i}-t_{i})\leq 2\alpha$ , proving that $t$ is an accurate reconstruction. ∎

We can now state in our terminology the lower bound of De from [De12] (building on [KRSU10]) showing that the inverse-quadratic dependence on $\alpha$ is necessary.

Let $k$ be any constant, $d\geq k$ be any integer, and let $\alpha\geq 1/d^{.499k}$ be a sufficiently small parameterThe constant $.499$ was chosen for simplicity, and can be replaced with any constant strictly smaller than $.5$ . (i.e. bounded by an absolute constant). There exists a constant $\beta=\beta(k)>0$ such that for every $\alpha^{\prime}>0,$ there exists a database $D\in(\{0,1\}^{d})^{n}$ with $n=\Omega_{\alpha^{\prime},k}(1/\alpha^{2})$ such that $D$ enables an $\alpha^{\prime}$ -reconstruction attack from $(\alpha,\beta)$ -accurate answers to the $k$ -way marginals $\mathcal{M}_{k,d}$ .

Although the above theorem is a simple extension of De’s lower bound, we sketch a proof for completeness, and refer the interested reader to [De12] for a more detailed analysis.

To prove that the reconstruction attack succeeds, we will show that there exists a database $D=(x_{1},\ldots,x_{n})\in\{0,1\}^{n\times d}$ such that for any $s\in^{n}$ , if $a$ satisfies

(i.e. $a$ has $(\alpha,\beta)$ -accurate answers) then $\mathcal{B}_{\mathcal{M}_{k,d}}(D,a)$ returns a vector $t$ such that $\|t-s\|_{1}\leq\alpha^{\prime}\cdot n.$ Henceforth we refer to such an $a$ simply as $(\alpha,\beta)$ -accurate for $\mathcal{M}_{k,d}$ on $(D,s),$ as a shorthand. The above guarantee must hold for suitable choices of $n,\beta,$ and $\alpha^{\prime}$ to satisfy the theorem.

We will argue that the reconstruction succeeds in two steps. First, we show that reconstruction succeeds if $D$ is “nice.” Second, we show that there exists “nice” $D$ that has the dimensions promised by the theorem.

To explain what we mean by a “nice” database $D$ , for any $D=(x_{1},\ldots,x_{n})\in\{0,1\}^{n\times d}$ and family of queries $\mathcal{Q}$ on $\{0,1\}^{d}$ , we define the matrix $M=M_{D,\mathcal{Q}}\in\{0,1\}^{n\times|\mathcal{Q}|},$ as $M(i,q)=q(x_{i}).$

A matrix $M\in\{0,1\}^{n\times m}$ is a $\delta$ -Euclidean section if for every vector $a$ in the rowspace of $M$ we have $\sqrt{m}\cdot\|a\|_{2}\geq\|a\|_{1}\geq\delta\sqrt{m}\cdot\|a\|_{2}.$

Let $D$ be a database and $\mathcal{Q}$ be a set of queries such that $M_{D,\mathcal{Q}}\in\{0,1\}^{n\times|\mathcal{Q}|}$ is a $\delta$ -Euclidean section and the least singular value of $M_{D,\mathcal{Q}}$ is $\sigma$ . Let $s\in^{n}$ be arbitrary. There exists $\beta=\beta(\delta)>0$ such that if $a$ are $(\alpha,\beta)$ -accurate answers for $\mathcal{Q}$ on $(D,s)$ , and $t=\mathcal{B}_{\mathcal{Q}}(D,a)$ , then $t$ satisfies

for $\gamma=O(\alpha\sqrt{n|\mathcal{Q}|}/\sigma).$ The constant hidden in the $O(\cdot)$ notation depends only on $\delta.$

Thus, it suffices to find database $D$ such that the matrix $M_{D,\mathcal{M}_{k,d}}$ is a Euclidean section (for some fixed constant $\delta>0$ ) and has no “small” singular values. A result of Rudelson [Rud12] (strengthening that of Kasiviswanathan et al. [KRSU10]) guarantees that such a database exists.

In particular, there exists a database $D\in\{0,1\}^{n\times d}$ such that the Hadamard product $M$ satisfies the two properties above.

Now fix any $s\in^{n}$ and let $a\in^{|\mathcal{M}_{k,d}|}$ be $(\alpha,\beta)$ -accurate answers to $\mathcal{M}_{k,d}$ on $(D,s)$ . Now, if we let $t=\mathcal{B}_{\mathcal{M}_{k,d}}(D,a)$ , by Lemma 5.6, provided that $\beta$ is smaller than some constant that depends only on $\delta$ , which in turn depends only on $k$ , we will have $\|s-t\|_{1}\leq\gamma\cdot n$ for

1.3 Putting Together the Lower Bound

Now we show how to combine the various attacks to prove Theorem 1.2 in the introduction. We obtain our lower bound by applying two rounds of composition. In the first round, we compose the reconstruction attack of Theorem 5.4 described above with the re-identifiable distribution for $1$ -way marginals. We then take the resulting re-identifiable distribution and apply a second round of composition using the reconstruction attack based on the VC-dimension of $k$ -way marginals.

We remark that it is necessary to apply the two rounds of composition in this order. In particular, we cannot prove Theorem 1.3 by composing first with the VC-dimension-based reconstruction attack. Our composition theorem requires a re-identifiable distribution from $(\alpha,\beta)$ -accurate answers for $\beta>0$ , whereas the reconstruction attack described in Lemma 5.3 requires $(\alpha,0)$ -accurate answers, and the reconstruction can fail if some queries have error much larger than $\alpha$ . The resulting re-identifiable distribution obtained from composing with this reconstruction attack will also require $(\alpha,0)$ -accurate answers, and thus cannot be composed further.

We can now formally state and prove our sample-complexity lower bound for $k$ -way marginals, thereby establishing Theorem 1.3 in the introduction.

such that there exists a distribution on $n$ -row databases $D\in(\{0,1\}^{d})^{n}$ that is $(1/2,o(1/n))$ -re-identifiable from $(\alpha,0)$ -accurate answers to the $k$ -way marginals $\mathcal{M}_{k,d}$ .

Applying Theorem 4.3 (with parameter $c=150$ ), we obtain item 1’ below. We then bring in another reconstruction attack for the composition theorem.

Using the composition Theorem 4.6 in place of Theorem 4.3, we obtain a version of Theorem 5.8 in the language of generalized fingerprinting codes.

such that there exists a $(n,\mathcal{M}_{k,d})$ -generalized fingerprinting code with security $(1/2,o(1/n))$ for $(\alpha,0)$ -accuracy.

1.4 A Tight Lower Bound for 2-Way Marginals

Theorem 5.8 does not give any non-trivial lower bound for $2$ -way marginals. Intuitively, the problem is that the proof uses two rounds of composition, and thus if we try to instantiate the proof for $2$ -way marginals, one of the three lower bounds being composed will have to be trivial (i.e. will be a lower bound for -way marginals). However, a simple modification of the proof yields a tight lower bound for $2$ -way marginals that holds even for $(\alpha,\beta)$ -accuracy.

such that there exists a distribution on $n$ -row databases $D\in(\{0,1\}^{d})^{n}$ that is $(1/2,o(1/n))$ -re-identifiable from $(\alpha,\beta)$ -accurate answers to the $2$ -way marginals $\mathcal{M}_{2,d}$ .

Applying Theorem 4.3 (with parameter $c=150$ ), we obtain the following: There exists a distribution on databases in $(\{0,1\}^{d})^{n_{d}n_{\alpha}}$ that is $(1/3,o(1/n_{d}n_{\alpha}))$ -re-identifiable from $(\alpha,4\beta)$ -accurate answers to $\mathcal{M}_{1,d/2}\land\mathcal{M}_{1,d/2}\subset\mathcal{M}_{2,d}$ .

To complete the theorem, note that $\mathcal{M}_{1,d/2}\land\mathcal{M}_{1,d/2}$ contains exactly $1/4$ of all the queries in $\mathcal{M}_{2,d}$ , so $(\alpha,\beta)$ -accurate answers to $\mathcal{M}_{2,d}$ contain $(\alpha,4\beta)$ -accurate answers to the subset $\mathcal{M}_{1,d/2}\land\mathcal{M}_{1,d/2}$ . So our lower bound for the subset $\mathcal{M}_{1,d/2}\land\mathcal{M}_{1,d/2}$ is sufficient to obtain the desired lower bound. Finally, note that

2 Lower Bounds for Arbitrary Queries

Using our composition theorem, we can also prove a nearly-optimal sample complexity lower bound as a function of the $|\mathcal{Q}|,d,$ and $\alpha$ and establish Theorem 1.3 in the introduction.

Roughly, the results of [DN03] can be interpreted in our framework as showing that there is an $\Omega(1/\alpha^{2})$ -row database that enables a $1/100$ -reconstruction attack from $(\alpha,0)$ -accurate answers to some family of queries $\mathcal{Q}$ , but only when the vector to be reconstructed is Boolean. That is, the attack reconstructs a bit vector accurately provided that every query in $\mathcal{Q}$ is answered correctly. Dwork et al. [DMT07, DY08] generalized this attack to only require $(\alpha,\beta)$ -accuracy for some constant $\beta>0$ , and we will make use of this extension (although we do not require computational efficiency, which was a focus of those works). Finally, we need an extension to the case of fractional vectors $s\in^{n}$ , instead of Boolean vectors $s\in\{0,1\}^{n}$ .

The extension is fairly simple and the proof follows the same outline of the original reconstruction attack from [DN03]. We are given accurate answers to queries in $\mathcal{Q}$ , which we interpret as approximate “subset-sums” of the vector $s\in^{n}$ that we wish to reconstruct. The reconstruction attack will output any vector $t$ from a discretization $\left\{0,1/m,\ldots,(m-1)/m,1\right\}^{n}$ of the unit interval that is “consistent” with these subset-sums. The main lemma we need is an “elimination lemma” that says that if $\|t-s\|_{1}$ is sufficiently large, then for a random subset $T\subseteq[n]$ ,

with suitable large constant probability. For $m=1$ this lemma can be established via combinatorial arguments, whereas for the $m>1$ case we establish it via the Berry-Esséen Theorem. The lemma is used to argue that for every $t$ that is sufficiently far from $s$ , a large fraction of the subset-sum queries will witness the fact that $t$ is far from $s$ , and ensure that $t$ is not chosen as the output.

First we state and prove the lemma that we just described, and then we will verify that it indeed leads to a reconstruction attack.

Let $\kappa>0$ be a constant, let $\alpha>0$ be a parameter with $\alpha\leq\kappa^{2}/240$ , and let $n=1/576\kappa^{2}\alpha^{2}$ . Then for every $r\in^{n}$ such that $\frac{1}{n}\sum_{i=1}^{n}|r_{i}|>\kappa$ , and a randomly chosen $q\subseteq[n]$ ,

Let $r$ be as in the statement of the lemma. Define a random variable

The condition on the right-hand side says that $\sum_{i}Q_{i}$ is in some interval of width $6\alpha n$ . Since the random variables $Q_{i}$ are independent, as $q$ is a randomly chosen subset, we will use the Berry-Esséen Theorem (Theorem 5.13) to conclude that this sum does not fall in any interval of this width too often. Establishing the next claim suffices to prove Lemma 5.11.

In order to apply Theorem 5.13 with $X_{i}=Q_{i}$ , we need to analyze the moments of the random variables $Q_{i}$ . The following bounds can be verified from the definition of $Q_{i}$ and the assumption that $\|r\|_{1}\geq\kappa n$ .

where $\sigma I$ is an interval of width $\sigma/2$ . Thus we have obtained that $\sum_{i}Q_{i}$ falls outside of any interval of width $\sigma/2$ with probability at least $3/5$ . In order to establish the claim, we simply observe that

when $n=1/576\kappa^{2}\alpha^{2}$ . Thus, the probability of falling outside an interval of width $6\alpha n$ is only larger than the probability of falling outside an interval of width $\sigma/2$ . ∎

Establishing Claim 5.12 completes the proof of Lemma 5.11. ∎

Let $\alpha^{\prime}\in(0,1]$ be a constant, let $\alpha>0$ be a parameter with $\alpha\leq(\alpha^{\prime})^{2}/960$ , and let $n=1/144(\alpha^{\prime})^{2}\alpha^{2}$ . For any data universe $\mathcal{X}=\{x_{1},\ldots,x_{n}\}$ of size $n$ , there is a set of counting queries $\mathcal{Q}$ over $\mathcal{X}$ of size at most $O(n\log(1/\alpha))$ such that the database $D=(x_{1},\ldots,x_{n})$ enables a $\alpha^{\prime}$ -reconstruction attack from $(\alpha,1/3)$ -accurate answers to $\mathcal{Q}$ .

First we will give a reconstruction algorithm $\mathcal{B}$ for an arbitrary family of queries. We will then show that for a random set of queries $\mathcal{Q}$ of the appropriate size, the reconstruction attack succeeds for every $s\in^{n}$ with non-zero probability, which implies that there exists a set of queries satisfying the conclusion of the theorem. We will use the shorthand

In order to show that the reconstruction attack $\mathcal{B}$ from Figure 6 succeeds, we must show that $\frac{1}{n}\sum_{i=1}^{n}|t_{i}-s_{i}|\leq\alpha^{\prime}.$ Let $s\in^{n}$ , and let $s^{\prime}\in\left\{0,1/m,\ldots,(m-1)/m,1\right\}^{n}$ be the vector obtained by rounding each entry of $s$ to the nearest $1/m$ . Then

so it is enough to show that the reconstruction attack outputs a vector close to $s^{\prime}$ . Observe that the vector $s^{\prime}$ itself satisfies

for any subset-sum query $q$ , so the reconstruction attack always finds some vector $t$ . To show that the reconstruction is successful, fix any $t\in\left\{0,1/m,\ldots,(m-1)/m,1\right\}^{n}$ such that $\frac{1}{n}\sum_{i=1}^{n}|t_{i}-s^{\prime}_{i}|>\frac{\alpha^{\prime}}{2}.$ If we write $r=s^{\prime}-t\in\{-1,\ldots,-1/m,0,1/m,\ldots,1\}^{n}$ , then $\frac{1}{n}\sum_{i=1}^{n}|r_{i}|>\frac{\alpha^{\prime}}{2}$ and $\langle q,r\rangle=\langle q,t\rangle-\langle q,s^{\prime}\rangle$ . In order to show that no $t$ that is far from $s^{\prime}$ can be output by $\mathcal{B}$ , we will show that for any $r\in\{-1,\ldots,-1/m,0,1/m,\ldots,1\}$ with $\frac{1}{n}\sum_{i=1}^{n}|r|>\frac{\alpha^{\prime}}{2}$ ,

To prove this, we first observe by Lemma 5.11 (setting $\kappa=\frac{1}{2}\alpha^{\prime}$ ) that for a randomly chosen query $q$ defined on $\mathcal{X}$ ,

The lemma applies because $\langle q,r\rangle=\frac{1}{n}\sum_{i=1}^{n}q(x_{i})r_{i}$ is a random subset-sum of the entries of $r$ .

Next, we apply a concentration bound to show that if the set $\mathcal{Q}$ of queries is a sufficiently large random set, then for every vector $r$ the fraction of queries for which $|\langle q,r\rangle|$ is large will be close to the expected number, which we have just established is at least $3|\mathcal{Q}|/5$ . We use the following version of the Chernoff bound.

Consider a set of randomly chosen queries $\mathcal{Q}$ . By the above, we have that for every $r\in\{-1,\ldots,-1/m,0,1/m,\ldots,1\}^{n}$ such that $\frac{1}{n}\sum_{i=1}^{n}|r|>\frac{\alpha^{\prime}}{2}$ ,

Since the queries are chosen independently, by the Chernoff bound we have

Thus, we can choose $|\mathcal{Q}|=O(n\log m)$ to obtain

Thus, we have established that there exists a family of queries $\mathcal{Q}$ such that for every $s,t$ such that $\frac{1}{n}\sum_{i=1}^{n}|t_{i}-s_{i}|>\alpha^{\prime}$ ,

Moreover, by $(\alpha,1/3)$ -accuracy, we have

Applying a triangle inequality, we can conclude

which implies that $t$ cannot be the output of $\mathcal{B}$ . This completes the proof.

2.2 Putting Together the Lower Bound

Now we show how to combine the various attacks to prove Theorem 1.2 in the introduction. We obtain our lower bound by applying two rounds of composition. In the first round, we compose the reconstruction attack described above with the re-identifiable distribution for $1$ -way marginals. We then take the resulting re-identifiable distribution and apply a second round of composition using the reconstruction attack for query families of high VC-dimension.

Just like our lower bound for $k$ -way marginal queries, we remark that it is necessary to apply the two rounds of composition in this order. See Section 5.1.3 for a discussion of this issue.

such that there exists a distribution on $n$ -row databases $D\in(\{0,1\}^{d})^{n}$ that is $(1/2,o(1/n))$ -re-identifiable from $(\alpha,0)$ -accurate answers to $\mathcal{Q}$ .

Applying Theorem 4.3 (with parameter $c=150$ ), we obtain item 1’ below. We then bring in another reconstruction attack for the composition theorem.

By Lemma 5.3, there exists a database $D\in(\{0,1\}^{d/3})^{\log h}$ that enables a $(4\alpha)$ -reconstruction attack from $(\alpha,0)$ -accurate answers to some $\mathcal{Q}_{vc}$ of size $h$ . (In particular, the family of queries can be all $(\log h)$ -way marginals on the first $\log h$ bits of the data universe items.)

Again, Theorem has a corresponding statement in terms of generalized fingerprinting codes.

such that there exists a $(n,\mathcal{Q})$ -generalized fingerprinting code with security $(1/2,o(1/n))$ for $(\alpha,0)$ -accuracy.

Constructing Error-Robust Fingerprinting Codes

In this section, we show how to construct fingerprinting codes that are robust to a constant fraction of errors, which will establish Theorem 3.4. Our codes are based on the fingerprinting code of Tardos [Tar08], which has a nearly optimal number of users, but is not robust to any constant fraction of errors. The number of users in our code is only a constant factor smaller than that of Tardos, and thus our codes also have a nearly optimal number of users.

To motivate our approach, it is useful to see why the Tardos code (and all other fingerprinting codes we are aware of) are not robust to a constant fraction of errors. The reason is that the the only way to introduce an error is to put a in a column containing only $1$ ’s or vice versa (recall that the set of codewords, $C\in\{0,1\}^{n\times d}$ , can be viewed as an $n\times d$ matrix). We call such columns “marked columns.” Thus, if the adversary is allowed to introduce $\geq m$ errors where $m$ is the number of marked columns then he can simply ignore the codewords and output either the all- or all- $1$ codeword, which cannot be traced. Thus, in order to tolerate a $\beta$ fraction of errors, it is necessary that $m\geq\beta d$ where $d$ is the length of the codeword, and this is not satisfied by any construction we know of (when $\beta>0$ is a constant). However, Tardos’ construction can be shown to remain secure if the adversary is allowed to introduce $\beta m$ errors, rather than $\beta d$ errors, for some constant $\beta>0$ . We demonstrate this formally in Section 6.2. In addition, we show how to take a fingerprinting code that tolerates $\beta m$ errors and modify it so that it can tolerate about $\beta d/3$ errors. This reduction is formalized in Section 6.1. Combining these two results will give us a robust fingerprinting code.

We remark that prior work [BN08, BKM10] has shown how to construct fingerprinting codes satisfying a weaker robustness property. Specifically, their codes allow the adversary to introduce a special “?” symbol in a large fraction of coordinates, but still require that any coordinate that is not a “?” satisfies the feasibility constraint.

Before proceeding with the construction and analysis, we restate some terminology and notation from Section 3. Recall that a fingerprinting code is a pair of algorithms $(\mathit{Gen},\mathit{Trace})$ , where $\mathit{Gen}$ specifies a distribution over codebooks $C\in\{0,1\}^{n\times d}$ consiting of $n$ codewords $(c_{1},\ldots,c_{n})$ , and $\mathit{Trace}(C,c^{\prime})$ either outputs the identity $i\in[n]$ of an accused user or outputs $\bot$ . Recall that $\mathit{Gen}$ and $\mathit{Trace}$ share a common state. For a coalition $S\subseteq[n]$ , we write $C_{S}\in\{0,1\}^{|S|\times d}$ to denote the subset of codewords belonging to users in $S$ .

Every codebook $C$ , coalition $S$ , and robustness parameter $\beta\in$ defines a feasible set of combined codewords,

We now recall the definition of an error-robust fingerprinting code from Section 3.1.

where the probability is taken over the coins of $\mathit{Gen},\mathit{Trace}$ , and $\mathcal{A}_{\mathit{FP}}$ . The algorithms $\mathit{Gen}$ and $\mathit{Trace}$ may share a common state.

The main result of this section is a construction of fingerprinting codes satisfying Definition 6.1

We remark that we have made no attempt to optimize the fraction of errors to which our code is robust. We leave it as an interesting open problem to construct a robust fingerprinting code for a nearly-optimal number of users that is robust to a fraction of errors arbitrarily close to $1/2$ .

A key step in our construction is a reduction from constructing error-robust fingerprinting codes to constructing a weaker object, which we call a weakly-robust fingerprinting code. The difference between a weakly-robust fingerprinting code and an error-robust fingerprinting code of the previous section is that we now demand that only a $\beta$ fraction of the marked positions can have errors, rather than a $\beta$ fraction of all positions.

In order to formally define weakly-robust fingerprinting codes, we introduce some terminology. If $C\in\{0,1\}^{n\times d}$ is a codebook, then for $b\in\{0,1\}$ , we say that position $j\in[d]$ is $b$ -marked in $C$ if $c_{ij}=b$ for every $i\in[n]$ . That is, $j$ is $b$ -marked if every user has the symbol $b$ in the $j$ -th position of their codeword. The set $F_{\beta}(C)$ consists of all codewords $c^{\prime}$ such that for a $1-\beta$ fraction of positions $j$ , either $j$ is not marked, or $j$ is $b$ -marked and $c^{\prime}_{j}=b$ . Notice that this constraint is vacuous if fewer than a $\beta$ fraction of positions are marked.

For a weakly-robust fingerprinting code, we will define a more constrained feasible set. Intuitively, a codeword $c^{\prime}$ is feasible if for a $1-\beta$ fraction of positions that are marked, $c^{\prime}_{j}$ is set appropriately. Note that this condition is meaningful even when the fraction of marked positions is much smaller than $\beta$ . More formally, we define

The next theorem states that if we have an $(n,d)$ -fingerprinting code that is weakly-robust to a $\beta$ fraction of errors and satisfies a mild technical condition, then we obtain an $(n,O(d))$ -fingerprinting code that is robust to an $\Omega(\beta)$ fraction of errors with a similar level of security.

are a $(n,d)$ -fingerprinting code with security $\xi$ weakly-robust to a $\beta$ fraction of errors, and

with probability at least $1-\xi$ over $C\leftarrow_{\mbox{\tiny R}}\mathit{Gen}$ , produce $C$ that has at least $m$ -marked columns and $m$ $1$ -marked columns.

Then there is a pair of algorithms $(\mathit{Gen}^{\prime},\mathit{Trace}^{\prime})$ that are a $(n,d^{\prime})$ -fingerprinting code with security $\xi^{\prime}$ robust to a $\beta/3$ fraction of errors, where

The reduction is given in Figure 7. Recall that $\mathit{Gen}^{\prime}$ and $\mathit{Trace}^{\prime}$ may share state, so $\pi$ and the shared state of $\mathit{Gen}$ and $\mathit{Trace}$ is known to $\mathit{Trace}^{\prime}$ .

Fix a coalition $S\subseteq[n]$ . Let $\mathcal{A}_{\mathit{FP}}^{\prime}$ be an adversary. Sample $C^{\prime}\leftarrow_{\mbox{\tiny R}}\mathit{Gen}^{\prime}$ and let $c^{\prime}=\mathcal{A}_{\mathit{FP}}^{\prime}(C^{\prime})$ . We will show that the reduction is successful by proving that if $c^{\prime}\in\mathit{F}_{\beta/3}(C^{\prime})$ , then the modified string $c\in\mathit{WF}_{\beta}(C)$ with probability $1-\exp(-\Omega(\beta m^{2}/d))$ . The reason is that an adversary who is given (a subset of the rows of) $C^{\prime}$ cannot distinguish real columns that are marked from fake columns. Therefore, the fraction of errors in the real marked columns should be close to the fraction of errors that are either real and marked or fake. Since the total fraction of errors in the entire codebook is at most $\beta/3$ , we know that the fraction of errors in real marked columns is not much larger than $\beta/3$ . Thus the fraction of errors in the real marked columns will be at most $\beta$ with high probability. We formalize this argument in the following claim.

for any choice of $k$ . An identical argument bounds the probability that the number of errors in real $1$ -marked columns is more than $\beta m_{1}$ . Therefore, the probability that more than a $\beta$ fraction of marked columns have errors is at most $2\exp(-\Omega(\beta m^{2}/d))$ . ∎

Now define an adversary $\mathcal{A}_{\mathit{FP}}$ that takes $C_{S}$ as input, simulates $\mathit{Gen}^{\prime}$ by appending marked columns to $C_{S}$ and applying a random permutation $\pi$ , and then applies $\mathcal{A}_{\mathit{FP}}^{\prime}$ to the resulting codebook $C^{\prime}_{S}$ . Then it takes $\mathcal{A}_{\mathit{FP}}^{\prime}(C^{\prime}_{S})$ , applies $\pi^{-1}$ , removes the fake columns, and outputs the result. Notice that $\mathit{Trace}^{\prime}$ applies $\mathit{Trace}$ to a codebook and codeword generated by exactly the same procedure. If we assume that $\mathcal{A}_{\mathit{FP}}^{\prime}(C^{\prime}_{S})$ is feasible with parameter $\beta/3$ , then by the analysis above, with probability at least $1-\xi-\exp(-\Omega(\beta m^{2}/d))$ , $\mathcal{A}_{\mathit{FP}}(C_{S})$ is weakly feasible with parameter $\beta$ . Thus,

where the first inequality is by Claim 6.5 and the second inequality is by $\xi$ -security of $\mathit{Trace}$ .

Since $\mathit{Trace}$ does not accuse a user outside of $S$ (except with probability at most $\xi$ ) regardless of whether or not that adversary’s codeword is feasible, it is immediate that $\mathit{Trace}^{\prime}$ also does not accuse a user outside of $S$ (except with probability at most $\xi$ ). ∎

2 Weak Robustness of Tardos’ Fingerprinting Code

In this section we show that Tardos’ fingerprinting code is weakly robust to a $\beta$ fraction of errors for $\beta\geq 1/25$ . Specifically we prove the following:

Tardos’ fingerprinting code is described in Figure 8. Note that the shared state of $\mathit{Gen}$ and $\mathit{Trace}$ will include $p_{1},\ldots,p_{d}$ .

Tardos’ proof that no user is falsely accused (except with probability $\xi$ ) holds for every adversary, regardless of whether or not the adversary’s output is feasible, therefore it holds without modification even when we allow the adversary to introduce errors. So we will state the following lemma from [Tar08, Section 3] without proof.

Let $(\mathit{Gen},\mathit{Trace})$ be the fingerprinting code defined in Algorithm 8. Then for every adversary $\mathcal{A}_{\mathit{FP}}$ , and every $S\subseteq[n]$ ,

where the probability is taken over the choice of $C\leftarrow_{\mbox{\tiny R}}\mathit{Gen}$ and the coins of $\mathcal{A}_{\mathit{FP}}$ .

Most of the remainder of this section is devoted to proving that any adversary who introduces errors into at most a $1/25$ fraction of the marked columns can be traced successfully.

Let $(\mathit{Gen},\mathit{Trace})$ be the fingerprinting code defined in Algorithm 8. Then for every adversary $\mathcal{A}_{\mathit{FP}}$ , and every $S\subseteq[n]$ ,

where the probability is taken over the choice of $C\leftarrow_{\mbox{\tiny R}}\mathit{Gen}$ and the coins of $\mathcal{A}_{\mathit{FP}}$ .

Before giving the proof, we briefly give a high-level roadmap. Recall that in the construction there is a “score” function $S_{i}(c^{\prime})$ that is computed for each user, and $\mathit{Trace}$ will output some user whose score is larger than the threshold $Z/2$ , if such a user exists. Tardos shows that the sum of the scores over all users is at least $nZ/2$ , which demonstrates that there exists a user whose score is above the threshold. His argument works by balancing two contributions to the score: 1) the contribution from $1$ -marked columns $j$ , which will always be positive due to the fact that $c^{\prime}_{j}=1$ , and 2) the potentially negative contribution from columns that are not $1$ -marked. Conceptually, he shows that the contribution from the $1$ -marked columns is larger in expectation than the negative contribution from the other columns, so the expected score is significantly above the threshold. He then applies a Chernoff-type bound to show that the score will be above the threshold with high probability. When the adversary is allowed to introduce errors so that there may be some $1$ -marked columns $j$ such that $c^{\prime}_{j}=0$ , these errors will contribute negatively to the score. The new ingredient in our argument is essentially to bound the negative contribution from these errors. We are able to get a sufficiently good bound to tolerate errors in $1/25$ of the coordinates. We expect that a tighter analysis and more careful tuning of the parameters can improve the fraction of errors that can be tolerated.

We will write $S=[n]$ . Doing so is without loss of generality as users outside of $S$ are irrelevant. We will use $\beta=1/25$ to denote the allowable fraction of errors. Fix an adversary $\mathcal{B}$ . Sample $C\leftarrow_{\mbox{\tiny R}}\mathit{Gen}$ and let $c^{\prime}=\mathcal{B}(C)$ . Assume $c^{\prime}\in\mathit{WF}_{\beta}(C)$ . In order to prove that some user is traced, we will bound the quantity

where $x_{j}=\sum_{i=1}^{n}C_{ij}$ is defined to be the number of codewords $c_{i}$ such that $c_{ij}=1$ . Our goal is to show that this quantity is at least $nZ/2$ with high probability. If we can do so, then there must exist a user $i\in[n]$ such that $S_{i}(c^{\prime})\geq Z/2$ , in which case $\mathit{Trace}(C,c^{\prime})\neq\bot$ .

If $j$ is unmarked, then $\overline{c}_{j}=0$ .

If $j$ is -marked, then $\overline{c}_{j}\in\{0,1\}$ .

If $j$ is $1$ -marked, then $\overline{c}_{j}\in\{-1,0\}$ .

The number of nonzero coordinates of $\overline{c}$ is at most $\beta m$ , where $m$ is the number of marked columns of $c$ .

We call a $\overline{c}$ satisfying the above constraints valid. By the linearity of $S(\cdot)$ , we can write

We will now establish the following claim.

We start by making an observation about the distribution of $S(\overline{c})=S(\overline{c})|_{C,\overline{c}}$ , which denotes $S(\overline{c})$ when we condition on a fixed choice of a codebook $C$ and a valid choice of $\overline{c}$ . Because the non-zero coordinates of $\overline{c}$ are only in marked columns of $C$ (those in which $x_{j}=0$ or $x_{j}=n$ ), the distribution of

depends only on the number of non-zero coordinates of $\overline{c}$ , and not on their location. To see that this is the case, consider a -marked coordinate $j$ on which $\overline{c}_{j}=1$ . The contribution of $j$ to $S(\overline{c})$ is exactly $-n/q_{j}$ . Similarly, for a $1$ -marked coordinate $j$ on which $\overline{c}_{j}=-1$ , the contribution of $j$ to $S(\overline{c})$ is exactly $-nq_{j}$ . Thus we can write

Each term in the first sum (resp. second sum) is a random variable that depends only on the distribution of $q_{j}$ conditioned on the the $j$ -th column being -marked (resp. $1$ -marked). Recall that $q_{j}$ is determined by $p_{j}$ . Moreover, conditioned on a fixed $C$ , the $p_{j}$ ’s are independent. To see this, let $C_{j}$ denote the $j$ th column of the codebook $C$ . Recall that each column $C_{j}$ is generated independently using $p_{j}$ , and the $p_{j}$ ’s themselves are chosen independently. Letting $f_{X}$ denote the density function of a random variable $X$ , this means that the joint density

This shows that the conditional random variables $p_{j}|_{C_{j}}$ are independent. Moreover, since $\overline{c}$ only depends on the codebook $C$ and coins of the adversary $\mathcal{B}$ , the $p_{j}$ ’s are still independent when we also condition on $\overline{c}$ . In fact, the following holds:

Conditioned on any fixed choice of $C$ and $\overline{c}$ , the following distributions are all identical, independent, and non-negative: 1) $(n/q_{j}\mid\textrm{$ j $is$ 0 $-marked})$ for $j\in[d]$ , and 2) $(nq_{j}\mid\textrm{$ j $is$ 1 $-marked})$ .

By the discussion above, we know that these random variables are independent. To see that they are identicially distributed, note that the distribution $p_{j}$ used to generate the $j$ th column of $C$ is symmetric about $1/2$ . Therefore, the probability that column $j$ is -marked when its entries are sampled according to $p_{j}$ is the same as the probability that $j$ is $1$ -marked when its entries are sampled according to $1-p_{j}$ . Applying Bayes’ rule, again using the fact that $p_{j}$ and $1-p_{j}$ have the same distribution, we see that the random variables $(p_{j}\mid\textrm{$ j $is$ 0 $-marked})$ and $(1-p_{j}\mid\textrm{$ j $is$ 1 $-marked})$ are identically distributed. The claim follows since $q_{j}=\sqrt{(1-p_{j})/p_{j}}$ . ∎

and conditioned on having $\beta m$ errors, those errors occur on a uniformly random set of marked columns. Thus, if we can show that

provided $n,1/\xi$ are sufficiently large.

First, observe that, since the distribution of $p$ is symmetric about $1/2$ , $A_{0}=A_{n}$ . Second, if we let

In order to obtain a strong enough bound, we need to show that $A_{n}-B_{n}=O(\beta\alpha)$ . We can calculate

Now we apply the approximation $e^{u}\leq 1+2u$ , which holds for $0\leq u\leq 1$ . To do so, we choose $\alpha=\sqrt{t}/n$ . Since $q=\sqrt{(1-p)/p}$ and $p\geq t$ , we have $\alpha nq\leq 1$ for this choice of $\alpha$ . Thus we have

The final inequality holds as long as $n$ is larger than some absolute constant. (To see that this is the case, recall that $t^{\prime}=\arcsin(\sqrt{t})=\arcsin(\sqrt{1/300n})=\Theta(1/\sqrt{n})$ , whereas $(1-1/300n)^{n}=1-\Omega(1)$ .) So we have established

Plugging this fact into the analysis above, we have

Now all that remains is to apply Markov’s inequality to bound this quantity by $\xi^{\sqrt{n}/4}$ .

To get the desired upper bound, it is sufficient to show

where the last inequality holds when $\beta<1/25$ . This is sufficient to complete the proof of Claim 6.10. ∎

Lemma 6.7 and 6.8 are sufficient to imply Lemma 6.6, that Tardos’ fingerprinting code is weakly robust. In order to apply our reduction from full robustness to weak robustness (Lemma 6.4), we need to also establish that with high probability there are many marked columns in the matrix $C\leftarrow_{\mbox{\tiny R}}\mathit{Gen}$ for Tardos’ fingerprinting code.

With probability at least $1-\xi$ over the choice of $C\leftarrow_{\mbox{\tiny R}}\mathit{Gen}$ , it holds that the number of -marked columns $m_{0}$ and the number of $1$ -marked columns $m_{1}$ are both larger than $m=5n^{3/2}\log(n/\xi)$ .

To estimate the number of marked columns, define for each $j=1,\ldots,d$ an indicator random variable $D_{j}$ for whether column $j$ is -marked. The $D_{j}$ ’s are i.i.d., and have expectation at least

A similar argument holds for $1$ -marked columns. Thus letting $m=5n\sqrt{n}\log(n/\xi)$ , the codebook $C$ has at least $m$ -marked columns and $m$ $1$ -marked columns with probability at least $1-\xi$ . Now observe that

for $n$ larger than some absolute constant. ∎

Combining Lemma 6.4 (reduction from robustness to weak robustness), Lemma 6.6 (weak robustness of Tardos’ code), and Lemma 6.12 (Tardos’ code has many marked columns), suffices to prove Theorem 6.2.

Acknowledgements

We thank Kobbi Nissim for drawing our attention to the question of sample complexity and for many helpful discussions. We thank Adam Smith for suggesting that we use the Gaussian mechanism to provide a new proof of the lower bound on the length of fingerprinting codes. Finally, we thank the anonymous reviewers for their helpful comments.

References

Appendix A Lower Bounds on Fingerprinting Codes via Differential Privacy

By the contrapositive of Theorem 3.5, upper bounds on the sample complexity of answering $1$ -way marginals with differential privacy imply a lower bound on the length $d$ of any fingerprinting code with a given number of users $n$ . As pointed out to us by Adam Smith, this yields a particularly simple, self-contained proof of Tardos’ [Tar08] optimal lower bound on the length of fingerprinting codes. Specifically, using the well known Gaussian mechanism for achieving differential privacy, we can design a simple adversary $\mathcal{A}_{\mathit{FP}}$ that violates the security of any traitor tracing scheme with length $d=o(n^{2})$ .

Before diving into the proof, we will state the following elementary fact about Gaussian random variables. The fact simply says that a Gaussian random variable with suitable variance is “close” to a shifted version of itself in a particular sense. This same fact is used to show that adding Gaussian noise of suitable variance provides differential privacy.

Let $\mathcal{A}_{\mathit{FP}}(C_{S})$ be the following adversary. Define the vector $\overline{c}\in^{d}$ as

First we claim that $\mathcal{A}_{\mathit{FP}}$ outputs feasible codewords with at least constant probability.

For every $S$ such that $|S|\geq n-1,$ and every codebook $C=(c_{ij})\in\{0,1\}^{n\times d},$

By a standard tail bound for the Gaussian, we have

Now it remains to show that $\mathcal{A}_{\mathit{FP}}$ cannot be traced successfully. By assumption $(\mathit{Gen},\mathit{Trace})$ has security $\xi<1/6en<1/3.$ Then we have in particular

Therefore, there exists $i^{*}\in[n]$ such that

To complete the proof, it now suffices to show that if $S=[n]\setminus\left\{i^{*}\right\}$ , then

which will contradict the security of the fingerprinting code.

By Fact A.2 (with $\delta=1/6en>\xi$ ), for every $r$ ,

Applying (10), and averaging over $C\leftarrow_{\mbox{\tiny R}}\mathit{Gen}$ and $r$ , we have

which is the desired contradiction. This completes the proof. ∎