Fingerprinting Codes and the Price of Approximate Differential Privacy

Mark Bun, Jonathan Ullman, Salil Vadhan

Introduction

Consider a database DXnD\in\mathcal{X}^{n}, in which each of the nn rows corresponds to an individual’s record, and each record is an element of some data universe X\mathcal{X} (e.g. X={0,1}d\mathcal{X}=\{0,1\}^{d}, corresponding to dd binary attributes per record). The goal of privacy-preserving data analysis is to enable rich statistical analyses on such a database while protecting the privacy of the individuals. It is especially desirable to achieve (ε,δ)(\varepsilon,\delta)-differential privacy [DMNS06, DKM+06], which (for suitable choices of ε\varepsilon and δ\delta) guarantees that no individual’s data has a significant influence on the information released about the database. A natural way to measure the tradeoff between these two goals is via sample complexity—the minimum number of records nn such that there exists a (possibly computationally unbounded) algorithm that achieves both differential privacy and statistical accuracy.

Some of the most basic statistics are counting queries, which are queries of the form “What fraction of individual records in DD satisfy some property qq?” In particular, we would like to design an algorithm that takes as input a database DD and, for some family of counting queries Q\mathcal{Q}, outputs an approximate answer to each of the queries in Q\mathcal{Q} that is accurate to within, say, ±.01\pm.01. Suppose we are given a bound on the number of queries Q|\mathcal{Q}| and the dimensionality of the database records dd, but otherwise allow the family Q\mathcal{Q} to be arbitrary. What is the sample complexity required to achieve (ε,δ)(\varepsilon,\delta)-differential privacy and statistical accuracy for Q\mathcal{Q}?

Of course, if we drop the requirement of privacy, then we could achieve perfect accuracy when DD contains any number of records. However, in many interesting settings the database DD consists of random samples from some larger population, and an analyst is actually interested in answering the queries on the population. Thus, even without a privacy constraint, DD would need to contain enough records to ensure that (with high probability) for every query qQq\in\mathcal{Q}, the answer to qq on DD is close to the answer to qq on the whole population, say within ±.01\pm.01. To achieve this form of statistical accuracy, it is well-known that it is necessary and sufficient for DD to contain Θ(logQ)\Theta(\log|\mathcal{Q}|) samples.For a specific family of queries Q\mathcal{Q}, the necessary and sufficient number of samples is proportional to the VC-dimension of Q\mathcal{Q}, which can be as large as logQ\log|\mathcal{Q}|. In this work we consider whether there is an additional “price of differential privacy” if we require both statistical accuracy and (ε,δ)(\varepsilon,\delta)-differential privacy (for, say, ε=O(1)\varepsilon=O(1), δ=o(1/n)\delta=o(1/n)). This benchmark has often been used to evaluate the utility of differentially private algorithms, beginning with the seminal work of Dinur and Nissim [DN03].

These results show that the price of privacy is small for datasets with few attributes, but may be large for high-dimensional datasets. For example, if we simply want to estimate the mean of each of the dd attributes without a privacy guarantee, then Θ(logd)\Theta(\log d) samples are necessary and sufficient to get statistical accuracy. However, the best known (ε,δ)(\varepsilon,\delta)-differentially private algorithm requires Ω(d)\Omega(\sqrt{d}) samples—an exponential gap. In the special case of pure (ε,0)(\varepsilon,0)-differential privacy, a lower bound of Ω(d)\Omega(d) is known [HT10]. However, for the general case of approximate (ε,δ)(\varepsilon,\delta)-differential privacy the best known lower bound is Ω(logd)\Omega(\log d) [DN03]. More generally, there are no known lower bounds that separate the sample complexity of (ε,δ)(\varepsilon,\delta)-differential privacy from the sample complexity required for statistical accuracy alone.

In this work we close this gap almost completely, and show that there is indeed a “price of approximate differential privacy” for high-dimensional datasets.

We establish this lower bound using a combinatorial object called a fingerprinting code, which was originally introduced by Boneh and Shaw [BS98] for the problem of watermarking copyrighted content. Specifically, we use Tardos’ construction of optimal fingerprinting codes [Tar08]. The use of “secure content distribution schemes” to prove lower bounds for differential privacy originates with the work of Dwork et al. [DNR+09], who used “traitor-tracing schemes,” which are a cryptographic analogue of information-theoretic fingerprinting codes, to prove computational hardness results for differential privacy. Extending this connection, Ullman [Ull13] used fingerprinting codes to construct a novel traitor-tracing scheme and obtain a strong computational hardness result for differential privacy.In fact, one way to prove Theorem 1.1 is by replacing the one-way functions in [Ull13] with a random oracle, and thereby obtain an information-theoretically secure traitor-tracing scheme. Here we show that a direct use of fingerprinting codes yields information-theoretic lower bounds on sample complexity.

Next, we consider the sample complexity of answering an arbitrary set Q\mathcal{Q} of counting queries to within error ±α\pm\alpha. As above, if we assume the database contains samples from a population, and require only that the answers to queries on the sampled database and the population are close, to within ±α\pm\alpha, then Θ(logQ/α2)\Theta(\log|\mathcal{Q}|/\alpha^{2}) samples are necessary and sufficient for just statistical accuracy. When Q|\mathcal{Q}| is large (relative to dd and 1/α1/\alpha), the best sample complexity for differential privacy is again achieved by the private multiplicative weights algorithm, and is O(dlogQ/α2)O(\sqrt{d}\log|\mathcal{Q}|/\alpha^{2}). For pure differential privacy, a lower bound of Ω(dlogQ/α2)\Omega(d\log|\mathcal{Q}|/\alpha^{2}) is known [Har11]. On the other hand, the best known lower bound for approximate differential privacy is Ω(max{logQ/α,1/α2})\Omega(\max\{\log|\mathcal{Q}|/\alpha,1/\alpha^{2}\}), which follows from the techniques of [DN03]. To resolve this gap, we give a composition theorem that allows us to obtain a nearly optimal lower bound by combining Theorem 1.1 with (variants of) the existing sample complexity lower bounds. The result shows that the private multiplicative weights algorithm achieves nearly-optimal sample-complexity as a function of Q,d|\mathcal{Q}|,d, and α\alpha.

We remark that the condition that d6log(1/α)d\geq 6\log(1/\alpha) is both necessary (up to the constant factor) and fairly mild. Necessary because the noisy histogram algorithm (see, e.g. [Vad16]) requires n=O(2d/2logQ/α)n=O(2^{d/2}\sqrt{\log|\mathcal{Q}|}/\alpha) samples, which is better than the conclusion of the lower bound when d<2log(1/α)d<2\log(1/\alpha). Mild because differential privacy cannot be satisfied for large query sets unless α1/n\alpha\gtrsim 1/\sqrt{n}, so the condition is no stronger than assuming n2d/3n\lesssim 2^{d/3}, in which case the number of samples is exponential in the dimension. Similarly, the condition sd/α2s\geq d/\alpha^{2} is also necessary, since adding independent noise to each query requires only nQ1/2/αn\gtrsim|\mathcal{Q}|^{1/2}/\alpha samples.

Finally, we consider the sample complexity of the natural and well studied class of kk-way marginal queries, also known as kk-way conjunction queries (see e.g. [BCD+07, KRSU10, GHRU11, TUV12, CTUW14, DNT13]). A kk-way marginal query on a database D({0,1}d)nD\in(\{0,1\}^{d})^{n} is specified by a set S[d]S\subseteq[d], Sk|S|\leq k, and a pattern t{0,1}St\in\{0,1\}^{|S|} and asks “What fraction of records in DD has each attribute jj in SS set to tjt_{j}?” The number of kk-way marginal queries on {0,1}d\{0,1\}^{d} is about 2k(dk)2^{k}\binom{d}{k}. For the special case of k=1k=1, the queries simply ask for the mean of each attribute, which was discussed above. We prove that the lower bound of Theorem 1.2, which applies to worst-case queries, also holds for the special case of kk-way marginal queries when α\alpha is not too small.

We remark that, since the number of kk-way marginal queries is about 2k(dk)2^{k}\binom{d}{k}, the sample complexity lower bound in Theoem 1.3 essentially matches that of Theorem 1.2. The two theorems are incomparable, since Theorem 1.2 applies even when α\alpha is exponentially small in dd, but only applies for a worst-case family of queries.

We now describe the main technical ingredients used to prove these results. For concreteness, we will describe the main ideas for the case of kk-way marginal queries.

Fingerprinting codes, introduced by Boneh and Shaw [BS98], were originally designed to address the problem of watermarking copyrighted content. Roughly speaking, a (fully-collusion-resilient) fingerprinting code is a way of generating codewords for nn users in such a way that any codeword can be uniquely traced back to a user. Each legitimate copy of a piece of digital content has such a codeword hidden in it, and thus any illegal copy can be traced back to the user who copied it. Moreover, even if an arbitrary subset of the users collude to produce a copy of the content, then under a certain marking assumption, the codeword appearing in the copy can still be traced back to one of the users who contributed to it. The standard marking assumption is that if every colluder has the same bit bb in the jj-th bit of their codeword, then the jj-th bit of the “combined” codeword in the copy they produce must be also bb. We refer the reader to the original paper of Boneh and Shaw [BS98] for the motivation behind the marking assumption and an explanation of how fingerprinting codes can be used to watermark digital content.

We show that the existence of short fingerprinting codes implies sample complexity lower bounds for 11-way marginal queries. Recall that a 11-way marginal query qjq_{j} is specified by an integer j[d]j\in[d] and asks simply “What fraction of records in DD have a 11 in the jj-th bit?” Suppose a coalition of users takes their codewords and builds a database D({0,1}d)nD\in(\{0,1\}^{d})^{n} where each record contains one of their codewords, and dd is the length of the codewords. Consider the 11-way marginal query qj(D)q_{j}(D). If every user in SS has a bit bb in the jj-th bit of their codeword, then qj(D)=bq_{j}(D)=b. Thus, if an algorithm answers 11-way marginal queries on DD with non-trivial accuracy, its output can be used to obtain a combined codeword that satisfies the marking assumption. By the tracing property of fingerprinting codes, we can use the combined codeword to identify one of the users in the database. However, if we can identify one of the users from the answers, then the algorithm is not differentially private.

Theorems 1.2 and 1.3 are proven by using this composition technique repeatedly to combine our lower bound for 11-way marginals with (variants of) several known lower bounds that capture the optimal dependence on logQ\log|\mathcal{Q}| and 1/α21/\alpha^{2}.

The connection between fingerprinting codes and differential privacy lower bounds extends to arbitrary families Q\mathcal{Q} of counting queries. We introduce the notion of a generalized fingerprinting code with respect to Q\mathcal{Q}, where each codeword corresponds to a data universe element xXx\in\mathcal{X} and the bits of the codeword are given by q(x)q(x) for each qQq\in\mathcal{Q}, but is the same as an ordinary fingerprinting code otherwise. The existence of a generalized fingerprinting code with respect to Q\mathcal{Q}, for nn users, implies a sample complexity lower bound of nn for privately releasing answers to Q\mathcal{Q}. We also show a partial converse to the above result, which states that some sort of “fingerprinting-code-like object” is necessary to prove sample complexity lower bounds for answering counting queries under differential privacy. This object has similar semantics to a generalized fingerprinting code, however the marking assumption required for tracing is slightly stronger and the probability that tracing succeeds can be significantly smaller than what is required by the standard definition of fingerprinting codes. Our partial converse parallels the result of Dwork et al. [DNR+09] that shows computational hardness results for differential privacy imply a “traitor-tracing-like object.” We leave it as an open question to pin down precisely the relationship between fingerprinting codes and information-theoretic lower bounds in differential privacy (and also between traitor-tracing schemes and computational hardness results for differential privacy).

2 Other Related Work

There have been attempts to prove optimal sample complexity lower bounds for kk-way marginals. In particular, when kk is a constant, Kasiviswanathan et al. [KRSU10] and De [De12] prove a lower bound of min{Q1/2/α,1/α2}\min\{|\mathcal{Q}|^{1/2}/\alpha,1/\alpha^{2}\} on the sample complexity. Note that when α\alpha is a constant, these lower bounds are O(1)O(1).

In addition to the general computational hardness results referenced above, there are several results that show stronger hardness results for restricted types of efficient algorithms [UV11, GHRU11, DNV12].

2.2 Subsequent Work

Subsequent to our work, Steinke and Ullman [SU15a] refined our use of fingerprinting codes to prove a lower bound of Ω(dlog(1/δ)/ε)\Omega(\sqrt{d\log(1/\delta)}/\varepsilon) on the number of samples required to release the mean of each of the dd attributes under (ε,δ)(\varepsilon,\delta)-differential privacy when δ1/n\delta\ll 1/n. This lower bound is optimal up to constant factors, and improves on Theorem 1.1 by a factor of roughly log(1/δ)logd\sqrt{\log(1/\delta)}\cdot\log d. They also improve and simplify our analysis of robust fingerprinting codes.

Our fingerprinting code technique has also been used to prove lower bounds for other types of differentially private data analyses. Namely, Dwork et al. [DTTZ14] prove lower bounds for differentially private principal component analysis and Bassily, Smith, and Thakurta [BST14] prove lower bounds for differentially private empirical risk minimization. In order to establish lower bounds for privately releasing threshold functions, Bun et al. [BNSV15] construct a fingerprinting-code-like object that yields a lower bound for the problem of releasing a value between the minimum and maximum of a dataset.

Dwork et al. [DSS+15] observe that the privacy attack implicit in our negative results is closely related to the influential attacks that were employed by Homer et al. [HSR+08] (and further studied in [SOJH09]) to violate privacy of public genetic datasets. Using this connection, they show how to make Homer et al.’s attack robust to very general models of noise and how to make the attack work without detailed knowledge of the population the dataset represents.

A pair of works [HU14, SU15b] show that fingerprinting codes and the related traitor-tracing schemes imply both information-theoretic lower bounds and computational hardness results for the “false discovery” problem in adaptive data analysis. Specifically, they show lower bounds for answering an online sequence of adaptively chosen counting queries where the database is a sample from some unknown distribution and the answers must be accurate with respect to that distribution. These works [HU14, SU15b] effectively reverse a connection established in [DFH+15, BSSU15], which used differentially private algorithms to obtain positive results for this problem.

Our technique for composing lower bounds in differential privacy has also found applications outside of privacy. Specifically, Liberty et al. [LMTU14] used this technique to prove nearly optimal lower bounds on the space required to “sketch” a database while approximately preserving answers to kk-way marginal queries (called “frequent itemset queries” in their work).

Preliminaries

We define a database DXnD\in\mathcal{X}^{n} to be an ordered tuple of nn rows (x1,,xn)X(x_{1},\ldots,x_{n})\in\mathcal{X} chosen from a data universe X\mathcal{X}. We say that two databases D,DXnD,D^{\prime}\in\mathcal{X}^{n} are adjacent if they differ only by a single row, and we denote this by DDD\sim D^{\prime}. In particular, we can replace the iith row of a database DD with some fixed “junk” element of X\mathcal{X} to obtain another database DiDD_{-i}\sim D. We emphasize that if DD is a database of size nn, then DiD_{-i} is also a database of size nn.

Let A:XnR\mathcal{A}:\mathcal{X}^{n}\to\mathcal{R} be a randomized algorithm (where nn is a varying parameter). A\mathcal{A} is (ε,δ)(\varepsilon,\delta)-differentially private if for every two adjacent databases DDD\sim D^{\prime} and every subset SRS\subseteq\mathcal{R},

Let A:XnR\mathcal{A}:\mathcal{X}^{n}\to\mathcal{R} be a randomized algorithm such that for every DXnD\in\mathcal{X}^{n}, every i,j[n]i,j\in[n], and every subset SRS\subseteq\mathcal{R},

Let \bot denote the fixed junk element of X\mathcal{X}. Then A:Xn1R\mathcal{A}^{\prime}:\mathcal{X}^{n-1}\to\mathcal{R} defined by A(x1,,xn1)=A(x1,,xn1,)\mathcal{A}^{\prime}(x_{1},\ldots,x_{n-1})=\mathcal{A}(x_{1},\ldots,x_{n-1},\bot) is (2ε,(eε+1)δ)(2\varepsilon,(e^{\varepsilon}+1)\delta)-differentially private.

Let D=(x1,,xn1)D=(x_{1},\ldots,x_{n-1}) and D=(x1,,xi,,xn1)D^{\prime}=(x_{1},\ldots,x_{i}^{\prime},\ldots,x_{n-1}) be adjacent databases. Then for any SRS\subseteq\mathcal{R}, we have

2 Counting Queries and Accuracy

In this paper we study algorithms that answer counting queries. A counting query on X\mathcal{X} is defined by a predicate q:X{0,1}q:\mathcal{X}\to\{0,1\}. Abusing notation, we define the evaluation of the query qq on a database D=(x1,,xn)XnD=(x_{1},\ldots,x_{n})\in\mathcal{X}^{n} to be its average value over the rows,

When β=0\beta=0 we may simply write that aa or A\mathcal{A} is α\alpha-accurate for Q\mathcal{Q}.

An important example of a collection of counting queries is the set of kk-way marginals. For all of our results it will be sufficient to consider only the set of monotone kk-way marginals.

A (monotone) kk-way marginal qSq_{S} over {0,1}d\{0,1\}^{d} is specified by a subset S[d]S\subseteq[d] of size Sk|S|\leq k. It takes the value qS(x)=1q_{S}(x)=1 if and only if xi=1x_{i}=1 for every index iSi\in S. The collection of all (monotone) kk-way marginals is denoted by Mk,d\mathcal{M}_{k,d}.

3 Sample Complexity

In this work we prove lower bounds on the sample complexity required to simultaneously achieve differential privacy and accuracy.

We will focus on the case where ε=O(1)\varepsilon=O(1) and δ=o(1/n)\delta=o(1/n). This setting of the parameters is essentially the most-permissive for which (ε,δ)(\varepsilon,\delta)-differential privacy is still a meaningful privacy definition. However, pinning down the exact dependence on ε\varepsilon and δ\delta is still of interest. Regarding ε\varepsilon, this can be done via the following standard lemma, which allows us to take ε=1\varepsilon=1 without loss of generality.

For every set of counting queries Q\mathcal{Q}, universe X\mathcal{X}, α,β,ε1\alpha,\beta\in,\varepsilon\leq 1. (Q,X)(\mathcal{Q},\mathcal{X}) has sample complexity nn^{*} for (α,β)(\alpha,\beta)-accuracy and (1,o(1/n))(1,o(1/n))-differential privacy if and only if it has sample complexity Θ(n/ε)\Theta(n^{*}/\varepsilon) for (α,β)(\alpha,\beta)-accuracy and (ε,o(1/n))(\varepsilon,o(1/n))-differential privacy.

One direction (O(n/ε)O(n^{*}/\varepsilon) samples are sufficient) is the “secrecy-of-the-sample lemma,” which appeared implicitly in [KLN+11]. The other direction (Ω(n/ε)\Omega(n^{*}/\varepsilon) samples are necessary) appears to be folklore.

The next lemma allows us to generically translate sample complexity lower bounds for constant accuracy into lower bounds that depend on the error parameter α\alpha. For some sets of queries, such as 11-way marginals, the dependence we get on α\alpha is tight. However, as we will see in Section 5, we can obtain lower bounds with an even stronger dependence on α\alpha for specific sets of queries.

Let Q\mathcal{Q} be a set of counting queries on X\mathcal{X} and let β,ε,δ>0\beta,\varepsilon,\delta>0. Suppose (Q,X)(\mathcal{Q},\mathcal{X}) has sample complexity nn^{*} for (α0,β)(\alpha_{0},\beta)-accuracy and (ε,δ)(\varepsilon,\delta)-differential privacy, where α0(0,1)\alpha_{0}\in(0,1) is a constant. Then (Q,X)(\mathcal{Q},\mathcal{X}) has sample complexity Ω(n/α)\Omega(n^{*}/\alpha) for (α,β,γ)(\alpha,\beta,\gamma)-accuracy and (ε,δ)(\varepsilon,\delta)-differential privacy.

The mechanism A\mathcal{A}^{\prime} inherits (ε,δ)(\varepsilon,\delta)-differential privacy from A\mathcal{A}, since changing one row of DD^{\prime} changes one row of the padded database DD^{\prime}. Now we argue accuracy. Suppose aqa_{q} is an answer such that aqq(D)α|a_{q}-q(D)|\leq\alpha. Note that by construction, q(D)=1m(nq(D)+(mn)q(x0))q(D)=\frac{1}{m}(nq(D^{\prime})+(m-n)q(x_{0})), and hence q(D)=1n(mq(D)(mn)q(x0))q(D^{\prime})=\frac{1}{n}(mq(D)-(m-n)q(x_{0})). Thus we have

Taking n=mα/α0n=\lceil m\alpha/\alpha_{0}\rceil makes this quantity at most α0\alpha_{0}, completing the proof. ∎

For context, we can restate some prior results on differentially private counting query release in our sample-complexity terminology.

For every set of counting queries Q\mathcal{Q} on X\mathcal{X} and every α>0\alpha>0, (Q,X)(\mathcal{Q},\mathcal{X}) has sample complexity at most

for (α,0)(\alpha,0)-accuracy and (1,o(1/n))(1,o(1/n))-differential privacy.

The next theorem shows that, when the data universe is not too small, the private multiplicative weights algorithm is nearly-optimal as a function of Q|\mathcal{Q}| and 1/α1/\alpha when each parameter is considered individually.

for (α,0)(\alpha,0)-accuracy and (1,o(1/n))(1,o(1/n))-differential privacy.

4 Re-identifiable Distributions

All of our eventual lower bounds will take the form of a “re-identification” attack, in which we possess data from a large number of individuals, and identify one such individual who was included in the database. In this attack, we choose a distribution on databases and give an adversary 1) a database DD drawn from that distribution and 2) either A(D)\mathcal{A}(D) or A(Di)\mathcal{A}(D_{-i}) for some row ii, where A\mathcal{A} is an alleged sanitizer. The adversary’s goal is to identify a row of DD that was given to the sanitizer. We say that the distribution is re-identifiable if there is an adversary who can identify such a row with sufficiently high confidence whenever A\mathcal{A} outputs accurate answers. If the adversary can do so, it means that there must be a pair of adjacent databases DDiD\sim D_{-i} such that the adversary can distinguish A(D)\mathcal{A}(D) from A(Di)\mathcal{A}(D_{-i}), which means A\mathcal{A} cannot be differentially private.

Here the probability is taken over the choice of DD and ii as well as the coins of A\mathcal{A} and B\mathcal{B}. We allow D\mathcal{D} and B\mathcal{B} to share a common state.

Note that, when row ii is not in the dataset, then it would be an error for B\mathcal{B} to declare that row ii is in the dataset, and condition 2 requires that the probability of this error occurring is at most ξ\xi.

The common state between D\mathcal{D} and B\mathcal{B} should be thought of as auxiliary information about the realization of DD that may help B\mathcal{B} identify a user ii. Formally, we could model this shared state by having D\mathcal{D} output an additional string auxaux that is given to B\mathcal{B} but not to A\mathcal{A}. However, we make the shared state implicit to reduce notational clutter. The need for this shared state will become apparent when we use fingerprinting codes to construct re-identifiable distributions; in the context of fingerprinting codes, the shared state represents auxiliary information about a codebook that helps the Trace\mathit{Trace} algorithm accuse a guilty pirate.

Lower Bounds via Fingerprinting Codes

In Section 3.1 we give the relevant background on fingerprinting codes and in Section 3.2 we prove our lower bounds for 11-way marginals.

Fingerprinting codes were introduced by Boneh and Shaw [BS98] to address the problem of watermarking digital content. A fingerprinting code is a pair of randomized algorithms (Gen,Trace)(\mathit{Gen},\mathit{Trace}). The code generator Gen\mathit{Gen} outputs a codebook C{0,1}n×dC\in\{0,1\}^{n\times d}. Each row cic_{i} of CC is the codeword of user ii. For a subset of users S[n]S\subseteq[n], we use CS{0,1}S×dC_{S}\in\{0,1\}^{|S|\times d} to denote the set of codewords of users in SS. The parameter dd is called the length of the fingerprinting code.

The security property of fingerprinting codes asserts that any codeword can be “traced” to a user i[n]i\in[n]. Moreover, we require that the fingerprinting code is “fully-collusion-resilient”—even if any “coalition” of users S[n]S\subseteq[n] gets together and “combines” their codewords in any way that respects certain constraints known as a marking assumption, then the combined codeword cc^{\prime} can be traced to a user iSi\in S. That is, there is a tracing algorithm Trace\mathit{Trace} that takes as inputs the codebook and combined codeword cc^{\prime} and outputs either a user i[n]i\in[n] or \bot, and we require that if cc^{\prime} satisfies the constraints, then Trace(C,c)S\mathit{Trace}(C,c^{\prime})\in S with high probability. Moreover, Trace\mathit{Trace} should accuse an innocent user, i.e. Trace(C,c)[n]S\mathit{Trace}(C,c^{\prime})\in[n]\setminus S, with very low probability. Analogous to the definition of re-identifiable distributions (Definition 2.10), we allow Gen\mathit{Gen} and Trace\mathit{Trace} to share a common state.As in Definition 2.10, we could model this by having Gen\mathit{Gen} output an additional string auxaux that is given to Trace\mathit{Trace}. However, we make the shared state implicit to reduce notational clutter. When designing fingerprinting codes, one tries to make the marking assumption on the combined codeword as weak as possible.

The basic marking assumption is that each bit of the combined word cc^{\prime} must match the corresponding bit for some user in SS. Formally, for a codebook C{0,1}n×dC\in\{0,1\}^{n\times d}, and a coalition S[n]S\subseteq[n], we define the set of feasible codewords for CSC_{S} to be

Observe that the combined codeword is only constrained on coordinates jj where all users in SS agree on the jj-th bit.

We are now ready to formally define a fingerprinting code.

where the probability is taken over the coins of Gen,Trace\mathit{Gen},\mathit{Trace}, and AFP\mathcal{A}_{\mathit{FP}}. The algorithms Gen\mathit{Gen} and Trace\mathit{Trace} may share a common state.

We remark that our proof of Theorem 3.5, showing how to construct re-identifiable distributions from a fingerprinting codes, will only require collusion resilience against coalitions SS of size Sn1|S|\geq n-1. Our choice to state Definition 3.1 using resilience against arbitrary coalitions is more consistent with the literature on fingerprinting codes.

Tardos [Tar08] constructed a family of fingerprinting codes with a nearly optimal number of users nn for a given length dd.

As we will see in the next subsection, fingerprinting codes satisfying Definition 3.1 will imply lower bounds on the sample complexity for releasing 11-way marginals with (α,0)(\alpha,0)-accuracy (accuracy for every query). In order to prove sample-complexity lower bounds for (α,β)(\alpha,\beta)-accuracy with β>0\beta>0, we will need fingerprinting codes satisfying a stronger security property. Specifically, we will expand the feasible set F(CS)F(C_{S}) to include all codewords that satisfy most feasibility constraints, and require that even codewords in this expanded set can be traced. Formally, for any β\beta\in, we define

where the probability is taken over the coins of Gen,Trace\mathit{Gen},\mathit{Trace}, and AFP\mathcal{A}_{\mathit{FP}}. The algorithms Gen\mathit{Gen} and Trace\mathit{Trace} may share a common state.

In Section 6 we show how to construct error-robust fingerprinting codes with a nearly-optimal number of users that are tolerant to a constant fraction of errors.

2 Lower Bounds for 111-Way Marginals

By combining Theorem 3.5 with Theorem 3.2 we obtain a sample complexity lower bound for 11-way marginals, and thereby establish Theorem 1.1 in the introduction.

Let (Gen,Trace)(\mathit{Gen},\mathit{Trace}) be the promised fingerprinting code. We define the re-identifiable distribution D\mathcal{D} to simply be the output distribution of the code generator, Gen\mathit{Gen}. And we define the privacy adversary B\mathcal{B} to take the answers a=A(D)M1,da=\mathcal{A}(D)\in^{|\mathcal{M}_{1,d}|}, obtain a{0,1}M1,d\overline{a}\in\{0,1\}^{|\mathcal{M}_{1,d}|} by rounding each entry of aa to {0,1}\{0,1\}, run the tracing algorithm Trace\mathit{Trace} on the rounded answers a\overline{a}, and return its output. The shared state of D\mathcal{D} and B\mathcal{B} will be the shared state of Gen\mathit{Gen} and Trace\mathit{Trace}.

Now we will verify that D\mathcal{D} is (ξ,ξ)(\xi,\xi)-re-identifiable. First, suppose that A(D)\mathcal{A}(D) outputs answers a=(aqj)j[d]a=(a_{q_{j}})_{j\in[d]} that are (1/3,β)(1/3,\beta)-accurate for 11-way marginals. That is, there is a set G[d]G\subseteq[d] such that G(1β)d|G|\geq(1-\beta)d and for every jGj\in G, the answer aqja_{q_{j}} estimates the fraction of rows having a 11 in column jj to within 1/31/3. Let aqj\overline{a}_{q_{j}} be aqja_{q_{j}} rounded to the nearest value in {0,1}\{0,1\}. Let jj be a column in GG. If column jj has all 11’s, then aqj2/3a_{q_{j}}\geq 2/3, and aqj=1\overline{a}_{q_{j}}=1. Similarly, if column jj has all ’s, then aqj1/3a_{q_{j}}\leq 1/3, and aqj=0\overline{a}_{q_{j}}=0. Therefore, we have

By security of the fingerprinting code (Definition 3.3), we have

But the event Trace(D,a)=\mathit{Trace}(D,\overline{a})=\bot is exactly the same as B(D,A(D))=\mathcal{B}(D,\mathcal{A}(D))=\bot, and thus we have established the first condition necessary for D\mathcal{D} to be (ξ,ξ)(\xi,\xi)-re-identifiable.

The second condition for re-identifiability follows directly from the soundness of the fingerprinting code, which asserts that for every adversary AFP\mathcal{A}_{\mathit{FP}}, in particular for A\mathcal{A}, it holds that

Using the additional structure of Tardos’ fingerprinting code, and our robust fingerprinting codes, we can prove minimax lower bounds for an “inference version” of the problem computing the 11-way marginals of a product distribution.

We can now formally define the problem of inferring the marginals pp as follows.

Our lower bound can thus be stated as follows,

The proof has the same general structure that we used to prove Theorem 3.5. Here, we describe additional observations about the structure of the fingerprinting codes used in that proof (see Section 6 for a description of Tardos’ fingerprinting code) that allow it to carry over to the inference version of computing 11-way marginals.

First, in Tardos’ (non-robust) fingerprinting code, the codebook DD is chosen by first sampling marginals pdp\in^{d} from an appropriate distribution and then sampling DD from Dpn.\mathcal{D}_{p}^{\otimes n}. The robust fingerprinting codes we construct in Section 6 also have this property.To generate a codebook DD^{\prime} for our robust fingerprinting code, we sample a codebook DD from Tardos’ fingerprinting code and then insert additional columns of all 11’s or all ’s to DD in random locations. Equivalently, we can obtain a codebook DD^{\prime} by appending 11’s and ’s in random locations of pp to obtain a vector pp^{\prime} and then sampling DD^{\prime} from Dpn.\mathcal{D}_{p^{\prime}}^{\otimes n}. Thus the instances used to prove Theorem 3.5 indeed consist of independent samples from a product distribution, which is what the inference problem assumes.

Next, recall that the proof of Theorem 3.5 shows that any string that is (α,β)(\alpha,\beta)-accurate for the 11-way marginals of DD can be traced successfully. It is moreover the case that any string that is (α,β)(\alpha,\beta)-accurate for the marginals pp can also be traced successfully. This is because the rows of DD are sampled independently from Dp\mathcal{D}_{p}, so accuracy for the 11-way marginals of DD and accuracy for pp coincide with high probability, at least when n=ω(logd)n=\omega(\log d):

Let pdp\in^{d} and let D\mboxRDpnD\leftarrow_{\mbox{\tiny R}}\mathcal{D}_{p}^{\otimes n}. Let ada\in^{d} denote the exact 11-way marginals of DD. Then for every α,η>0\alpha,\eta>0, and n=Ω(log(d/η)/α2)n=\Omega(\log(d/\eta)/\alpha^{2}), we have apα\|a-p\|_{\infty}\leq\alpha with probability at least 1η1-\eta over the choice of DD.

We remark that Steinke and Ullman [SU15a] showed that accuracy with respect to the marginals pp actually suffices to trace regardless of the value of nn.

These two observations suffice to show that, when nn is too small, a differentially private algorithm cannot be accurate for pp with high probability over the choices of both pp and DD. Thus, for every differentially private algorithm, there exists some pp such that the algorithm is not accurate with high probability over the choice of DD, which means that the algorithm does not accurately infer the marginals of an arbitrary product distribution. ∎

3 Fingerprinting Codes for General Query Families

In this section, we generalize the connection between fingerprinting codes and sample complexity lower bounds for arbitrary sets of queries. We show that a generalized fingerprinting code with respect to any family of counting queries Q\mathcal{Q} yields a sample complexity lower bound for Q\mathcal{Q}, which is analogous to our lower bound for 11-way marginals (Theorem 3.5). We then argue that some type of fingerprinting code is necessary to prove any sample complexity lower bound by exhibiting a tight connection between such lower bounds and a weak variant of our generalized fingerprinting codes.

We begin by defining our generalization of fingerprinting codes. Fix a finite data universe X\mathcal{X} and a set of counting queries Q\mathcal{Q} over X\mathcal{X}. A generalized fingerprinting code with respect to the family Q\mathcal{Q} consists of a pair of randomized algorithms (Gen,Trace)(\mathit{Gen},\mathit{Trace}). The code generation algorithm Gen\mathit{Gen} produces a codebook CXnC\in\mathcal{X}^{n}. Each row cic_{i} of CC is the codeword corresponding to user ii. A coalition S[n]S\subseteq[n] of pirates receives the subset CS={ci:iS}C_{S}=\{c_{i}:i\in S\} of codewords, and produces an answer vector aQa\in^{|\mathcal{Q}|}. We replace the traditional marking condition on the pirates with the generalized constraint that they output a feasible answer vector. A natural way to define feasibility for answer vectors is to require a condition similar to (α,β)(\alpha,\beta)-accuracy, i.e. an answer vector aa is feasible if aqq(CS)α|a_{q}-q(C_{S})|\leq\alpha for all but a β\beta fraction of queries qQq\in\mathcal{Q}. We thus define a generalized set of feasible answer vectors by

When α=11/n\alpha=1-1/n, the generalized set of feasible answer vectors captures the traditional marking assumption by rounding each entry of a feasible answer vector to or 11.An equivalent way to view a codebook is as a set of nn codewords C({0,1}Q)nC\in(\{0,1\}^{|\mathcal{Q}|})^{n}, where each user’s codeword is ci=(q(x))qQc_{i}=(q(x))_{q\in\mathcal{Q}} for some xXx\in\mathcal{X}. Notice that the case where Q\mathcal{Q} is the class of 11-way marginals places no constraints on the structure of a codeword, i.e. a codeword can be any binary string. With this viewpoint, the goal of the pirates is to output an answer vector aQa\in^{|\mathcal{Q}|} with aq1SiS(ci)qα|a_{q}-\frac{1}{|S|}\sum_{i\in S}(c_{i})_{q}|\leq\alpha for all but a β\beta fraction of the queries qQq\in\mathcal{Q}.

A pair of algorithms (Gen,Trace)(\mathit{Gen},\mathit{Trace}) is an (n,Q)(n,\mathcal{Q})-fingerprinting code for (α,β)(\alpha,\beta)-accuracy with security (γ,ξ)(\gamma,\xi) if Gen\mathit{Gen} outputs a codebook CXnC\in\mathcal{X}^{n} and for every (possibly randomized) adversary AFP\mathcal{A}_{\mathit{FP}}, and every coalition S[n]S\subseteq[n] with Sn1|S|\geq n-1, if we set a\mboxRAFP(CS)a\leftarrow_{\mbox{\tiny R}}\mathcal{A}_{\mathit{FP}}(C_{S}), then

where the probability is taken over the coins of Gen,Trace\mathit{Gen},\mathit{Trace}, and AFP\mathcal{A}_{\mathit{FP}}. The algorithms Gen\mathit{Gen} and Trace\mathit{Trace} may share a common state.

The security properties of Definition 3.12 differ from those of an ordinary fingerprinting code in two ways so as to enable a clean statement of a composition theorem for generalized fingerprinting codes (Theorem 4.6). First, we use two separate security parameters γ,ξ\gamma,\xi for the different types of tracing errors, as in the definition of re-identifiable distributions. Second, security only needs to hold for coalitions of size n1n-1 or nn. However, this condition implies security for coalitions of arbitrary size with an increased false accusation probability of nξn\xi.

As in Theorem 3.5, the existence of a generalized (n,Q)(n,\mathcal{Q})-fingerprinting code implies a sample complexity lower bound of nn for privately releasing answers to Q\mathcal{Q}, with essentially the same proof.

In particular, if γ1/3\gamma\leq 1/3 and ξ=o(1/n)\xi=o(1/n), then there is no algorithm A:XnQ\mathcal{A}:\mathcal{X}^{n}\to^{|\mathcal{Q}|} that is (O(1),o(1/n))(O(1),o(1/n))-differentially private and (α,β)(\alpha,\beta)-accurate for Q\mathcal{Q}.

We now turn to investigate whether a converse to Theorem 3.13 holds. We show that a sample complexity lower bound for a family of queries Q\mathcal{Q} is essentially equivalent to the existence of a weak type of fingerprinting code, where the tracing procedure depends on the family Q\mathcal{Q} and the tracing error probabilities satisfy certain affine constraint. It remains an interesting open question to determine the precise relationship between privacy lower bounds and our notion of generalized fingerprinting codes.

A pair of algorithms (Gen,Trace)(\mathit{Gen},\mathit{Trace}) is an (n,Q)(n,\mathcal{Q})-weak fingerprinting code for (α,β)(\alpha,\beta)-accuracy with security (ε,δ)(\varepsilon,\delta) if Gen\mathit{Gen} outputs a codebook CXnC\in\mathcal{X}^{n} and for every (possibly randomized) adversary AFP\mathcal{A}_{\mathit{FP}} that outputs a feasible answer vector with probability 2/32/3, and every coalition S[n]S\subseteq[n] with Sn1|S|\geq n-1, if we set a\mboxRAFP(CS)a\leftarrow_{\mbox{\tiny R}}\mathcal{A}_{\mathit{FP}}(C_{S}), then

where the probabilities are taken over the coins of Gen\mathit{Gen}, Trace\mathit{Trace}, and AFP\mathcal{A}_{\mathit{FP}}. The algorithms Gen\mathit{Gen} and Trace\mathit{Trace} may share a common state.

That is, we require the false accusation probability Pr[Trace(C,a)[n]S]\Pr[\mathit{Trace}(C,a)\in[n]\setminus S] to be much smaller than the total probability of accusing any user. Note that a tracing algorithm that accuses a random user with probability pp will falsely accuse a user with probability p/np/n when S=n1|S|=n-1; however, this does not satisfy Definition 3.14 because we require the gap between the two probabilities to be at least a factor of eεne^{\varepsilon}n.

Observe that taking ξ<(1δ)/2eεn\xi<(1-\delta)/2e^{\varepsilon}n in Definition 3.12 yields an (n,Q)(n,\mathcal{Q})-weak fingerprinting code with security (ε,δ)(\varepsilon,\delta). However, Definition 3.14 is weaker than Definition 3.12 in a few important ways. First, security only holds against pirates with a failure probability of at most 1/31/3. Second, while Definition 3.12 requires completeness error Pr[Trace(C,a)=]<ξ\Pr[\mathit{Trace}(C,a)=\bot]<\xi, a weak fingerprinting code allows Pr[Trace(C,a)=]=1o(1)\Pr[\mathit{Trace}(C,a)=\bot]=1-o(1) as long as Pr[Trace(C,a)[n]S]\Pr[\mathit{Trace}(C,a)\in[n]\setminus S] is sufficiently small.

The following theorem shows that the existence of an (n,Q)(n,\mathcal{Q})-weak fingerprinting code is essentially equivalent to a sample complexity lower bound of nn against Q\mathcal{Q}.

Then there exists an ii^{*} such that Pr[Trace(C,AFP(C))=i]p/n\Pr[\mathit{Trace}(C,\mathcal{A}_{\mathit{FP}}(C))=i^{*}]\geq p/n. By differential privacy,

On the other hand, by the security of the weak fingerprinting code and differential privacy,

This yields a contradiction whenever εε/2\varepsilon^{\prime}\leq\varepsilon/2 and δδ/(1+eε/2n)\delta^{\prime}\leq\delta/(1+e^{\varepsilon/2}n).

We now show the converse direction, i.e. that the high sample complexity of (Q,X)(\mathcal{Q},\mathcal{X}) implies the existence of a weak fingerprinting code. We begin with a technical lemma which shows that the high sample complexity of Q\mathcal{Q} also rules out mechanisms that satisfy only a one-sided constraint on the probability of any event under the replacement of one row:

Let ε1/2\varepsilon\leq 1/2. Let A\mathcal{A} be an (α,β)(\alpha,\beta)-accurate algorithm for Q\mathcal{Q} on databases DXmD\in\mathcal{X}^{m}. Suppose we have that for all databases DXmD\in\mathcal{X}^{m}, all i[m]i\in[m], and all measurable TRange(A)T\subseteq\text{Range}(\mathcal{A}) that

Let d=VC(Q)d=\mathit{VC}(\mathcal{Q}) be the VC-dimension of Q\mathcal{Q} and let

Then there exists a (6ε,(e2ε+e5ε)δ)(6\varepsilon,(e^{2\varepsilon}+e^{5\varepsilon})\delta)-differentially private algorithm B\mathcal{B} on databases of size n=m/εn=\lceil m/\varepsilon\rceil that gives (α+α,β)(\alpha+\alpha^{\prime},\beta)-accurate answers to Q\mathcal{Q} on any database DXnD^{\prime}\in\mathcal{X}^{n} with probability at least 1/21/2 .

On input a database DXnD^{\prime}\in\mathcal{X}^{n}, consider the algorithm B\mathcal{B}^{\prime} that samples a random subset DD consisting of mm rows from DD^{\prime} (without replacement) and returns A(D)\mathcal{A}(D). Then by our hypothesis on A\mathcal{A}, for every i[n]i\in[n] and every measurable TRange(B)=Range(A)T\subseteq\text{Range}(\mathcal{B})=\text{Range}(\mathcal{A}) we have

On the other hand, a “secrecy-of-the-sample” argument [KLN+11] enables us to obtain the reverse inequality. For a row k[n]k\in[n], consider the following two experiments:

Experiment 1: Sample a random subset DD of mm rows from DkD^{\prime}_{-k}.

Experiment 2: Sample j\mboxR[n]j\leftarrow_{\mbox{\tiny R}}[n], and then sample a random subset DD of mm rows from DjD^{\prime}_{-j}.

Any database DD sampleable under Experiment 1 appears with probability 1/(nm)1/{n\choose m}, but appears with probability at least

Combining the two inequalities shows that for every database DXnD^{\prime}\in\mathcal{X}^{n} and every i,k[n]i,k\in[n],

By Lemma 2.2, the algorithm B(D1,,Dn1)=B(D1,,Dn1,)\mathcal{B}(D^{\prime}_{1},\ldots,D^{\prime}_{n-1})=\mathcal{B}^{\prime}(D^{\prime}_{1},\ldots,D^{\prime}_{n-1},\bot) is (6ε,(e2ε+e5ε)δ)(6\varepsilon,(e^{2\varepsilon}+e^{5\varepsilon})\delta)-differentially private.

Finally, uniform convergence of the sampling error of B\mathcal{B}^{\prime} implies that it remains an accurate algorithm, and hence so is B\mathcal{B}. In particular, when DD is a random sample of mm rows from DD^{\prime} and dd is the VC-dimension of Q\mathcal{Q}, we have [AB09]:

Taking α\alpha^{\prime} as in the theorem statement makes the total failure probability of B\mathcal{B} at most 1/21/2. ∎

Now we proceed to complete the proof of Theorem 3.15. Suppose (Q,X)(\mathcal{Q},\mathcal{X}) has sample complexity greater than nn for (α+α,β)(\alpha+\alpha^{\prime},\beta)-accuracy (with failure probability 1/21/2) and (6ε,(e2ε+e5ε)δ)(6\varepsilon,(e^{2\varepsilon}+e^{5\varepsilon})\delta)-differential privacy. By Lemma 3.16, for every (α,β)(\alpha,\beta)-accurate mechanism A\mathcal{A} for Q\mathcal{Q} there exists a database DXmD\in\mathcal{X}^{m} with m=nεm=\lfloor n\varepsilon\rfloor, a set TT, and an index ii such that

We now argue that it is without loss of generality to restrict our attention to mechanisms A\mathcal{A} whose range is the finite set ImQ={0,12m,1m,,112m,1}QI_{m}^{|\mathcal{Q}|}=\{0,\frac{1}{2m},\frac{1}{m},\ldots,1-\frac{1}{2m},1\}^{|\mathcal{Q}|}. To see this, note that the exact answer to any counting query qq on a database DXmD\in\mathcal{X}^{m} is in the set {0,1m,2m,,11m,1}\{0,\frac{1}{m},\frac{2}{m},\ldots,1-\frac{1}{m},1\}. Therefore, if an answer aa\in satisfies aq(D)α|a-q(D)|\leq\alpha, then the value

is a point in ImI_{m} that also satisfies aˉq(D)α|\bar{a}-q(D)|\leq\alpha. Thus, we will henceforth assume that the mechanism’s output lies in this finite range.

We now apply the min-max theorem from game theory (or equivalently, linear programming duality), to exhibit a fixed distribution on (D,T,i)(D,T,i) for which Inequality (3) holds. Specifically, consider a two-player zero-sum game in which Player 1 chooses a triple (D,T,i)(D,T,i), where DXmD\in\mathcal{X}^{m}, TImQT\subseteq I_{m}^{|\mathcal{Q}|}, and i[m]i\in[m], and Player 2 chooses a randomized function A:XmImQ\mathcal{A}:\mathcal{X}^{m}\to I_{m}^{|\mathcal{Q}|} that is (α,β)(\alpha,\beta)-accurate for Q\mathcal{Q}. Let the payoff to Player 1 be

By inequality (3), the value of this game is greater than δ\delta. So by the min-max theorem there exists a mixed strategy for Player 1 that achieves a payoff greater than δ\delta against any mixed strategy for Player 2. (Note that we can apply the min-max theorem because we have assumed that the mechanism’s output lies in a finite range.) That is, there exists a distribution D\mathcal{D} over triples (D,T,i)(D,T,i) such that for any randomized algorithm A:XmImQ\mathcal{A}:\mathcal{X}^{m}\to I_{m}^{|\mathcal{Q}|} that takes any DD to a feasible vector in Fα,β(D)F_{\alpha,\beta}(D) with probability at least 2/32/3,

Now consider the following code: Gen\mathit{Gen} samples a database DD, a set TT, and an index ii according to the promised distribution D\mathcal{D}. The codebook CC is (Dπ(1),,Dπ(m))(D_{\pi(1)},\ldots,D_{\pi(m)}) where π:[m][m]\pi:[m]\to[m] is a random permutation. On input an answer vector aa, the algorithm Trace\mathit{Trace} checks whether aTa\in T. If it is, then Trace\mathit{Trace} outputs π(i)\pi(i), and otherwise outputs \bot.

To analyze the security of this code, fix a coalition SS of m1m-1 users using a pirate strategy AFP\mathcal{A}_{\mathit{FP}}. Because the codebook is a random permutation of the rows of DD, it is equivalent to analyze the original database DD and a random coalition of m1m-1 users. Thus the part of the codebook CSC_{S} given to the pirates is a random set of m1m-1 rows from DD, i.e. DjD_{-j} for a random j[m]j\in[m] with the junk row at index jj removed. The condition that AFP\mathcal{A}_{\mathit{FP}} outputs a feasible answer vector is equivalent to a=AFP(CS)a=\mathcal{A}_{\mathit{FP}}(C_{S}) being an (α,β)(\alpha,\beta)-accurate answer vector. Therefore, letting A:XmImQ\mathcal{A}:\mathcal{X}^{m}\to I_{m}^{|\mathcal{Q}|} be the algorithm that runs AFP\mathcal{A}_{\mathit{FP}} on its input with the junk row removed, we have

On the other hand, the probability that Trace\mathit{Trace} outputs the user jj not in the coalition is

because the events {j=i}\{j=i\} and {Trace(C,a)=i}\{\mathit{Trace}(C,a)=i\} are independent. Thus by (4),

where both probabilities are taken over the coins of Gen,Trace\mathit{Gen},\mathit{Trace}, and AFP\mathcal{A}_{\mathit{FP}}. ∎

A Composition Theorem for Sample Complexity

In this section we state and prove a composition theorem for sample complexity lower bounds. At a high-level the composition theorem starts with two pairs, (Q,X)(\mathcal{Q},\mathcal{X}) and (Q,X)(\mathcal{Q}^{\prime},\mathcal{X}^{\prime}), for which we know sample-complexity lower bounds of nn and nn^{\prime} respectively, and attempts to prove a sample-complexity lower bound of nnn\cdot n^{\prime} for a related family of queries on a related data universe.

Specifically, our sample-complexity lower bound will apply to the “product” of Q\mathcal{Q} and Q\mathcal{Q}^{\prime}, defined on X×X\mathcal{X}\times\mathcal{X}^{\prime}. We define the product QQ\mathcal{Q}\land\mathcal{Q}^{\prime} to be

Since q,qq,q^{\prime} are boolean-valued, their conjunction can also be written q(x)q(x)q(x)q^{\prime}(x^{\prime}).

We now begin to describe how we can prove a sample complexity lower bound for QQ\mathcal{Q}\land\mathcal{Q}^{\prime}. First, we describe a certain product operation on databases. Let DXnD\in\mathcal{X}^{n}, D=(x1,,xn)D=(x_{1},\ldots,x_{n}), be a database. Let D1,,Dn(X)nD^{\prime}_{1},\ldots,D^{\prime}_{n}\in(\mathcal{X}^{\prime})^{n^{\prime}} where Di=(xi1,,xin)D^{\prime}_{i}=(x^{\prime}_{i1},\ldots,x^{\prime}_{in^{\prime}}) be nn databases. We define the product database D=D×(D1,,Dn)(X×X)nnD^{*}=D\times(D^{\prime}_{1},\ldots,D^{\prime}_{n})\in(\mathcal{X}\times\mathcal{X}^{\prime})^{n\cdot n^{\prime}} as follows: For every i=1,,n,j=1,,ni=1,\ldots,n,j=1,\ldots,n^{\prime}, let the (i,j)(i,j)-th row of DD^{*} be x(i,j)=(xi,xij)x^{*}_{(i,j)}=(x_{i},x^{\prime}_{ij}). Note that we index the rows of DD^{*} by (i,j)(i,j). We will sometimes refer to D1,,DnD^{\prime}_{1},\ldots,D^{\prime}_{n} as the “subdatabases” of DD^{*}.

The key property of these databases is that we can use a query qqQQq\land q^{\prime}\in\mathcal{Q}\land\mathcal{Q}^{\prime} to compute a “subset-sum” of the vector sq=(q(D1),,q(Dn))s_{q^{\prime}}=(q^{\prime}(D^{\prime}_{1}),\ldots,q^{\prime}(D^{\prime}_{n})) consisting of the answers to qq^{\prime} on each of the nn subdatabases. That is, for every qQq\in\mathcal{Q} and qQq^{\prime}\in\mathcal{Q}^{\prime},

Thus, every approximate answer aqqa_{q\land q^{\prime}} to a query qqq\land q^{\prime} places a subset-sum constraint on the vector sqs_{q^{\prime}}. (Namely, aqq1ni=1nq(xi)q(Di)a_{q\land q^{\prime}}\approx\frac{1}{n}\sum_{i=1}^{n}q(x_{i})q^{\prime}(D^{\prime}_{i})) If the database DD and family Q\mathcal{Q} are chosen appropriately, and the answers are sufficiently accurate, then we will be able to reconstruct a good approximation to sqs_{q^{\prime}}. Indeed, this sort of “reconstruction attack” is the core of many lower bounds for differential privacy, starting with the work of Dinur and Nissim [DN03]. The setting they consider is essentially the special case of what we have just described where D1,,DnD^{\prime}_{1},\ldots,D^{\prime}_{n} are each just a single bit (X={0,1}\mathcal{X}^{\prime}=\{0,1\}, and Q\mathcal{Q}^{\prime} contains only the identity query). In Section 5 we will discuss choices of DD and Q\mathcal{Q} that allow for this reconstruction.

We now state the formal notion of reconstruction attack that we want DD and Q\mathcal{Q} to satisfy.

for at least a 1β1-\beta fraction of queries qQq\in\mathcal{Q}, BD(a)\mathcal{B}_{D}(a) outputs a vector tnt\in^{n} such that

Then we say that DXnD\in\mathcal{X}^{n} enables an α\alpha^{\prime}-reconstruction attack from (α,β)(\alpha,\beta)-accurate answers to Q\mathcal{Q}.

A reconstruction attack itself implies a sample-complexity lower bound, as in [DN03]. However, we show how to obtain stronger sample complexity lower bounds from the reconstruction attack by applying it to a product database DD^{*} to obtain accurate answers to queries on its subdatabases. For each query qQq^{\prime}\in\mathcal{Q}^{\prime}, we run the adversary promised by the reconstruction attack on the approximate answers given to queries of the form (qq)Q{q}(q\land q^{\prime})\in\mathcal{Q}\land\{q^{\prime}\}. As discussed above, answers to these queries will approximate subset sums of the vector sq=(q(D1),,q(Dn))s_{q^{\prime}}=(q^{\prime}(D^{\prime}_{1}),\ldots,q^{\prime}(D^{\prime}_{n})). When the reconstruction attack is given these approximate answers, it returns a vector tq=(tq,1,,tq,n)t_{q^{\prime}}=(t_{q^{\prime},1},\ldots,t_{q^{\prime},n}) such that tq,isq,i=q(Di)t_{q^{\prime},i}\approx s_{q^{\prime},i}=q^{\prime}(D^{\prime}_{i}) on average over ii. Running the reconstruction attack for every query qq^{\prime} gives us a collection t=(tq,i)qQ,i[n]t=(t_{q^{\prime},i})_{q^{\prime}\in\mathcal{Q}^{\prime},i\in[n]} where tq,iq(Di)t_{q^{\prime},i}\approx q^{\prime}(D^{\prime}_{i}) on average over both qq^{\prime} and ii. By an application of Markov’s inequality, for most of the subdatabases DiD^{\prime}_{i}, we have that tq,iq(Di)t_{q^{\prime},i}\approx q^{\prime}(D^{\prime}_{i}) on average over the choice of qQq^{\prime}\in\mathcal{Q}^{\prime}. For each ii such that this guarantee holds, another application of Markov’s inequality shows that for most queries qQq^{\prime}\in\mathcal{Q}^{\prime} we have tq,iq(Di)t_{q^{\prime},i}\approx q^{\prime}(D^{\prime}_{i}), which is our definition of (α,β)(\alpha,\beta)-accuracy (later enabling us to apply a re-identification adversary for Q\mathcal{Q}^{\prime}).

The algorithm we have described for obtaining accurate answers on the subdatabases is formalized in Figure 1.

We are now in a position to state the main lemma that enables our composition technique. The lemma says that if we are given accurate answers to QQ\mathcal{Q}\land\mathcal{Q}^{\prime} on DD^{*} and the database DXnD\in\mathcal{X}^{n} enables a reconstruction attack from accurate answers to Q\mathcal{Q}, then we can obtain accurate answers to Q\mathcal{Q}^{\prime} on most of the subdatabases D1,,Dn(X)nD^{\prime}_{1},\ldots,D^{\prime}_{n}\in(\mathcal{X}^{\prime})^{n^{\prime}}.

The additional bookkeeping in the proof is to handle the case where aa is only accurate for most queries. In this case the reconstruction attack may fail completely for certain queries qQq^{\prime}\in\mathcal{Q}^{\prime} and we need to account for this additional source of error.

Assume the answer vector a=(aqq)qQ,qQa=(a_{q\land q^{\prime}})_{q\in\mathcal{Q},q^{\prime}\in\mathcal{Q}^{\prime}} is (α,β)(\alpha,\beta)-accurate for QQ\mathcal{Q}\land\mathcal{Q}^{\prime} on D=D×(D1,,Dn)D^{*}=D\times(D^{\prime}_{1},\ldots,D^{\prime}_{n}). By assumption, DD enables a reconstruction attack BD\mathcal{B}_{D} that succeeds in reconstructing an approximation to sq=(q(D1),,q(Dn))s_{q^{\prime}}=(q^{\prime}(D^{\prime}_{1}),\ldots,q^{\prime}(D^{\prime}_{n})) when given (α,cβ)(\alpha,c\beta)-accurate answers for the family of queries Q{q}\mathcal{Q}\land\{q^{\prime}\}. Consider the set of qq^{\prime} on which the reconstruction attack succeeds, i.e.

Since aa is (α,β)(\alpha,\beta)-accurate, an application of Markov’s inequality shows that

Thus, Qgood(11/c)Q|\mathcal{Q}^{\prime}_{\mathit{good}}|\geq(1-1/c)|\mathcal{Q}^{\prime}|.

Recall that, by (5), we can interpret answers to QQ\mathcal{Q}\land\mathcal{Q}^{\prime} as subset sums of answers to the subdatabases, so for every qQgoodq^{\prime}\in\mathcal{Q}^{\prime}_{\mathit{good}},

for at least a 1cβ1-c\beta fraction of queries qqQ{q}q\land q^{\prime}\in\mathcal{Q}\land\{q^{\prime}\}. Since DD enables a reconstruction attack from (α,cβ)(\alpha,c\beta)-accurate answers to Q\mathcal{Q}, by Definition 4.1, BD((aqq)qQ)\mathcal{B}_{D}((a_{q\land q^{\prime}})_{q\in\mathcal{Q}}) recovers a vector tqnt_{q^{\prime}}\in^{n} such that

Since this holds for every qQgoodq^{\prime}\in\mathcal{Q}^{\prime}_{\mathit{good}}, we have

The statement inside the final probability is precisely that (tq,i)qQ(t_{q^{\prime},i})_{q^{\prime}\in\mathcal{Q}^{\prime}} is (6cα,2/c)(6c\alpha^{\prime},2/c)-accurate for Q\mathcal{Q}^{\prime} on DiD^{\prime}_{i}. This completes the proof of the lemma. ∎

We now explain how the main lemma allows us to prove a composition theorem for sample complexity lower bounds. We start with a query family Q\mathcal{Q} on a database DXnD\in\mathcal{X}^{n} that enables a reconstruction attack, and a distribution D\mathcal{D}^{\prime} over databases in (X)n(\mathcal{X}^{\prime})^{n^{\prime}} that is re-identifiable from answers to a family Q\mathcal{Q}^{\prime}. We show how to combine these objects to form a re-identifiable distribution D\mathcal{D}^{*} for queries QQ\mathcal{Q}\land\mathcal{Q}^{\prime} over (X×X)nn(\mathcal{X}\times\mathcal{X}^{\prime})^{n\cdot n^{\prime}}, yielding a sample complexity lower bound of nnn\cdot n^{\prime}.

A sample from D\mathcal{D}^{*} consists of D=D×(D1,,Dn)D^{*}=D\times(D^{\prime}_{1},\ldots,D^{\prime}_{n}) where each subdatabase DiD^{\prime}_{i} is an independent sample from from D\mathcal{D}^{\prime}. The main lemma above shows that if there is an algorithm A\mathcal{A} that is accurate for QQ\mathcal{Q}\land\mathcal{Q}^{\prime} on DD^{*}, then an adversary can reconstruct accurate answers to Q\mathcal{Q}^{\prime} on most of the subdatabases D1,,DnD^{\prime}_{1},\ldots,D^{\prime}_{n}. Since these subdatabases are drawn from a re-identifiable distribution, the adversary can the re-identify a member of one of the subdatabases DiD^{\prime}_{i}. Since the identified member of DiD^{\prime}_{i} is also a member of DD^{*}, we will have a re-identification attack against DD^{*} as well.

We are now ready to formalize our composition theorem.

Let Q\mathcal{Q} be a family of counting queries on X\mathcal{X}, and let Q\mathcal{Q}^{\prime} be a family of counting queries on X\mathcal{X}^{\prime}. Let γ,ξ,α,α,β\gamma,\xi,\alpha^{\prime},\alpha,\beta\in be parameters. Assume that for some parameters c>1c>1, γ,ξ,α,α,β\gamma,\xi,\alpha^{\prime},\alpha,\beta\in, the following both hold:

There exists a database DXnD\in\mathcal{X}^{n} that enables an α\alpha^{\prime}-reconstruction attack from (α,cβ)(\alpha,c\beta)-accurate answers to Q\mathcal{Q}.

There is a distribution D\mathcal{D}^{\prime} on databases D(X)nD\in(\mathcal{X}^{\prime})^{n^{\prime}} that is (γ,ξ)(\gamma,\xi)-re-identifiable from (6cα,2/c)(6c\alpha^{\prime},2/c)-accurate answers to Q\mathcal{Q}^{\prime}.

Then there is a distribution on databases D(X×X)nnD^{*}\in(\mathcal{X}\times\mathcal{X}^{\prime})^{n\cdot n^{\prime}} that is (γ+1/6,ξ)(\gamma+1/6,\xi)-re-identifiable from (α,β)(\alpha,\beta)-accurate answers to QQ\mathcal{Q}\land\mathcal{Q}^{\prime}.

Assume that A(D)\mathcal{A}(D^{*}) is (α,β)(\alpha,\beta)-accurate for QQ\mathcal{Q}\land\mathcal{Q}^{\prime}. By Lemma 4.2, we have

where the last inequality is by (6). Thus, it suffices to prove that

We prove this inequality by giving a reduction to the re-identifiability of D\mathcal{D}^{\prime}. Consider the following sanitizer A\mathcal{A}^{\prime}: On input D\mboxRDD^{\prime}\leftarrow_{\mbox{\tiny R}}\mathcal{D}^{\prime}, A\mathcal{A}^{\prime} first chooses a random index i\mboxR[n]i^{*}\leftarrow_{\mbox{\tiny R}}[n]. Next, it samples D1,,Di1,Di+1,,Dn\mboxRDD^{\prime}_{1},\ldots,D^{\prime}_{i^{*}-1},D^{\prime}_{i^{*}+1},\ldots,D^{\prime}_{n}\leftarrow_{\mbox{\tiny R}}\mathcal{D}^{\prime} independently, and sets Di=DD^{\prime}_{i^{*}}=D^{\prime}. Finally, it runs A\mathcal{A} on D=D×(D1,,Dn)D^{*}=D\times(D^{\prime}_{1},\ldots,D^{\prime}_{n}) and then runs the reconstruction attack R\mathcal{R}^{*} to recover answers (tq,i)qQ,i[n](t_{q^{\prime},i})_{q^{\prime}\in\mathcal{Q}^{\prime},i\in[n]} and outputs (tq,i)qQ(t_{q^{\prime},i^{*}})_{q^{\prime}\in\mathcal{Q}^{\prime}}.

Notice that since D1,,DnD^{\prime}_{1},\ldots,D^{\prime}_{n} are all i.i.d. samples from D\mathcal{D}^{\prime}, their joint distribution is independent of the choice of ii^{*}. Specifically, in the view of B\mathcal{B}^{*}, we could have chosen ii^{*} after seeing its output on DD^{*}. Therefore, the following random variables are identically distributed:

(tq,i)qQ(t_{q^{\prime},i})_{q^{\prime}\in\mathcal{Q}^{\prime}}, where (tq,i)qQ,i[n](t_{q^{\prime},i})_{q^{\prime}\in\mathcal{Q}^{\prime},i\in[n]} is the output of RD(A(D))\mathcal{R}_{D}^{*}(\mathcal{A}(D^{*})) on D\mboxRDD^{*}\leftarrow_{\mbox{\tiny R}}\mathcal{D}^{*}, and i\mboxR[n]i\leftarrow_{\mbox{\tiny R}}[n].

A(D)\mathcal{A}^{\prime}(D^{\prime}) where D\mboxRDD^{\prime}\leftarrow_{\mbox{\tiny R}}\mathcal{D}^{\prime}.

where the last inequality follows because D\mathcal{D}^{\prime} is a (γ,ξ)(\gamma,\xi)-re-identifiable from (6cα,2/c)(6c\alpha^{\prime},2/c)-accurate answers to Q\mathcal{Q}^{\prime}. Thus we have established (8). Combining (7) and (8) completes the proof of the claim.

The next claim follows directly from the definition of B\mathcal{B}^{*} and the fact that D\mathcal{D}^{\prime} is (γ,ξ)(\gamma,\xi)-re-identifiable.

For every (i,j)[n]×[n](i,j)\in[n]\times[n^{\prime}],

Combining Claims 4.4 and 4.5 suffices to prove that D\mathcal{D}^{*} is (γ+1/6,ξ)(\gamma+1/6,\xi)-re-identifiable from (α,β)(\alpha,\beta)-accurate answers to QQ\mathcal{Q}\land\mathcal{Q}^{\prime}, completing the proof of the theorem. ∎

The proof of Theorem 4.3 also yields a composition theorem for generalized fingerprinting codes. Specifically, Theorem 4.6 below shows how to combine a reconstruction attack for a query family Q\mathcal{Q} on a database DXnD\in\mathcal{X}^{n} with a (n,Q)(n^{\prime},\mathcal{Q}^{\prime})-generalized fingerprinting code to obtain a (nn,QQ)(n\cdot n^{\prime},\mathcal{Q}\land\mathcal{Q}^{\prime})-generalized fingerprinting code.

Let Q\mathcal{Q} be a family of counting queries on X\mathcal{X}, and let Q\mathcal{Q}^{\prime} be a family of counting queries on X\mathcal{X}^{\prime}. Let γ,ξ,α,α,β\gamma,\xi,\alpha^{\prime},\alpha,\beta\in be parameters. Assume that for some parameters c>1c>1, γ,ξ,α,α,β\gamma,\xi,\alpha^{\prime},\alpha,\beta\in, the following both hold:

There exists a database DXnD\in\mathcal{X}^{n} that enables an α\alpha^{\prime}-reconstruction attack from (α,cβ)(\alpha,c\beta)-accurate answers to Q\mathcal{Q}.

There exists a (n,Q)(n^{\prime},\mathcal{Q}^{\prime})-generalized fingerprinting code for (6cα,2/c)(6c\alpha^{\prime},2/c)-accuracy with security (γ,ξ)(\gamma,\xi).

Then there is a (nn,QQ)(n\cdot n^{\prime},\mathcal{Q}\land\mathcal{Q}^{\prime})-generalized fingerprinting code for (α,β)(\alpha,\beta)-accuracy with security (γ+1/6,ξ)(\gamma+1/6,\xi).

Applications of the Composition Theorem

In this section we show how to use our composition theorem (Section 4) to combine our new lower bounds for 11-way marginal queries from Section 3 with (variants of) known lower bounds from the literature to obtain our main results. In Section 5.1 we prove a lower bound for kk-way marginal queries when α\alpha is not too small (at least inverse polynomial in dd), thereby proving Theorem 1.2 in the introduction. Then in Section 5.2 we obtain a similar lower bound for arbitrary counting queries that allows α\alpha to take a wider range of parameters..

A known reconstruction-based lower bound of Ω(k)\Omega(k) for kk-way marginals.

A known reconstruction-based lower bound of Ω(1/α2)\Omega(1/\alpha^{2}) for kk-way marginals.

The lower bound of Ω(k)\Omega(k) for kk-way marginals is a special case of a lower bound of Ω(VC(Q))\Omega(\mathit{VC}(\mathcal{Q})) due to [Rot10] and based on [DN03], where VC(Q)\mathit{VC}(\mathcal{Q}) is the Vapnik-Chervonenkis (VC) dimension of Q\mathcal{Q}. The lower bound of Ω(1/α2)\Omega(1/\alpha^{2}) for kk-way marginals is due to [KRSU10, De12].

To apply our composition theorem, we need to formulate these reconstruction attack in the language of Definition 4.1. In particular, we observe that these reconstruction attacks readily generalize to allow us to reconstruct fractional vectors sns\in^{n}, instead of just boolean vectors as in [DN03, Rot10].

First we state and prove that the linear dependence on kk is necessary.

Let Q\mathcal{Q} be a collection of counting queries over a data universe X\mathcal{X}. We say a set {x1,,xk}X\{x_{1},\ldots,x_{k}\}\subseteq\mathcal{X} is shattered by Q\mathcal{Q} if for every string v{0,1}kv\in\{0,1\}^{k}, there exists a query qQq\in\mathcal{Q} such that (q(x1),,q(xk))=(v1,,vk)(q(x_{1}),\ldots,q(x_{k}))=(v_{1},\ldots,v_{k}). The VC-Dimension of Q\mathcal{Q} denoted VC(Q)\mathit{VC}(\mathcal{Q}) is the cardinality of the largest subset of X\mathcal{X} that is shattered by Q\mathcal{Q}.

For each i=1,,ki=1,\ldots,k, let xi=(1,1,,0,,1)x_{i}=(1,1,\ldots,0,\ldots,1) where the zero is at the ii-th index. We will show that {x1,,xk}\{x_{1},\ldots,x_{k}\} is shattered by Mk,d\mathcal{M}_{k,d}. For a string v{0,1}kv\in\{0,1\}^{k}, let the query qv(x)q_{v}(x) take the conjunction of the bits of xx at indices set to in vv. Then qv(xi)=1q_{v}(x_{i})=1 iff vi=1v_{i}=1, so (qv(x1),,qv(xk))=(v1,,vk)(q_{v}(x_{1}),\ldots,q_{v}(x_{k}))=(v_{1},\ldots,v_{k}). ∎

Let Q\mathcal{Q} be a collection of counting queries over a data universe X\mathcal{X} and let n=VC(Q)n=\mathit{VC}(\mathcal{Q}). Then there is a database DXnD\in\mathcal{X}^{n} which enables a 4α4\alpha-reconstruction attack from (α,0)(\alpha,0)-accurate answers to Q\mathcal{Q}.

Let {x1,,xn}\{x_{1},\ldots,x_{n}\} be shattered by Q\mathcal{Q}, and consider the database D=(x1,,xn)D=(x_{1},\ldots,x_{n}). Let sns\in^{n} be an arbitrary string to be reconstructed and let a=(aq)qQa=(a_{q})_{q\in\mathcal{Q}} be (α,0)(\alpha,0)-accurate answers. That is, for every qQq\in\mathcal{Q}

Consider the brute-force reconstruction attack B\mathcal{B} defined in Figure 4. Notice that, since aa is (α,0)(\alpha,0)-accurate, B\mathcal{B} always finds a suitable vector tt. Namely, the original database ss satisfies the constraints.

We will show that the reconstructed vector tt satisfies

Let TT be the set of coordinates on which ti>sit_{i}>s_{i} and let SS be the set of coordinates where si>tis_{i}>t_{i}. Note that

We will show that absolute values of the sums over TT and SS are each at most 2α2\alpha. Since {x1,,xn}\{x_{1},\ldots,x_{n}\} is shattered by Q\mathcal{Q}, there is a query qQq\in\mathcal{Q} such that q(xi)=1q(x_{i})=1 iff iTi\in T. Therefore, by the definitions of tt and (α,0)(\alpha,0)-accuracy,

so by the triangle inequality, 1niT(tisi)2α\frac{1}{n}\sum_{i\in T}(t_{i}-s_{i})\leq 2\alpha. An identical argument shows that 1niS(siti)2α\frac{1}{n}\sum_{i\in S}(s_{i}-t_{i})\leq 2\alpha, proving that tt is an accurate reconstruction. ∎

We can now state in our terminology the lower bound of De from [De12] (building on [KRSU10]) showing that the inverse-quadratic dependence on α\alpha is necessary.

Let kk be any constant, dkd\geq k be any integer, and let α1/d.499k\alpha\geq 1/d^{.499k} be a sufficiently small parameterThe constant .499.499 was chosen for simplicity, and can be replaced with any constant strictly smaller than .5.5. (i.e. bounded by an absolute constant). There exists a constant β=β(k)>0\beta=\beta(k)>0 such that for every α>0,\alpha^{\prime}>0, there exists a database D({0,1}d)nD\in(\{0,1\}^{d})^{n} with n=Ωα,k(1/α2)n=\Omega_{\alpha^{\prime},k}(1/\alpha^{2}) such that DD enables an α\alpha^{\prime}-reconstruction attack from (α,β)(\alpha,\beta)-accurate answers to the kk-way marginals Mk,d\mathcal{M}_{k,d}.

Although the above theorem is a simple extension of De’s lower bound, we sketch a proof for completeness, and refer the interested reader to [De12] for a more detailed analysis.

To prove that the reconstruction attack succeeds, we will show that there exists a database D=(x1,,xn){0,1}n×dD=(x_{1},\ldots,x_{n})\in\{0,1\}^{n\times d} such that for any sns\in^{n}, if aa satisfies

(i.e. aa has (α,β)(\alpha,\beta)-accurate answers) then BMk,d(D,a)\mathcal{B}_{\mathcal{M}_{k,d}}(D,a) returns a vector tt such that ts1αn.\|t-s\|_{1}\leq\alpha^{\prime}\cdot n. Henceforth we refer to such an aa simply as (α,β)(\alpha,\beta)-accurate for Mk,d\mathcal{M}_{k,d} on (D,s),(D,s), as a shorthand. The above guarantee must hold for suitable choices of n,β,n,\beta, and α\alpha^{\prime} to satisfy the theorem.

We will argue that the reconstruction succeeds in two steps. First, we show that reconstruction succeeds if DD is “nice.” Second, we show that there exists “nice” DD that has the dimensions promised by the theorem.

To explain what we mean by a “nice” database DD, for any D=(x1,,xn){0,1}n×dD=(x_{1},\ldots,x_{n})\in\{0,1\}^{n\times d} and family of queries Q\mathcal{Q} on {0,1}d\{0,1\}^{d}, we define the matrix M=MD,Q{0,1}n×Q,M=M_{D,\mathcal{Q}}\in\{0,1\}^{n\times|\mathcal{Q}|}, as M(i,q)=q(xi).M(i,q)=q(x_{i}).

A matrix M{0,1}n×mM\in\{0,1\}^{n\times m} is a δ\delta-Euclidean section if for every vector aa in the rowspace of MM we have ma2a1δma2.\sqrt{m}\cdot\|a\|_{2}\geq\|a\|_{1}\geq\delta\sqrt{m}\cdot\|a\|_{2}.

Let DD be a database and Q\mathcal{Q} be a set of queries such that MD,Q{0,1}n×QM_{D,\mathcal{Q}}\in\{0,1\}^{n\times|\mathcal{Q}|} is a δ\delta-Euclidean section and the least singular value of MD,QM_{D,\mathcal{Q}} is σ\sigma. Let sns\in^{n} be arbitrary. There exists β=β(δ)>0\beta=\beta(\delta)>0 such that if aa are (α,β)(\alpha,\beta)-accurate answers for Q\mathcal{Q} on (D,s)(D,s), and t=BQ(D,a)t=\mathcal{B}_{\mathcal{Q}}(D,a), then tt satisfies

for γ=O(αnQ/σ).\gamma=O(\alpha\sqrt{n|\mathcal{Q}|}/\sigma). The constant hidden in the O()O(\cdot) notation depends only on δ.\delta.

Thus, it suffices to find database DD such that the matrix MD,Mk,dM_{D,\mathcal{M}_{k,d}} is a Euclidean section (for some fixed constant δ>0\delta>0) and has no “small” singular values. A result of Rudelson [Rud12] (strengthening that of Kasiviswanathan et al. [KRSU10]) guarantees that such a database exists.

In particular, there exists a database D{0,1}n×dD\in\{0,1\}^{n\times d} such that the Hadamard product MM satisfies the two properties above.

Now fix any sns\in^{n} and let aMk,da\in^{|\mathcal{M}_{k,d}|} be (α,β)(\alpha,\beta)-accurate answers to Mk,d\mathcal{M}_{k,d} on (D,s)(D,s). Now, if we let t=BMk,d(D,a)t=\mathcal{B}_{\mathcal{M}_{k,d}}(D,a), by Lemma 5.6, provided that β\beta is smaller than some constant that depends only on δ\delta, which in turn depends only on kk, we will have st1γn\|s-t\|_{1}\leq\gamma\cdot n for

1.3 Putting Together the Lower Bound

Now we show how to combine the various attacks to prove Theorem 1.2 in the introduction. We obtain our lower bound by applying two rounds of composition. In the first round, we compose the reconstruction attack of Theorem 5.4 described above with the re-identifiable distribution for 11-way marginals. We then take the resulting re-identifiable distribution and apply a second round of composition using the reconstruction attack based on the VC-dimension of kk-way marginals.

We remark that it is necessary to apply the two rounds of composition in this order. In particular, we cannot prove Theorem 1.3 by composing first with the VC-dimension-based reconstruction attack. Our composition theorem requires a re-identifiable distribution from (α,β)(\alpha,\beta)-accurate answers for β>0\beta>0, whereas the reconstruction attack described in Lemma 5.3 requires (α,0)(\alpha,0)-accurate answers, and the reconstruction can fail if some queries have error much larger than α\alpha. The resulting re-identifiable distribution obtained from composing with this reconstruction attack will also require (α,0)(\alpha,0)-accurate answers, and thus cannot be composed further.

We can now formally state and prove our sample-complexity lower bound for kk-way marginals, thereby establishing Theorem 1.3 in the introduction.

such that there exists a distribution on nn-row databases D({0,1}d)nD\in(\{0,1\}^{d})^{n} that is (1/2,o(1/n))(1/2,o(1/n))-re-identifiable from (α,0)(\alpha,0)-accurate answers to the kk-way marginals Mk,d\mathcal{M}_{k,d}.

Applying Theorem 4.3 (with parameter c=150c=150), we obtain item 1’ below. We then bring in another reconstruction attack for the composition theorem.

Using the composition Theorem 4.6 in place of Theorem 4.3, we obtain a version of Theorem 5.8 in the language of generalized fingerprinting codes.

such that there exists a (n,Mk,d)(n,\mathcal{M}_{k,d})-generalized fingerprinting code with security (1/2,o(1/n))(1/2,o(1/n)) for (α,0)(\alpha,0)-accuracy.

1.4 A Tight Lower Bound for 2-Way Marginals

Theorem 5.8 does not give any non-trivial lower bound for 22-way marginals. Intuitively, the problem is that the proof uses two rounds of composition, and thus if we try to instantiate the proof for 22-way marginals, one of the three lower bounds being composed will have to be trivial (i.e. will be a lower bound for -way marginals). However, a simple modification of the proof yields a tight lower bound for 22-way marginals that holds even for (α,β)(\alpha,\beta)-accuracy.

such that there exists a distribution on nn-row databases D({0,1}d)nD\in(\{0,1\}^{d})^{n} that is (1/2,o(1/n))(1/2,o(1/n))-re-identifiable from (α,β)(\alpha,\beta)-accurate answers to the 22-way marginals M2,d\mathcal{M}_{2,d}.

Applying Theorem 4.3 (with parameter c=150c=150), we obtain the following: There exists a distribution on databases in ({0,1}d)ndnα(\{0,1\}^{d})^{n_{d}n_{\alpha}} that is (1/3,o(1/ndnα))(1/3,o(1/n_{d}n_{\alpha}))-re-identifiable from (α,4β)(\alpha,4\beta)-accurate answers to M1,d/2M1,d/2M2,d\mathcal{M}_{1,d/2}\land\mathcal{M}_{1,d/2}\subset\mathcal{M}_{2,d}.

To complete the theorem, note that M1,d/2M1,d/2\mathcal{M}_{1,d/2}\land\mathcal{M}_{1,d/2} contains exactly 1/41/4 of all the queries in M2,d\mathcal{M}_{2,d}, so (α,β)(\alpha,\beta)-accurate answers to M2,d\mathcal{M}_{2,d} contain (α,4β)(\alpha,4\beta)-accurate answers to the subset M1,d/2M1,d/2\mathcal{M}_{1,d/2}\land\mathcal{M}_{1,d/2}. So our lower bound for the subset M1,d/2M1,d/2\mathcal{M}_{1,d/2}\land\mathcal{M}_{1,d/2} is sufficient to obtain the desired lower bound. Finally, note that

2 Lower Bounds for Arbitrary Queries

Using our composition theorem, we can also prove a nearly-optimal sample complexity lower bound as a function of the Q,d,|\mathcal{Q}|,d, and α\alpha and establish Theorem 1.3 in the introduction.

Roughly, the results of [DN03] can be interpreted in our framework as showing that there is an Ω(1/α2)\Omega(1/\alpha^{2})-row database that enables a 1/1001/100-reconstruction attack from (α,0)(\alpha,0)-accurate answers to some family of queries Q\mathcal{Q}, but only when the vector to be reconstructed is Boolean. That is, the attack reconstructs a bit vector accurately provided that every query in Q\mathcal{Q} is answered correctly. Dwork et al. [DMT07, DY08] generalized this attack to only require (α,β)(\alpha,\beta)-accuracy for some constant β>0\beta>0, and we will make use of this extension (although we do not require computational efficiency, which was a focus of those works). Finally, we need an extension to the case of fractional vectors sns\in^{n}, instead of Boolean vectors s{0,1}ns\in\{0,1\}^{n}.

The extension is fairly simple and the proof follows the same outline of the original reconstruction attack from [DN03]. We are given accurate answers to queries in Q\mathcal{Q}, which we interpret as approximate “subset-sums” of the vector sns\in^{n} that we wish to reconstruct. The reconstruction attack will output any vector tt from a discretization {0,1/m,,(m1)/m,1}n\left\{0,1/m,\ldots,(m-1)/m,1\right\}^{n} of the unit interval that is “consistent” with these subset-sums. The main lemma we need is an “elimination lemma” that says that if ts1\|t-s\|_{1} is sufficiently large, then for a random subset T[n]T\subseteq[n],

with suitable large constant probability. For m=1m=1 this lemma can be established via combinatorial arguments, whereas for the m>1m>1 case we establish it via the Berry-Esséen Theorem. The lemma is used to argue that for every tt that is sufficiently far from ss, a large fraction of the subset-sum queries will witness the fact that tt is far from ss, and ensure that tt is not chosen as the output.

First we state and prove the lemma that we just described, and then we will verify that it indeed leads to a reconstruction attack.

Let κ>0\kappa>0 be a constant, let α>0\alpha>0 be a parameter with ακ2/240\alpha\leq\kappa^{2}/240, and let n=1/576κ2α2n=1/576\kappa^{2}\alpha^{2}. Then for every rnr\in^{n} such that 1ni=1nri>κ\frac{1}{n}\sum_{i=1}^{n}|r_{i}|>\kappa, and a randomly chosen q[n]q\subseteq[n],

Let rr be as in the statement of the lemma. Define a random variable

The condition on the right-hand side says that iQi\sum_{i}Q_{i} is in some interval of width 6αn6\alpha n. Since the random variables QiQ_{i} are independent, as qq is a randomly chosen subset, we will use the Berry-Esséen Theorem (Theorem 5.13) to conclude that this sum does not fall in any interval of this width too often. Establishing the next claim suffices to prove Lemma 5.11.

In order to apply Theorem 5.13 with Xi=QiX_{i}=Q_{i}, we need to analyze the moments of the random variables QiQ_{i}. The following bounds can be verified from the definition of QiQ_{i} and the assumption that r1κn\|r\|_{1}\geq\kappa n.

where σI\sigma I is an interval of width σ/2\sigma/2. Thus we have obtained that iQi\sum_{i}Q_{i} falls outside of any interval of width σ/2\sigma/2 with probability at least 3/53/5. In order to establish the claim, we simply observe that

when n=1/576κ2α2n=1/576\kappa^{2}\alpha^{2}. Thus, the probability of falling outside an interval of width 6αn6\alpha n is only larger than the probability of falling outside an interval of width σ/2\sigma/2. ∎

Establishing Claim 5.12 completes the proof of Lemma 5.11. ∎

Let α(0,1]\alpha^{\prime}\in(0,1] be a constant, let α>0\alpha>0 be a parameter with α(α)2/960\alpha\leq(\alpha^{\prime})^{2}/960, and let n=1/144(α)2α2n=1/144(\alpha^{\prime})^{2}\alpha^{2}. For any data universe X={x1,,xn}\mathcal{X}=\{x_{1},\ldots,x_{n}\} of size nn, there is a set of counting queries Q\mathcal{Q} over X\mathcal{X} of size at most O(nlog(1/α))O(n\log(1/\alpha)) such that the database D=(x1,,xn)D=(x_{1},\ldots,x_{n}) enables a α\alpha^{\prime}-reconstruction attack from (α,1/3)(\alpha,1/3)-accurate answers to Q\mathcal{Q}.

First we will give a reconstruction algorithm B\mathcal{B} for an arbitrary family of queries. We will then show that for a random set of queries Q\mathcal{Q} of the appropriate size, the reconstruction attack succeeds for every sns\in^{n} with non-zero probability, which implies that there exists a set of queries satisfying the conclusion of the theorem. We will use the shorthand

In order to show that the reconstruction attack B\mathcal{B} from Figure 6 succeeds, we must show that 1ni=1ntisiα.\frac{1}{n}\sum_{i=1}^{n}|t_{i}-s_{i}|\leq\alpha^{\prime}. Let sns\in^{n}, and let s{0,1/m,,(m1)/m,1}ns^{\prime}\in\left\{0,1/m,\ldots,(m-1)/m,1\right\}^{n} be the vector obtained by rounding each entry of ss to the nearest 1/m1/m. Then

so it is enough to show that the reconstruction attack outputs a vector close to ss^{\prime}. Observe that the vector ss^{\prime} itself satisfies

for any subset-sum query qq, so the reconstruction attack always finds some vector tt. To show that the reconstruction is successful, fix any t{0,1/m,,(m1)/m,1}nt\in\left\{0,1/m,\ldots,(m-1)/m,1\right\}^{n} such that 1ni=1ntisi>α2.\frac{1}{n}\sum_{i=1}^{n}|t_{i}-s^{\prime}_{i}|>\frac{\alpha^{\prime}}{2}. If we write r=st{1,,1/m,0,1/m,,1}nr=s^{\prime}-t\in\{-1,\ldots,-1/m,0,1/m,\ldots,1\}^{n}, then 1ni=1nri>α2\frac{1}{n}\sum_{i=1}^{n}|r_{i}|>\frac{\alpha^{\prime}}{2} and q,r=q,tq,s\langle q,r\rangle=\langle q,t\rangle-\langle q,s^{\prime}\rangle. In order to show that no tt that is far from ss^{\prime} can be output by B\mathcal{B}, we will show that for any r{1,,1/m,0,1/m,,1}r\in\{-1,\ldots,-1/m,0,1/m,\ldots,1\} with 1ni=1nr>α2\frac{1}{n}\sum_{i=1}^{n}|r|>\frac{\alpha^{\prime}}{2},

To prove this, we first observe by Lemma 5.11 (setting κ=12α\kappa=\frac{1}{2}\alpha^{\prime}) that for a randomly chosen query qq defined on X\mathcal{X},

The lemma applies because q,r=1ni=1nq(xi)ri\langle q,r\rangle=\frac{1}{n}\sum_{i=1}^{n}q(x_{i})r_{i} is a random subset-sum of the entries of rr.

Next, we apply a concentration bound to show that if the set Q\mathcal{Q} of queries is a sufficiently large random set, then for every vector rr the fraction of queries for which q,r|\langle q,r\rangle| is large will be close to the expected number, which we have just established is at least 3Q/53|\mathcal{Q}|/5. We use the following version of the Chernoff bound.

Consider a set of randomly chosen queries Q\mathcal{Q}. By the above, we have that for every r{1,,1/m,0,1/m,,1}nr\in\{-1,\ldots,-1/m,0,1/m,\ldots,1\}^{n} such that 1ni=1nr>α2\frac{1}{n}\sum_{i=1}^{n}|r|>\frac{\alpha^{\prime}}{2},

Since the queries are chosen independently, by the Chernoff bound we have

Thus, we can choose Q=O(nlogm)|\mathcal{Q}|=O(n\log m) to obtain

Thus, we have established that there exists a family of queries Q\mathcal{Q} such that for every s,ts,t such that 1ni=1ntisi>α\frac{1}{n}\sum_{i=1}^{n}|t_{i}-s_{i}|>\alpha^{\prime},

Moreover, by (α,1/3)(\alpha,1/3)-accuracy, we have

Applying a triangle inequality, we can conclude

which implies that tt cannot be the output of B\mathcal{B}. This completes the proof.

2.2 Putting Together the Lower Bound

Now we show how to combine the various attacks to prove Theorem 1.2 in the introduction. We obtain our lower bound by applying two rounds of composition. In the first round, we compose the reconstruction attack described above with the re-identifiable distribution for 11-way marginals. We then take the resulting re-identifiable distribution and apply a second round of composition using the reconstruction attack for query families of high VC-dimension.

Just like our lower bound for kk-way marginal queries, we remark that it is necessary to apply the two rounds of composition in this order. See Section 5.1.3 for a discussion of this issue.

such that there exists a distribution on nn-row databases D({0,1}d)nD\in(\{0,1\}^{d})^{n} that is (1/2,o(1/n))(1/2,o(1/n))-re-identifiable from (α,0)(\alpha,0)-accurate answers to Q\mathcal{Q}.

Applying Theorem 4.3 (with parameter c=150c=150), we obtain item 1’ below. We then bring in another reconstruction attack for the composition theorem.

By Lemma 5.3, there exists a database D({0,1}d/3)loghD\in(\{0,1\}^{d/3})^{\log h} that enables a (4α)(4\alpha)-reconstruction attack from (α,0)(\alpha,0)-accurate answers to some Qvc\mathcal{Q}_{vc} of size hh. (In particular, the family of queries can be all (logh)(\log h)-way marginals on the first logh\log h bits of the data universe items.)

Again, Theorem has a corresponding statement in terms of generalized fingerprinting codes.

such that there exists a (n,Q)(n,\mathcal{Q})-generalized fingerprinting code with security (1/2,o(1/n))(1/2,o(1/n)) for (α,0)(\alpha,0)-accuracy.

Constructing Error-Robust Fingerprinting Codes

In this section, we show how to construct fingerprinting codes that are robust to a constant fraction of errors, which will establish Theorem 3.4. Our codes are based on the fingerprinting code of Tardos [Tar08], which has a nearly optimal number of users, but is not robust to any constant fraction of errors. The number of users in our code is only a constant factor smaller than that of Tardos, and thus our codes also have a nearly optimal number of users.

To motivate our approach, it is useful to see why the Tardos code (and all other fingerprinting codes we are aware of) are not robust to a constant fraction of errors. The reason is that the the only way to introduce an error is to put a in a column containing only 11’s or vice versa (recall that the set of codewords, C{0,1}n×dC\in\{0,1\}^{n\times d}, can be viewed as an n×dn\times d matrix). We call such columns “marked columns.” Thus, if the adversary is allowed to introduce m\geq m errors where mm is the number of marked columns then he can simply ignore the codewords and output either the all- or all-11 codeword, which cannot be traced. Thus, in order to tolerate a β\beta fraction of errors, it is necessary that mβdm\geq\beta d where dd is the length of the codeword, and this is not satisfied by any construction we know of (when β>0\beta>0 is a constant). However, Tardos’ construction can be shown to remain secure if the adversary is allowed to introduce βm\beta m errors, rather than βd\beta d errors, for some constant β>0\beta>0. We demonstrate this formally in Section 6.2. In addition, we show how to take a fingerprinting code that tolerates βm\beta m errors and modify it so that it can tolerate about βd/3\beta d/3 errors. This reduction is formalized in Section 6.1. Combining these two results will give us a robust fingerprinting code.

We remark that prior work [BN08, BKM10] has shown how to construct fingerprinting codes satisfying a weaker robustness property. Specifically, their codes allow the adversary to introduce a special “?” symbol in a large fraction of coordinates, but still require that any coordinate that is not a “?” satisfies the feasibility constraint.

Before proceeding with the construction and analysis, we restate some terminology and notation from Section 3. Recall that a fingerprinting code is a pair of algorithms (Gen,Trace)(\mathit{Gen},\mathit{Trace}), where Gen\mathit{Gen} specifies a distribution over codebooks C{0,1}n×dC\in\{0,1\}^{n\times d} consiting of nn codewords (c1,,cn)(c_{1},\ldots,c_{n}), and Trace(C,c)\mathit{Trace}(C,c^{\prime}) either outputs the identity i[n]i\in[n] of an accused user or outputs \bot. Recall that Gen\mathit{Gen} and Trace\mathit{Trace} share a common state. For a coalition S[n]S\subseteq[n], we write CS{0,1}S×dC_{S}\in\{0,1\}^{|S|\times d} to denote the subset of codewords belonging to users in SS.

Every codebook CC, coalition SS, and robustness parameter β\beta\in defines a feasible set of combined codewords,

We now recall the definition of an error-robust fingerprinting code from Section 3.1.

where the probability is taken over the coins of Gen,Trace\mathit{Gen},\mathit{Trace}, and AFP\mathcal{A}_{\mathit{FP}}. The algorithms Gen\mathit{Gen} and Trace\mathit{Trace} may share a common state.

The main result of this section is a construction of fingerprinting codes satisfying Definition 6.1

We remark that we have made no attempt to optimize the fraction of errors to which our code is robust. We leave it as an interesting open problem to construct a robust fingerprinting code for a nearly-optimal number of users that is robust to a fraction of errors arbitrarily close to 1/21/2.

A key step in our construction is a reduction from constructing error-robust fingerprinting codes to constructing a weaker object, which we call a weakly-robust fingerprinting code. The difference between a weakly-robust fingerprinting code and an error-robust fingerprinting code of the previous section is that we now demand that only a β\beta fraction of the marked positions can have errors, rather than a β\beta fraction of all positions.

In order to formally define weakly-robust fingerprinting codes, we introduce some terminology. If C{0,1}n×dC\in\{0,1\}^{n\times d} is a codebook, then for b{0,1}b\in\{0,1\}, we say that position j[d]j\in[d] is bb-marked in CC if cij=bc_{ij}=b for every i[n]i\in[n]. That is, jj is bb-marked if every user has the symbol bb in the jj-th position of their codeword. The set Fβ(C)F_{\beta}(C) consists of all codewords cc^{\prime} such that for a 1β1-\beta fraction of positions jj, either jj is not marked, or jj is bb-marked and cj=bc^{\prime}_{j}=b. Notice that this constraint is vacuous if fewer than a β\beta fraction of positions are marked.

For a weakly-robust fingerprinting code, we will define a more constrained feasible set. Intuitively, a codeword cc^{\prime} is feasible if for a 1β1-\beta fraction of positions that are marked, cjc^{\prime}_{j} is set appropriately. Note that this condition is meaningful even when the fraction of marked positions is much smaller than β\beta. More formally, we define

The next theorem states that if we have an (n,d)(n,d)-fingerprinting code that is weakly-robust to a β\beta fraction of errors and satisfies a mild technical condition, then we obtain an (n,O(d))(n,O(d))-fingerprinting code that is robust to an Ω(β)\Omega(\beta) fraction of errors with a similar level of security.

are a (n,d)(n,d)-fingerprinting code with security ξ\xi weakly-robust to a β\beta fraction of errors, and

with probability at least 1ξ1-\xi over C\mboxRGenC\leftarrow_{\mbox{\tiny R}}\mathit{Gen}, produce CC that has at least mm -marked columns and mm 11-marked columns.

Then there is a pair of algorithms (Gen,Trace)(\mathit{Gen}^{\prime},\mathit{Trace}^{\prime}) that are a (n,d)(n,d^{\prime})-fingerprinting code with security ξ\xi^{\prime} robust to a β/3\beta/3 fraction of errors, where

The reduction is given in Figure 7. Recall that Gen\mathit{Gen}^{\prime} and Trace\mathit{Trace}^{\prime} may share state, so π\pi and the shared state of Gen\mathit{Gen} and Trace\mathit{Trace} is known to Trace\mathit{Trace}^{\prime}.

Fix a coalition S[n]S\subseteq[n]. Let AFP\mathcal{A}_{\mathit{FP}}^{\prime} be an adversary. Sample C\mboxRGenC^{\prime}\leftarrow_{\mbox{\tiny R}}\mathit{Gen}^{\prime} and let c=AFP(C)c^{\prime}=\mathcal{A}_{\mathit{FP}}^{\prime}(C^{\prime}). We will show that the reduction is successful by proving that if cFβ/3(C)c^{\prime}\in\mathit{F}_{\beta/3}(C^{\prime}), then the modified string cWFβ(C)c\in\mathit{WF}_{\beta}(C) with probability 1exp(Ω(βm2/d))1-\exp(-\Omega(\beta m^{2}/d)). The reason is that an adversary who is given (a subset of the rows of) CC^{\prime} cannot distinguish real columns that are marked from fake columns. Therefore, the fraction of errors in the real marked columns should be close to the fraction of errors that are either real and marked or fake. Since the total fraction of errors in the entire codebook is at most β/3\beta/3, we know that the fraction of errors in real marked columns is not much larger than β/3\beta/3. Thus the fraction of errors in the real marked columns will be at most β\beta with high probability. We formalize this argument in the following claim.

for any choice of kk. An identical argument bounds the probability that the number of errors in real 11-marked columns is more than βm1\beta m_{1}. Therefore, the probability that more than a β\beta fraction of marked columns have errors is at most 2exp(Ω(βm2/d))2\exp(-\Omega(\beta m^{2}/d)). ∎

Now define an adversary AFP\mathcal{A}_{\mathit{FP}} that takes CSC_{S} as input, simulates Gen\mathit{Gen}^{\prime} by appending marked columns to CSC_{S} and applying a random permutation π\pi, and then applies AFP\mathcal{A}_{\mathit{FP}}^{\prime} to the resulting codebook CSC^{\prime}_{S}. Then it takes AFP(CS)\mathcal{A}_{\mathit{FP}}^{\prime}(C^{\prime}_{S}), applies π1\pi^{-1}, removes the fake columns, and outputs the result. Notice that Trace\mathit{Trace}^{\prime} applies Trace\mathit{Trace} to a codebook and codeword generated by exactly the same procedure. If we assume that AFP(CS)\mathcal{A}_{\mathit{FP}}^{\prime}(C^{\prime}_{S}) is feasible with parameter β/3\beta/3, then by the analysis above, with probability at least 1ξexp(Ω(βm2/d))1-\xi-\exp(-\Omega(\beta m^{2}/d)), AFP(CS)\mathcal{A}_{\mathit{FP}}(C_{S}) is weakly feasible with parameter β\beta. Thus,

where the first inequality is by Claim 6.5 and the second inequality is by ξ\xi-security of Trace\mathit{Trace}.

Since Trace\mathit{Trace} does not accuse a user outside of SS (except with probability at most ξ\xi) regardless of whether or not that adversary’s codeword is feasible, it is immediate that Trace\mathit{Trace}^{\prime} also does not accuse a user outside of SS (except with probability at most ξ\xi). ∎

2 Weak Robustness of Tardos’ Fingerprinting Code

In this section we show that Tardos’ fingerprinting code is weakly robust to a β\beta fraction of errors for β1/25\beta\geq 1/25. Specifically we prove the following:

Tardos’ fingerprinting code is described in Figure 8. Note that the shared state of Gen\mathit{Gen} and Trace\mathit{Trace} will include p1,,pdp_{1},\ldots,p_{d}.

Tardos’ proof that no user is falsely accused (except with probability ξ\xi) holds for every adversary, regardless of whether or not the adversary’s output is feasible, therefore it holds without modification even when we allow the adversary to introduce errors. So we will state the following lemma from [Tar08, Section 3] without proof.

Let (Gen,Trace)(\mathit{Gen},\mathit{Trace}) be the fingerprinting code defined in Algorithm 8. Then for every adversary AFP\mathcal{A}_{\mathit{FP}}, and every S[n]S\subseteq[n],

where the probability is taken over the choice of C\mboxRGenC\leftarrow_{\mbox{\tiny R}}\mathit{Gen} and the coins of AFP\mathcal{A}_{\mathit{FP}}.

Most of the remainder of this section is devoted to proving that any adversary who introduces errors into at most a 1/251/25 fraction of the marked columns can be traced successfully.

Let (Gen,Trace)(\mathit{Gen},\mathit{Trace}) be the fingerprinting code defined in Algorithm 8. Then for every adversary AFP\mathcal{A}_{\mathit{FP}}, and every S[n]S\subseteq[n],

where the probability is taken over the choice of C\mboxRGenC\leftarrow_{\mbox{\tiny R}}\mathit{Gen} and the coins of AFP\mathcal{A}_{\mathit{FP}}.

Before giving the proof, we briefly give a high-level roadmap. Recall that in the construction there is a “score” function Si(c)S_{i}(c^{\prime}) that is computed for each user, and Trace\mathit{Trace} will output some user whose score is larger than the threshold Z/2Z/2, if such a user exists. Tardos shows that the sum of the scores over all users is at least nZ/2nZ/2, which demonstrates that there exists a user whose score is above the threshold. His argument works by balancing two contributions to the score: 1) the contribution from 11-marked columns jj, which will always be positive due to the fact that cj=1c^{\prime}_{j}=1, and 2) the potentially negative contribution from columns that are not 11-marked. Conceptually, he shows that the contribution from the 11-marked columns is larger in expectation than the negative contribution from the other columns, so the expected score is significantly above the threshold. He then applies a Chernoff-type bound to show that the score will be above the threshold with high probability. When the adversary is allowed to introduce errors so that there may be some 11-marked columns jj such that cj=0c^{\prime}_{j}=0, these errors will contribute negatively to the score. The new ingredient in our argument is essentially to bound the negative contribution from these errors. We are able to get a sufficiently good bound to tolerate errors in 1/251/25 of the coordinates. We expect that a tighter analysis and more careful tuning of the parameters can improve the fraction of errors that can be tolerated.

We will write S=[n]S=[n]. Doing so is without loss of generality as users outside of SS are irrelevant. We will use β=1/25\beta=1/25 to denote the allowable fraction of errors. Fix an adversary B\mathcal{B}. Sample C\mboxRGenC\leftarrow_{\mbox{\tiny R}}\mathit{Gen} and let c=B(C)c^{\prime}=\mathcal{B}(C). Assume cWFβ(C)c^{\prime}\in\mathit{WF}_{\beta}(C). In order to prove that some user is traced, we will bound the quantity

where xj=i=1nCijx_{j}=\sum_{i=1}^{n}C_{ij} is defined to be the number of codewords cic_{i} such that cij=1c_{ij}=1. Our goal is to show that this quantity is at least nZ/2nZ/2 with high probability. If we can do so, then there must exist a user i[n]i\in[n] such that Si(c)Z/2S_{i}(c^{\prime})\geq Z/2, in which case Trace(C,c)\mathit{Trace}(C,c^{\prime})\neq\bot.

If jj is unmarked, then cj=0\overline{c}_{j}=0.

If jj is -marked, then cj{0,1}\overline{c}_{j}\in\{0,1\}.

If jj is 11-marked, then cj{1,0}\overline{c}_{j}\in\{-1,0\}.

The number of nonzero coordinates of c\overline{c} is at most βm\beta m, where mm is the number of marked columns of cc.

We call a c\overline{c} satisfying the above constraints valid. By the linearity of S()S(\cdot), we can write

We will now establish the following claim.

We start by making an observation about the distribution of S(c)=S(c)C,cS(\overline{c})=S(\overline{c})|_{C,\overline{c}}, which denotes S(c)S(\overline{c}) when we condition on a fixed choice of a codebook CC and a valid choice of c\overline{c}. Because the non-zero coordinates of c\overline{c} are only in marked columns of CC (those in which xj=0x_{j}=0 or xj=nx_{j}=n), the distribution of

depends only on the number of non-zero coordinates of c\overline{c}, and not on their location. To see that this is the case, consider a -marked coordinate jj on which cj=1\overline{c}_{j}=1. The contribution of jj to S(c)S(\overline{c}) is exactly n/qj-n/q_{j}. Similarly, for a 11-marked coordinate jj on which cj=1\overline{c}_{j}=-1, the contribution of jj to S(c)S(\overline{c}) is exactly nqj-nq_{j}. Thus we can write

Each term in the first sum (resp. second sum) is a random variable that depends only on the distribution of qjq_{j} conditioned on the the jj-th column being -marked (resp. 11-marked). Recall that qjq_{j} is determined by pjp_{j}. Moreover, conditioned on a fixed CC, the pjp_{j}’s are independent. To see this, let CjC_{j} denote the jjth column of the codebook CC. Recall that each column CjC_{j} is generated independently using pjp_{j}, and the pjp_{j}’s themselves are chosen independently. Letting fXf_{X} denote the density function of a random variable XX, this means that the joint density

This shows that the conditional random variables pjCjp_{j}|_{C_{j}} are independent. Moreover, since c\overline{c} only depends on the codebook CC and coins of the adversary B\mathcal{B}, the pjp_{j}’s are still independent when we also condition on c\overline{c}. In fact, the following holds:

Conditioned on any fixed choice of CC and c\overline{c}, the following distributions are all identical, independent, and non-negative: 1) (n/q_{j}\mid\textrm{jisis0-marked}) for j[d]j\in[d], and 2) (nq_{j}\mid\textrm{jisis1-marked}).

By the discussion above, we know that these random variables are independent. To see that they are identicially distributed, note that the distribution pjp_{j} used to generate the jjth column of CC is symmetric about 1/21/2. Therefore, the probability that column jj is -marked when its entries are sampled according to pjp_{j} is the same as the probability that jj is 11-marked when its entries are sampled according to 1pj1-p_{j}. Applying Bayes’ rule, again using the fact that pjp_{j} and 1pj1-p_{j} have the same distribution, we see that the random variables (p_{j}\mid\textrm{jisis0-marked}) and (1-p_{j}\mid\textrm{jisis1-marked}) are identically distributed. The claim follows since qj=(1pj)/pjq_{j}=\sqrt{(1-p_{j})/p_{j}}. ∎

and conditioned on having βm\beta m errors, those errors occur on a uniformly random set of marked columns. Thus, if we can show that

provided n,1/ξn,1/\xi are sufficiently large.

First, observe that, since the distribution of pp is symmetric about 1/21/2, A0=AnA_{0}=A_{n}. Second, if we let

In order to obtain a strong enough bound, we need to show that AnBn=O(βα)A_{n}-B_{n}=O(\beta\alpha). We can calculate

Now we apply the approximation eu1+2ue^{u}\leq 1+2u, which holds for 0u10\leq u\leq 1. To do so, we choose α=t/n\alpha=\sqrt{t}/n. Since q=(1p)/pq=\sqrt{(1-p)/p} and ptp\geq t, we have αnq1\alpha nq\leq 1 for this choice of α\alpha. Thus we have

The final inequality holds as long as nn is larger than some absolute constant. (To see that this is the case, recall that t=arcsin(t)=arcsin(1/300n)=Θ(1/n)t^{\prime}=\arcsin(\sqrt{t})=\arcsin(\sqrt{1/300n})=\Theta(1/\sqrt{n}), whereas (11/300n)n=1Ω(1)(1-1/300n)^{n}=1-\Omega(1).) So we have established

Plugging this fact into the analysis above, we have

Now all that remains is to apply Markov’s inequality to bound this quantity by ξn/4\xi^{\sqrt{n}/4}.

To get the desired upper bound, it is sufficient to show

where the last inequality holds when β<1/25\beta<1/25. This is sufficient to complete the proof of Claim 6.10. ∎

Lemma 6.7 and 6.8 are sufficient to imply Lemma 6.6, that Tardos’ fingerprinting code is weakly robust. In order to apply our reduction from full robustness to weak robustness (Lemma 6.4), we need to also establish that with high probability there are many marked columns in the matrix C\mboxRGenC\leftarrow_{\mbox{\tiny R}}\mathit{Gen} for Tardos’ fingerprinting code.

With probability at least 1ξ1-\xi over the choice of C\mboxRGenC\leftarrow_{\mbox{\tiny R}}\mathit{Gen}, it holds that the number of -marked columns m0m_{0} and the number of 11-marked columns m1m_{1} are both larger than m=5n3/2log(n/ξ)m=5n^{3/2}\log(n/\xi).

To estimate the number of marked columns, define for each j=1,,dj=1,\ldots,d an indicator random variable DjD_{j} for whether column jj is -marked. The DjD_{j}’s are i.i.d., and have expectation at least

A similar argument holds for 11-marked columns. Thus letting m=5nnlog(n/ξ)m=5n\sqrt{n}\log(n/\xi), the codebook CC has at least mm -marked columns and mm 11-marked columns with probability at least 1ξ1-\xi. Now observe that

for nn larger than some absolute constant. ∎

Combining Lemma 6.4 (reduction from robustness to weak robustness), Lemma 6.6 (weak robustness of Tardos’ code), and Lemma 6.12 (Tardos’ code has many marked columns), suffices to prove Theorem 6.2.

Acknowledgements

We thank Kobbi Nissim for drawing our attention to the question of sample complexity and for many helpful discussions. We thank Adam Smith for suggesting that we use the Gaussian mechanism to provide a new proof of the lower bound on the length of fingerprinting codes. Finally, we thank the anonymous reviewers for their helpful comments.

References

Appendix A Lower Bounds on Fingerprinting Codes via Differential Privacy

By the contrapositive of Theorem 3.5, upper bounds on the sample complexity of answering 11-way marginals with differential privacy imply a lower bound on the length dd of any fingerprinting code with a given number of users nn. As pointed out to us by Adam Smith, this yields a particularly simple, self-contained proof of Tardos’ [Tar08] optimal lower bound on the length of fingerprinting codes. Specifically, using the well known Gaussian mechanism for achieving differential privacy, we can design a simple adversary AFP\mathcal{A}_{\mathit{FP}} that violates the security of any traitor tracing scheme with length d=o(n2)d=o(n^{2}).

Before diving into the proof, we will state the following elementary fact about Gaussian random variables. The fact simply says that a Gaussian random variable with suitable variance is “close” to a shifted version of itself in a particular sense. This same fact is used to show that adding Gaussian noise of suitable variance provides differential privacy.

Let AFP(CS)\mathcal{A}_{\mathit{FP}}(C_{S}) be the following adversary. Define the vector cd\overline{c}\in^{d} as

First we claim that AFP\mathcal{A}_{\mathit{FP}} outputs feasible codewords with at least constant probability.

For every SS such that Sn1,|S|\geq n-1, and every codebook C=(cij){0,1}n×d,C=(c_{ij})\in\{0,1\}^{n\times d},

By a standard tail bound for the Gaussian, we have

Now it remains to show that AFP\mathcal{A}_{\mathit{FP}} cannot be traced successfully. By assumption (Gen,Trace)(\mathit{Gen},\mathit{Trace}) has security ξ<1/6en<1/3.\xi<1/6en<1/3. Then we have in particular

Therefore, there exists i[n]i^{*}\in[n] such that

To complete the proof, it now suffices to show that if S=[n]{i}S=[n]\setminus\left\{i^{*}\right\}, then

which will contradict the security of the fingerprinting code.

By Fact A.2 (with δ=1/6en>ξ\delta=1/6en>\xi), for every rr,

Applying (10), and averaging over C\mboxRGenC\leftarrow_{\mbox{\tiny R}}\mathit{Gen} and rr, we have

which is the desired contradiction. This completes the proof. ∎