Fingerprinting Codes and the Price of Approximate Differential Privacy
Mark Bun, Jonathan Ullman, Salil Vadhan
Introduction
Consider a database , in which each of the rows corresponds to an individual’s record, and each record is an element of some data universe (e.g. , corresponding to binary attributes per record). The goal of privacy-preserving data analysis is to enable rich statistical analyses on such a database while protecting the privacy of the individuals. It is especially desirable to achieve -differential privacy [DMNS06, DKM+06], which (for suitable choices of and ) guarantees that no individual’s data has a significant influence on the information released about the database. A natural way to measure the tradeoff between these two goals is via sample complexity—the minimum number of records such that there exists a (possibly computationally unbounded) algorithm that achieves both differential privacy and statistical accuracy.
Some of the most basic statistics are counting queries, which are queries of the form “What fraction of individual records in satisfy some property ?” In particular, we would like to design an algorithm that takes as input a database and, for some family of counting queries , outputs an approximate answer to each of the queries in that is accurate to within, say, . Suppose we are given a bound on the number of queries and the dimensionality of the database records , but otherwise allow the family to be arbitrary. What is the sample complexity required to achieve -differential privacy and statistical accuracy for ?
Of course, if we drop the requirement of privacy, then we could achieve perfect accuracy when contains any number of records. However, in many interesting settings the database consists of random samples from some larger population, and an analyst is actually interested in answering the queries on the population. Thus, even without a privacy constraint, would need to contain enough records to ensure that (with high probability) for every query , the answer to on is close to the answer to on the whole population, say within . To achieve this form of statistical accuracy, it is well-known that it is necessary and sufficient for to contain samples.For a specific family of queries , the necessary and sufficient number of samples is proportional to the VC-dimension of , which can be as large as . In this work we consider whether there is an additional “price of differential privacy” if we require both statistical accuracy and -differential privacy (for, say, , ). This benchmark has often been used to evaluate the utility of differentially private algorithms, beginning with the seminal work of Dinur and Nissim [DN03].
These results show that the price of privacy is small for datasets with few attributes, but may be large for high-dimensional datasets. For example, if we simply want to estimate the mean of each of the attributes without a privacy guarantee, then samples are necessary and sufficient to get statistical accuracy. However, the best known -differentially private algorithm requires samples—an exponential gap. In the special case of pure -differential privacy, a lower bound of is known [HT10]. However, for the general case of approximate -differential privacy the best known lower bound is [DN03]. More generally, there are no known lower bounds that separate the sample complexity of -differential privacy from the sample complexity required for statistical accuracy alone.
In this work we close this gap almost completely, and show that there is indeed a “price of approximate differential privacy” for high-dimensional datasets.
We establish this lower bound using a combinatorial object called a fingerprinting code, which was originally introduced by Boneh and Shaw [BS98] for the problem of watermarking copyrighted content. Specifically, we use Tardos’ construction of optimal fingerprinting codes [Tar08]. The use of “secure content distribution schemes” to prove lower bounds for differential privacy originates with the work of Dwork et al. [DNR+09], who used “traitor-tracing schemes,” which are a cryptographic analogue of information-theoretic fingerprinting codes, to prove computational hardness results for differential privacy. Extending this connection, Ullman [Ull13] used fingerprinting codes to construct a novel traitor-tracing scheme and obtain a strong computational hardness result for differential privacy.In fact, one way to prove Theorem 1.1 is by replacing the one-way functions in [Ull13] with a random oracle, and thereby obtain an information-theoretically secure traitor-tracing scheme. Here we show that a direct use of fingerprinting codes yields information-theoretic lower bounds on sample complexity.
Next, we consider the sample complexity of answering an arbitrary set of counting queries to within error . As above, if we assume the database contains samples from a population, and require only that the answers to queries on the sampled database and the population are close, to within , then samples are necessary and sufficient for just statistical accuracy. When is large (relative to and ), the best sample complexity for differential privacy is again achieved by the private multiplicative weights algorithm, and is . For pure differential privacy, a lower bound of is known [Har11]. On the other hand, the best known lower bound for approximate differential privacy is , which follows from the techniques of [DN03]. To resolve this gap, we give a composition theorem that allows us to obtain a nearly optimal lower bound by combining Theorem 1.1 with (variants of) the existing sample complexity lower bounds. The result shows that the private multiplicative weights algorithm achieves nearly-optimal sample-complexity as a function of , and .
We remark that the condition that is both necessary (up to the constant factor) and fairly mild. Necessary because the noisy histogram algorithm (see, e.g. [Vad16]) requires samples, which is better than the conclusion of the lower bound when . Mild because differential privacy cannot be satisfied for large query sets unless , so the condition is no stronger than assuming , in which case the number of samples is exponential in the dimension. Similarly, the condition is also necessary, since adding independent noise to each query requires only samples.
Finally, we consider the sample complexity of the natural and well studied class of -way marginal queries, also known as -way conjunction queries (see e.g. [BCD+07, KRSU10, GHRU11, TUV12, CTUW14, DNT13]). A -way marginal query on a database is specified by a set , , and a pattern and asks “What fraction of records in has each attribute in set to ?” The number of -way marginal queries on is about . For the special case of , the queries simply ask for the mean of each attribute, which was discussed above. We prove that the lower bound of Theorem 1.2, which applies to worst-case queries, also holds for the special case of -way marginal queries when is not too small.
We remark that, since the number of -way marginal queries is about , the sample complexity lower bound in Theoem 1.3 essentially matches that of Theorem 1.2. The two theorems are incomparable, since Theorem 1.2 applies even when is exponentially small in , but only applies for a worst-case family of queries.
We now describe the main technical ingredients used to prove these results. For concreteness, we will describe the main ideas for the case of -way marginal queries.
Fingerprinting codes, introduced by Boneh and Shaw [BS98], were originally designed to address the problem of watermarking copyrighted content. Roughly speaking, a (fully-collusion-resilient) fingerprinting code is a way of generating codewords for users in such a way that any codeword can be uniquely traced back to a user. Each legitimate copy of a piece of digital content has such a codeword hidden in it, and thus any illegal copy can be traced back to the user who copied it. Moreover, even if an arbitrary subset of the users collude to produce a copy of the content, then under a certain marking assumption, the codeword appearing in the copy can still be traced back to one of the users who contributed to it. The standard marking assumption is that if every colluder has the same bit in the -th bit of their codeword, then the -th bit of the “combined” codeword in the copy they produce must be also . We refer the reader to the original paper of Boneh and Shaw [BS98] for the motivation behind the marking assumption and an explanation of how fingerprinting codes can be used to watermark digital content.
We show that the existence of short fingerprinting codes implies sample complexity lower bounds for -way marginal queries. Recall that a -way marginal query is specified by an integer and asks simply “What fraction of records in have a in the -th bit?” Suppose a coalition of users takes their codewords and builds a database where each record contains one of their codewords, and is the length of the codewords. Consider the -way marginal query . If every user in has a bit in the -th bit of their codeword, then . Thus, if an algorithm answers -way marginal queries on with non-trivial accuracy, its output can be used to obtain a combined codeword that satisfies the marking assumption. By the tracing property of fingerprinting codes, we can use the combined codeword to identify one of the users in the database. However, if we can identify one of the users from the answers, then the algorithm is not differentially private.
Theorems 1.2 and 1.3 are proven by using this composition technique repeatedly to combine our lower bound for -way marginals with (variants of) several known lower bounds that capture the optimal dependence on and .
The connection between fingerprinting codes and differential privacy lower bounds extends to arbitrary families of counting queries. We introduce the notion of a generalized fingerprinting code with respect to , where each codeword corresponds to a data universe element and the bits of the codeword are given by for each , but is the same as an ordinary fingerprinting code otherwise. The existence of a generalized fingerprinting code with respect to , for users, implies a sample complexity lower bound of for privately releasing answers to . We also show a partial converse to the above result, which states that some sort of “fingerprinting-code-like object” is necessary to prove sample complexity lower bounds for answering counting queries under differential privacy. This object has similar semantics to a generalized fingerprinting code, however the marking assumption required for tracing is slightly stronger and the probability that tracing succeeds can be significantly smaller than what is required by the standard definition of fingerprinting codes. Our partial converse parallels the result of Dwork et al. [DNR+09] that shows computational hardness results for differential privacy imply a “traitor-tracing-like object.” We leave it as an open question to pin down precisely the relationship between fingerprinting codes and information-theoretic lower bounds in differential privacy (and also between traitor-tracing schemes and computational hardness results for differential privacy).
2 Other Related Work
There have been attempts to prove optimal sample complexity lower bounds for -way marginals. In particular, when is a constant, Kasiviswanathan et al. [KRSU10] and De [De12] prove a lower bound of on the sample complexity. Note that when is a constant, these lower bounds are .
In addition to the general computational hardness results referenced above, there are several results that show stronger hardness results for restricted types of efficient algorithms [UV11, GHRU11, DNV12].
2.2 Subsequent Work
Subsequent to our work, Steinke and Ullman [SU15a] refined our use of fingerprinting codes to prove a lower bound of on the number of samples required to release the mean of each of the attributes under -differential privacy when . This lower bound is optimal up to constant factors, and improves on Theorem 1.1 by a factor of roughly . They also improve and simplify our analysis of robust fingerprinting codes.
Our fingerprinting code technique has also been used to prove lower bounds for other types of differentially private data analyses. Namely, Dwork et al. [DTTZ14] prove lower bounds for differentially private principal component analysis and Bassily, Smith, and Thakurta [BST14] prove lower bounds for differentially private empirical risk minimization. In order to establish lower bounds for privately releasing threshold functions, Bun et al. [BNSV15] construct a fingerprinting-code-like object that yields a lower bound for the problem of releasing a value between the minimum and maximum of a dataset.
Dwork et al. [DSS+15] observe that the privacy attack implicit in our negative results is closely related to the influential attacks that were employed by Homer et al. [HSR+08] (and further studied in [SOJH09]) to violate privacy of public genetic datasets. Using this connection, they show how to make Homer et al.’s attack robust to very general models of noise and how to make the attack work without detailed knowledge of the population the dataset represents.
A pair of works [HU14, SU15b] show that fingerprinting codes and the related traitor-tracing schemes imply both information-theoretic lower bounds and computational hardness results for the “false discovery” problem in adaptive data analysis. Specifically, they show lower bounds for answering an online sequence of adaptively chosen counting queries where the database is a sample from some unknown distribution and the answers must be accurate with respect to that distribution. These works [HU14, SU15b] effectively reverse a connection established in [DFH+15, BSSU15], which used differentially private algorithms to obtain positive results for this problem.
Our technique for composing lower bounds in differential privacy has also found applications outside of privacy. Specifically, Liberty et al. [LMTU14] used this technique to prove nearly optimal lower bounds on the space required to “sketch” a database while approximately preserving answers to -way marginal queries (called “frequent itemset queries” in their work).
Preliminaries
We define a database to be an ordered tuple of rows chosen from a data universe . We say that two databases are adjacent if they differ only by a single row, and we denote this by . In particular, we can replace the th row of a database with some fixed “junk” element of to obtain another database . We emphasize that if is a database of size , then is also a database of size .
Let be a randomized algorithm (where is a varying parameter). is -differentially private if for every two adjacent databases and every subset ,
Let be a randomized algorithm such that for every , every , and every subset ,
Let denote the fixed junk element of . Then defined by is -differentially private.
Let and be adjacent databases. Then for any , we have
2 Counting Queries and Accuracy
In this paper we study algorithms that answer counting queries. A counting query on is defined by a predicate . Abusing notation, we define the evaluation of the query on a database to be its average value over the rows,
When we may simply write that or is -accurate for .
An important example of a collection of counting queries is the set of -way marginals. For all of our results it will be sufficient to consider only the set of monotone -way marginals.
A (monotone) -way marginal over is specified by a subset of size . It takes the value if and only if for every index . The collection of all (monotone) -way marginals is denoted by .
3 Sample Complexity
In this work we prove lower bounds on the sample complexity required to simultaneously achieve differential privacy and accuracy.
We will focus on the case where and . This setting of the parameters is essentially the most-permissive for which -differential privacy is still a meaningful privacy definition. However, pinning down the exact dependence on and is still of interest. Regarding , this can be done via the following standard lemma, which allows us to take without loss of generality.
For every set of counting queries , universe , . has sample complexity for -accuracy and -differential privacy if and only if it has sample complexity for -accuracy and -differential privacy.
One direction ( samples are sufficient) is the “secrecy-of-the-sample lemma,” which appeared implicitly in [KLN+11]. The other direction ( samples are necessary) appears to be folklore.
The next lemma allows us to generically translate sample complexity lower bounds for constant accuracy into lower bounds that depend on the error parameter . For some sets of queries, such as -way marginals, the dependence we get on is tight. However, as we will see in Section 5, we can obtain lower bounds with an even stronger dependence on for specific sets of queries.
Let be a set of counting queries on and let . Suppose has sample complexity for -accuracy and -differential privacy, where is a constant. Then has sample complexity for -accuracy and -differential privacy.
The mechanism inherits -differential privacy from , since changing one row of changes one row of the padded database . Now we argue accuracy. Suppose is an answer such that . Note that by construction, , and hence . Thus we have
Taking makes this quantity at most , completing the proof. ∎
For context, we can restate some prior results on differentially private counting query release in our sample-complexity terminology.
For every set of counting queries on and every , has sample complexity at most
for -accuracy and -differential privacy.
The next theorem shows that, when the data universe is not too small, the private multiplicative weights algorithm is nearly-optimal as a function of and when each parameter is considered individually.
for -accuracy and -differential privacy.
4 Re-identifiable Distributions
All of our eventual lower bounds will take the form of a “re-identification” attack, in which we possess data from a large number of individuals, and identify one such individual who was included in the database. In this attack, we choose a distribution on databases and give an adversary 1) a database drawn from that distribution and 2) either or for some row , where is an alleged sanitizer. The adversary’s goal is to identify a row of that was given to the sanitizer. We say that the distribution is re-identifiable if there is an adversary who can identify such a row with sufficiently high confidence whenever outputs accurate answers. If the adversary can do so, it means that there must be a pair of adjacent databases such that the adversary can distinguish from , which means cannot be differentially private.
Here the probability is taken over the choice of and as well as the coins of and . We allow and to share a common state.
Note that, when row is not in the dataset, then it would be an error for to declare that row is in the dataset, and condition 2 requires that the probability of this error occurring is at most .
The common state between and should be thought of as auxiliary information about the realization of that may help identify a user . Formally, we could model this shared state by having output an additional string that is given to but not to . However, we make the shared state implicit to reduce notational clutter. The need for this shared state will become apparent when we use fingerprinting codes to construct re-identifiable distributions; in the context of fingerprinting codes, the shared state represents auxiliary information about a codebook that helps the algorithm accuse a guilty pirate.
Lower Bounds via Fingerprinting Codes
In Section 3.1 we give the relevant background on fingerprinting codes and in Section 3.2 we prove our lower bounds for -way marginals.
Fingerprinting codes were introduced by Boneh and Shaw [BS98] to address the problem of watermarking digital content. A fingerprinting code is a pair of randomized algorithms . The code generator outputs a codebook . Each row of is the codeword of user . For a subset of users , we use to denote the set of codewords of users in . The parameter is called the length of the fingerprinting code.
The security property of fingerprinting codes asserts that any codeword can be “traced” to a user . Moreover, we require that the fingerprinting code is “fully-collusion-resilient”—even if any “coalition” of users gets together and “combines” their codewords in any way that respects certain constraints known as a marking assumption, then the combined codeword can be traced to a user . That is, there is a tracing algorithm that takes as inputs the codebook and combined codeword and outputs either a user or , and we require that if satisfies the constraints, then with high probability. Moreover, should accuse an innocent user, i.e. , with very low probability. Analogous to the definition of re-identifiable distributions (Definition 2.10), we allow and to share a common state.As in Definition 2.10, we could model this by having output an additional string that is given to . However, we make the shared state implicit to reduce notational clutter. When designing fingerprinting codes, one tries to make the marking assumption on the combined codeword as weak as possible.
The basic marking assumption is that each bit of the combined word must match the corresponding bit for some user in . Formally, for a codebook , and a coalition , we define the set of feasible codewords for to be
Observe that the combined codeword is only constrained on coordinates where all users in agree on the -th bit.
We are now ready to formally define a fingerprinting code.
where the probability is taken over the coins of , and . The algorithms and may share a common state.
We remark that our proof of Theorem 3.5, showing how to construct re-identifiable distributions from a fingerprinting codes, will only require collusion resilience against coalitions of size . Our choice to state Definition 3.1 using resilience against arbitrary coalitions is more consistent with the literature on fingerprinting codes.
Tardos [Tar08] constructed a family of fingerprinting codes with a nearly optimal number of users for a given length .
As we will see in the next subsection, fingerprinting codes satisfying Definition 3.1 will imply lower bounds on the sample complexity for releasing -way marginals with -accuracy (accuracy for every query). In order to prove sample-complexity lower bounds for -accuracy with , we will need fingerprinting codes satisfying a stronger security property. Specifically, we will expand the feasible set to include all codewords that satisfy most feasibility constraints, and require that even codewords in this expanded set can be traced. Formally, for any , we define
where the probability is taken over the coins of , and . The algorithms and may share a common state.
In Section 6 we show how to construct error-robust fingerprinting codes with a nearly-optimal number of users that are tolerant to a constant fraction of errors.
2 Lower Bounds for 111-Way Marginals
By combining Theorem 3.5 with Theorem 3.2 we obtain a sample complexity lower bound for -way marginals, and thereby establish Theorem 1.1 in the introduction.
Let be the promised fingerprinting code. We define the re-identifiable distribution to simply be the output distribution of the code generator, . And we define the privacy adversary to take the answers , obtain by rounding each entry of to , run the tracing algorithm on the rounded answers , and return its output. The shared state of and will be the shared state of and .
Now we will verify that is -re-identifiable. First, suppose that outputs answers that are -accurate for -way marginals. That is, there is a set such that and for every , the answer estimates the fraction of rows having a in column to within . Let be rounded to the nearest value in . Let be a column in . If column has all ’s, then , and . Similarly, if column has all ’s, then , and . Therefore, we have
By security of the fingerprinting code (Definition 3.3), we have
But the event is exactly the same as , and thus we have established the first condition necessary for to be -re-identifiable.
The second condition for re-identifiability follows directly from the soundness of the fingerprinting code, which asserts that for every adversary , in particular for , it holds that
Using the additional structure of Tardos’ fingerprinting code, and our robust fingerprinting codes, we can prove minimax lower bounds for an “inference version” of the problem computing the -way marginals of a product distribution.
We can now formally define the problem of inferring the marginals as follows.
Our lower bound can thus be stated as follows,
The proof has the same general structure that we used to prove Theorem 3.5. Here, we describe additional observations about the structure of the fingerprinting codes used in that proof (see Section 6 for a description of Tardos’ fingerprinting code) that allow it to carry over to the inference version of computing -way marginals.
First, in Tardos’ (non-robust) fingerprinting code, the codebook is chosen by first sampling marginals from an appropriate distribution and then sampling from The robust fingerprinting codes we construct in Section 6 also have this property.To generate a codebook for our robust fingerprinting code, we sample a codebook from Tardos’ fingerprinting code and then insert additional columns of all ’s or all ’s to in random locations. Equivalently, we can obtain a codebook by appending ’s and ’s in random locations of to obtain a vector and then sampling from Thus the instances used to prove Theorem 3.5 indeed consist of independent samples from a product distribution, which is what the inference problem assumes.
Next, recall that the proof of Theorem 3.5 shows that any string that is -accurate for the -way marginals of can be traced successfully. It is moreover the case that any string that is -accurate for the marginals can also be traced successfully. This is because the rows of are sampled independently from , so accuracy for the -way marginals of and accuracy for coincide with high probability, at least when :
Let and let . Let denote the exact -way marginals of . Then for every , and , we have with probability at least over the choice of .
We remark that Steinke and Ullman [SU15a] showed that accuracy with respect to the marginals actually suffices to trace regardless of the value of .
These two observations suffice to show that, when is too small, a differentially private algorithm cannot be accurate for with high probability over the choices of both and . Thus, for every differentially private algorithm, there exists some such that the algorithm is not accurate with high probability over the choice of , which means that the algorithm does not accurately infer the marginals of an arbitrary product distribution. ∎
3 Fingerprinting Codes for General Query Families
In this section, we generalize the connection between fingerprinting codes and sample complexity lower bounds for arbitrary sets of queries. We show that a generalized fingerprinting code with respect to any family of counting queries yields a sample complexity lower bound for , which is analogous to our lower bound for -way marginals (Theorem 3.5). We then argue that some type of fingerprinting code is necessary to prove any sample complexity lower bound by exhibiting a tight connection between such lower bounds and a weak variant of our generalized fingerprinting codes.
We begin by defining our generalization of fingerprinting codes. Fix a finite data universe and a set of counting queries over . A generalized fingerprinting code with respect to the family consists of a pair of randomized algorithms . The code generation algorithm produces a codebook . Each row of is the codeword corresponding to user . A coalition of pirates receives the subset of codewords, and produces an answer vector . We replace the traditional marking condition on the pirates with the generalized constraint that they output a feasible answer vector. A natural way to define feasibility for answer vectors is to require a condition similar to -accuracy, i.e. an answer vector is feasible if for all but a fraction of queries . We thus define a generalized set of feasible answer vectors by
When , the generalized set of feasible answer vectors captures the traditional marking assumption by rounding each entry of a feasible answer vector to or .An equivalent way to view a codebook is as a set of codewords , where each user’s codeword is for some . Notice that the case where is the class of -way marginals places no constraints on the structure of a codeword, i.e. a codeword can be any binary string. With this viewpoint, the goal of the pirates is to output an answer vector with for all but a fraction of the queries .
A pair of algorithms is an -fingerprinting code for -accuracy with security if outputs a codebook and for every (possibly randomized) adversary , and every coalition with , if we set , then
where the probability is taken over the coins of , and . The algorithms and may share a common state.
The security properties of Definition 3.12 differ from those of an ordinary fingerprinting code in two ways so as to enable a clean statement of a composition theorem for generalized fingerprinting codes (Theorem 4.6). First, we use two separate security parameters for the different types of tracing errors, as in the definition of re-identifiable distributions. Second, security only needs to hold for coalitions of size or . However, this condition implies security for coalitions of arbitrary size with an increased false accusation probability of .
As in Theorem 3.5, the existence of a generalized -fingerprinting code implies a sample complexity lower bound of for privately releasing answers to , with essentially the same proof.
In particular, if and , then there is no algorithm that is -differentially private and -accurate for .
We now turn to investigate whether a converse to Theorem 3.13 holds. We show that a sample complexity lower bound for a family of queries is essentially equivalent to the existence of a weak type of fingerprinting code, where the tracing procedure depends on the family and the tracing error probabilities satisfy certain affine constraint. It remains an interesting open question to determine the precise relationship between privacy lower bounds and our notion of generalized fingerprinting codes.
A pair of algorithms is an -weak fingerprinting code for -accuracy with security if outputs a codebook and for every (possibly randomized) adversary that outputs a feasible answer vector with probability , and every coalition with , if we set , then
where the probabilities are taken over the coins of , , and . The algorithms and may share a common state.
That is, we require the false accusation probability to be much smaller than the total probability of accusing any user. Note that a tracing algorithm that accuses a random user with probability will falsely accuse a user with probability when ; however, this does not satisfy Definition 3.14 because we require the gap between the two probabilities to be at least a factor of .
Observe that taking in Definition 3.12 yields an -weak fingerprinting code with security . However, Definition 3.14 is weaker than Definition 3.12 in a few important ways. First, security only holds against pirates with a failure probability of at most . Second, while Definition 3.12 requires completeness error , a weak fingerprinting code allows as long as is sufficiently small.
The following theorem shows that the existence of an -weak fingerprinting code is essentially equivalent to a sample complexity lower bound of against .
Then there exists an such that . By differential privacy,
On the other hand, by the security of the weak fingerprinting code and differential privacy,
This yields a contradiction whenever and .
We now show the converse direction, i.e. that the high sample complexity of implies the existence of a weak fingerprinting code. We begin with a technical lemma which shows that the high sample complexity of also rules out mechanisms that satisfy only a one-sided constraint on the probability of any event under the replacement of one row:
Let . Let be an -accurate algorithm for on databases . Suppose we have that for all databases , all , and all measurable that
Let be the VC-dimension of and let
Then there exists a -differentially private algorithm on databases of size that gives -accurate answers to on any database with probability at least .
On input a database , consider the algorithm that samples a random subset consisting of rows from (without replacement) and returns . Then by our hypothesis on , for every and every measurable we have
On the other hand, a “secrecy-of-the-sample” argument [KLN+11] enables us to obtain the reverse inequality. For a row , consider the following two experiments:
Experiment 1: Sample a random subset of rows from .
Experiment 2: Sample , and then sample a random subset of rows from .
Any database sampleable under Experiment 1 appears with probability , but appears with probability at least
Combining the two inequalities shows that for every database and every ,
By Lemma 2.2, the algorithm is -differentially private.
Finally, uniform convergence of the sampling error of implies that it remains an accurate algorithm, and hence so is . In particular, when is a random sample of rows from and is the VC-dimension of , we have [AB09]:
Taking as in the theorem statement makes the total failure probability of at most . ∎
Now we proceed to complete the proof of Theorem 3.15. Suppose has sample complexity greater than for -accuracy (with failure probability ) and -differential privacy. By Lemma 3.16, for every -accurate mechanism for there exists a database with , a set , and an index such that
We now argue that it is without loss of generality to restrict our attention to mechanisms whose range is the finite set . To see this, note that the exact answer to any counting query on a database is in the set . Therefore, if an answer satisfies , then the value
is a point in that also satisfies . Thus, we will henceforth assume that the mechanism’s output lies in this finite range.
We now apply the min-max theorem from game theory (or equivalently, linear programming duality), to exhibit a fixed distribution on for which Inequality (3) holds. Specifically, consider a two-player zero-sum game in which Player 1 chooses a triple , where , , and , and Player 2 chooses a randomized function that is -accurate for . Let the payoff to Player 1 be
By inequality (3), the value of this game is greater than . So by the min-max theorem there exists a mixed strategy for Player 1 that achieves a payoff greater than against any mixed strategy for Player 2. (Note that we can apply the min-max theorem because we have assumed that the mechanism’s output lies in a finite range.) That is, there exists a distribution over triples such that for any randomized algorithm that takes any to a feasible vector in with probability at least ,
Now consider the following code: samples a database , a set , and an index according to the promised distribution . The codebook is where is a random permutation. On input an answer vector , the algorithm checks whether . If it is, then outputs , and otherwise outputs .
To analyze the security of this code, fix a coalition of users using a pirate strategy . Because the codebook is a random permutation of the rows of , it is equivalent to analyze the original database and a random coalition of users. Thus the part of the codebook given to the pirates is a random set of rows from , i.e. for a random with the junk row at index removed. The condition that outputs a feasible answer vector is equivalent to being an -accurate answer vector. Therefore, letting be the algorithm that runs on its input with the junk row removed, we have
On the other hand, the probability that outputs the user not in the coalition is
because the events and are independent. Thus by (4),
where both probabilities are taken over the coins of , and . ∎
A Composition Theorem for Sample Complexity
In this section we state and prove a composition theorem for sample complexity lower bounds. At a high-level the composition theorem starts with two pairs, and , for which we know sample-complexity lower bounds of and respectively, and attempts to prove a sample-complexity lower bound of for a related family of queries on a related data universe.
Specifically, our sample-complexity lower bound will apply to the “product” of and , defined on . We define the product to be
Since are boolean-valued, their conjunction can also be written .
We now begin to describe how we can prove a sample complexity lower bound for . First, we describe a certain product operation on databases. Let , , be a database. Let where be databases. We define the product database as follows: For every , let the -th row of be . Note that we index the rows of by . We will sometimes refer to as the “subdatabases” of .
The key property of these databases is that we can use a query to compute a “subset-sum” of the vector consisting of the answers to on each of the subdatabases. That is, for every and ,
Thus, every approximate answer to a query places a subset-sum constraint on the vector . (Namely, ) If the database and family are chosen appropriately, and the answers are sufficiently accurate, then we will be able to reconstruct a good approximation to . Indeed, this sort of “reconstruction attack” is the core of many lower bounds for differential privacy, starting with the work of Dinur and Nissim [DN03]. The setting they consider is essentially the special case of what we have just described where are each just a single bit (, and contains only the identity query). In Section 5 we will discuss choices of and that allow for this reconstruction.
We now state the formal notion of reconstruction attack that we want and to satisfy.
for at least a fraction of queries , outputs a vector such that
Then we say that enables an -reconstruction attack from -accurate answers to .
A reconstruction attack itself implies a sample-complexity lower bound, as in [DN03]. However, we show how to obtain stronger sample complexity lower bounds from the reconstruction attack by applying it to a product database to obtain accurate answers to queries on its subdatabases. For each query , we run the adversary promised by the reconstruction attack on the approximate answers given to queries of the form . As discussed above, answers to these queries will approximate subset sums of the vector . When the reconstruction attack is given these approximate answers, it returns a vector such that on average over . Running the reconstruction attack for every query gives us a collection where on average over both and . By an application of Markov’s inequality, for most of the subdatabases , we have that on average over the choice of . For each such that this guarantee holds, another application of Markov’s inequality shows that for most queries we have , which is our definition of -accuracy (later enabling us to apply a re-identification adversary for ).
The algorithm we have described for obtaining accurate answers on the subdatabases is formalized in Figure 1.
We are now in a position to state the main lemma that enables our composition technique. The lemma says that if we are given accurate answers to on and the database enables a reconstruction attack from accurate answers to , then we can obtain accurate answers to on most of the subdatabases .
The additional bookkeeping in the proof is to handle the case where is only accurate for most queries. In this case the reconstruction attack may fail completely for certain queries and we need to account for this additional source of error.
Assume the answer vector is -accurate for on . By assumption, enables a reconstruction attack that succeeds in reconstructing an approximation to when given -accurate answers for the family of queries . Consider the set of on which the reconstruction attack succeeds, i.e.
Since is -accurate, an application of Markov’s inequality shows that
Thus, .
Recall that, by (5), we can interpret answers to as subset sums of answers to the subdatabases, so for every ,
for at least a fraction of queries . Since enables a reconstruction attack from -accurate answers to , by Definition 4.1, recovers a vector such that
Since this holds for every , we have
The statement inside the final probability is precisely that is -accurate for on . This completes the proof of the lemma. ∎
We now explain how the main lemma allows us to prove a composition theorem for sample complexity lower bounds. We start with a query family on a database that enables a reconstruction attack, and a distribution over databases in that is re-identifiable from answers to a family . We show how to combine these objects to form a re-identifiable distribution for queries over , yielding a sample complexity lower bound of .
A sample from consists of where each subdatabase is an independent sample from from . The main lemma above shows that if there is an algorithm that is accurate for on , then an adversary can reconstruct accurate answers to on most of the subdatabases . Since these subdatabases are drawn from a re-identifiable distribution, the adversary can the re-identify a member of one of the subdatabases . Since the identified member of is also a member of , we will have a re-identification attack against as well.
We are now ready to formalize our composition theorem.
Let be a family of counting queries on , and let be a family of counting queries on . Let be parameters. Assume that for some parameters , , the following both hold:
There exists a database that enables an -reconstruction attack from -accurate answers to .
There is a distribution on databases that is -re-identifiable from -accurate answers to .
Then there is a distribution on databases that is -re-identifiable from -accurate answers to .
Assume that is -accurate for . By Lemma 4.2, we have
where the last inequality is by (6). Thus, it suffices to prove that
We prove this inequality by giving a reduction to the re-identifiability of . Consider the following sanitizer : On input , first chooses a random index . Next, it samples independently, and sets . Finally, it runs on and then runs the reconstruction attack to recover answers and outputs .
Notice that since are all i.i.d. samples from , their joint distribution is independent of the choice of . Specifically, in the view of , we could have chosen after seeing its output on . Therefore, the following random variables are identically distributed:
, where is the output of on , and .
where .
where the last inequality follows because is a -re-identifiable from -accurate answers to . Thus we have established (8). Combining (7) and (8) completes the proof of the claim.
The next claim follows directly from the definition of and the fact that is -re-identifiable.
For every ,
Combining Claims 4.4 and 4.5 suffices to prove that is -re-identifiable from -accurate answers to , completing the proof of the theorem. ∎
The proof of Theorem 4.3 also yields a composition theorem for generalized fingerprinting codes. Specifically, Theorem 4.6 below shows how to combine a reconstruction attack for a query family on a database with a -generalized fingerprinting code to obtain a -generalized fingerprinting code.
Let be a family of counting queries on , and let be a family of counting queries on . Let be parameters. Assume that for some parameters , , the following both hold:
There exists a database that enables an -reconstruction attack from -accurate answers to .
There exists a -generalized fingerprinting code for -accuracy with security .
Then there is a -generalized fingerprinting code for -accuracy with security .
Applications of the Composition Theorem
In this section we show how to use our composition theorem (Section 4) to combine our new lower bounds for -way marginal queries from Section 3 with (variants of) known lower bounds from the literature to obtain our main results. In Section 5.1 we prove a lower bound for -way marginal queries when is not too small (at least inverse polynomial in ), thereby proving Theorem 1.2 in the introduction. Then in Section 5.2 we obtain a similar lower bound for arbitrary counting queries that allows to take a wider range of parameters..
A known reconstruction-based lower bound of for -way marginals.
A known reconstruction-based lower bound of for -way marginals.
The lower bound of for -way marginals is a special case of a lower bound of due to [Rot10] and based on [DN03], where is the Vapnik-Chervonenkis (VC) dimension of . The lower bound of for -way marginals is due to [KRSU10, De12].
To apply our composition theorem, we need to formulate these reconstruction attack in the language of Definition 4.1. In particular, we observe that these reconstruction attacks readily generalize to allow us to reconstruct fractional vectors , instead of just boolean vectors as in [DN03, Rot10].
First we state and prove that the linear dependence on is necessary.
Let be a collection of counting queries over a data universe . We say a set is shattered by if for every string , there exists a query such that . The VC-Dimension of denoted is the cardinality of the largest subset of that is shattered by .
For each , let where the zero is at the -th index. We will show that is shattered by . For a string , let the query take the conjunction of the bits of at indices set to in . Then iff , so . ∎
Let be a collection of counting queries over a data universe and let . Then there is a database which enables a -reconstruction attack from -accurate answers to .
Let be shattered by , and consider the database . Let be an arbitrary string to be reconstructed and let be -accurate answers. That is, for every
Consider the brute-force reconstruction attack defined in Figure 4. Notice that, since is -accurate, always finds a suitable vector . Namely, the original database satisfies the constraints.
We will show that the reconstructed vector satisfies
Let be the set of coordinates on which and let be the set of coordinates where . Note that
We will show that absolute values of the sums over and are each at most . Since is shattered by , there is a query such that iff . Therefore, by the definitions of and -accuracy,
so by the triangle inequality, . An identical argument shows that , proving that is an accurate reconstruction. ∎
We can now state in our terminology the lower bound of De from [De12] (building on [KRSU10]) showing that the inverse-quadratic dependence on is necessary.
Let be any constant, be any integer, and let be a sufficiently small parameterThe constant was chosen for simplicity, and can be replaced with any constant strictly smaller than . (i.e. bounded by an absolute constant). There exists a constant such that for every there exists a database with such that enables an -reconstruction attack from -accurate answers to the -way marginals .
Although the above theorem is a simple extension of De’s lower bound, we sketch a proof for completeness, and refer the interested reader to [De12] for a more detailed analysis.
To prove that the reconstruction attack succeeds, we will show that there exists a database such that for any , if satisfies
(i.e. has -accurate answers) then returns a vector such that Henceforth we refer to such an simply as -accurate for on as a shorthand. The above guarantee must hold for suitable choices of and to satisfy the theorem.
We will argue that the reconstruction succeeds in two steps. First, we show that reconstruction succeeds if is “nice.” Second, we show that there exists “nice” that has the dimensions promised by the theorem.
To explain what we mean by a “nice” database , for any and family of queries on , we define the matrix as
A matrix is a -Euclidean section if for every vector in the rowspace of we have
Let be a database and be a set of queries such that is a -Euclidean section and the least singular value of is . Let be arbitrary. There exists such that if are -accurate answers for on , and , then satisfies
for The constant hidden in the notation depends only on
Thus, it suffices to find database such that the matrix is a Euclidean section (for some fixed constant ) and has no “small” singular values. A result of Rudelson [Rud12] (strengthening that of Kasiviswanathan et al. [KRSU10]) guarantees that such a database exists.
In particular, there exists a database such that the Hadamard product satisfies the two properties above.
Now fix any and let be -accurate answers to on . Now, if we let , by Lemma 5.6, provided that is smaller than some constant that depends only on , which in turn depends only on , we will have for
1.3 Putting Together the Lower Bound
Now we show how to combine the various attacks to prove Theorem 1.2 in the introduction. We obtain our lower bound by applying two rounds of composition. In the first round, we compose the reconstruction attack of Theorem 5.4 described above with the re-identifiable distribution for -way marginals. We then take the resulting re-identifiable distribution and apply a second round of composition using the reconstruction attack based on the VC-dimension of -way marginals.
We remark that it is necessary to apply the two rounds of composition in this order. In particular, we cannot prove Theorem 1.3 by composing first with the VC-dimension-based reconstruction attack. Our composition theorem requires a re-identifiable distribution from -accurate answers for , whereas the reconstruction attack described in Lemma 5.3 requires -accurate answers, and the reconstruction can fail if some queries have error much larger than . The resulting re-identifiable distribution obtained from composing with this reconstruction attack will also require -accurate answers, and thus cannot be composed further.
We can now formally state and prove our sample-complexity lower bound for -way marginals, thereby establishing Theorem 1.3 in the introduction.
such that there exists a distribution on -row databases that is -re-identifiable from -accurate answers to the -way marginals .
Applying Theorem 4.3 (with parameter ), we obtain item 1’ below. We then bring in another reconstruction attack for the composition theorem.
Using the composition Theorem 4.6 in place of Theorem 4.3, we obtain a version of Theorem 5.8 in the language of generalized fingerprinting codes.
such that there exists a -generalized fingerprinting code with security for -accuracy.
1.4 A Tight Lower Bound for 2-Way Marginals
Theorem 5.8 does not give any non-trivial lower bound for -way marginals. Intuitively, the problem is that the proof uses two rounds of composition, and thus if we try to instantiate the proof for -way marginals, one of the three lower bounds being composed will have to be trivial (i.e. will be a lower bound for -way marginals). However, a simple modification of the proof yields a tight lower bound for -way marginals that holds even for -accuracy.
such that there exists a distribution on -row databases that is -re-identifiable from -accurate answers to the -way marginals .
Applying Theorem 4.3 (with parameter ), we obtain the following: There exists a distribution on databases in that is -re-identifiable from -accurate answers to .
To complete the theorem, note that contains exactly of all the queries in , so -accurate answers to contain -accurate answers to the subset . So our lower bound for the subset is sufficient to obtain the desired lower bound. Finally, note that
2 Lower Bounds for Arbitrary Queries
Using our composition theorem, we can also prove a nearly-optimal sample complexity lower bound as a function of the and and establish Theorem 1.3 in the introduction.
Roughly, the results of [DN03] can be interpreted in our framework as showing that there is an -row database that enables a -reconstruction attack from -accurate answers to some family of queries , but only when the vector to be reconstructed is Boolean. That is, the attack reconstructs a bit vector accurately provided that every query in is answered correctly. Dwork et al. [DMT07, DY08] generalized this attack to only require -accuracy for some constant , and we will make use of this extension (although we do not require computational efficiency, which was a focus of those works). Finally, we need an extension to the case of fractional vectors , instead of Boolean vectors .
The extension is fairly simple and the proof follows the same outline of the original reconstruction attack from [DN03]. We are given accurate answers to queries in , which we interpret as approximate “subset-sums” of the vector that we wish to reconstruct. The reconstruction attack will output any vector from a discretization of the unit interval that is “consistent” with these subset-sums. The main lemma we need is an “elimination lemma” that says that if is sufficiently large, then for a random subset ,
with suitable large constant probability. For this lemma can be established via combinatorial arguments, whereas for the case we establish it via the Berry-Esséen Theorem. The lemma is used to argue that for every that is sufficiently far from , a large fraction of the subset-sum queries will witness the fact that is far from , and ensure that is not chosen as the output.
First we state and prove the lemma that we just described, and then we will verify that it indeed leads to a reconstruction attack.
Let be a constant, let be a parameter with , and let . Then for every such that , and a randomly chosen ,
Let be as in the statement of the lemma. Define a random variable
The condition on the right-hand side says that is in some interval of width . Since the random variables are independent, as is a randomly chosen subset, we will use the Berry-Esséen Theorem (Theorem 5.13) to conclude that this sum does not fall in any interval of this width too often. Establishing the next claim suffices to prove Lemma 5.11.
In order to apply Theorem 5.13 with , we need to analyze the moments of the random variables . The following bounds can be verified from the definition of and the assumption that .
where is an interval of width . Thus we have obtained that falls outside of any interval of width with probability at least . In order to establish the claim, we simply observe that
when . Thus, the probability of falling outside an interval of width is only larger than the probability of falling outside an interval of width . ∎
Establishing Claim 5.12 completes the proof of Lemma 5.11. ∎
Let be a constant, let be a parameter with , and let . For any data universe of size , there is a set of counting queries over of size at most such that the database enables a -reconstruction attack from -accurate answers to .
First we will give a reconstruction algorithm for an arbitrary family of queries. We will then show that for a random set of queries of the appropriate size, the reconstruction attack succeeds for every with non-zero probability, which implies that there exists a set of queries satisfying the conclusion of the theorem. We will use the shorthand
In order to show that the reconstruction attack from Figure 6 succeeds, we must show that Let , and let be the vector obtained by rounding each entry of to the nearest . Then
so it is enough to show that the reconstruction attack outputs a vector close to . Observe that the vector itself satisfies
for any subset-sum query , so the reconstruction attack always finds some vector . To show that the reconstruction is successful, fix any such that If we write , then and . In order to show that no that is far from can be output by , we will show that for any with ,
To prove this, we first observe by Lemma 5.11 (setting ) that for a randomly chosen query defined on ,
The lemma applies because is a random subset-sum of the entries of .
Next, we apply a concentration bound to show that if the set of queries is a sufficiently large random set, then for every vector the fraction of queries for which is large will be close to the expected number, which we have just established is at least . We use the following version of the Chernoff bound.
Consider a set of randomly chosen queries . By the above, we have that for every such that ,
Since the queries are chosen independently, by the Chernoff bound we have
Thus, we can choose to obtain
Thus, we have established that there exists a family of queries such that for every such that ,
Moreover, by -accuracy, we have
Applying a triangle inequality, we can conclude
which implies that cannot be the output of . This completes the proof.
2.2 Putting Together the Lower Bound
Now we show how to combine the various attacks to prove Theorem 1.2 in the introduction. We obtain our lower bound by applying two rounds of composition. In the first round, we compose the reconstruction attack described above with the re-identifiable distribution for -way marginals. We then take the resulting re-identifiable distribution and apply a second round of composition using the reconstruction attack for query families of high VC-dimension.
Just like our lower bound for -way marginal queries, we remark that it is necessary to apply the two rounds of composition in this order. See Section 5.1.3 for a discussion of this issue.
such that there exists a distribution on -row databases that is -re-identifiable from -accurate answers to .
Applying Theorem 4.3 (with parameter ), we obtain item 1’ below. We then bring in another reconstruction attack for the composition theorem.
By Lemma 5.3, there exists a database that enables a -reconstruction attack from -accurate answers to some of size . (In particular, the family of queries can be all -way marginals on the first bits of the data universe items.)
Again, Theorem has a corresponding statement in terms of generalized fingerprinting codes.
such that there exists a -generalized fingerprinting code with security for -accuracy.
Constructing Error-Robust Fingerprinting Codes
In this section, we show how to construct fingerprinting codes that are robust to a constant fraction of errors, which will establish Theorem 3.4. Our codes are based on the fingerprinting code of Tardos [Tar08], which has a nearly optimal number of users, but is not robust to any constant fraction of errors. The number of users in our code is only a constant factor smaller than that of Tardos, and thus our codes also have a nearly optimal number of users.
To motivate our approach, it is useful to see why the Tardos code (and all other fingerprinting codes we are aware of) are not robust to a constant fraction of errors. The reason is that the the only way to introduce an error is to put a in a column containing only ’s or vice versa (recall that the set of codewords, , can be viewed as an matrix). We call such columns “marked columns.” Thus, if the adversary is allowed to introduce errors where is the number of marked columns then he can simply ignore the codewords and output either the all- or all- codeword, which cannot be traced. Thus, in order to tolerate a fraction of errors, it is necessary that where is the length of the codeword, and this is not satisfied by any construction we know of (when is a constant). However, Tardos’ construction can be shown to remain secure if the adversary is allowed to introduce errors, rather than errors, for some constant . We demonstrate this formally in Section 6.2. In addition, we show how to take a fingerprinting code that tolerates errors and modify it so that it can tolerate about errors. This reduction is formalized in Section 6.1. Combining these two results will give us a robust fingerprinting code.
We remark that prior work [BN08, BKM10] has shown how to construct fingerprinting codes satisfying a weaker robustness property. Specifically, their codes allow the adversary to introduce a special “?” symbol in a large fraction of coordinates, but still require that any coordinate that is not a “?” satisfies the feasibility constraint.
Before proceeding with the construction and analysis, we restate some terminology and notation from Section 3. Recall that a fingerprinting code is a pair of algorithms , where specifies a distribution over codebooks consiting of codewords , and either outputs the identity of an accused user or outputs . Recall that and share a common state. For a coalition , we write to denote the subset of codewords belonging to users in .
Every codebook , coalition , and robustness parameter defines a feasible set of combined codewords,
We now recall the definition of an error-robust fingerprinting code from Section 3.1.
where the probability is taken over the coins of , and . The algorithms and may share a common state.
The main result of this section is a construction of fingerprinting codes satisfying Definition 6.1
We remark that we have made no attempt to optimize the fraction of errors to which our code is robust. We leave it as an interesting open problem to construct a robust fingerprinting code for a nearly-optimal number of users that is robust to a fraction of errors arbitrarily close to .
A key step in our construction is a reduction from constructing error-robust fingerprinting codes to constructing a weaker object, which we call a weakly-robust fingerprinting code. The difference between a weakly-robust fingerprinting code and an error-robust fingerprinting code of the previous section is that we now demand that only a fraction of the marked positions can have errors, rather than a fraction of all positions.
In order to formally define weakly-robust fingerprinting codes, we introduce some terminology. If is a codebook, then for , we say that position is -marked in if for every . That is, is -marked if every user has the symbol in the -th position of their codeword. The set consists of all codewords such that for a fraction of positions , either is not marked, or is -marked and . Notice that this constraint is vacuous if fewer than a fraction of positions are marked.
For a weakly-robust fingerprinting code, we will define a more constrained feasible set. Intuitively, a codeword is feasible if for a fraction of positions that are marked, is set appropriately. Note that this condition is meaningful even when the fraction of marked positions is much smaller than . More formally, we define
The next theorem states that if we have an -fingerprinting code that is weakly-robust to a fraction of errors and satisfies a mild technical condition, then we obtain an -fingerprinting code that is robust to an fraction of errors with a similar level of security.
are a -fingerprinting code with security weakly-robust to a fraction of errors, and
with probability at least over , produce that has at least -marked columns and -marked columns.
Then there is a pair of algorithms that are a -fingerprinting code with security robust to a fraction of errors, where
The reduction is given in Figure 7. Recall that and may share state, so and the shared state of and is known to .
Fix a coalition . Let be an adversary. Sample and let . We will show that the reduction is successful by proving that if , then the modified string with probability . The reason is that an adversary who is given (a subset of the rows of) cannot distinguish real columns that are marked from fake columns. Therefore, the fraction of errors in the real marked columns should be close to the fraction of errors that are either real and marked or fake. Since the total fraction of errors in the entire codebook is at most , we know that the fraction of errors in real marked columns is not much larger than . Thus the fraction of errors in the real marked columns will be at most with high probability. We formalize this argument in the following claim.
for any choice of . An identical argument bounds the probability that the number of errors in real -marked columns is more than . Therefore, the probability that more than a fraction of marked columns have errors is at most . ∎
Now define an adversary that takes as input, simulates by appending marked columns to and applying a random permutation , and then applies to the resulting codebook . Then it takes , applies , removes the fake columns, and outputs the result. Notice that applies to a codebook and codeword generated by exactly the same procedure. If we assume that is feasible with parameter , then by the analysis above, with probability at least , is weakly feasible with parameter . Thus,
where the first inequality is by Claim 6.5 and the second inequality is by -security of .
Since does not accuse a user outside of (except with probability at most ) regardless of whether or not that adversary’s codeword is feasible, it is immediate that also does not accuse a user outside of (except with probability at most ). ∎
2 Weak Robustness of Tardos’ Fingerprinting Code
In this section we show that Tardos’ fingerprinting code is weakly robust to a fraction of errors for . Specifically we prove the following:
Tardos’ fingerprinting code is described in Figure 8. Note that the shared state of and will include .
Tardos’ proof that no user is falsely accused (except with probability ) holds for every adversary, regardless of whether or not the adversary’s output is feasible, therefore it holds without modification even when we allow the adversary to introduce errors. So we will state the following lemma from [Tar08, Section 3] without proof.
Let be the fingerprinting code defined in Algorithm 8. Then for every adversary , and every ,
where the probability is taken over the choice of and the coins of .
Most of the remainder of this section is devoted to proving that any adversary who introduces errors into at most a fraction of the marked columns can be traced successfully.
Let be the fingerprinting code defined in Algorithm 8. Then for every adversary , and every ,
where the probability is taken over the choice of and the coins of .
Before giving the proof, we briefly give a high-level roadmap. Recall that in the construction there is a “score” function that is computed for each user, and will output some user whose score is larger than the threshold , if such a user exists. Tardos shows that the sum of the scores over all users is at least , which demonstrates that there exists a user whose score is above the threshold. His argument works by balancing two contributions to the score: 1) the contribution from -marked columns , which will always be positive due to the fact that , and 2) the potentially negative contribution from columns that are not -marked. Conceptually, he shows that the contribution from the -marked columns is larger in expectation than the negative contribution from the other columns, so the expected score is significantly above the threshold. He then applies a Chernoff-type bound to show that the score will be above the threshold with high probability. When the adversary is allowed to introduce errors so that there may be some -marked columns such that , these errors will contribute negatively to the score. The new ingredient in our argument is essentially to bound the negative contribution from these errors. We are able to get a sufficiently good bound to tolerate errors in of the coordinates. We expect that a tighter analysis and more careful tuning of the parameters can improve the fraction of errors that can be tolerated.
We will write . Doing so is without loss of generality as users outside of are irrelevant. We will use to denote the allowable fraction of errors. Fix an adversary . Sample and let . Assume . In order to prove that some user is traced, we will bound the quantity
where is defined to be the number of codewords such that . Our goal is to show that this quantity is at least with high probability. If we can do so, then there must exist a user such that , in which case .
If is unmarked, then .
If is -marked, then .
If is -marked, then .
The number of nonzero coordinates of is at most , where is the number of marked columns of .
We call a satisfying the above constraints valid. By the linearity of , we can write
We will now establish the following claim.
We start by making an observation about the distribution of , which denotes when we condition on a fixed choice of a codebook and a valid choice of . Because the non-zero coordinates of are only in marked columns of (those in which or ), the distribution of
depends only on the number of non-zero coordinates of , and not on their location. To see that this is the case, consider a -marked coordinate on which . The contribution of to is exactly . Similarly, for a -marked coordinate on which , the contribution of to is exactly . Thus we can write
Each term in the first sum (resp. second sum) is a random variable that depends only on the distribution of conditioned on the the -th column being -marked (resp. -marked). Recall that is determined by . Moreover, conditioned on a fixed , the ’s are independent. To see this, let denote the th column of the codebook . Recall that each column is generated independently using , and the ’s themselves are chosen independently. Letting denote the density function of a random variable , this means that the joint density
This shows that the conditional random variables are independent. Moreover, since only depends on the codebook and coins of the adversary , the ’s are still independent when we also condition on . In fact, the following holds:
Conditioned on any fixed choice of and , the following distributions are all identical, independent, and non-negative: 1) (n/q_{j}\mid\textrm{j0-marked}) for , and 2) (nq_{j}\mid\textrm{j1-marked}).
By the discussion above, we know that these random variables are independent. To see that they are identicially distributed, note that the distribution used to generate the th column of is symmetric about . Therefore, the probability that column is -marked when its entries are sampled according to is the same as the probability that is -marked when its entries are sampled according to . Applying Bayes’ rule, again using the fact that and have the same distribution, we see that the random variables (p_{j}\mid\textrm{j0-marked}) and (1-p_{j}\mid\textrm{j1-marked}) are identically distributed. The claim follows since . ∎
and conditioned on having errors, those errors occur on a uniformly random set of marked columns. Thus, if we can show that
provided are sufficiently large.
First, observe that, since the distribution of is symmetric about , . Second, if we let
In order to obtain a strong enough bound, we need to show that . We can calculate
Now we apply the approximation , which holds for . To do so, we choose . Since and , we have for this choice of . Thus we have
The final inequality holds as long as is larger than some absolute constant. (To see that this is the case, recall that , whereas .) So we have established
Plugging this fact into the analysis above, we have
Now all that remains is to apply Markov’s inequality to bound this quantity by .
To get the desired upper bound, it is sufficient to show
where the last inequality holds when . This is sufficient to complete the proof of Claim 6.10. ∎
Lemma 6.7 and 6.8 are sufficient to imply Lemma 6.6, that Tardos’ fingerprinting code is weakly robust. In order to apply our reduction from full robustness to weak robustness (Lemma 6.4), we need to also establish that with high probability there are many marked columns in the matrix for Tardos’ fingerprinting code.
With probability at least over the choice of , it holds that the number of -marked columns and the number of -marked columns are both larger than .
To estimate the number of marked columns, define for each an indicator random variable for whether column is -marked. The ’s are i.i.d., and have expectation at least
A similar argument holds for -marked columns. Thus letting , the codebook has at least -marked columns and -marked columns with probability at least . Now observe that
for larger than some absolute constant. ∎
Combining Lemma 6.4 (reduction from robustness to weak robustness), Lemma 6.6 (weak robustness of Tardos’ code), and Lemma 6.12 (Tardos’ code has many marked columns), suffices to prove Theorem 6.2.
Acknowledgements
We thank Kobbi Nissim for drawing our attention to the question of sample complexity and for many helpful discussions. We thank Adam Smith for suggesting that we use the Gaussian mechanism to provide a new proof of the lower bound on the length of fingerprinting codes. Finally, we thank the anonymous reviewers for their helpful comments.
References
Appendix A Lower Bounds on Fingerprinting Codes via Differential Privacy
By the contrapositive of Theorem 3.5, upper bounds on the sample complexity of answering -way marginals with differential privacy imply a lower bound on the length of any fingerprinting code with a given number of users . As pointed out to us by Adam Smith, this yields a particularly simple, self-contained proof of Tardos’ [Tar08] optimal lower bound on the length of fingerprinting codes. Specifically, using the well known Gaussian mechanism for achieving differential privacy, we can design a simple adversary that violates the security of any traitor tracing scheme with length .
Before diving into the proof, we will state the following elementary fact about Gaussian random variables. The fact simply says that a Gaussian random variable with suitable variance is “close” to a shifted version of itself in a particular sense. This same fact is used to show that adding Gaussian noise of suitable variance provides differential privacy.
Let be the following adversary. Define the vector as
First we claim that outputs feasible codewords with at least constant probability.
For every such that and every codebook
By a standard tail bound for the Gaussian, we have
Now it remains to show that cannot be traced successfully. By assumption has security Then we have in particular
Therefore, there exists such that
To complete the proof, it now suffices to show that if , then
which will contradict the security of the fingerprinting code.
By Fact A.2 (with ), for every ,
Applying (10), and averaging over and , we have
which is the desired contradiction. This completes the proof. ∎