Preventing False Discovery in Interactive Data Analysis is Hard
Moritz Hardt, Jonathan Ullman
Introduction
Empirical research is commonly done by testing multiple hypotheses on a finite sample. A test outcome is deemed statistically signficant if it is unliked to have occured by chance alone. False discovery arises if the analyst incorrectly declares an observation as statistically significant. For decades statisticians have been devising methods for preventing false discovery, such as the widely used and highly influential method for controlling the false discovery rate due to Benjamini and Hochberg [BH95].
Nevertheless the problem of false discovery persists across all empirical sciences today. Popular articles report on an increasing number of invalid research findings. Why is it seemingly so difficult to prevent false discovery? Today’s practice of data analysis diverges from classical statistics in its massive scale, heavy use of sophisticated algorithms, and large number of participants in any given project. Importantly, the way modern data analysts interact with the data set is inherently adaptive—many design choices, including the choice and tuning of the algorithm itself, depend on previous interactions with the data set. An extreme example are data science competitions, in which hundreds of data scientists analyze the same data set and repeatedly evaluate their approach against the same data. This level of adaptivity makes it nearly impossible to give a precise a priori description of the experimental setup.
The SQ model has a number of advantages for our purposes. First, almost all natural machine learning algorithms can be compiled into a sequence of statistical queries. Hence, the model does not give up much generality. Second, it makes it convenient to formalize adaptivity. In the adaptive/interactive setting, the analyst is modeled as an efficient algorithm that given a sequence of queries and answers (previously exchanged with the oracle) produces a new query . We say that an oracle is accurate given samples for adaptively chosen queries, if for every distribution given samples from the oracle accurately responds to any computationally efficient adaptive analyst that makes at most queries. A computationally efficient oracle should run time polynomial in and on input of each query.
A recent work by Dwork, Feldman, Hardt, Pitassi, Reingold and Roth [DFH+14] addresses the problem of answering adaptive statistical queries. Their main result implies that there is a computationally inefficient oracle that accurately answers even an exponential number of adaptively chosen statistical queries. Moreover, they show that a quadratic number of queries can be answered accurately and efficiently. Our main theorem shows that these results are essentially as far as it goes. Under a standard cryptographic hardness assumption, we show that there is no efficient oracle that is accurate on more than a cubic number of adaptively chosen queries.
Assuming one-way functions exist, there is no computationally efficient oracle that given samples is accurate on adaptively chosen queries.
An intuitive interpretation of the theorem is that if an efficient oracle attempts to answer more than statistical queries it cannot in general maintain that its answers are statistically valid with respect to the underlying distribution. Of course, the oracle can always report the exact answer of the query on its data set. However, this strategy does not maintain accuracy on adaptive queries in general and—as our theorem shows—neither does any other computationally efficient approach. From a technical perspective our result gives a strong computational lower bound in the statistical query model. Lower bounds in the statistical query model have been studied for more than two decades. But more broadly speaking, we interpret our result as pointing at an inherent computational obstruction to preventing false discovery in collaborative science.
Note that Theorem 1.1 stands in sharp contrast to the non-adaptive setting. If we fix queries and then sample items from the distribution the observed empirical answer to each query on the data set will be close to the correct answer with high probability so long as . This guarantee follows from a Hoeffding bound together with the union bound.
Our hardness result applies when the dimensionality of the data grows with the sample size more than logarithmically so that is no longer polynomial in This is under the stronger but standard assumption that exponentially hard one-way-functions exist. This requirement is rather mild, and is also necessary. If then the empirical distribution of the samples will be close to the underlying distribution in statistical distance, and thus every statistical query can be answered accurately given the sample. More generally, as we discuss in Section 1.2, there are algorithms that run in time polynomial in and and provide accuracy even on an exponential number of adaptively chosen queries [DFH+14]. Thus, our results show that the dimensionality of the data has a major effect on the hardness of the problem. In fact, we provide a second theorem that shows that if the dimensionality is polynomially large in then we cannot even hope for a computationally unbounded oracle that provides accuracy on adaptive queries.
There is no computationally unbounded oracle that given samples of dimension is accurate on adaptively chosen queries.
While the dimension in the previous theorem has to be large, there are important data sets that exhibit this trade-off between sample size and dimension. A good example are genome wide association studies (GWAS). Here, the sample size corresponds to patients with a certain (possibly rare disease) and is often in the hundreds. The dimensionality corresponds to the number of relevent positions in the human genome and is often in the millions. Moreover, the genome resolution is increasing rapidly with new technology whereas the number of available patients is not.
To conclude this discussion of our results, we believe that adaptivity is an essential element of modern data analysis that ought to be taken into account by theoretical models. At the same time, our theorems demonstrate the intrinsic difficulty of coping with adaptivity.
1 Proof overview
The intuition for our proof is rather simple. We will design a challenge distribution and a computationally efficient adaptive analyst so that the following is true. If any compuationally efficient oracle is given samples drawn from then our adaptive analyst is able to reconstruct samples In other words, the analyst is able to find all but a constant number of samples that the oracle is using. While the analyst has a priori information about the distribution it has no information whatsoever about which sample received. Nevertheless, the analyst can reconstruct essentially all of the hidden sample. Quantitatively, the analyst proceeds in rounds and each round consists of roughly queries. In each round the analyst successfully recovers one data item from the oracle provided that the oracle continues to give accurate answers. After the analyst has recovered almost all samples, the effective sample size of the oracle has shrunk down to a constant size. At this point it is easy for the analyst to find queries on which the oracle gives blatantly inaccurate answers.
The first problem is to recover even a single data point inside the oracle’s sample. To solve this problem we rely on a cryptographic primitive known as a fingerprinting code. Fingerprinting codes were introduced by Boneh and Shaw [BS98] for the problem of watermarking digital content. A fingerprinting code has two components. The first component generates a set of “challenge queries.” The second component is a “tracing algorithm” which takes answers to these queries and returns a data item. The fingerprinting code gives the guarantee that if the challenge queries are answered accurately, and by looking only at how each challenge query is defined on then the tracing algorithm will successfully recover one element in Unfortunately, in general nothing prevents the oracle from evaluating the queries at points outside of In fact, information-theoretically the challenge queries used in our attack reveal information about the unknown distribution that the oracle didn’t have previously. Evaluating the query outside the sample is somewhat unnatural. For example, if the oracle simply outputs an empirical quantity that only depends on the sample this situation will not arise. For such natural oracles our proof is somewhat easier and does not require any cryptographic assumptions. We therefore present this illustrative special case in Section 3.
To obtain a result for all computationally bounded oracles, we need to hide from the oracle the additional information that’s revealed by the query definition outside the sample. To do so, we use an encryption scheme to effectively hide the definition of the query on points outside of from the oracle. The encryption is sufficient to show that, assuming that the oracle is computationally bounded, the tracing algorithm of the fingerprinting code will succeed. We note that encryption schemes suitable for our purpose exist under the standard assumption that one-way functions exist. With this one-round approach in mind, we can proceed iteratively. In the next round we exclude the previously learned data item from the definition of the challenge queries, which ensures that the analyst learns a new item in each round.
There is one important subtlety. The tracing algorithm of the fingerprinting code will only succeed if the oracle answers the challenge queries accurately with respect to its sample However, our assumption is that the oracle is accurate with respect to the underlying distribution rather than the sample We need to worry that eventually the sample and the distribution disagree on the challenge queries. In this case the oracle may be inaccurate on its sample (and hence tracing fails), yet still accurate on the distribution. To rule out this pathological situation we use a measure concentration property of our specific choice of fingerprinting code. Specifically, we the fact that the challenge queries of the code are essentially random predicates with a certain bias. This property allows us to use the randomness of the challenge queries to argue that the sample approximately agrees with the distribution on these queries with sufficiently high probability so long as there are at least elements in the sample that we haven’t reconstructed yet. Due to the approximation error incurred here, we also need to use a somewhat stronger primitive called a robust fingerprinting code that was just recently provided in work by Bun, Ullman and Vadhan [BUV14], which also satisfies the necessary measure concentration property.
2 Connection to privacy and reconstruction attacks
Our work builds on a close connection to the problem of designing privacy-preserving oracles. Here, the goal is to provide answers to statistical queries in such a way that the analyst does not learn the specifics of individual data records but rather global properties of the underlying distribution. A successful approach for formalizing this desideratum is the notion of differential privacy [DMNS06]. Differential privacy requires that the answers given by the oracle are randomized in such a way that the presence or absence of any single data item in the sample cannot be detected. It is known that differential privacy prevents so-called reconstruction attacks. A reconstruction attack is an algorithm that is able to reconstruct most entries of a data set by interacting with the oracle. Such an attack demonstrates that the oracle is blatantly non-private (it fails to satisfy not only differential privacy, but any reasonable notion of privacy). Our work can be considered an efficient reconstruction attack as we give an efficient adaptive analyst that reconstructs almost all of the data points that the oracle uses if the oracle provides accuracy on queries. An immediate consequence of our work is therefore the following result.
Assuming one-way functions exist, any computationally efficient oracle that given samples is accurate on adaptively chosen queries must be blatantly non-private.
This result should be compared with recent work of Ullman [Ull13], which showed that oracles satisfying differential privacy cannot answer even non-adaptively chosen queries. Here we show that if the queries are chosen adaptively, then the same conclusion holds even for oracles that merely thwart blatant non-privacy, up to a factor of loss in the number of queries.
An important difference to the privacy setting is how accuracy is defined. In the privacy setting, accuracy is defined with respect to the oracle’s sample. It is trivial to maintain accuracy with respect to the sample by answering each query with the sample mean, which succeeds even when the oracle is blatantly non-private. In the setting of false discovery, we define accuracy with respect to the underlying distribution and show that achieving this notion of accuracy is hard for the oracle.
We emphasize that exponential running time was known to be inherent for differentially private algorithms that answer statistical queries [Ull13], but prior to our results it was possible that there was an efficient oracle that accurately answered exponentially many adaptively chosen statistical queries via a different approach.
3 Related work
The combination of fingerprinting codes and encryption in our one-round approach is a common technique in the construction of “traitor-tracing schemes.” Traitor-tracing schemes were introduced by Chor, Fiat, and Naor [CFN94], also for the problem of secure distribution of digital content. Dwork et al. [DNR+09] were the first to show that traitor-tracing schemes can be used to prove computational hardness results for differential privacy. Ullman [Ull13] showed that traitor-tracing schemes with certain non-standard security properties can be used to prove strong computational hardness results for differential privacy, and showed how to construct such a scheme. In fact, the one-round approach described above closely mirrors the traitor-tracing scheme constructed in [Ull13]. See [Ull13] for a more detailed discussion of prior work on traitor-tracing and the issues that arise when using traitor-tracing schemes in the context of differential privacy.
Our work was also inspired by recent work of Hardt and Woodruff [HW13], which showed that no low-dimensional linear sketch can give valid answers to even a polynomial number of adaptively chosen queries. Technically our results are largely orthogonal to theirs, since we consider arbitrary computationally efficient statistical query oracles, rather than linear sketches. However, their work also noted the connection between differential privacy and validly answering adaptively chosen queries. On the technical side, our iterative approach was inspired by their results.
There is also a large body of work on the computational hardness of certain learning problems. Many of these results have a similar flavor to ours in showing that any computationally efficient algorithm requires either large running time or a large number of samples from the distribution in order to learn a valid hypothesis. However, we are not aware of any result showing a hardness result specific to adaptively chosen queries.
Acknowledgments
We are extremely grateful to Aaron Roth for raising the issue of adaptivity in false discovery control at the Simons Workshop on Differential Privacy. We thank Cynthia Dwork and Omer Reingold for introducing us to the area of False Discovery Control. We also thank Salil Vadhan for helpful discussions. We acknowledge the Simons Institute for Theoretical Computer Science at Berkeley where this work started.
Preliminaries
The goal is to design an oracle that answers statistical queries about the unknown distribution , given only iid samples from . In this work, we are interested in the case where the queries may be adaptively and adversarially chosen.
Specifically, is a stateful algorithm that holds a tuple of samples , takes a statistical query as input, and returns a real-valued answer . We require that when consists of iid samples from , the answer is close to , and moreover that this condition holds for every query in an adaptively chosen sequence . Formally, we define the accuracy guarantee using the following game with a stateful adversary .
An oracle is -accurate for adaptively chosen queries given samples in if for every adversary ,
Specifically, for any set of codewords , we define
We can now formally define error-robust fingerprinting codes
for every (possibly randomized) adversary and every , if , then
Bun, Ullman, and Vadhan [BUV14] introduced error-robust fingerprinting codes. They gave a construction with nearly-optimal length, building on the nearly-optimal construction of standard (non-robust) fingerprinting codes by Tardos [Tar08].
For our results, we will need an additional technical lemma about the fingerprinting code in [BUV14] that we will use for our results. The lemma states that if is at least a sufficiently large constant, then for most columns , the mean of the -th column of and that of are close. In order to prove the lemma, we need to partially describe the algorithm .
For every , and every such that , we have
Thus, by a triangle inequality, it holds that
Lower bound for natural oracles
In this section we prove our main result in the special case where the oracle satisfies a natural condition, roughly speaking, that it does not evaluate a given query outside its sample. The proof is technically simpler in this case as it is unconditional and does not rely on any cryptographic constructions. Nevertheless, the proof outline is essentially the same as in the general case and so it is instructive to begin with this special case.
An oracle is natural if for every input sample and every two queries and such that for all the answers and that the oracle gives on queries and respectively, are identical if the oracle is deterministic and identically distributed if the oracle is randomized. If the oracle is stateful, then this condition should hold when the oracle is in any of its possible states.
We will now show that there is no natural oracle that is accurate for a sufficiently large number of adaptively chosen queries. To do so, we will construct an adversary that chooses a distribution , and then issues queries to the oracle in such a way that any computationally efficient oracle that is given samples from will fail to answer all queries correctly.
The goal of the recovery phase of the algorithm is to identify most of the samples that are held by the oracle. Once the attacker has this information, he can use it to find queries that distinguish the oracle’s samples from the population and force the oracle to be inaccurate.
In order to recover samples, the attacker will force the oracle to give answers that are consistent with the fingerprinting codes , which are then given to to recover an element of the sample. Our first claim establishes that an accurate oracle will indeed force the oracle to give answers consistent with the fingerprinting codes.
Observe that by definition, for every ,
Since and , for every
Applying the triangle inequality to (1) and (2), this shows
By Lemma 2.4, since , for every ,
where the probability is taken over the choice of . By a union bound over , where , if , then
The claim now follows by combining (3) and (4). ∎
Now that we have established Claim 4.1, we know that in every round , the oracle holding returns a set of answers that are consistent with the fingerprinting code . However, this fact alone is not enough to guarantee that returns a user in , because the queries to the oracle depend on rows of for users outside of , whereas the security of the fingerprinting code applies only to algorithms that only have access to the rows of for users in . However, if we assume that the oracle is natural, then its answers do not depend on information about the query at points outside of the sample
Fix any round and let . By the security of the fingerprinting code, we have that for every algorithm
Observe that the oracle is natural and therefore the answer it gives on any query cannot depend on rows of that belong to users outside of Moreover, the query is on points in and The queries issued in rounds depend only on , which is independent from . Hence, the answer of the oracle depends only on points in We therefore have
By a union bound over , we also have
2 Analysis of the Attack Phase
At this point we know that if the oracle is natural and accurately answers all the queries in the recovery phase, then with high probability . Next we show that if this event occurs, then the probability that the oracle answers accurately in the attack phase is bounded away from by a constant. Since an accurate oracle is required to answer each query accurately with probability at least , we will obtain a contradiction.
To begin with we show that the population answer is close to the value in the attack.
In we have
We will now show that the oracle cannot guess the value of with sufficiently high probability provided that the recovery phase succeeded.
Consider the case where We have
where we used that On the other hand, when , we have
Note that because the oracle is natural it answer only depends on for When the oracle sees only that for every , it cannot give an answer that is simultaneously accurate to within for both the case of and for the case of . The event for every occurs with at least probability as shown above. Conditioned on this event, both cases and have constant probability. Hence, the answer of the oracle must be far from with constant probability. Formally,
By Claim 3.4, we have that Claim 3.5 shows that
The statement of the lemma now follows from a triangle inequality. ∎
3 Putting it together
There is no natural oracle that is -accurate for adaptively chosen queries given samples.
Therefore, if is natural, by Lemma 3.2,
However, the definition of an accurate oracle asserts, in particular, that
Lower bound for all computationally bounded oracles
In this section we will show that there is no computationally efficient oracle that is accurate for a sufficiently large number of adaptively chosen queries, and thereby formally establish Theorem 1.1 in the introduction. To do so, we will construct an adversary that chooses a distribution , and then issues queries to the oracle such that no computationally efficient oracle given samples from can answer all the queries correctly.
Our attack relies on the existence of a semantically secure private-key encryption scheme that we briefly recall here. An encryption scheme is a triple of efficient algorithms with the following syntax:
is a randomized algorithm that takes as input a secret key and a one-bit message and outputs a ciphertext . Formally, .
is a deterministic algorithm that takes as input a secret key and a ciphertext and outputs a decryption . If the ciphertext was an encryption of under the key , then . Formally, if , then with probability .
Roughly, security of the encryption scheme asserts that no polynomial time adversary who does not know the secret key can distinguish encryptions of from encryptions of , even if the adversary has access to an oracle that returns the encryption of an arbitrary message under the unknown key. For convenience, we will require that this security property holds simultaneously for an arbitrary polynomial number of secret keys. The existence of an encryption scheme with this property follows immediately from the existence an ordinary semantically secure encryption scheme. We start with the stronger definition only to simplify our proofs. A secure encryption scheme exists under the minimal cryptographic assumption that one-way functions exist. The formal definition of security is not needed until Section A.
2 Description of the attack
3 Analysis of the recovery phase
The goal of the recovery phase of the algorithm is to identify most of the samples that are held by the oracle. Once the attacker has this information, he can use it to find queries that distinguish the oracle’s keys from the population and force the oracle to be inaccurate.
In order to recover keys, the attacker will force the oracle to give answers that are consistent with the fingerprinting codes , which are then given to to recover an element of the sample. Our first claim establishes that an accurate oracle will indeed force the oracle to give answers consistent with the fingerprinting codes.
Observe that by definition, for every ,
where the last equality is because . Since and ,
Applying the triangle inequality to (7) and (8), this shows
By Lemma 2.4, since , for every , if , then
where the probability is taken over the choice of . By a union bound over , where , if , then
The claim now follows by combining (9) and (10). ∎
Now that we have established Claim 4.1, we know that in every round , the oracle holding returns a set of answers that are consistent with the fingerprinting code . However, this fact alone is not enough to guarantee that returns a user in , because the queries to the oracle depend on rows of for users outside of , whereas the security of the fingerprinting code applies only to algorithms that only have access to the rows of for users in . To remedy this problem we rely on the fact that the rows of outside of are encrypted under keys that are not known to the oracle. Thus, a computationally efficient oracle “does not know” those rows. We can formalize this argument by comparing to an where rows of for users outside of are replaced with zeros, and argue that the adversary cannot distinguish between these two attacks without breaking the security of the encryption scheme.
The proof follows from the security of the encryption scheme. We defer the details to Section A.
The proof is immediate by combining Claim 4.1 and Claim 4.2.
Next, we argue that in , with high probability only outputs users contained in the sample .
Fix any round and let . By the security of the fingerprinting code, we have that for every algorithm
Observe that in , the oracle is never given any input that depends on rows of that belong to users outside of : The queries issued in rounds depend only on , which is independent from . And in round the query only depends on ciphertexts for , which are all independent of whenever . Therefore we have
By a union bound over , we also have
Finally, we show that if with high probability in , then with high probability in . Again, we do so by arguing that and are computationally indistinguishable.
The proof follows from the security of the encryption scheme. We defer the details to Section A.
The proof is immediate by combining Claim 4.4 and Claim 4.5.
4 Analysis of the attack phase
By the arguments of Section 4.3, we know that if the oracle is computationally efficient and accurately answers all the queries in the recovery phase, then with high probability . In this section we will show that if these events indeed occur, then the probability that the oracle answers accurately in the attack phase is bounded away from by a constant. Since an accurate oracle is required to answer each query accurately with probability at least , we will have obtained a contradiction.
To begin with we show that the population answer is close to the value in the real attack.
We will show that the oracle cannot guess the value of with sufficiently high probability in provided that the recovery phase succeeded.
Consider the case where We have
On the other hand, when , we have
Note that in the oracle only sees for When the oracle sees only that for every , it cannot give an answer that is simultaneously accurate to within for both the case of and for the case of . The event for every occurs with at least probability as shown above. Conditioned on this event, both cases and have constant probability. Hence, the answer of the oracle must be far from with constant probability. Formally,
As in the analysis of the recovery phase, we will first argue that the probability the oracle is accurate for in is nearly the same as it is in an where the query has been modified to contain no information about users outside of .
The proof will follow from the security of the encryption scheme. We defer the details to Section A.
By Claim 4.7, we have that Combining Claim 4.8 with Claim 4.9, we further have
The statement of the lemma now follows from a triangle inequality. ∎
5 Putting it together
We can now prove Theorem 1.1 from the introduction.
There is no computationally efficient oracle that is -accurate for adaptively chosen queries given samples.
Therefore, if , by Lemma 4.6,
However, the definition of an accurate oracle asserts, in particular, that
6 An information-theoretic lower bound
Lower bounds for avoiding blatant non-privacy
In this section we show how our arguments also imply that computationally efficient oracles that guarantee accuracy for adaptively chosen statistical queries must be blatantly non-private, and thereby establish Theorem 1.3 in the introduction.
Before we can define blatant non-privacy, we need to define a notion of accuracy that is more appropriate for the application to privacy. In contrast to Definition 2.1 where accuracy is defined with respect to the distribution, here we define accurate with respect to the sample itself. With this change in mind, we model blatant non-privacy via the following game.
If the conclusion holds even for computationally inefficient oracles then we replace “for efficient oracles” with “for unbounded oracles” in the definition.
2 Lower bounds
In this section we show the following theorem
Giving accurate answers to adaptively chosen queries is blatantly non-private for computationally efficient oracles.
We establish this theorem via an adversary that essentially performs only the reconstruction phase of . The adversary is described in Figure 7. Observe that in , we have already established that there is an adversary that recovers a set such that when is drawn from a distribution and gives accurate answers for the distribution . The key difference between that guarantee and the one we must establish, is that here we want to establish blatant non-privacy when the oracle is accurate for the sample. However, this can be addressed via a fairly simple modification to the argument.
As before, we introduce a related “ideal attack” for which we can show recovery succeeds.
Fix any round , let . We will show that
By the security of the fingerprinting code, we have that for every and every algorithm , if , then
Observe that if and for every , , then and . In this case, we have
As we did in proving the lower bounds for answering adaptively chosen statistical queries, we now claim that and are computationally indistinguishable
The analysis is essentially identical to what was shown in the proof of the lower bounds for answering adaptively chosen statistical queries, so we omit the proof.
We can now combine these two claims to prove Theorem 5.3
If is -accurate, then
Since , and implies , we have
As we did in Section 4.6, we can prove an information-theoretic analogue of our hardness result for avoiding blatant non-privacy.
Giving accurate answers to adaptively chosen queries on samples of dimension is blatantly non-private for unbounded oracles.
The proof is essentially identical to what is sketched in Section 4.6.
References
Appendix A Security reductions from Section 4
In Sections 4.3 and 4.4 we made several claims comparing the probability of events in to the probability of events in . Each of these claims follow from the assumed security of the encryption scheme. In this section we restate and prove these claims. Since the claims are all of a similar nature, the proof will be somewhat modular.
Before we begin recall the formal definition of security of an encryption scheme. Security is defined via a pair of oracles and . takes as input the index of a key and a message and returns , whereas takes the same input but returns . The security of the encryption scheme asserts that for randomly chosen secret keys, no computationally efficient adversary can tell whether or not it is interacting with or .
To prove each of these claims , we construct an adversary that will attempt to use to break the security of the encryption. We construct in such a way that its advantage in breaking the security of encryption is precisely the difference in the probability of the event between and , which implies that the difference in probabilities is negligible. The simulator is given in Figure 9
First, observe that for , is computationally efficient as long as , and are all computationally efficient. Efficiency of and will be satisfied by the construction in Theorem 2.3 and efficiency of is by assumption of the claim. Also notice can determine whether has occurred efficiently.
Now we observe that when the oracle is (the oracle that takes as input and and returns ), and are chosen randomly from , then the view of the oracle is identical to . Specifically, the oracle holds a random sample of pairs and is shown queries that are encryptions either under keys it knows or random unknown keys. Moreover, the messages being encrypted are chosen from the same distribution. On the other hand, when the oracle is (the oracle that takes as input and and returns ), then the view of the oracle is identical to . Thus we have that for ,