Interactive Fingerprinting Codes and the Hardness of Preventing False Discovery

Thomas Steinke, Jonathan Ullman

Introduction

Empirical research commonly involves asking multiple “queries” on a finite sample drawn from some population(e.g., summary statistics, hypothesis tests, or learning algorithms). The outcome of a query is deemed significant if it is unlikely to have occurred by chance alone, and a “false discovery” occurs if the analyst incorrectly declares an observation significant. For decades statisticians have been devising methods for preventing false discovery, such as the “Bonferroni correction” [Bon36, Dun61] and the widely used and highly influential method of Benjamini and Hochberg [BH95] for controlling the “false discovery rate.”

Nevertheless, false discovery persists across all empirical sciences, and both popular and scientific articles report on an increasing number of invalid research findings. Typically false discovery is attributed to misuse of statistics. However, another possible explanation is that methods for preventing false discovery do not address the fact that data analysis is inherently adaptive—the choice of queries depends on previous interactions with the data. The issue of adaptivity was recently investigated in a striking paper by Dwork, Feldman, Hardt, Pitassi, Reingold, and Roth [DFH+15] and also by [HU14].

These two papers formalized the problem of adaptive data analysis in Kearns’ statistical-query (SQ) model [Kea93]. In the SQ model, there is an algorithm called the oracle that is given nn samples from an unknown distribution D{\cal D} over some finite universe X={0,1}d\mathcal{X}=\{0,1\}^{d}, where the parameter dd is the dimensionality of the distribution. The oracle must answer statistical queries about D\cal D. A statistical query qq is specified by a predicate p ⁣:X{0,1}p\colon{\cal X}\to\{0,1\} and the answer to a statistical query is

The oracle’s answer aa to a query qq is accurate if aq(D)α|a-q({\cal D})|\leq\alpha with high probability (for suitably small α\alpha). Importantly, the goal of the oracle is to provide answers that “generalize” to the underlying distribution, rather than answers that are specific to the sample. The latter is easy to achieve by outputting the empirical average of the query predicate on the sample.

The analyst makes a sequence of queries q1,q2,,qkq^{1},q^{2},\dots,q^{k} to the oracle, which responds with answers a1,a2,,aka^{1},a^{2},\dots,a^{k}. In the adaptive setting, the query qiq^{i} may depend on the previous queries and answers q1,a1,,qi1,ai1q^{1},a^{1},\dots,q^{i-1},a^{i-1} arbitrarily. We say the oracle is accurate given nn samples for kk adaptively chosen queries if, when given nn samples from an arbitrary distribution D{\cal D}, the oracle accurately responds to any adaptive analyst that makes at most kk queries with high probability. A computationally efficient oracle answers each query in time polynomial in nn and dd.We assume that the analyst only asks queries that can be evaluated on the sample in polynomial time.

Unfortunately, we show that this is not the case, and prove the following nearly optimal hardness result for preventing false discovery.

Assuming the existence of one-way functions, there is no computationally efficient oracle that given nn samples is accurate on O(n2)O(n^{2}) adaptively chosen queries.

As in [HU14], our hardness result applies whenever the dimensionality dd of the data grows with the sample size such that 2d2^{d} is not polynomial in n.n.This is under the stronger, but still standard, assumption that exponentially-hard one-way functions exist. This requirement is both mild and necessary. If n2dn\gg 2^{d} then the empirical distribution of the nn samples will be close to the underlying distribution in statistical distance, so every statistical query can be answered accurately given the sample. Thus, the dimensionality of the data has a major effect on the hardness of the problem. In fact, we can prove a nearly optimal information theoretic lower bound when the dimensionality of the data is much larger than nn.

There is no oracle (even a computationally unbounded one) that given nn samples in dimension d=O(n2)d=O(n^{2}) is accurate on O(n2)O(n^{2}) adaptively chosen queries.

Our result builds on the techniques of [HU14], who use fingerprinting code [BS98, Tar08] to prove their hardness result. In this work, we identify a variant called an interactive fingerprinting code [FT01], which abstracts the technique in [HU14] and gives a more direct way of proving hardness results for adaptive data analysis. A slightly weaker version of our results can be obtained using the nice recent construction of interactive fingerprinting codes due to Laarhoven et al. [LDR+13] as a black box. However, we give a new analysis of (a close variant of) their code, which is simpler and achieves stronger parameters.

Thus, we can summarize the contributions of this work as follows.

We identify interactive fingerprinting codes as the key combinatorial object underlying the hardness of preventing false discovery in adaptive environments, analogous to the way in which (non interactive) fingerprinting codes are the key combinatorial object underlying the hardness of differential privacy.

We use this connection to prove nearly optimal hardness results for preventing false discovery in interactive data analysis.

We give a new Fourier-analytic method for analyzing both interactive and non-interactive fingerprinting codes that we believe is more intuitive, more flexible, and also leads to even stronger hardness results. In particular, using our analysis we are able to prove that these codes are optimally robustIn this context, optimal robustness means that all of our hardness results apply even when the oracle answers only a 1/2+Ω(1)1/2+\Omega(1) fraction of the queries accurately. [BUV14], which can be used to strengthen the hardness results in [Ull13, BUV14, SU15]. Given the importance of fingerprinting codes to adaptive data analysis and privacy, we believe this new analysis will find further applications.

The structure of our proof is rather simple, and closely follows the framework in [HU14]. We will design a challenge distribution D{\cal D} and a computationally efficient adaptive analyst A{\cal A} who knows D\cal D. If any computationally efficient oracle O{\cal O} is given nn samples S={x1,,xn}S=\{x_{1},\dots,x_{n}\} drawn from D{\cal D}, then our analyst A{\cal A} can use the answers of O\cal O to reconstruct the set SS. Using this information, the adversary can construct a query on which SS is not representative of D\cal D.

Our adversary A\cal A and the distribution D\cal D, like that of [HU14], is built from a combinatorial object with a computational “wrapper.” The computational wrapper uses queries that cryptographically “hide” information from the oracle O\cal O. In our work he combinatorial object will be an interactive fingerprinting code (IFPC). An IFPC is a generalization of a (standard) fingerprinting code, which was originally introduced by Boneh and Shaw [BS98] as a way to watermark digital content.

This result suffices for the informal statements made above, but our construction is somewhat more general and has additional parameters and security properties, which we detail in Section 2.

2 Applications to Data Privacy

The adversary used to show hardness of preventing false discovery is effectively carrying out a reconstruction attack against the database of samples. Roughly, if there is an adversary who can reconstruct the set of samples SS from the oracle’s answers, then the oracle is said to be “blatantly non-private”—it reveals essentially all of the data it holds, and so cannot guarantee any reasonable notion of privacy to the owners of the data. Since the seminal work of Dinur and Nissim [DN03], such reconstruction attacks have been used to establish strong limitations on the accuracy of privacy-preserving oracles.

Assuming the existence of one-way functions, every computationally efficient oracle that, given nn samples, is accurate on O(n2)O(n^{2}) adaptively chosen queries is blatantly non private.

Every (possibly computationally unbounded) oracle that, given nn samples in dimension d=O(n2)d=O(n^{2}), is accurate on O(n2)O(n^{2}) adaptively chosen queries is blatantly non private.

3 Additional Related Work

Our work and [HU14] is part of a line of work connecting technology for secure watermarking to lower bounds for private and interactive data analysis tasks. This connection first appeared in the work of Dwork, Naor, Reingold, Rothblum, and Vadhan [DNR+09], who showed that the existence of traitor-tracing schemes implies hardness of differential privacy. Traitor-tracing schemes were introduced by Chor, Fiat, and Naor [CFN94], also for the problem of watermarking digital content. The connection between traitor-tracing and differential privacy was strengthened in [Ull13], which introduced the use of fingerprinting codes in the context of differential privacy, and used them to show optimal hardness results for certain settings. [BUV14] showed that fingerprinting codes can be used to prove nearly-optimal information-theoretic lower bounds for differential privacy, which established fingerprinting codes as the key information-theoretic object underlying lower bounds in differential privacy.

Since there introduction by Boneh and Shaw [BS98] there has been extensive work on fingerprinting codes, most of which is beyond the scope of this discussion. For the standard, non-interactive definition of fingerprinting codes, [Tar08] gave an essentially optimal construction, which has been very influential in most of the subsequent work on the topic. The interactive model of fingerprinting codes was first studied by [FT01] under the name “dynamic traitor-tracing schemes.” Formally their results are in a significantly different model and cannot be used to prove hardness of preventing false discovery. [Tas05] gave the first construction in the model we use, but achieved suboptimal code length. Recently Laarhoven, Doumen, Roelse, Škorić, and de Wegner [LDR+13], gave a construction with nearly optimal length by generalizing Tardos’ code to the interactive setting. Their construction is quite similar to ours, but our analysis is substantively different and leads to sharper and more general guarantees (and we feel is more intuitive).

In an exciting recent paper, [DFH+15] gave the first algorithms for answering arbitrary adaptively chosen statistical queries. Their algorithms rely on known algorithms for answering statistical queries under differential privacy in a black box manner. Recently, [Ull14] showed how to design differentially private mechanisms for answering exponentially many adaptively chosen queries from the richer class of convex empirical risk minimization queries. By the results of [DFH+15], this algorithm is also a (computationally inefficient) oracle that is accurate for exponentially many adaptively chosen convex empirical risk minimization queries.

4 Organization

In Section 2 we define and construct interactive fingerprinting codes, the main technical ingredient we use to establish our results. In Sections 3 and 4 we show how interactive fingerprinting codes can be used to obtain hardness results for preventing false discovery and blatant non privacy, respectively. The definition of interactive fingerprinting codes is contained in Section 2.1 and is necessary for Sections 3 and 4, but the remainder of Section 2 and Sections 3 and 4 can be read in either order.

Interactive Fingerprinting Codes

In order to motivate the definition of interactive fingerprinting codes, it will be helpful to review the motivation for standard, non interactive fingerprinting codes.

Fingerprinting codes were introduced by Boneh and Shaw [BS98] for the problem of watermarking digital content (such as a movie or a piece of software). Consider a company that distributes some content to NN users. Some of the users may illegally distribute copies of the content. To combat this, the company gives each user a unique version of the content by adding distinctive “watermarks” to it. Thus, if the company finds an illegal copy, it can be traced back to the user who originally purchased it. Unfortunately, users may be able to remove the watermarks. In particular, a coalition of users may combine their copies in a way that mixes or obfuscates the watermarks. A fingerprinting code ensures that, even if up to nn users collude to combine their codewords, an illegal copy can be still be traced to at least one of the users.

A key drawback of fingerprinting codes is that we can only guarantee that a single user iSi\in S is traced. This is inherent, as setting the pirate codeword aa to be the codeword of a single user prevents any other user from being identified. We will see that this can be circumvented by moving to an interactive setting.

We are now ready to formally define interactive fingerprinting codes. To do so we make use of the following game between an adversary P\mathcal{P} and the fingerprinting code F\mathcal{F}. Both P\mathcal{P} and F\mathcal{F} may be stateful.

In the remainder of this section, we give a construction of interactive fingerprinting codes, and establish the following theorem.

and false accusation probability δ\delta for

We remark on the parameters of our construction and how they relate to the literature.

The expression for the failure probability ε\varepsilon is a bit mysterious. To interpret it, we fix β=1/2Ω(1)\beta=1/2-\Omega(1) and consider two parameter regimes: δ(Nn)1\delta(N-n)\ll 1 and δ(Nn)1\delta(N-n)\gg 1.

In the traditional parameter regime for fingerprinting codes δ(Nn)=ε1\delta(N-n)=\varepsilon^{\prime}\ll 1, and so no users are falsely accused. Then our fingerprinting code has length O(n2log((Nn)/ε))O(n^{2}\log((N-n)/\varepsilon^{\prime})) and a failure probability of ε\varepsilon^{\prime}. This matches the result of [LDR+13].

However, if we are willing to tolerate falsely accusing a small constant fraction of users, then we can set, for example, δ(Nn)=.01N\delta(N-n)=.01N, and our fingerprinting code will have length O(n2)O(n^{2}) and failure probability 2Ω(n)2^{-\Omega(n)}. To our knowledge, such large values of δ\delta have not been considered before. It saves a logarithmic factor in our final result.

Our construction works for any robustness parameter β<1/2\beta<1/2. Previously [BUV14] gave a construction for β=1/75\beta=1/75 in the non-interactive setting. Previous constructions in the interactive setting do not achieve any robustness β>0\beta>0, even for the weaker model of robustness to erasures [BN08].

Our completeness condition differs subtly from previous work. We require that, with high probability,

While our version is less natural in the watermarking setting, it is important to our application to false dicsovery. Our interactive fingerprinting code ensures that the adversary cannot be consistent with respect to the population, rather than that it cannot be consistent with respect to the sample.

2 The Construction

Our construction and analysis is based on the optimal (non interactive) fingerprinting codes of Tardos [Tar08], and the robust variant by Bun et al. [BUV14]. The code is essentially the same, but columns are generated and shown to the adversary one at a time, and tracing is modified to identify users interactively.

We begin with some definitions and notation. For 0a<b10\leq a<b\leq 1, let Da,bD_{a,b} be the distribution with support (a,b)(a,b) and probability density function μ(p)=Ca,b/p(1p)\mu(p)=C_{a,b}/\sqrt{p(1-p)}, where Ca,bC_{a,b} is a normalising constant.To sample from Da,bD_{a,b}, first sample φ(sin1(a),sin1(b))\varphi\in(\sin^{-1}(\sqrt{a}),\sin^{-1}(\sqrt{b})) uniformly, then output sin2(φ)\sin^{2}(\varphi) as the sample. For α,ζ(0,1/2)\alpha,\zeta\in(0,1/2), let Dα,ζ\overline{D}_{\alpha,\zeta} be the distribution on $thatreturnsasamplefromthat returns a sample fromD_{\alpha,1-\alpha}withprobabilitywith probability1-2\zetaandorand or1eachwithprobabilityeach with probability\zeta$.

3 Analysis Overview

Intuitively, the quantity sijs_{i}^{j}, which we call the score of user ii, measures the “correlation” between the answers (a1,,aj)(a^{1},\cdots,a^{j}) of P\mathcal{P} and the ii-th codeword (ci1,,cij)(c_{i}^{1},\cdots,c_{i}^{j}), using a particular measure of correlation that takes into account the choices p1,,pjp^{1},\dots,p^{j}. If sijs_{i}^{j} ever exceeds the threshold σ\sigma, meaning that the answers are significantly correlated with the ii-th codeword, then we accuse user ii. Thus, our goal is to show two things: Soundness, that the score of an innocent user (i.e. i∉S1i\not\in S^{1}) never exceeds the threshold, as the answers cannot be correlated with the unknown ii-th codeword. And completeness, that the score of every guilty user (i.e. iS1i\in S^{1}) will at some point exceed the threshold, meaning that the answers must correlate with the ii-th codeword for every iS1i\in S^{1}.

3.2 Completeness

The hidden constants are set to ensure that Equation (2) conflicts with Equation (1). Thus, we can conclude that P\mathcal{P} cannot give consistent answers for a 1β1-\beta fraction of rounds. That is to say, P\mathcal{P} is forced to be inconsistent because all of S1S^{1} is accused and eventually P\mathcal{P} sees none of the codewords and is reduced to guessing an answer aja^{j}.

3.3 Establishing Correlation

Proving Equation (1) is key to the analysis. Our proof thereof combines and simplifies the analyses of [Tar08] and [BUV14]. For this high level overview, we ignore the issue of robustness and fix β=0\beta=0.

where the expectations are taken over the randomness of pjp^{j}, cjc^{j}, and aja^{j}. Equation (3), combined with a concentration result, implies Equation (1).

The intuition behind Equation (3) and the choice of pjp^{j} is as follows. Consistency guarantees that, if cij=bc^{j}_{i}=b for all iS1i\in S^{1}, then aj=ba^{j}=b. This is a weak correlation guarantee, but it suffices to ensure correlation between aja^{j} and iS1cij\sum_{i\in S^{1}}c^{j}_{i}. The affine scaling ϕpj\phi^{p^{j}} ensures that ϕpj(cij)\phi^{p^{j}}(c^{j}_{i}) has mean zero (i.e. is uncorrelated with a constant) and and unit variance (i.e. has unit correlation with itself). The expectation of ajϕpj(cij)a^{j}\cdot\phi^{p^{j}}(c^{j}_{i}) can be interpreted as the ii-th first-order Fourier coefficient of aja^{j} as a function of cjc^{j}. To understand first-order Fourier coefficients, consider the “dictator” function: Suppose aj=cija^{j}=c^{j}_{i^{*}} for some iS1i^{*}\in S^{1} - that is, P\mathcal{P} always outputs the ii^{*}-th bit. Then

This example can be generalised to aja^{j} being an arbitrary function of cS1jc^{j}_{S^{1}} using Fourier analysis. This calculation also indicates why we choose the probability density function of pjDα,1αp^{j}\sim D_{\alpha,1-\alpha} to be proportional to 1/p(1p)1/\sqrt{p(1-p)}.

To handle robustness (β>0\beta>0) we use the ideas of [BUV14]. With probability 2ζ2\zeta each round is a “special” constant round—i.e. cj=(1)Nc^{j}=(1)^{N} or cj=(1)Nc^{j}=(-1)^{N}. Otherwise it is a “normal” round where cjc^{j} is sampled as before. Intuitively, the adversary P\mathcal{P} cannot distinguish the special rounds from the normal rounds in which cc happens to be constant. If the adversary gives inconsistent answers on normal rounds, then it must also give inconsistent answers on special rounds. Since there are many more special rounds than normal rounds, this means that a small number of inconsistencies in normal rounds implies a large number of inconsistencies on special rounds. Conversely, inconsistencies are absorbed by the special rounds, so we can assume there are very few inconsistencies in normal rounds. Thus P\mathcal{P} is forced to behave consistently on the normal rounds and the analysis on these rounds proceeds as before.

4 Proof of Soundness

We first show that no user is falsely accused except with probability δ/2\delta/2. This boils down to proving a concentration bound. Then another concentration bound shows that with high probability at most a δ\delta fraction of users are falsely accused.

These concentrations bounds are essentially standard. However, we are showing concentration of sums of variables of the form ϕp(c)\phi^{p}(c), which may be quite large if p0p\approx 0 or p1p\approx 1. This technical problem prevents us from directly applying standard concentration bounds. Instead we open up the standard proofs and verify the desired concentration. We take the usual approach of bounding the moment generating function and using that to give a tail bound.

For p[α,1α]{0,1}p\in[\alpha,1-\alpha]\cup\{0,1\} and t[α/2,α/2]t\in[-\sqrt{\alpha}/2,\sqrt{\alpha}/2], we have

Let p1pm[α,1α]{0,1}p_{1}\cdots p_{m}\in[\alpha,1-\alpha]\cup\{0,1\} and c1cmc_{1}\cdots c_{m} drawn independently with cipic_{i}\sim p_{i}. Let a1ama_{1}\cdots a_{m}\in be fixed. For all λ0\lambda\geq 0, we have

By Lemma 2.4, for all t[α/2,α/2]t\in[-\sqrt{\alpha}/2,\sqrt{\alpha}/2],

Set t=min{α/2,λ/2m}t=\min\{\sqrt{\alpha}/2,\lambda/2m\}. If λ[0,mα]\lambda\in[0,m\sqrt{\alpha}], then

On the other hand, if λmα\lambda\geq m\sqrt{\alpha}, then

The result is obtained by adding these expressions. ∎

The following theorem shows how we can beat the union bound for tail bounds on partial sums.

This is a useful if we are in a setting where S1|S^{1}| is unknown: if S1>n|S^{1}|>n, then the interactive fingerprinting code will still not make too many false accusations, even if it fails to identify all of S1S^{1}.

If δ<1/(NS1)\delta<1/(N-|S^{1}|), then this is a very poor bound. Instead we use the fact that the EiE_{i}s are discrete and Markov’s inequality, which amounts to a union bound. For δ(NS1)<1\delta(N-|S^{1}|)<1, we have

The following lemma will be useful later.

5 Proof of Completeness

For this section, assume that the adversary P\mathcal{P} is always consistent - that is, we have no robustness and β=0\beta=0. Robustness will be added in Section 2.5.2. Here we establish that the scores have good expectation, namely

In this section we deviate from the proof in [Tar08]. We use biased Fourier analysis to give a more intuitive proof of the correlation bound.

We have the following lemma and proposition, which relate the correlation ajiS1ϕpj(cij)a^{j}\cdot\sum_{i\in S^{1}}\phi^{p^{j}}(c_{i}^{j}) to the properties of aja^{j} as a function of pjp^{j}. To interpret these imagine that ff represents the adversary P\mathcal{P} with one round viewed in isolation – the fingerprinting code gives the adversary cjc^{j} and the adversary responds with f(cSjj)f(c^{j}_{S^{j}}).

Firstly, the following lemma gives an interpretation of the correlation value for a fixed pjp^{j}.

Thus, for any p(0,1)p\in(0,1), we can write ff in terms of these basis functions:

This decomposition is a generalisation of Fourier analysis to biased distributions [O’D14, §8.4]. For p,q(0,1)p,q\in(0,1), the expansion of ff gives the following expressions for g(q)g(q), g(q)g^{\prime}(q) and g(p)g^{\prime}(p).

Now we can interpret the correlation for a random pjDa,bp^{j}\sim D_{a,b}.

This effectively follows by integrating Lemma 2.11.

Let μ(p)=Ca,b/p(1p)\mu(p)=C_{a,b}/\sqrt{p(1-p)} be the probability density function for the distribution Da,bD_{a,b} on the interval (a,b)(a,b). By Lemma 2.11 and the fundamental theorem of calculus, we have

It remains to show that Ca,b=(2sin1(b)2sin1(a))11/πC_{a,b}=\left(2\sin^{-1}(\sqrt{b})-2\sin^{-1}(\sqrt{a})\right)^{-1}\geq 1/\pi. This follows from observing that

Now we have a lemma to bring consistency into the picture. If ff is consistent, b1b\approx 1, and a0a\approx 0, then

This gives a lower bound on the correlation for consistent ff.

5.2 Robustness

We require the fingerprinting code to be robust to inconsistent answers. We show that the correlation is still good in the presence of inconsistencies.

For f:{±1}n{±1}f:\{\pm 1\}^{n}\to\{\pm 1\}, define a random variable ξα,ζ(f)\xi_{\alpha,\zeta}(f) by

The following bounds the expected increase in scores from one round of interaction.

Let f:{±1}n{±1}f:\{\pm 1\}^{n}\to\{\pm 1\} and α,ζ(0,1/2)\alpha,\zeta\in(0,1/2). Then

5.3 Concentration

So far we have shown that the fingerprinting code achieves good correlation or the adversary is not consistent in expectation. However, we need this to hold with high probability. Thus we now show that sums of ξα,ζ(f)\xi_{\alpha,\zeta}(f) variables concentrate around their expectation.

Again, the proofs in this section are standard. However, the ξα,ζ(f)\xi_{\alpha,\zeta}(f) variables can be quite unwieldy and we are thus unable to apply standard results directly. So instead we must open the proofs and verify that the concentration bounds hold. We proceed by bounding the moment generating function of ξα,ζ(f)\xi_{\alpha,\zeta}(f) and then proving an Azuma-like concentration inequality. These calculations are not novel or insightful.

Let f:{±1}n{±1}f:\{\pm 1\}^{n}\to\{\pm 1\}, α(0,1/2)\alpha\in(0,1/2), ζ[1/4,1/2)\zeta\in[1/4,1/2), and t[α/8,α/8]t\in[-\sqrt{\alpha}/8,\sqrt{\alpha}/8]. Then

where C=64enα/4αC=\frac{64e^{n\alpha/4}}{\alpha}.

Let Y=i[n]ϕp(ci)Y=\sum_{i\in[n]}\phi^{p}(c_{i}). By Lemma 2.4 and independence,

for t[α/2,α/2]t\in[-\sqrt{\alpha}/2,\sqrt{\alpha}/2]. Pick t{±α/2}t\in\{\pm\sqrt{\alpha}/2\} such that

Then by dropping positive terms, for all j1j\geq 1,

Thus we have bounded the even moments of YY. By Cauchy-Schwartz, for k=2j+13k=2j+1\geq 3,

For t[α/8,α/8]t\in[-\sqrt{\alpha}/8,\sqrt{\alpha}/8], we have

XiX_{i} is determined by Ui\mathcal{U}_{i},

μi\mu_{i} is determined by Ui1\mathcal{U}_{i-1}, and

Ui1\mathcal{U}_{i-1} is determined by Ui\mathcal{U}_{i}.

Suppose that, for all i[m]i\in[m], uΩu\in\Omega, and t[c,c]t\in[-c,c],

First we show by induction on k[m]k\in[m] that, for all uΩu\in\Omega and t[c,c]t\in[-c,c],

This clearly holds for k=1k=1, as this is our supposition for i=mi=m. Now suppose this holds for some k[m1]k\in[m-1]. For uΩu\in\Omega and t[c,c]t\in[-c,c], we have

for all t[0,c]t\in[0,c] and λ>0\lambda>0. Set t=min{c,λ/2mC}t=\min\{c,\lambda/2mC\} to obtain the result. ∎

5.4 Bounding the Score

Now we can finally show that the scores are large with high probability.

Since the adversary P\mathcal{P} is computationally unbounded and arbitrary, we may assume it is deterministic. We may also assume n=S1n=|S^{1}| and that the adversary is able to see cS1jc^{j}_{S^{1}} at each round. (This only gives the adversary more power.)

where \sim denotes having the same distribution. We have

Now we can apply the above lemmas to bound the expectation and tail of this random variable.

for all fjf^{j}. Moreover, by Proposition 2.15,

for all t[α/8,α/8]t\in[-\sqrt{\alpha}/8,\sqrt{\alpha}/8], where C=70/α64enα/4/αC=70/\alpha\geq 64e^{n\alpha/4}/\alpha, as α1/4n\alpha\leq 1/4n.

However, we can also prove that the scores are small with high probability. This follows from the fact that users with large scores are accused and therefore no user’s score can be too large:

Now we show that the conflicting bounds of Theorem 2.17 and Lemma 2.18 imply completeness - that is, the adversary P\mathcal{P} cannot be consistent.

We claim this is a contradiction, which then holds with high probability, thus proving the theorem.

Substituting these into Equation (7) gives

Now we use 12ζ=12(12β)1-2\zeta=\frac{1}{2}\left(\frac{1}{2}-\beta\right) and γ=2π12ζζ=(12β)πζ\gamma=\frac{2}{\pi}\frac{1-2\zeta}{\zeta}=\frac{\left(\frac{1}{2}-\beta\right)}{\pi\zeta} to derive a contradiction from Equation (8):

Since ζ=1214(12β)\zeta=\frac{1}{2}-\frac{1}{4}\left(\frac{1}{2}-\beta\right), we have

This gives a contradiction. The total failure probability is bounded by

assuming (12β)n1\left(\frac{1}{2}-\beta\right)n\geq 1. ∎

6 Non-Interactive Fingerprinting Codes

Our construction and analysis also gives a construction of traditional non-interactive fingerprinting codes. First we give a formal definition of a fingerprinting code.

Our construction and analysis is readily adapted to the non-interactive setting. We obtain the following theorem.

and false accusation probability δ\delta for

Hardness of False Discovery

In this section we prove our main result - that answering O(n2)O(n^{2}) adaptive queries given nn samples is hard. But first we must formally define the model in which we are working.

Given a distribution D\mathcal{D} over {0,1}d\{0,1\}^{d}, we would like to answer statistical queries about D\mathcal{D}. A statistical query on {0,1}d\{0,1\}^{d} is specified by a function q:{0,1}dq:\{0,1\}^{d}\to and (abusing notation) is defined to be

Our goal is to design an oracle O\mathcal{O} that answers statistical queries on D\mathcal{D} using only iid samples x1,,xn\mboxRDx_{1},\dots,x_{n}\leftarrow_{\mbox{\tiny R}}\mathcal{D}. Our focus is the case where the queries are chosen adaptively and adversarially.

2 Encryption Schemes

Our attack relies on the existence of a semantically secure private-key encryption scheme. An encryption scheme is a triple of efficient algorithms (Gen,Enc,Dec)(\mathit{\mathit{Gen}},\mathit{\mathit{Enc}},\mathit{\mathit{Dec}}) with the following syntax:

Gen\mathit{\mathit{Gen}} is a randomized algorithm that takes as input a security parameter λ\lambda and outputs a λ\lambda-bit secret key. Formally, sk\mboxRGen(1λ)sk\leftarrow_{\mbox{\tiny R}}\mathit{\mathit{Gen}}(1^{\lambda}).

Dec\mathit{\mathit{Dec}} is a deterministic algorithm that takes as input a secret key and a ciphertext ctct and outputs a decrypted message mm^{\prime}. If the ciphertext ctct was an encryption of mm under the key sksk, then m=mm^{\prime}=m. Formally, if ct\mboxREnc(sk,m)ct\leftarrow_{\mbox{\tiny R}}\mathit{\mathit{Enc}}(sk,m), then Dec(sk,ct)=m\mathit{\mathit{Dec}}(sk,ct)=m with probability 11.

Roughly, security of the encryption scheme asserts that no polynomial time adversary who does not know the secret key can distinguish encryptions of m=0m=0 from encryptions of m=1m=1, even if the adversary has access to an oracle that returns the encryption of an arbitrary message under the unknown key. For convenience, we will require that this security property holds simultaneously for an arbitrary polynomial number of secret keys. The existence of an encryption scheme with this property follows immediately from the existence an ordinary semantically secure encryption scheme. We start with the stronger definition only to simplify our proofs. A secure encryption scheme exists under the minimal cryptographic assumption that one-way functions exist. The formal definition of security is not needed until Section A.

3 The Attack

4 Informal Analysis of the Attack

Before formally analysing the attack, we comment on the overall structure thereof.

In order to do this, the oracle could decrypt qjq^{j} to obtain cjc^{j} for every jj. However, the oracle does not have all the necessary secret keys; it only has the secret keys corresponding to its sample SS. Thus, by the security of the encryption scheme, any efficient oracle effectively can only see cSTjjc^{j}_{S\setminus T^{j}}. That is to say, if the oracle is computationally efficient, then it has the same restriction as a fingerprinting adversary P\mathcal{P}. Thus, any computationally efficient oracle must lose the fingerprinting game, meaning it cannot answer every query (or even just a β=1/2+Ω(1)\beta=1/2+\Omega(1) fraction of the queries) accurately.

One subtly arises since “accuracy” for the oracle is defined with respect to the true answer qj(D)=1Ni[N]Tjcij,q^{j}(\mathcal{D})=\frac{1}{N}\sum_{i\in[N]\setminus T^{j}}c^{j}_{i}, whereas “accuracy” in the fingerprinting game is defined with respect to the average over all of cjc^{j}, that is 1Ni[N]cij\frac{1}{N}\sum_{i\in[N]}c^{j}_{i}. We deal with these subtleties by arguing that TjT^{j}, which is the number of users accused by the interactive fingerprinting code prior to the jj-th query, is small. Here we use the fact that the fingerprinting code only allows a relatively small number of false accusations N/1000N/1000. Therefore Tjn+N/1000N/500|T^{j}|\leq n+N/1000\leq N/500. As a result, the definition of accuracy guaranteed by the oracle will be close enough to the definition of accuracy required for the interactive fingerprinting code to succeed in identifying the sample.

5 Analysis of the Attack

In this section we prove our main result:

This follows straightforwardly from a reduction to the security of the fingerprinting code. Notice that the query qjq^{j} does not depend on any entry cijc^{j}_{i} for i∉STj1i\not\in S\setminus T^{j-1}. Thus, an adversary for the fingerprinting code who has access to cSTj1jc^{j}_{S\setminus T^{j-1}} can simulate the view of the oracle. Since we have for any adversary P\mathcal{P}

The proof is straightforward from the definition of security, and is deferred to Section A. Combining Claims 3.3 and 3.4 we easily obtain the following.

where the second equality is because by construction ctij\mboxREnc(ski,cij)ct^{j}_{i}\leftarrow_{\mbox{\tiny R}}\mathit{\mathit{Enc}}(sk_{i},c^{j}_{i}) and the inequality is because we have cij{±1}c^{j}_{i}\in\{\pm 1\}.

Noting that N/1000+n<N/500N/1000+n<N/500 and combining with (10), we have

Applying the triangle inequality to (9) and (11), we obtain

As before, we can argue that the real attack and the ideal attack are computationally indistinguishable, and thus the oracle must also give consistent answers in the ideal attack.

The proof is straightforward from the definition of security, and is deferred to Section A. Combining Claims 3.6 and 3.7 we easily obtain the following.

Putting the above claims together, we obtain the main theorem:

Assume for the sake of contradiction that there were such an oracle. Theorem 2.2 implies that an interactive fingerprinting code of length O(n2/(12β)4)O(n^{2}/\left(\frac{1}{2}-\beta\right)^{4}) exists, so the attack can be carried out. By Claim 3.8 we would have

Note that the constants in the (0.99,β,1/2)(0.99,\beta,1/2)-accuracy assumption are arbitrary and have only been fixed for simplicity.

6 An Information-Theoretic Lower Bound

As in [HU14], we observe that the techniques underlying our computational hardness result can also be used to prove an information-theoretic lower bound when the dimension of the data is large. At a high level, the argument uses the fact that the encryption scheme we rely on only needs to satisfy relatively weak security properties, specifically security for at most O(n2)O(n^{2}) messages. This security property can actually be achieved against computationally unbounded adversaries provided that the length of the secret keys is O(n2)O(n^{2}). As a result, our lower bound can be made to hold against computationally unbounded oracles, but since the secret keys have length O(n2)O(n^{2}), we will require d=O(n2)d=O(n^{2}). We refer the reader to [HU14] for a slightly more detailed discussion, and simply state the following result.

Hardness of Avoiding Blatant Non Privacy

In this section we show how our arguments also imply that computationally efficient oracles that guarantee accuracy for adaptively chosen statistical queries must be blatantly non-private.

Before we can define blatant non-privacy, we need to define a notion of accuracy that is more appropriate for the application to privacy. In contrast to Definition 3.1 where accuracy is defined with respect to the distribution, here we define accurate with respect to the sample itself. With this change in mind, we model blatant non-privacy via the following game.

where q(x)=1ni[n]q(xi)q(x)=\frac{1}{n}\sum_{i\in[n]}q(x_{i}) is the average over the sample.

2 Lower Bounds

In this section we show the following theorem

Assuming one-way functions exist, any computationally efficient oracle O\mathcal{O} that gives accurate answers to O(n2)O(n^{2}) adaptively chosen queries is blatantly non-private.

We will start by establishing that the number of falsely accused users is small. That is, we have TLxn/10000|T^{L}\setminus x|\leq n/10000 with high probability. As in Section 3, this condition will follow from the security of the interactive fingerprinting code F\mathcal{F} combined with the security of the encryption scheme, via the introduction of an “ideal attack” (Figure 8).

This follows straightforwardly from a reduction to the security of the fingerprinting code. Notice that since the query qjq^{j} does not depend on any entry cijc^{j}_{i} for i∉xTj1i\not\in x\setminus T^{j-1}. Thus, an adversary for the fingerprinting code who has access to cxTj1jc^{j}_{x\setminus T^{j-1}} can simulate the view of the oracle. Since we have for any adversary P\mathcal{P}

Now we can argue that an efficient oracle cannot distinguish between the real attack and the ideal attack. Thus the conclusion that TLxn/10000|T^{L}\setminus x|\leq n/10000 with high probability must also hold in the real game.

The proof is straightforward from the definition of security, and is deferred to Section A. Combining Claims 4.4 and 4.5 we easily obtain the following.

By Claim 4.6 we have xxn/10000|x^{\prime}\setminus x|\leq n/10000. Now, in order to show xxn/100|x^{\prime}\triangle x|\leq n/100, it suffices to show that xxn/200|x\setminus x^{\prime}|\leq n/200. In order to do so we begin with the following claim, which establishes that if the oracle O\mathcal{O} is sufficiently accurate, and xTj1n/200|x\setminus T^{j-1}|\leq n/200, then the oracle returns a consistent answer to the query qjq^{j}. Recalling that we use θj\theta^{j} to denote the number of rounded answers ak\overline{a}^{k} for 1kj1\leq k\leq j that are inconsistent with cjc^{j}, we can state the following claim.

After renormalizing by (n/nTj1)(n/n-|T^{j-1}|) we have

Since 0Tj1xn/100000\leq|T^{j-1}\setminus x|\leq n/10000 (by Claim 4.6), and since the algorithm terminates unless Tj1499n/500|T^{j-1}|\leq 499n/500, we obtain

By the assumption that O\mathcal{O} is (1/1000,β,1/2)(1/1000,\beta,1/2)-sample-accurate, we have that, with probability at least 1/21/2, for (1β)L(1-\beta)L choices of j[L]j\in[L],

Finally, observe that if cij=1c^{j}_{i}=1 for every i[2n]i\in[2n], then we have

and by (15) we have (n/(nTj))aj12/3=1/3(n/(n-|T^{j}|))a^{j}\geq 1-2/3=1/3. Thus, the rounded answer aj=1\overline{a}^{j}=1. Similarly, if cij=1c^{j}_{i}=-1 for every i[2n]i\in[2n], then we have aj=1\overline{a}^{j}=-1. This completes the proof of the claim. ∎

As before, we can argue that the real attack and the ideal attack are computationally indistinguishable, and thus the oracle must also give consistent answers in the ideal attack.

The proof is straightforward from the definition of security, and is deferred to Section A. Combining Claims 4.7 and 4.8 we easily obtain the following.

Putting it together, we obtain the following theorem.

which is a contradiction. This completes the proof of the theorem. ∎

3 An Information-Theoretic Lower Bound

As we did in Section 3.6, we can prove an information-theoretic analogue of our hardness result for avoiding blatant non-privacy.

The proof is essentially identical to what is sketched in Section 3.6.

Acknowledgements

We thank Moritz Hardt and Salil Vadhan for insightful discussions during the early stages of this work. We also thank Thijs Laarhoven for bringing his work on interactive fingerprinting codes to our attention.

References

Appendix A Security Reductions from Sections 3 and 4

In Section 3 we made several claims comparing the probability of events in Attack\mathsf{Attack} to the probability of events in IdealAttack\mathsf{IdealAttack}. Each of these claims follow from the assumed security of the encryption scheme. In this section we restate and prove these claims. Since the claims are all of a similar nature, the proof will be somewhat modular. The claims in Section 4 relating PrivacyAttack\mathsf{PrivacyAttack} to IdealPrivacyAttack\mathsf{IdealPrivacyAttack} can be proven in an essentially identical fashion, and we omit these proofs for brevity.

Before we begin recall the formal definition of security of an encryption scheme. Security is defined via a pair of oracles E0\mathcal{E}_{0} and E1\mathcal{E}_{1}. E1(sk1,,skN,)\mathcal{E}_{1}(sk_{1},\dots,sk_{N},\cdot) takes as input the index of a key i[N]i\in[N] and a message mm and returns Enc(ski,m)\mathit{\mathit{Enc}}(sk_{i},m), whereas E0(sk1,,skN,)\mathcal{E}_{0}(sk_{1},\dots,sk_{N},\cdot) takes the same input but returns Enc(ski,0)\mathit{\mathit{Enc}}(sk_{i},0). The security of the encryption scheme asserts that for randomly chosen secret keys, no computationally efficient adversary can tell whether or not it is interacting with E0\mathcal{E}_{0} or E1\mathcal{E}_{1}.

We now restate the relevant claims from Section 3.

To prove both of these claims, for c{1,2}c\in\{1,2\} we construct an adversary Bc\mathcal{B}_{c} that will attempt to use O\mathcal{O} to break the security of the encryption. We construct Bc\mathcal{B}_{c} in such a way that its advantage in breaking the security of encryption is precisely the difference in the probability of the event ZcZ_{c} between Attack\mathsf{Attack} and IdealAttack\mathsf{IdealAttack}, which implies that the difference in probabilities is negligible. The simulator is given in Figure 9

First, observe that for c{1,2}c\in\left\{1,2\right\}, Bc\mathcal{B}_{c} is computationally efficient as long as F\mathcal{F} and O\mathcal{O} are both computationally efficient. It is not hard to see that our construction F\mathcal{F} is efficient and efficiency of O\mathcal{O} is an assumption of the claim. Also notice B\mathcal{B} can determine whether ZcZ_{c} has occurred efficiently.

Now we observe that when the oracle is E1\mathcal{E}_{1} (the oracle that takes as input ii and mm and returns Enc(ski,m)\mathit{\mathit{Enc}}(\overline{sk}_{i},m)), and sk1,,skN\overline{sk}_{1},\dots,\overline{sk}_{N} are chosen randomly from Gen(1λ)\mathit{\mathit{Gen}}(1^{\lambda}), then the view of the oracle is identical to Attackn,d[O]\mathsf{Attack}_{n,d}[\mathcal{O}]. Specifically, the oracle holds a random sample of pairs (i,ski)(i,sk_{i}) and is shown queries that are encryptions either under keys it knows or random unknown keys. Moreover, the messages being encrypted are chosen from the same distribution. On the other hand, when the oracle is E0\mathcal{E}_{0} (the oracle that takes as input ii and ctct and returns Enc(ski,0)\mathit{\mathit{Enc}}(\overline{sk}_{i},0)), then the view of the oracle is identical to Attackn,d[O]\mathsf{Attack}_{n,d}[\mathcal{O}]. Thus we have that for c{1,2}c\in\{1,2\},