Interactive Fingerprinting Codes and the Hardness of Preventing False Discovery

Thomas Steinke, Jonathan Ullman

Introduction

Empirical research commonly involves asking multiple “queries” on a finite sample drawn from some population(e.g., summary statistics, hypothesis tests, or learning algorithms). The outcome of a query is deemed significant if it is unlikely to have occurred by chance alone, and a “false discovery” occurs if the analyst incorrectly declares an observation significant. For decades statisticians have been devising methods for preventing false discovery, such as the “Bonferroni correction” [Bon36, Dun61] and the widely used and highly influential method of Benjamini and Hochberg [BH95] for controlling the “false discovery rate.”

Nevertheless, false discovery persists across all empirical sciences, and both popular and scientific articles report on an increasing number of invalid research findings. Typically false discovery is attributed to misuse of statistics. However, another possible explanation is that methods for preventing false discovery do not address the fact that data analysis is inherently adaptive—the choice of queries depends on previous interactions with the data. The issue of adaptivity was recently investigated in a striking paper by Dwork, Feldman, Hardt, Pitassi, Reingold, and Roth [DFH+15] and also by [HU14].

These two papers formalized the problem of adaptive data analysis in Kearns’ statistical-query (SQ) model [Kea93]. In the SQ model, there is an algorithm called the oracle that is given $n$ samples from an unknown distribution ${\cal D}$ over some finite universe $\mathcal{X}=\{0,1\}^{d}$ , where the parameter $d$ is the dimensionality of the distribution. The oracle must answer statistical queries about $\cal D$ . A statistical query $q$ is specified by a predicate $p\colon{\cal X}\to\{0,1\}$ and the answer to a statistical query is

The oracle’s answer $a$ to a query $q$ is accurate if $|a-q({\cal D})|\leq\alpha$ with high probability (for suitably small $\alpha$ ). Importantly, the goal of the oracle is to provide answers that “generalize” to the underlying distribution, rather than answers that are specific to the sample. The latter is easy to achieve by outputting the empirical average of the query predicate on the sample.

The analyst makes a sequence of queries $q^{1},q^{2},\dots,q^{k}$ to the oracle, which responds with answers $a^{1},a^{2},\dots,a^{k}$ . In the adaptive setting, the query $q^{i}$ may depend on the previous queries and answers $q^{1},a^{1},\dots,q^{i-1},a^{i-1}$ arbitrarily. We say the oracle is accurate given $n$ samples for $k$ adaptively chosen queries if, when given $n$ samples from an arbitrary distribution ${\cal D}$ , the oracle accurately responds to any adaptive analyst that makes at most $k$ queries with high probability. A computationally efficient oracle answers each query in time polynomial in $n$ and $d$ .We assume that the analyst only asks queries that can be evaluated on the sample in polynomial time.

Unfortunately, we show that this is not the case, and prove the following nearly optimal hardness result for preventing false discovery.

Assuming the existence of one-way functions, there is no computationally efficient oracle that given $n$ samples is accurate on $O(n^{2})$ adaptively chosen queries.

As in [HU14], our hardness result applies whenever the dimensionality $d$ of the data grows with the sample size such that $2^{d}$ is not polynomial in $n.$ This is under the stronger, but still standard, assumption that exponentially-hard one-way functions exist. This requirement is both mild and necessary. If $n\gg 2^{d}$ then the empirical distribution of the $n$ samples will be close to the underlying distribution in statistical distance, so every statistical query can be answered accurately given the sample. Thus, the dimensionality of the data has a major effect on the hardness of the problem. In fact, we can prove a nearly optimal information theoretic lower bound when the dimensionality of the data is much larger than $n$ .

There is no oracle (even a computationally unbounded one) that given $n$ samples in dimension $d=O(n^{2})$ is accurate on $O(n^{2})$ adaptively chosen queries.

Our result builds on the techniques of [HU14], who use fingerprinting code [BS98, Tar08] to prove their hardness result. In this work, we identify a variant called an interactive fingerprinting code [FT01], which abstracts the technique in [HU14] and gives a more direct way of proving hardness results for adaptive data analysis. A slightly weaker version of our results can be obtained using the nice recent construction of interactive fingerprinting codes due to Laarhoven et al. [LDR+13] as a black box. However, we give a new analysis of (a close variant of) their code, which is simpler and achieves stronger parameters.

Thus, we can summarize the contributions of this work as follows.

We identify interactive fingerprinting codes as the key combinatorial object underlying the hardness of preventing false discovery in adaptive environments, analogous to the way in which (non interactive) fingerprinting codes are the key combinatorial object underlying the hardness of differential privacy.

We use this connection to prove nearly optimal hardness results for preventing false discovery in interactive data analysis.

We give a new Fourier-analytic method for analyzing both interactive and non-interactive fingerprinting codes that we believe is more intuitive, more flexible, and also leads to even stronger hardness results. In particular, using our analysis we are able to prove that these codes are optimally robustIn this context, optimal robustness means that all of our hardness results apply even when the oracle answers only a $1/2+\Omega(1)$ fraction of the queries accurately. [BUV14], which can be used to strengthen the hardness results in [Ull13, BUV14, SU15]. Given the importance of fingerprinting codes to adaptive data analysis and privacy, we believe this new analysis will find further applications.

The structure of our proof is rather simple, and closely follows the framework in [HU14]. We will design a challenge distribution ${\cal D}$ and a computationally efficient adaptive analyst ${\cal A}$ who knows $\cal D$ . If any computationally efficient oracle ${\cal O}$ is given $n$ samples $S=\{x_{1},\dots,x_{n}\}$ drawn from ${\cal D}$ , then our analyst ${\cal A}$ can use the answers of $\cal O$ to reconstruct the set $S$ . Using this information, the adversary can construct a query on which $S$ is not representative of $\cal D$ .

Our adversary $\cal A$ and the distribution $\cal D$ , like that of [HU14], is built from a combinatorial object with a computational “wrapper.” The computational wrapper uses queries that cryptographically “hide” information from the oracle $\cal O$ . In our work he combinatorial object will be an interactive fingerprinting code (IFPC). An IFPC is a generalization of a (standard) fingerprinting code, which was originally introduced by Boneh and Shaw [BS98] as a way to watermark digital content.

This result suffices for the informal statements made above, but our construction is somewhat more general and has additional parameters and security properties, which we detail in Section 2.

2 Applications to Data Privacy

The adversary used to show hardness of preventing false discovery is effectively carrying out a reconstruction attack against the database of samples. Roughly, if there is an adversary who can reconstruct the set of samples $S$ from the oracle’s answers, then the oracle is said to be “blatantly non-private”—it reveals essentially all of the data it holds, and so cannot guarantee any reasonable notion of privacy to the owners of the data. Since the seminal work of Dinur and Nissim [DN03], such reconstruction attacks have been used to establish strong limitations on the accuracy of privacy-preserving oracles.

Assuming the existence of one-way functions, every computationally efficient oracle that, given $n$ samples, is accurate on $O(n^{2})$ adaptively chosen queries is blatantly non private.

Every (possibly computationally unbounded) oracle that, given $n$ samples in dimension $d=O(n^{2})$ , is accurate on $O(n^{2})$ adaptively chosen queries is blatantly non private.

3 Additional Related Work

Our work and [HU14] is part of a line of work connecting technology for secure watermarking to lower bounds for private and interactive data analysis tasks. This connection first appeared in the work of Dwork, Naor, Reingold, Rothblum, and Vadhan [DNR+09], who showed that the existence of traitor-tracing schemes implies hardness of differential privacy. Traitor-tracing schemes were introduced by Chor, Fiat, and Naor [CFN94], also for the problem of watermarking digital content. The connection between traitor-tracing and differential privacy was strengthened in [Ull13], which introduced the use of fingerprinting codes in the context of differential privacy, and used them to show optimal hardness results for certain settings. [BUV14] showed that fingerprinting codes can be used to prove nearly-optimal information-theoretic lower bounds for differential privacy, which established fingerprinting codes as the key information-theoretic object underlying lower bounds in differential privacy.

Since there introduction by Boneh and Shaw [BS98] there has been extensive work on fingerprinting codes, most of which is beyond the scope of this discussion. For the standard, non-interactive definition of fingerprinting codes, [Tar08] gave an essentially optimal construction, which has been very influential in most of the subsequent work on the topic. The interactive model of fingerprinting codes was first studied by [FT01] under the name “dynamic traitor-tracing schemes.” Formally their results are in a significantly different model and cannot be used to prove hardness of preventing false discovery. [Tas05] gave the first construction in the model we use, but achieved suboptimal code length. Recently Laarhoven, Doumen, Roelse, Škorić, and de Wegner [LDR+13], gave a construction with nearly optimal length by generalizing Tardos’ code to the interactive setting. Their construction is quite similar to ours, but our analysis is substantively different and leads to sharper and more general guarantees (and we feel is more intuitive).

In an exciting recent paper, [DFH+15] gave the first algorithms for answering arbitrary adaptively chosen statistical queries. Their algorithms rely on known algorithms for answering statistical queries under differential privacy in a black box manner. Recently, [Ull14] showed how to design differentially private mechanisms for answering exponentially many adaptively chosen queries from the richer class of convex empirical risk minimization queries. By the results of [DFH+15], this algorithm is also a (computationally inefficient) oracle that is accurate for exponentially many adaptively chosen convex empirical risk minimization queries.

4 Organization

In Section 2 we define and construct interactive fingerprinting codes, the main technical ingredient we use to establish our results. In Sections 3 and 4 we show how interactive fingerprinting codes can be used to obtain hardness results for preventing false discovery and blatant non privacy, respectively. The definition of interactive fingerprinting codes is contained in Section 2.1 and is necessary for Sections 3 and 4, but the remainder of Section 2 and Sections 3 and 4 can be read in either order.

Interactive Fingerprinting Codes

In order to motivate the definition of interactive fingerprinting codes, it will be helpful to review the motivation for standard, non interactive fingerprinting codes.

Fingerprinting codes were introduced by Boneh and Shaw [BS98] for the problem of watermarking digital content (such as a movie or a piece of software). Consider a company that distributes some content to $N$ users. Some of the users may illegally distribute copies of the content. To combat this, the company gives each user a unique version of the content by adding distinctive “watermarks” to it. Thus, if the company finds an illegal copy, it can be traced back to the user who originally purchased it. Unfortunately, users may be able to remove the watermarks. In particular, a coalition of users may combine their copies in a way that mixes or obfuscates the watermarks. A fingerprinting code ensures that, even if up to $n$ users collude to combine their codewords, an illegal copy can be still be traced to at least one of the users.

A key drawback of fingerprinting codes is that we can only guarantee that a single user $i\in S$ is traced. This is inherent, as setting the pirate codeword $a$ to be the codeword of a single user prevents any other user from being identified. We will see that this can be circumvented by moving to an interactive setting.

We are now ready to formally define interactive fingerprinting codes. To do so we make use of the following game between an adversary $\mathcal{P}$ and the fingerprinting code $\mathcal{F}$ . Both $\mathcal{P}$ and $\mathcal{F}$ may be stateful.

In the remainder of this section, we give a construction of interactive fingerprinting codes, and establish the following theorem.

and false accusation probability $\delta$ for

We remark on the parameters of our construction and how they relate to the literature.

The expression for the failure probability $\varepsilon$ is a bit mysterious. To interpret it, we fix $\beta=1/2-\Omega(1)$ and consider two parameter regimes: $\delta(N-n)\ll 1$ and $\delta(N-n)\gg 1$ .

In the traditional parameter regime for fingerprinting codes $\delta(N-n)=\varepsilon^{\prime}\ll 1$ , and so no users are falsely accused. Then our fingerprinting code has length $O(n^{2}\log((N-n)/\varepsilon^{\prime}))$ and a failure probability of $\varepsilon^{\prime}$ . This matches the result of [LDR+13].

However, if we are willing to tolerate falsely accusing a small constant fraction of users, then we can set, for example, $\delta(N-n)=.01N$ , and our fingerprinting code will have length $O(n^{2})$ and failure probability $2^{-\Omega(n)}$ . To our knowledge, such large values of $\delta$ have not been considered before. It saves a logarithmic factor in our final result.

Our construction works for any robustness parameter $\beta<1/2$ . Previously [BUV14] gave a construction for $\beta=1/75$ in the non-interactive setting. Previous constructions in the interactive setting do not achieve any robustness $\beta>0$ , even for the weaker model of robustness to erasures [BN08].

Our completeness condition differs subtly from previous work. We require that, with high probability,

While our version is less natural in the watermarking setting, it is important to our application to false dicsovery. Our interactive fingerprinting code ensures that the adversary cannot be consistent with respect to the population, rather than that it cannot be consistent with respect to the sample.

2 The Construction

Our construction and analysis is based on the optimal (non interactive) fingerprinting codes of Tardos [Tar08], and the robust variant by Bun et al. [BUV14]. The code is essentially the same, but columns are generated and shown to the adversary one at a time, and tracing is modified to identify users interactively.

We begin with some definitions and notation. For $0\leq a<b\leq 1$ , let $D_{a,b}$ be the distribution with support $(a,b)$ and probability density function $\mu(p)=C_{a,b}/\sqrt{p(1-p)}$ , where $C_{a,b}$ is a normalising constant.To sample from $D_{a,b}$ , first sample $\varphi\in(\sin^{-1}(\sqrt{a}),\sin^{-1}(\sqrt{b}))$ uniformly, then output $\sin^{2}(\varphi)$ as the sample. For $\alpha,\zeta\in(0,1/2)$ , let $\overline{D}_{\alpha,\zeta}$ be the distribution on $ $that returns a sample from$ D_{\alpha,1-\alpha} $with probability$ 1-2\zeta $and or$ 1 $each with probability$ \zeta$.

3 Analysis Overview

Intuitively, the quantity $s_{i}^{j}$ , which we call the score of user $i$ , measures the “correlation” between the answers $(a^{1},\cdots,a^{j})$ of $\mathcal{P}$ and the $i$ -th codeword $(c_{i}^{1},\cdots,c_{i}^{j})$ , using a particular measure of correlation that takes into account the choices $p^{1},\dots,p^{j}$ . If $s_{i}^{j}$ ever exceeds the threshold $\sigma$ , meaning that the answers are significantly correlated with the $i$ -th codeword, then we accuse user $i$ . Thus, our goal is to show two things: Soundness, that the score of an innocent user (i.e. $i\not\in S^{1}$ ) never exceeds the threshold, as the answers cannot be correlated with the unknown $i$ -th codeword. And completeness, that the score of every guilty user (i.e. $i\in S^{1}$ ) will at some point exceed the threshold, meaning that the answers must correlate with the $i$ -th codeword for every $i\in S^{1}$ .

3.2 Completeness

The hidden constants are set to ensure that Equation (2) conflicts with Equation (1). Thus, we can conclude that $\mathcal{P}$ cannot give consistent answers for a $1-\beta$ fraction of rounds. That is to say, $\mathcal{P}$ is forced to be inconsistent because all of $S^{1}$ is accused and eventually $\mathcal{P}$ sees none of the codewords and is reduced to guessing an answer $a^{j}$ .

3.3 Establishing Correlation

Proving Equation (1) is key to the analysis. Our proof thereof combines and simplifies the analyses of [Tar08] and [BUV14]. For this high level overview, we ignore the issue of robustness and fix $\beta=0$ .

where the expectations are taken over the randomness of $p^{j}$ , $c^{j}$ , and $a^{j}$ . Equation (3), combined with a concentration result, implies Equation (1).

The intuition behind Equation (3) and the choice of $p^{j}$ is as follows. Consistency guarantees that, if $c^{j}_{i}=b$ for all $i\in S^{1}$ , then $a^{j}=b$ . This is a weak correlation guarantee, but it suffices to ensure correlation between $a^{j}$ and $\sum_{i\in S^{1}}c^{j}_{i}$ . The affine scaling $\phi^{p^{j}}$ ensures that $\phi^{p^{j}}(c^{j}_{i})$ has mean zero (i.e. is uncorrelated with a constant) and and unit variance (i.e. has unit correlation with itself). The expectation of $a^{j}\cdot\phi^{p^{j}}(c^{j}_{i})$ can be interpreted as the $i$ -th first-order Fourier coefficient of $a^{j}$ as a function of $c^{j}$ . To understand first-order Fourier coefficients, consider the “dictator” function: Suppose $a^{j}=c^{j}_{i^{*}}$ for some $i^{*}\in S^{1}$ - that is, $\mathcal{P}$ always outputs the $i^{*}$ -th bit. Then

This example can be generalised to $a^{j}$ being an arbitrary function of $c^{j}_{S^{1}}$ using Fourier analysis. This calculation also indicates why we choose the probability density function of $p^{j}\sim D_{\alpha,1-\alpha}$ to be proportional to $1/\sqrt{p(1-p)}$ .

To handle robustness ( $\beta>0$ ) we use the ideas of [BUV14]. With probability $2\zeta$ each round is a “special” constant round—i.e. $c^{j}=(1)^{N}$ or $c^{j}=(-1)^{N}$ . Otherwise it is a “normal” round where $c^{j}$ is sampled as before. Intuitively, the adversary $\mathcal{P}$ cannot distinguish the special rounds from the normal rounds in which $c$ happens to be constant. If the adversary gives inconsistent answers on normal rounds, then it must also give inconsistent answers on special rounds. Since there are many more special rounds than normal rounds, this means that a small number of inconsistencies in normal rounds implies a large number of inconsistencies on special rounds. Conversely, inconsistencies are absorbed by the special rounds, so we can assume there are very few inconsistencies in normal rounds. Thus $\mathcal{P}$ is forced to behave consistently on the normal rounds and the analysis on these rounds proceeds as before.

4 Proof of Soundness

We first show that no user is falsely accused except with probability $\delta/2$ . This boils down to proving a concentration bound. Then another concentration bound shows that with high probability at most a $\delta$ fraction of users are falsely accused.

These concentrations bounds are essentially standard. However, we are showing concentration of sums of variables of the form $\phi^{p}(c)$ , which may be quite large if $p\approx 0$ or $p\approx 1$ . This technical problem prevents us from directly applying standard concentration bounds. Instead we open up the standard proofs and verify the desired concentration. We take the usual approach of bounding the moment generating function and using that to give a tail bound.

For $p\in[\alpha,1-\alpha]\cup\{0,1\}$ and $t\in[-\sqrt{\alpha}/2,\sqrt{\alpha}/2]$ , we have

Let $p_{1}\cdots p_{m}\in[\alpha,1-\alpha]\cup\{0,1\}$ and $c_{1}\cdots c_{m}$ drawn independently with $c_{i}\sim p_{i}$ . Let $a_{1}\cdots a_{m}\in$ be fixed. For all $\lambda\geq 0$ , we have

By Lemma 2.4, for all $t\in[-\sqrt{\alpha}/2,\sqrt{\alpha}/2]$ ,

Set $t=\min\{\sqrt{\alpha}/2,\lambda/2m\}$ . If $\lambda\in[0,m\sqrt{\alpha}]$ , then

On the other hand, if $\lambda\geq m\sqrt{\alpha}$ , then

The result is obtained by adding these expressions. ∎

The following theorem shows how we can beat the union bound for tail bounds on partial sums.

This is a useful if we are in a setting where $|S^{1}|$ is unknown: if $|S^{1}|>n$ , then the interactive fingerprinting code will still not make too many false accusations, even if it fails to identify all of $S^{1}$ .

If $\delta<1/(N-|S^{1}|)$ , then this is a very poor bound. Instead we use the fact that the $E_{i}$ s are discrete and Markov’s inequality, which amounts to a union bound. For $\delta(N-|S^{1}|)<1$ , we have

The following lemma will be useful later.

5 Proof of Completeness

For this section, assume that the adversary $\mathcal{P}$ is always consistent - that is, we have no robustness and $\beta=0$ . Robustness will be added in Section 2.5.2. Here we establish that the scores have good expectation, namely

In this section we deviate from the proof in [Tar08]. We use biased Fourier analysis to give a more intuitive proof of the correlation bound.

We have the following lemma and proposition, which relate the correlation $a^{j}\cdot\sum_{i\in S^{1}}\phi^{p^{j}}(c_{i}^{j})$ to the properties of $a^{j}$ as a function of $p^{j}$ . To interpret these imagine that $f$ represents the adversary $\mathcal{P}$ with one round viewed in isolation – the fingerprinting code gives the adversary $c^{j}$ and the adversary responds with $f(c^{j}_{S^{j}})$ .

Firstly, the following lemma gives an interpretation of the correlation value for a fixed $p^{j}$ .

Thus, for any $p\in(0,1)$ , we can write $f$ in terms of these basis functions:

This decomposition is a generalisation of Fourier analysis to biased distributions [O’D14, §8.4]. For $p,q\in(0,1)$ , the expansion of $f$ gives the following expressions for $g(q)$ , $g^{\prime}(q)$ and $g^{\prime}(p)$ .

Now we can interpret the correlation for a random $p^{j}\sim D_{a,b}$ .

This effectively follows by integrating Lemma 2.11.

Let $\mu(p)=C_{a,b}/\sqrt{p(1-p)}$ be the probability density function for the distribution $D_{a,b}$ on the interval $(a,b)$ . By Lemma 2.11 and the fundamental theorem of calculus, we have

It remains to show that $C_{a,b}=\left(2\sin^{-1}(\sqrt{b})-2\sin^{-1}(\sqrt{a})\right)^{-1}\geq 1/\pi$ . This follows from observing that

Now we have a lemma to bring consistency into the picture. If $f$ is consistent, $b\approx 1$ , and $a\approx 0$ , then

This gives a lower bound on the correlation for consistent $f$ .

5.2 Robustness

We require the fingerprinting code to be robust to inconsistent answers. We show that the correlation is still good in the presence of inconsistencies.

For $f:\{\pm 1\}^{n}\to\{\pm 1\}$ , define a random variable $\xi_{\alpha,\zeta}(f)$ by

The following bounds the expected increase in scores from one round of interaction.

Let $f:\{\pm 1\}^{n}\to\{\pm 1\}$ and $\alpha,\zeta\in(0,1/2)$ . Then

5.3 Concentration

So far we have shown that the fingerprinting code achieves good correlation or the adversary is not consistent in expectation. However, we need this to hold with high probability. Thus we now show that sums of $\xi_{\alpha,\zeta}(f)$ variables concentrate around their expectation.

Again, the proofs in this section are standard. However, the $\xi_{\alpha,\zeta}(f)$ variables can be quite unwieldy and we are thus unable to apply standard results directly. So instead we must open the proofs and verify that the concentration bounds hold. We proceed by bounding the moment generating function of $\xi_{\alpha,\zeta}(f)$ and then proving an Azuma-like concentration inequality. These calculations are not novel or insightful.

Let $f:\{\pm 1\}^{n}\to\{\pm 1\}$ , $\alpha\in(0,1/2)$ , $\zeta\in[1/4,1/2)$ , and $t\in[-\sqrt{\alpha}/8,\sqrt{\alpha}/8]$ . Then

where $C=\frac{64e^{n\alpha/4}}{\alpha}$ .

Let $Y=\sum_{i\in[n]}\phi^{p}(c_{i})$ . By Lemma 2.4 and independence,

for $t\in[-\sqrt{\alpha}/2,\sqrt{\alpha}/2]$ . Pick $t\in\{\pm\sqrt{\alpha}/2\}$ such that

Then by dropping positive terms, for all $j\geq 1$ ,

Thus we have bounded the even moments of $Y$ . By Cauchy-Schwartz, for $k=2j+1\geq 3$ ,

For $t\in[-\sqrt{\alpha}/8,\sqrt{\alpha}/8]$ , we have

$X_{i}$ is determined by $\mathcal{U}_{i}$ ,

$\mu_{i}$ is determined by $\mathcal{U}_{i-1}$ , and

$\mathcal{U}_{i-1}$ is determined by $\mathcal{U}_{i}$ .

Suppose that, for all $i\in[m]$ , $u\in\Omega$ , and $t\in[-c,c]$ ,

First we show by induction on $k\in[m]$ that, for all $u\in\Omega$ and $t\in[-c,c]$ ,

This clearly holds for $k=1$ , as this is our supposition for $i=m$ . Now suppose this holds for some $k\in[m-1]$ . For $u\in\Omega$ and $t\in[-c,c]$ , we have

for all $t\in[0,c]$ and $\lambda>0$ . Set $t=\min\{c,\lambda/2mC\}$ to obtain the result. ∎

5.4 Bounding the Score

Now we can finally show that the scores are large with high probability.

Since the adversary $\mathcal{P}$ is computationally unbounded and arbitrary, we may assume it is deterministic. We may also assume $n=|S^{1}|$ and that the adversary is able to see $c^{j}_{S^{1}}$ at each round. (This only gives the adversary more power.)

where $\sim$ denotes having the same distribution. We have

Now we can apply the above lemmas to bound the expectation and tail of this random variable.

for all $f^{j}$ . Moreover, by Proposition 2.15,

for all $t\in[-\sqrt{\alpha}/8,\sqrt{\alpha}/8]$ , where $C=70/\alpha\geq 64e^{n\alpha/4}/\alpha$ , as $\alpha\leq 1/4n$ .

However, we can also prove that the scores are small with high probability. This follows from the fact that users with large scores are accused and therefore no user’s score can be too large:

Now we show that the conflicting bounds of Theorem 2.17 and Lemma 2.18 imply completeness - that is, the adversary $\mathcal{P}$ cannot be consistent.

We claim this is a contradiction, which then holds with high probability, thus proving the theorem.

Substituting these into Equation (7) gives

Now we use $1-2\zeta=\frac{1}{2}\left(\frac{1}{2}-\beta\right)$ and $\gamma=\frac{2}{\pi}\frac{1-2\zeta}{\zeta}=\frac{\left(\frac{1}{2}-\beta\right)}{\pi\zeta}$ to derive a contradiction from Equation (8):

Since $\zeta=\frac{1}{2}-\frac{1}{4}\left(\frac{1}{2}-\beta\right)$ , we have

This gives a contradiction. The total failure probability is bounded by

assuming $\left(\frac{1}{2}-\beta\right)n\geq 1$ . ∎

6 Non-Interactive Fingerprinting Codes

Our construction and analysis also gives a construction of traditional non-interactive fingerprinting codes. First we give a formal definition of a fingerprinting code.

Our construction and analysis is readily adapted to the non-interactive setting. We obtain the following theorem.

and false accusation probability $\delta$ for

Hardness of False Discovery

In this section we prove our main result - that answering $O(n^{2})$ adaptive queries given $n$ samples is hard. But first we must formally define the model in which we are working.

Given a distribution $\mathcal{D}$ over $\{0,1\}^{d}$ , we would like to answer statistical queries about $\mathcal{D}$ . A statistical query on $\{0,1\}^{d}$ is specified by a function $q:\{0,1\}^{d}\to$ and (abusing notation) is defined to be

Our goal is to design an oracle $\mathcal{O}$ that answers statistical queries on $\mathcal{D}$ using only iid samples $x_{1},\dots,x_{n}\leftarrow_{\mbox{\tiny R}}\mathcal{D}$ . Our focus is the case where the queries are chosen adaptively and adversarially.

2 Encryption Schemes

Our attack relies on the existence of a semantically secure private-key encryption scheme. An encryption scheme is a triple of efficient algorithms $(\mathit{\mathit{Gen}},\mathit{\mathit{Enc}},\mathit{\mathit{Dec}})$ with the following syntax:

$\mathit{\mathit{Gen}}$ is a randomized algorithm that takes as input a security parameter $\lambda$ and outputs a $\lambda$ -bit secret key. Formally, $sk\leftarrow_{\mbox{\tiny R}}\mathit{\mathit{Gen}}(1^{\lambda})$ .

$\mathit{\mathit{Dec}}$ is a deterministic algorithm that takes as input a secret key and a ciphertext $ct$ and outputs a decrypted message $m^{\prime}$ . If the ciphertext $ct$ was an encryption of $m$ under the key $sk$ , then $m^{\prime}=m$ . Formally, if $ct\leftarrow_{\mbox{\tiny R}}\mathit{\mathit{Enc}}(sk,m)$ , then $\mathit{\mathit{Dec}}(sk,ct)=m$ with probability $1$ .

Roughly, security of the encryption scheme asserts that no polynomial time adversary who does not know the secret key can distinguish encryptions of $m=0$ from encryptions of $m=1$ , even if the adversary has access to an oracle that returns the encryption of an arbitrary message under the unknown key. For convenience, we will require that this security property holds simultaneously for an arbitrary polynomial number of secret keys. The existence of an encryption scheme with this property follows immediately from the existence an ordinary semantically secure encryption scheme. We start with the stronger definition only to simplify our proofs. A secure encryption scheme exists under the minimal cryptographic assumption that one-way functions exist. The formal definition of security is not needed until Section A.

3 The Attack

4 Informal Analysis of the Attack

Before formally analysing the attack, we comment on the overall structure thereof.

In order to do this, the oracle could decrypt $q^{j}$ to obtain $c^{j}$ for every $j$ . However, the oracle does not have all the necessary secret keys; it only has the secret keys corresponding to its sample $S$ . Thus, by the security of the encryption scheme, any efficient oracle effectively can only see $c^{j}_{S\setminus T^{j}}$ . That is to say, if the oracle is computationally efficient, then it has the same restriction as a fingerprinting adversary $\mathcal{P}$ . Thus, any computationally efficient oracle must lose the fingerprinting game, meaning it cannot answer every query (or even just a $\beta=1/2+\Omega(1)$ fraction of the queries) accurately.

One subtly arises since “accuracy” for the oracle is defined with respect to the true answer $q^{j}(\mathcal{D})=\frac{1}{N}\sum_{i\in[N]\setminus T^{j}}c^{j}_{i},$ whereas “accuracy” in the fingerprinting game is defined with respect to the average over all of $c^{j}$ , that is $\frac{1}{N}\sum_{i\in[N]}c^{j}_{i}$ . We deal with these subtleties by arguing that $T^{j}$ , which is the number of users accused by the interactive fingerprinting code prior to the $j$ -th query, is small. Here we use the fact that the fingerprinting code only allows a relatively small number of false accusations $N/1000$ . Therefore $|T^{j}|\leq n+N/1000\leq N/500$ . As a result, the definition of accuracy guaranteed by the oracle will be close enough to the definition of accuracy required for the interactive fingerprinting code to succeed in identifying the sample.

5 Analysis of the Attack

In this section we prove our main result:

This follows straightforwardly from a reduction to the security of the fingerprinting code. Notice that the query $q^{j}$ does not depend on any entry $c^{j}_{i}$ for $i\not\in S\setminus T^{j-1}$ . Thus, an adversary for the fingerprinting code who has access to $c^{j}_{S\setminus T^{j-1}}$ can simulate the view of the oracle. Since we have for any adversary $\mathcal{P}$

The proof is straightforward from the definition of security, and is deferred to Section A. Combining Claims 3.3 and 3.4 we easily obtain the following.

where the second equality is because by construction $ct^{j}_{i}\leftarrow_{\mbox{\tiny R}}\mathit{\mathit{Enc}}(sk_{i},c^{j}_{i})$ and the inequality is because we have $c^{j}_{i}\in\{\pm 1\}$ .

Noting that $N/1000+n<N/500$ and combining with (10), we have

Applying the triangle inequality to (9) and (11), we obtain

As before, we can argue that the real attack and the ideal attack are computationally indistinguishable, and thus the oracle must also give consistent answers in the ideal attack.

The proof is straightforward from the definition of security, and is deferred to Section A. Combining Claims 3.6 and 3.7 we easily obtain the following.

Putting the above claims together, we obtain the main theorem:

Assume for the sake of contradiction that there were such an oracle. Theorem 2.2 implies that an interactive fingerprinting code of length $O(n^{2}/\left(\frac{1}{2}-\beta\right)^{4})$ exists, so the attack can be carried out. By Claim 3.8 we would have

Note that the constants in the $(0.99,\beta,1/2)$ -accuracy assumption are arbitrary and have only been fixed for simplicity.

6 An Information-Theoretic Lower Bound

As in [HU14], we observe that the techniques underlying our computational hardness result can also be used to prove an information-theoretic lower bound when the dimension of the data is large. At a high level, the argument uses the fact that the encryption scheme we rely on only needs to satisfy relatively weak security properties, specifically security for at most $O(n^{2})$ messages. This security property can actually be achieved against computationally unbounded adversaries provided that the length of the secret keys is $O(n^{2})$ . As a result, our lower bound can be made to hold against computationally unbounded oracles, but since the secret keys have length $O(n^{2})$ , we will require $d=O(n^{2})$ . We refer the reader to [HU14] for a slightly more detailed discussion, and simply state the following result.

Hardness of Avoiding Blatant Non Privacy

In this section we show how our arguments also imply that computationally efficient oracles that guarantee accuracy for adaptively chosen statistical queries must be blatantly non-private.

Before we can define blatant non-privacy, we need to define a notion of accuracy that is more appropriate for the application to privacy. In contrast to Definition 3.1 where accuracy is defined with respect to the distribution, here we define accurate with respect to the sample itself. With this change in mind, we model blatant non-privacy via the following game.

where $q(x)=\frac{1}{n}\sum_{i\in[n]}q(x_{i})$ is the average over the sample.

2 Lower Bounds

In this section we show the following theorem

Assuming one-way functions exist, any computationally efficient oracle $\mathcal{O}$ that gives accurate answers to $O(n^{2})$ adaptively chosen queries is blatantly non-private.

We will start by establishing that the number of falsely accused users is small. That is, we have $|T^{L}\setminus x|\leq n/10000$ with high probability. As in Section 3, this condition will follow from the security of the interactive fingerprinting code $\mathcal{F}$ combined with the security of the encryption scheme, via the introduction of an “ideal attack” (Figure 8).

This follows straightforwardly from a reduction to the security of the fingerprinting code. Notice that since the query $q^{j}$ does not depend on any entry $c^{j}_{i}$ for $i\not\in x\setminus T^{j-1}$ . Thus, an adversary for the fingerprinting code who has access to $c^{j}_{x\setminus T^{j-1}}$ can simulate the view of the oracle. Since we have for any adversary $\mathcal{P}$

Now we can argue that an efficient oracle cannot distinguish between the real attack and the ideal attack. Thus the conclusion that $|T^{L}\setminus x|\leq n/10000$ with high probability must also hold in the real game.

The proof is straightforward from the definition of security, and is deferred to Section A. Combining Claims 4.4 and 4.5 we easily obtain the following.

By Claim 4.6 we have $|x^{\prime}\setminus x|\leq n/10000$ . Now, in order to show $|x^{\prime}\triangle x|\leq n/100$ , it suffices to show that $|x\setminus x^{\prime}|\leq n/200$ . In order to do so we begin with the following claim, which establishes that if the oracle $\mathcal{O}$ is sufficiently accurate, and $|x\setminus T^{j-1}|\leq n/200$ , then the oracle returns a consistent answer to the query $q^{j}$ . Recalling that we use $\theta^{j}$ to denote the number of rounded answers $\overline{a}^{k}$ for $1\leq k\leq j$ that are inconsistent with $c^{j}$ , we can state the following claim.

After renormalizing by $(n/n-|T^{j-1}|)$ we have

Since $0\leq|T^{j-1}\setminus x|\leq n/10000$ (by Claim 4.6), and since the algorithm terminates unless $|T^{j-1}|\leq 499n/500$ , we obtain

By the assumption that $\mathcal{O}$ is $(1/1000,\beta,1/2)$ -sample-accurate, we have that, with probability at least $1/2$ , for $(1-\beta)L$ choices of $j\in[L]$ ,

Finally, observe that if $c^{j}_{i}=1$ for every $i\in[2n]$ , then we have

and by (15) we have $(n/(n-|T^{j}|))a^{j}\geq 1-2/3=1/3$ . Thus, the rounded answer $\overline{a}^{j}=1$ . Similarly, if $c^{j}_{i}=-1$ for every $i\in[2n]$ , then we have $\overline{a}^{j}=-1$ . This completes the proof of the claim. ∎

As before, we can argue that the real attack and the ideal attack are computationally indistinguishable, and thus the oracle must also give consistent answers in the ideal attack.

The proof is straightforward from the definition of security, and is deferred to Section A. Combining Claims 4.7 and 4.8 we easily obtain the following.

Putting it together, we obtain the following theorem.

which is a contradiction. This completes the proof of the theorem. ∎

3 An Information-Theoretic Lower Bound

As we did in Section 3.6, we can prove an information-theoretic analogue of our hardness result for avoiding blatant non-privacy.

The proof is essentially identical to what is sketched in Section 3.6.

Acknowledgements

We thank Moritz Hardt and Salil Vadhan for insightful discussions during the early stages of this work. We also thank Thijs Laarhoven for bringing his work on interactive fingerprinting codes to our attention.

References

Appendix A Security Reductions from Sections 3 and 4

In Section 3 we made several claims comparing the probability of events in $\mathsf{Attack}$ to the probability of events in $\mathsf{IdealAttack}$ . Each of these claims follow from the assumed security of the encryption scheme. In this section we restate and prove these claims. Since the claims are all of a similar nature, the proof will be somewhat modular. The claims in Section 4 relating $\mathsf{PrivacyAttack}$ to $\mathsf{IdealPrivacyAttack}$ can be proven in an essentially identical fashion, and we omit these proofs for brevity.

Before we begin recall the formal definition of security of an encryption scheme. Security is defined via a pair of oracles $\mathcal{E}_{0}$ and $\mathcal{E}_{1}$ . $\mathcal{E}_{1}(sk_{1},\dots,sk_{N},\cdot)$ takes as input the index of a key $i\in[N]$ and a message $m$ and returns $\mathit{\mathit{Enc}}(sk_{i},m)$ , whereas $\mathcal{E}_{0}(sk_{1},\dots,sk_{N},\cdot)$ takes the same input but returns $\mathit{\mathit{Enc}}(sk_{i},0)$ . The security of the encryption scheme asserts that for randomly chosen secret keys, no computationally efficient adversary can tell whether or not it is interacting with $\mathcal{E}_{0}$ or $\mathcal{E}_{1}$ .

We now restate the relevant claims from Section 3.

To prove both of these claims, for $c\in\{1,2\}$ we construct an adversary $\mathcal{B}_{c}$ that will attempt to use $\mathcal{O}$ to break the security of the encryption. We construct $\mathcal{B}_{c}$ in such a way that its advantage in breaking the security of encryption is precisely the difference in the probability of the event $Z_{c}$ between $\mathsf{Attack}$ and $\mathsf{IdealAttack}$ , which implies that the difference in probabilities is negligible. The simulator is given in Figure 9

First, observe that for $c\in\left\{1,2\right\}$ , $\mathcal{B}_{c}$ is computationally efficient as long as $\mathcal{F}$ and $\mathcal{O}$ are both computationally efficient. It is not hard to see that our construction $\mathcal{F}$ is efficient and efficiency of $\mathcal{O}$ is an assumption of the claim. Also notice $\mathcal{B}$ can determine whether $Z_{c}$ has occurred efficiently.

Now we observe that when the oracle is $\mathcal{E}_{1}$ (the oracle that takes as input $i$ and $m$ and returns $\mathit{\mathit{Enc}}(\overline{sk}_{i},m)$ ), and $\overline{sk}_{1},\dots,\overline{sk}_{N}$ are chosen randomly from $\mathit{\mathit{Gen}}(1^{\lambda})$ , then the view of the oracle is identical to $\mathsf{Attack}_{n,d}[\mathcal{O}]$ . Specifically, the oracle holds a random sample of pairs $(i,sk_{i})$ and is shown queries that are encryptions either under keys it knows or random unknown keys. Moreover, the messages being encrypted are chosen from the same distribution. On the other hand, when the oracle is $\mathcal{E}_{0}$ (the oracle that takes as input $i$ and $ct$ and returns $\mathit{\mathit{Enc}}(\overline{sk}_{i},0)$ ), then the view of the oracle is identical to $\mathsf{Attack}_{n,d}[\mathcal{O}]$ . Thus we have that for $c\in\{1,2\}$ ,