From average case complexity to improper learning complexity

Amit Daniely, Nati Linial, Shai Shalev-Shwartz

Introduction

Valiant’s celebrated probably approximately correct (=PAC) model of machine learning led to an extensive research that yielded a whole scientific community devoted to computational learning theory. In the PAC learning model, a learner is given an oracle access to randomly generated samples $(X,Y)\in{\cal X}\times\{0,1\}$ where $X$ is sampled from some unknown distribution ${\cal D}$ on ${\cal X}$ and $Y=h^{*}(X)$ for some unknown function $h^{*}:{\cal X}\to\{0,1\}$ . Furthermore, it is assumed that $h^{*}$ comes from a predefined hypothesis class ${\cal H}$ , consisting of $0,1$ valued functions on ${\cal X}$ . The learning problem defined by ${\cal H}$ is to find a function $h:{\cal X}\to\{0,1\}$ that minimizes $\operatorname{Err}_{{\cal D}}(h):=\Pr_{X\sim{\cal D}}(h(X)\not=h^{*}(X))$ . For concreteness’ sake we take ${\cal X}=\{\pm 1\}^{n}$ , and we consider the learning problem tractable if there is an algorithm that on input $\epsilon$ , runs in time $\operatorname{poly}(n,1/\epsilon)$ and outputs, w.h.p., a hypothesis $h$ with $\operatorname{Err}(h)\leq\epsilon$ .

Assuming $\mathbf{P}\neq\mathbf{NP}$ , the status of most basic computational problems is fairly well understood. In a sharp contrast, almost $30$ years after Valiant’s paper, the status of most basic learning problems is still wide open – there is a huge gap between the performance of the best known algorithms and hardness results:

The crux of the matter, leading to this state of affairs, has to do with the learner’s freedom to return any hypothesis. A learner who may return hypotheses outside the class ${\cal H}$ is called an improper learner. This additional freedom makes such algorithms potentially more powerful than proper learners. On the other hand, this added flexibility makes it difficult to apply standard reductions from $\mathbf{NP}$ -hard problems. Indeed, there was no success so far in proving intractability of a learning problem based on $\mathbf{NP}$ -hardness. Moreover, as Applebaum, Barak and Xiao showed, many standard ways to do so are doomed to fail, unless the polynomial hierarchy collapses.

The vast majority of existing lower bounds on learning utilize the crypto-based argument, suggested in . Roughly speaking, to prove that a certain learning problem is hard, one starts with a certain collection of functions, that by assumption are one-way trapdoor permutations. This immediately yields some hard (usually artificial) learning problem. The final step is to reduce this artificial problem to some natural learning problem.

Unlike the difficulty in establishing lower bounds for improper learning, the situation in proper learning is much better understood. Usually, hardness of proper learning is proved by showing that it is $\mathbf{NP}$ -hard to distinguish a realizable sample from an unrealizable sample. I.e., it is hard to tell whether there is some hypothesis in ${\cal H}$ which has zero error on a given sample. This, however, does not suffice for the purpose of proving lower bounds on improper learning, because it might be the case that the learner finds a hypothesis (not from ${\cal H}$ ) that does not err on the sample even though no $h\in{\cal H}$ can accomplish this. In this paper we present a new methodology for proving hardness of improper learning. Loosely speaking, we show that improper learning is impossible provided that it is hard to distinguish a realizable sample from a randomly generated unrealizable sample.

Agnostically learning halfspaces with a constant approximation ratio is hard, even over the boolean cube.

Learning intersection of $\omega(1)$ halfspaces is hard, even over the boolean cube.

We note that result 4 can be established using the cryptographic technique . Result 5 is often taken as a hardness assumption. We also conjecture that under our generalization of Feige’s assumption it is hard to learn intersections of even constant number of halfspaces. We present a possible approach to the case of four halfspaces. To the best of our knowledge, these results easily imply most existing lower bounds for improper learning.

There is a crucial reversal of order that works in our favour. To lower bound improper learning, we actually need much less than what is needed in cryptography, where a problem and a distribution on instances are appropriate if they fool every algorithm. In contrast, here we are presented with a concrete learning algorithms and we devise a problem and a distribution on instances that fail it.

2 On the role of average case complexity

Preliminaries

Distributions on ${\cal Z}_{n}$ (resp. ${\cal Z}_{n}^{m}$ ) are denoted ${\cal D}_{n}$ (resp. ${\cal D}_{n}^{m}$ ). Ensembles of distributions are denoted by ${\cal D}$ . That is, ${\cal D}=\{{\cal D}_{n}^{m(n)}\}_{n=1}^{\infty}$ where ${\cal D}_{n}^{m(n)}$ is a distributions on ${\cal Z}_{n}^{m(n)}$ . We say that ${\cal D}$ is a polynomial ensemble if $m(n)$ is upper bounded by some polynomial in $n$ .

The error of a hypothesis $h:{\cal X}_{n}\to\{0,1\}$ w.r.t. ${\cal D}_{n}$ on ${\cal Z}_{n}$ is defined as $\operatorname{Err}_{{\cal D}_{n}}(h)=\Pr_{(x,y)\sim{\cal D}_{n}}\left(h(x)\neq y\right)$ . For a hypothesis class ${\cal H}_{n}$ , we define $\operatorname{Err}_{{\cal D}_{n}}({\cal H}_{n})=\min_{h\in{\cal H}_{n}}\operatorname{Err}_{{\cal D}_{n}}(h)$ . We say that a distribution ${\cal D}_{n}$ is realizable by $h$ (resp. ${\cal H}_{n}$ ) if $\operatorname{Err}_{{\cal D}_{n}}(h)=0$ (resp. $\operatorname{Err}_{{\cal D}_{n}}({\cal H}_{n})=0$ ). Similarly, we say that ${\cal D}_{n}$ is $\epsilon$ -almost realizable by $h$ (resp. ${\cal H}_{n}$ ) if $\operatorname{Err}_{{\cal D}_{n}}(h)\leq\epsilon$ (resp. $\operatorname{Err}_{{\cal D}_{n}}({\cal H}_{n})\leq\epsilon$ ).

A sample is a sequence $S=\{(x_{1},y_{1}),\ldots(x_{m},y_{m})\}\in{\cal Z}^{m}_{n}$ . The empirical error of a hypothesis $h:{\cal X}_{n}\to\{0,1\}$ w.r.t. sample $S$ is $\operatorname{Err}_{S}(h)=\frac{1}{m}\sum_{i=1}^{m}1(h(x_{i})=y_{i})$ . The empirical error of a hypothesis class ${\cal H}_{n}$ w.r.t. $S$ is $\operatorname{Err}_{S}({\cal H}_{n})=\min_{h\in{\cal H}_{n}}\operatorname{Err}_{S}(h)$ . We say that a sample $S$ is realizable by $h$ if $\operatorname{Err}_{S}(h)=0$ . The sample $S$ is realizable by ${\cal H}_{n}$ if $\operatorname{Err}_{S}({\cal H}_{n})=0$ . Similarly, we define the notion of $\epsilon$ -almost realizable sample (by either a hypothesis $h:{\cal X}_{n}\to\{0,1\}$ or a class ${\cal H}_{n}$ ).

A learning algorithm, denoted ${\cal L}$ , obtains an error parameter $0<\epsilon<1$ , a confidence parameter $0<\delta<1$ , a complexity parameter $n$ , and an access to an oracle that produces samples according to unknown distribution ${\cal D}_{n}$ on ${\cal Z}_{n}$ . It should output a (description of) hypothesis $h:{\cal X}_{n}\to\{0,1\}$ . We say that the algorithm ${\cal L}$ (PAC) learns the hypothesis class ${\cal H}$ if, for every realizable distribution ${\cal D}_{n}$ , with probability $\geq 1-\delta$ , ${\cal L}$ outputs a hypothesis with error $\leq\epsilon$ . We say that an algorithm ${\cal L}$ agnostically learns ${\cal H}$ if, for every distribution ${\cal D}_{n}$ , with probability $\geq 1-\delta$ , ${\cal L}$ outputs a hypothesis with error $\leq\operatorname{Err}_{{\cal D}_{n}}({\cal H})+\epsilon$ . We say that an algorithm ${\cal L}$ approximately agnostically learns ${\cal H}$ with approximation ratio $\alpha=\alpha(n)\geq 1$ if, for every distribution ${\cal D}_{n}$ , with probability $\geq 1-\delta$ , ${\cal L}$ outputs a hypothesis with error $\leq\alpha\cdot\operatorname{Err}_{{\cal D}_{n}}({\cal H})+\epsilon$ . We say that ${\cal L}$ is efficient if it runs in time polynomial in $n,1/\epsilon$ and $1/\delta$ , and outputs a hypothesis that can be evaluated in time polynomial in $n,1/\epsilon$ and $1/\delta$ . We say that ${\cal L}$ is proper (with respect to ${\cal H}$ ) if it always outputs a hypothesis in ${\cal H}$ . Otherwise, we say that ${\cal L}$ is improper.

2 Constraints Satisfaction Problems

3 Resolution refutation and Davis Putnam algorithms

The methodology

We begin by discussing the methodology in the realm of realizable learning, and we later proceed to agnostic learning. Some of the ideas underling our methodology appeared, in a much more limited context, in .

To motivate the approach, recall how one usually proves that a class cannot be efficiently properly learnable. Given a hypothesis class ${\cal H}$ , let $\Pi({\cal H})$ be the problem of distinguishing between an ${\cal H}$ -realizable sample $S$ and one with $\operatorname{Err}_{S}({\cal H})\geq\frac{1}{4}$ . If ${\cal H}$ is efficiently properly learnable then this problem is inThe reverse direction is almost true: If the search version of this problem can be solved in polynomial time, then ${\cal H}$ is efficiently learnable. $\mathbf{RP}$ : To solve $\Pi({\cal H})$ , we simply invoke a proper learning algorithm ${\cal A}$ that efficiently learns ${\cal H}$ , with examples drawn uniformly from $S$ . Let $h$ be the output of ${\cal A}$ . Since ${\cal A}$ properly learns ${\cal H}$ , we have

If $S$ is a realizable sample, then $\operatorname{Err}_{S}(h)$ is small.

If $\operatorname{Err}_{S}({\cal H})\geq\frac{1}{4}$ then, since $h\in{\cal H}$ , $\operatorname{Err}_{S}(h)\geq\frac{1}{4}$ .

This gives an efficient way to decide whether $S$ is realizable. We conclude that if $\Pi({\cal H})$ is $\mathbf{NP}$ -hard, then ${\cal H}$ is not efficiently learnable, unless $\mathbf{NP}=\mathbf{RP}$ .

We see that it is not clear how to establish hardness of improper learning based on the hardness of distinguishing between a realizable and an unrealizable sample. The core problem is that even if $S$ is not realizable, the algorithm might still return a good hypothesis. The crux of our new technique is the observation that if $S$ is randomly generated unrealizable sample then even improper algorithm cannot return a hypothesis with a small empirical error. The point is that the returned hypothesis is determined solely by the examples that ${\cal A}$ sees and its random bits. Therefore, if ${\cal A}$ is an efficient algorithm, the number of hypotheses it might return cannot be too large. Hence, if $S$ is “random enough”, it likely to be far from all these hypotheses, in which case the hypothesis returned by ${\cal A}$ would have a large error on $S$ .

We now formalize this idea. Let ${\cal D}=\{{\cal D}^{m(n)}_{n}\}_{n}$ be a polynomial ensemble of distributions, such that ${\cal D}^{m(n)}_{n}$ is a distribution on ${\cal Z}_{n}^{m(n)}$ . Think of ${\cal D}^{m(n)}_{n}$ as a distribution that generates samples that are far from being realizable by ${\cal H}$ . We say that it is hard to distinguish between a ${\cal D}$ -random sample and a realizable sample if there is no efficient randomized algorithm ${\cal A}$ with the following properties:

For every realizable sample $S\in{\cal Z}^{m(n)}_{n}$ ,

If $S\sim{\cal D}_{n}^{m(n)}$ , then with probability $1-o_{n}(1)$ over the choice of $S$ , it holds that

Let ${\cal D}^{m(n)}_{n}$ be the distribution over ${\cal Z}_{n}^{m(n)}$ defined by taking $m(n)$ independent uniformly chosen examples from ${\cal X}_{n}\times\{0,1\}$ . For $f:{\cal X}_{n}\to\{0,1\}$ , $\Pr_{S\sim{\cal D}^{m(n)}_{n}}\left(\operatorname{Err}_{S}(f)\leq\frac{1}{4}\right)$ is the probability of getting at most $\frac{m(n)}{4}$ heads in $m(n)$ independent tosses of a fair coin. By Hoeffding’s bound, this probability is $\leq 2^{-\frac{1}{8}m(n)}$ . Therefore, ${\cal D}=\{{\cal D}^{m(n)}_{n}\}_{n}$ is $\left(\frac{1}{8}m(n),1/4\right)$ -scattered.

Every hypothesis class that satisfies the following condition is not efficiently learnable. There exists $\beta>0$ such that for every $c>0$ there is an $(n^{c},\beta)$ -scattered ensemble ${\cal D}$ for which it is hard to distinguish between a ${\cal D}$ -random sample and a realizable sample.

The theorem and the proof below work verbatim if we replace $\beta$ by $\beta(n)$ , provided that $\beta(n)>n^{-a}$ for some $a>0$ .

Proof Let ${\cal H}$ be the hypothesis class in question and suppose toward a contradiction that algorithm ${\cal L}$ learns ${\cal H}$ efficiently. Let $M\left(n,1/\epsilon,1/\delta\right)$ be the maximal number of random bits used by ${\cal L}$ when run on the input $n,\epsilon,\delta$ . This includes both the bits describing the examples produced by the oracle and “standard” random bits. Since ${\cal L}$ is efficient, $M\left(n,1/\epsilon,1/\delta\right)<\operatorname{poly}(n,1/\epsilon,1/\delta)$ . Define

By assumption, there is a $(q(n),\beta)$ -scattered ensemble ${\cal D}$ for which it is hard to distinguish a ${\cal D}$ -random sample from a realizable sample. Consider the algorithm ${\cal A}$ defined below. On input $S\in{\cal Z}_{n}^{m(n)}$ ,

Run ${\cal L}$ with parameters $n,\beta$ and $\frac{1}{4}$ , such that the examples’ oracle generates examples by choosing a random example from $S$ .

Let $h$ be the hypothesis that ${\cal L}$ returns. If $\operatorname{Err}_{S}(h)\leq\beta$ , output “realizable”. Otherwise, output “unrealizable”.

Next, we derive a contradiction by showing that ${\cal A}$ distinguishes a realizable sample from a ${\cal D}$ -random sample. Indeed, if the input $S$ is realizable, then ${\cal L}$ is guaranteed to return, with probability $\geq 1-\frac{1}{4}$ , a hypothesis $h:{\cal X}_{n}\to\{0,1\}$ with $\operatorname{Err}_{S}(h)\leq\beta$ . Therefore, w.p. $\geq\frac{3}{4}$ ${\cal A}$ will output “realizable”.

What if the input sample $S$ is drawn from ${\cal D}^{m(n)}_{n}$ ? Let ${\cal G}\subset\{0,1\}^{{\cal X}_{n}}$ be the collection of functions that ${\cal L}$ might return when run with parameters $n,\epsilon(n)$ and $\frac{1}{4}$ . We note that $|{\cal G}|\leq 2^{q(n)-n}$ , since each hypothesis in ${\cal G}$ can be described by $q(n)-n$ bits. Namely, the random bits that ${\cal L}$ uses and the description of the examples sampled by the oracle. Now, since ${\cal D}$ is $(q(n),\beta)$ -scattered, the probability that $\operatorname{Err}_{S}(h)\leq\beta$ for some $h\in{\cal G}$ is at most $|{\cal G}|2^{-q(n)}\leq 2^{-n}$ . It follows that the probability that ${\cal A}$ responds “realizable” is $\leq 2^{-n}$ . This leads to the desired contradiction and concludes our proof. $\Box$

For every sample $S\in{\cal Z}^{m(n)}_{n}$ that is $\epsilon(n)$ -almost realizable,

If $S\sim{\cal D}_{n}^{m(n)}$ , then with probability $1-o_{n}(1)$ over the choice of $S$ , it holds that

Let $\alpha\geq 1$ . Every hypothesis class that satisfies the following condition is not efficiently agnostically learnable with an approximation ratio of $\alpha$ . For some $\beta$ and every $c>0$ , there is a $(n^{c},\alpha\beta+1/n)$ -scattered ensemble ${\cal D}$ such that it is hard to distinguish between a ${\cal D}$ -random sample and a $\beta$ -almost realizable sample.

As in theorem 3.2, the theorem and the proof below work verbatim if we replace $\alpha$ by $\alpha(n)$ and $\beta$ by $\beta(n)$ , provided that $\beta(n)>n^{-a}$ for some $a>0$ .

Proof Let ${\cal H}$ be the hypothesis class in question and suppose toward a contradiction that ${\cal L}$ efficiently agnostically learns ${\cal H}$ with approximation ratio of $\alpha$ . Let $M\left(n,1/\epsilon,1/\delta\right)$ be the maximal number of random bits used by ${\cal L}$ when it runs on the input $n,\epsilon,\delta$ . This includes both the bits describing the examples produced by the oracle and the “standard” random bits. Since ${\cal L}$ is efficient, $M\left(n,1/\epsilon,1/\delta\right)<\operatorname{poly}(n,1/\epsilon,1/\delta)$ . Define,

By the assumptions of the theorem, there is a $(q(n),\alpha\beta+1/n)$ -scattered ensemble ${\cal D}$ such that it is hard to distinguish between a ${\cal D}$ -random sample and a $\beta$ -almost realizable sample. Consider the following efficient algorithm to distinguish between a ${\cal D}$ -random sample and a $\beta$ -almost realizable sample. On input $S\in{\cal Z}_{n}^{m(n)}$ ,

Run ${\cal L}$ with parameters $n,1/n$ and $\frac{1}{4}$ , such that the examples are sampled uniformly from $S$ .

Let $h$ be the hypothesis returned by the algorithm ${\cal L}$ . If $\operatorname{Err}_{S}(h)\leq\alpha\beta+1/n$ , return “almost realizable”. Otherwise, return “unrealizable”.

Next, we derive a contradiction by showing that this algorithm, which we denote by ${\cal A}$ , distinguishes between a realizable sample and a ${\cal D}$ -random sample. Indeed, if the input $S$ is $\beta$ -almost realizable, then ${\cal L}$ is guaranteed to return, with probability $\geq 1-\frac{1}{4}$ , a hypothesis $h:{\cal X}_{n}\to\{0,1\}$ with $\operatorname{Err}_{S}(h)\leq\alpha\beta+1/n$ . Therefore, the algorithm ${\cal A}$ will return, w.p. $\geq\frac{3}{4}$ , “almost realizable”.

Suppose now that the input sample $S$ is drawn according to ${\cal D}_{n}$ . Let ${\cal G}\subset\{0,1\}^{{\cal X}_{n}}$ be the collection of functions that the learning algorithm ${\cal L}$ might return when it runs with the parameters $n,1/n$ and $\frac{1}{4}$ . Note that each hypothesis in ${\cal G}$ can be described by $q(n)-n$ bits, namely, the random bits used by ${\cal L}$ and the description of the examples sampled by the oracle. Therefore, $|{\cal G}|\leq 2^{q(n)-n}$ . Now, since ${\cal D}$ is $(q(n),\alpha\beta+1/n)$ -scattered, the probability that some function in $h\in{\cal G}$ will have $\operatorname{Err}_{S}(h)\leq\alpha\beta+1/n$ is at most $|{\cal G}|2^{-q(n)}\leq 2^{-n}$ . It follows that the probability that the algorithm ${\cal A}$ will return “almost realizable” is $\leq 2^{-n}$ . $\Box$

The strong random CSP assumption

Let us briefly summarize the evidence for this assumption.

For every $\epsilon>0$ and sufficiently large $C>0$ , it is hard to distinguish instances with value $\geq\operatorname*{\overline{VAL}}(P)-\epsilon$ from random instances with $C\cdot n$ constraints.

Summary of results

Following the Boosting argument of Schapire , hardness of improper learning of a class ${\cal H}$ immediately implies that for every $\epsilon>0$ , there is no efficient algorithm that when running on a distribution that is realized by ${\cal H}$ , guaranteed to output a hypothesis with error $\leq\frac{1}{2}-\epsilon$ . Therefore, hardness results of improper learning are very strong, in the sense that they imply that the algorithm that just makes a random guess for each example, is essentially optimal.

2 Agnostically learning halfspaces

The problem of proper agnostic learning of halfspaces was shown to be hard to approximate within a factor of $2^{\log^{1-\epsilon}(n)}$ . Using the cryptographic technique, improper learning of halfspaces is known to be hard, under a certain cryptographic assumption regarding the shortest vector problem (, based on ). No hardness results are known for approximately and improperly learning halfspaces. Here, we show that:

3 Learning intersection of halfspaces

Learning intersection of halfspaces has been a major challenge in machine learning. Beside being a natural generalization of learning halfspaces, its importance stems from neural networks . Learning neural networks was popular in the 80’s, and enjoy a certain comeback nowadays. A neural network is composed of layers, each of which is composed of nodes. The first layer consists of $n$ nodes, containing the input values. The nodes in the rest of the layers calculates a value according to a halfspace (or a “soft” halfspace obtained by replacing the sign function with a sigmoidal function) applied on the values of the nodes in the previous layer. The final layer consists of a single node, which is the output of the whole network.

4 Additional results

5 On the proofs

Future work

We elaborate below on some of the numerous open problems and research directions that the present paper suggests.

As discussed in the previous section, one can try to derive it, even partially, from weaker assumptions.

3 More applications

Likewise for learning large margin halfspaces (see remark 7.4) and for parity.

Proofs of the lower bounds

Let ${\cal D}$ be some distribution on a set ${\cal X}$ . For even $m$ , let $X_{1},\ldots,X_{m}$ be independent random variables drawn according to ${\cal D}$ . Consider the sample $S=\{(X_{1},1),(X_{2},0)\ldots,(X_{m-1},1),(X_{m},0)\}$ . Then, for every $h:{\cal X}\to\{0,1\}$ ,

Proof For $1\leq i\leq\frac{m}{2}$ let $T_{i}=1[h(X_{2i-1})\neq 1]+1[h(X_{2i})\neq 0]$ . Note that $\operatorname{Err}_{S}(h)=\frac{1}{m}\sum_{i=1}^{\frac{m}{2}}T_{i}$ . Also, the $T_{i}$ ’s are independent random variables with mean $1$ and values between and $2$ . Therefore, by Hoeffding’s bound,

$H_{k}$ is heredity approximation resistant on satisfiable instances.

For every sufficiently large $k$ , there exists $y^{k}\in\{\pm 1\}^{K}$ such that $H_{k}(x)=1\Rightarrow H_{k}(y^{k}\oplus x)=0$

The reduction works as follows. Let $y^{k}$ be the vector from lemma 7.2. Given an instance

Note that if $J$ is random then so is $J^{\prime}$ . Also, if $J$ is satisfiable with a satisfying assignment $u$ , then, by lemma 7.2, $u$ satisfies in $J^{\prime}$ exactly the constraints with odd indices. Next, we will produce a sample $S\in\left(\{\pm 1\}^{2Kn}\times\{0,1\}\right)^{m}$ from $J^{\prime}$ as follows. We will index the coordinates of vectors in $\{\pm 1\}^{2Kn}$ by $[K]\times\{\pm 1\}\times[n]$ . We define a mapping $\Psi$ from the collection of $H_{k}$ -constraints to $\{\pm 1\}^{2Kn}$ as follows – for each constraint $C=H_{k}(j_{1}x_{i_{1}},\ldots,j_{K}x_{i_{K}})$ we define $\Psi(C)\in\{\pm 1\}^{2Kn}$ by the formula

Finally, if $J^{\prime}=\{C^{\prime}_{1},\ldots,C^{\prime}_{m}\}$ , we will produce the sample

The theorem follows from the following claim:

If $J$ is a random instance then $S$ is $\left(\frac{9}{100}m,\frac{1}{5}\right)$ -scattered.

where, as mentioned before, we index coordinates of $x\in\{\pm 1\}^{2Kn}$ by triplets in $[K]\times\{\pm 1\}\times[n]$ . We claim that for every $H_{k}$ -constraint $C$ , $\phi_{u}(\Psi(C))=C(u)$ . This suffices, since if $u$ satisfies $J$ then $u$ satisfies exactly the constraints with odd indices in $J^{\prime}$ . Therefore, by the definition of $S$ and the fact that $\forall C,\phi_{u}(\Psi(C))=C(u)$ , $\phi_{u}$ realizes $S$ .

Indeed, let $C(x)=H_{k}(j_{1}x_{i_{1}},\ldots,j_{K}x_{i_{K}})$ be a $H_{k}$ -constraint. We have

By a simple scaling argument we can prove corollary 5.2.

2 Agnostically learning halfspaces

Now define $\Psi:\{-1,1,0\}^{n}\to\{\pm 1\}^{2n}$ by

We will use assumption 4.5 with respect to the majority predicate $\operatorname*{MAJ}_{K}:\{\pm 1\}^{K}\to\{0,1\}$ . Recall that $\operatorname*{MAJ}(x)=1$ if and only if $\sum_{i=1}^{K}x_{i}>0$ . The following claim analyses its relevant properties.

$\operatorname*{\overline{VAL}}(\operatorname*{MAJ}_{K})=1-\frac{1}{K+1}$ .

and $\Pr_{x\sim{\cal D}}\left((x_{i},x_{j})=(0,0)\right)=\frac{1}{4}$ .

Since this is true for every pairwise uniform distribution, $\operatorname*{\overline{VAL}}(\operatorname*{MAJ}_{K})\leq\frac{2t+1}{2(t+1)}=1-\frac{1}{K+1}$ . $\Box$

${\cal D}$ is $(\Omega(n^{c}),\alpha\beta+\frac{1}{n})$ -scattered.

Combining all the above we conclude the proof of theorem 5.4. $\Box$

It is not hard to see that the proof of theorem 5.4 shows that it is hard to approximately learn large margin halfspaces with any constant approximation ratio. Taking considerations as in remark 7.3, one might hypothesize that the correct approximation ratio for this problem is about $\frac{1}{\gamma}$ . As in the case of learning halfspaces, best known algorithms do just a bit better, namely, they have an approximation ratio of $\frac{1}{\gamma\sqrt{\log(1/\gamma)}}$ . Therefore, one might hypothesize that the best possible approximation ratio is $\frac{1}{\gamma\operatorname{poly}\left(\log(1/\gamma)\right)}$ . We note that a recent result shows that this is the best possible approximation ratio, if we restrict ourselves to a large class of learning algorithms (that includes SVM with a kernel, regression, Fourier transform and more).

3 Learning automata

The theorem remains true (with the same proof), even if we restrict to acyclic automata.

Clearly, this automaton calculates the same function as $R$ . $\Box$

4 Toward intersection of 444 halfspaces

There is $k_{0}$ such that for every odd $k\geq k_{0}$ we have

Assuming the unique games conjecture, $P_{k}$ is heredity approximation resistant.

Proof We start with part 1. By , it suffices to show that there is a pairwise uniform distribution that is supported in $P_{k}^{-1}(1)$ . Denote $Q(x^{1},\ldots x^{4})=\wedge_{j=1}^{4}T_{k,\lceil\frac{k}{2}\rceil-1}(x^{j})$ and $R(x^{1},\ldots x^{4})=\neg\left(\wedge_{j=1}^{4}T_{k,\lceil\frac{k}{2}\rceil-1}(x^{j})\right)$ . Note that if ${\cal D}_{Q}$ is a pairwise uniform distribution that is supported in $Q^{-1}(1)$ and ${\cal D}_{R}$ is a pairwise uniform distribution that is supported in $R^{-1}(1)$ , then ${\cal D}_{Q}\times{\cal D}_{R}$ is a pairwise uniform distribution that is supported in $P_{k}^{-1}(1)$ . Therefore, it suffices to show that such ${\cal D}_{Q}$ and ${\cal D}_{R}$ exist.

We first construct ${\cal D}_{Q}$ . Let ${\cal D}_{k}$ be the following distribution over $\{\pm 1\}^{k}$ – with probability $\frac{1}{k+1}$ choose the all-one vector and with probability $\frac{k}{k+1}$ , choose at random a vector with $\lceil\frac{k}{2}\rceil-1$ ones (uniformly among all such vectors). By the argument of claim 2, ${\cal D}_{k}$ is pairwise uniform. Clearly, the distribution ${\cal D}_{Q}={\cal D}_{k}\times{\cal D}_{k}\times{\cal D}_{k}\times{\cal D}_{k}$ over $\left(\{\pm 1\}^{k}\right)^{4}$ is a pairwise uniform distribution that is supported in $Q^{-1}(1)$ .

Next, we construct ${\cal D}_{R}$ . Let $k_{0}$ be large enough so that for every $k\geq k_{0}$ , the probability that a random vector from $\{\pm 1\}^{k}$ will have more than $\lceil\frac{k}{2}\rceil$ minus-ones is $\geq\frac{3}{8}$ (it is easy to see that this probability approaches $\frac{1}{2}$ as $k$ approaches $\infty$ . Therefore, such $k_{0}$ exists). Now, let $Z\in\{0,1\}^{4}$ be a random variable that satisfies:

$Z_{1},\ldots,Z_{4}$ are pairwise independent.

For every $1\leq i\leq 4$ , $\Pr(Z_{i}=1)=\frac{3}{8}$ .

In a moment, we will show that a random variable with the above properties exists. Now, let $B\subset\{\pm 1\}^{k}$ be a set with $|B|\geq\frac{3}{8}\cdot 2^{k}$ such that every vector in $B$ has more than $\lceil\frac{k}{2}\rceil$ minus-ones. Consider the distribution ${\cal D}_{R}$ of the random variable $(X^{1},\ldots,X^{4})\in\left(\{\pm 1\}^{k}\right)^{4}$ sampled as follows. We first sample $Z$ , then, for $1\leq i\leq 4$ , if $Z_{i}=1$ , we choose $X^{i}$ to be a random vector $B$ and otherwise, we choose $X^{i}$ to be a random vector $B^{c}$ .

We note that since $Z_{1},\ldots,Z_{4}$ are pairwise independent, $X^{1},\ldots,X^{4}$ are pairwise independent as well. Also, the distribution of $X^{i},\;1=1,\ldots,4$ is uniform. Therefore, ${\cal D}_{R}$ is pairwise uniform. Also, since $\Pr(Z=(0,0,0,0))=0$ , with probability $1$ , at least one of the $X^{i}$ ’s will have more than $\lceil\frac{k}{2}\rceil$ minus-ones. Therefore, ${\cal D}_{R}$ is supported in $R^{-1}(1)$ .

It is left to show that there exists a random variable $Z\in\{0,1\}^{4}$ as specified above. Let $Z$ be the random variable defined as follows:

With probability $\frac{140}{192}$ $Z$ is a uniform vector with a single positive coordinate.

With probability $\frac{30}{192}$ $Z$ is a uniform vector with $2$ positive coordinates.

With probability $\frac{22}{192}$ $Z$ is a uniform vector with $4$ positive coordinates.

Clearly, $\Pr(Z=(0,0,0,0))=0$ . Also, for every distinct $1\leq i,j\leq 4$ we have

Therefore, the other two specifications of $Z$ hold as well.

$P_{k}$ is heredity approximation resistant on satisfiable instances.

Given an instance $J$ , we produce two examples for each constraint: for the constraint

we will produce two examples in $\{-1,1,0\}^{4n}\times\{0,1\}$ , each of which has exactly $4k$ non zero coordinates. The first is a positively labelled example whose instance is the vector with the value $j_{q,l},\;1\leq q\leq 4,1\leq l\leq k$ in the $n(q-1)+i_{q,l}$ coordinate. the second is a negatively labelled example whose instance is the vector with the value $j_{q,l},\;5\leq q\leq 8,1\leq l\leq k$ in the $n(q-5)+i_{q,l}$ coordinate.

It is not hard to see that if $J$ is satisfiable then the produced sample is realizable by intersection of four halfspaces: if $u\in\{\pm 1\}^{n}$ is a satisfying assignment then the sample is realized by the intersection of the $4$ halfspaces $\sum_{i=1}^{n}u_{i}x_{n(q-1)+i}\geq-1,\;\;q=1,2,3,4$ . On the other hand, by proposition 7.1, if $J$ is random instance with $n^{d}$ constraints, then the resulting ensamble is $(\Omega(n^{d}),\frac{1}{5})$ scattered.

5 Agnostically learning parity

We will reduce from the aforementioned problem to the problem of distinguishing between $\beta$ -almost realizable sample and ${\cal D}$ -random sample for a distribution ${\cal D}$ which is $\left(\Omega\left(n^{d}\right),\frac{1}{4}\right)$ -scattered. Since both $\beta$ and $d$ are arbitrary, the theorem follows from theorem 3.4.

Resolution lower bounds

Theorem 4.2 now follows from theorem 8.1 and the following two lemmas.

Proof Let $\tau=\{T_{1},\ldots,T_{r}\}$ be a resolution refutation to $J$ . Define $\mu(T_{i})$ as the minimal number $\mu$ such that $T_{i}$ is implied by $\mu$ constraints in $J$ .

If $T_{i}$ is implied by $T_{i_{1}},T_{i_{2}},\;i_{1},i_{2}<i$ then $\mu(T_{i})\leq\mu(T_{i_{1}})+\mu(T_{i_{2}})$ .

For every $1\leq i\leq\frac{\mu}{2}$ , $T_{j}$ contains a variable appearing only is $C_{i}$ .

The next lemma shows that the condition in lemma 8.2 holds w.h.p. for a suitable random instance. For the sake of readability, it is formulated in terms of sets instead of constraints.

Fix integers $k>r>d$ such that $r>\max\{17d,544\}$ . Suppose that $A_{1},\ldots,A_{n^{d}}\in\binom{[n]}{k}$ are chosen uniformly at random. Then, with probability $1-o_{n}(1)$ , for every $I\subset[n^{d}]$ with $|I|\leq n^{\frac{3}{4}}$ for most $i\in I$ we have $|A_{i}\setminus\cup_{j\in I\setminus\{i\}}A_{j}|\geq k-r$ .

Proof Fix a set $I$ with $2\leq t\leq n^{\frac{3}{4}}$ elements. Order the sets in $I$ arbitrarily and also order the elements in each set arbitrarily. Let $X_{1},\ldots,X_{kt}$ be the following random variables: $X_{1}$ is the first element in the first set of $I$ , $X_{2}$ is the second element in the first set of $I$ and so on till the $k$ ’th element of the last set of $I$ .

Denote by $R_{i}\;\;1\leq i\leq kt$ the indicator random variable of the event that $X_{i}=X_{j}$ for some $j<i$ . We claim that if $\sum R_{i}<\frac{tr}{4}$ , the conclusion of the lemma holds for $I$ . Indeed, let $J_{1}\subset I$ be the set of indices with $R_{i}=1$ , $J_{2}\subset I$ be the set of indices $i$ with $R_{i}=0$ but $X_{i}=X_{j}$ for some $j>i$ and $J=J_{1}\cup J_{2}$ . If the conclusion of the lemma does not hold for $I$ , then $|J|\geq\frac{tr}{2}$ . If in addition $|J_{1}|=\sum R_{i}<\frac{tr}{4}$ we must have $|J_{2}|>\frac{tr}{4}>|J_{1}|$ . For every $i\in J_{2}$ , let $f(i)$ be the minimal index $j>i$ such that $X_{i}=X_{j}$ . We note that $f(i)\in J_{1}$ , therefore $f$ is a mapping from $J_{2}$ to $J_{1}$ . Since $|J_{2}|>|J_{1}|$ , $f(i_{1})=f(i_{2})$ for some $i_{1}<i_{2}$ in $J_{2}$ . Therefore, $X_{i_{1}}=X_{f(i_{1})}=X_{i_{2}}$ and hence, $R_{i_{2}}=1$ contradicting the assumption that $i_{2}\in J_{2}$ .

Note that the probability that $R_{i}=1$ is at most $\frac{tk}{n}$ . This estimate holds also given the values of $R_{1},\ldots,R_{i-1}$ . It follows that the probability that $R_{i}=1$ for every $i\in A$ for a particular $A\subset I$ with $|A|=\lceil\frac{rt}{4}\rceil$ is at most $\left(\frac{tk}{n}\right)^{\frac{rt}{4}}$ . Therefore, for some constants $C^{\prime},C>0$ (that depend only on $d$ and $k$ ), the probability that $J$ fails to satisfy the conclusion of the lemma is bounded by

The second inequality follows from Stirling’s approximation. Summing over all collections $I$ of size $t$ we conclude that for some $C^{\prime\prime}>0$ , the probability that the conclusion of the lemma does not hold for some collection of size $t$ is at most

Summing over all $2\leq t\leq n^{\frac{3}{4}}$ , we conclude that the probability that the conclusion of the lemma does not hold is at most $C^{\prime\prime}n^{-\frac{1}{4}}=o_{n}(1)$ . $\Box$

Proof (of corollary 9.2) Under the conditions of the corollary, by theorem 9.1, we have $\mathbf{NP}\subset\mathbf{SZKP}$ or $\mathbf{CoNP}\subset\mathbf{SZKP}$ . Since $\mathbf{SZKP}$ is closed under taking complement , in both cases, $\mathbf{NP}\subset\mathbf{SZKP}$ . Since $\mathbf{SZKP}\subset\mathbf{CoAM}$ , we conclude that $\mathbf{NP}\subset\mathbf{CoAM}$ , which collapses the polynomial hierarchy . $\Box$

Consider the following problem. The input is a circuit $\Psi:\{0,1\}^{n}\to\{0,1\}^{m}$ and a number $t$ . The instance is a YES instance if the entropyWe consider the standard Shannon’s entropy with bits units. of $\Psi$ , when it acts on a uniform input sampled from $\{0,1\}^{n}$ , is $\leq t-1$ . The instance is a NO instance if this entropy is $\geq t$ . By this problem is in $\mathbf{SZKP}$ . To establish the proof, we will show that $L$ can be reduced to this problem.

Amit Daniely is a recipient of the Google Europe Fellowship in Learning Theory, and this research is supported in part by this Google Fellowship. Nati Linial is supported by grants from ISF, BSF and I-Core. Shai Shalev-Shwartz is supported by the Israeli Science Foundation grant number 590-10. We thank Sangxia Huang for his kind help and for valuable discussions about his paper . We thank Guy Kindler for valuable discussions.

Introduction

2 On the role of average case complexity

Preliminaries

2 Constraints Satisfaction Problems

3 Resolution refutation and Davis Putnam algorithms

The methodology

The strong random CSP assumption

Summary of results

2 Agnostically learning halfspaces

3 Learning intersection of halfspaces

4 Additional results

5 On the proofs

Future work

3 More applications

Proofs of the lower bounds

2 Agnostically learning halfspaces

3 Learning automata

4 Toward intersection of 444 halfspaces

5 Agnostically learning parity

Resolution lower bounds

References