From average case complexity to improper learning complexity

Amit Daniely, Nati Linial, Shai Shalev-Shwartz

Introduction

Valiant’s celebrated probably approximately correct (=PAC) model of machine learning led to an extensive research that yielded a whole scientific community devoted to computational learning theory. In the PAC learning model, a learner is given an oracle access to randomly generated samples (X,Y)X×{0,1}(X,Y)\in{\cal X}\times\{0,1\} where XX is sampled from some unknown distribution D{\cal D} on X{\cal X} and Y=h(X)Y=h^{*}(X) for some unknown function h:X{0,1}h^{*}:{\cal X}\to\{0,1\}. Furthermore, it is assumed that hh^{*} comes from a predefined hypothesis class H{\cal H}, consisting of 0,10,1 valued functions on X{\cal X}. The learning problem defined by H{\cal H} is to find a function h:X{0,1}h:{\cal X}\to\{0,1\} that minimizes ErrD(h):=PrXD(h(X)h(X))\operatorname{Err}_{{\cal D}}(h):=\Pr_{X\sim{\cal D}}(h(X)\not=h^{*}(X)). For concreteness’ sake we take X={±1}n{\cal X}=\{\pm 1\}^{n}, and we consider the learning problem tractable if there is an algorithm that on input ϵ\epsilon, runs in time poly(n,1/ϵ)\operatorname{poly}(n,1/\epsilon) and outputs, w.h.p., a hypothesis hh with Err(h)ϵ\operatorname{Err}(h)\leq\epsilon.

Assuming PNP\mathbf{P}\neq\mathbf{NP}, the status of most basic computational problems is fairly well understood. In a sharp contrast, almost 3030 years after Valiant’s paper, the status of most basic learning problems is still wide open – there is a huge gap between the performance of the best known algorithms and hardness results:

The crux of the matter, leading to this state of affairs, has to do with the learner’s freedom to return any hypothesis. A learner who may return hypotheses outside the class H{\cal H} is called an improper learner. This additional freedom makes such algorithms potentially more powerful than proper learners. On the other hand, this added flexibility makes it difficult to apply standard reductions from NP\mathbf{NP}-hard problems. Indeed, there was no success so far in proving intractability of a learning problem based on NP\mathbf{NP}-hardness. Moreover, as Applebaum, Barak and Xiao showed, many standard ways to do so are doomed to fail, unless the polynomial hierarchy collapses.

The vast majority of existing lower bounds on learning utilize the crypto-based argument, suggested in . Roughly speaking, to prove that a certain learning problem is hard, one starts with a certain collection of functions, that by assumption are one-way trapdoor permutations. This immediately yields some hard (usually artificial) learning problem. The final step is to reduce this artificial problem to some natural learning problem.

Unlike the difficulty in establishing lower bounds for improper learning, the situation in proper learning is much better understood. Usually, hardness of proper learning is proved by showing that it is NP\mathbf{NP}-hard to distinguish a realizable sample from an unrealizable sample. I.e., it is hard to tell whether there is some hypothesis in H{\cal H} which has zero error on a given sample. This, however, does not suffice for the purpose of proving lower bounds on improper learning, because it might be the case that the learner finds a hypothesis (not from H{\cal H}) that does not err on the sample even though no hHh\in{\cal H} can accomplish this. In this paper we present a new methodology for proving hardness of improper learning. Loosely speaking, we show that improper learning is impossible provided that it is hard to distinguish a realizable sample from a randomly generated unrealizable sample.

Agnostically learning halfspaces with a constant approximation ratio is hard, even over the boolean cube.

Learning intersection of ω(1)\omega(1) halfspaces is hard, even over the boolean cube.

We note that result 4 can be established using the cryptographic technique . Result 5 is often taken as a hardness assumption. We also conjecture that under our generalization of Feige’s assumption it is hard to learn intersections of even constant number of halfspaces. We present a possible approach to the case of four halfspaces. To the best of our knowledge, these results easily imply most existing lower bounds for improper learning.

There is a crucial reversal of order that works in our favour. To lower bound improper learning, we actually need much less than what is needed in cryptography, where a problem and a distribution on instances are appropriate if they fool every algorithm. In contrast, here we are presented with a concrete learning algorithms and we devise a problem and a distribution on instances that fail it.

2 On the role of average case complexity

Preliminaries

Distributions on Zn{\cal Z}_{n} (resp. Znm{\cal Z}_{n}^{m}) are denoted Dn{\cal D}_{n} (resp. Dnm{\cal D}_{n}^{m}). Ensembles of distributions are denoted by D{\cal D}. That is, D={Dnm(n)}n=1{\cal D}=\{{\cal D}_{n}^{m(n)}\}_{n=1}^{\infty} where Dnm(n){\cal D}_{n}^{m(n)} is a distributions on Znm(n){\cal Z}_{n}^{m(n)}. We say that D{\cal D} is a polynomial ensemble if m(n)m(n) is upper bounded by some polynomial in nn.

The error of a hypothesis h:Xn{0,1}h:{\cal X}_{n}\to\{0,1\} w.r.t. Dn{\cal D}_{n} on Zn{\cal Z}_{n} is defined as ErrDn(h)=Pr(x,y)Dn(h(x)y)\operatorname{Err}_{{\cal D}_{n}}(h)=\Pr_{(x,y)\sim{\cal D}_{n}}\left(h(x)\neq y\right). For a hypothesis class Hn{\cal H}_{n}, we define ErrDn(Hn)=minhHnErrDn(h)\operatorname{Err}_{{\cal D}_{n}}({\cal H}_{n})=\min_{h\in{\cal H}_{n}}\operatorname{Err}_{{\cal D}_{n}}(h). We say that a distribution Dn{\cal D}_{n} is realizable by hh (resp. Hn{\cal H}_{n}) if ErrDn(h)=0\operatorname{Err}_{{\cal D}_{n}}(h)=0 (resp. ErrDn(Hn)=0\operatorname{Err}_{{\cal D}_{n}}({\cal H}_{n})=0). Similarly, we say that Dn{\cal D}_{n} is ϵ\epsilon-almost realizable by hh (resp. Hn{\cal H}_{n}) if ErrDn(h)ϵ\operatorname{Err}_{{\cal D}_{n}}(h)\leq\epsilon (resp. ErrDn(Hn)ϵ\operatorname{Err}_{{\cal D}_{n}}({\cal H}_{n})\leq\epsilon).

A sample is a sequence S={(x1,y1),(xm,ym)}ZnmS=\{(x_{1},y_{1}),\ldots(x_{m},y_{m})\}\in{\cal Z}^{m}_{n}. The empirical error of a hypothesis h:Xn{0,1}h:{\cal X}_{n}\to\{0,1\} w.r.t. sample SS is ErrS(h)=1mi=1m1(h(xi)=yi)\operatorname{Err}_{S}(h)=\frac{1}{m}\sum_{i=1}^{m}1(h(x_{i})=y_{i}). The empirical error of a hypothesis class Hn{\cal H}_{n} w.r.t. SS is ErrS(Hn)=minhHnErrS(h)\operatorname{Err}_{S}({\cal H}_{n})=\min_{h\in{\cal H}_{n}}\operatorname{Err}_{S}(h). We say that a sample SS is realizable by hh if ErrS(h)=0\operatorname{Err}_{S}(h)=0. The sample SS is realizable by Hn{\cal H}_{n} if ErrS(Hn)=0\operatorname{Err}_{S}({\cal H}_{n})=0. Similarly, we define the notion of ϵ\epsilon-almost realizable sample (by either a hypothesis h:Xn{0,1}h:{\cal X}_{n}\to\{0,1\} or a class Hn{\cal H}_{n}).

A learning algorithm, denoted L{\cal L}, obtains an error parameter 0<ϵ<10<\epsilon<1, a confidence parameter 0<δ<10<\delta<1, a complexity parameter nn, and an access to an oracle that produces samples according to unknown distribution Dn{\cal D}_{n} on Zn{\cal Z}_{n}. It should output a (description of) hypothesis h:Xn{0,1}h:{\cal X}_{n}\to\{0,1\}. We say that the algorithm L{\cal L} (PAC) learns the hypothesis class H{\cal H} if, for every realizable distribution Dn{\cal D}_{n}, with probability 1δ\geq 1-\delta, L{\cal L} outputs a hypothesis with error ϵ\leq\epsilon. We say that an algorithm L{\cal L} agnostically learns H{\cal H} if, for every distribution Dn{\cal D}_{n}, with probability 1δ\geq 1-\delta, L{\cal L} outputs a hypothesis with error ErrDn(H)+ϵ\leq\operatorname{Err}_{{\cal D}_{n}}({\cal H})+\epsilon. We say that an algorithm L{\cal L} approximately agnostically learns H{\cal H} with approximation ratio α=α(n)1\alpha=\alpha(n)\geq 1 if, for every distribution Dn{\cal D}_{n}, with probability 1δ\geq 1-\delta, L{\cal L} outputs a hypothesis with error αErrDn(H)+ϵ\leq\alpha\cdot\operatorname{Err}_{{\cal D}_{n}}({\cal H})+\epsilon. We say that L{\cal L} is efficient if it runs in time polynomial in n,1/ϵn,1/\epsilon and 1/δ1/\delta, and outputs a hypothesis that can be evaluated in time polynomial in n,1/ϵn,1/\epsilon and 1/δ1/\delta. We say that L{\cal L} is proper (with respect to H{\cal H}) if it always outputs a hypothesis in H{\cal H}. Otherwise, we say that L{\cal L} is improper.

2 Constraints Satisfaction Problems

3 Resolution refutation and Davis Putnam algorithms

The methodology

We begin by discussing the methodology in the realm of realizable learning, and we later proceed to agnostic learning. Some of the ideas underling our methodology appeared, in a much more limited context, in .

To motivate the approach, recall how one usually proves that a class cannot be efficiently properly learnable. Given a hypothesis class H{\cal H}, let Π(H)\Pi({\cal H}) be the problem of distinguishing between an H{\cal H}-realizable sample SS and one with ErrS(H)14\operatorname{Err}_{S}({\cal H})\geq\frac{1}{4}. If H{\cal H} is efficiently properly learnable then this problem is inThe reverse direction is almost true: If the search version of this problem can be solved in polynomial time, then H{\cal H} is efficiently learnable. RP\mathbf{RP}: To solve Π(H)\Pi({\cal H}), we simply invoke a proper learning algorithm A{\cal A} that efficiently learns H{\cal H}, with examples drawn uniformly from SS. Let hh be the output of A{\cal A}. Since A{\cal A} properly learns H{\cal H}, we have

If SS is a realizable sample, then ErrS(h)\operatorname{Err}_{S}(h) is small.

If ErrS(H)14\operatorname{Err}_{S}({\cal H})\geq\frac{1}{4} then, since hHh\in{\cal H}, ErrS(h)14\operatorname{Err}_{S}(h)\geq\frac{1}{4}.

This gives an efficient way to decide whether SS is realizable. We conclude that if Π(H)\Pi({\cal H}) is NP\mathbf{NP}-hard, then H{\cal H} is not efficiently learnable, unless NP=RP\mathbf{NP}=\mathbf{RP}.

We see that it is not clear how to establish hardness of improper learning based on the hardness of distinguishing between a realizable and an unrealizable sample. The core problem is that even if SS is not realizable, the algorithm might still return a good hypothesis. The crux of our new technique is the observation that if SS is randomly generated unrealizable sample then even improper algorithm cannot return a hypothesis with a small empirical error. The point is that the returned hypothesis is determined solely by the examples that A{\cal A} sees and its random bits. Therefore, if A{\cal A} is an efficient algorithm, the number of hypotheses it might return cannot be too large. Hence, if SS is “random enough”, it likely to be far from all these hypotheses, in which case the hypothesis returned by A{\cal A} would have a large error on SS.

We now formalize this idea. Let D={Dnm(n)}n{\cal D}=\{{\cal D}^{m(n)}_{n}\}_{n} be a polynomial ensemble of distributions, such that Dnm(n){\cal D}^{m(n)}_{n} is a distribution on Znm(n){\cal Z}_{n}^{m(n)}. Think of Dnm(n){\cal D}^{m(n)}_{n} as a distribution that generates samples that are far from being realizable by H{\cal H}. We say that it is hard to distinguish between a D{\cal D}-random sample and a realizable sample if there is no efficient randomized algorithm A{\cal A} with the following properties:

For every realizable sample SZnm(n)S\in{\cal Z}^{m(n)}_{n},

If SDnm(n)S\sim{\cal D}_{n}^{m(n)}, then with probability 1on(1)1-o_{n}(1) over the choice of SS, it holds that

Let Dnm(n){\cal D}^{m(n)}_{n} be the distribution over Znm(n){\cal Z}_{n}^{m(n)} defined by taking m(n)m(n) independent uniformly chosen examples from Xn×{0,1}{\cal X}_{n}\times\{0,1\}. For f:Xn{0,1}f:{\cal X}_{n}\to\{0,1\}, PrSDnm(n)(ErrS(f)14)\Pr_{S\sim{\cal D}^{m(n)}_{n}}\left(\operatorname{Err}_{S}(f)\leq\frac{1}{4}\right) is the probability of getting at most m(n)4\frac{m(n)}{4} heads in m(n)m(n) independent tosses of a fair coin. By Hoeffding’s bound, this probability is 218m(n)\leq 2^{-\frac{1}{8}m(n)}. Therefore, D={Dnm(n)}n{\cal D}=\{{\cal D}^{m(n)}_{n}\}_{n} is (18m(n),1/4)\left(\frac{1}{8}m(n),1/4\right)-scattered.

Every hypothesis class that satisfies the following condition is not efficiently learnable. There exists β>0\beta>0 such that for every c>0c>0 there is an (nc,β)(n^{c},\beta)-scattered ensemble D{\cal D} for which it is hard to distinguish between a D{\cal D}-random sample and a realizable sample.

The theorem and the proof below work verbatim if we replace β\beta by β(n)\beta(n), provided that β(n)>na\beta(n)>n^{-a} for some a>0a>0.

Proof Let H{\cal H} be the hypothesis class in question and suppose toward a contradiction that algorithm L{\cal L} learns H{\cal H} efficiently. Let M(n,1/ϵ,1/δ)M\left(n,1/\epsilon,1/\delta\right) be the maximal number of random bits used by L{\cal L} when run on the input n,ϵ,δn,\epsilon,\delta. This includes both the bits describing the examples produced by the oracle and “standard” random bits. Since L{\cal L} is efficient, M(n,1/ϵ,1/δ)<poly(n,1/ϵ,1/δ)M\left(n,1/\epsilon,1/\delta\right)<\operatorname{poly}(n,1/\epsilon,1/\delta). Define

By assumption, there is a (q(n),β)(q(n),\beta)-scattered ensemble D{\cal D} for which it is hard to distinguish a D{\cal D}-random sample from a realizable sample. Consider the algorithm A{\cal A} defined below. On input SZnm(n)S\in{\cal Z}_{n}^{m(n)},

Run L{\cal L} with parameters n,βn,\beta and 14\frac{1}{4}, such that the examples’ oracle generates examples by choosing a random example from SS.

Let hh be the hypothesis that L{\cal L} returns. If ErrS(h)β\operatorname{Err}_{S}(h)\leq\beta, output “realizable”. Otherwise, output “unrealizable”.

Next, we derive a contradiction by showing that A{\cal A} distinguishes a realizable sample from a D{\cal D}-random sample. Indeed, if the input SS is realizable, then L{\cal L} is guaranteed to return, with probability 114\geq 1-\frac{1}{4}, a hypothesis h:Xn{0,1}h:{\cal X}_{n}\to\{0,1\} with ErrS(h)β\operatorname{Err}_{S}(h)\leq\beta. Therefore, w.p. 34\geq\frac{3}{4} A{\cal A} will output “realizable”.

What if the input sample SS is drawn from Dnm(n){\cal D}^{m(n)}_{n}? Let G{0,1}Xn{\cal G}\subset\{0,1\}^{{\cal X}_{n}} be the collection of functions that L{\cal L} might return when run with parameters n,ϵ(n)n,\epsilon(n) and 14\frac{1}{4}. We note that G2q(n)n|{\cal G}|\leq 2^{q(n)-n}, since each hypothesis in G{\cal G} can be described by q(n)nq(n)-n bits. Namely, the random bits that L{\cal L} uses and the description of the examples sampled by the oracle. Now, since D{\cal D} is (q(n),β)(q(n),\beta)-scattered, the probability that ErrS(h)β\operatorname{Err}_{S}(h)\leq\beta for some hGh\in{\cal G} is at most G2q(n)2n|{\cal G}|2^{-q(n)}\leq 2^{-n}. It follows that the probability that A{\cal A} responds “realizable” is 2n\leq 2^{-n}. This leads to the desired contradiction and concludes our proof. \Box

For every sample SZnm(n)S\in{\cal Z}^{m(n)}_{n} that is ϵ(n)\epsilon(n)-almost realizable,

If SDnm(n)S\sim{\cal D}_{n}^{m(n)}, then with probability 1on(1)1-o_{n}(1) over the choice of SS, it holds that

Let α1\alpha\geq 1. Every hypothesis class that satisfies the following condition is not efficiently agnostically learnable with an approximation ratio of α\alpha. For some β\beta and every c>0c>0, there is a (nc,αβ+1/n)(n^{c},\alpha\beta+1/n)-scattered ensemble D{\cal D} such that it is hard to distinguish between a D{\cal D}-random sample and a β\beta-almost realizable sample.

As in theorem 3.2, the theorem and the proof below work verbatim if we replace α\alpha by α(n)\alpha(n) and β\beta by β(n)\beta(n), provided that β(n)>na\beta(n)>n^{-a} for some a>0a>0.

Proof Let H{\cal H} be the hypothesis class in question and suppose toward a contradiction that L{\cal L} efficiently agnostically learns H{\cal H} with approximation ratio of α\alpha. Let M(n,1/ϵ,1/δ)M\left(n,1/\epsilon,1/\delta\right) be the maximal number of random bits used by L{\cal L} when it runs on the input n,ϵ,δn,\epsilon,\delta. This includes both the bits describing the examples produced by the oracle and the “standard” random bits. Since L{\cal L} is efficient, M(n,1/ϵ,1/δ)<poly(n,1/ϵ,1/δ)M\left(n,1/\epsilon,1/\delta\right)<\operatorname{poly}(n,1/\epsilon,1/\delta). Define,

By the assumptions of the theorem, there is a (q(n),αβ+1/n)(q(n),\alpha\beta+1/n)-scattered ensemble D{\cal D} such that it is hard to distinguish between a D{\cal D}-random sample and a β\beta-almost realizable sample. Consider the following efficient algorithm to distinguish between a D{\cal D}-random sample and a β\beta-almost realizable sample. On input SZnm(n)S\in{\cal Z}_{n}^{m(n)},

Run L{\cal L} with parameters n,1/nn,1/n and 14\frac{1}{4}, such that the examples are sampled uniformly from SS.

Let hh be the hypothesis returned by the algorithm L{\cal L}. If ErrS(h)αβ+1/n\operatorname{Err}_{S}(h)\leq\alpha\beta+1/n, return “almost realizable”. Otherwise, return “unrealizable”.

Next, we derive a contradiction by showing that this algorithm, which we denote by A{\cal A}, distinguishes between a realizable sample and a D{\cal D}-random sample. Indeed, if the input SS is β\beta-almost realizable, then L{\cal L} is guaranteed to return, with probability 114\geq 1-\frac{1}{4}, a hypothesis h:Xn{0,1}h:{\cal X}_{n}\to\{0,1\} with ErrS(h)αβ+1/n\operatorname{Err}_{S}(h)\leq\alpha\beta+1/n. Therefore, the algorithm A{\cal A} will return, w.p. 34\geq\frac{3}{4}, “almost realizable”.

Suppose now that the input sample SS is drawn according to Dn{\cal D}_{n}. Let G{0,1}Xn{\cal G}\subset\{0,1\}^{{\cal X}_{n}} be the collection of functions that the learning algorithm L{\cal L} might return when it runs with the parameters n,1/nn,1/n and 14\frac{1}{4}. Note that each hypothesis in G{\cal G} can be described by q(n)nq(n)-n bits, namely, the random bits used by L{\cal L} and the description of the examples sampled by the oracle. Therefore, G2q(n)n|{\cal G}|\leq 2^{q(n)-n}. Now, since D{\cal D} is (q(n),αβ+1/n)(q(n),\alpha\beta+1/n)-scattered, the probability that some function in hGh\in{\cal G} will have ErrS(h)αβ+1/n\operatorname{Err}_{S}(h)\leq\alpha\beta+1/n is at most G2q(n)2n|{\cal G}|2^{-q(n)}\leq 2^{-n}. It follows that the probability that the algorithm A{\cal A} will return “almost realizable” is 2n\leq 2^{-n}. \Box

The strong random CSP assumption

Let us briefly summarize the evidence for this assumption.

For every ϵ>0\epsilon>0 and sufficiently large C>0C>0, it is hard to distinguish instances with value VAL(P)ϵ\geq\operatorname*{\overline{VAL}}(P)-\epsilon from random instances with CnC\cdot n constraints.

Summary of results

Following the Boosting argument of Schapire , hardness of improper learning of a class H{\cal H} immediately implies that for every ϵ>0\epsilon>0, there is no efficient algorithm that when running on a distribution that is realized by H{\cal H}, guaranteed to output a hypothesis with error 12ϵ\leq\frac{1}{2}-\epsilon. Therefore, hardness results of improper learning are very strong, in the sense that they imply that the algorithm that just makes a random guess for each example, is essentially optimal.

2 Agnostically learning halfspaces

The problem of proper agnostic learning of halfspaces was shown to be hard to approximate within a factor of 2log1ϵ(n)2^{\log^{1-\epsilon}(n)} . Using the cryptographic technique, improper learning of halfspaces is known to be hard, under a certain cryptographic assumption regarding the shortest vector problem (, based on ). No hardness results are known for approximately and improperly learning halfspaces. Here, we show that:

3 Learning intersection of halfspaces

Learning intersection of halfspaces has been a major challenge in machine learning. Beside being a natural generalization of learning halfspaces, its importance stems from neural networks . Learning neural networks was popular in the 80’s, and enjoy a certain comeback nowadays. A neural network is composed of layers, each of which is composed of nodes. The first layer consists of nn nodes, containing the input values. The nodes in the rest of the layers calculates a value according to a halfspace (or a “soft” halfspace obtained by replacing the sign function with a sigmoidal function) applied on the values of the nodes in the previous layer. The final layer consists of a single node, which is the output of the whole network.

4 Additional results

5 On the proofs

Future work

We elaborate below on some of the numerous open problems and research directions that the present paper suggests.

As discussed in the previous section, one can try to derive it, even partially, from weaker assumptions.

3 More applications

Likewise for learning large margin halfspaces (see remark 7.4) and for parity.

Proofs of the lower bounds

Let D{\cal D} be some distribution on a set X{\cal X}. For even mm, let X1,,XmX_{1},\ldots,X_{m} be independent random variables drawn according to D{\cal D}. Consider the sample S={(X1,1),(X2,0),(Xm1,1),(Xm,0)}S=\{(X_{1},1),(X_{2},0)\ldots,(X_{m-1},1),(X_{m},0)\}. Then, for every h:X{0,1}h:{\cal X}\to\{0,1\},

Proof For 1im21\leq i\leq\frac{m}{2} let Ti=1[h(X2i1)1]+1[h(X2i)0]T_{i}=1[h(X_{2i-1})\neq 1]+1[h(X_{2i})\neq 0]. Note that ErrS(h)=1mi=1m2Ti\operatorname{Err}_{S}(h)=\frac{1}{m}\sum_{i=1}^{\frac{m}{2}}T_{i}. Also, the TiT_{i}’s are independent random variables with mean 11 and values between and 22. Therefore, by Hoeffding’s bound,

HkH_{k} is heredity approximation resistant on satisfiable instances.

For every sufficiently large kk, there exists yk{±1}Ky^{k}\in\{\pm 1\}^{K} such that Hk(x)=1Hk(ykx)=0H_{k}(x)=1\Rightarrow H_{k}(y^{k}\oplus x)=0

The reduction works as follows. Let yky^{k} be the vector from lemma 7.2. Given an instance

Note that if JJ is random then so is JJ^{\prime}. Also, if JJ is satisfiable with a satisfying assignment uu, then, by lemma 7.2, uu satisfies in JJ^{\prime} exactly the constraints with odd indices. Next, we will produce a sample S({±1}2Kn×{0,1})mS\in\left(\{\pm 1\}^{2Kn}\times\{0,1\}\right)^{m} from JJ^{\prime} as follows. We will index the coordinates of vectors in {±1}2Kn\{\pm 1\}^{2Kn} by [K]×{±1}×[n][K]\times\{\pm 1\}\times[n]. We define a mapping Ψ\Psi from the collection of HkH_{k}-constraints to {±1}2Kn\{\pm 1\}^{2Kn} as follows – for each constraint C=Hk(j1xi1,,jKxiK)C=H_{k}(j_{1}x_{i_{1}},\ldots,j_{K}x_{i_{K}}) we define Ψ(C){±1}2Kn\Psi(C)\in\{\pm 1\}^{2Kn} by the formula

Finally, if J={C1,,Cm}J^{\prime}=\{C^{\prime}_{1},\ldots,C^{\prime}_{m}\}, we will produce the sample

The theorem follows from the following claim:

If JJ is a random instance then SS is (9100m,15)\left(\frac{9}{100}m,\frac{1}{5}\right)-scattered.

where, as mentioned before, we index coordinates of x{±1}2Knx\in\{\pm 1\}^{2Kn} by triplets in [K]×{±1}×[n][K]\times\{\pm 1\}\times[n]. We claim that for every HkH_{k}-constraint CC, ϕu(Ψ(C))=C(u)\phi_{u}(\Psi(C))=C(u). This suffices, since if uu satisfies JJ then uu satisfies exactly the constraints with odd indices in JJ^{\prime}. Therefore, by the definition of SS and the fact that C,ϕu(Ψ(C))=C(u)\forall C,\phi_{u}(\Psi(C))=C(u), ϕu\phi_{u} realizes SS.

Indeed, let C(x)=Hk(j1xi1,,jKxiK)C(x)=H_{k}(j_{1}x_{i_{1}},\ldots,j_{K}x_{i_{K}}) be a HkH_{k}-constraint. We have

By a simple scaling argument we can prove corollary 5.2.

2 Agnostically learning halfspaces

Now define Ψ:{1,1,0}n{±1}2n\Psi:\{-1,1,0\}^{n}\to\{\pm 1\}^{2n} by

We will use assumption 4.5 with respect to the majority predicate MAJK:{±1}K{0,1}\operatorname*{MAJ}_{K}:\{\pm 1\}^{K}\to\{0,1\}. Recall that MAJ(x)=1\operatorname*{MAJ}(x)=1 if and only if i=1Kxi>0\sum_{i=1}^{K}x_{i}>0. The following claim analyses its relevant properties.

VAL(MAJK)=11K+1\operatorname*{\overline{VAL}}(\operatorname*{MAJ}_{K})=1-\frac{1}{K+1}.

and PrxD((xi,xj)=(0,0))=14\Pr_{x\sim{\cal D}}\left((x_{i},x_{j})=(0,0)\right)=\frac{1}{4}.

Since this is true for every pairwise uniform distribution, VAL(MAJK)2t+12(t+1)=11K+1\operatorname*{\overline{VAL}}(\operatorname*{MAJ}_{K})\leq\frac{2t+1}{2(t+1)}=1-\frac{1}{K+1}. \Box

D{\cal D} is (Ω(nc),αβ+1n)(\Omega(n^{c}),\alpha\beta+\frac{1}{n})-scattered.

Combining all the above we conclude the proof of theorem 5.4. \Box

It is not hard to see that the proof of theorem 5.4 shows that it is hard to approximately learn large margin halfspaces with any constant approximation ratio. Taking considerations as in remark 7.3, one might hypothesize that the correct approximation ratio for this problem is about 1γ\frac{1}{\gamma}. As in the case of learning halfspaces, best known algorithms do just a bit better, namely, they have an approximation ratio of 1γlog(1/γ)\frac{1}{\gamma\sqrt{\log(1/\gamma)}}. Therefore, one might hypothesize that the best possible approximation ratio is 1γpoly(log(1/γ))\frac{1}{\gamma\operatorname{poly}\left(\log(1/\gamma)\right)}. We note that a recent result shows that this is the best possible approximation ratio, if we restrict ourselves to a large class of learning algorithms (that includes SVM with a kernel, regression, Fourier transform and more).

3 Learning automata

The theorem remains true (with the same proof), even if we restrict to acyclic automata.

Clearly, this automaton calculates the same function as RR. \Box

4 Toward intersection of 444 halfspaces

There is k0k_{0} such that for every odd kk0k\geq k_{0} we have

Assuming the unique games conjecture, PkP_{k} is heredity approximation resistant.

Proof We start with part 1. By , it suffices to show that there is a pairwise uniform distribution that is supported in Pk1(1)P_{k}^{-1}(1). Denote Q(x1,x4)=j=14Tk,k21(xj)Q(x^{1},\ldots x^{4})=\wedge_{j=1}^{4}T_{k,\lceil\frac{k}{2}\rceil-1}(x^{j}) and R(x1,x4)=¬(j=14Tk,k21(xj))R(x^{1},\ldots x^{4})=\neg\left(\wedge_{j=1}^{4}T_{k,\lceil\frac{k}{2}\rceil-1}(x^{j})\right). Note that if DQ{\cal D}_{Q} is a pairwise uniform distribution that is supported in Q1(1)Q^{-1}(1) and DR{\cal D}_{R} is a pairwise uniform distribution that is supported in R1(1)R^{-1}(1), then DQ×DR{\cal D}_{Q}\times{\cal D}_{R} is a pairwise uniform distribution that is supported in Pk1(1)P_{k}^{-1}(1). Therefore, it suffices to show that such DQ{\cal D}_{Q} and DR{\cal D}_{R} exist.

We first construct DQ{\cal D}_{Q}. Let Dk{\cal D}_{k} be the following distribution over {±1}k\{\pm 1\}^{k} – with probability 1k+1\frac{1}{k+1} choose the all-one vector and with probability kk+1\frac{k}{k+1}, choose at random a vector with k21\lceil\frac{k}{2}\rceil-1 ones (uniformly among all such vectors). By the argument of claim 2, Dk{\cal D}_{k} is pairwise uniform. Clearly, the distribution DQ=Dk×Dk×Dk×Dk{\cal D}_{Q}={\cal D}_{k}\times{\cal D}_{k}\times{\cal D}_{k}\times{\cal D}_{k} over ({±1}k)4\left(\{\pm 1\}^{k}\right)^{4} is a pairwise uniform distribution that is supported in Q1(1)Q^{-1}(1).

Next, we construct DR{\cal D}_{R}. Let k0k_{0} be large enough so that for every kk0k\geq k_{0}, the probability that a random vector from {±1}k\{\pm 1\}^{k} will have more than k2\lceil\frac{k}{2}\rceil minus-ones is 38\geq\frac{3}{8} (it is easy to see that this probability approaches 12\frac{1}{2} as kk approaches \infty. Therefore, such k0k_{0} exists). Now, let Z{0,1}4Z\in\{0,1\}^{4} be a random variable that satisfies:

Z1,,Z4Z_{1},\ldots,Z_{4} are pairwise independent.

For every 1i41\leq i\leq 4, Pr(Zi=1)=38\Pr(Z_{i}=1)=\frac{3}{8}.

In a moment, we will show that a random variable with the above properties exists. Now, let B{±1}kB\subset\{\pm 1\}^{k} be a set with B382k|B|\geq\frac{3}{8}\cdot 2^{k} such that every vector in BB has more than k2\lceil\frac{k}{2}\rceil minus-ones. Consider the distribution DR{\cal D}_{R} of the random variable (X1,,X4)({±1}k)4(X^{1},\ldots,X^{4})\in\left(\{\pm 1\}^{k}\right)^{4} sampled as follows. We first sample ZZ, then, for 1i41\leq i\leq 4, if Zi=1Z_{i}=1, we choose XiX^{i} to be a random vector BB and otherwise, we choose XiX^{i} to be a random vector BcB^{c}.

We note that since Z1,,Z4Z_{1},\ldots,Z_{4} are pairwise independent, X1,,X4X^{1},\ldots,X^{4} are pairwise independent as well. Also, the distribution of Xi,  1=1,,4X^{i},\;1=1,\ldots,4 is uniform. Therefore, DR{\cal D}_{R} is pairwise uniform. Also, since Pr(Z=(0,0,0,0))=0\Pr(Z=(0,0,0,0))=0, with probability 11, at least one of the XiX^{i}’s will have more than k2\lceil\frac{k}{2}\rceil minus-ones. Therefore, DR{\cal D}_{R} is supported in R1(1)R^{-1}(1).

It is left to show that there exists a random variable Z{0,1}4Z\in\{0,1\}^{4} as specified above. Let ZZ be the random variable defined as follows:

With probability 140192\frac{140}{192} ZZ is a uniform vector with a single positive coordinate.

With probability 30192\frac{30}{192} ZZ is a uniform vector with 22 positive coordinates.

With probability 22192\frac{22}{192} ZZ is a uniform vector with 44 positive coordinates.

Clearly, Pr(Z=(0,0,0,0))=0\Pr(Z=(0,0,0,0))=0. Also, for every distinct 1i,j41\leq i,j\leq 4 we have

Therefore, the other two specifications of ZZ hold as well.

PkP_{k} is heredity approximation resistant on satisfiable instances.

Given an instance JJ, we produce two examples for each constraint: for the constraint

we will produce two examples in {1,1,0}4n×{0,1}\{-1,1,0\}^{4n}\times\{0,1\}, each of which has exactly 4k4k non zero coordinates. The first is a positively labelled example whose instance is the vector with the value jq,l,  1q4,1lkj_{q,l},\;1\leq q\leq 4,1\leq l\leq k in the n(q1)+iq,ln(q-1)+i_{q,l} coordinate. the second is a negatively labelled example whose instance is the vector with the value jq,l,  5q8,1lkj_{q,l},\;5\leq q\leq 8,1\leq l\leq k in the n(q5)+iq,ln(q-5)+i_{q,l} coordinate.

It is not hard to see that if JJ is satisfiable then the produced sample is realizable by intersection of four halfspaces: if u{±1}nu\in\{\pm 1\}^{n} is a satisfying assignment then the sample is realized by the intersection of the 44 halfspaces i=1nuixn(q1)+i1,    q=1,2,3,4\sum_{i=1}^{n}u_{i}x_{n(q-1)+i}\geq-1,\;\;q=1,2,3,4. On the other hand, by proposition 7.1, if JJ is random instance with ndn^{d} constraints, then the resulting ensamble is (Ω(nd),15)(\Omega(n^{d}),\frac{1}{5}) scattered.

5 Agnostically learning parity

We will reduce from the aforementioned problem to the problem of distinguishing between β\beta-almost realizable sample and D{\cal D}-random sample for a distribution D{\cal D} which is (Ω(nd),14)\left(\Omega\left(n^{d}\right),\frac{1}{4}\right)-scattered. Since both β\beta and dd are arbitrary, the theorem follows from theorem 3.4.

Resolution lower bounds

Theorem 4.2 now follows from theorem 8.1 and the following two lemmas.

Proof Let τ={T1,,Tr}\tau=\{T_{1},\ldots,T_{r}\} be a resolution refutation to JJ. Define μ(Ti)\mu(T_{i}) as the minimal number μ\mu such that TiT_{i} is implied by μ\mu constraints in JJ.

If TiT_{i} is implied by Ti1,Ti2,  i1,i2<iT_{i_{1}},T_{i_{2}},\;i_{1},i_{2}<i then μ(Ti)μ(Ti1)+μ(Ti2)\mu(T_{i})\leq\mu(T_{i_{1}})+\mu(T_{i_{2}}).

For every 1iμ21\leq i\leq\frac{\mu}{2}, TjT_{j} contains a variable appearing only is CiC_{i}.

The next lemma shows that the condition in lemma 8.2 holds w.h.p. for a suitable random instance. For the sake of readability, it is formulated in terms of sets instead of constraints.

Fix integers k>r>dk>r>d such that r>max{17d,544}r>\max\{17d,544\}. Suppose that A1,,And([n]k)A_{1},\ldots,A_{n^{d}}\in\binom{[n]}{k} are chosen uniformly at random. Then, with probability 1on(1)1-o_{n}(1), for every I[nd]I\subset[n^{d}] with In34|I|\leq n^{\frac{3}{4}} for most iIi\in I we have AijI{i}Ajkr|A_{i}\setminus\cup_{j\in I\setminus\{i\}}A_{j}|\geq k-r.

Proof Fix a set II with 2tn342\leq t\leq n^{\frac{3}{4}} elements. Order the sets in II arbitrarily and also order the elements in each set arbitrarily. Let X1,,XktX_{1},\ldots,X_{kt} be the following random variables: X1X_{1} is the first element in the first set of II, X2X_{2} is the second element in the first set of II and so on till the kk’th element of the last set of II.

Denote by Ri    1iktR_{i}\;\;1\leq i\leq kt the indicator random variable of the event that Xi=XjX_{i}=X_{j} for some j<ij<i. We claim that if Ri<tr4\sum R_{i}<\frac{tr}{4}, the conclusion of the lemma holds for II. Indeed, let J1IJ_{1}\subset I be the set of indices with Ri=1R_{i}=1, J2IJ_{2}\subset I be the set of indices ii with Ri=0R_{i}=0 but Xi=XjX_{i}=X_{j} for some j>ij>i and J=J1J2J=J_{1}\cup J_{2}. If the conclusion of the lemma does not hold for II, then Jtr2|J|\geq\frac{tr}{2}. If in addition J1=Ri<tr4|J_{1}|=\sum R_{i}<\frac{tr}{4} we must have J2>tr4>J1|J_{2}|>\frac{tr}{4}>|J_{1}|. For every iJ2i\in J_{2}, let f(i)f(i) be the minimal index j>ij>i such that Xi=XjX_{i}=X_{j}. We note that f(i)J1f(i)\in J_{1}, therefore ff is a mapping from J2J_{2} to J1J_{1}. Since J2>J1|J_{2}|>|J_{1}|, f(i1)=f(i2)f(i_{1})=f(i_{2}) for some i1<i2i_{1}<i_{2} in J2J_{2}. Therefore, Xi1=Xf(i1)=Xi2X_{i_{1}}=X_{f(i_{1})}=X_{i_{2}} and hence, Ri2=1R_{i_{2}}=1 contradicting the assumption that i2J2i_{2}\in J_{2}.

Note that the probability that Ri=1R_{i}=1 is at most tkn\frac{tk}{n}. This estimate holds also given the values of R1,,Ri1R_{1},\ldots,R_{i-1}. It follows that the probability that Ri=1R_{i}=1 for every iAi\in A for a particular AIA\subset I with A=rt4|A|=\lceil\frac{rt}{4}\rceil is at most (tkn)rt4\left(\frac{tk}{n}\right)^{\frac{rt}{4}}. Therefore, for some constants C,C>0C^{\prime},C>0 (that depend only on dd and kk), the probability that JJ fails to satisfy the conclusion of the lemma is bounded by

The second inequality follows from Stirling’s approximation. Summing over all collections II of size tt we conclude that for some C>0C^{\prime\prime}>0, the probability that the conclusion of the lemma does not hold for some collection of size tt is at most

Summing over all 2tn342\leq t\leq n^{\frac{3}{4}}, we conclude that the probability that the conclusion of the lemma does not hold is at most Cn14=on(1)C^{\prime\prime}n^{-\frac{1}{4}}=o_{n}(1). \Box

Proof (of corollary 9.2) Under the conditions of the corollary, by theorem 9.1, we have NPSZKP\mathbf{NP}\subset\mathbf{SZKP} or CoNPSZKP\mathbf{CoNP}\subset\mathbf{SZKP}. Since SZKP\mathbf{SZKP} is closed under taking complement , in both cases, NPSZKP\mathbf{NP}\subset\mathbf{SZKP}. Since SZKPCoAM\mathbf{SZKP}\subset\mathbf{CoAM} , we conclude that NPCoAM\mathbf{NP}\subset\mathbf{CoAM}, which collapses the polynomial hierarchy . \Box

Consider the following problem. The input is a circuit Ψ:{0,1}n{0,1}m\Psi:\{0,1\}^{n}\to\{0,1\}^{m} and a number tt. The instance is a YES instance if the entropyWe consider the standard Shannon’s entropy with bits units. of Ψ\Psi, when it acts on a uniform input sampled from {0,1}n\{0,1\}^{n}, is t1\leq t-1. The instance is a NO instance if this entropy is t\geq t. By this problem is in SZKP\mathbf{SZKP}. To establish the proof, we will show that LL can be reduced to this problem.

Amit Daniely is a recipient of the Google Europe Fellowship in Learning Theory, and this research is supported in part by this Google Fellowship. Nati Linial is supported by grants from ISF, BSF and I-Core. Shai Shalev-Shwartz is supported by the Israeli Science Foundation grant number 590-10. We thank Sangxia Huang for his kind help and for valuable discussions about his paper . We thank Guy Kindler for valuable discussions.

References