From average case complexity to improper learning complexity
Amit Daniely, Nati Linial, Shai Shalev-Shwartz
Introduction
Valiant’s celebrated probably approximately correct (=PAC) model of machine learning led to an extensive research that yielded a whole scientific community devoted to computational learning theory. In the PAC learning model, a learner is given an oracle access to randomly generated samples where is sampled from some unknown distribution on and for some unknown function . Furthermore, it is assumed that comes from a predefined hypothesis class , consisting of valued functions on . The learning problem defined by is to find a function that minimizes . For concreteness’ sake we take , and we consider the learning problem tractable if there is an algorithm that on input , runs in time and outputs, w.h.p., a hypothesis with .
Assuming , the status of most basic computational problems is fairly well understood. In a sharp contrast, almost years after Valiant’s paper, the status of most basic learning problems is still wide open – there is a huge gap between the performance of the best known algorithms and hardness results:
The crux of the matter, leading to this state of affairs, has to do with the learner’s freedom to return any hypothesis. A learner who may return hypotheses outside the class is called an improper learner. This additional freedom makes such algorithms potentially more powerful than proper learners. On the other hand, this added flexibility makes it difficult to apply standard reductions from -hard problems. Indeed, there was no success so far in proving intractability of a learning problem based on -hardness. Moreover, as Applebaum, Barak and Xiao showed, many standard ways to do so are doomed to fail, unless the polynomial hierarchy collapses.
The vast majority of existing lower bounds on learning utilize the crypto-based argument, suggested in . Roughly speaking, to prove that a certain learning problem is hard, one starts with a certain collection of functions, that by assumption are one-way trapdoor permutations. This immediately yields some hard (usually artificial) learning problem. The final step is to reduce this artificial problem to some natural learning problem.
Unlike the difficulty in establishing lower bounds for improper learning, the situation in proper learning is much better understood. Usually, hardness of proper learning is proved by showing that it is -hard to distinguish a realizable sample from an unrealizable sample. I.e., it is hard to tell whether there is some hypothesis in which has zero error on a given sample. This, however, does not suffice for the purpose of proving lower bounds on improper learning, because it might be the case that the learner finds a hypothesis (not from ) that does not err on the sample even though no can accomplish this. In this paper we present a new methodology for proving hardness of improper learning. Loosely speaking, we show that improper learning is impossible provided that it is hard to distinguish a realizable sample from a randomly generated unrealizable sample.
Agnostically learning halfspaces with a constant approximation ratio is hard, even over the boolean cube.
Learning intersection of halfspaces is hard, even over the boolean cube.
We note that result 4 can be established using the cryptographic technique . Result 5 is often taken as a hardness assumption. We also conjecture that under our generalization of Feige’s assumption it is hard to learn intersections of even constant number of halfspaces. We present a possible approach to the case of four halfspaces. To the best of our knowledge, these results easily imply most existing lower bounds for improper learning.
There is a crucial reversal of order that works in our favour. To lower bound improper learning, we actually need much less than what is needed in cryptography, where a problem and a distribution on instances are appropriate if they fool every algorithm. In contrast, here we are presented with a concrete learning algorithms and we devise a problem and a distribution on instances that fail it.
2 On the role of average case complexity
Preliminaries
Distributions on (resp. ) are denoted (resp. ). Ensembles of distributions are denoted by . That is, where is a distributions on . We say that is a polynomial ensemble if is upper bounded by some polynomial in .
The error of a hypothesis w.r.t. on is defined as . For a hypothesis class , we define . We say that a distribution is realizable by (resp. ) if (resp. ). Similarly, we say that is -almost realizable by (resp. ) if (resp. ).
A sample is a sequence . The empirical error of a hypothesis w.r.t. sample is . The empirical error of a hypothesis class w.r.t. is . We say that a sample is realizable by if . The sample is realizable by if . Similarly, we define the notion of -almost realizable sample (by either a hypothesis or a class ).
A learning algorithm, denoted , obtains an error parameter , a confidence parameter , a complexity parameter , and an access to an oracle that produces samples according to unknown distribution on . It should output a (description of) hypothesis . We say that the algorithm (PAC) learns the hypothesis class if, for every realizable distribution , with probability , outputs a hypothesis with error . We say that an algorithm agnostically learns if, for every distribution , with probability , outputs a hypothesis with error . We say that an algorithm approximately agnostically learns with approximation ratio if, for every distribution , with probability , outputs a hypothesis with error . We say that is efficient if it runs in time polynomial in and , and outputs a hypothesis that can be evaluated in time polynomial in and . We say that is proper (with respect to ) if it always outputs a hypothesis in . Otherwise, we say that is improper.
2 Constraints Satisfaction Problems
3 Resolution refutation and Davis Putnam algorithms
The methodology
We begin by discussing the methodology in the realm of realizable learning, and we later proceed to agnostic learning. Some of the ideas underling our methodology appeared, in a much more limited context, in .
To motivate the approach, recall how one usually proves that a class cannot be efficiently properly learnable. Given a hypothesis class , let be the problem of distinguishing between an -realizable sample and one with . If is efficiently properly learnable then this problem is inThe reverse direction is almost true: If the search version of this problem can be solved in polynomial time, then is efficiently learnable. : To solve , we simply invoke a proper learning algorithm that efficiently learns , with examples drawn uniformly from . Let be the output of . Since properly learns , we have
If is a realizable sample, then is small.
If then, since , .
This gives an efficient way to decide whether is realizable. We conclude that if is -hard, then is not efficiently learnable, unless .
We see that it is not clear how to establish hardness of improper learning based on the hardness of distinguishing between a realizable and an unrealizable sample. The core problem is that even if is not realizable, the algorithm might still return a good hypothesis. The crux of our new technique is the observation that if is randomly generated unrealizable sample then even improper algorithm cannot return a hypothesis with a small empirical error. The point is that the returned hypothesis is determined solely by the examples that sees and its random bits. Therefore, if is an efficient algorithm, the number of hypotheses it might return cannot be too large. Hence, if is “random enough”, it likely to be far from all these hypotheses, in which case the hypothesis returned by would have a large error on .
We now formalize this idea. Let be a polynomial ensemble of distributions, such that is a distribution on . Think of as a distribution that generates samples that are far from being realizable by . We say that it is hard to distinguish between a -random sample and a realizable sample if there is no efficient randomized algorithm with the following properties:
For every realizable sample ,
If , then with probability over the choice of , it holds that
Let be the distribution over defined by taking independent uniformly chosen examples from . For , is the probability of getting at most heads in independent tosses of a fair coin. By Hoeffding’s bound, this probability is . Therefore, is -scattered.
Every hypothesis class that satisfies the following condition is not efficiently learnable. There exists such that for every there is an -scattered ensemble for which it is hard to distinguish between a -random sample and a realizable sample.
The theorem and the proof below work verbatim if we replace by , provided that for some .
Proof Let be the hypothesis class in question and suppose toward a contradiction that algorithm learns efficiently. Let be the maximal number of random bits used by when run on the input . This includes both the bits describing the examples produced by the oracle and “standard” random bits. Since is efficient, . Define
By assumption, there is a -scattered ensemble for which it is hard to distinguish a -random sample from a realizable sample. Consider the algorithm defined below. On input ,
Run with parameters and , such that the examples’ oracle generates examples by choosing a random example from .
Let be the hypothesis that returns. If , output “realizable”. Otherwise, output “unrealizable”.
Next, we derive a contradiction by showing that distinguishes a realizable sample from a -random sample. Indeed, if the input is realizable, then is guaranteed to return, with probability , a hypothesis with . Therefore, w.p. will output “realizable”.
What if the input sample is drawn from ? Let be the collection of functions that might return when run with parameters and . We note that , since each hypothesis in can be described by bits. Namely, the random bits that uses and the description of the examples sampled by the oracle. Now, since is -scattered, the probability that for some is at most . It follows that the probability that responds “realizable” is . This leads to the desired contradiction and concludes our proof.
For every sample that is -almost realizable,
If , then with probability over the choice of , it holds that
Let . Every hypothesis class that satisfies the following condition is not efficiently agnostically learnable with an approximation ratio of . For some and every , there is a -scattered ensemble such that it is hard to distinguish between a -random sample and a -almost realizable sample.
As in theorem 3.2, the theorem and the proof below work verbatim if we replace by and by , provided that for some .
Proof Let be the hypothesis class in question and suppose toward a contradiction that efficiently agnostically learns with approximation ratio of . Let be the maximal number of random bits used by when it runs on the input . This includes both the bits describing the examples produced by the oracle and the “standard” random bits. Since is efficient, . Define,
By the assumptions of the theorem, there is a -scattered ensemble such that it is hard to distinguish between a -random sample and a -almost realizable sample. Consider the following efficient algorithm to distinguish between a -random sample and a -almost realizable sample. On input ,
Run with parameters and , such that the examples are sampled uniformly from .
Let be the hypothesis returned by the algorithm . If , return “almost realizable”. Otherwise, return “unrealizable”.
Next, we derive a contradiction by showing that this algorithm, which we denote by , distinguishes between a realizable sample and a -random sample. Indeed, if the input is -almost realizable, then is guaranteed to return, with probability , a hypothesis with . Therefore, the algorithm will return, w.p. , “almost realizable”.
Suppose now that the input sample is drawn according to . Let be the collection of functions that the learning algorithm might return when it runs with the parameters and . Note that each hypothesis in can be described by bits, namely, the random bits used by and the description of the examples sampled by the oracle. Therefore, . Now, since is -scattered, the probability that some function in will have is at most . It follows that the probability that the algorithm will return “almost realizable” is .
The strong random CSP assumption
Let us briefly summarize the evidence for this assumption.
For every and sufficiently large , it is hard to distinguish instances with value from random instances with constraints.
Summary of results
Following the Boosting argument of Schapire , hardness of improper learning of a class immediately implies that for every , there is no efficient algorithm that when running on a distribution that is realized by , guaranteed to output a hypothesis with error . Therefore, hardness results of improper learning are very strong, in the sense that they imply that the algorithm that just makes a random guess for each example, is essentially optimal.
2 Agnostically learning halfspaces
The problem of proper agnostic learning of halfspaces was shown to be hard to approximate within a factor of . Using the cryptographic technique, improper learning of halfspaces is known to be hard, under a certain cryptographic assumption regarding the shortest vector problem (, based on ). No hardness results are known for approximately and improperly learning halfspaces. Here, we show that:
3 Learning intersection of halfspaces
Learning intersection of halfspaces has been a major challenge in machine learning. Beside being a natural generalization of learning halfspaces, its importance stems from neural networks . Learning neural networks was popular in the 80’s, and enjoy a certain comeback nowadays. A neural network is composed of layers, each of which is composed of nodes. The first layer consists of nodes, containing the input values. The nodes in the rest of the layers calculates a value according to a halfspace (or a “soft” halfspace obtained by replacing the sign function with a sigmoidal function) applied on the values of the nodes in the previous layer. The final layer consists of a single node, which is the output of the whole network.
4 Additional results
5 On the proofs
Future work
We elaborate below on some of the numerous open problems and research directions that the present paper suggests.
As discussed in the previous section, one can try to derive it, even partially, from weaker assumptions.
3 More applications
Likewise for learning large margin halfspaces (see remark 7.4) and for parity.
Proofs of the lower bounds
Let be some distribution on a set . For even , let be independent random variables drawn according to . Consider the sample . Then, for every ,
Proof For let . Note that . Also, the ’s are independent random variables with mean and values between and . Therefore, by Hoeffding’s bound,
is heredity approximation resistant on satisfiable instances.
For every sufficiently large , there exists such that
The reduction works as follows. Let be the vector from lemma 7.2. Given an instance
Note that if is random then so is . Also, if is satisfiable with a satisfying assignment , then, by lemma 7.2, satisfies in exactly the constraints with odd indices. Next, we will produce a sample from as follows. We will index the coordinates of vectors in by . We define a mapping from the collection of -constraints to as follows – for each constraint we define by the formula
Finally, if , we will produce the sample
The theorem follows from the following claim:
If is a random instance then is -scattered.
where, as mentioned before, we index coordinates of by triplets in . We claim that for every -constraint , . This suffices, since if satisfies then satisfies exactly the constraints with odd indices in . Therefore, by the definition of and the fact that , realizes .
Indeed, let be a -constraint. We have
By a simple scaling argument we can prove corollary 5.2.
2 Agnostically learning halfspaces
Now define by
We will use assumption 4.5 with respect to the majority predicate . Recall that if and only if . The following claim analyses its relevant properties.
.
and .
Since this is true for every pairwise uniform distribution, .
is -scattered.
Combining all the above we conclude the proof of theorem 5.4.
It is not hard to see that the proof of theorem 5.4 shows that it is hard to approximately learn large margin halfspaces with any constant approximation ratio. Taking considerations as in remark 7.3, one might hypothesize that the correct approximation ratio for this problem is about . As in the case of learning halfspaces, best known algorithms do just a bit better, namely, they have an approximation ratio of . Therefore, one might hypothesize that the best possible approximation ratio is . We note that a recent result shows that this is the best possible approximation ratio, if we restrict ourselves to a large class of learning algorithms (that includes SVM with a kernel, regression, Fourier transform and more).
3 Learning automata
The theorem remains true (with the same proof), even if we restrict to acyclic automata.
Clearly, this automaton calculates the same function as .
4 Toward intersection of 444 halfspaces
There is such that for every odd we have
Assuming the unique games conjecture, is heredity approximation resistant.
Proof We start with part 1. By , it suffices to show that there is a pairwise uniform distribution that is supported in . Denote and . Note that if is a pairwise uniform distribution that is supported in and is a pairwise uniform distribution that is supported in , then is a pairwise uniform distribution that is supported in . Therefore, it suffices to show that such and exist.
We first construct . Let be the following distribution over – with probability choose the all-one vector and with probability , choose at random a vector with ones (uniformly among all such vectors). By the argument of claim 2, is pairwise uniform. Clearly, the distribution over is a pairwise uniform distribution that is supported in .
Next, we construct . Let be large enough so that for every , the probability that a random vector from will have more than minus-ones is (it is easy to see that this probability approaches as approaches . Therefore, such exists). Now, let be a random variable that satisfies:
are pairwise independent.
For every , .
In a moment, we will show that a random variable with the above properties exists. Now, let be a set with such that every vector in has more than minus-ones. Consider the distribution of the random variable sampled as follows. We first sample , then, for , if , we choose to be a random vector and otherwise, we choose to be a random vector .
We note that since are pairwise independent, are pairwise independent as well. Also, the distribution of is uniform. Therefore, is pairwise uniform. Also, since , with probability , at least one of the ’s will have more than minus-ones. Therefore, is supported in .
It is left to show that there exists a random variable as specified above. Let be the random variable defined as follows:
With probability is a uniform vector with a single positive coordinate.
With probability is a uniform vector with positive coordinates.
With probability is a uniform vector with positive coordinates.
Clearly, . Also, for every distinct we have
Therefore, the other two specifications of hold as well.
is heredity approximation resistant on satisfiable instances.
Given an instance , we produce two examples for each constraint: for the constraint
we will produce two examples in , each of which has exactly non zero coordinates. The first is a positively labelled example whose instance is the vector with the value in the coordinate. the second is a negatively labelled example whose instance is the vector with the value in the coordinate.
It is not hard to see that if is satisfiable then the produced sample is realizable by intersection of four halfspaces: if is a satisfying assignment then the sample is realized by the intersection of the halfspaces . On the other hand, by proposition 7.1, if is random instance with constraints, then the resulting ensamble is scattered.
5 Agnostically learning parity
We will reduce from the aforementioned problem to the problem of distinguishing between -almost realizable sample and -random sample for a distribution which is -scattered. Since both and are arbitrary, the theorem follows from theorem 3.4.
Resolution lower bounds
Theorem 4.2 now follows from theorem 8.1 and the following two lemmas.
Proof Let be a resolution refutation to . Define as the minimal number such that is implied by constraints in .
If is implied by then .
For every , contains a variable appearing only is .
The next lemma shows that the condition in lemma 8.2 holds w.h.p. for a suitable random instance. For the sake of readability, it is formulated in terms of sets instead of constraints.
Fix integers such that . Suppose that are chosen uniformly at random. Then, with probability , for every with for most we have .
Proof Fix a set with elements. Order the sets in arbitrarily and also order the elements in each set arbitrarily. Let be the following random variables: is the first element in the first set of , is the second element in the first set of and so on till the ’th element of the last set of .
Denote by the indicator random variable of the event that for some . We claim that if , the conclusion of the lemma holds for . Indeed, let be the set of indices with , be the set of indices with but for some and . If the conclusion of the lemma does not hold for , then . If in addition we must have . For every , let be the minimal index such that . We note that , therefore is a mapping from to . Since , for some in . Therefore, and hence, contradicting the assumption that .
Note that the probability that is at most . This estimate holds also given the values of . It follows that the probability that for every for a particular with is at most . Therefore, for some constants (that depend only on and ), the probability that fails to satisfy the conclusion of the lemma is bounded by
The second inequality follows from Stirling’s approximation. Summing over all collections of size we conclude that for some , the probability that the conclusion of the lemma does not hold for some collection of size is at most
Summing over all , we conclude that the probability that the conclusion of the lemma does not hold is at most .
Proof (of corollary 9.2) Under the conditions of the corollary, by theorem 9.1, we have or . Since is closed under taking complement , in both cases, . Since , we conclude that , which collapses the polynomial hierarchy .
Consider the following problem. The input is a circuit and a number . The instance is a YES instance if the entropyWe consider the standard Shannon’s entropy with bits units. of , when it acts on a uniform input sampled from , is . The instance is a NO instance if this entropy is . By this problem is in . To establish the proof, we will show that can be reduced to this problem.
Amit Daniely is a recipient of the Google Europe Fellowship in Learning Theory, and this research is supported in part by this Google Fellowship. Nati Linial is supported by grants from ISF, BSF and I-Core. Shai Shalev-Shwartz is supported by the Israeli Science Foundation grant number 590-10. We thank Sangxia Huang for his kind help and for valuable discussions about his paper . We thank Guy Kindler for valuable discussions.