More data speeds up training time in learning halfspaces over sparse vectors

Amit Daniely, Nati Linial, Shai Shalev Shwartz

Introduction

In the modern digital period, we are facing a rapid growth of available datasets in science and technology. In most computing tasks (e.g. storing and searching in such datasets), large datasets are a burden and require more computation. However, for learning tasks the situation is radically different. A simple observation is that more data can never hinder you from performing a task. If you have more data than you need, just ignore it!

A basic question is how to learn from “big data”. The statistical learning literature classically studies questions like “how much data is needed to perform a learning task?” or “how does accuracy improve as the amount of data grows?” etc. In the modern, “data revolution era”, it is often the case that the amount of data available far exceeds the information theoretic requirements. We can wonder whether this, seemingly redundant data, can be used for other purposes. An intriguing question in this vein, studied recently by several researchers ((Decatur et al., 1998; Servedio., 2000; Shalev-Shwartz et al., 2012; Berthet and Rigollet, 2013; Chandrasekaran and Jordan, 2013)), is the following

Question 1: Are there any learning tasks in which more data, beyond the information theoretic barrier, can provably be leveraged to speed up computation time?

To prove this, we present a novel technique to establish computational-statistical tradeoffs in supervised learning problems. To the best of our knowledge, this is the first such a result that is not based on cryptographic primitives.

The natural learning problem we consider is the task of learning the class of halfspaces over kk-sparse vectors. Here, the instance space is the space of kk-sparse vectors,

and the hypothesis class is halfspaces over kk-sparse vectors, namely

In addition, we allow improper learning (a.k.a. representation independent learning), namely, the learning algorithm is not restricted to output a hypothesis from Hn,k{\mathcal{H}}_{n,k}, but only should output a hypothesis whose error is not much larger than the error of the best hypothesis in Hn,k{\mathcal{H}}_{n,k}. This gives the learner a lot of flexibility in choosing an appropriate representation of the problem. This additional freedom to the learner makes it much harder to prove lower bounds in this model. Concretely, it is not clear how to use standard reductions from NP hard problems in order to establish lower bounds for improper learning (moreover, Applebaum et al. (2008) give evidence that such simple reductions do not exist).

The classes Hn,k{\mathcal{H}}_{n,k} and similar classes have been studied by several authors (e.g. Long. and Servedio (2013)). They naturally arise in learning scenarios in which the set of all possible features is very large, but each example has only a small number of active features. For example:

Predicting an advertisement based on a search query: Here, the possible features of each instance are all English words, whereas the active features are only the set of words given in the query.

Learning Preferences (Hazan et al., 2012): Here, we have nn players. A ranking of the players is a permutation σ:[n][n]\sigma:[n]\to[n] (think of σ(i)\sigma(i) as the rank of the ii’th player). Each ranking induces a preference hσh_{\sigma} over the ordered pairs, such that hσ(i,j)=1h_{\sigma}(i,j)=1 iff ii is ranked higher that jj. Namely,

We will show a positive answer to Question 1 for the class Hn,3{\mathcal{H}}_{n,3}. To do so, we showIn fact, similar results hold for every constant k3k\geq 3. Indeed, since Hn,3Hn,k{\mathcal{H}}_{n,3}\subset{\mathcal{H}}_{n,k} for every k3k\geq 3, it is trivial that item 33 below holds for every k3k\geq 3. The upper bound given in item 11 holds for every kk. For item 2, it is not hard to show that Hn,k{\mathcal{H}}_{n,k} can be learnt using a sample of Ω(nkϵ2)\Omega\left(\frac{n^{k}}{\epsilon^{2}}\right) examples by a naive improper learning algorithm, similar to the algorithm we describe in this section for k=3k=3. the following:

Ignoring computational issues, it is possible to learn the class Hn,3{\mathcal{H}}_{n,3} using O(nϵ2)O\left(\frac{n}{\epsilon^{2}}\right) examples.

A graphical illustration of our main results is given below:

runtime2^{O(n)}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>&gt;</mo><mi mathvariant="normal">poly</mi><mo>⁡</mo><mo stretchy="false">(</mo><mi>n</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">&gt;\operatorname{poly}(n)</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.5782em;vertical-align:-0.0391em;"></span><span class="mrel">&gt;</span><span class="mspace" style="margin-right:0.2778em;"></span></span><span class="base"><span class="strut" style="height:1em;vertical-align:-0.25em;"></span><span class="mop"><span class="mord mathrm" style="margin-right:0.0139em;">poly</span></span><span class="mopen">(</span><span class="mord mathnormal">n</span><span class="mclose">)</span></span></span></span></span>n^{O(1)}examplesn2<spanclass="katexdisplay"><spanclass="katex"><spanclass="katexmathml"><mathxmlns="http://www.w3.org/1998/Math/MathML"display="block"><semantics><mrow><msup><mi>n</mi><mn>1.5</mn></msup></mrow><annotationencoding="application/xtex">n1.5</annotation></semantics></math></span><spanclass="katexhtml"ariahidden="true"><spanclass="base"><spanclass="strut"style="height:0.8641em;"></span><spanclass="mord"><spanclass="mordmathnormal">n</span><spanclass="msupsub"><spanclass="vlistt"><spanclass="vlistr"><spanclass="vlist"style="height:0.8641em;"><spanstyle="top:3.113em;marginright:0.05em;"><spanclass="pstrut"style="height:2.7em;"></span><spanclass="sizingresetsize6size3mtight"><spanclass="mordmtight"><spanclass="mordmtight">1.5</span></span></span></span></span></span></span></span></span></span></span></span></span>nn^{2}<span class="katex-display"><span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msup><mi>n</mi><mn>1.5</mn></msup></mrow><annotation encoding="application/x-tex">n^{1.5}</annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.8641em;"></span><span class="mord"><span class="mord mathnormal">n</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height:0.8641em;"><span style="top:-3.113em;margin-right:0.05em;"><span class="pstrut" style="height:2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mtight">1.5</span></span></span></span></span></span></span></span></span></span></span></span></span>n The proof of item 1 above is easy – simply note that Hn,3H_{n,3} has VC dimension n+1n+1.

Item 2 is proved in section 4, relying on the results of Hazan et al. (2012). We note, however, that a weaker result, that still suffices for answering Question 1 in the affirmative, can be proven using a naive improper learning algorithm. In particular, we show below how to learn Hn,3{\mathcal{H}}_{n,3} efficiently with a sample of Ω(n3ϵ2)\Omega\left(\frac{n^{3}}{\epsilon^{2}}\right) examples. The idea is to replace the class Hn,3{\mathcal{H}}_{n,3} with the class {±1}Cn,3\{\pm 1\}^{C_{n,3}} containing all functions from Cn,3C_{n,3} to {±1}\{\pm 1\}. Clearly, this class contains Hn,3H_{n,3}. In addition, we can efficiently find a function ff that minimizes the empirical training error over a training set SS as follows: For every xCn,kx\in C_{n,k}, if xx does not appear at all in the training set we will set f(x)f(x) arbitrarily to 11. Otherwise, we will set f(x)f(x) to be the majority of the labels in the training set that correspond to xx. Finally, note that the VC dimension of {±1}Cn,3\{\pm 1\}^{C_{n,3}} is smaller than n3n^{3} (since Cn,3<n3|C_{n,3}|<n^{3}). Hence, standard generalization results (e.g. Vapnik (1995)) implies that a training set size of Ω(n3ϵ2)\Omega\left(\frac{n^{3}}{\epsilon^{2}}\right) suffices for learning this class.

Item 3 is shown in section 3 by presenting a novel technique for establishing statistical-computational tradeoffs.

The class Hn,2{\mathcal{H}}_{n,2}. Our main result gives a positive answer to Question 1 for the task of improperly learning Hn,k{\mathcal{H}}_{n,k} for k3k\geq 3. A natural question is what happens for k=2k=2 and k=1k=1. Since VC(Hn,1)=VC(Hn,2)=n+1\operatorname*{VC}({\mathcal{H}}_{n,1})=\operatorname*{VC}({\mathcal{H}}_{n,2})=n+1, the information theoretic barrier for learning these classes is Θ(nϵ2)\Theta\left(\frac{n}{\epsilon^{2}}\right). In section 4, we prove that Hn,2{\mathcal{H}}_{n,2} (and, consequently, Hn,1Hn,2{\mathcal{H}}_{n,1}\subset{\mathcal{H}}_{n,2}) can be learnt using O(nlog3(n)ϵ2)O\left(\frac{n\log^{3}(n)}{\epsilon^{2}}\right) examples, indicating that significant computational-statistical tradeoffs start to manifest themselves only for k3k\geq 3.

Recently, (Berthet and Rigollet, 2013) gave a positive answer to Question 1 in the context of unsupervised learning. Concretely, they studied the problem of sparse PCA, namely, finding a sparse vector that maximizes the variance of an unsupervised data. Conditioning on the hardness of the planted clique problem, they gave a positive answer to Question 1 for sparse PCA. Our work, as well as the previous work of Decatur et al. (1998); Servedio. (2000); Shalev-Shwartz et al. (2012), studies Question 1 in the supervised learning setup. We emphasize that unsupervised learning problems are radically different than supervised learning problems in the context of deriving lower bounds. The main reason for the difference is that in supervised learning problems, the learner is allowed to employ improper learning, which gives it a lot of power in choosing an adequate representation of the data. For example, the upper bound we have derived for the class of sparse halfspaces switched from representing hypotheses as halfspaces to representation of hypotheses as tables over Cn,3C_{n,3}, which made the learning problem easy from the computational perspective. The crux of the difficulty in constructing lower bounds is due to this freedom of the learner in choosing a convenient representation. This difficulty does not arise in the problem of sparse PCA detection, since there the learner must output a good sparse vector. Therefore, it is not clear whether the approach given in (Berthet and Rigollet, 2013) can be used to establish computational-statistical gaps in supervised learning problems.

Background and notation

For h:Cn,3{±1}h:C_{n,3}\to\{\pm 1\} and a distribution D{\mathcal{D}} on Cn,3×{±1}C_{n,3}\times\{\pm 1\} we denote the error of hh w.r.t. D{\mathcal{D}} by ErrD(h)=Pr(x,y)D(h(x)y)\operatorname{Err}_{\mathcal{D}}(h)=\Pr_{(x,y)\sim{\mathcal{D}}}\left(h(x)\neq y\right). For H{±1}Cn,3{\mathcal{H}}\subset\{\pm 1\}^{C_{n,3}} we denote the error of H{\mathcal{H}} w.r.t. D{\mathcal{D}} by ErrD(H)=minhHErrD(h)\operatorname{Err}_{{\mathcal{D}}}({\mathcal{H}})=\min_{h\in{\mathcal{H}}}\operatorname{Err}_{{\mathcal{D}}}(h). For a sample S(Cn,3×{±1})mS\in\left(C_{n,3}\times\{\pm 1\}\right)^{m} we denote by ErrS(h)\operatorname{Err}_{S}(h) (resp. ErrS(H)\operatorname{Err}_{S}({\mathcal{H}})) the error of hh (resp. H{\mathcal{H}}) w.r.t. the empirical distribution induces by the sample SS.

A learning algorithm, LL, receives a sample S(Cn,3×{±1})mS\in\left(C_{n,3}\times\{\pm 1\}\right)^{m} and return a hypothesis L(S):Cn,3{±1}L(S):C_{n,3}\to\{\pm 1\}. We say that LL learns Hn,3{\mathcal{H}}_{n,3} using m(n,ϵ)m(n,\epsilon) examples if,For simplicity, we require the algorithm to succeed with probability of at least 9/109/10. This can be easily amplified to probability of at least 1δ1-\delta, as in the usual definition of agnostic PAC learning, while increasing the sample complexity by a factor of log(1/δ)\log(1/\delta). for every distribution D{\mathcal{D}} on Cn,3×{±1}C_{n,3}\times\{\pm 1\} and a sample SS of more than m(n,ϵ)m(n,\epsilon) i.i.d. examples drawn from D{\mathcal{D}},

The algorithm LL is efficient if it runs in polynomial time in the sample size and returns a hypothesis that can be evaluated in polynomial time.

Soundness: If Val(ϕ)1ϵ\operatorname{Val}(\phi)\geq 1-\epsilon, then

In fact, for all we know, the following conjecture may be true for every 0μ0.50\leq\mu\leq 0.5.

Note that Feige’s conjecture is equivalent to the -R3SAT hardness assumption.

Let 0μ0.50\leq\mu\leq 0.5. If the μ\mu-R3SAT hardness assumption (conjecture 2.2) is true, then there exists no efficient learning algorithm that learns the class Hn,3{\mathcal{H}}_{n,3} using O(n1+μϵ2)O\left(\frac{n^{1+\mu}}{\epsilon^{2}}\right) examples.

In the proof of Theorem 3.1 we rely on the validity of a conjecture, similar to conjecture 2.2 for 33-variables majority formulas. Following an argument from (Feige, 2002) (Theorem 3.2) the validity of the conjecture on which we rely for majority formulas follows the validity of conjecture 2.2.

An nn-variables 3MAJ clause is a boolean formula of the form

An nn-variables 3MAJ formula is a boolean formula of the form

where the CiC_{i}’s are 3MAJ clauses. By 3MAJn,m\textrm{3MAJ}_{n,m} we denote the set of 3MAJ formulas with nn variables and mm clauses.

Let 0μ0.50\leq\mu\leq 0.5. If the μ\mu-R3SAT hardness assumption is true, then for every ϵ>0\epsilon>0 and for every large enough integer Δ>Δ0(ϵ)\Delta>\Delta_{0}(\epsilon) there exists no efficient algorithm with the following properties.

If Val(ϕ)34ϵ\operatorname{Val}(\phi)\geq\frac{3}{4}-\epsilon, then

Next, we prove Theorem 3.1. In fact, we will prove a slightly stronger result. Namely, define the subclass Hn,3dHn,3{\mathcal{H}}_{n,3}^{d}\subset{\mathcal{H}}_{n,3}, of homogenous halfspaces with binary weights, given by Hn,3d={hw,0w{±1}n}{\mathcal{H}}_{n,3}^{d}=\left\{h_{w,0}\mid w\in\{\pm 1\}^{n}\right\}. As we show, under the μ\mu-R3SAT hardness assumption, it is impossible to efficiently learn this subclass using only O(n1+μϵ2)O\left(\frac{n^{1+\mu}}{\epsilon^{2}}\right) examples.

The crux of the construction is that if ϕ\phi is random, no algorithm (even improper and even inefficient) can return a hypothesis with a small error. The reason for that is that since the sample provided to the algorithm consists of only κn0.012\kappa\frac{n}{0.01^{2}} samples, the algorithm won’t see most of ψ\psi’s clauses, and, consequently, the produced hypothesis hh will be independent of them. Since these clauses are random, hh is likely to err on about half of them, so that ErrDϕ(h)\operatorname{Err}_{D_{\phi}}(h) will be close to half!

To summarize we constructed an efficient algorithm with the following properties: if ϕ\phi is almost satisfiable, the algorithm will return a hypothesis with a small error, and then we will declare “exceptional”, while for random ϕ\phi, the algorithm will return a hypothesis with a large error, and we will declare “typical”.

Our construction crucially relies on the restriction to learning algorithm with a small sample complexity. Indeed, if the learning algorithm obtains more than n1+μn^{1+\mu} examples, then it will see most of ψ\psi’s clauses, and therefore it might succeed in “learning” even when the source of the formula is random. Therefore, we will declare “exceptional” even when the source is random.

(of theorem 3.1) Assume by way of contradiction that the μ\mu-R3SAT hardness assumption is true and yet there exists an efficient learning algorithm that learns the class Hn,3{\mathcal{H}}_{n,3} using O(n1+μϵ2)O\left(\frac{n^{1+\mu}}{\epsilon^{2}}\right) examples. Setting ϵ=1100\epsilon=\frac{1}{100}, we conclude that there exists an efficient algorithm LL and a constant κ>0\kappa>0 such that given a sample SS of more than κn1+μ\kappa\cdot n^{1+\mu} examples drawn from a distribution D{\mathcal{D}} on Cn,3×{±1}C_{n,3}\times\{\pm 1\}, returns a classifier L(S):Cn,3{±1}L(S):C_{n,3}\to\{\pm 1\} such that

W.p. 34\geq\frac{3}{4} over the choice of SS, ErrD(L(S))ErrD(Hn,3)+1100\operatorname{Err}_{{\mathcal{D}}}(L(S))\leq\operatorname{Err}_{{\mathcal{D}}}({\mathcal{H}}_{n,3})+\frac{1}{100}.

Fix Δ\Delta large enough such that Δ>100κ\Delta>100\kappa and the conclusion of Theorem 3.2 holds with ϵ=1100\epsilon=\frac{1}{100}. We will construct an algorithm, AA, contradicting Theorem 3.2. On input ϕ3MAJn,Δn1+μ\phi\in\textrm{3MAJ}_{n,\Delta n^{1+\mu}} consisting of the 3MAJ clauses C1,,CΔn1+μC_{1},\ldots,C_{\Delta n^{1+\mu}}, the algorithm AA proceeds as follows

Generate a sample SS consisting of Δn1+μ\Delta n^{1+\mu} examples as follows. For every clause, Ck=MAJ((1)j1xi1,(1)j2xi2,(1)j3xi3)C_{k}=\textrm{MAJ}((-1)^{j_{1}}x_{i_{1}},(-1)^{j_{2}}x_{i_{2}},(-1)^{j_{3}}x_{i_{3}}), generate an example (xk,yk)Cn,3×{±1}(x_{k},y_{k})\in C_{n,3}\times\{\pm 1\} by choosing b{±1}b\in\{\pm 1\} at random and letting

For example, if n=6n=6, the clause is MAJ(x2,x3,x6)\textrm{MAJ}(-x_{2},x_{3},x_{6}) and b=1b=-1, we generate the example

Choose a sample S1S_{1} consisting of Δn1+μ100κn1+μ\frac{\Delta n^{1+\mu}}{100}\geq\kappa\cdot n^{1+\mu} examples by choosing at random (with repetitions) examples from SS.

We claim that AA contradicts Theorem 3.2. Clearly, AA runs in polynomial time. It remains to show that

If Val(ϕ)341100\operatorname{Val}(\phi)\geq\frac{3}{4}-\frac{1}{100}, then

Assume first that ϕ3MAJn,Δn1+μ\phi\in\textrm{3MAJ}_{n,\Delta n^{1+\mu}} is chosen at random. Given the sample S1S_{1}, the sample S2:=SS1S_{2}:=S\setminus S_{1} is a sample of S2|S_{2}| i.i.d. examples which are independent from the sample S1S_{1}, and hence also from h=L(S1)h=L(S_{1}). Moreover, for every example (xk,yk)S2(x_{k},y_{k})\in S_{2}, yky_{k} is a Bernoulli random variable with parameter 12\frac{1}{2} which is independent of xkx_{k}. To see that, note that an example whose instance is xkx_{k} can be generated by exactly two clauses – one corresponds to yk=1y_{k}=1, while the other corresponds to yk=1y_{k}=-1 (e.g., the instance (1,1,0,1)(1,-1,0,1) can be generated from the clause MAJ(x1,x2,x4)\textrm{MAJ}(x_{1},-x_{2},x_{4}) and b=1b=1 or the clause MAJ(x1,x2,x4)\textrm{MAJ}(-x_{1},x_{2},-x_{4}) and b=1b=-1). Thus, given the instance xkx_{k}, the probability that yk=1y_{k}=1 is 12\frac{1}{2}, independent of xkx_{k}.

It follows that ErrS2(h)\operatorname{Err}_{S_{2}}(h) is an average of at least (11100)Δn1+μ\left(1-\frac{1}{100}\right)\Delta n^{1+\mu} independent Bernoulli random variable. By Chernoff’s bound, with probability 1o(1)\geq 1-o(1), ErrS2(h)>121100\operatorname{Err}_{S_{2}}(h)>\frac{1}{2}-\frac{1}{100}. Thus,

Assume now that Val(ϕ)341100\operatorname{Val}(\phi)\geq\frac{3}{4}-\frac{1}{100} and let ψ{±1}n\psi\in\{\pm 1\}^{n} be an assignment that indicates that. Let ΨHn,3\Psi\in{\mathcal{H}}_{n,3} be the hypothesis Ψ(x)=sign(ψ,x)\Psi(x)=\textrm{sign}\left(\langle\psi,x\rangle\right). It can be easily checked that Ψ(xk)=yk\Psi(x_{k})=y_{k} if and only if ψ\psi satisfies CkC_{k}. Since Val(ϕ)341100\operatorname{Val}(\phi)\geq\frac{3}{4}-\frac{1}{100}, it follows that

By the choice of LL, with probability 114=34\geq 1-\frac{1}{4}=\frac{3}{4},

The following theorem derives upper bounds for learning Hn,2{\mathcal{H}}_{n,2} and Hn,3{\mathcal{H}}_{n,3}. Its proof relies on results from Hazan et al. (2012) about learning β\beta-decomposable matrices, and due to the lack of space is given in the appendix.

There exists an efficient algorithm that learns Hn,2{\mathcal{H}}_{n,2} using O(nlog3(n)ϵ2)O\left(\frac{n\log^{3}(n)}{\epsilon^{2}}\right) examples

There exists an efficient algorithm that learns Hn,3{\mathcal{H}}_{n,3} using O(n2log3(n)ϵ2)O\left(\frac{n^{2}\log^{3}(n)}{\epsilon^{2}}\right) examples

Discussion

We formally established a computational-sample complexity tradeoff for the task of (agnostically and improperly) PAC learning of halfspaces over 33-sparse vectors. Our proof of the lower bound relies on a novel, non cryptographic, technique for establishing such tradeoffs. We also derive a new non-trivial upper bound for this task.

Amit Daniely is a recipient of the Google Europe Fellowship in Learning Theory, and this research is supported in part by this Google Fellowship. Nati Linial is supported by grants from ISF, BSF and I-Core. Shai Shalev-Shwartz is supported by the Israeli Science Foundation grant number 590-10.

References

Appendix A Proof of Theorem 4.1

The proof of the theorem relies on results from Hazan et al. about learning β\beta-decomposable matrices. Let WW be an n×mn\times m matrix. We define the symmetrization of WW to be the (n+m)×(n+m)(n+m)\times(n+m) matrix

We say that WW is β\beta-decomposable if there exist positive semi-definite matrices P,NP,N for which

Each matrix in {±1}n×m\{\pm 1\}^{n\times m} can be naturally interpreted as a hypothesis on [n]×[m][n]\times[m].

We say that a learning algorithm LL learns a class Hn{±1}Xn{\mathcal{H}}_{n}\subset\{\pm 1\}^{X_{n}} using m(n,ϵ,δ)m(n,\epsilon,\delta) examples if, for every distribution D{\mathcal{D}} on Xn×{±1}X_{n}\times\{\pm 1\} and a sample SS of more than m(n,ϵ,δ)m(n,\epsilon,\delta) i.i.d. examples drawn from D{\mathcal{D}},

Hazan et al. have provedThe result of Hazan et al. is more general than what is stated here. Also, Hazan et al. considered the online scenario. The result for the statistical scenario, as stated here, can be derived by applying standard online-to-batch conversions (see for example Cesa-Bianchi et al. ). that

Hazan et al. The hypothesis class of β\beta-decomposable n×mn\times m matrices with ±1\pm 1 entries ban be efficiently learnt using a sample of O(β2(n+m)log(n+m)+log(1/δ)ϵ2)O\left(\frac{\beta^{2}(n+m)\log(n+m)+\log(1/\delta)}{\epsilon^{2}}\right) examples.

We start with a generic reduction from a problem of learning a class Gn{\mathcal{G}}_{n} over an instance space Xn{1,1,0}nX_{n}\subset\{-1,1,0\}^{n} to the problem of learning β(n)\beta(n)-decomposable matrices. We say that Gn{\mathcal{G}}_{n} is realized by mn×mnm_{n}\times m_{n} matrices that are β(n)\beta(n)-decomposable if there exists a mapping ψn:Xn[mn]×[mn]\psi_{n}:X_{n}\to[m_{n}]\times[m_{n}] such that for every hGnh\in{\mathcal{G}}_{n} there exists a β(n)\beta(n)-decomposable mn×mnm_{n}\times m_{n} matrix WW for which xXn,  h(x)=Wψn(x)\forall x\in X_{n},\;h(x)=W_{\psi_{n}(x)}. The mapping ψn\psi_{n} is called a realization of Gn{\mathcal{G}}_{n}. In the case that the mapping ψn\psi_{n} can be computed in time polynomial in nn, we say that Gn{\mathcal{G}}_{n} is efficiently realized and ψn\psi_{n} is an efficient realization. It follows from Theorem A.1 that:

If Gn{\mathcal{G}}_{n} is efficiently realized by mn×mnm_{n}\times m_{n} matrices that are β(n)\beta(n)-decomposable then Gn{\mathcal{G}}_{n} can be efficiently learnt using a sample of O(β(n)2mnlog(mn)+log(1/δ)ϵ2)O\left(\frac{\beta(n)^{2}m_{n}\log(m_{n})+\log(1/\delta)}{\epsilon^{2}}\right) examples.

We now turn to the proof of Theorem 4.1. We start with the first assertion, about learning Hn,2{\mathcal{H}}_{n,2}. The idea will be to partition the instance space into a disjoint union of subsets and show that the restriction of the hypothesis class to each subset can be efficiently realized by β(n)\beta(n)-decomposable. Concretely, we decompose Cn,2C_{n,2} into a disjoint union of five sets

For every 2r2-2\leq r\leq 2, Hn,2Anr{\mathcal{H}}_{n,2}|_{A^{r}_{n}} can be efficiently realized by n×nn\times n matrices that are O(log(n))O(\log(n))-decomposable.

To glue together the five restrictions, we will rely on the following Lemma, whose proof is given in section A.1.

Let X1,...,XkX_{1},...,X_{k} be partition of a domain XX and let HH be a hypothesis class over XX. Define Hi=HXiH_{i}=H|_{X_{i}}. Suppose the for every HiH_{i} there exist a learning algorithm that learns HiH_{i} using C(d+log(1/δ))/ϵ2\leq C(d+\log(1/\delta))/\epsilon^{2} examples, for some constant C8C\geq 8. Consider the algorithm AA which receives an i.i.d. training set SS of mm examples from X×{0,1}X\times\{0,1\} and applies the learning algorithm for each HiH_{i} on the examples in SS that belongs to XiX_{i}. Then, AA learns HH using at most

The first part of Theorem 4.1 is therefore follows from Lemma A.3, Lemma A.4 and Corollary A.2.

Having the first part of Theorem 4.1 and Lemma A.4 at hand, it is not hard to prove the second part of Theorem 4.1:

For 1in21\leq i\leq n-2 and b{±1}b\in\{\pm 1\} define

Let ψn:Cn,3Cn,2\psi_{n}:C_{n,3}\to C_{n,2} be the mapping that zeros the first non zero coordinate. It is not hard to see that Hn,3Dn,i,b={hψnDn,i,bhHn,2}{\mathcal{H}}_{n,3}|_{D_{n,i,b}}=\left\{h\circ\psi_{n}|_{D_{n,i,b}}\mid h\in{\mathcal{H}}_{n,2}\right\}. Therefore Hn,3Dn,i,b{\mathcal{H}}_{n,3}|_{D_{n,i,b}} can be identified with Hn,2{\mathcal{H}}_{n,2} using the mapping ψn\psi_{n}, and therefore can efficiently learnt using O(nlog3(n)+log(1/δ)ϵ2)O\left(\frac{n\log^{3}(n)+\log(1/\delta)}{\epsilon^{2}}\right) examples (the dependency on δ\delta does not appear in the statement, but can be easily inferred from the proof). The second part of Theorem 4.1 is therefore follows from the first part of the Theorem and Lemma A.4.

In the proof, we will rely on the following facts. The tensor product of two matrices AMn×mA\in M_{n\times m} and BMk×lB\in M_{k\times l} is defined as the (nk)×(ml)(n\cdot k)\times(m\cdot l) matrix

Let WW be a β\beta-decomposable matrix and let AA be a PSD matrix whose diagonal entries are upper bounded by α\alpha. Then WAW\otimes A is (αβ)(\alpha\cdot\beta)-decomposable.

It is not hard to see that for every matrix WW and a symmetric matrix AA,

Moreover, since the tensor product of two PSD matrices is PSD, if sym(W)=PN\operatorname{sym}(W)=P-N is a β\beta-decomposition of WW, then

is a (αβ)(\alpha\cdot\beta)-decomposition of WAW\otimes A. ∎

If WW is a β\beta-decomposable matrix, then so is every matrix obtained from WW by iteratively deleting rows and columns.

It is enough to show that deleting one row or column leaves WW β\beta-decomposable. Suppose that WW^{\prime} is obtained from WMn×mW\in M_{n\times m} by deleting the ii’th row (the proof for deleting columns is similar). It is not hard to see that sym(W)\operatorname{sym}(W^{\prime}) is the ii’th principal minor of sym(W)\operatorname{sym}(W). Therefore, since principal minors of PSD matrices are PSD matrices as well, if sym(W)=PN\operatorname{sym}(W)=P-N is β\beta-decomposition of WW then sym(W)=[P]i,i[N]i,i\operatorname{sym}(W^{\prime})=[P]_{i,i}-[N]_{i,i} is a β\beta-decomposition of WW^{\prime}. ∎

Hazan et al. Let TnT_{n} be the upper triangular matrix whose all entries in the diagonal and above are 11, and whose all entries beneath the diagonal are 1-1. Then TnT_{n} is O(log(n))O(\log(n))-decomposable.

Lastly, we will also need the following generalization of proposition A.7

Let WW be an n×nn\times n ±1\pm 1 matrix. Assume that there exists a sequence 0j(1),,j(n)n0\leq j(1),\ldots,j(n)\leq n such that

Since switching rows of a β\beta-decomposable matrix leaves a β\beta-decomposable matrix, we can assume without loss of generality that j(1)j(2)j(n)j(1)\leq j(2)\leq\ldots\leq j(n). Let JJ be the n×nn\times n all ones matrix. It is not hard to see that WW can be obtained from TnJT_{n}\otimes J by iteratively deleting rows and columns. Combining propositions A.5, A.6 and A.7, we conclude that WW is O(log(n))O(\log(n))-decomposable, as required. ∎

(of Lemma A.3) Denote Anr=Hn,2Anr{\mathcal{A}}^{r}_{n}={\mathcal{H}}_{n,2}|_{A^{r}_{n}}. We split into cases.

Case 1, r=0: Note that An0={eieji,j[n]}A_{n}^{0}=\{e_{i}-e_{j}\mid i,j\in[n]\}. Define ψn:An0[n]×[n]\psi_{n}:A_{n}^{0}\to[n]\times[n] by ψn(eiej)=(i,j)\psi_{n}(e_{i}-e_{j})=(i,j). We claim that ψn\psi_{n} is an efficient realization of An0{\mathcal{A}}_{n}^{0} by n×nn\times n matrices that are O(log(n))O(\log(n)) decomposable. Indeed, let h=hw,bAn0h=h_{w,b}\in{\mathcal{A}}_{n}^{0}, and let WW be the n×nn\times n matrix Wij=Wψn(eiej)=h(eiej)W_{ij}=W_{\psi_{n}(e_{i}-e_{j})}=h(e_{i}-e_{j}). It is enough to show that WW is O(log(n))O(\log(n))-decomposable.

From equation (1), it is not hard to see that there exist numbers

The conclusion follows from Proposition A.8

Case 2, r=2 and r=-2: We confine ourselves to the case r=2r=2. The case r=2r=-2 is similar. Note that An2={ei+ejij[n]}A_{n}^{2}=\{e_{i}+e_{j}\mid i\neq j\in[n]\}. Define ψn:An2[n]×[n]\psi_{n}:A_{n}^{2}\to[n]\times[n] by ψn(ei+ej)=(i,j)\psi_{n}(e_{i}+e_{j})=(i,j). We claim that ψn\psi_{n} is an efficient realization of An2{\mathcal{A}}_{n}^{2} by n×nn\times n matrices that are O(log(n))O(\log(n)) decomposable. Indeed, let h=hw,bAn2h=h_{w,b}\in{\mathcal{A}}_{n}^{2}, and let WW be the n×nn\times n matrix Wij=Wψn(ei+ej)=h(ei+ej)W_{ij}=W_{\psi_{n}(e_{i}+e_{j})}=h(e_{i}+e_{j}). It is enough to show that WW is O(log(n))O(\log(n))-decomposable.

From equation (2), it is not hard to see that there exist numbers

The conclusion follows from Proposition A.8

Case 3, r=1 and r=-1: We confine ourselves to the case r=1r=1. The case r=1r=-1 is similar. Note that An1={eii[n]}A_{n}^{1}=\{e_{i}\mid i\in[n]\}. Define ψn:An0[n]×[n]\psi_{n}:A_{n}^{0}\to[n]\times[n] by ψn(ei)=(i,i)\psi_{n}(e_{i})=(i,i). We claim that ψn\psi_{n} is an efficient realization of An1{\mathcal{A}}_{n}^{1} by n×nn\times n matrices that are 33-decomposable (let alone, log(n)\log(n)-decomposable). Indeed, let h=hw,bAn1h=h_{w,b}\in{\mathcal{A}}_{n}^{1}, and let WW be the n×nn\times n matrix with Wii=Wψn(ei)=h(ei)W_{ii}=W_{\psi_{n}(e_{i})}=h(e_{i}) and 1-1 outside the diagonal. It is enough to show that WW is 33-decomposable. Since JJ is 11-decomposable, it is enough to show that W+JW+J is 22-decomposable. However, it is not hard to see that every diagonal matrix DD is (maxiDii)(\max_{i}|D_{ii}|)-decomposable. ∎

(of Lemma A.4) Let S=(x1,y1),,(xm,ym)S=(x_{1},y_{1}),\ldots,(x_{m},y_{m}) be a training set and let m^i\hat{m}_{i} be the number of examples in SS that belong to XiX_{i}. Given that the values of the random variables m^1,,m^i\hat{m}_{1},\ldots,\hat{m}_{i} is determined, we have that w.p. of at least 1δ1-\delta,

where DiD_{i} is the induced distribution over XiX_{i}, hih_{i} is the output of the ii’th algorithm, and hh^{*} is the optimal hypothesis w.r.t. the original distribution DD. Define,

It follows from the above that we also have, w.p. at least 1δ1-\delta, for every ii,

Let αi=D{(x,y):xXi}\alpha_{i}=D\{(x,y):x\in{\mathcal{X}}_{i}\}, and note that iαi=1\sum_{i}\alpha_{i}=1. Therefore,

Next note that if αim<C(d+log(k/δ))\alpha_{i}m<C(d+\log(k/\delta)) then αim/mi1\alpha_{i}m/m_{i}\leq 1. Otherwise, using Chernoff’s inequality, for every ii we have

It follows that with probability of at least 1δ1-\delta,

All in all, we have shown that with probability of at least 12δ1-2\delta it holds that

Therefore, the the algorithm learns H{\mathcal{H}} using