Reliably Learning the ReLU in Polynomial Time

Surbhi Goel, Varun Kanade, Adam Klivans, Justin Thaler

Introduction

Recall the following two fundamental machine-learning problems:

The term agnostic above refers to the fact that the labeling on {1,1}\{-1,1\} may be arbitrary. In this work, we relax the notion of success to improper learning, where the learner may output any polynomial-time computable hypothesis achieving a loss that is within ϵ\epsilon of the optimal solution from the concept class.

Taken together, these two problems are at the core of many important techniques from modern Machine Learning and Statistics. It is well-known how to efficiently solve ordinary least squares and other variants of linear regression; we know of multiple polynomial-time solutions, all extensively used in practice . In contrast, Problem 1.2 is thought to be computationally intractable due to the many existing hardness results in the literature .

The ReLU is a hybrid function that lies “in-between” a linear function and a threshold function in the following sense: restricted to inputs x\mathbf{x} such that wx>0\mathbf{w}\cdot\mathbf{x}>0, the ReLU is linear, and for inputs x\mathbf{x} such that wx0\mathbf{w}\cdot\mathbf{x}\leq 0, the ReLU thresholds the value wx\mathbf{w}\cdot\mathbf{x} and simply outputs zero. In this sense, we could view the ReLU as a “one-sided” threshold function. Since learning a ReLU has aspects of both linear regression and threshold learning, it is not straightforward to identify a notion of loss that captures both of these aspects.

We introduce a natural model for learning ReLUs inspired by the Reliable Agnostic learning model that was introduced by Kalai et al. in the context of Boolean functions. The goal will be to minimize both the false positive rate and a loss function (for example, square-loss) on points the distribution labels non-zero. In this work, we give efficient algorithms for learning a ReLU over the unit sphere with respect to any loss function that satisfies mild properties (convexity, monotonicity, boundedness, and Lipschitz-ness).

The Reliable Agnostic model is motivated by the Neyman-Pearson criteria, and is intended to capture settings in which false positive errors are more costly than false negative errors (e.g., spam detection) or vice versa. We observe that the asymmetric manner in which the Reliable Agnostic model treats different types of errors naturally corresponds to the one-sided nature of a ReLU. In particular, there may be settings in which mistakenly predicting a positive value instead of zero carries a high cost.

As a concrete example, imagine that inputs are comments on an online news article. Suppose that each comment is assigned a numerical score of quality or appropriateness, where the true scoring function is reasonably modeled by a linear function of the features of the comment. The newspaper wants to implement an automated system in which comments are either a) rejected outright if the score is below a threshold or b) posted in order of score, possibly after undergoing human review.For example, The New York Times recently announced that they are moving to a hybrid comment moderation system that combines human and automated review . In this situation, it may be costlier to post (or subject to human review) a low-quality or inappropriate comment than it is to automatically reject a comment that is slightly above the threshold for posting.

2 Our Contributions

We can now state our main theorem giving a poly-time algorithm (in nn, the dimension) for reliably learning any ReLU.

Let C={xmax(0,wx) ⁣:w21}\mathcal{C}=\{\mathbf{x}\mapsto\mathsf{max}(0,\mathbf{w}\cdot\mathbf{x})\colon\|\mathbf{w}\|_{2}\leq 1\} be the class of ReLUs with weight vectors w\mathbf{w} satisfying w21\|\mathbf{w}\|_{2}\leq 1. There exists a learning algorithm A{\cal A} that reliably learns C{\cal C} in time 2O(1/ϵ)nO(1)2^{O(1/\epsilon)}\cdot n^{O(1)}.

We can obtain the same complexity bounds for learning ReLUs in the standard agnostic model with respect to the same class of loss functions. This yields a PTAS (polynomial-time approximation scheme) for an optimization problem regarding ReLUs posed by Bach . See Section 3.4 for details.

We can compose our results to obtain efficient algorithms for small-depth networks of ReLUs. For brevity, here we state results only for linear combinations of ReLUs (which are often called depth-two networks of ReLUs, see, e.g., ). Formal results for other types of networks can be found in Section 4.

Let C{\cal C} be a depth-2 network of ReLUs with kk hidden units. Then C{\cal C} is reliably learnable in time 2O(k/ϵ)nO(1)2^{O(\sqrt{k}/\epsilon)}\cdot n^{O(1)}.

The above results are perhaps surprising in light of the hardness result due to Livni et al. who showed that for X={0,1}n{\cal X}=\{0,1\}^{n}, learning the difference of even two ReLUs is as hard as learning a threshold function.

We also obtain results for noisy polynomial reconstruction on the sphere (equivalently, agnostically learning a polynomial) with respect to a large class of loss functions:

Andoni et al. were the first to give efficient algorithms for noisy polynomial reconstruction over non-Boolean domains. In particular, they gave algorithms that succeed on the unit cube but require an underlying product distribution and do not work in the agnostic setting (they also run in time exponential in the degree dd).

At a high level the proofs of both Theorem 1.3 and 1.6 follow the same outline, but we do not know how to obtain one from the other.

3 Applications to Convex Piecewise Regression

We establish a novel connection between learning networks of ReLUs and a broad class of piecewise-linear regression problems studied in machine learning and optimization. The following problem was defined by Boyd and Magnani as a generalization of the well-known MARS (multivariate adaptive regression splines) framework due to Friedman :

Applying our learnability results for networks of ReLUs, we obtain the first polynomial-time algorithms for solving the above max-kk-affine regression problem and the sum of max-22-affine regression problem when k=O(1)k=O(1). Boyd and Magnani specifically highlight the case of k=O(1)k=O(1) and provide a variety of heuristics; we obtain the first provably efficient results.

There is an algorithm A{\cal A} for solving the convex piecewise-linear fitting problem (cf. Definition 1.7) in time 2O((k/ϵ)logk)nO(1)2^{O\left((k/\epsilon)^{{\log k}}\right)}\cdot n^{O(1)}.

We can also use our results for learning networks of ReLUs to learn the so-called “leaky ReLUs” and “parameterized” ReLUs (PReLUs); see Section 4.3 for details. We obtain these results by composing various “ReLU gadgets,” i.e., constant-depth networks of ReLUs with a small number of bounded-weight hidden units.

4 Hardness

Let C{\cal C} be the class of ReLUs over the domain X={0,1}n{\cal X}=\{0,1\}^{n}. Then any algorithm for reliably learning C{\cal C} in time g(ϵ)poly(n)g(\epsilon)\cdot\mathsf{poly}(n) for any function gg will give a polynomial time algorithm for learning ω(1)\omega(1)-sparse parities with noise (for any ϵ=O(1)\epsilon=O(1)).

Efficiently learning sparse parities (of any superconstant length) with noise is considered one the most challenging problems in theoretical computer science.

5 Techniques and Related Work

Our starting point will be to prove the existence of low-degree, low-weight polynomial approximators for every cCc\in{\cal C}. The polynomial method has a well established history in computational learning theory (e.g., Kalai et al. for agnostically learning halfspaces under distributional assumptions), and we can apply classical techniques from approximation theory and recent work due to Sherstov to construct low-weight, low-degree approximators for any ReLU.

We can then relax Optimization Problem 1 to the space of low-weight polynomials and follow the approach of Shalev-Shwartz et al. who used tools from Reproducing Kernel Hilbert Spaces (RKHS) to learn low-weight polynomials efficiently (Shalev-Shwartz et al. focused on a relaxation of the 0/1 loss for halfspaces).

The main challenge is to obtain reliability; i.e., to simultaneously minimize the false-positive rate and the loss dictated by the objective function. To do this we take a “dual-loss” approach and carefully construct two loss functions that will both be minimized with high probability. Proving that these losses generalize for a large class of objective functions is subtle and requires “clipping” in order to apply the appropriate Rademacher bound. Our final output hypothesis is max(0,h)\mathsf{max}(0,h) where hh is a “clipped” version of the optimal low-weight, low-degree polynomial on the training data, appropriately kernelized.

Our learning algorithms for networks of ReLUs are obtained by generalizing a composition technique due to Zhang et al. , who considered networks of “smooth” activation functions computed by power series (we discuss this more in Section 4). Using a sequence of “gadget” reductions, we then show that even small-size networks of ReLUs are surprisingly powerful, yielding the first set of provably efficient algorithms for a variety of piecewise-linear regression problems in high dimension.

Note: A recent manuscript appearing on the Arxiv due to R. Arora et al. considers the complexity of training depth-22 networks of ReLUs with kk hidden units on a sample of size mm but when the dimension n=1n=1. They give a proper learning algorithm that runs in time exponential in kk and mm. These concept classes, however, can be improperly learned in time polynomial in kk and mm using a straightforward reduction to piecewise linear regression on the real line.

Preliminaries

2 Concept Classes

3 Learning Models

We consider two learning models in this paper. The first is the standard agnostic learning model and the second is a generalization of the reliable agnostic learning framework . We describe these models briefly; the reader may refer to the original articles for further details.

4 Kernel Methods

We make use of kernel methods in our learning algorithms. For completeness, we define kernels and a few important results concerning kernel methods. The reader may refer to Hofmann et al. (or any standard text) for further details.

We will use the following variant of the polynomial kernel:

Also define HMKd\mathcal{H}_{{\mathsf{MK}_{d}}} to be the corresponding Reproducing Kernel Hilbert Space (RKHS).

Observe that MKd{\mathsf{MK}_{d}} is the sum of standard polynomial kernels (cf. ) of degree ii for i[d]i\in[d]. However, the feature map conventionally used for a standard polynomial kernel has only (n+dd){n+d\choose d} entries and, under that definition involves coefficients of size as large as dΘ(d)d^{\Theta(d)}. The feature map ψd\psi_{d} used by MKd{\mathsf{MK}_{d}} avoids these coefficients by using NdN_{d} entries as defined above (that is, entries of ψd(x)\psi_{d}(\mathbf{x}) are indexed by ordered subsets of [n][n], while entries of the standard feature map are indexed by unordered subsets of [n][n].)

Consider the representation of pwp_{\mathbf{w}} as an element of HMKd\mathcal{H}_{\mathsf{MK}_{d}} defined as follows: the entry of index (k1,,kj)[n]j(k_{1},\ldots,k_{j})\in[n]^{j} of the representation equals βji=1jwki\beta_{j}\cdot\prod_{i=1}^{j}w_{k_{i}} for j[d]j\in[d]. Abusing notation, we use pwp_{\mathbf{w}} to denote both the multivariate polynomial and the vector in HMKd\mathcal{H}_{\mathsf{MK}_{d}}. The following lemma establishes that pwHMKdp_{\mathbf{w}}\in\mathcal{H}_{\mathsf{MK}_{d}} is indeed a representation of the polynomial pwp_{\mathbf{w}}, and gives a bound on pw,pw\langle p_{\mathbf{w}},p_{\mathbf{w}}\rangle. The proof follows an analysis applied by Shalev-Shwartz et al. [31, Lemma 2.4] to a different kernel (cf. Remark 2.8 below).

Let p(t)=i=0dβitip(t)=\sum_{i=0}^{d}\beta_{i}t^{i} be a given univariate polynomial with i=1dβi2B\sum_{i=1}^{d}\beta_{i}^{2}\leq B. For w\mathbf{w} such that w1\|\mathbf{w}\|\leq 1, consider the polynomial pw(x):=p(wx)p_{\mathbf{w}}(\mathbf{x}):=p(\mathbf{w}\cdot\mathbf{x}). Then pwp_{\mathbf{w}} is represented by the vector pwHMKdp_{\mathbf{w}}\in\mathcal{H}_{{\mathsf{MK}_{d}}} defined above. Moreover, pw,pwB\langle p_{\mathbf{w}},p_{\mathbf{w}}\rangle\leq B.

Shalev-Shwartz et al. proved a bound on the Euclidean norm of representations of polynomials of the form p(wx)p(\mathbf{w}\cdot\mathbf{x}) in the RKHS corresponding to the kernel function K(x,y)=1112x,yK(\mathbf{x},\mathbf{y})=\frac{1}{1-\frac{1}{2}\langle\mathbf{x},\mathbf{y}\rangle}. This allowed them to represent functions computed by power series, as opposed to polynomials of (finite) degree dd. However, for degree dd polynomials, the use of their kernel results in a Euclidean norm bound that is a factor of 2d2^{d} worse than what we obtain from Lemma 2.7. This difference is central to our results on noisy polynomial reconstruction in Section 3.5, where we address this issue in more technical detail.

5 Generalization Bounds

We make use of the following standard generalization bound for hypothesis classes with small Rademacher complexity. Readers unfamiliar with Rademacher complexity may refer to the paper of Bartlett and Mendelson .

where Rm(F)\mathcal{R}_{m}(\mathcal{F}) is the Rademacher complexity of the function class F\mathcal{F}.

We will combine the following two theorems with Theorem 2.9 above to bound the generalization error of our algorithms for agnostic and reliable learning.

Let X\mathcal{X} be a subset of a Hilbert space equipped with inner product ,\langle\cdot,\cdot\rangle such that for each xX\mathbf{x}\in\mathcal{X}, x,xX2\langle\mathbf{x},\mathbf{x}\rangle\leq X^{2}, and let W={xx,w  w,wW2}\mathcal{W}=\{\mathbf{x}\mapsto\langle\mathbf{x},\mathbf{w}\rangle~{}|~{}\langle\mathbf{w},\mathbf{w}\rangle\leq W^{2}\} be a class of linear functions. Then it holds that

The following result as stated appears in but is originally attributed to .

6 Approximation Theory

Finally, let p(x)=2ϵ2(pˉ(x)12)+12p(x)=\frac{2-\epsilon}{2}(\bar{p}(x)-\frac{1}{2})+\frac{1}{2}. We have for xx\in,

Furthermore, it is clearly the case that p()p()\subseteq. ∎

Let p(t)=i=0dβitip(t)=\sum_{i=0}^{d}\beta_{i}t^{i} be a univariate polynomial of degree dd. Let MM be such that maxtp(t)M\displaystyle\max_{t\in}|p(t)|\leq M. Then i=0dβi2(d+1)(4e)2dM2\displaystyle\sum_{i=0}^{d}\beta_{i}^{2}\leq(d+1)\cdot(4e)^{2d}\cdot M^{2}.

Lemma 4.1 from Sherstov states that for any polynomial satisfying the conditions in the statement of the lemma, the following holds for all i{0,,d}i\in\{0,\ldots,d\}:

Reliably Learning the ReLU

In order to reliably learn ReLUs, it would suffice to solve Optimization Problem 1 (see Section 1). This mathematical program, however, is not convex; hence, we consider a suitable convex relaxation.

Thus, an optimal solution to Optimization Problem 2 achieves a loss on the training data that is within {i  yi>0}Lϵ|\{i~{}|~{}y_{i}>0\}|\cdot L\cdot\epsilon of that achieved by the optimal solution to Optimization Problem 1.

While Optimization Problem 2 is convex, it is still not trivial to solve efficiently. For one, the RKHS HMKd\mathcal{H}_{\mathsf{MK}_{d}} has dimension nΘ(d)n^{\Theta(d)}. However, materializing such vectors explicitly requires nΘ(d)n^{\Theta(d)} time, and Theorem 1.3 promises a learning algorithm with runtime 2O(1/ϵ)nO(1)nO(d)2^{O(1/\epsilon)}\cdot n^{O(1)}\ll n^{O(d)}. As in Shalev-Shwartz et al. , we apply the Representer Theorem (see e.g., ), to guarantee that Optimization Problem 2 can be solved in time that is polynomial in the number of samples used.

2 Description of the Output Hypothesis

To address this issue, we will “clip” the function to always output a value between $$:

The hypothesis hh output by our algorithm is as follows.

We use a fact due to Ledoux and Talagrand on the Rademacher complexity of composed function classes (Theorem 2.11) to bound the generalization error. Clipping comes at a small cost, in the sense that it forces us to require that the loss function be monotone. However, we can handle non-monotone losses if the output hypothesis is not clipped, albeit with sample complexity bounds that depend polynomially on the Lipschitz-constant and bound of the loss in the interval [B,B][-\sqrt{B},\sqrt{B}] as opposed to $$.

3 Formal Version of Theorem 1.3 and Its Proof

In order to prove the theorem, we need to bound the following two losses for the output hypothesis hh.

Now consider the following to bound L=0(h;D)\mathcal{L}_{=0}(h;{\cal D}).

It is sometimes possible to avoid this exponential dependence on LL in the setting of agnostic learning (as opposed to reliable learning). Indeed, in the case of agnostic learning there is no need to threshold the output at 2ϵ2\epsilon (this thresholding contributed 2ϵL2\epsilon L to our bound on the excess error established in Inequality (13)); simply clipping the output to be in the range of Y\mathcal{Y} suffices.

4 An Implication for Learning Convex Neural Networks

In a recent work, Bach considered convex relaxations of optimization problems related to learning neural networks with a single hidden layer and non-decreasing homogeneous activation function.His setting allows potentially uncountably many hidden units along with a sparsity-inducing regularizer. One specific problem raised in his paper [3, Sec. 6] is understanding the computational complexity of the following problem.

We describe the modified algorithm and the minor differences in the proof below.

Consider a multivariate polynomial pp of degree dd such that sum of the squared coefficients is bounded by BB. Denote the coefficient of monomial x1i1xninx_{1}^{i_{1}}\cdots x_{n}^{i_{n}} by β(i1,,in)\beta(i_{1},\dots,i_{n}) for (i1,,in){0,,d}n(i_{1},\ldots,i_{n})\in\{0,\ldots,d\}^{n}. We have

Let MM be the map that takes an ordered tuple (k1,,kj)[n]j({k_{1}},\ldots,{k_{j}})\in[n]^{j} for j[d]j\in[d] to the tuple (i1,,in){0,,d}n(i_{1},\ldots,i_{n})\in\{0,\ldots,d\}^{n} such that xk1xkj=x1i1xninx_{k_{1}}\cdots x_{k_{j}}=x_{1}^{i_{1}}\cdots x_{n}^{i_{n}}. Let C(i1,,in)C(i_{1},\ldots,i_{n}) be the number of distinct orderings of the iji_{j}’s for j[n]j\in[n]; C(i1,,in)C(i_{1},\dots,i_{n}) which can be computed from the multinomial theorem (cf. ). Observe that the number of tuples that MM maps to (i1,,in)(i_{1},\ldots,i_{n}) is precisely C(i1,,in)C(i_{1},\ldots,i_{n}).

Recall that HMKd\mathcal{H}_{{\mathsf{MK}_{d}}} denotes the RKHS from Definition 2.6. Observe that the polynomial pp from Equation (22) is represented by the vector vpHMKd\mathbf{v}_{p}\in\mathcal{H}_{{\mathsf{MK}_{d}}} defined as follows. For j[d]j\in[d], entry (k1,,kj)(k_{1},\dots,k_{j}) of vp\mathbf{v}_{p} equals

It is easy to see that vp\mathbf{v}_{p} as defined represents pp. Indeed,

Via a standard analysis identical to that of Section 3.1, Optimization Problem 6 is a convex program and can be solved in time polynomial in mm, nn, and dd. Let α\boldsymbol{\alpha}^{*} denote an optimal solution to Optimization Problem 6 and let f()=i=1mαiMKd(xi,)f(\cdot)=\sum_{i=1}^{m}\alpha_{i}^{*}{\mathsf{MK}_{d}}(\mathbf{x}_{i},\cdot). The hypothesis hh output by our algorithm is as follows.

6 Proper Learning

Observe that the above can be easily computed since we know xi\mathbf{x}_{i} for all i[m]i\in[m], and the function CC can be efficiently computed as discussed before using the multinomial theorem. Hence, the hypothesis is itself a polynomial of degree at most dd, any desired coefficient of which can be computed efficiently.

7 Formal Version of Theorem 1.6 and Its Proof

where opt\mathsf{opt} is the error of the best fitting multivariate polynomial pp of degree dd whose sum of squares of coefficients is bounded by BB.

In order to prove the theorem, we need to bound

In the rest of the proof we assume that for every fP(n,d,B)f\in\mathcal{P}(n,d,B), the following hold:

Now consider the following to bound L(h;D)\mathcal{L}(h;D). Letting pp be any polynomial in P(n,d,B)\mathcal{P}(n,d,B),

Networks of ReLUs

In this section, we extend learnability results for a single ReLU to network of ReLUs. Our results in this section apply to the standard agnostic model of learning in the case that the output is a linear combination of hidden units. If our output layer, however, is a single ReLU, then our results can be extended to the reliable setting using similar techniques from Section 3.

We will use the same framework as Zhang et al. , who showed how to learn networks where the activation function is computed exactly by a power series (with bounded sum of squares of coefficients BB) with respect to loss functions that are bounded on a domain that is a function of BB. Their algorithm works by repeatedly composing the kernel of Shalev-Shwartz et al. and optimizing in the corresponding RKHS.

We generalize their results to activation functions that are approximated by polynomials. This allows us to capture many classes of activation functions including ReLUs. Our clipping technique also allows us to work with respect to a broader class of loss functions.

Our results for learning networks of ReLUs have a number of new applications. First, we give the first efficient algorithms for learning “parameterized” ReLUs and “leaky” ReLUs. Second, we obtain the first polynomial-time approximation schemes for convex piecewise-linear regression (see Section 4.5 for details). As far as we are aware, there were no provably efficient algorithms known for these types of multivariate piecewise-linear regression problems.

where yj(0)(x)=xy^{(0)}_{j}(\mathbf{x})=\mathbf{x} for all jj. We similarly define hj(i)h^{(i)}_{j} to be the function that maps xX\mathbf{x}\in\mathcal{X} to the input of unit jj in layer i+1i+1:

For a better understanding of the above notation, consider a fully-connected network N1\mathcal{N}_{1} with a single hidden layer (these are also known as depth-2 networks) consisting of kk units:

In this case, output of unit i[k]i\in[k] in the hidden layer is yi(1)(x)=σ(wix)y^{(1)}_{i}(\mathbf{x})=\sigma(\mathbf{w}_{i}\cdot\mathbf{x}) and the input to the same unit is hi(0)(x)=wixh^{(0)}_{i}(\mathbf{x})=\mathbf{w}_{i}\cdot\mathbf{x}.

Let N[σ,D,W,M]\mathcal{N}[\sigma,D,W,M] be the class of fully-connected networks with DD hidden layers and σ\sigma as the activation function. Additionally, the weights are constrained such that j=1n(wij(0))2M2\sum_{j=1}^{n}(w_{ij}^{(0)})^{2}\leq M^{2} for all units ii in layer 0 and k=1n(i)wjk(i)W\sum_{k=1}^{n^{(i)}}|w_{jk}^{(i)}|\leq W for all units jj in all layers i{1,,D}i\in\{1,\ldots,D\}. Also, the inputs to each unit are bounded in magnitude by MM, i.e., hj(l)(x)[M,M]h^{(l)}_{j}(\mathbf{x})\in[-M,M] with M1M\geq 1 for each l<Dl<D and j=1,,n(l+1)j=1,\dots,n^{(l+1)}.

We consider activation functions which can be approximated by polynomials with sum of squares of coefficients bounded. We term them low-weight approximable activation functions, formalized as follows.

2 Approximate Polynomial Networks

We first bound the error incurred when each activation function is replaced by a corresponding low-weight polynomial approximation.

Let σ\sigma be an activation function that is 1-LipschitzNote that this is not a restriction, as we have not explicitly constrained the weights WW. Thus, to allow a Lipschitz constant LL, we simply replace WW by WLWL. and such that there exists a degree dd polynomial pp that is a (ϵWDD,2M,B)(\frac{\epsilon}{W^{D}D},2M,B) approximation for σ\sigma, with ϵ(0,1)\epsilon\in(0,1), with d,M,B1d,M,B\geq 1. Then, for all NN[σ,D,W,M]\mathcal{N}\in\mathcal{N}[\sigma,D,W,M], there exists NˉN[p,D,W,2M]\bar{\mathcal{N}}\in\mathcal{N}[p,D,W,2M] such that

Let NN[σ,D,W,M]\mathcal{N}\in\mathcal{N}[\sigma,D,W,M] and let NˉN[p,D,W,M]\bar{\mathcal{N}}\in\mathcal{N}[p,D,W,M] be such that it has the same structure and weights as N\mathcal{N} with the activation replaced with pp. For N\mathcal{N} let h(i)(x)h^{(i)}(\mathbf{x}) be the inputs to layer i+1i+1 and y(i)(x)y^{(i)}(\mathbf{x}) be the outputs of unit jj of layer ii as defined previously. Correspondingly, for Nˉ\bar{\mathcal{N}} let hˉ(i)(x)\bar{h}^{(i)}(\mathbf{x}) be the inputs to layer i+1i+1 and yˉ(i)(x)\bar{y}^{(i)}(\mathbf{x}) be the outputs of layer ii. We prove by induction on layer ii that for all units jj of layer ii,

For layer i=0i=0, we have h(0)(x)j=hˉ(0)(x)j=wj(0)x[M,M]h^{(0)}(\mathbf{x})_{j}=\bar{h}^{(0)}(\mathbf{x})_{j}=\textbf{w}_{j}^{(0)}\cdot\mathbf{x}\in[-M,M] which trivially satisfies (32). Now, we prove that the desired property holds for layer ll, assuming the following holds for layer l1l-1. We have for all units jj in layer l1l-1,

Note that this implies that hˉj(l1)(x)hj(l1)(x)+(l1)ϵWDl+1D2M\left|\bar{h}^{(l-1)}_{j}(\mathbf{x})\right|\leq\left|h^{(l-1)}_{j}(\mathbf{x})\right|+\frac{(l-1)\epsilon}{W^{D-l+1}D}\leq 2M. Here the second inequality follows from the assumption that inputs to each unit are bounded by MM and ϵ<1\epsilon<1. We have for all x\mathbf{x} and jj,

Step (34) follows since hˉj(l1)(x)[2M,2M]\bar{h}^{(l-1)}_{j}(\mathbf{x})\in[-2M,2M] and pp uniformly ϵWDD\frac{\epsilon}{W^{D}D}-approximates σ\sigma in [2M,2M][-2M,2M]. Step (35) follows from σ\sigma being 1-Lipschitz. Step (36) follows from (33). Finally Step (37) follows from wj(l)1W\|\textbf{w}_{j}^{(l)}\|_{1}\leq W which is given. This completes the inductive proof.

We conclude by noting that N(x)=h1(D)(x)\mathcal{N}(\mathbf{x})=h_{1}^{(D)}(\mathbf{x}) and Nˉ(x)=hˉ1(D)(x)\bar{\mathcal{N}}(\mathbf{x})=\bar{h}_{1}^{(D)}(\mathbf{x}). Thus, from above we get,

Given the above transformation to a polynomial network and associated error bounds, we apply the main theorem of Zhang et al. combined with the clipping technique from Section 3 to obtain the following result:

The time complexity of the above algorithm is bounded by nO(1)BO(d)D1log(1/δ)n^{O(1)}\cdot B^{O(d)^{D-1}}\cdot\log(1/\delta), where dd is the degree of pp, and BB is a bound on i=0d2iβi2\sum_{i=0}^{d}2^{i}\beta_{i}^{2} (see Defn. 4.2).

From Theorem 4.3 we have that for all NN[σ,D,W,M]\mathcal{N}\in\mathcal{N}[\sigma,D,W,M], there is a network NˉN[p,D,W,M]\bar{\mathcal{N}}\in\mathcal{N}[p,D,W,M] such that

Now from [36, Theorem 1], we know that there exists an algorithm that outputs a predictor f^\widehat{f} such that with probability at least 1δ1-\delta for any distribution D{\cal D}

For loss functions that take on large values on the range of the predictor, we instead output the clipped version of the predictor clip(f^)\mathsf{clip}(\widehat{f}) in order to satisfy the requirements of the Rademacher bounds (as in Section 3).

Combining the above inequalities, we have

We can now state the learnability result for ReLU networks as follows.

The time complexity of the above algorithm is bounded by nO(1)2((L+1)MWDDϵ1)Dlog(1/δ)n^{O(1)}\cdot 2^{((L+1)\cdot M\cdot W^{D}\cdot D\cdot\epsilon^{-1})^{D}}\cdot\log(1/\delta).

Using Theorem 4.4, we state the following learnability result for depth-2 sigmoid networks.

3 Application: Learning Parametric Rectified Linear Unit

A Parametric Rectified Linear Unit (PReLU) is a generalization of ReLU introduced by He et al. . Compared to the ReLU, it has an additional parameter that is learned. Formally, it is defined as

The parametric rectifier (denoted by σPReLU\sigma_{\text{PReLU}}) is an activation function defined as

The proof of the corollary follows from setting L=1L=1, D=1D=1, M=1M=1 and W=O(1)W=O(1) in Theorem 4.5.

The condition that a|a| be bounded by 1 is reasonable as in practice the value of aa is very rarely above 1 as observed by He et al. . Also note that Leaky-ReLUs are PReLUs with fixed aa (usually 0.01). Hence, we can agnostically learn them under the same conditions using an identical argument as above. Note that a network of PReLU can also be similarly learned as a ReLU by replacing each ReLU in the network by a linear combination of two ReLUs as described before.

4 Application: Learning the Piecewise Linear Transfer Function

Several functions have been used to relax the 0/10/1 loss in the context of learning linear classifiers. The best example is the sigmoid function discussed earlier. Here we consider the piecewise linear transfer function. Formally, it is defined as

The CC-Lipschitz piecewise linear transfer function (denoted by σpw\sigma_{\text{pw}}) is an activation function defined as

The proof of the corollary follows from setting L=1L=1, D=1D=1, M=1M=1 and W=O(L)W=O(L) in Theorem 4.5.

5 Application: Convex Piecewise-Linear Fitting

In this section we can use our learnability results for networks of ReLUs to give polynomial-time approximation schemes for convex piecewise-linear regression . These problems have been studied in optimization and notably in machine learning in the context of Multivariate Adaptive Regression Splines (MARS ). Note that these are not the same as univariate piecewise or segmented regression problems, for which polynomial-time algorithms are known. Although our algorithms run in time exponential in kk (the number of affine functions), we note that no provably efficient algorithms were known prior to our work even for the case k=2k=2.Boyd and Magnani specifically focus on the case of small kk, writing “Our interest, however, is in the case when the number of terms kk is relatively small, say no more than 10, or a few 10s.”

The key idea will be to reduce piecewise regression problems to an optimization problem on networks of ReLUs using simple ReLU “gadgets.” We formally describe the problems and describe the gadgets in detail.

We start with a simple class of convex piecewise linear functions represented as a sum of a fixed number of functions where each of these functions is a maximum of 2 affine functions. This is formally defined as follows.

Each input to each unit is bounded in magnitude by 2.

The first property holds as wj(0)max(w2j1w2j,w2j1,w2j)w2j1+w2j2\|\mathbf{w}_{j}^{(0)}\|\leq\mathsf{max}(\|\mathbf{w}_{2j-1}-\mathbf{w}_{2j}\|,\|\mathbf{w}_{2j-1}\|,\|\mathbf{w}_{2j}\|)\leq\|\mathbf{w}_{2j-1}\|+\|\mathbf{w}_{2j}\|\leq 2 using the triangle inequality. The second holds because each of the kk max sub-networks contributes 3 to w1(1)1\|\mathbf{w}_{1}^{(1)}\|_{1}. The third is implied by the fact that each input to each unit is bounded by max(w1x,w1x,(w1w2)x)2|\mathsf{max}(\mathbf{w}_{1}\cdot\mathbf{x},-\mathbf{w}_{1}\cdot\mathbf{x},(\mathbf{w}_{1}-\mathbf{w}_{2})\cdot\mathbf{x})|\leq 2.

Let C{\cal C} be as in Definition 4.13, then there is an algorithm A{\cal A} for solving sum of kk max 2-affine fitting problem in time nO(1)2O((k2/ϵ))log(1/δ)n^{O(1)}2^{O\left((k^{2}/\epsilon)\right)}\log(1/\delta).

5.2 Max k𝑘k-Affine

In this section, we move to a more general convex piecewise linear function represented as the maximum of kk affine functions. This is formally defined as follows.

Note that this form is universal since any convex piecewise-linear function can be expressed as a max-affine function, for some value of kk. However, we focus on bounded kk and give learnability bounds in terms of kk.

Observe that max kk-affine can be expressed in a complete binary tree structure of height logk\lceil\log k\rceil with a max\mathsf{max} operation at each unit and wix\mathbf{w}_{i}\cdot\mathbf{x} for i[k]i\in[k] at the kk leaf units (for example, see Figure 2 Note that if kk is not a power of 2, then we can trivially add leaves with value w1x\mathbf{w}_{1}\cdot\mathbf{x} and make it a complete tree.

Thus, the class of convex piecewise linear functions can be expressed as a network of ReLUs with logk\lceil\log k\rceil hidden layers by replacing each max\mathsf{max} unit in the tree by 3 ReLUs and adding an output unit. See Figure 3 for the construction for k=4k=4.

for j[32logk1]j\in[3\cdot 2^{\lceil\log k\rceil-1}]. Further, the weight vectors input to hidden layer i{2,,logk}i\in\{2,\ldots,\lceil\log k\rceil\} of the network are

for j[32logki+1]j\in[3\cdot 2^{\lceil\log k\rceil-i+1}]. Note, ei\mathbf{e}_{i} refers to the vector with 1 at position ii and 0 everywhere else. Finally the weight vector for the output unit is w1(logk)=e1+e2e3w_{1}^{(\lceil\log k\rceil)}=\mathbf{e}_{1}+\mathbf{e}_{2}-\mathbf{e}_{3}. The following properties of Nmax\mathcal{N}_{\mathsf{max}} are easy to deduce.

wj(i)16\|\mathbf{w}_{j}^{(i)}\|_{1}\leq 6 for i[logk]i\in[\lceil\log k\rceil]

Here, the first and third conditions are the same conditions as in the previous section. The second holds by the values of the weights defined above. Using the above construction, we obtain the following result.

Let C{\cal C} be as in Definition 4.15, then there is an algorithm A{\cal A} for solving the max kk-affine fitting problem in time nO(1)2O(k/ϵ)logklog(1/δ)n^{O(1)}\cdot 2^{O(k/\epsilon)^{\lceil\log k\rceil}}\cdot\log(1/\delta).

Hardness of Learning ReLU

To begin, we recall the following problem from computational learning theory widely thought to be computationally intractable.

(Learning Sparse Parity with Noise) Let χS:{0,1}n{0,1}n\chi_{S}:\{0,1\}^{n}\rightarrow\{0,1\}^{n} be an unknown parity function on a subset SS, Sk|S|\leq k, of nn inputs bits (i.e., any input, restricted to SS, with an odd number of ones is mapped to 11 and otherwise). Let Ck{\cal C}_{k} be the concept class of all parity functions on subsets SS of size at most k. Let D{\cal D} be a distribution on {0,1}n×{1,1}\{0,1\}^{n}\times\{-1,1\} and define

The Sparse Learning Parity with Noise problem is as follows: Given i.i.d. examples drawn from D{\cal D}, find hh such that Pr(x,y)D[h(x)y]opt+ϵ\Pr_{(\mathbf{x},y)\sim{\cal D}}[h(\mathbf{x})\neq y]\leq\mathsf{opt}+\epsilon.

For every algorithm A{\cal A} that solves the Sparse Learning Parity with Noise problem, there exists ϵ=O(1)\epsilon=O(1) and kω(1)k\in\omega(1) such that A{\cal A} requires time nΩ(k)n^{\Omega(k)}.

Any algorithm breaking the above assumption would be a major result in theoretical computer science. The best known algorithms due to Blum, Kalai, Wasserman and Valiant run in time 2O(n/logn)2^{O(n/\log n)} and n0.8kn^{0.8k}, respectively. Under this assumption, we can rule out polynomial-time algorithms for reliably learning ReLUs on distributions supported on {0,1}n\{0,1\}^{n}.

Let C{\cal C} be the class of ReLUs over the domain X={0,1}n{\cal X}=\{0,1\}^{n} with the added restriction that w12k\|\mathbf{w}\|_{1}\leq 2k. Any algorithm A{\cal A} for reliably learning C{\cal C} in time g(ϵ)poly(n)g(\epsilon)\cdot\mathsf{poly}(n) for any function gg will give a polynomial time algorithm for learning sparse parities with noise of size kk for ϵ=O(1)\epsilon=O(1).

We will show how to use a reliable ReLU learner to agnostically learn conjunctions on {0,1}n\{0,1\}^{n} and use an observation due to Feldman and Kothari who showed that agnostically learning conjunctions is harder than the Sparse Learning Parity with Noise problem. Let COk{\cal CO}_{k} be the concept class of all Boolean conjunctions of length at most kk.

Notice that for the domain X={0,1}n{\cal X}=\{0,1\}^{n}, the conjunction of literals x1,,xkx_{1},\ldots,x_{k} can be computed exactly as max(0, x1++xk(k1))\mathsf{max}(0,~{}x_{1}+\cdots+x_{k}-(k-1)). Fix an arbitrary distribution D{\cal D} on {0,1}n×{0,1}\{0,1\}^{n}\times\{0,1\} and define

Kalai et al. (Theorem 5) observed that in order output a hypothesis hh with error opt+ϵ\mathsf{opt}+\epsilon it suffices to minimize (to within ϵ\epsilon) the following quantity:

The above proof also shows hardness of learning ReLUs agnostically. Note the above hardness result holds if we require the learning algorithm to succeed on all domains where (wx)|(w\cdot x)| can grow without bound with respect to nn:

Finally, we point out Kalai et al. proved that reliably learning conjunctions is also as hard as PAC Learning DNF formulas. Thus, by our above reduction, any efficient algorithm for reliably learning ReLUs would give an efficient algorithm for PAC learning DNF formulas (again this would be considered a breakthrough result in computational learning theory).

Conclusions and Open Problems

We have given the first set of efficient algorithms for ReLUs in a natural learning model. ReLUs are both effective in practice and, unlike linear threshold functions (halfspaces), admit non-trivial learning algorithms for all distributions with respect to adversarial noise. We “sidestepped” the hardness results in Boolean function learning by focusing on problems that are not entirely scale-invariant with respect to the choice of domain (e.g., reliably learning ReLUs). The obvious open question is to improve the dependence of our main result on 1/ϵ1/\epsilon. We have no reason to believe that 2O(1/ϵ)2^{O(1/\epsilon)} is the best possible.

Acknowledgements. The authors are grateful to Sanjeev Arora and Roi Livni for helpful feedback and useful discussions on this work.

References