Reliably Learning the ReLU in Polynomial Time

Surbhi Goel, Varun Kanade, Adam Klivans, Justin Thaler

Introduction

Recall the following two fundamental machine-learning problems:

The term agnostic above refers to the fact that the labeling on $\{-1,1\}$ may be arbitrary. In this work, we relax the notion of success to improper learning, where the learner may output any polynomial-time computable hypothesis achieving a loss that is within $\epsilon$ of the optimal solution from the concept class.

Taken together, these two problems are at the core of many important techniques from modern Machine Learning and Statistics. It is well-known how to efficiently solve ordinary least squares and other variants of linear regression; we know of multiple polynomial-time solutions, all extensively used in practice . In contrast, Problem 1.2 is thought to be computationally intractable due to the many existing hardness results in the literature .

The ReLU is a hybrid function that lies “in-between” a linear function and a threshold function in the following sense: restricted to inputs $\mathbf{x}$ such that $\mathbf{w}\cdot\mathbf{x}>0$ , the ReLU is linear, and for inputs $\mathbf{x}$ such that $\mathbf{w}\cdot\mathbf{x}\leq 0$ , the ReLU thresholds the value $\mathbf{w}\cdot\mathbf{x}$ and simply outputs zero. In this sense, we could view the ReLU as a “one-sided” threshold function. Since learning a ReLU has aspects of both linear regression and threshold learning, it is not straightforward to identify a notion of loss that captures both of these aspects.

We introduce a natural model for learning ReLUs inspired by the Reliable Agnostic learning model that was introduced by Kalai et al. in the context of Boolean functions. The goal will be to minimize both the false positive rate and a loss function (for example, square-loss) on points the distribution labels non-zero. In this work, we give efficient algorithms for learning a ReLU over the unit sphere with respect to any loss function that satisfies mild properties (convexity, monotonicity, boundedness, and Lipschitz-ness).

The Reliable Agnostic model is motivated by the Neyman-Pearson criteria, and is intended to capture settings in which false positive errors are more costly than false negative errors (e.g., spam detection) or vice versa. We observe that the asymmetric manner in which the Reliable Agnostic model treats different types of errors naturally corresponds to the one-sided nature of a ReLU. In particular, there may be settings in which mistakenly predicting a positive value instead of zero carries a high cost.

As a concrete example, imagine that inputs are comments on an online news article. Suppose that each comment is assigned a numerical score of quality or appropriateness, where the true scoring function is reasonably modeled by a linear function of the features of the comment. The newspaper wants to implement an automated system in which comments are either a) rejected outright if the score is below a threshold or b) posted in order of score, possibly after undergoing human review.For example, The New York Times recently announced that they are moving to a hybrid comment moderation system that combines human and automated review . In this situation, it may be costlier to post (or subject to human review) a low-quality or inappropriate comment than it is to automatically reject a comment that is slightly above the threshold for posting.

2 Our Contributions

We can now state our main theorem giving a poly-time algorithm (in $n$ , the dimension) for reliably learning any ReLU.

Let $\mathcal{C}=\{\mathbf{x}\mapsto\mathsf{max}(0,\mathbf{w}\cdot\mathbf{x})\colon\|\mathbf{w}\|_{2}\leq 1\}$ be the class of ReLUs with weight vectors $\mathbf{w}$ satisfying $\|\mathbf{w}\|_{2}\leq 1$ . There exists a learning algorithm ${\cal A}$ that reliably learns ${\cal C}$ in time $2^{O(1/\epsilon)}\cdot n^{O(1)}$ .

We can obtain the same complexity bounds for learning ReLUs in the standard agnostic model with respect to the same class of loss functions. This yields a PTAS (polynomial-time approximation scheme) for an optimization problem regarding ReLUs posed by Bach . See Section 3.4 for details.

We can compose our results to obtain efficient algorithms for small-depth networks of ReLUs. For brevity, here we state results only for linear combinations of ReLUs (which are often called depth-two networks of ReLUs, see, e.g., ). Formal results for other types of networks can be found in Section 4.

Let ${\cal C}$ be a depth-2 network of ReLUs with $k$ hidden units. Then ${\cal C}$ is reliably learnable in time $2^{O(\sqrt{k}/\epsilon)}\cdot n^{O(1)}$ .

The above results are perhaps surprising in light of the hardness result due to Livni et al. who showed that for ${\cal X}=\{0,1\}^{n}$ , learning the difference of even two ReLUs is as hard as learning a threshold function.

We also obtain results for noisy polynomial reconstruction on the sphere (equivalently, agnostically learning a polynomial) with respect to a large class of loss functions:

Andoni et al. were the first to give efficient algorithms for noisy polynomial reconstruction over non-Boolean domains. In particular, they gave algorithms that succeed on the unit cube but require an underlying product distribution and do not work in the agnostic setting (they also run in time exponential in the degree $d$ ).

At a high level the proofs of both Theorem 1.3 and 1.6 follow the same outline, but we do not know how to obtain one from the other.

3 Applications to Convex Piecewise Regression

We establish a novel connection between learning networks of ReLUs and a broad class of piecewise-linear regression problems studied in machine learning and optimization. The following problem was defined by Boyd and Magnani as a generalization of the well-known MARS (multivariate adaptive regression splines) framework due to Friedman :

Applying our learnability results for networks of ReLUs, we obtain the first polynomial-time algorithms for solving the above max- $k$ -affine regression problem and the sum of max- $2$ -affine regression problem when $k=O(1)$ . Boyd and Magnani specifically highlight the case of $k=O(1)$ and provide a variety of heuristics; we obtain the first provably efficient results.

There is an algorithm ${\cal A}$ for solving the convex piecewise-linear fitting problem (cf. Definition 1.7) in time $2^{O\left((k/\epsilon)^{{\log k}}\right)}\cdot n^{O(1)}$ .

We can also use our results for learning networks of ReLUs to learn the so-called “leaky ReLUs” and “parameterized” ReLUs (PReLUs); see Section 4.3 for details. We obtain these results by composing various “ReLU gadgets,” i.e., constant-depth networks of ReLUs with a small number of bounded-weight hidden units.

4 Hardness

Let ${\cal C}$ be the class of ReLUs over the domain ${\cal X}=\{0,1\}^{n}$ . Then any algorithm for reliably learning ${\cal C}$ in time $g(\epsilon)\cdot\mathsf{poly}(n)$ for any function $g$ will give a polynomial time algorithm for learning $\omega(1)$ -sparse parities with noise (for any $\epsilon=O(1)$ ).

Efficiently learning sparse parities (of any superconstant length) with noise is considered one the most challenging problems in theoretical computer science.

5 Techniques and Related Work

Our starting point will be to prove the existence of low-degree, low-weight polynomial approximators for every $c\in{\cal C}$ . The polynomial method has a well established history in computational learning theory (e.g., Kalai et al. for agnostically learning halfspaces under distributional assumptions), and we can apply classical techniques from approximation theory and recent work due to Sherstov to construct low-weight, low-degree approximators for any ReLU.

We can then relax Optimization Problem 1 to the space of low-weight polynomials and follow the approach of Shalev-Shwartz et al. who used tools from Reproducing Kernel Hilbert Spaces (RKHS) to learn low-weight polynomials efficiently (Shalev-Shwartz et al. focused on a relaxation of the 0/1 loss for halfspaces).

The main challenge is to obtain reliability; i.e., to simultaneously minimize the false-positive rate and the loss dictated by the objective function. To do this we take a “dual-loss” approach and carefully construct two loss functions that will both be minimized with high probability. Proving that these losses generalize for a large class of objective functions is subtle and requires “clipping” in order to apply the appropriate Rademacher bound. Our final output hypothesis is $\mathsf{max}(0,h)$ where $h$ is a “clipped” version of the optimal low-weight, low-degree polynomial on the training data, appropriately kernelized.

Our learning algorithms for networks of ReLUs are obtained by generalizing a composition technique due to Zhang et al. , who considered networks of “smooth” activation functions computed by power series (we discuss this more in Section 4). Using a sequence of “gadget” reductions, we then show that even small-size networks of ReLUs are surprisingly powerful, yielding the first set of provably efficient algorithms for a variety of piecewise-linear regression problems in high dimension.

Note: A recent manuscript appearing on the Arxiv due to R. Arora et al. considers the complexity of training depth- $2$ networks of ReLUs with $k$ hidden units on a sample of size $m$ but when the dimension $n=1$ . They give a proper learning algorithm that runs in time exponential in $k$ and $m$ . These concept classes, however, can be improperly learned in time polynomial in $k$ and $m$ using a straightforward reduction to piecewise linear regression on the real line.

Preliminaries

2 Concept Classes

3 Learning Models

We consider two learning models in this paper. The first is the standard agnostic learning model and the second is a generalization of the reliable agnostic learning framework . We describe these models briefly; the reader may refer to the original articles for further details.

4 Kernel Methods

We make use of kernel methods in our learning algorithms. For completeness, we define kernels and a few important results concerning kernel methods. The reader may refer to Hofmann et al. (or any standard text) for further details.

We will use the following variant of the polynomial kernel:

Also define $\mathcal{H}_{{\mathsf{MK}_{d}}}$ to be the corresponding Reproducing Kernel Hilbert Space (RKHS).

Observe that ${\mathsf{MK}_{d}}$ is the sum of standard polynomial kernels (cf. ) of degree $i$ for $i\in[d]$ . However, the feature map conventionally used for a standard polynomial kernel has only ${n+d\choose d}$ entries and, under that definition involves coefficients of size as large as $d^{\Theta(d)}$ . The feature map $\psi_{d}$ used by ${\mathsf{MK}_{d}}$ avoids these coefficients by using $N_{d}$ entries as defined above (that is, entries of $\psi_{d}(\mathbf{x})$ are indexed by ordered subsets of $[n]$ , while entries of the standard feature map are indexed by unordered subsets of $[n]$ .)

Consider the representation of $p_{\mathbf{w}}$ as an element of $\mathcal{H}_{\mathsf{MK}_{d}}$ defined as follows: the entry of index $(k_{1},\ldots,k_{j})\in[n]^{j}$ of the representation equals $\beta_{j}\cdot\prod_{i=1}^{j}w_{k_{i}}$ for $j\in[d]$ . Abusing notation, we use $p_{\mathbf{w}}$ to denote both the multivariate polynomial and the vector in $\mathcal{H}_{\mathsf{MK}_{d}}$ . The following lemma establishes that $p_{\mathbf{w}}\in\mathcal{H}_{\mathsf{MK}_{d}}$ is indeed a representation of the polynomial $p_{\mathbf{w}}$ , and gives a bound on $\langle p_{\mathbf{w}},p_{\mathbf{w}}\rangle$ . The proof follows an analysis applied by Shalev-Shwartz et al. [31, Lemma 2.4] to a different kernel (cf. Remark 2.8 below).

Let $p(t)=\sum_{i=0}^{d}\beta_{i}t^{i}$ be a given univariate polynomial with $\sum_{i=1}^{d}\beta_{i}^{2}\leq B$ . For $\mathbf{w}$ such that $\|\mathbf{w}\|\leq 1$ , consider the polynomial $p_{\mathbf{w}}(\mathbf{x}):=p(\mathbf{w}\cdot\mathbf{x})$ . Then $p_{\mathbf{w}}$ is represented by the vector $p_{\mathbf{w}}\in\mathcal{H}_{{\mathsf{MK}_{d}}}$ defined above. Moreover, $\langle p_{\mathbf{w}},p_{\mathbf{w}}\rangle\leq B$ .

Shalev-Shwartz et al. proved a bound on the Euclidean norm of representations of polynomials of the form $p(\mathbf{w}\cdot\mathbf{x})$ in the RKHS corresponding to the kernel function $K(\mathbf{x},\mathbf{y})=\frac{1}{1-\frac{1}{2}\langle\mathbf{x},\mathbf{y}\rangle}$ . This allowed them to represent functions computed by power series, as opposed to polynomials of (finite) degree $d$ . However, for degree $d$ polynomials, the use of their kernel results in a Euclidean norm bound that is a factor of $2^{d}$ worse than what we obtain from Lemma 2.7. This difference is central to our results on noisy polynomial reconstruction in Section 3.5, where we address this issue in more technical detail.

5 Generalization Bounds

We make use of the following standard generalization bound for hypothesis classes with small Rademacher complexity. Readers unfamiliar with Rademacher complexity may refer to the paper of Bartlett and Mendelson .

where $\mathcal{R}_{m}(\mathcal{F})$ is the Rademacher complexity of the function class $\mathcal{F}$ .

We will combine the following two theorems with Theorem 2.9 above to bound the generalization error of our algorithms for agnostic and reliable learning.

Let $\mathcal{X}$ be a subset of a Hilbert space equipped with inner product $\langle\cdot,\cdot\rangle$ such that for each $\mathbf{x}\in\mathcal{X}$ , $\langle\mathbf{x},\mathbf{x}\rangle\leq X^{2}$ , and let $\mathcal{W}=\{\mathbf{x}\mapsto\langle\mathbf{x},\mathbf{w}\rangle~{}|~{}\langle\mathbf{w},\mathbf{w}\rangle\leq W^{2}\}$ be a class of linear functions. Then it holds that

The following result as stated appears in but is originally attributed to .

6 Approximation Theory

Finally, let $p(x)=\frac{2-\epsilon}{2}(\bar{p}(x)-\frac{1}{2})+\frac{1}{2}$ . We have for $x\in$ ,

Furthermore, it is clearly the case that $p()\subseteq$ . ∎

Let $p(t)=\sum_{i=0}^{d}\beta_{i}t^{i}$ be a univariate polynomial of degree $d$ . Let $M$ be such that $\displaystyle\max_{t\in}|p(t)|\leq M$ . Then $\displaystyle\sum_{i=0}^{d}\beta_{i}^{2}\leq(d+1)\cdot(4e)^{2d}\cdot M^{2}$ .

Lemma 4.1 from Sherstov states that for any polynomial satisfying the conditions in the statement of the lemma, the following holds for all $i\in\{0,\ldots,d\}$ :

Reliably Learning the ReLU

In order to reliably learn ReLUs, it would suffice to solve Optimization Problem 1 (see Section 1). This mathematical program, however, is not convex; hence, we consider a suitable convex relaxation.

Thus, an optimal solution to Optimization Problem 2 achieves a loss on the training data that is within $|\{i~{}|~{}y_{i}>0\}|\cdot L\cdot\epsilon$ of that achieved by the optimal solution to Optimization Problem 1.

While Optimization Problem 2 is convex, it is still not trivial to solve efficiently. For one, the RKHS $\mathcal{H}_{\mathsf{MK}_{d}}$ has dimension $n^{\Theta(d)}$ . However, materializing such vectors explicitly requires $n^{\Theta(d)}$ time, and Theorem 1.3 promises a learning algorithm with runtime $2^{O(1/\epsilon)}\cdot n^{O(1)}\ll n^{O(d)}$ . As in Shalev-Shwartz et al. , we apply the Representer Theorem (see e.g., ), to guarantee that Optimization Problem 2 can be solved in time that is polynomial in the number of samples used.

2 Description of the Output Hypothesis

To address this issue, we will “clip” the function to always output a value between $$:

The hypothesis $h$ output by our algorithm is as follows.

We use a fact due to Ledoux and Talagrand on the Rademacher complexity of composed function classes (Theorem 2.11) to bound the generalization error. Clipping comes at a small cost, in the sense that it forces us to require that the loss function be monotone. However, we can handle non-monotone losses if the output hypothesis is not clipped, albeit with sample complexity bounds that depend polynomially on the Lipschitz-constant and bound of the loss in the interval $[-\sqrt{B},\sqrt{B}]$ as opposed to $$.

3 Formal Version of Theorem 1.3 and Its Proof

In order to prove the theorem, we need to bound the following two losses for the output hypothesis $h$ .

Now consider the following to bound $\mathcal{L}_{=0}(h;{\cal D})$ .

It is sometimes possible to avoid this exponential dependence on $L$ in the setting of agnostic learning (as opposed to reliable learning). Indeed, in the case of agnostic learning there is no need to threshold the output at $2\epsilon$ (this thresholding contributed $2\epsilon L$ to our bound on the excess error established in Inequality (13)); simply clipping the output to be in the range of $\mathcal{Y}$ suffices.

4 An Implication for Learning Convex Neural Networks

In a recent work, Bach considered convex relaxations of optimization problems related to learning neural networks with a single hidden layer and non-decreasing homogeneous activation function.His setting allows potentially uncountably many hidden units along with a sparsity-inducing regularizer. One specific problem raised in his paper [3, Sec. 6] is understanding the computational complexity of the following problem.

We describe the modified algorithm and the minor differences in the proof below.

Consider a multivariate polynomial $p$ of degree $d$ such that sum of the squared coefficients is bounded by $B$ . Denote the coefficient of monomial $x_{1}^{i_{1}}\cdots x_{n}^{i_{n}}$ by $\beta(i_{1},\dots,i_{n})$ for $(i_{1},\ldots,i_{n})\in\{0,\ldots,d\}^{n}$ . We have

Let $M$ be the map that takes an ordered tuple $({k_{1}},\ldots,{k_{j}})\in[n]^{j}$ for $j\in[d]$ to the tuple $(i_{1},\ldots,i_{n})\in\{0,\ldots,d\}^{n}$ such that $x_{k_{1}}\cdots x_{k_{j}}=x_{1}^{i_{1}}\cdots x_{n}^{i_{n}}$ . Let $C(i_{1},\ldots,i_{n})$ be the number of distinct orderings of the $i_{j}$ ’s for $j\in[n]$ ; $C(i_{1},\dots,i_{n})$ which can be computed from the multinomial theorem (cf. ). Observe that the number of tuples that $M$ maps to $(i_{1},\ldots,i_{n})$ is precisely $C(i_{1},\ldots,i_{n})$ .

Recall that $\mathcal{H}_{{\mathsf{MK}_{d}}}$ denotes the RKHS from Definition 2.6. Observe that the polynomial $p$ from Equation (22) is represented by the vector $\mathbf{v}_{p}\in\mathcal{H}_{{\mathsf{MK}_{d}}}$ defined as follows. For $j\in[d]$ , entry $(k_{1},\dots,k_{j})$ of $\mathbf{v}_{p}$ equals

It is easy to see that $\mathbf{v}_{p}$ as defined represents $p$ . Indeed,

Via a standard analysis identical to that of Section 3.1, Optimization Problem 6 is a convex program and can be solved in time polynomial in $m$ , $n$ , and $d$ . Let $\boldsymbol{\alpha}^{*}$ denote an optimal solution to Optimization Problem 6 and let $f(\cdot)=\sum_{i=1}^{m}\alpha_{i}^{*}{\mathsf{MK}_{d}}(\mathbf{x}_{i},\cdot)$ . The hypothesis $h$ output by our algorithm is as follows.

6 Proper Learning

Observe that the above can be easily computed since we know $\mathbf{x}_{i}$ for all $i\in[m]$ , and the function $C$ can be efficiently computed as discussed before using the multinomial theorem. Hence, the hypothesis is itself a polynomial of degree at most $d$ , any desired coefficient of which can be computed efficiently.

7 Formal Version of Theorem 1.6 and Its Proof

where $\mathsf{opt}$ is the error of the best fitting multivariate polynomial $p$ of degree $d$ whose sum of squares of coefficients is bounded by $B$ .

In order to prove the theorem, we need to bound

In the rest of the proof we assume that for every $f\in\mathcal{P}(n,d,B)$ , the following hold:

Now consider the following to bound $\mathcal{L}(h;D)$ . Letting $p$ be any polynomial in $\mathcal{P}(n,d,B)$ ,

Networks of ReLUs

In this section, we extend learnability results for a single ReLU to network of ReLUs. Our results in this section apply to the standard agnostic model of learning in the case that the output is a linear combination of hidden units. If our output layer, however, is a single ReLU, then our results can be extended to the reliable setting using similar techniques from Section 3.

We will use the same framework as Zhang et al. , who showed how to learn networks where the activation function is computed exactly by a power series (with bounded sum of squares of coefficients $B$ ) with respect to loss functions that are bounded on a domain that is a function of $B$ . Their algorithm works by repeatedly composing the kernel of Shalev-Shwartz et al. and optimizing in the corresponding RKHS.

We generalize their results to activation functions that are approximated by polynomials. This allows us to capture many classes of activation functions including ReLUs. Our clipping technique also allows us to work with respect to a broader class of loss functions.

Our results for learning networks of ReLUs have a number of new applications. First, we give the first efficient algorithms for learning “parameterized” ReLUs and “leaky” ReLUs. Second, we obtain the first polynomial-time approximation schemes for convex piecewise-linear regression (see Section 4.5 for details). As far as we are aware, there were no provably efficient algorithms known for these types of multivariate piecewise-linear regression problems.

where $y^{(0)}_{j}(\mathbf{x})=\mathbf{x}$ for all $j$ . We similarly define $h^{(i)}_{j}$ to be the function that maps $\mathbf{x}\in\mathcal{X}$ to the input of unit $j$ in layer $i+1$ :

For a better understanding of the above notation, consider a fully-connected network $\mathcal{N}_{1}$ with a single hidden layer (these are also known as depth-2 networks) consisting of $k$ units:

In this case, output of unit $i\in[k]$ in the hidden layer is $y^{(1)}_{i}(\mathbf{x})=\sigma(\mathbf{w}_{i}\cdot\mathbf{x})$ and the input to the same unit is $h^{(0)}_{i}(\mathbf{x})=\mathbf{w}_{i}\cdot\mathbf{x}$ .

Let $\mathcal{N}[\sigma,D,W,M]$ be the class of fully-connected networks with $D$ hidden layers and $\sigma$ as the activation function. Additionally, the weights are constrained such that $\sum_{j=1}^{n}(w_{ij}^{(0)})^{2}\leq M^{2}$ for all units $i$ in layer 0 and $\sum_{k=1}^{n^{(i)}}|w_{jk}^{(i)}|\leq W$ for all units $j$ in all layers $i\in\{1,\ldots,D\}$ . Also, the inputs to each unit are bounded in magnitude by $M$ , i.e., $h^{(l)}_{j}(\mathbf{x})\in[-M,M]$ with $M\geq 1$ for each $l<D$ and $j=1,\dots,n^{(l+1)}$ .

We consider activation functions which can be approximated by polynomials with sum of squares of coefficients bounded. We term them low-weight approximable activation functions, formalized as follows.

2 Approximate Polynomial Networks

We first bound the error incurred when each activation function is replaced by a corresponding low-weight polynomial approximation.

Let $\sigma$ be an activation function that is 1-LipschitzNote that this is not a restriction, as we have not explicitly constrained the weights $W$ . Thus, to allow a Lipschitz constant $L$ , we simply replace $W$ by $WL$ . and such that there exists a degree $d$ polynomial $p$ that is a $(\frac{\epsilon}{W^{D}D},2M,B)$ approximation for $\sigma$ , with $\epsilon\in(0,1)$ , with $d,M,B\geq 1$ . Then, for all $\mathcal{N}\in\mathcal{N}[\sigma,D,W,M]$ , there exists $\bar{\mathcal{N}}\in\mathcal{N}[p,D,W,2M]$ such that

Let $\mathcal{N}\in\mathcal{N}[\sigma,D,W,M]$ and let $\bar{\mathcal{N}}\in\mathcal{N}[p,D,W,M]$ be such that it has the same structure and weights as $\mathcal{N}$ with the activation replaced with $p$ . For $\mathcal{N}$ let $h^{(i)}(\mathbf{x})$ be the inputs to layer $i+1$ and $y^{(i)}(\mathbf{x})$ be the outputs of unit $j$ of layer $i$ as defined previously. Correspondingly, for $\bar{\mathcal{N}}$ let $\bar{h}^{(i)}(\mathbf{x})$ be the inputs to layer $i+1$ and $\bar{y}^{(i)}(\mathbf{x})$ be the outputs of layer $i$ . We prove by induction on layer $i$ that for all units $j$ of layer $i$ ,

For layer $i=0$ , we have $h^{(0)}(\mathbf{x})_{j}=\bar{h}^{(0)}(\mathbf{x})_{j}=\textbf{w}_{j}^{(0)}\cdot\mathbf{x}\in[-M,M]$ which trivially satisfies (32). Now, we prove that the desired property holds for layer $l$ , assuming the following holds for layer $l-1$ . We have for all units $j$ in layer $l-1$ ,

Note that this implies that $\left|\bar{h}^{(l-1)}_{j}(\mathbf{x})\right|\leq\left|h^{(l-1)}_{j}(\mathbf{x})\right|+\frac{(l-1)\epsilon}{W^{D-l+1}D}\leq 2M$ . Here the second inequality follows from the assumption that inputs to each unit are bounded by $M$ and $\epsilon<1$ . We have for all $\mathbf{x}$ and $j$ ,

Step (34) follows since $\bar{h}^{(l-1)}_{j}(\mathbf{x})\in[-2M,2M]$ and $p$ uniformly $\frac{\epsilon}{W^{D}D}$ -approximates $\sigma$ in $[-2M,2M]$ . Step (35) follows from $\sigma$ being 1-Lipschitz. Step (36) follows from (33). Finally Step (37) follows from $\|\textbf{w}_{j}^{(l)}\|_{1}\leq W$ which is given. This completes the inductive proof.

We conclude by noting that $\mathcal{N}(\mathbf{x})=h_{1}^{(D)}(\mathbf{x})$ and $\bar{\mathcal{N}}(\mathbf{x})=\bar{h}_{1}^{(D)}(\mathbf{x})$ . Thus, from above we get,

Given the above transformation to a polynomial network and associated error bounds, we apply the main theorem of Zhang et al. combined with the clipping technique from Section 3 to obtain the following result:

The time complexity of the above algorithm is bounded by $n^{O(1)}\cdot B^{O(d)^{D-1}}\cdot\log(1/\delta)$ , where $d$ is the degree of $p$ , and $B$ is a bound on $\sum_{i=0}^{d}2^{i}\beta_{i}^{2}$ (see Defn. 4.2).

From Theorem 4.3 we have that for all $\mathcal{N}\in\mathcal{N}[\sigma,D,W,M]$ , there is a network $\bar{\mathcal{N}}\in\mathcal{N}[p,D,W,M]$ such that

Now from [36, Theorem 1], we know that there exists an algorithm that outputs a predictor $\widehat{f}$ such that with probability at least $1-\delta$ for any distribution ${\cal D}$

For loss functions that take on large values on the range of the predictor, we instead output the clipped version of the predictor $\mathsf{clip}(\widehat{f})$ in order to satisfy the requirements of the Rademacher bounds (as in Section 3).

Combining the above inequalities, we have

We can now state the learnability result for ReLU networks as follows.

The time complexity of the above algorithm is bounded by $n^{O(1)}\cdot 2^{((L+1)\cdot M\cdot W^{D}\cdot D\cdot\epsilon^{-1})^{D}}\cdot\log(1/\delta)$ .

Using Theorem 4.4, we state the following learnability result for depth-2 sigmoid networks.

3 Application: Learning Parametric Rectified Linear Unit

A Parametric Rectified Linear Unit (PReLU) is a generalization of ReLU introduced by He et al. . Compared to the ReLU, it has an additional parameter that is learned. Formally, it is defined as

The parametric rectifier (denoted by $\sigma_{\text{PReLU}}$ ) is an activation function defined as

The proof of the corollary follows from setting $L=1$ , $D=1$ , $M=1$ and $W=O(1)$ in Theorem 4.5.

The condition that $|a|$ be bounded by 1 is reasonable as in practice the value of $a$ is very rarely above 1 as observed by He et al. . Also note that Leaky-ReLUs are PReLUs with fixed $a$ (usually 0.01). Hence, we can agnostically learn them under the same conditions using an identical argument as above. Note that a network of PReLU can also be similarly learned as a ReLU by replacing each ReLU in the network by a linear combination of two ReLUs as described before.

4 Application: Learning the Piecewise Linear Transfer Function

Several functions have been used to relax the $0/1$ loss in the context of learning linear classifiers. The best example is the sigmoid function discussed earlier. Here we consider the piecewise linear transfer function. Formally, it is defined as

The $C$ -Lipschitz piecewise linear transfer function (denoted by $\sigma_{\text{pw}}$ ) is an activation function defined as

The proof of the corollary follows from setting $L=1$ , $D=1$ , $M=1$ and $W=O(L)$ in Theorem 4.5.

5 Application: Convex Piecewise-Linear Fitting

In this section we can use our learnability results for networks of ReLUs to give polynomial-time approximation schemes for convex piecewise-linear regression . These problems have been studied in optimization and notably in machine learning in the context of Multivariate Adaptive Regression Splines (MARS ). Note that these are not the same as univariate piecewise or segmented regression problems, for which polynomial-time algorithms are known. Although our algorithms run in time exponential in $k$ (the number of affine functions), we note that no provably efficient algorithms were known prior to our work even for the case $k=2$ .Boyd and Magnani specifically focus on the case of small $k$ , writing “Our interest, however, is in the case when the number of terms $k$ is relatively small, say no more than 10, or a few 10s.”

The key idea will be to reduce piecewise regression problems to an optimization problem on networks of ReLUs using simple ReLU “gadgets.” We formally describe the problems and describe the gadgets in detail.

We start with a simple class of convex piecewise linear functions represented as a sum of a fixed number of functions where each of these functions is a maximum of 2 affine functions. This is formally defined as follows.

Each input to each unit is bounded in magnitude by 2.

The first property holds as $\|\mathbf{w}_{j}^{(0)}\|\leq\mathsf{max}(\|\mathbf{w}_{2j-1}-\mathbf{w}_{2j}\|,\|\mathbf{w}_{2j-1}\|,\|\mathbf{w}_{2j}\|)\leq\|\mathbf{w}_{2j-1}\|+\|\mathbf{w}_{2j}\|\leq 2$ using the triangle inequality. The second holds because each of the $k$ max sub-networks contributes 3 to $\|\mathbf{w}_{1}^{(1)}\|_{1}$ . The third is implied by the fact that each input to each unit is bounded by $|\mathsf{max}(\mathbf{w}_{1}\cdot\mathbf{x},-\mathbf{w}_{1}\cdot\mathbf{x},(\mathbf{w}_{1}-\mathbf{w}_{2})\cdot\mathbf{x})|\leq 2$ .

Let ${\cal C}$ be as in Definition 4.13, then there is an algorithm ${\cal A}$ for solving sum of $k$ max 2-affine fitting problem in time $n^{O(1)}2^{O\left((k^{2}/\epsilon)\right)}\log(1/\delta)$ .

5.2 Max k𝑘k-Affine

In this section, we move to a more general convex piecewise linear function represented as the maximum of $k$ affine functions. This is formally defined as follows.

Note that this form is universal since any convex piecewise-linear function can be expressed as a max-affine function, for some value of $k$ . However, we focus on bounded $k$ and give learnability bounds in terms of $k$ .

Observe that max $k$ -affine can be expressed in a complete binary tree structure of height $\lceil\log k\rceil$ with a $\mathsf{max}$ operation at each unit and $\mathbf{w}_{i}\cdot\mathbf{x}$ for $i\in[k]$ at the $k$ leaf units (for example, see Figure 2 Note that if $k$ is not a power of 2, then we can trivially add leaves with value $\mathbf{w}_{1}\cdot\mathbf{x}$ and make it a complete tree.

Thus, the class of convex piecewise linear functions can be expressed as a network of ReLUs with $\lceil\log k\rceil$ hidden layers by replacing each $\mathsf{max}$ unit in the tree by 3 ReLUs and adding an output unit. See Figure 3 for the construction for $k=4$ .

for $j\in[3\cdot 2^{\lceil\log k\rceil-1}]$ . Further, the weight vectors input to hidden layer $i\in\{2,\ldots,\lceil\log k\rceil\}$ of the network are

for $j\in[3\cdot 2^{\lceil\log k\rceil-i+1}]$ . Note, $\mathbf{e}_{i}$ refers to the vector with 1 at position $i$ and 0 everywhere else. Finally the weight vector for the output unit is $w_{1}^{(\lceil\log k\rceil)}=\mathbf{e}_{1}+\mathbf{e}_{2}-\mathbf{e}_{3}$ . The following properties of $\mathcal{N}_{\mathsf{max}}$ are easy to deduce.

$\|\mathbf{w}_{j}^{(i)}\|_{1}\leq 6$ for $i\in[\lceil\log k\rceil]$

Here, the first and third conditions are the same conditions as in the previous section. The second holds by the values of the weights defined above. Using the above construction, we obtain the following result.

Let ${\cal C}$ be as in Definition 4.15, then there is an algorithm ${\cal A}$ for solving the max $k$ -affine fitting problem in time $n^{O(1)}\cdot 2^{O(k/\epsilon)^{\lceil\log k\rceil}}\cdot\log(1/\delta)$ .

Hardness of Learning ReLU

To begin, we recall the following problem from computational learning theory widely thought to be computationally intractable.

(Learning Sparse Parity with Noise) Let $\chi_{S}:\{0,1\}^{n}\rightarrow\{0,1\}^{n}$ be an unknown parity function on a subset $S$ , $|S|\leq k$ , of $n$ inputs bits (i.e., any input, restricted to $S$ , with an odd number of ones is mapped to $1$ and otherwise). Let ${\cal C}_{k}$ be the concept class of all parity functions on subsets $S$ of size at most k. Let ${\cal D}$ be a distribution on $\{0,1\}^{n}\times\{-1,1\}$ and define

The Sparse Learning Parity with Noise problem is as follows: Given i.i.d. examples drawn from ${\cal D}$ , find $h$ such that $\Pr_{(\mathbf{x},y)\sim{\cal D}}[h(\mathbf{x})\neq y]\leq\mathsf{opt}+\epsilon$ .

For every algorithm ${\cal A}$ that solves the Sparse Learning Parity with Noise problem, there exists $\epsilon=O(1)$ and $k\in\omega(1)$ such that ${\cal A}$ requires time $n^{\Omega(k)}$ .

Any algorithm breaking the above assumption would be a major result in theoretical computer science. The best known algorithms due to Blum, Kalai, Wasserman and Valiant run in time $2^{O(n/\log n)}$ and $n^{0.8k}$ , respectively. Under this assumption, we can rule out polynomial-time algorithms for reliably learning ReLUs on distributions supported on $\{0,1\}^{n}$ .

Let ${\cal C}$ be the class of ReLUs over the domain ${\cal X}=\{0,1\}^{n}$ with the added restriction that $\|\mathbf{w}\|_{1}\leq 2k$ . Any algorithm ${\cal A}$ for reliably learning ${\cal C}$ in time $g(\epsilon)\cdot\mathsf{poly}(n)$ for any function $g$ will give a polynomial time algorithm for learning sparse parities with noise of size $k$ for $\epsilon=O(1)$ .

We will show how to use a reliable ReLU learner to agnostically learn conjunctions on $\{0,1\}^{n}$ and use an observation due to Feldman and Kothari who showed that agnostically learning conjunctions is harder than the Sparse Learning Parity with Noise problem. Let ${\cal CO}_{k}$ be the concept class of all Boolean conjunctions of length at most $k$ .

Notice that for the domain ${\cal X}=\{0,1\}^{n}$ , the conjunction of literals $x_{1},\ldots,x_{k}$ can be computed exactly as $\mathsf{max}(0,~{}x_{1}+\cdots+x_{k}-(k-1))$ . Fix an arbitrary distribution ${\cal D}$ on $\{0,1\}^{n}\times\{0,1\}$ and define

Kalai et al. (Theorem 5) observed that in order output a hypothesis $h$ with error $\mathsf{opt}+\epsilon$ it suffices to minimize (to within $\epsilon$ ) the following quantity:

The above proof also shows hardness of learning ReLUs agnostically. Note the above hardness result holds if we require the learning algorithm to succeed on all domains where $|(w\cdot x)|$ can grow without bound with respect to $n$ :

Finally, we point out Kalai et al. proved that reliably learning conjunctions is also as hard as PAC Learning DNF formulas. Thus, by our above reduction, any efficient algorithm for reliably learning ReLUs would give an efficient algorithm for PAC learning DNF formulas (again this would be considered a breakthrough result in computational learning theory).

Conclusions and Open Problems

We have given the first set of efficient algorithms for ReLUs in a natural learning model. ReLUs are both effective in practice and, unlike linear threshold functions (halfspaces), admit non-trivial learning algorithms for all distributions with respect to adversarial noise. We “sidestepped” the hardness results in Boolean function learning by focusing on problems that are not entirely scale-invariant with respect to the choice of domain (e.g., reliably learning ReLUs). The obvious open question is to improve the dependence of our main result on $1/\epsilon$ . We have no reason to believe that $2^{O(1/\epsilon)}$ is the best possible.

Acknowledgements. The authors are grateful to Sanjeev Arora and Roi Livni for helpful feedback and useful discussions on this work.