How do infinite width bounded norm networks look in function space?

Pedro Savarese, Itay Evron, Daniel Soudry, Nathan Srebro

Introduction

Empirical and theoretical results suggest that neural network models used in practice are not constrained by their size (number of units, or number of weights) but perhaps achieve complexity control by controlling the magnitude of the weights (e.g. Bartlett, 1997; Neyshabur et al., 2014; Zhang et al., 2016), either explicitly or implicitly (Soudry et al., 2018; Gunasekar et al., 2018).

In fact, it is reasonable to think of networks as essentially having infinite size (having an infinite number of units), and controlled only through some norm of the weights. Using infinite (or unbounded) size networks we can approximate virtually any function, and so in training such a model we are essentially searching over the space of all functions.

Learning can be thought of as searching over the entire function space for a function with small representation cost, given by the minimal norm required to represent it in the chosen infinite size architecture. Understanding learning with infinite (or unbounded) size networks thus relies crucially on understanding this “representation cost”, which is the actual inductive bias of learning.

Equivalently, we can think of this question as asking what functions can be represented, or approximated arbitrarily well, with an infinite-size bounded-norm network. There has been considerable work for over three decades on the question of what functions can be approximated by neural networks, establishing that anyAny continuous or appropriately smooth function, depending on the precise model and topology. function can be approximated by a large enough network, and studying the size (number of units) required to achieve good approximation (e.g. Hornik et al., 1989; Cybenko, 1989; Barron, 1993; Pinkus, 1999). However, we are not aware of prior work establishing what norm is required for approximation, which, based on our current understanding of deep learning, is arguably the more important question (See Appendix B for a detailed comparison to Barron (1993)).

There is a significant difference in studying approximation in terms of size (number of units) vs norm: most smooth functions require unbounded size in order to be approximated arbitrarily well (after all, functions that can be represented with a bounded number of weights form a finite dimensional class and thus occupy only zero measure in the infinite dimensional function space). Therefore, in classical approximation results the size goes to infinity as the approximation error goes to zero, and results focus on how quickly the size goes to infinity, or on approximation up to finite error. In contrast, as we shall see here, broad classes of smooth functions can be approximated arbitrarily well, and even perfectly represented, with a bounded norm—the norm need not increase for the approximation to improve. Can we characterize this class of functions?

In this paper we consider infinite-width ReLU networks where the overall Euclidean norm (sum of squares of all weights in the system) is controlled. More specifically, we consider networks with a single hidden layer, consisting of an unbounded number of rectified linear units (ReLUs)—see Section 2 for a precise definition. Such two-layer infinite networks were studied in the context of generalization (Neyshabur et al., 2015) and optimization (Bach, 2017; Chizat and Bach, 2018) and are were shown by Neyshabur et al. (2014) to be equivalent to “convex neural nets” as studied by Bengio et al. (2006).

We further show that fitting data while minimizing this complexity yields to linear spline interpolation. Interestingly, such linear splines are exactly the predictors recently studied by Belkin et al. (2018) as a model for understanding interpolation learning, though they fell short of connecting such models to neural networks. Our work thus closes the loop and establishes a concrete connection between their analysis and neural network learning.

We see then that even for univariate functions, Euclidean norm regularization on the weights already gives rise to very rich and natural induced bias in function space, that is not at all obvious and perhaps even surprising. We of course view this as a starting point for studying multivariate functions, and in Section 5 discuss how our techniques can be extended, and conjecture an answer related to the integral of the nuclear norm of the Hessian.

Our derivation of the induced complexity can be seen as an application of Green’s function of the second derivative, and in Section 4 we elaborate on this view, and describe how fitting a two-layer network can be viewed as a method for solving a variational problem using Green’s functions.

Infinite Width ReLU Networks

We consider 2-layer networks, with a single hidden layer consisting of an unbounded number of rectified linear units (ReLUs), defined by:

We associate with each network the squared Euclidean norm of the non-bias weightsRemoving the output unit bias term $b^{(2)}$ will not change any of the definitions or results, since it can be simulated at no cost by a unit with infinitely large $b^{(1)}_{i}$ , infinitesimally small $w^{(2)}_{i}$ , and ${\mathbf{w}}^{(1)}_{i}=0$ . The unregularized bias terms $b^{(1)}_{i}$ in the hidden layer are important to our analysis, and removing them or regularizing them would substantially change the definitions and results.:

The above definitions of $R(f)$ require $f$ to be exactly implementable by some finite-size ReLU network, and so $R(f)$ is finite only for piece-wise linear functions with a finite number of pieces. However, any continuous function can be approximated by increasing the number of units arbitrarily. Since we are not concerned with the number of units, only the norm, our central object of study is thus the norm required to capture a function as the number of units increases. This is captured by:

To understand $\overline{R}(f)$ observe that by (5), the sub-level set ${\mathcal{F}}_{B}=\left\{f\middle|R(f)\leq B\right\}$ is a scaling of the symmetric convex hull of ReLUs, plus arbitrary constant functions:

Learning an unbounded width ReLU network $h_{\theta}$ by fitting some loss functional $L(\cdot)$ while controlling the norm $C(\theta)$ by minimizing

is thus equivalent to learning a function $f$ while controlling $\overline{R}(f)$ :

One-dimensional Functions

Our main result is an exact and complete characterization of $\overline{R}(f)$ for univariate functions:

where we are treating the measures over $b$ as distributions. Taking the derivative twice w.r.t. $x$ :

where $H(z)$ is the Heaviside step function ( $H(z)=1$ when $z>0$ and zero otherwise), whose distributional derivative is the dirac distribution $\delta_{a}(z)$ , which assigns point-mass to $z=a$ , and we defined $\alpha_{+}(b)=\alpha(1,b)+\alpha(-1,b)$ .

We see that the measure $\alpha$ that represents a function $f$ is almost unique: the component $\alpha_{+}$ corresponding to the total (signed) mass is unique and determined precisely the second derivative of $f$ . The only flexibility is in shifting mass between the forward and backward sloping ReLUs with threshold $b$ , i.e., between $\alpha(1,b)$ and $\alpha(-1,b)$ . We denote this component by $\alpha_{-}(b)=\alpha(1,b)-\alpha(-1,b)$ , which together with $\alpha_{+}$ defines $\alpha$ as $\alpha(w,b)=\frac{1}{2}(\alpha_{+}(b)+w\cdot\alpha_{-}(b))$ . As the following calculation shows the component $\alpha_{-}$ only contributes an affine component to $f=h_{\alpha,c}$ :

Any measure $\alpha_{-}$ that integrates as (18), in conjunction with $\alpha_{+}=f^{\prime\prime}$ and an appropriate $c$ , will yield $f=h_{\alpha,c}$ with

To minimize $\left\lVert{\alpha}\right\rVert_{1}$ we must therefore solve the following convex program:

Consider the possible values $\lambda$ might take:

where $l_{n}=\frac{y_{n+1}-y_{n}}{x_{n+1}-x_{n}}$ for $n\in\{1,\cdots,N-1\}$ , and $l_{0},l_{N}$ are chosen so as to minimize

Theorem 3.1 guarantees that the linear spline interpolation is a global minimum of (25), but it will in general not be unique, as demonstrated in Figure 1, and networks learned in practice might implement much “smoother” functions. The same type of solutions will also be obtained when minimizing any loss over a finite sample:

Then (28) will always have a global minimum which is a piece-wise linear function with at most $N+1$ pieces, with breakpoints at the data points $x_{1},\ldots,x_{N}$ .

A minimizer $h_{\theta^{*}}$ of (28) is also a minimizer of (25) with the alternative labels $y_{n}=h_{\theta^{*}}(x_{n})$ . ∎

An Interpertation in Terms of Green’s Functions

In terms of the representation (12), we get the second derivative directly from the measure $\alpha$ , as $u(x)=\alpha_{+}(x)=\alpha(1,x)+\alpha(-1,x)$ . For the more standard integral representation (9), we have $u(x)=\alpha(1,x)+\alpha(-1,-x)$ . Fitting an actual network with discrete units and weights on each unit, as in (1), corresponds to an approximation of the form:

where to get a smoother approximation we might want to replace the delta function with a smoother bump.

Ignoring for the moment the linear term, we saw in Corollary 24 that fitting a two-layer ReLU network while controlling the norm of the weights corresponds essentially to the variational problem:

An interpretation of how this is done using the ReLU network is as follows: we first introduce an auxiliary function variable $u$ , and rewrite (30) as:

We then use the fact that the ReLU is a Green’s function for the second derivative to write

Approximating Higher Dimensional Functions

where we recall $H$ is the Heaviside step function, and

What remains is to minimize $\left\lVert{\alpha}\right\rVert_{1}$ under this constraint. As an indicative result, we can calculate $\left\lVert{\alpha}\right\rVert_{1}$ for the special case where it is non-negative, and so $f=f_{\alpha,c}$ is convex and the Hessian is p.s.d.:

The right equality is a direct consequence of the divergence theorem. It remains to prove the left equality. From (36) we have

where in $\left(1\right)$ we assume, without loss of generality, that ${\mathbf{w}}^{\top}=\left(1,0,\dots,0\right)$ inside the square brackets and define the following $d-1$ dimensional ball

in $\left(2\right)$ we define $\mathcal{I}$ as an indicator function and calculate the surface area of $\mathcal{A}_{r}\left(z\right)$ , in $\left(3\right)$ we switch the order of the limit and integration using the bounded convergence theorem, in $\left(4\right)$ we denote $x_{1}=b$ , and in $\left(5\right)$ we use our assumption that $\alpha\left({\mathbf{w}},b\right)\geq 0.$ ∎

We do not expect that in general $\alpha$ would be non-negative, nor that the Hessian would be p.s.d. Recall that in the one dimensional case, $\overline{R}(f)$ is related to the integral of the absolute value of $f^{\prime\prime}$ . Therefore, in order to generalize Claim 5.1 to mixed sign eigenvalues, a naive conjecture would be that $\overline{R}(f)$ is given by the integral of the nuclear norm of the Hessian, up to a linear function accounting for the boundary conditions as in the one dimensional case. That is, the sum of the absolute values of the eigenvalues of the Hessian, as opposed to the Laplacian which is the sum of signed eigenvalues.

Unfortunately, such a conjecture cannot be accurate. The issue is that, for some functions with vanishing Hessian, any norm of the Hessian, normalized as in Claim 5.1, would be equal to zero. For example,

Discussion

As we are realizing that (explicit or implicit) regularization plays a key role in deep learning, it is crucial to understand how simple norm control on the weights in parameter space induces rich complexity control in function space. In a sense, for infinite size networks, the only role of the architecture is to give rise to and shape this rich complexity control. Recent work studied this question in linear neural networks, and showed how in regularizing the weights in a convolutional network yields rich complexity control inducing sparsity in the frequency domain (Gunasekar et al., 2018). Here we go beyond linear networks and study infinite ReLU networks.

Much in the same as linear convolutional neural networks can represent any linear function, and the architecture’s only role is to induce complexity control in that space, infinite width ReLU networks can represent any continuous function, and the role of the architecture, in our view, is to induce complexity control over function space. We see that indeed, even for univariate functions, the architecture already induces a natural complexity control that is not obvious nor explicit. Furthermore, we are particularly excited that this complexity control exactly matches the learning rules studied in recent work on interpolation learning (Belkin et al., 2018), and provides a concrete connection of that work to deep learning. We are eager to find out whether this connection carries also to the multivariate case.

We also hope that our study will reinvigorate approximation theory work on neural networks, studying approximation by networks of bounded norm rather than a bounded number of units.

We are in debt to Charlie Smart (University of Chicago) for pointing out the connection to Green’s functions, and would also like to thank Jason Lee (UCS), Holden Lee (Princeton) and Arturs Backurs (TTIC) for helpful discussions. PS and NS were partially supported by NSF awards 1546500 and 1764032.

References

Appendix

Recall the definition of 2-layer ReLU networks $h_{\theta}$ for $\theta\in\Theta_{2}$ :

(4) and (5) are equivalent. More specifically:

where a rescaling given by $c_{i}=\sqrt{\left\lvert{w^{(2)}_{i}}\right\rvert/\left\lVert{{\mathbf{w}}^{(1)}_{i}}\right\rVert_{2}}$ minimizes the left-hand side and achieves equality. Since the right-hand side is invariant to rescaling, we can arbitrarily set $\left\lVert{{\mathbf{w}}^{(1)}_{i}}\right\rVert_{2}=1$ for all $i$ , yielding $\sum_{i=1}^{k}\left\lvert{w^{(2)}_{i}}\right\rvert=\left\lVert{{\mathbf{w}}^{(2)}}\right\rVert_{1}$ . ∎

Appendix B Relationship to Barron’s Analysis

Appendix C The effect of neural networks depth on the inductive bias

We now turn to a different infinite width architecture, where we study the effect of depth. The networks we consider have a parallel structure common in many state of the art system (Cheng et al., 2016; Shazeer et al., 2017). Consider a deep neural network with $L$ layers and a parallel architecture, so the output is a sum of $k$ sub-networks with $L-1$ layers. Such a network is parameterized by $\theta=\left(k,\mathcal{W}_{1},\dots,\mathcal{W}_{k},{\mathbf{w}}^{\left(L\right)}\right)$ , where $\mathcal{W}_{i}=\left(W^{\left(1\right)}_{i},\dots,W^{\left(L-1\right)}_{i}\right)$ is the set of weight matrices of the $i$ th network, and ${\mathbf{w}}^{\left(L\right)}$ is the weight vector of the last layer, linearly combining the (scalar) outputs of all $k$ parallel sub-networks. The parameter class $\Theta_{L}$ is therefore defined as

and for $i\in\left[{k}\right],l\in\left[{L-1}\right]$ :

Unlike the networks in Section 2, the networks in this section do not have biases.

The squared Euclidean norm of the weights (averaged over all layers) is now defined as

yielding the following definition of the infimum norm required to implement a function with this parallel architecture:

In other words, for each $\bar{\mathcal{W}}=\left(W^{\left(1\right)},\ldots,W^{\left(L-1\right)}\right)\in\mathcal{S}$ , the weight matrices are normalized, i.e., $\forall{l}:\left\lVert{W^{\left(l\right)}}\right\rVert_{F}=1$ . It will be convenient to let $\bm{\alpha}$ denote the weights of the last layer (originally ${\mathbf{w}}^{\left(L\right)}$ ) in this setting.

where $\mathcal{S}$ is defined in (45), and for each solution of one problem (either (46) or (44)) we can find an equivalent solution of the other (with the same $h_{\theta}^{L}$ ).

The proof of this Theorem has the immediate implication that

During the proof of Theorem C.1 we used the inequality of arithmetic and geometric means in (53) to show that $\forall{i\in\left[{k}\right]}:$

After finishing that proof, we know that it must hold with equality. But, equality holds if and only if all the numbers in the means are equal. ∎

When learning on finite sample sets, the optimal solutions can be shown to be supported on bounded vector sets. More specifically,

This helps us shed light on the behavior of loss minimization when regularizing all parameters, and not only the ones on the last layer.

to (44), we show how to construct a solution for (46). Start by setting the following solution, where $\forall{i}\in\left[{k^{*}}\right]$ :

We use the $1$ -positive-homogeneity of the activation functions to show $\forall{i\in\left[{k^{*}}\right]}$ it holds that

We now show that the value of the new solution, i.e., its norm, equals

where we used the inequality of arithmetic and geometric means.

As a conclusion, we get the required inequality

When (46) is attainable, and given an optimal solution $\theta=(k^{*},\bm{\alpha}^{*},\left\{\bar{\mathcal{W}}_{i}^{*}\right\}_{i=1}^{k^{*}})$ , we show how to attain a solution to the implementation cost formulation (44).

We construct a solution $\theta$ where $\forall{i\in\left[{k^{*}}\right]}$

By using the $1$ -positive-homogeneity property of the activation functions (once for every matrix in $\bar{\mathcal{W}}_{i}$ ), we get that for all ${i\in\left[{k^{*}}\right]}$ :

We therefore narrow our proof to deeper networks where $L\geq 3\Rightarrow\frac{2}{L}<1$ and the norm in the objective function is no longer convex (but quasi-convex). Following a proof by Petrov (2019), we are able to show that when $L\geq 3$ , not only there exists an optimal solution with a finite support of size at most $N$ , but all optimal solution are such.

where $s_{i}\triangleq\operatorname{sign}{\bm{\alpha}^{*}_{i}}$ . The function ${z}^{\frac{2}{L}}$ is concave $\forall{z\geq 0}$ , and so we apply the Jensen inequality to get

Notice we used the strict Jensen inequality, since neither the function $z^{\frac{2}{L}}$ is linear, nor are the two terms equal (following our choice of $\rho$ in (63)).

Finally, we notice the above imply that one of the two solutions $\bm{\alpha}^{*}\pm\rho\bm{\beta}$ must have a strictly smaller norm than $\bm{\alpha}^{*}$ . As a result of our choice of $\bm{\beta}$ , all three solutions have the same loss, i.e.,

The above necessarily mean that at least one of these two solutions we constructed has an objective value strictly smaller than the objective value of $\bm{\alpha}^{*}$ , in contradiction to its optimality. ∎

Appendix D Proof of Claim 5.2

From here it is straightforward to see that

Therefore, for some constants $r_{0},C$ , and any norm,

Which vanishes as $r\rightarrow\infty$ . ∎