New error bounds for deep networks using sparse grids

Hadrien Montanelli, Qiang Du

Introduction

One of the most important theoretical problems is to determine why and when deep (but not shallow) networks can lessen or break the “curse of dimensionality,” expression first coined by Bellman in . A possible way of addressing this problem is to focus on a particular set of functions which have a very special structure (such as compositional or polynomial), and to show that for this particular set deep networks perform extremely well . We follow a different route. We consider a space of functions that is more generic for multivariate approximation in high dimensions, and prove new error estimates for which the curse of dimensionality is lessened by establishing a connection with sparse grids .

for some norm \|\cdot\|. For deep networks, one also wants to find the asymptotic behavior of the depth as a function of the accuracy ϵ\epsilon. Results of the form (2) are standard approximation results that suffer from the curse of dimensionality. For a small dimensions dd, the size NN of the network increases at a reasonable rate as ϵ\epsilon goes to zero. However, NN grows geometrically with dd.

These are Banach spaces too corresponding to the completion of Cm(Ω)C^{m}(\Omega) (functions with continuous partial derivatives up to order mm) with respect to the norm

with standard Lp(Ω)L^{p}(\Omega)-norm p\|\cdot\|_{p}.

As we mentioned previously, some results without curse of dimensionality for deep (but not shallow) networks have been recently derived; some of them are listed in Table 2. To derive such results, one has to consider functions with a very special structure. For example, in , the authors deal with compositional functions. This class includes in particular functions that are a composition of two-dimensional functions, e.g., in dimension d=8d=8,

for some bivariate functions f11f_{11}, f12f_{12}, f13f_{13}, f14f_{14}, f21f_{21}, f22f_{22} and f3f_{3}. Such functions can be represented by a binary tree with dd inputs variables, log2d\log_{2}d levels and (d1)(d-1) nodes, and each of the nodes can be approximated by a subnetowrk of size N/(d1)N/(d-1); see Figure 1.

The estimate is achieved via sparse grid approximations to functions in X2,p(Ω)X^{2,p}(\Omega). The curse of dimensionality is not totally overcome but is significantly lessened since the exponent dd only affects logarithmic factors log2ϵ|\log_{2}\epsilon|.Let us emphasize that the constant in (7), however, still depends exponentially on the dimension dd. As we will see in Section 2.2, Korobov spaces X2,p(Ω)X^{2,p}(\Omega) are subsets of Sobolev spaces W2,p(Ω)W^{2,p}(\Omega).

The reminder of the paper is structured as follows. We review Korobov spaces and sparse grids in Section 2, and prove our theorem in Section 3.

Korobov spaces and sparse grids

We review in this section Korobov spaces and sparse grids, which go back to Korobov and Smolyak , and were rediscovered by Zenger in for solving partial differential equations. Since 1990, Korobov spaces/sparse grids and related hyperbolic cross approximation have been used extensively in the context of high-dimensional function approximation, and the group of Michael Griebel has been particularly influential. For details, we recommend the exhaustive review .

The first ingredient for cooking sparse grids is a hierarchical basis of functions. To approximate functions of one variable xx on Ω=\Omega=, one considers a family of grids Ωl\Omega_{l} of level ll characterized by a grid size hl=2lh_{l}=2^{-l} and 2l12^{l}-1 points xl,i=ihlx_{l,i}=ih_{l}, 1i2l11\leq i\leq 2^{l}-1. For each Ωl\Omega_{l}, ones considers piecewise linear hat functions ϕl,i\phi_{l,i} centered at xl,ix_{l,i} defined by

where ϕ\phi is the mother of all hat functions,

Note that ϕl,i1\|\phi_{l,i}\|_{\infty}\leq 1 for all ll and ii. One then considers the function spaces VlV_{l} spanned by such functions,

and the hierarchical increment spaces WlW_{l} given by

The basis that corresponds to the WlW_{l}’s for 1ln1\leq l\leq n is called the hierarchical basis, while the basis of VnV_{n} is called the nodal basis. We show both bases for n=3n=3 in Figure 2.

Let us conclude this subsection by mentioning that one-dimensional hierarchical bases are not limited to piecewise linear functions (8). These can be generalized to piecewise higher-order polynomials [2, Th. 4.8], and also to other multiscale bases such as wavelets (see, e.g., ).

2 Multi-dimensional hierarchical basis and Korobov spaces

The second ingredient is to employ a tensor product construction to approximate functions of dd variables x=(x1,,xd)\boldsymbol{x}=(x_{1},\ldots,x_{d}) in Ω=d\Omega=^{d}. One considers a family of grids Ωl\Omega_{\boldsymbol{l}} of level l=(l1,,ld)\boldsymbol{l}=(l_{1},\ldots,l_{d}) with points xl,i=ihl\boldsymbol{x}_{\boldsymbol{l},\boldsymbol{i}}=\boldsymbol{i}\cdot\boldsymbol{h_{l}}, 1i2l1\boldsymbol{1}\leq\boldsymbol{i}\leq\boldsymbol{2^{l}-1}, obtained by the tensor product of dd one-dimensional grids with levels l1,,ldl_{1},\ldots,l_{d}.Multiplications and inequalities have to be understood componentwise. We use the notation 1=(1,,1)\boldsymbol{1}=(1,\ldots,1). For each Ωl\Omega_{\boldsymbol{l}}, one considers hat functions ϕl,i\phi_{\boldsymbol{l},\boldsymbol{i}} centered at points xl,i\boldsymbol{x}_{\boldsymbol{l},\boldsymbol{i}} defined by the product of the one-dimensional basis functions,

As in one dimension, one considers the function spaces spanned by these functions

The multi-dimensional hierarchical basis is the basis that corresponds to the WlW_{\boldsymbol{l}}’s for 1ln\boldsymbol{1}\leq\boldsymbol{l}\leq\boldsymbol{n}. We show all subspaces WlW_{\boldsymbol{l}} in two dimensions for n=(3,3)\boldsymbol{n}=(3,3) in Figure 3.

Equipped with a multi-dimensional hierarchical basis, one may approximate functions of dd variables. The appropriate function spaces in this context are the Korobov spaces X2,p(Ω)X^{2,p}(\Omega) defined for 2p+2\leq p\leq+\infty byFor simplicity, we only consider functions that are zero at the boundary. Sparse grids for functions that are non-zero at the boundary can be derived in an analogous fashion, and have similar approximation properties.

with k=max1jdkj|\boldsymbol{k}|_{\infty}=\max_{1\leq j\leq d}k_{j} and norm

These spaces go back to the 1959 paper of Korobov . Note the difference with the Sobolev spaces W2,p(Ω)W^{2,p}(\Omega) defined in (3): smoothness for X2,p(Ω)X^{2,p}(\Omega) is measured in terms of mixed derivatives of order two. For example in two dimensions, from k=max(k1,k2)2|\boldsymbol{k}|_{\infty}=\max(k_{1},k_{2})\leq 2, one can see that the Korobov spaces X2,p(Ω)X^{2,p}(\Omega) require

whereas k1=(k1+k2)2|\boldsymbol{k}|_{1}=(k_{1}+k_{2})\leq 2 for W2,p(Ω)W^{2,p}(\Omega) yields

In other words, Korobov spaces X2,p(Ω)X^{2,p}(\Omega) are subsets of Sobolev spaces W2,p(Ω)W^{2,p}(\Omega).

The key fact is that any function fX2,p(Ω)f\in X^{2,p}(\Omega) has a unique (infinite) expansion in the hierarchical basis,

3 Discretization

The third and last ingredient is a clever truncation of the expansion (21). Sparse grids are discretizations of X2,p(Ω)X^{2,p}(\Omega) defined by

and correspond to a number of grid points NN given by [2, Lemma 3.6]

A sparse grid in two dimensions is shown in Figure 3. Note that full grids Vn()V_{n}^{(\infty)}, with

correspond to a much larger O(hnd)=O(2nd)\mathcal{O}(h_{n}^{-d})=\mathcal{O}(2^{nd}) number of grid points.

and for any fX2,p(Ω)f\in X^{2,p}(\Omega), the approximation error satisfies [2, Lemma 3.13],

Th approximation error in (28) is slightly worse than the O(N2d)\mathcal{O}(N^{-\frac{2}{d}}) error for approximating functions in X2,p(Ω)X^{2,p}(\Omega) with full grids [2, Lemma 3.5], but using a much smaller number of points.

Error bounds using sparse grids

Results listed in Table 1 are typically proven using the following technique. One shows that certain functions ff can be approximated by polynomials fMf_{M} of degree MM to any prescribed accuracy ϵ\epsilon, and so can polynomials fMf_{M} by neural networks fNf_{N} of size NN, with NN bounded by some function of ϵ\epsilon, as in (2). This amounts to decomposing the approximation error as

for some norm \|\cdot\|. We use the same idea but instead of polynomials, we use approximations by sparse grids.

We first explain in Section 3.1 how to approximate the hat functions ϕl,i\phi_{\boldsymbol{l},\boldsymbol{i}} using ideas introduced independently by Liang and Srikant , and Yartosky . We then prove in Section 3.2 our theorem concerning approximation of functions in X2,p(Ω)X^{2,p}(\Omega) by deep networks.

The following proposition of Yarotsky shows how deep networks can implement multiplication.

Proposition 1 [21, Prop. 3]. For any 0<ϵ<10<\epsilon<1, there is a deep ReLU network with inputs x1x_{1} and x2x_{2}, with x1M|x_{1}|\leq M and x2M|x_{2}|\leq M, that implements the multiplication x1x2x_{1}x_{2} with accuracy ϵ\epsilon, outputs if x1=0x_{1}=0 or x2=0x_{2}=0, and has depth and size O(log2ϵ+log2M)\mathcal{O}(|\log_{2}\epsilon|+\log_{2}M).

From Proposition 1, we obtain the following result, which shows how deep networks can approximate the multi-dimensional hat functions (13) with a binary tree structure. The corresponding network is shown in Figure 4, and the proof can be found in Appendix A.

Proposition 2. For any dimension dd and 0<ϵ<10<\epsilon<1, there is a deep ReLU network with dd inputs x1,,xdx_{1},\ldots,x_{d} that implements the multiplication ϕl,i(x)=j=1dϕlj,ij(xj)\phi_{\boldsymbol{l},\boldsymbol{i}}(\boldsymbol{x})=\prod_{j=1}^{d}\phi_{l_{j},i_{j}}(x_{j}) with accuracy ϵ\epsilon, outputs if one of the ϕlj,ij(xj)\phi_{l_{j},i_{j}}(x_{j}) is , and has depth and size O(log2ϵlog2d)\mathcal{O}(|\log_{2}\epsilon|\log_{2}d).

2 Approximating sparse grids by deep networks

We use the fact that functions in X2,p(d)X^{2,p}(^{d}) can be approximated by sparse grids, and then show that sparse grids can be represented by deep networks using the multiplication presented in the previous subsection. The resulting network is shown in Figure 5.

Theorem 1. For any dimension dd and 0<ϵ<10<\epsilon<1, there is a deep ReLU network with dd inputs x1,,xdx_{1},\ldots,x_{d} capable of expressing any function ff in X2,p(d)X^{2,p}(^{d}) that satisfies f2,1|f|_{\boldsymbol{2},\infty}\leq 1 with accuracy ϵ\epsilon, and has depth O(log2ϵlog2d)\mathcal{O}(|\log_{2}\epsilon|\log_{2}d) and size O(ϵ12log2ϵ32(d1)+1log2d)\mathcal{O}(\epsilon^{-\frac{1}{2}}|\log_{2}\epsilon|^{\frac{3}{2}(d-1)+1}\log_{2}d).

Proof. Let us consider fX2,p(d)f\in X^{2,p}(^{d}) and suppose we want to approximate ff with a deep ReLU network fNf_{N} of size NN. Let us write

where fm(1)Vm(1)f^{(1)}_{m}\in V_{m}^{(1)} is the sparse grid approximation of ff with M=O(2mmd1)M=\mathcal{O}(2^{m}m^{d-1}) points. We know from (29) that for any ϵ>0\epsilon>0, we can equal the first term to ϵ/2\epsilon/2 with

Let us now approximate fm(1)f_{m}^{(1)} by a network fNf_{N} consisting of MM subnetworks, each subnetwork implementing the approximate multiplication introduced in the previous subsection, which we write as ϕ~l,i(x)\widetilde{\phi}_{\boldsymbol{l},\boldsymbol{i}}(\boldsymbol{x}), that is,

Let us suppose that each ϕ~l,i(x)\widetilde{\phi}_{\boldsymbol{l},\boldsymbol{i}}(\boldsymbol{x}) is computed to accuracy δ\delta with a network of depth and size O(log2δlog2d)\mathcal{O}(|\log_{2}\delta|\log_{2}d) for some 0<δ<10<\delta<1 (using Proposition 2). From (33), we get

For a given l\boldsymbol{l}, a given x\boldsymbol{x} belongs to the support of at most one ϕl,i(x)\phi_{\boldsymbol{l},\boldsymbol{i}}(\boldsymbol{x}) because these have disjoint supports,Note that the same holds true for ϕ~l,i(x)\widetilde{\phi}_{\boldsymbol{l},\boldsymbol{i}}(\boldsymbol{x}) using the -in--out property of Proposition 2. so the inequality becomes

Using the property of the decay of the coefficients (23), l1m+d122l11\sum_{|\boldsymbol{l}|_{1}\leq m+d-1}2^{-2|\boldsymbol{l}|_{1}}\leq 1 and 2d12^{-d}\leq 1, we obtain

since f2,1|f|_{\boldsymbol{2},\infty}\leq 1. Hence, for δ=ϵ/2\delta=\epsilon/2, one has

The depth of the network is O(log2δlog2d)=O(log2ϵlog2d)\mathcal{O}(|\log_{2}\delta|\log_{2}d)=\mathcal{O}(|\log_{2}\epsilon|\log_{2}d), and its size is

Discussion

We have proven new rigorous upper bounds for the approximation of functions in Korobov spaces X2,p(Ω)X^{2,p}(\Omega) by deep ReLU networks, for which the curse of dimensionality is lessened. The proof is based on the ability of deep networks to approximate sparse grids via a binary tree structure (Figure 4), which resembles the compositional structure used in .

There are many ways in which this work could be profitably continued. To show an advantage of deep networks versus shallow, it would be desirable to obtain a lower bound for approximations in X2,p(Ω)X^{2,p}(\Omega) by shallow networks, for which the curse of dimensionality is not lessened.For approximations by shallow networks in Sobolev spaces W2,2(Ω)W^{2,2}(\Omega), it is well known that both the lower and upper bounds depend exponentially on the dimension dd [13, Th. 6.1]. Another extension would be to derive similar estimates for smoother functions, e.g., functions with mixed derivatives of order m>2m>2. Piecewise smooth functions could also be considered (as in ), as well as Jacobi-weighted Korobov spaces , and energy-based sparse grids (for which the curse of dimensionality can be totally overcome [2, Th. 3.10]). More generally, we could apply our methodology to any expansion of the form (21), as long as the expansion coefficients satisfy a property like (23) (which controls the width of the network) and the basis functions can be implemented efficiently using the multiplication of Section 3.1 (which controls the depth).

Our theorem provides an upper bound for the approximation complexity when the same network is used to approximate all functions in a given Korobov space. In other words, the network architecture does not depend on the function being approximated; only the weights vl,iv_{\boldsymbol{l},\boldsymbol{i}} do. Alternatively, we could consider adaptive architectures where not only the weights but also the architecture is adjusted to the function being approximated. We would expect that this would decrease the complexity of the resulting network. Adaptive network architectures in the context of approximating multivariate functions have been studied by, e.g., Yarotsky in .

As mentioned in the introduction, breaking the curse of dimensionality often relies on taking advantage of special properties of the functions being approximated. In this paper, we followed a different route and considered a more generic space of functions, and approximations by sparse grids. Let us emphasize, however, that sparse grids—in particular the norm (18)—are highly anisotropic: to be efficient, these require the functions being approximated to be aligned with the axes. This is in fact the case for many algorithms for the approximation of multivariate functions, including low-rank compressions and quasi-Monte Carlo methods; we refer the interested reader to for details.

Acknowledgements

We thank the members of the CM3 group (Ran Gu, Hwi Lee, Qi Sun and Yunzhe Tao) at Columbia University, and colleague and programmer Joel R. Clay for fruitful discussions. The first author is much indebted to former PhD supervisor Nick Trefethen for his inspirational contributions to numerical analysis and in particular to the field of approximation theory.

References

Appendix A Proof of Proposition 2

Let us first note that ϕlj,ij\phi_{l_{j},i_{j}} can be written as

i.e., it can be implemented by a network of depth 22 and size 33.

Let us now prove the result concerning the multiplication by induction over dd. For simplicity we will suppose that d=2pd=2^{p} is a power of 22, and will prove the result by induction over pp.

For p=1p=1, i.e., d=2d=2, we want to show that for any 0<ϵ<10<\epsilon<1 there is a deep ReLU network with two inputs x1x_{1} and x2x_{2} that implements the multiplication ϕl1,i1(x1)×ϕl2,i2(x2)\phi_{l_{1},i_{1}}(x_{1})\times\phi_{l_{2},i_{2}}(x_{2}) with accuracy ϵ\epsilon, outputs if ϕl1,i1(x1)=0\phi_{l_{1},i_{1}}(x_{1})=0 or ϕl2,i2(x2)=0\phi_{l_{2},i_{2}}(x_{2})=0 (which we call -in--out property), and has depth and size O(log2ϵ)\mathcal{O}(|\log_{2}\epsilon|). To create such a network, one can combine two networks that compute ϕl1,i1(x1)\phi_{l_{1},i_{1}}(x_{1}) and ϕl2,i2(x1)\phi_{l_{2},i_{2}}(x_{1}) from x1x_{1} and x2x_{2} (each having depth 22 and size 33) with the network of Proposition 1 (with M=1M=1 since ϕlj,ij1\|\phi_{l_{j},i_{j}}\|_{\infty}\leq 1) to multiply ϕl1,i1(x1)\phi_{l_{1},i_{1}}(x_{1}) by ϕl2,i2(x2)\phi_{l_{2},i_{2}}(x_{2}) (depth and size O(log2ϵ)\mathcal{O}(|\log_{2}\epsilon|)). The resulting network has depth O(2+log2ϵ)=O(log2ϵ)\mathcal{O}(2+|\log_{2}\epsilon|)=\mathcal{O}(|\log_{2}\epsilon|) and size O(3×2+log2ϵ)=O(log2ϵ)\mathcal{O}(3\times 2+|\log_{2}\epsilon|)=\mathcal{O}(|\log_{2}\epsilon|), and inherits the -in--out property from Proposition 1.

Let us suppose now that this is true in dimension d/2=2p1d/2=2^{p-1} for some p1p\geq 1, and let us show this is still true in dimension d=2pd=2^{p} for any 0<ϵ<10<\epsilon<1. The induction hypothesis (which we use with ϵ/4\epsilon/4) states that there is a deep ReLU network with d/2d/2 inputs x1,,xd/2x_{1},\ldots,x_{d/2} that implements the multiplication j=1d/2ϕlj,ij(xj)\prod_{j=1}^{d/2}\phi_{l_{j},i_{j}}(x_{j}) with accuracy ϵ/4\epsilon/4, outputs if one of the ϕlj,ij(xj)\phi_{l_{j},i_{j}}(x_{j}) is , and has depth and size O(log2ϵ/4(log2d1))=O(log2ϵ(log2d1))\mathcal{O}(|\log_{2}\epsilon/4|(\log_{2}d-1))=\mathcal{O}(|\log_{2}\epsilon|(\log_{2}d-1)). Let

denote this network, where ×~\widetilde{\times} is the approximate multiplication of Proposition 1. In other words, this network corresponds to the hierarchical combination of d/21d/2-1 products; see Figure 4. The induction hypothesis tells us that this network has accuracy ϵ/4\epsilon/4,

Similarly we consider the network with d/2d/2 inputs xd/2+1,,xdx_{d/2+1},\ldots,x_{d} that implements j=d/2+1dϕlj,ij(xj)\prod_{j=d/2+1}^{d}\phi_{l_{j},i_{j}}(x_{j}),

and -in--out property. To construct a network that implements the full multiplication j=1dϕlj,ij(xj)\prod_{j=1}^{d}\phi_{l_{j},i_{j}}(x_{j}), we combine (41) with (44), that is,

Note that (46) satisfies the -in--out property since (41) and (44) do. Let us now examine the accuracy of this network:

using (42), (43) and (45). Therefore to bound (47) by ϵ\epsilon we would like to bound the first term in (47) by ϵ/2ϵ2/16\epsilon/2-\epsilon^{2}/16. Note that this term corresponds to the top multiplication of Figure 4. To achieve accuracy ϵ/2ϵ2/16\epsilon/2-\epsilon^{2}/16, we use Proposition 1 with M=1+ϵ/4M=1+\epsilon/4. Therefore, this multiplication is implemented by a network that has depth and size