New error bounds for deep networks using sparse grids

Hadrien Montanelli, Qiang Du

Introduction

One of the most important theoretical problems is to determine why and when deep (but not shallow) networks can lessen or break the “curse of dimensionality,” expression first coined by Bellman in . A possible way of addressing this problem is to focus on a particular set of functions which have a very special structure (such as compositional or polynomial), and to show that for this particular set deep networks perform extremely well . We follow a different route. We consider a space of functions that is more generic for multivariate approximation in high dimensions, and prove new error estimates for which the curse of dimensionality is lessened by establishing a connection with sparse grids .

for some norm $\|\cdot\|$ . For deep networks, one also wants to find the asymptotic behavior of the depth as a function of the accuracy $\epsilon$ . Results of the form (2) are standard approximation results that suffer from the curse of dimensionality. For a small dimensions $d$ , the size $N$ of the network increases at a reasonable rate as $\epsilon$ goes to zero. However, $N$ grows geometrically with $d$ .

These are Banach spaces too corresponding to the completion of $C^{m}(\Omega)$ (functions with continuous partial derivatives up to order $m$ ) with respect to the norm

with standard $L^{p}(\Omega)$ -norm $\|\cdot\|_{p}$ .

As we mentioned previously, some results without curse of dimensionality for deep (but not shallow) networks have been recently derived; some of them are listed in Table 2. To derive such results, one has to consider functions with a very special structure. For example, in , the authors deal with compositional functions. This class includes in particular functions that are a composition of two-dimensional functions, e.g., in dimension $d=8$ ,

for some bivariate functions $f_{11}$ , $f_{12}$ , $f_{13}$ , $f_{14}$ , $f_{21}$ , $f_{22}$ and $f_{3}$ . Such functions can be represented by a binary tree with $d$ inputs variables, $\log_{2}d$ levels and $(d-1)$ nodes, and each of the nodes can be approximated by a subnetowrk of size $N/(d-1)$ ; see Figure 1.

The estimate is achieved via sparse grid approximations to functions in $X^{2,p}(\Omega)$ . The curse of dimensionality is not totally overcome but is significantly lessened since the exponent $d$ only affects logarithmic factors $|\log_{2}\epsilon|$ .Let us emphasize that the constant in (7), however, still depends exponentially on the dimension $d$ . As we will see in Section 2.2, Korobov spaces $X^{2,p}(\Omega)$ are subsets of Sobolev spaces $W^{2,p}(\Omega)$ .

The reminder of the paper is structured as follows. We review Korobov spaces and sparse grids in Section 2, and prove our theorem in Section 3.

Korobov spaces and sparse grids

We review in this section Korobov spaces and sparse grids, which go back to Korobov and Smolyak , and were rediscovered by Zenger in for solving partial differential equations. Since 1990, Korobov spaces/sparse grids and related hyperbolic cross approximation have been used extensively in the context of high-dimensional function approximation, and the group of Michael Griebel has been particularly influential. For details, we recommend the exhaustive review .

The first ingredient for cooking sparse grids is a hierarchical basis of functions. To approximate functions of one variable $x$ on $\Omega=$ , one considers a family of grids $\Omega_{l}$ of level $l$ characterized by a grid size $h_{l}=2^{-l}$ and $2^{l}-1$ points $x_{l,i}=ih_{l}$ , $1\leq i\leq 2^{l}-1$ . For each $\Omega_{l}$ , ones considers piecewise linear hat functions $\phi_{l,i}$ centered at $x_{l,i}$ defined by

where $\phi$ is the mother of all hat functions,

Note that $\|\phi_{l,i}\|_{\infty}\leq 1$ for all $l$ and $i$ . One then considers the function spaces $V_{l}$ spanned by such functions,

and the hierarchical increment spaces $W_{l}$ given by

The basis that corresponds to the $W_{l}$ ’s for $1\leq l\leq n$ is called the hierarchical basis, while the basis of $V_{n}$ is called the nodal basis. We show both bases for $n=3$ in Figure 2.

Let us conclude this subsection by mentioning that one-dimensional hierarchical bases are not limited to piecewise linear functions (8). These can be generalized to piecewise higher-order polynomials [2, Th. 4.8], and also to other multiscale bases such as wavelets (see, e.g., ).

2 Multi-dimensional hierarchical basis and Korobov spaces

The second ingredient is to employ a tensor product construction to approximate functions of $d$ variables $\boldsymbol{x}=(x_{1},\ldots,x_{d})$ in $\Omega=^{d}$ . One considers a family of grids $\Omega_{\boldsymbol{l}}$ of level $\boldsymbol{l}=(l_{1},\ldots,l_{d})$ with points $\boldsymbol{x}_{\boldsymbol{l},\boldsymbol{i}}=\boldsymbol{i}\cdot\boldsymbol{h_{l}}$ , $\boldsymbol{1}\leq\boldsymbol{i}\leq\boldsymbol{2^{l}-1}$ , obtained by the tensor product of $d$ one-dimensional grids with levels $l_{1},\ldots,l_{d}$ .Multiplications and inequalities have to be understood componentwise. We use the notation $\boldsymbol{1}=(1,\ldots,1)$ . For each $\Omega_{\boldsymbol{l}}$ , one considers hat functions $\phi_{\boldsymbol{l},\boldsymbol{i}}$ centered at points $\boldsymbol{x}_{\boldsymbol{l},\boldsymbol{i}}$ defined by the product of the one-dimensional basis functions,

As in one dimension, one considers the function spaces spanned by these functions

The multi-dimensional hierarchical basis is the basis that corresponds to the $W_{\boldsymbol{l}}$ ’s for $\boldsymbol{1}\leq\boldsymbol{l}\leq\boldsymbol{n}$ . We show all subspaces $W_{\boldsymbol{l}}$ in two dimensions for $\boldsymbol{n}=(3,3)$ in Figure 3.

Equipped with a multi-dimensional hierarchical basis, one may approximate functions of $d$ variables. The appropriate function spaces in this context are the Korobov spaces $X^{2,p}(\Omega)$ defined for $2\leq p\leq+\infty$ byFor simplicity, we only consider functions that are zero at the boundary. Sparse grids for functions that are non-zero at the boundary can be derived in an analogous fashion, and have similar approximation properties.

with $|\boldsymbol{k}|_{\infty}=\max_{1\leq j\leq d}k_{j}$ and norm

These spaces go back to the 1959 paper of Korobov . Note the difference with the Sobolev spaces $W^{2,p}(\Omega)$ defined in (3): smoothness for $X^{2,p}(\Omega)$ is measured in terms of mixed derivatives of order two. For example in two dimensions, from $|\boldsymbol{k}|_{\infty}=\max(k_{1},k_{2})\leq 2$ , one can see that the Korobov spaces $X^{2,p}(\Omega)$ require

whereas $|\boldsymbol{k}|_{1}=(k_{1}+k_{2})\leq 2$ for $W^{2,p}(\Omega)$ yields

In other words, Korobov spaces $X^{2,p}(\Omega)$ are subsets of Sobolev spaces $W^{2,p}(\Omega)$ .

The key fact is that any function $f\in X^{2,p}(\Omega)$ has a unique (infinite) expansion in the hierarchical basis,

3 Discretization

The third and last ingredient is a clever truncation of the expansion (21). Sparse grids are discretizations of $X^{2,p}(\Omega)$ defined by

and correspond to a number of grid points $N$ given by [2, Lemma 3.6]

A sparse grid in two dimensions is shown in Figure 3. Note that full grids $V_{n}^{(\infty)}$ , with

correspond to a much larger $\mathcal{O}(h_{n}^{-d})=\mathcal{O}(2^{nd})$ number of grid points.

and for any $f\in X^{2,p}(\Omega)$ , the approximation error satisfies [2, Lemma 3.13],

Th approximation error in (28) is slightly worse than the $\mathcal{O}(N^{-\frac{2}{d}})$ error for approximating functions in $X^{2,p}(\Omega)$ with full grids [2, Lemma 3.5], but using a much smaller number of points.

Error bounds using sparse grids

Results listed in Table 1 are typically proven using the following technique. One shows that certain functions $f$ can be approximated by polynomials $f_{M}$ of degree $M$ to any prescribed accuracy $\epsilon$ , and so can polynomials $f_{M}$ by neural networks $f_{N}$ of size $N$ , with $N$ bounded by some function of $\epsilon$ , as in (2). This amounts to decomposing the approximation error as

for some norm $\|\cdot\|$ . We use the same idea but instead of polynomials, we use approximations by sparse grids.

We first explain in Section 3.1 how to approximate the hat functions $\phi_{\boldsymbol{l},\boldsymbol{i}}$ using ideas introduced independently by Liang and Srikant , and Yartosky . We then prove in Section 3.2 our theorem concerning approximation of functions in $X^{2,p}(\Omega)$ by deep networks.

The following proposition of Yarotsky shows how deep networks can implement multiplication.

Proposition 1 [21, Prop. 3]. For any $0<\epsilon<1$ , there is a deep ReLU network with inputs $x_{1}$ and $x_{2}$ , with $|x_{1}|\leq M$ and $|x_{2}|\leq M$ , that implements the multiplication $x_{1}x_{2}$ with accuracy $\epsilon$ , outputs if $x_{1}=0$ or $x_{2}=0$ , and has depth and size $\mathcal{O}(|\log_{2}\epsilon|+\log_{2}M)$ .

From Proposition 1, we obtain the following result, which shows how deep networks can approximate the multi-dimensional hat functions (13) with a binary tree structure. The corresponding network is shown in Figure 4, and the proof can be found in Appendix A.

Proposition 2. For any dimension $d$ and $0<\epsilon<1$ , there is a deep ReLU network with $d$ inputs $x_{1},\ldots,x_{d}$ that implements the multiplication $\phi_{\boldsymbol{l},\boldsymbol{i}}(\boldsymbol{x})=\prod_{j=1}^{d}\phi_{l_{j},i_{j}}(x_{j})$ with accuracy $\epsilon$ , outputs if one of the $\phi_{l_{j},i_{j}}(x_{j})$ is , and has depth and size $\mathcal{O}(|\log_{2}\epsilon|\log_{2}d)$ .

2 Approximating sparse grids by deep networks

We use the fact that functions in $X^{2,p}(^{d})$ can be approximated by sparse grids, and then show that sparse grids can be represented by deep networks using the multiplication presented in the previous subsection. The resulting network is shown in Figure 5.

Theorem 1. For any dimension $d$ and $0<\epsilon<1$ , there is a deep ReLU network with $d$ inputs $x_{1},\ldots,x_{d}$ capable of expressing any function $f$ in $X^{2,p}(^{d})$ that satisfies $|f|_{\boldsymbol{2},\infty}\leq 1$ with accuracy $\epsilon$ , and has depth $\mathcal{O}(|\log_{2}\epsilon|\log_{2}d)$ and size $\mathcal{O}(\epsilon^{-\frac{1}{2}}|\log_{2}\epsilon|^{\frac{3}{2}(d-1)+1}\log_{2}d)$ .

Proof. Let us consider $f\in X^{2,p}(^{d})$ and suppose we want to approximate $f$ with a deep ReLU network $f_{N}$ of size $N$ . Let us write

where $f^{(1)}_{m}\in V_{m}^{(1)}$ is the sparse grid approximation of $f$ with $M=\mathcal{O}(2^{m}m^{d-1})$ points. We know from (29) that for any $\epsilon>0$ , we can equal the first term to $\epsilon/2$ with

Let us now approximate $f_{m}^{(1)}$ by a network $f_{N}$ consisting of $M$ subnetworks, each subnetwork implementing the approximate multiplication introduced in the previous subsection, which we write as $\widetilde{\phi}_{\boldsymbol{l},\boldsymbol{i}}(\boldsymbol{x})$ , that is,

Let us suppose that each $\widetilde{\phi}_{\boldsymbol{l},\boldsymbol{i}}(\boldsymbol{x})$ is computed to accuracy $\delta$ with a network of depth and size $\mathcal{O}(|\log_{2}\delta|\log_{2}d)$ for some $0<\delta<1$ (using Proposition 2). From (33), we get

For a given $\boldsymbol{l}$ , a given $\boldsymbol{x}$ belongs to the support of at most one $\phi_{\boldsymbol{l},\boldsymbol{i}}(\boldsymbol{x})$ because these have disjoint supports,Note that the same holds true for $\widetilde{\phi}_{\boldsymbol{l},\boldsymbol{i}}(\boldsymbol{x})$ using the -in--out property of Proposition 2. so the inequality becomes

Using the property of the decay of the coefficients (23), $\sum_{|\boldsymbol{l}|_{1}\leq m+d-1}2^{-2|\boldsymbol{l}|_{1}}\leq 1$ and $2^{-d}\leq 1$ , we obtain

since $|f|_{\boldsymbol{2},\infty}\leq 1$ . Hence, for $\delta=\epsilon/2$ , one has

The depth of the network is $\mathcal{O}(|\log_{2}\delta|\log_{2}d)=\mathcal{O}(|\log_{2}\epsilon|\log_{2}d)$ , and its size is

Discussion

We have proven new rigorous upper bounds for the approximation of functions in Korobov spaces $X^{2,p}(\Omega)$ by deep ReLU networks, for which the curse of dimensionality is lessened. The proof is based on the ability of deep networks to approximate sparse grids via a binary tree structure (Figure 4), which resembles the compositional structure used in .

There are many ways in which this work could be profitably continued. To show an advantage of deep networks versus shallow, it would be desirable to obtain a lower bound for approximations in $X^{2,p}(\Omega)$ by shallow networks, for which the curse of dimensionality is not lessened.For approximations by shallow networks in Sobolev spaces $W^{2,2}(\Omega)$ , it is well known that both the lower and upper bounds depend exponentially on the dimension $d$ [13, Th. 6.1]. Another extension would be to derive similar estimates for smoother functions, e.g., functions with mixed derivatives of order $m>2$ . Piecewise smooth functions could also be considered (as in ), as well as Jacobi-weighted Korobov spaces , and energy-based sparse grids (for which the curse of dimensionality can be totally overcome [2, Th. 3.10]). More generally, we could apply our methodology to any expansion of the form (21), as long as the expansion coefficients satisfy a property like (23) (which controls the width of the network) and the basis functions can be implemented efficiently using the multiplication of Section 3.1 (which controls the depth).

Our theorem provides an upper bound for the approximation complexity when the same network is used to approximate all functions in a given Korobov space. In other words, the network architecture does not depend on the function being approximated; only the weights $v_{\boldsymbol{l},\boldsymbol{i}}$ do. Alternatively, we could consider adaptive architectures where not only the weights but also the architecture is adjusted to the function being approximated. We would expect that this would decrease the complexity of the resulting network. Adaptive network architectures in the context of approximating multivariate functions have been studied by, e.g., Yarotsky in .

As mentioned in the introduction, breaking the curse of dimensionality often relies on taking advantage of special properties of the functions being approximated. In this paper, we followed a different route and considered a more generic space of functions, and approximations by sparse grids. Let us emphasize, however, that sparse grids—in particular the norm (18)—are highly anisotropic: to be efficient, these require the functions being approximated to be aligned with the axes. This is in fact the case for many algorithms for the approximation of multivariate functions, including low-rank compressions and quasi-Monte Carlo methods; we refer the interested reader to for details.

Acknowledgements

We thank the members of the CM3 group (Ran Gu, Hwi Lee, Qi Sun and Yunzhe Tao) at Columbia University, and colleague and programmer Joel R. Clay for fruitful discussions. The first author is much indebted to former PhD supervisor Nick Trefethen for his inspirational contributions to numerical analysis and in particular to the field of approximation theory.

References

Appendix A Proof of Proposition 2

Let us first note that $\phi_{l_{j},i_{j}}$ can be written as

i.e., it can be implemented by a network of depth $2$ and size $3$ .

Let us now prove the result concerning the multiplication by induction over $d$ . For simplicity we will suppose that $d=2^{p}$ is a power of $2$ , and will prove the result by induction over $p$ .

For $p=1$ , i.e., $d=2$ , we want to show that for any $0<\epsilon<1$ there is a deep ReLU network with two inputs $x_{1}$ and $x_{2}$ that implements the multiplication $\phi_{l_{1},i_{1}}(x_{1})\times\phi_{l_{2},i_{2}}(x_{2})$ with accuracy $\epsilon$ , outputs if $\phi_{l_{1},i_{1}}(x_{1})=0$ or $\phi_{l_{2},i_{2}}(x_{2})=0$ (which we call -in--out property), and has depth and size $\mathcal{O}(|\log_{2}\epsilon|)$ . To create such a network, one can combine two networks that compute $\phi_{l_{1},i_{1}}(x_{1})$ and $\phi_{l_{2},i_{2}}(x_{1})$ from $x_{1}$ and $x_{2}$ (each having depth $2$ and size $3$ ) with the network of Proposition 1 (with $M=1$ since $\|\phi_{l_{j},i_{j}}\|_{\infty}\leq 1$ ) to multiply $\phi_{l_{1},i_{1}}(x_{1})$ by $\phi_{l_{2},i_{2}}(x_{2})$ (depth and size $\mathcal{O}(|\log_{2}\epsilon|)$ ). The resulting network has depth $\mathcal{O}(2+|\log_{2}\epsilon|)=\mathcal{O}(|\log_{2}\epsilon|)$ and size $\mathcal{O}(3\times 2+|\log_{2}\epsilon|)=\mathcal{O}(|\log_{2}\epsilon|)$ , and inherits the -in--out property from Proposition 1.

Let us suppose now that this is true in dimension $d/2=2^{p-1}$ for some $p\geq 1$ , and let us show this is still true in dimension $d=2^{p}$ for any $0<\epsilon<1$ . The induction hypothesis (which we use with $\epsilon/4$ ) states that there is a deep ReLU network with $d/2$ inputs $x_{1},\ldots,x_{d/2}$ that implements the multiplication $\prod_{j=1}^{d/2}\phi_{l_{j},i_{j}}(x_{j})$ with accuracy $\epsilon/4$ , outputs if one of the $\phi_{l_{j},i_{j}}(x_{j})$ is , and has depth and size $\mathcal{O}(|\log_{2}\epsilon/4|(\log_{2}d-1))=\mathcal{O}(|\log_{2}\epsilon|(\log_{2}d-1))$ . Let

denote this network, where $\widetilde{\times}$ is the approximate multiplication of Proposition 1. In other words, this network corresponds to the hierarchical combination of $d/2-1$ products; see Figure 4. The induction hypothesis tells us that this network has accuracy $\epsilon/4$ ,

Similarly we consider the network with $d/2$ inputs $x_{d/2+1},\ldots,x_{d}$ that implements $\prod_{j=d/2+1}^{d}\phi_{l_{j},i_{j}}(x_{j})$ ,

and -in--out property. To construct a network that implements the full multiplication $\prod_{j=1}^{d}\phi_{l_{j},i_{j}}(x_{j})$ , we combine (41) with (44), that is,

Note that (46) satisfies the -in--out property since (41) and (44) do. Let us now examine the accuracy of this network:

using (42), (43) and (45). Therefore to bound (47) by $\epsilon$ we would like to bound the first term in (47) by $\epsilon/2-\epsilon^{2}/16$ . Note that this term corresponds to the top multiplication of Figure 4. To achieve accuracy $\epsilon/2-\epsilon^{2}/16$ , we use Proposition 1 with $M=1+\epsilon/4$ . Therefore, this multiplication is implemented by a network that has depth and size