Analysis of the Generalization Error: Empirical Risk Minimization over Deep Artificial Neural Networks Overcomes the Curse of Dimensionality in the Numerical Approximation of Black-Scholes Partial Differential Equations

Julius Berner, Philipp Grohs, Arnulf Jentzen

Introduction

In this introductory section we want to present and motivate our problem, provide the reader with background knowledge and references to previous research on the topic, and outline the important steps, as well as possible extensions, of our contribution.

where (Stξ)t[0,T](S_{t}^{\xi})_{t\in[0,T]} is a stochastic process satisfying the stochastic differential equation (SDE)

If the goal, however, is to approximate Fd(T,)F_{d}(T,\cdot) not only at a single value but, for example, on a full hypercube [u,v]d[u,v]^{d}, there has been no known method which does not suffer from the curse of dimensionality. In particular, it has been completely out of range to provably approximate Fd(T,)F_{d}(T,\cdot) on [u,v]d[u,v]^{d} in high dimensions, say, d100d\gg 100.

The present paper introduces and analyzes deep learning–based algorithms for the numerical approximation of Fd(T,)F_{d}(T,\cdot) on a full hypercube [u,v]d[u,v]^{d}. We will prove that the resulting algorithms overcome the curse of dimensionality and can consequently be efficiently applied even in high dimensions. Our proofs will be based on tools from statistical learning theory and the following key properties of linear Kolmogorov equations:

The fact that one can reformulate (1) as a learning problem (see Lemma 3.1).

The fact that typical initial values arising from problems in computational finance, such as, for example, (2), are either exactly representable as neural networks with ReLU activation function (ReLU networks) or can be approximated by such neural networks without incurring the curse of dimensionality (see [29, Section 4]).

The fact that Property P.2 is preserved under the evolution of linear Kolmogorov equations (1) with affine diffusion and drift coefficients, which implies that Fd(T,)F_{d}(T,\cdot) can be approximated by ReLU networks without incurring the curse of dimensionality (see Theorem 3.2).

2 Deep learning and statistical learning theory

In their most basic incarnation, deep learning–based algorithms start with training data

To give a concrete example, Xd(i)X_{d}^{(i)} may consist of different 28×2828\times 28 pixel grayscale images of handwritten digits and Yd(i)Y_{d}^{(i)} may consist of corresponding probabilities describing the likelihood of a certain digit to be shown in image Xd(i)X_{d}^{(i)} . The goal is then to find a functional relation between images and labels and use it for predictive purposes on unseen images.

Empirical risk minimization (ERM) attempts to solve this prediction problem by minimizing the empirical risk

where AW,B(x):=Wx+B{\mathcal{A}}_{W,B}(x):=Wx+B and ρ(x)=(ρ(xi))i=1n\rho_{*}(x)=(\rho(x_{i}))_{i=1}^{n}, i.e., ρ\rho is applied componentwise. Then, neural network hypothesis classes are typically of the form

Despite the great practical success of the “deep learning paradigm,” for generic real world training data it is far out of reach to specify network architectures which guarantee a desired performance on unseen data; see also .

This type of problem can be theoretically studied using tools developed within the field of statistical learning theory. There it is typically postulated that ((Xd(i),Yd(i)))i=1m((X_{d}^{(i)},Y_{d}^{(i)}))_{i=1}^{m} are i.i.d. samples drawn from the distribution of some (unknown) data (Xd,Yd)(X_{d},Y_{d}) and that the optimal functional relation between XdX_{d} and YdY_{d} is given by the regression function

statistical learning problem with data (Xd,Yd)(X_{d},Y_{d}) and quadratic loss function.

Under strong regularity assumptions on the regression function fdf^{*}_{d} and the distribution of (Xd,Yd)(X_{d},Y_{d}) it is possible to obtain bounds on the sample size mm and the “complexity” of the hypothesis class H{\mathcal{H}} in order to guarantee, with high probability, an error

see, for example, . In the above, the regularity of the regression function fdf^{*}_{d} quantifies how well fdf^{*}_{d} can be approximated by the hypothesis class H{\mathcal{H}}.

In the case of the neural network hypothesis classes H=Nρ,a,Ru,v{\mathcal{H}}={\mathcal{N}}^{u,v}_{\rho,{\mathbf{a}},R} these regularity assumptions are met if fdf^{*}_{d} satisfies certain smoothness assumptions; see . Moreover, the complexity of Nρ,a,Ru,v{\mathcal{N}}^{u,v}_{\rho,{\mathbf{a}},R} can mainly be described by the size of the neural network parametrizations, i.e., the number of network parameters

We will show in Subsection 1.3 below that our specific deep learning–based method for numerically solving Kolmogorov equations allows us to rigorously apply tools from statistical learning theory, as we can overcome the following potential problems:

The crucial assumption that the training data consists of i.i.d. samples drawn from an underlying probability distribution is usually debatable or at least hard to verify.

Even if this assumption were satisfied, the underlying distribution of (Xd,Yd)(X_{d},Y_{d}) is typically unknown. Thus, it is hard to ensure a priori the regularity assumptions on fdf^{*}_{d}, which are needed to apply tools from statistical learning theory.

Most of the classical techniques operate in an asymptotic regime where the number of training samples mm exceeds the network size P(a){P}({\mathbf{a}}). However, in many applications the number of training samples is fixed, and it is not possible to generate more training data at will.

3 Kolmogorov equations as learning problem

We will reformulate the numerical approximation of Fd(T,)F_{d}(T,\cdot) on [u,v]d[u,v]^{d} as a statistical learning problem and demonstrate that in this specific case none of the aforementioned Problems R.1–R.4 appears. Let XdU([u,v]d)X_{d}\sim\mathcal{U}([u,v]^{d}) be uniformly distributed on [u,v]d[u,v]^{d} and define Yd:=φd(STXd)Y_{d}:=\varphi_{d}(S^{X_{d}}_{T}), where (StXd)t[0,T](S_{t}^{X_{d}})_{t\in[0,T]} is a stochastic process satisfying the SDE

Under suitable conditions it then follows from the Feynman–Kac formula (3) that Fd(T,)F_{d}(T,\cdot) is the minimizer of the risk functional Ed{\mathcal{E}}_{d}, that is, fd(x)=Fd(T,x)f^{*}_{d}(x)=F_{d}(T,x) for a.e. x[u,v]dx\in[u,v]^{d}; see Lemma 3.1. As outlined in Subsection 1.2, we thus have that

the end value of the Kolmogorov equation Fd(T,)F_{d}(T,\cdot) is the solution to the statistical learning problem with data (Xd,φd(STXd))(X_{d},\varphi_{d}(S^{X_{d}}_{T})) and quadratic loss function.

A natural next step is to apply the deep learning paradigm, that is, for mm i.i.d. samples ((Xd(i),Yd(i)))i=1m((X_{d}^{(i)},Y_{d}^{(i)}))_{i=1}^{m} drawn from the distribution of (Xd,Yd)(X_{d},Y_{d}) to minimize the empirical risk (4) over a hypothesis class of neural networks Nρ,a,Ru,v{\mathcal{N}}^{u,v}_{\rho,{\mathbf{a}},R}.

In this idea has been implemented with suitable classes of deep neural networks of a given architecture as hypothesis class H{\mathcal{H}}. In extensive numerical simulations it was observed that the proposed algorithm is efficient even in very high dimensions, suggesting that it does not suffer from the curse of dimensionality. Similar conclusions can be found in related work , which covers topics ranging from American option pricing problems to fully nonlinear PDEs. Note that in a nonquantitative analysis of the approximation error is given; all the other works are purely empirical. To the best of our knowledge, this is the first joint quantitative analysis of approximation and generalization error confirming the efficiency of deep learning–based methods applied to the numerical solution of high-dimensional PDEs.

Two main parameters influence the complexity of the algorithm described above: the number P(ad,ε)P({\mathbf{a}}_{d,\varepsilon}) of network parameters that need to be optimized as well as the number of training samples md,εm_{d,\varepsilon} needed to guarantee that, with high probability, the estimate

holds true. We are interested in the scaling of md,εm_{d,\varepsilon} and P(ad,ε)P({\mathbf{a}}_{d,\varepsilon}) with respect to the precision ε\varepsilon and dimension dd.

Observe that the data distribution (Xd,φd(STXd))(X_{d},\varphi_{d}(S^{X_{d}}_{T})) is now explicitly known and i.i.d. samples of this distribution can be simulated as needed (XdX_{d} is uniformly distributed and can be simulated using a suitable random number generator, and STXdS^{X_{d}}_{T} can be simulated by any numerical solver for the SDE (7); see ). Moreover, the uniform distribution of the input data XdX_{d} gives rise to typical L2{\mathcal{L}}^{2}-error estimates (8) and, in the case of affine coefficients and suitable initial values, we can establish bounds on how well the regression function can be approximated by neural network hypothesis classes; see Theorem 3.2. In particular, contrary to conventional learning problems, in the statistical learning problem that arises from our reformulation of the Kolmogorov equation, none of the Problems R.1–R.4 described in Subsection 1.2 occurs. We will therefore be able to rigorously invoke tools from statistical learning theory to obtain bounds on the quantities md,εm_{d,\varepsilon} and P(ad,ε)P({\mathbf{a}}_{d,\varepsilon}) above.

4 Contribution

see Corollary 3.5. We conclude that the aforementioned deep learning–based algorithm does not suffer from the curse of dimensionality.

We briefly describe our proof strategy for bounding the error between the empirical risk minimizer and the end value of the Kolmogorov equation

as in (8). By the so-called bias-variance decomposition we can represent this error as the sum of a generalization error and an approximation error, i.e.,

see Lemma 2.1. Bounds on the size of neural networks with ReLU activation function

see Proposition 2.6. In conjunction with Hoeffding’s inequality this allows us to uniformly (over the hypothesis class of neural networks) bound the error between the risk and the empirical risk. However, this requires that the regression function fdf^{*}_{d} as well as all functions in H{\mathcal{H}} are uniformly bounded. To that end we assume the initial value φd\varphi_{d} to be bounded, which by the Feynman–Kac formula (3) implies that also the function fd=Fd(T,)f^{*}_{d}=F_{d}(T,\cdot) is bounded. Moreover, we introduce hypothesis classes of “clipped” neural networks

Note that there exist different concepts and known results in order to bound the generalization error (see, for instance, ). The present paper intends to stress the interplay between the approximation and generalization error and gives a complete proof in order to rigorously show the absence of the curse of dimensionality for our particular problem.

We are now ready to formulate a first specific result of this paper as an appetizer. It demonstrates that deep learning–based ERM succeeds in solving the option pricing problem for European put options without incurring the curse of dimensionality.

P(a)Cdε2{P}({\mathbf{a}})\leq Cd\varepsilon^{-2} (number of parameters),

RCd3/2ε1R\leq Cd^{3/2}\varepsilon^{-1} (parameter bound), and

max{a1,a2}Cdε1\max\{a_{1},a_{2}\}\leq Cd\varepsilon^{-1} (size of the architecture),

Note that our analysis does not consider the computational cost of solving the nonsmooth, nonconvex ERM problem (4). This is typically achieved by stochastic first order optimization methods whose theoretical analysis is still an open problem. While there are many interesting approaches to the latter question, they tend to require very strong assumptions (e.g., (almost) linearity, convexity, extreme overparametrization, or inverse stability of Fρ{\mathcal{F}}_{\rho} ), which we want to avoid in our analysis.

5 Extensions

Our results in Section 3, in conjunction with the results of , can be applied to prove the absence of the curse of dimensionality in the pricing of (capped) basket call, basket put, call on max, and call on min options. Moreover, the results of Section 2 hold for a general statistical learning problem within Setting 2.1. That is, ReLU network approximation results for the regression function translate directly into generalization results without incurring the curse of dimensionality. If suitable learning problems can be established, this work can extend various neural network approximation results for PDEs (see, e.g., ) to also consider the generalization error and get one step closer to a full error analysis. For instance, there are stronger approximation results for more restricted option pricing problems , and there are very recent approximation results for semilinear heat equations and Kolmogorov equations with (time-inhomogeneous) nonlinear coefficients where the dependence on the dimension is polynomial. Using a generalized version of Lemma 3.1, the findings of this paper can be used to prove that the corresponding ERM problem achieves, with high probability, a desired accuracy ε\varepsilon with the number of samples and the size of the hypothesis class scaling only polynomially in dd and ε1\varepsilon^{-1}. In particular this means that the presented methods are not restricted to the case of linear Kolmogorov equations with affine drift and diffusion coefficients. Finally, note that one obtains similar results for any continuous piecewise linear activation function with a finite number of breakpoints; see the comment after Theorem 2.3 and [62, Proposition 1].

6 Outline

Results in statistical learning theory

The present section develops generalization bounds for ERM problems in the spirit of .

such that the mapping Ωωf^d,m,H(ω)\Omega\ni\omega\mapsto\widehat{f}_{d,m,{\mathcal{H}}}(\omega) is measurable.

We want to emphasize that the minima in (9) and (10) will be attained due to the compactness of our hypothesis class, but they need not be unique. We require the mapping Ωωf^d,m,H(ω)\Omega\ni\omega\mapsto\widehat{f}_{d,m,{\mathcal{H}}}(\omega) to be measurable in order to view the risk of the empirical regression function as a random variable \Omega\ni\omega\mapsto{\mathcal{E}}_{d}\big{(}\widehat{f}_{d,m,{\mathcal{H}}}(\omega)\big{)}. This ensures that the probability in the generalization error bound (Theorem 2.2) is well-defined. While this technical assumption is often not explicitly stated in the literature on statistical learning theory it is actually crucial for analyzing the generalization error. In our setting (by choosing a suitable minimizer) measurability can indeed be satisfied; see Appendix A.1.

We now introduce the concept of covering numbers in order to bound the generalization error. {setting}[covering number] For every r(0,1){r}\in(0,1), every normed vector space (Z,)(\mathcal{Z},\|{\cdot}\|), and every compact subset HZ{\mathcal{H}}\subseteq\mathcal{Z} we define the r{r}-covering number of H{\mathcal{H}} w.r.t. \|{\cdot}\| by

where Ballr(f):={gH: fgr}\mathit{Ball}_{{r}}(f):=\{g\in{\mathcal{H}}:\ \|f-g\|\leq{r}\} denotes the ball of radius rr around fHf\in{\mathcal{H}}. If the norm is clear from the context, we will use the abbreviation Cov(H,r):=Cov(H,,r){\operatorname{Cov}}({\mathcal{H}},{r}):={\operatorname{Cov}}({\mathcal{H}},\|{\cdot}\|,{r}). Assume that the functions in our hypothesis class H{\mathcal{H}} are uniformly bounded and that balls of radius rr around the functions (fi)i=1n(f_{i})_{i=1}^{n} cover H{\mathcal{H}}. We can then use the (uniform) Lipschitz continuity of the (empirical) risk to bound the generalization error by

and employ Hoeffding’s inequality and a union bound to obtain the following estimate.

This result is adapted from [4, Theorem 17.1], [19, Theorem 3.14], and [49, Example 3.31], and for the sake of completeness we present a detailed proof in Appendix A.3.

2 Covering numbers of neural network hypothesis classes

As a natural next step we prove estimates on the covering numbers of neural network hypothesis classes in order to leverage the result of Theorem 2.2. Note that for different assumptions (i.e., boundedness assumptions on the activation function, different norms on the parameters, or evaluation of the neural networks on input data) similar approaches can be found in .

L(a):=LL({\mathbf{a}}):=L (number of layers), and

and the hypothesis class of clipped neural networks

The hypothesis classes Na,R,Du,v{\mathcal{N}}_{{\mathbf{a}},R,{D}}^{u,v} are somewhat nonstandard in the sense that the clipping function clipD{\operatorname{clip}}_{{D}} is applied to the output of a neural network realization. The reason for our choice of this definition is that Theorem 2.2 requires that the set of neural networks over which the ERM problem is solved consists of uniformly bounded functions. In Appendix A.4 we show that the clipping function clipD{\operatorname{clip}}_{{D}} can be represented as a small neural network, which implies that the seemingly nonstandard classes Na,R,Du,v{\mathcal{N}}_{{\mathbf{a}},R,{D}}^{u,v} are actually conventional neural network classes that can be trained with standard methods .

The next theorem quantifies the Lipschitz continuity Lip(Fu,v)\operatorname{Lip}({\mathcal{F}}^{u,v}) of the operator

which maps bounded neural network parametrizations with fixed architecture to the corresponding realization functions (restricted to [u,v]d[u,v]^{d}).

The claim follows by a simple counting argument.

Together with Theorem 2.3 this allows us to bound the covering number of our hypothesis class of (clipped) neural networks.

The proof in Appendix A.6 is based on the behavior of covering numbers under the action of a Lipschitz function, i.e.,

3 Analysis of the generalization error

Combining Theorem 2.2 and Proposition 2.6, the following theorem describes our main result related to the generalization capabilities of hypothesis classes consisting of clipped ReLU networks.

and define H:=Na,R,Du,v{\mathcal{H}}:=\mathcal{N}_{\mathbf{a},R,{D}}^{u,v}. Then it holds that

Now Theorem 2.2 and a simple calculation ensures that

Next we show how Theorem 2.7 can be used to leverage bounds on the approximation error in order to obtain quantitative bounds on the generalization error.

Thus, Corollary 2.11 is a direct consequence of Corollary 2.9.

Applications for the numerical approximation of high-dimensional PDEs

In the present section we apply the general results of Section 2 to the numerical solution of high-dimensional Kolmogorov equations.

φd(x)F(ηd,ε)(x)ε(1+x2ν),|\varphi_{d}(x)-{\mathcal{F}}({\bm{\eta}}_{d,\varepsilon})(x)|\leq\varepsilon(1+\|x\|_{{2}}^{\nu}),

F(ηd,ε)(x)D|{\mathcal{F}}({\bm{\eta}}_{d,\varepsilon})(x)|\leq D,

ηd,εζdβεκ,\|{\bm{\eta}}_{d,\varepsilon}\|_{\infty}\leq\zeta d^{\beta}\varepsilon^{-\kappa}, and

P(bd,ε)ζdγελ.{P}(\mathbf{b}_{d,\varepsilon})\leq\zeta d^{\gamma}\varepsilon^{-\lambda}.

FdF_{d} is a viscosity solution of the dd-dimensional Kolmogorov equation

the unique (Gt)({\mathcal{G}}_{t})-adapted stochastic processThe solution process is unique up to indistinguishability; see, for instance, [5, Theorem 6.2.2]. with continuous sample paths satisfying the SDE

The result is based on work from and the following formal calculation:

for a.e. x[u,v]dx\in[u,v]^{d}; see Appendix A.7 for a rigorous proof.

2 Neural network generalization results for solutions of Kolmogorov equations

We first show that the end value [u,v]dxFd(T,x)[u,v]^{d}\ni x\mapsto F_{d}(T,x) of the solution to the Kolmogorov equation can be approximated by hypothesis classes consisting of clipped ReLU networks.

\tfrac{1}{(v-u)^{d}}\big{\|}g-F_{d}(T,{\cdot})\big{\|}_{{\mathcal{L}}^{2}([u,v]^{d})}^{2}\leq\varepsilon,

P(a)Cdνλ/2+γελ/22,{P}({\mathbf{a}})\leq Cd^{\nu\lambda/2+\gamma}\varepsilon^{-\lambda/2-2},

RCd(νκ+3)/2+βεκ/21,R\leq Cd^{(\nu\kappa+3)/2+\beta}\varepsilon^{-\kappa/2-1},

L(a)=L(bd,cdν/2ε1/2){L}({\mathbf{a}})={L}(\mathbf{b}_{d,cd^{-\nu/2}\varepsilon^{1/2}}), and

aCε1bd,cdν/2ε1/2.\|{\mathbf{a}}\|_{\infty}\leq C\varepsilon^{-1}\|\mathbf{b}_{d,cd^{-\nu/2}\varepsilon^{1/2}}\|_{\infty}.

can be decomposed into the sum of the squared bias and the variance, i.e.,

and we bound the parameter magnitudes of θ{\bm{\theta}} with the help of Lemma A.12.

Observe that the approximation result in Theorem 3.2 does not underlie the curse of dimensionality, and by Corollary 2.9 we can establish a generalization result that is free of the curse of dimensionality.

there exist aAd{\mathbf{a}}\in{\mathbf{A}}_{d} and R[1,)R\in[1,\infty) such that it holds that

P(a)Cdνλ/2+γελ/22,{P}({\mathbf{a}})\leq Cd^{\nu\lambda/2+\gamma}\varepsilon^{-\lambda/2-2},

RCd(νκ+3)/2+βεκ/21,R\leq Cd^{(\nu\kappa+3)/2+\beta}\varepsilon^{-\kappa/2-1},

L(a)=L(bd,cdν/2ε1/2){L}({\mathbf{a}})={L}(\mathbf{b}_{d,cd^{-\nu/2}\varepsilon^{1/2}}), and

aCε1bd,cdν/2ε1/2\|{\mathbf{a}}\|_{\infty}\leq C\varepsilon^{-1}\|\mathbf{b}_{d,cd^{-\nu/2}\varepsilon^{1/2}}\|_{\infty},

where H=Na,R,Du,v.{\mathcal{H}}=\mathcal{N}_{{\mathbf{a}},R,{D}}^{u,v}.

This is a direct consequence of Theorem 3.2 (with εε/2\varepsilon\leftarrow\varepsilon/2) and Corollary 2.9.

We can also reformulate this in a more compact form.

there exist aAd{\mathbf{a}}\in{\mathbf{A}}_{d} and R[1,)R\in[1,\infty) such that it holds that

max{R,P(a)}p(d,ε1)\max\{R,{P}({\mathbf{a}})\}\leq p(d,\varepsilon^{-1}),

where H=Na,R,Du,v{\mathcal{H}}=\mathcal{N}_{{\mathbf{a}},R,{D}}^{u,v}.

This follows directly from Theorem 3.2 and Corollary 2.11.

3 Pricing of high-dimensional options

The proof of Theorem 1.1 from the introductory section dealing with the pricing of high-dimensional European put options is now an easy consequence of the above theory.

Accordingly, Setting 3.1 is satisfied with

Now Theorem 3.3 and a straightforward calculation prove the claim.

Appendix A Proofs

This appendix contains various proofs and additional material omitted from the main text.

The following lemma shows that the empirical regression function can be chosen measurable as required in Setting 2.1. This implies that the risk of the empirical regression function \Omega\ni\omega\mapsto{\mathcal{E}}_{d}\big{(}\widehat{f}_{d,m,{\mathcal{H}}}(\omega)\big{)} is measurable which is necessary for bounding the generalization error in Theorem 2.2.

in a way, such that it holds thatWe denote by B(Z){\mathcal{B}}(\mathcal{Z}) the Borel σ\sigma-algebra of a topological space Z\mathcal{Z}.

Ωωf^d,m,H(ω)\Omega\ni\omega\mapsto\widehat{f}_{d,m,{\mathcal{H}}}(\omega) is G{\mathcal{G}}/B(H){\mathcal{B}}({\mathcal{H}})-measurable and

First observe that H{\mathcal{H}} is a separable metric space induced by the uniform norm L\|{\cdot}\|_{{\mathcal{L}}^{\infty}} and that for every fHf\in{\mathcal{H}} the mapping

This shows that for every ωΩ\omega\in\Omega the function HfE^d,m(f)(ω){\mathcal{H}}\ni f\mapsto\widehat{{\mathcal{E}}}_{d,m}(f)(\omega) is continuous and the Measurable Maximum Theorem in [1, Theorem 18.19] ensures that the set-valued function of minimizers of

admits a measurable selector. That is to say, there exists a G{\mathcal{G}}/B(H){\mathcal{B}}({\mathcal{H}})-measurable mapping f^d,m,H ⁣:ΩH\widehat{f}_{d,m,\mathcal{H}}\colon\Omega\to{\mathcal{H}} such that for every ωΩ\omega\in\Omega it holds that

This yields the claim as compositions of measurable functions are again measurable.

A.2 Bias-variance decomposition

This proves items (i) and (ii) and shows that it holds that

Finally, applying (12) (with ffd,Hf\leftarrow f_{d,{\mathcal{H}}}) proves the lemma.

A.3 Bound on the generalization error

First note that by assumption for every fHf\in{\mathcal{H}} it holds that

and analogously for the samples ((Xd(i),Yd(i)))i=1m((X_{d}^{(i)},Y_{d}^{(i)}))_{i=1}^{m}. The elementary identity

Now define N:=\operatorname{Cov}\big{(}\mathcal{H},\|{\cdot}\|_{{\mathcal{L}}^{\infty}},\frac{\varepsilon}{32{D}}\big{)} and choose f1,f2,,fNHf_{1},f_{2},\dots,f_{N}\in\mathcal{H} such that the balls

cover H\mathcal{H}. This establishes that for every i{1,2,,N}i\in\{1,2,\dots,N\}, fBallif\in\mathit{Ball}_{i} it holds that

Our assumptions yield that for every ωΩ\omega\in\Omega it holds that

Observe that for fixed fHf\in{\mathcal{H}} it holds that the random variables E_{i}:=\big{(}f(X^{(i)}_{d})-Y^{(i)}_{d}\big{)}^{2}, i{1,2,,m}i\in\{1,2,\dots,m\}, are independent and satisfy

which by Hoeffding’s inequality (see [36, Theorem 2]) ensures that

Together with (LABEL:eq:subsets), the monotonicity and subadditivity of the probability measure, and the measurability assumptions according to Lemma A.1 this implies that

Using the complement rule and plugging in the definition of NN proves the theorem.

A.4 Clipped neural networks are standard neural networks

We show that “clipped” neural network hypothesis classes Na,R,Du,v{\mathcal{N}}_{{\mathbf{a}},R,{D}}^{u,v} are in fact subsets of “non-clipped” ones.

The proof follows by the representation of the clipping function in Lemma A.5 and the fact that composition with a neural network does not change the magnitude of its parameters;Because of that we did not choose the easier representation {\operatorname{clip}}_{{D}}(x)=-{D}+\operatorname{ReLU}_{*}\big{(}2{D}-\operatorname{ReLU}_{*}({D}-x)\big{)}. see also [54, Definition 2.2] for a formal definition.

A.5 Lipschitz continuity of the realization map

Define L:=L(a)L:={L}({\mathbf{a}}) and \mathfrak{m}:=\max\big{\{}1,|u|,|v|\big{\}}. We will show the following stronger statement. For every θ,ηPa,R{\bm{\theta}},{\bm{\eta}}\in{\mathcal{P}}_{{\mathbf{a}},R} it holds that

This directly implies the statement of Theorem 2.3, as it holds that

For the proof of (16) let us fix θ,ηPa,R{\bm{\theta}},{\bm{\eta}}\in{\mathcal{P}}_{{\mathbf{a}},R} given by

Let r:=θη{r}:=\|{\bm{\theta}}-{\bm{\eta}}\|_{\infty}, and for every s{1,,L}s\in\{1,\dots,L\} define the partial parametrizations

We are interested in estimating the error eL\mathfrak{e}_{L} and try to bound ms\mathfrak{m}_{s} relative to ms1\mathfrak{m}_{s-1} and es\mathfrak{e}_{s} relative to es1\mathfrak{e}_{s-1}. Note that for every s{2,,L}s\in\{2,\dots,L\} it holds that

Analogous computations for the case s=1s=1 and the function gsg_{s} establish that for every s{1,,L}s\in\{1,\dots,L\} it holds that msRams1+R\mathfrak{m}_{s}\leq R\|{\mathbf{a}}\|_{\infty}\mathfrak{m}_{s-1}+R. By induction this implies that

for every s{1,,L}s\in\{1,\dots,L\}. Moreover, note that for every s{2,,L}s\in\{2,\dots,L\} it holds that

Together with (17), one proves by induction that for every s{1,2,,L}s\in\{1,2,\dots,L\} it holds that

A.6 Covering numbers of neural network hypothesis classes

To simplify the notation we define \mathfrak{m}:=\max\big{\{}1,|u|,|v|\big{\}},

Choose θ1,θ2,,θNPa,R{\bm{\theta}}_{1},{\bm{\theta}}_{2},\dots,{\bm{\theta}}_{N}\in{\mathcal{P}}_{{\mathbf{a}},R} such that for every θPa,R{\bm{\theta}}\in{\mathcal{P}}_{{\mathbf{a}},R} there exists i{1,2,,N}i\in\{1,2,\dots,N\} with θθiΔ\|{\bm{\theta}}-{\bm{\theta}}_{i}\|_{\infty}\leq\Delta, which by Theorem 2.3 and the fact that clipD{\operatorname{clip}}_{{D}} is nonexpansive implies that

A.7 Kolmogorov equation as learning problem

see [29, Corollary 2.23(ii)]. We claim that for every AB([u,v]d)A\in\mathcal{B}([u,v]^{d}) it holds that

which by (18) and Setting 3.1 ensures that for a.e. x[u,v]dx\in[u,v]^{d} it holds that

In [10, Lemma 2.6(v)] it is shown that for every ε(0,1)\varepsilon\in(0,1) it holds that

A.8 Neural network approximation result for solutions of Kolmogorov equations

The proof of Theorem 3.2 is given after the following two auxiliary lemmas. First, we show that given an SDE with affine coefficients σd\sigma_{d} and μd\mu_{d}, its solution STxS^{x}_{T} also admits a (random) affine representation.

see [29, Proposition 2.14]. Together with the facts that it holds that

In the next lemma we show that the average of the composition of a neural network with different affine functions can be represented by a single neural network and we bound the number and size of its parameters.

Then there exist aAd{\mathbf{a}}\in{\mathbf{A}}_{d} and θPa{\bm{\theta}}\in{\mathcal{P}}_{\mathbf{a}} such that it holds that

F(θ)=1nj=1nF(η)AM(j),N(j){\mathcal{F}}({\bm{\theta}})=\frac{1}{n}\sum_{j=1}^{n}{\mathcal{F}}({\bm{\eta}})\circ{\mathcal{A}}_{M^{(j)},N^{(j)}},

P(a)n2P(b){P}({\mathbf{a}})\leq n^{2}{P}(\mathbf{b}),

θdηmaxj=1n(M(j)2+N(j)2+1)\|{\bm{\theta}}\|_{\infty}\leq\sqrt{d}\|{\bm{\eta}}\|_{\infty}\max_{j=1}^{n}\left(\|M^{(j)}\|_{{2}}+\|N^{(j)}\|_{{2}}+1\right),

a=nb\|{\mathbf{a}}\|_{\infty}=n\|\mathbf{b}\|_{\infty}.

With the exception of item (iii) this result is proven in [29, Lemma 3.8]. There it is shown that for η=((Vl,Al))l=1L{\bm{\eta}}=((V_{l},A_{l}))_{l=1}^{L} a suitable parametrization θ=((Wl,Bl))l=1L{\bm{\theta}}=((W_{l},B_{l}))_{l=1}^{L} is given by WL:=[1nVL1nVL1nVL]W_{L}:=\begin{bmatrix}\tfrac{1}{n}V_{L}&\tfrac{1}{n}V_{L}&\dots&\tfrac{1}{n}V_{L}\end{bmatrix}, BL:=ALB_{L}:=A_{L},

We now use techniques from [29, Proof of Proposition 3.4] to show that the random variable G ⁣:Ω[0,)G\colon\Omega\to[0,\infty), given by

Next, observe that Lemma A.12 establishes that

By Lemma A.14, our assumptions, and (19) there exist aAd\mathbf{a}\in{\mathbf{A}}_{d} and θPa{\bm{\theta}}\in{\mathcal{P}}_{\mathbf{a}} satisfying the following:

clipDF(θ)=F(θ)=1nj=1nF(ηd,δ)AM(j),N(j)=1nj=1ngAM(j),N(j){\operatorname{clip}}_{{D}}\circ{\mathcal{F}}\left({\bm{\theta}}\right)={\mathcal{F}}\left({\bm{\theta}}\right)=\tfrac{1}{n}\sum_{j=1}^{n}{\mathcal{F}}({\bm{\eta}}_{d,{\delta}})\circ{\mathcal{A}}_{M^{(j)},N^{(j)}}=\tfrac{1}{n}\sum_{j=1}^{n}g\circ{\mathcal{A}}_{M^{(j)},N^{(j)}};

P(a)n2P(bd,δ)322D4ζdγε2δλCdνλ/2+γελ/22{P}({\mathbf{a}})\leq n^{2}{P}(\mathbf{b}_{d,{\delta}})\leq 32^{2}D^{4}\zeta d^{\gamma}\varepsilon^{-2}{\delta}^{-\lambda}\leq Cd^{\nu\lambda/2+\gamma}\varepsilon^{-\lambda/2-2};

θdηd,δ(192D2c1(1)dε1+1)Cd(νκ+3)/2+βϵκ/21\|{\bm{\theta}}\|_{\infty}\leq\sqrt{d}\|{\bm{\eta}}_{d,{\delta}}\|_{\infty}\left(192D^{2}\mathfrak{c}_{1}(1)d\varepsilon^{-1}+1\right)\leq Cd^{(\nu\kappa+3)/2+\beta}\epsilon^{-\kappa/2-1};

L(a)=L(bd,δ)=L(bd,cdν/2ε1/2){L}({\mathbf{a}})={L}(\mathbf{b}_{d,{\delta}})={L}(\mathbf{b}_{d,cd^{-\nu/2}\varepsilon^{1/2}}); and

a=nbd,δ32D2ε1bd,δCε1bd,cdν/2ε1/2\|{\mathbf{a}}\|_{\infty}=n\|\mathbf{b}_{d,{\delta}}\|_{\infty}\leq 32D^{2}\varepsilon^{-1}\|\mathbf{b}_{d,{\delta}}\|_{\infty}\leq C\varepsilon^{-1}\|\mathbf{b}_{d,cd^{-\nu/2}\varepsilon^{1/2}}\|_{\infty},

where C:=\zeta\max\big{\{}32^{2}D^{4}\big{(}8\mathfrak{c}_{\nu}(\mathfrak{m})\big{)}^{\lambda},\big{(}192D^{2}\mathfrak{c}_{1}(1)+1\big{)}\big{(}8\mathfrak{c}_{\nu}(\mathfrak{m})\big{)}^{\kappa}\big{\}} and c:=\big{(}8\mathfrak{c}_{\nu}(\mathfrak{m})\big{)}^{-1}. Together with (21) this proves the theorem.

Acknowledgements

The authors are grateful to Shahar Mendelson and Stefan Steinerberger for their useful comments.

References