Analysis of the Generalization Error: Empirical Risk Minimization over Deep Artificial Neural Networks Overcomes the Curse of Dimensionality in the Numerical Approximation of Black-Scholes Partial Differential Equations
Julius Berner, Philipp Grohs, Arnulf Jentzen
Introduction
In this introductory section we want to present and motivate our problem, provide the reader with background knowledge and references to previous research on the topic, and outline the important steps, as well as possible extensions, of our contribution.
where is a stochastic process satisfying the stochastic differential equation (SDE)
If the goal, however, is to approximate not only at a single value but, for example, on a full hypercube , there has been no known method which does not suffer from the curse of dimensionality. In particular, it has been completely out of range to provably approximate on in high dimensions, say, .
The present paper introduces and analyzes deep learning–based algorithms for the numerical approximation of on a full hypercube . We will prove that the resulting algorithms overcome the curse of dimensionality and can consequently be efficiently applied even in high dimensions. Our proofs will be based on tools from statistical learning theory and the following key properties of linear Kolmogorov equations:
The fact that one can reformulate (1) as a learning problem (see Lemma 3.1).
The fact that typical initial values arising from problems in computational finance, such as, for example, (2), are either exactly representable as neural networks with ReLU activation function (ReLU networks) or can be approximated by such neural networks without incurring the curse of dimensionality (see [29, Section 4]).
The fact that Property P.2 is preserved under the evolution of linear Kolmogorov equations (1) with affine diffusion and drift coefficients, which implies that can be approximated by ReLU networks without incurring the curse of dimensionality (see Theorem 3.2).
2 Deep learning and statistical learning theory
In their most basic incarnation, deep learning–based algorithms start with training data
To give a concrete example, may consist of different pixel grayscale images of handwritten digits and may consist of corresponding probabilities describing the likelihood of a certain digit to be shown in image . The goal is then to find a functional relation between images and labels and use it for predictive purposes on unseen images.
Empirical risk minimization (ERM) attempts to solve this prediction problem by minimizing the empirical risk
where and , i.e., is applied componentwise. Then, neural network hypothesis classes are typically of the form
Despite the great practical success of the “deep learning paradigm,” for generic real world training data it is far out of reach to specify network architectures which guarantee a desired performance on unseen data; see also .
This type of problem can be theoretically studied using tools developed within the field of statistical learning theory. There it is typically postulated that are i.i.d. samples drawn from the distribution of some (unknown) data and that the optimal functional relation between and is given by the regression function
statistical learning problem with data and quadratic loss function.
Under strong regularity assumptions on the regression function and the distribution of it is possible to obtain bounds on the sample size and the “complexity” of the hypothesis class in order to guarantee, with high probability, an error
see, for example, . In the above, the regularity of the regression function quantifies how well can be approximated by the hypothesis class .
In the case of the neural network hypothesis classes these regularity assumptions are met if satisfies certain smoothness assumptions; see . Moreover, the complexity of can mainly be described by the size of the neural network parametrizations, i.e., the number of network parameters
We will show in Subsection 1.3 below that our specific deep learning–based method for numerically solving Kolmogorov equations allows us to rigorously apply tools from statistical learning theory, as we can overcome the following potential problems:
The crucial assumption that the training data consists of i.i.d. samples drawn from an underlying probability distribution is usually debatable or at least hard to verify.
Even if this assumption were satisfied, the underlying distribution of is typically unknown. Thus, it is hard to ensure a priori the regularity assumptions on , which are needed to apply tools from statistical learning theory.
Most of the classical techniques operate in an asymptotic regime where the number of training samples exceeds the network size . However, in many applications the number of training samples is fixed, and it is not possible to generate more training data at will.
3 Kolmogorov equations as learning problem
We will reformulate the numerical approximation of on as a statistical learning problem and demonstrate that in this specific case none of the aforementioned Problems R.1–R.4 appears. Let be uniformly distributed on and define , where is a stochastic process satisfying the SDE
Under suitable conditions it then follows from the Feynman–Kac formula (3) that is the minimizer of the risk functional , that is, for a.e. ; see Lemma 3.1. As outlined in Subsection 1.2, we thus have that
the end value of the Kolmogorov equation is the solution to the statistical learning problem with data and quadratic loss function.
A natural next step is to apply the deep learning paradigm, that is, for i.i.d. samples drawn from the distribution of to minimize the empirical risk (4) over a hypothesis class of neural networks .
In this idea has been implemented with suitable classes of deep neural networks of a given architecture as hypothesis class . In extensive numerical simulations it was observed that the proposed algorithm is efficient even in very high dimensions, suggesting that it does not suffer from the curse of dimensionality. Similar conclusions can be found in related work , which covers topics ranging from American option pricing problems to fully nonlinear PDEs. Note that in a nonquantitative analysis of the approximation error is given; all the other works are purely empirical. To the best of our knowledge, this is the first joint quantitative analysis of approximation and generalization error confirming the efficiency of deep learning–based methods applied to the numerical solution of high-dimensional PDEs.
Two main parameters influence the complexity of the algorithm described above: the number of network parameters that need to be optimized as well as the number of training samples needed to guarantee that, with high probability, the estimate
holds true. We are interested in the scaling of and with respect to the precision and dimension .
Observe that the data distribution is now explicitly known and i.i.d. samples of this distribution can be simulated as needed ( is uniformly distributed and can be simulated using a suitable random number generator, and can be simulated by any numerical solver for the SDE (7); see ). Moreover, the uniform distribution of the input data gives rise to typical -error estimates (8) and, in the case of affine coefficients and suitable initial values, we can establish bounds on how well the regression function can be approximated by neural network hypothesis classes; see Theorem 3.2. In particular, contrary to conventional learning problems, in the statistical learning problem that arises from our reformulation of the Kolmogorov equation, none of the Problems R.1–R.4 described in Subsection 1.2 occurs. We will therefore be able to rigorously invoke tools from statistical learning theory to obtain bounds on the quantities and above.
4 Contribution
see Corollary 3.5. We conclude that the aforementioned deep learning–based algorithm does not suffer from the curse of dimensionality.
We briefly describe our proof strategy for bounding the error between the empirical risk minimizer and the end value of the Kolmogorov equation
as in (8). By the so-called bias-variance decomposition we can represent this error as the sum of a generalization error and an approximation error, i.e.,
see Lemma 2.1. Bounds on the size of neural networks with ReLU activation function
see Proposition 2.6. In conjunction with Hoeffding’s inequality this allows us to uniformly (over the hypothesis class of neural networks) bound the error between the risk and the empirical risk. However, this requires that the regression function as well as all functions in are uniformly bounded. To that end we assume the initial value to be bounded, which by the Feynman–Kac formula (3) implies that also the function is bounded. Moreover, we introduce hypothesis classes of “clipped” neural networks
Note that there exist different concepts and known results in order to bound the generalization error (see, for instance, ). The present paper intends to stress the interplay between the approximation and generalization error and gives a complete proof in order to rigorously show the absence of the curse of dimensionality for our particular problem.
We are now ready to formulate a first specific result of this paper as an appetizer. It demonstrates that deep learning–based ERM succeeds in solving the option pricing problem for European put options without incurring the curse of dimensionality.
(number of parameters),
(parameter bound), and
(size of the architecture),
Note that our analysis does not consider the computational cost of solving the nonsmooth, nonconvex ERM problem (4). This is typically achieved by stochastic first order optimization methods whose theoretical analysis is still an open problem. While there are many interesting approaches to the latter question, they tend to require very strong assumptions (e.g., (almost) linearity, convexity, extreme overparametrization, or inverse stability of ), which we want to avoid in our analysis.
5 Extensions
Our results in Section 3, in conjunction with the results of , can be applied to prove the absence of the curse of dimensionality in the pricing of (capped) basket call, basket put, call on max, and call on min options. Moreover, the results of Section 2 hold for a general statistical learning problem within Setting 2.1. That is, ReLU network approximation results for the regression function translate directly into generalization results without incurring the curse of dimensionality. If suitable learning problems can be established, this work can extend various neural network approximation results for PDEs (see, e.g., ) to also consider the generalization error and get one step closer to a full error analysis. For instance, there are stronger approximation results for more restricted option pricing problems , and there are very recent approximation results for semilinear heat equations and Kolmogorov equations with (time-inhomogeneous) nonlinear coefficients where the dependence on the dimension is polynomial. Using a generalized version of Lemma 3.1, the findings of this paper can be used to prove that the corresponding ERM problem achieves, with high probability, a desired accuracy with the number of samples and the size of the hypothesis class scaling only polynomially in and . In particular this means that the presented methods are not restricted to the case of linear Kolmogorov equations with affine drift and diffusion coefficients. Finally, note that one obtains similar results for any continuous piecewise linear activation function with a finite number of breakpoints; see the comment after Theorem 2.3 and [62, Proposition 1].
6 Outline
Results in statistical learning theory
The present section develops generalization bounds for ERM problems in the spirit of .
such that the mapping is measurable.
We want to emphasize that the minima in (9) and (10) will be attained due to the compactness of our hypothesis class, but they need not be unique. We require the mapping to be measurable in order to view the risk of the empirical regression function as a random variable \Omega\ni\omega\mapsto{\mathcal{E}}_{d}\big{(}\widehat{f}_{d,m,{\mathcal{H}}}(\omega)\big{)}. This ensures that the probability in the generalization error bound (Theorem 2.2) is well-defined. While this technical assumption is often not explicitly stated in the literature on statistical learning theory it is actually crucial for analyzing the generalization error. In our setting (by choosing a suitable minimizer) measurability can indeed be satisfied; see Appendix A.1.
We now introduce the concept of covering numbers in order to bound the generalization error. {setting}[covering number] For every , every normed vector space , and every compact subset we define the -covering number of w.r.t. by
where denotes the ball of radius around . If the norm is clear from the context, we will use the abbreviation . Assume that the functions in our hypothesis class are uniformly bounded and that balls of radius around the functions cover . We can then use the (uniform) Lipschitz continuity of the (empirical) risk to bound the generalization error by
and employ Hoeffding’s inequality and a union bound to obtain the following estimate.
This result is adapted from [4, Theorem 17.1], [19, Theorem 3.14], and [49, Example 3.31], and for the sake of completeness we present a detailed proof in Appendix A.3.
2 Covering numbers of neural network hypothesis classes
As a natural next step we prove estimates on the covering numbers of neural network hypothesis classes in order to leverage the result of Theorem 2.2. Note that for different assumptions (i.e., boundedness assumptions on the activation function, different norms on the parameters, or evaluation of the neural networks on input data) similar approaches can be found in .
(number of layers), and
and the hypothesis class of clipped neural networks
The hypothesis classes are somewhat nonstandard in the sense that the clipping function is applied to the output of a neural network realization. The reason for our choice of this definition is that Theorem 2.2 requires that the set of neural networks over which the ERM problem is solved consists of uniformly bounded functions. In Appendix A.4 we show that the clipping function can be represented as a small neural network, which implies that the seemingly nonstandard classes are actually conventional neural network classes that can be trained with standard methods .
The next theorem quantifies the Lipschitz continuity of the operator
which maps bounded neural network parametrizations with fixed architecture to the corresponding realization functions (restricted to ).
The claim follows by a simple counting argument.
Together with Theorem 2.3 this allows us to bound the covering number of our hypothesis class of (clipped) neural networks.
The proof in Appendix A.6 is based on the behavior of covering numbers under the action of a Lipschitz function, i.e.,
3 Analysis of the generalization error
Combining Theorem 2.2 and Proposition 2.6, the following theorem describes our main result related to the generalization capabilities of hypothesis classes consisting of clipped ReLU networks.
and define . Then it holds that
Now Theorem 2.2 and a simple calculation ensures that
Next we show how Theorem 2.7 can be used to leverage bounds on the approximation error in order to obtain quantitative bounds on the generalization error.
Thus, Corollary 2.11 is a direct consequence of Corollary 2.9.
Applications for the numerical approximation of high-dimensional PDEs
In the present section we apply the general results of Section 2 to the numerical solution of high-dimensional Kolmogorov equations.
,
and
is a viscosity solution of the -dimensional Kolmogorov equation
the unique -adapted stochastic processThe solution process is unique up to indistinguishability; see, for instance, [5, Theorem 6.2.2]. with continuous sample paths satisfying the SDE
The result is based on work from and the following formal calculation:
for a.e. ; see Appendix A.7 for a rigorous proof.
2 Neural network generalization results for solutions of Kolmogorov equations
We first show that the end value of the solution to the Kolmogorov equation can be approximated by hypothesis classes consisting of clipped ReLU networks.
\tfrac{1}{(v-u)^{d}}\big{\|}g-F_{d}(T,{\cdot})\big{\|}_{{\mathcal{L}}^{2}([u,v]^{d})}^{2}\leq\varepsilon,
, and
can be decomposed into the sum of the squared bias and the variance, i.e.,
and we bound the parameter magnitudes of with the help of Lemma A.12.
Observe that the approximation result in Theorem 3.2 does not underlie the curse of dimensionality, and by Corollary 2.9 we can establish a generalization result that is free of the curse of dimensionality.
there exist and such that it holds that
, and
,
where
This is a direct consequence of Theorem 3.2 (with ) and Corollary 2.9.
We can also reformulate this in a more compact form.
there exist and such that it holds that
,
where .
This follows directly from Theorem 3.2 and Corollary 2.11.
3 Pricing of high-dimensional options
The proof of Theorem 1.1 from the introductory section dealing with the pricing of high-dimensional European put options is now an easy consequence of the above theory.
Accordingly, Setting 3.1 is satisfied with
Now Theorem 3.3 and a straightforward calculation prove the claim.
Appendix A Proofs
This appendix contains various proofs and additional material omitted from the main text.
The following lemma shows that the empirical regression function can be chosen measurable as required in Setting 2.1. This implies that the risk of the empirical regression function \Omega\ni\omega\mapsto{\mathcal{E}}_{d}\big{(}\widehat{f}_{d,m,{\mathcal{H}}}(\omega)\big{)} is measurable which is necessary for bounding the generalization error in Theorem 2.2.
in a way, such that it holds thatWe denote by the Borel -algebra of a topological space .
is /-measurable and
First observe that is a separable metric space induced by the uniform norm and that for every the mapping
This shows that for every the function is continuous and the Measurable Maximum Theorem in [1, Theorem 18.19] ensures that the set-valued function of minimizers of
admits a measurable selector. That is to say, there exists a /-measurable mapping such that for every it holds that
This yields the claim as compositions of measurable functions are again measurable.
A.2 Bias-variance decomposition
This proves items (i) and (ii) and shows that it holds that
Finally, applying (12) (with ) proves the lemma.
A.3 Bound on the generalization error
First note that by assumption for every it holds that
and analogously for the samples . The elementary identity
Now define N:=\operatorname{Cov}\big{(}\mathcal{H},\|{\cdot}\|_{{\mathcal{L}}^{\infty}},\frac{\varepsilon}{32{D}}\big{)} and choose such that the balls
cover . This establishes that for every , it holds that
Our assumptions yield that for every it holds that
Observe that for fixed it holds that the random variables E_{i}:=\big{(}f(X^{(i)}_{d})-Y^{(i)}_{d}\big{)}^{2}, , are independent and satisfy
which by Hoeffding’s inequality (see [36, Theorem 2]) ensures that
Together with (LABEL:eq:subsets), the monotonicity and subadditivity of the probability measure, and the measurability assumptions according to Lemma A.1 this implies that
Using the complement rule and plugging in the definition of proves the theorem.
A.4 Clipped neural networks are standard neural networks
We show that “clipped” neural network hypothesis classes are in fact subsets of “non-clipped” ones.
The proof follows by the representation of the clipping function in Lemma A.5 and the fact that composition with a neural network does not change the magnitude of its parameters;Because of that we did not choose the easier representation {\operatorname{clip}}_{{D}}(x)=-{D}+\operatorname{ReLU}_{*}\big{(}2{D}-\operatorname{ReLU}_{*}({D}-x)\big{)}. see also [54, Definition 2.2] for a formal definition.
A.5 Lipschitz continuity of the realization map
Define and \mathfrak{m}:=\max\big{\{}1,|u|,|v|\big{\}}. We will show the following stronger statement. For every it holds that
This directly implies the statement of Theorem 2.3, as it holds that
For the proof of (16) let us fix given by
Let , and for every define the partial parametrizations
We are interested in estimating the error and try to bound relative to and relative to . Note that for every it holds that
Analogous computations for the case and the function establish that for every it holds that . By induction this implies that
for every . Moreover, note that for every it holds that
Together with (17), one proves by induction that for every it holds that
A.6 Covering numbers of neural network hypothesis classes
To simplify the notation we define \mathfrak{m}:=\max\big{\{}1,|u|,|v|\big{\}},
Choose such that for every there exists with , which by Theorem 2.3 and the fact that is nonexpansive implies that
A.7 Kolmogorov equation as learning problem
see [29, Corollary 2.23(ii)]. We claim that for every it holds that
which by (18) and Setting 3.1 ensures that for a.e. it holds that
In [10, Lemma 2.6(v)] it is shown that for every it holds that
A.8 Neural network approximation result for solutions of Kolmogorov equations
The proof of Theorem 3.2 is given after the following two auxiliary lemmas. First, we show that given an SDE with affine coefficients and , its solution also admits a (random) affine representation.
see [29, Proposition 2.14]. Together with the facts that it holds that
In the next lemma we show that the average of the composition of a neural network with different affine functions can be represented by a single neural network and we bound the number and size of its parameters.
Then there exist and such that it holds that
,
,
,
.
With the exception of item (iii) this result is proven in [29, Lemma 3.8]. There it is shown that for a suitable parametrization is given by , ,
We now use techniques from [29, Proof of Proposition 3.4] to show that the random variable , given by
Next, observe that Lemma A.12 establishes that
By Lemma A.14, our assumptions, and (19) there exist and satisfying the following:
;
;
;
; and
,
where C:=\zeta\max\big{\{}32^{2}D^{4}\big{(}8\mathfrak{c}_{\nu}(\mathfrak{m})\big{)}^{\lambda},\big{(}192D^{2}\mathfrak{c}_{1}(1)+1\big{)}\big{(}8\mathfrak{c}_{\nu}(\mathfrak{m})\big{)}^{\kappa}\big{\}} and c:=\big{(}8\mathfrak{c}_{\nu}(\mathfrak{m})\big{)}^{-1}. Together with (21) this proves the theorem.
Acknowledgements
The authors are grateful to Shahar Mendelson and Stefan Steinerberger for their useful comments.