ReLU Deep Neural Networks and Linear Finite Elements

Juncai He, Lin Li, Jinchao Xu, Chunyue Zheng

Introduction

In recent years, deep learning models have achieved unprecedented success in various tasks of machine learning or artificial intelligence, such as computer vision, natural language processing and reinforcement learning . One main technique in deep learning is deep neural network. A typical DNN model is based on a hierarchy of composition of linear functions and a given nonlinear activation function. However, why DNN models can work so well is still unclear.

Mathematical analysis of DNN can be carried out using many different approaches. One approach is to study the approximation properties of the function class provided by DNN. The approximation property of DNN is relevant to the so-called expressive power of a DNN model. Early studies of approximation properties of DNN can be traced back in and where the authors established some approximation properties for the function classes given by a feedforward neural network with a single hidden layer. Further error estimates for such neural networks in terms of number of neurons can be found in for sinusoidal activation functions and in for more general sigmoidal activation functions. There are many other papers on this topic during the 90s and a good review of relevant works can be found in and .

There are many different choices of activation functions. In fact, as shown in , a neural network with a single hidden layer can approximate any continuous function for any activation function which is not a polynomial. Among all the activation functions, the so-called rectified linear unit (ReLU) activation function , namely \mboxReLU(x)=max(x,0)\mbox{{\rm ReLU}}(x)=\max(x,0), has emerged to be one of the most popular activation functions used in the deep learning literature and applications. presents an approximation of ReLU DNNs by relating to wavelets. Recently, establish LL^{\infty} and L2L^{2} error bounds for functions of many variables that are approximated by linear combinations of ReLU. presents rates of approximation by deep CNNs for functions in the Sobolev space Hr(Ω)H^{r}(\Omega) with r>2+d/2r>2+d/2. This paper is devoted to some further mathematical analysis of DNN models with ReLU as the activation function.

Motivated by this result, we study the following two questions on the DNN representation of a given CWPL function:

What is the minimal number of layers that are needed?

To answer the first question, in this paper, we will go through the proof of this representation result to give some explicit estimations of the number of neurons that are needed in a DNN to represent a given CPWL function. As a result, we find that the number of neurons that are needed for a DNN to represent a CPWL on mm-subdomains can be as large as O(d2mm!)\mathcal{O}(d2^{mm!})!

In real applications, many efforts have been made to compress the deep neural networks by using heavily quantized weights, c.f. . Especially, binary and ternary weight models not only give high model compression rate, but also eliminate the need of most floating-point multiplications during interface phase. In particular, for some small data sets such as MINST and CIFAR-10 , the ternary CNNs are shown to have the same accuracy as that of the original CNN. Using the special structure for representing any CPWL functions by ReLU DNNs, we provide certain theoretical justification of the use of ternary CNNs. Furthermore, we also present a modified version of those models with some rigorous mathematical justifications.

Another topic that will be investigated in the paper is the application of artificial neural networks for differential equations. This topic can be traced back to in which collocation methods are studied. Recently, there are increased new research interests in the literature for the application of deep neural networks for numerical approximation of nonlinear and high dimensional PDEs as in . Based on our result about the relationship between FEM and ReLU DNNs, we discuss the application of ReLU DNN for solving PDEs with respect to the convergence properties. In particular, we use an 1D example to demonstrate that a Galerkin method using ReLU DNN can lead to better approximation result than adaptive finite element method that has exactly the same number of degrees of freedom as in the ReLU DNN.

Deep neural network (DNN) generated by ReLU

In this section, we briefly discuss the definition and properties of the deep neural networks generated by using ReLU as the activation function.

Given n,m1n,m\geq 1, the first ingredient in defining a deep neural network (DNN) is (vector) linear functions of the form

The second main ingredient is a nonlinear activation function, usually denoted as

By applying the function to each component, we can extend this naturally to

with f0(x)=Θ(x)f^{0}(x)=\Theta(x). The following more concise notation is often used in computer science literature:

A ReLU DNN with kk hidden layers might be written as:

We note that ReLU is a continuous piecewise linear (CPWL) function. Since the composition of two CPWL functions is obviously still a CPWL function, we have the following simple observation().

For convenience of exposition, we introduce the following notation:

Namely DNNJ{\rm{DNN}_{J}} represents the DNN model with JJ hidden layers and ReLU activation function with arbitrary size.

We note that for J=0J=0, DNN0{\rm DNN}_{0} is a simple function space of global linear functions, which is often used in classic statistical analysis such as linear regression. The structure of DNNJ{\rm DNN}_{J} gets more interesting as JJ becomes larger. We shall now discuss the simple case when J=1J=1, namely

as the restriction of DNN1m{\rm DNN}_{1}^{m} on Ω\Omega.

Approximation property for the function class DNN1m(Ω){\rm DNN}_{1}^{m}(\Omega) has been much studied in the literature. For example, in and , DNN1m(Ω){\rm DNN}_{1}^{m}(\Omega) is proved to be dense in C0(Ω)C^{0}(\Omega) as mm\to\infty, which is known as universal approximation. There are also many works devoted to the asymptotic error estimates. For example, established the following estimate:

where Ω|\Omega| denotes the volume of Ω\Omega and

For a given set of wiw^{i} and bib^{i}, it is tempting to think the functions in DNN1m{\rm DNN}^{m}_{1} are generated by {\mboxReLU(wix+bi)}i=1m\{\mbox{{\rm ReLU}}(w_{i}x+b_{i})\}_{i=1}^{m}. In such a consideration, the following result is of great theoretical interest. The proof will be seen in Section 4.

In real applications, wiw_{i} and bib_{i} are variables. As a result, DNN1m{\rm DNN}^{m}_{1} is generated by variable basis functions {\mboxReLU(wix+bi)}i=1m\{\mbox{{\rm ReLU}}(w_{i}x+b_{i})\}_{i=1}^{m} and in particular DNN1m{\rm DNN}^{m}_{1} is a nonlinear space which is expected to have certain nonlinear approximation property as discussed in .

Linear finite element (LFE) function as a DNN

In this section, we consider a special CPWL function space, namely the space of linear simplicial finite element functions. We will first give a brief description of finite element method and give a constructive proof that any linear simplicial finite element function can be represented by a ReLU DNN.

The finite element method (FEM), as a popular numerical method for approximating the solutions of partial differential equations (PDEs), is a well-studied subject (,). The finite element function space is usually a subspace of the solution space, for example, the space of piecewise linear functions over a given mesh. In , it is shown that piecewise linear functions can be written as ReLU DNNs, which will be discussed in details later. By exploring the relationship between FEM and ReLU DNN, we hope to shed some new light on how DNN works in this special case.

A finite element space is defined in association with a simplicial finite element grid ThΩ{\cal T}_{h}\subset\Omega. A simplicial finite element grid Th{\cal T}_{h} consists of a set of simplexes {τk}\{\tau_{k}\} and the corresponding set of nodal points is denoted by Nh{\cal N}_{h}. For a given grid Th\mathcal{T}_{h}, the corresponding finite element space is given by

Given xiNhx_{i}\in\mathcal{N}_{h}, it is easy to see that there exists a unique function ϕiVh\phi_{i}\in V_{h}, known as the nodal basis function, such that

A typical profile of ϕi\phi_{i} is shown in Fig. 3.2 for d=1d=1 and d=2d=2.

Obviously any vVhv\in V_{h} can be uniquely represented in terms of these nodal basis functions:

Given xiNhx_{i}\in{\cal N}_{h}, let N(i)N(i) denote all the indices jj such that τj\tau_{j} contains the nodal point xix_{i}, namely

and khk_{h} denote the maximum number of neighboring elements in the grid

Let G(i)G(i) denote the support of the nodal basis ϕi\phi_{i}:

We say that the grid Th\mathcal{T}_{h} is locally convex if G(i)G(i) is convex for each ii.

We proceed next to demonstrate how a finite element function can be represented by a ReLU DNN. Our derivation and analysis are based on the representation of the finite element function as a linear combination of basis functions as follows.

2 DNN representation of finite element functions

As an illustration, we will now demonstrate how a linear finite element function associated with a locally convex grid Th\mathcal{T}_{h} can be represented by a ReLU DNN. For more general grids, we refer to Remark 1 and §5.

Thanks to (3.3), it suffices to show that each basis function ϕi\phi_{i} can be represented by a ReLU DNN. We first note that the case where d=1d=1 is trivial as the basis function ϕi\phi_{i} with support in [xi1,xi+1][x_{i-1},x_{i+1}] can be easily written as

In order to consider the cases where d>1d>1, we first prove the following lemma.

Given xiNhx_{i}\in{\cal N}_{h}, if G(i)G(i) is convex, then the corresponding basis function can be written as

where, for each kN(i)k\in N(i), gkg_{k} is the global linear function such that gk=ϕig_{k}=\phi_{i} on τk\tau_{k}.

Let PkP_{k} be the hyperplane that passes through the d1d-1 subsimplex (of τk\tau_{k}) that does not contain xix_{i} (see the left figure in Figure 3.3). Since G(i)G(i) is convex by assumption, all points in τk0\tau_{k_{0}} should be on the same side of the hyperplane PkP_{k}. As a result, for all kN(i)k\in N(i),

By combining the above inequality with the following obvious inequality that

and the fact that all gkg_{k} are linear, we conclude that

This, together with (3.7), proves that (3.6) holds for all xG(i)x\in G(i). Thus

On the other hand, if xG(i)x\notin G(i), there exists a τkG(i)\tau_{k}\subset G(i) such that τk\tau_{k} contains a segment of the straight line that pass through xx and xix_{i}(see the right figure in Figure 3.3). Again let PkP_{k} be the hyperplane associated with τk\tau_{k} as defined above. We note that xx and xix_{i} are on the different sides of PkP_{k}. Since

This finishes the proof of Lemma 3.1. \square

If G(i)G(i) is not convex, we could also write the basis function as some max-min functions. But the form of max-min function is not as simple as the case where G(i)G(i) is convex, and it depends on the shape of the support of the basis function. In some cases, we can write the basis function as the max-min-max form if G(i)G(i) is a special non-convex set.

We are now in a position to state and prove the main result in this section.

Given a locally convex finite element grid Th{\cal T}_{h}, any linear finite element function with NN degrees of freedom, can be written as a ReLU-DNN with at most k=log2kh+1k=\lceil\log_{2}k_{h}\rceil+1 hidden layers and at most O(khN)\mathcal{O}(k_{h}N) number of the neurons.

By Lemma 3.1, the basis function ϕi(x)\phi_{i}(x) can be written as:

According to this procedure, we get the minimum of N(i)|N(i)| terms by splitting them in two, each taking the minimum over at most N(i)/2\lceil|N(i)|/2\rceil terms. This contributes to one ReLU hidden layer. Then we can further split the terms

until all the minimum functions contain only 1 or 2 terms.

which is also a ReLU DNN with 11 hidden layer. So we can write a basis function as a 1+log2kh1+\lceil\log_{2}k_{h}\rceil-hidden-layer DNN. Considering the binary-tree structure, a kk-layer full binary-tree has 2k12^{k}-1 nodes. We can see the number of neurons is at most

By (3.3), the piecewise linear function can be represented as a DNN with k=1+log2khk=1+\lceil\log_{2}k_{h}\rceil hidden layers. The number of neurons is at most O(khN)\mathcal{O}(k_{h}N). \square

We now consider a special class of the so-called shape regular finite element grid Th\mathcal{T}_{h} which satisfies

for some constants κ1\kappa_{1} and κ2\kappa_{2} independent of hh and dd, where rτr_{\tau} (RτR_{\tau}) is the radius of the largest (smallest) ball contained in (containing) τ\tau.

Given a locally convex and shape regular finite element grid Th{\cal T}_{h}, any linear finite element function with NN degrees of freedom(DOFs), can be written as a ReLU-DNN with at most O(d)\mathcal{O}(d) hidden layers. The number of neurons is at most O(κdN)\mathcal{O}(\kappa^{d}N) for some constant κ2\kappa\geq 2 depending on the shape-regularity of Th\mathcal{T}_{h}. The number of non-zero parameters is at most O(dκdN)\mathcal{O}(d\kappa^{d}N).

We note that, using the approach described in this section, a finite element function with NN DOFs can be represented by a DNN with O(N)\mathcal{O}(N) number of weights. This property is expected to be useful when DNNs are used in adaptive mesh-less or vertex-less numerical discretization methods for partial differential equation, which is a subject of further study.

3 Comparison of error estimates in adaptive finite element and DNN methods

Error estimates for adaptive finite element methods are well studied in the literature. For example, an appropriately adapted linear finite element function with O(N)\mathcal{O}(N) DOFs is proved to admit the following error estimate:

if uW2,2dd+2(Ω)u\in W^{2,\frac{2d}{d+2}}(\Omega) and vv is the interpolation based on the adapted finite element grid. More details can be founded in .

For a shallow network DNN1{\rm DNN}_{1} with O(N)\mathcal{O}(N) DOFs (i.e. O(Nd)\mathcal{O}(\frac{N}{d}) neurons), we have the next error estimate in (2.9) as

In comparison, an adaptive finite element function with the same order of O(N)\mathcal{O}(N) DOFs can only have convergence rate of order O(N2d)\mathcal{O}(N^{-\frac{2}{d}}).

As will be shown in §4, shallow neural networks DNN1{\rm DNN}_{1} (namely with only one hidden layer) cannot recover a linear finite element function in general, but may potentially lead to better asymptotic accuracy as the dimension dd gets larger.

One idea that may help us to understand is that the shallow network is a kind of NN-term or basis selection ( )approximation scheme with {σ(wix+bi)}i=1N\{\sigma(w_{i}x+b_{i})\}_{i=1}^{N} as the basis functions (as shown in Theorem 2.1), similar to using {sin(nx)}n=1N\{sin(nx)\}_{n=1}^{N} as the basis functions in Fourier approximation or some others in wavelets.

For deep ReLU neural networks, our connections of FEM and ReLU DNNs in this section help us to construct a special ReLU DNN models with depth O(d)\mathcal{O}(d) and parameters O(dkdN)\mathcal{O}(dk_{d}N) for O(N)\mathcal{O}(N) DOFs. By using the approximation result for adaptive FEM, DNN approximation uDNNu_{DNN} for special structure with O(N)\mathcal{O}(N) DOFs can get

and kd=O(κd)k_{d}=\mathcal{O}(\kappa^{d}). This shows that there exists some special deep ReLU DNN structure which is at least as good as adaptive FEM.

In the previous section, we show that a finite element function can be represented by a ReLU DNN with log2kh+1\log_{2}k_{h}+1 hidden layers.

In particular, we will show the following theorem.

with fi=αi\mboxReLU(wix+bi)f_{i}=\alpha_{i}\mbox{{\rm ReLU}}(w_{i}x+b_{i}), αi0\alpha_{i}\neq 0 and wi0w_{i}\neq 0. Consider the finite element functions, if this one hidden layer ReLU DNN can recover any basis function of FEM, then it can recover the finite element space. Thus let us assume ff is a locally supported basis function for FEM, i.e. supp(f){\rm{supp}}(f) is bounded, where

Furthermore, if Ω\Omega is a bounded domain, we assume that

For more general case, we can define the set of discontinuous points of a function by

for i=1:Ni=1:N with HH be the Heaviside function defined as:

Without loss of generality, we can assume that

is contradictory to the assumption that ff is locally supported.

Hence DNN1{\rm DNN}_{1} cannot recover any piecewise linear function in Ω\Omega for d2d\geq 2. \square

Following the proof above, we can prove Theorem 2.1.

Proof If (wi,bi)(w_{i},b_{i}) and (wj,bj)(w_{j},b_{j}) are linearly independent for any iji\neq j, we know that the set of discontinuous points for any nontrivial combinations of x\mboxReLU(wix+bi)\nabla_{x}\mbox{{\rm ReLU}}(w_{i}x+b_{i}) cannot be empty. So, this is contradictory to

since DfD_{\nabla f}\neq\emptyset where f(x)f(x) is a combination of {\mboxReLU(wix+bi)}\{\mbox{{\rm ReLU}}(w_{i}x+b_{i})\}. \square

This shows that despite it has the so-called universal approximation properties , shallow network is not enough in the case of recovering all CPWL functions. More precisely, although the shallow ReLU DNNs are CPWL functions themselves and can approximate any CPWL functions with any accuracy, there are some CPWL functions they cannot represent exactly. As an example, a local basis function in FEM with compact support and some other simple conditions cannot be represented by ReLU DNNs with one hidden layer for dimensions greater than 22.

As for the upper bound, Theorem 5.2 in provides us with one answer.

This also indicates that log2(d+1)\lceil\log_{2}(d+1)\rceil is “optimal” for d=2,3d=2,3.

General CPWL as a ReLU DNN

In the previous sections, we present a special approach to represent a linear simplicial finite element function by a ReLU DNN. In this section, we discuss a general approach to represent a general CPWL by a ReLU DNN, which is introduced in . In comparison with the special approach in §3, this general approach gives a ReLU DNN with relatively fewer layers but significantly more number of neurons.

Namely, on each Ωi\Omega_{i}, ff is a linear function:

Proof Because mm is finite, so there must exist MM number of subdomains,

Then we proceed to estimate MM. On the one hand, we have mm pieces linear functions, so

For the relationship between ReLU DNNs and CPWL functions, we have the next theorem with some estimation.

the number of hidden layers is bounded by

here MM, satisfying mMm!m\leq M\leq m!, is the number of subdomains as defined in Lemma 5.1.

The main result in Theorem 5.2 is not new, which can be found in . In the next subsection, we will give an outline of the proof of Theorem 5.2 and in particular to derive the new estimate (5.3) on the number of neurons needed in the DNN representation. We will also discuss the application of this theorem to the simplicial finite element space in § 3.1.

2 On the proof of Theorem 5.2

In this section, we give an outline of the proof of Theorem 5.2. We will mainly follow the proof in which is based on many relevant results in existing literature such as , but add some detailed estimate of the number of neurons.

here σk{1,1}\sigma_{k}\in\{1,-1\} and gˉ\bar{g} is one of gg, αg+h\alpha g+h or αˉh\bar{\alpha}h.

With all the lemmas above, now we can start to prove Theorem 5.2.

where MM is the number of subdomains as defined in Lemma 5.1, and sj{1,,m}s_{j}\subset\{1,\dots,m\}, mm is the number of distinct pieces of ff.

Let Φj=max1tj{ministli}, 1jM\Phi_{j}=\max\limits_{1\leq t\leq j}\{\min_{i\in s_{t}}l_{i}\},\ 1\leq j\leq M, then

with K12m1K_{1}\leq 2^{m}-1, sn(1){1,2,,m}s^{(1)}_{n}\subset\{1,2,\dots,m\} and σn(1){1,1}\sigma^{(1)}_{n}\in\{1,-1\}.

For each max{ΦM1,maxisn(1)li}\max\{\Phi_{M-1},\max_{i\in s^{(1)}_{n}}l_{i}\} in (5.8), we can continue this procedure for M1M-1 times. Then we have:

here sn{1,2,,m}s^{\prime}_{n}\subset\{1,2,\dots,m\} and M(2m1)MM^{\prime}\leq(2^{m}-1)^{M}.

Now that we write a piecewise linear function in form of (5.9), in order to get the log2(d+1)\lceil\log_{2}(d+1)\rceil-hidden-layer ReLU DNN, we need to do linear transformations to reduce the cardinality of sns^{\prime}_{n} from mm to n+1n+1. This can be done by Lemma 5.3.

Following the procedures in Lemma 5.3 by , when reducing one cardinality of sns^{\prime}_{n}, one maxisnli\max_{i\in s^{\prime}_{n}}l_{i} will become at most 2d+112^{d+1}-1 terms. If the cardinality is reduced from mm to d+1d+1, then we need to repeat the whole procedure md1m-d-1 times. Hence for each maxisj{li}\max_{i\in s^{\prime}_{j}}\{l_{i}\}, in total we have at most (2d+11)md1(2^{d+1}-1)^{m-d-1} terms. Thus for

with sj{1,+1}s_{j}\in\{-1,+1\} and Sjd+1|S_{j}|\leq d+1, we have p(2m1)M(2d+11)md1p\leq(2^{m}-1)^{M}(2^{d+1}-1)^{m-d-1}.

For each maxiSjli\max_{i\in S_{j}}l_{i} with Sjd+1|S_{j}|\leq d+1, Lemma 5.4 can be used to get a ReLU DNN with log2(d+1)\lceil\log_{2}(d+1)\rceil hidden layers. Again by Lemma 5.4, the size is at most 2(d+1)+4(2(d+1)1)=10d+62(d+1)+4(2(d+1)-1)=10d+6. Adding these maxiSjli\max_{i\in S_{j}}l_{i} together, we have the total size is at most (10d+6)(2m1)M(2d+11)md1(10d+6)(2^{m}-1)^{M}(2^{d+1}-1)^{m-d-1}, which is O(d2mM+(d+1)(md1))\mathcal{O}(d2^{mM+(d+1)(m-d-1)}).

Note that if md+1m\leq d+1, we do not need to use Lemma 5.3, the size will be at most O(d2mM)\mathcal{O}(d2^{mM}). \square

The estimation in Theorem 5.2 is a rough one, but still can provide some insights of this DNN representation. It can be seen that although the depth of this DNN is relatively shallow, the size of it might be extremely large, depending on the numbers of subdomains and distinct pieces.

𝑑1\lceil\log_{2}(d+1)\rceil hidden layers Given a locally convex finite element grid, now we have two different ways to represent a linear finite element function. In this part, we estimate the number of neurons if we write the function as a ReLU DNN with at most log2(d+1)\lceil\log_{2}(d+1)\rceil hidden layers. Then we can compare the sizes of two different approaches. Again we start with the basis functions.

Given xix_{i}, denote the corresponding basis function as ϕi\phi_{i}. If G(i)G(i) is convex ,then the ReLU DNN with at most log2(d+1)\lceil\log_{2}(d+1)\rceil hidden layers has size at most O(d2(d+1)kh)\mathcal{O}(d2^{(d+1)k_{h}}).

For simplicity, let us further assume that

The first step is to write it as the linear combination fo max functions

Our goal is to make every term on the right hand side only take maximum over at most d+1d+1 linear functions, here dd is the dimension.

For any term with linear functions more than d+1d+1, we need to use linear transformation to reduce this number. When reducing by one, one term will become at most 2d+112^{d+1}-1 terms. Thus max{0,g1,,gl}\max\{0,g_{1},\dots,g_{l}\} will become at most (2d+11)ld(2^{d+1}-1)^{l-d} when ldl\geq d.

For any term with number of linear functions less or equal than d+1d+1, it remains unchanged. The number of this kind of terms is

Then in total the number of terms should be

Proof According to Corollary 5.3, every basis function has a size independent of NN, so the size of the DNN function with at most (log2(d+1))(\lceil\log_{2}(d+1)\rceil) hidden layers is at most O(d2(d+1)khN)\mathcal{O}(d2^{(d+1)k_{h}}N). \square

By comparing the above results with Theorem 3.1, we can see that although the DNN with log2(d+1)\lceil\log_{2}(d+1)\rceil hidden layers has shallower depth, the number of neurons is much larger than the one with log2kh+1\lceil\log_{2}k_{h}\rceil+1 hidden layers.

Low bit-width DNN models

In this section, we will show the rationality of low bit-width models with respect to approximation properties in some sense by investigating that a special type of ReLU DNN model can also recover all CPWL functions. In , an incremental network quantization strategy is proposed for transforming a general trained CNN into some low bit-width version in which there parameters are all zeros or powers of two. Mathematically speaking, low bit-width DNN model is defined as:

where σ\sigma is the activation function and

Under this closed form, they propose a projected gradient descent methods with respect to SGD to train a general R-FCN model for object detection. They also find that 6-bit (i.e b=6b=6) model works almost the same with classical model in the object detection tasks.

Then it comes the question: why can those kinds of models work? More precisely, for classification or detection problems, can this model separate those data exactly? By our results in previous sections, we find a special family of ReLU DNN which has at most one general layers and all other layers with low bit-width parameters. The results offer modification and theoretical explanation of the existing low bit-width DNNs proposed in the literature.

Here we try to explain why those low bit-width DNN model also work for classification problems to some extent. We have the following result:

Any continuous piecewise function can be represented by the next model:

with Q0,3\mathcal{Q}_{0,3} defined in (6.2) and Jlog2(d+1)J\geq\lceil\log_{2}(d+1)\rceil.

Proof Because of Theorem 5.2, we can rewrite any piecewise linear function as a ReLU DNN

with J0log2(d+1)J_{0}\leq\lceil\log_{2}(d+1)\rceil. By (3.9), we know that

Also note that DNN~0,3J0DNN~0,3J\widetilde{{{\rm DNN}}}_{0,3}^{J_{0}}\subseteq\widetilde{{{\rm DNN}}}_{0,3}^{J} if J0JJ_{0}\leq J. This completes the proof.

Application to Numerical PDEs

In this section, we discuss the application of DNNs to the numerical solution of partial differential equations (PDEs). In most of our discussion, we consider the following model problem:

The idea of using DNN for numerical PDEs can be traced back to where a collocation method is used. Similar ideas have been explored by many different authors for different types of PDEs.

For the model problem (7.1), roughly speaking, the collocation method amounts to the following least square problem:

here uN(x,Θ)u_{N}(x,\Theta) is taken among the DNN function class in the form of (2.3) with a smooth activation function such as sigmoidal function and xix_{i} are some collocation points.

Recently, applied DNN for numerical PDE in the Galerkin setting which amounts to the solution of the following energy minimization problem:

Numerical experiments have demonstrated the potential of this approach. In the rest of this section, we will discuss a number of aspects of this approach from both theoretical and practical viewpoints. In particular, we will discuss its relationship with two popular finite element methods: adaptive finite element method and moving grid method.

The finite element approximation to (7.1) can be written as

where VhV_{h} is the finite element space as described in §3.1.

In the finite element setting, the optimization problem (7.4) is to find the coefficient (νi)(\nu_{i}) as in (3.3) for a given finite element mesh Th\mathcal{T}_{h}. Some more sophisticated versions of the finite element method can be obtained by varying or optimizing Th\mathcal{T}_{h} so that more accurate finite element approximation can be obtained. Roughly speaking, there are two main approaches for optimizing Th\mathcal{T}_{h}: one is the adaptive finite element method and the other is the moving grid finite element method.

The adaptive finite element method is, roughly speaking, to vary Th\mathcal{T}_{h} by either coarsening or refining the grid. One main theoretical result is that a family of adapted grids Th\mathcal{T}_{h} with O(N)\mathcal{O}(N) degrees of freedom can be obtained so that the corresponding adaptive finite element approximation uNu_{N} satisfies the following error estimate

We refer to for relevant details and its generalizations.

One interesting observation is that the convergence rate O(N1d)\mathcal{O}(N^{-\frac{1}{d}}) in (7.5) deteriorate badly as dd increases. Of course, error estimate in the form (7.5) depends on which Sobolev or Besov function classes that the solution uu belongs to, namely what norms are used in the right hand side of (7.5). But regardless what function classes for the solution uu, no asymptotic error estimate seems to be known in the literature that is better than O(N1d)\mathcal{O}(N^{-\frac{1}{d}}).

The moving grid method is, on the other hand, to optimize Th\mathcal{T}_{h} by varying the location of grid points while preserving topological structure of the grids (in particular the number of grid ponts remain unchanged). This approach proves to be effective in many applications, see . But there are very few theories on the error estimate like (7.5) in the moving grid method.

However, the H1H^{1} approximation properties are still unclear even for DNN1{\rm DNN}_{1}. proves a similar result for H1H^{1} error estimate for DNN1m(Ω){\rm DNN}_{1}^{m}(\Omega) with activation function σ(x)=cos(x)\sigma(x)=\cos(x). For general activation functions, or just for ReLU, it is an open problem.

2 DNN-Galerkin method

The finite element methods discussed above, including adaptive method and moving grid method, depend crucially on the underlying finite element grids. Numerical methods based on DNN, as we shall describe now, are a family of numerical methods that require no grids at all. This is reminiscent of the “mesh-less method” that have been much studied in recent years . But the mesh-less method still requires the use of discretization points. The DNN-Galerkin method (as we shall call), namely the Galerkin version of the DNN-element method such as (7.3), goes one step further: it does not even need any discretization points! It is a totally point-free method!

Let us now give a brief discussion on the error estimate for the DNN-Galerkin method. We first recall a classic result by for a DNN with one hidden layer of O(N)\mathcal{O}(N) DOFs ( i.e. O(Nd)\mathcal{O}(\frac{N}{d}) neurons),

3 An 1D example: a two point-boundary value problem

As a proof of concept, let us discuss a very simple one dimensional example. We focus on the following model problem:

The exact solution uH1(0,1)u\in H^{1}(0,1) satisfies that

We define the space of ReLU DNNs with one hidden layer as follows:

where θi\theta_{i} is the slope of piecewise linear function in [ti1,ti][t_{i-1},t_{i}]. In order to satisfy the condition u(1;t,θ)=0u(1;t,\theta)=0, we have the constraint

where t=(t0,t1,...,tN+1),θ=(θ0,θ1,...,θN+1)t=(t_{0},t_{1},...,t_{N+1}),\theta=(\theta_{0},\theta_{1},...,\theta_{N+1}). We do the alternate iteration as below,

where η\eta is the step-length. Once tt is fixed, the minimization problem is a quadratic optimization, which is the traditional finite element method. So we solve the FEM solution u(x;tk,θk+1)u(x;t^{k},\theta^{k+1}) on grid tt and then compute the slope θi\theta_{i} on each [ti1,ti][t_{i-1},t_{i}].

with K=0.01K=0.01. In this numerical experiment, the learning rate η=0.5\eta=0.5, the max iteration step is 200200, and the degrees of freedom N=53N=53.

At the beginning of the simulation, we use the adaptive finite element method(AFEM) to get the adaptive grid from the uniform grid. Next we construct DNN solution with the same degrees of freedom. Then we minimize the energy and get the DNN solution. We compare the energy and H1H^{1} semi-norm error of uniform grid solution(uFEM), AFEM solution and DNN solution. From Table 1, the energy and H1H^{1} semi-norm of the DNN solution are smaller than uAFEMu_{AFEM} and uuFEMu_{uFEM}, which implies that the DNNs can find better solution than AFEM. Figure 7.1 shows the two different grid points on the same graph, we can easily see the grid points are moving.

Conclusion

By relating ReLU DNN models with linear finite element functions, we provide some theoretical insights on why and how deep neural networks work. It is shown that ReLU DNN models with sufficiently many layers (at least two) can reproduce all the linear finite element functions. This in some sense provides some theoretical explanation of the expressive power of deep learning models and also the necessity of using deep layers in deep learning applications.

Two different approaches are discussed in this paper on the representation of continuous piecewise linear functions by ReLU DNNs. The first approach, as proposed in and described in §5, leads to a DNN representation with a relatively shallow network with log2(d+1)\lceil\log_{2}(d+1)\rceil hidden layers but a relatively larger number of neurons. The second approach, presented in this paper and described in §3, leads to a representation that has a relatively deeper network with log2kh+1\lceil\log_{2}k_{h}\rceil+1 hidden layers (see (3.4)). Further investigations are needed in the future to combine these two approaches to obtain a more balanced representation.

The DNN representation of linear finite element functions opens a door for theoretical explanation and possible improvement on the application of the quantized weights in a convolution neural networks (see ).

One theoretically interesting question addressed in this paper concerns the minimal number of layers that are needed in a DNN model to reproduce general continuous piecewise linear functions. Theorem 4.1 provides a partial answer to this question, namely the minimal number of layers is at least 22. As a result, the number of layers log2(d+1)\lceil\log_{2}(d+1)\rceil as given in Theorem 5.2 is optimal for 1d31\leq d\leq 3. It is still an open question if this number is also optimal for d4d\geq 4.

This paper also briefly touches upon the application of DNN in numerical solution of partial differential equations, which is a topic that was investigated in the literature in 1990s and has attracted much attention recently. Our focus is on the comparison of Galerkin type of discretization methods provided by adaptive linear finite element methods and by deep neural networks. When the dimension dd is large, asymptotic approximation properties are compared for these two different approaches in terms of the number of the dimension dd and the number of the degrees of freedom. When dd is small, we use the simplest case d=1d=1 to demonstrate that the deep neural network would lead to a more accurate Galerkin approximation to a differential equation solution than the adaptive finite element method would under the assumption that the degrees of freedom are the same in both cases but without comparing their computational costs. This preliminary study seems to indicate that deep neural network may provide a potentially viable approach to the numerical solution of partial differential equations for both high and low dimensions although the underlying computational cost is a serious issue that may or may not be properly addressed by further studies in the future.

Appendix A Lattice representation

Let p(t)p(t) be a continuous piecewise linear function and the unique-order region partition is

assume the linear function on [ti,ti+1][t_{i},t_{i+1}] is li(x)=kit+bil_{i}(x)=k_{i}t+b_{i}. If the parameters satisfy

Then there exists lp(t)=kpt+bpl_{p}(t)=k_{p}t+b_{p}, such that

Let kp=mini{ki}k_{p}=\min_{i}\{k_{i}\}, Δti=ti+1ti\Delta t_{i}=t_{i+1}-t_{i}.

Since here we only involve linear functions, so we can represent each point (ti,yi)(t_{i},y_{i}) by using kik_{i}’s and Δti\Delta t_{i}’s.

Then the point (tp,b0+i=0p1kiΔti)(t_{p},b_{0}+\sum_{i=0}^{p-1}k_{i}\Delta t_{i}) is on y=kpt+bpy=k_{p}t+b_{p}, and we can write bpb_{p} as following:

Since here kpk_{p} is the minimum, we have:

which means we find the desired pair of kk and pp.

Notice here kpk0k_{p}\neq k_{0} and kpkrk_{p}\neq k_{r} by the assumptions in the lemma.

and next we show that Φk(x)f(x)\Phi_{k}(x)\leq f(x) for all x and every kk:

Then by Lemma A.1, there must exist ltl_{t} with tk,kt\neq k,k^{\prime} and

Thus for every Φk\Phi_{k}, we have Φk(x)f(x)\Phi_{k}(x)\leq f(x) for all xx. So if we take the maximum over all these functions, we should have:

This is exactly the desired form. Here skm|s_{k}|\leq m, and the number of Φk\Phi_{k} depends on the domain partition we do.

The drawback of this representation is that the number of MM may be too large, so we want to deal with other domain partitions, for example, partitions that produce a set of convex regions. The following theorem is an improvement of Theorem 5.1.

If the arrangement inside a convex region CkC_{k} is the same, then the proof should be the same as Theorem 5.1. If not, the contribution of each region CkC_{k} can be considered as the union of the contributions of its unique-order subsets: Ψk=max1iMkΦi\Psi_{k}=\max_{1\leq i\leq M_{k}}\Phi_{i}, here Φi\Phi_{i} is defined as in Theorem 5.1. We can see that k=1SMk\sum_{k=1}^{S}M_{k} is no less than MM, the number of unique-order regions. By applying properties of lattices, we have:

so according to Theorem 5.1, Ψ=f\Psi=f for all xx in the domain. \square

Appendix B Proof of Lemmas

In this section, we will show the proofs of the lemmas used in previous sections.

When α>1\alpha>1, if gαˉh0g-\bar{\alpha}h\geq 0, then (B.1) becomes

and the right hand side (RHS) of (5.4) becomes the same.

If gαˉh0g-\bar{\alpha}h\leq 0, then (B.1) becomes

and the RHS of (5.4) is also this expression.

For α<1\alpha<1, similarly, the identity (5.4) always holds. Further, if consider the cases α>1\alpha>1, 0<α<10<\alpha<1 and α<0\alpha<0 respectively, we have the following important identity:

here σk{1,1}\sigma_{k}\in\{1,-1\} and gˉ\bar{g} is one of gg, αg+h\alpha g+h or αˉh\bar{\alpha}h. \square

B.2 Proof of Lemma 5.3

If αj=0\alpha_{j}=0 for each 1jnˉ1\leq j\leq\bar{n}, then by taking max{c0,α0}\max\{c_{0},\alpha_{0}\}, we already make the RHS of (B.3) as the RHS of (5.6). Otherwise, αj0\exists\alpha_{j}\neq 0 for some 1jnˉ1\leq j\leq\bar{n}, we can assume that αη0,αη+1=...=αnˉ\alpha_{\eta}\neq 0,\alpha_{\eta+1}=...=\alpha_{\bar{n}}, so

If αi=1\alpha_{i}=1 for each 1iη1\leq i\leq\eta, we can do the following linear transformation (l1,,lnˉ)T=A(l1,,lnˉ)T(l^{\prime}_{1},\dots,l^{\prime}_{\bar{n}})^{T}=A(l_{1},\dots,l_{\bar{n}})^{T}, where

in this case, the coefficients of l1,,lη1l^{\prime}_{1},\dots,l^{\prime}_{\eta-1} are not 1.

So we can assume there is at least one αi1\alpha_{i}\neq 1 for 1iη1\leq i\leq\eta, say αη1\alpha_{\eta}\neq 1. Let

(B.7) is already the desired form, because now we only take maximum over L1L-1 linear functions and one constant.

As for the (B.7), notice that now we have eliminated lηl_{\eta} in the third expression. So continue this procedure, at last we will only have constant in the last expression, by taking maximum of this constant and c0c_{0}, we can reduce one term in the max expression.

For (B.7), consider the linear transformation (l1,,lnˉ)T=B(l1,,lnˉ)T(l^{\prime\prime}_{1},\dots,l^{\prime\prime}_{\bar{n}})^{T}=B(l_{1},\dots,l_{\bar{n}})^{T}:

Then it is the same as (B.7). Follow the same steps as for (B.7), we can achieve the desired result. \square

Whenever we eliminate one lil_{i} in the expression of lLl_{L}, we will gain 3 terms, which is (B.7-B.7). Among these three terms, (B.7) is in desired form, and we need to continue to use (5.5) for (B.7) and (B.7) until we only have constant. Note that in the proof, ηn\eta\leq n. By this procedure, we will gain at most 2n+112^{n+1}-1 terms (see Figure B.1).

B.3 Proof of Lemma 5.4

Since max{x,y}\max\{x,y\} can be represented by a 2-layer ReLU DNN with size 4, we know that ff can be represented by a ReLU DNN of depth at most max{k1,,km}+log(m)+1\max\{k_{1},\dots,k_{m}\}+\lceil\log(m)\rceil+1 and size at most s1++sm+4(2m1)s_{1}+\dots+s_{m}+4(2m-1). \square

References