ReLU Deep Neural Networks and Linear Finite Elements
Juncai He, Lin Li, Jinchao Xu, Chunyue Zheng
Introduction
In recent years, deep learning models have achieved unprecedented success in various tasks of machine learning or artificial intelligence, such as computer vision, natural language processing and reinforcement learning . One main technique in deep learning is deep neural network. A typical DNN model is based on a hierarchy of composition of linear functions and a given nonlinear activation function. However, why DNN models can work so well is still unclear.
Mathematical analysis of DNN can be carried out using many different approaches. One approach is to study the approximation properties of the function class provided by DNN. The approximation property of DNN is relevant to the so-called expressive power of a DNN model. Early studies of approximation properties of DNN can be traced back in and where the authors established some approximation properties for the function classes given by a feedforward neural network with a single hidden layer. Further error estimates for such neural networks in terms of number of neurons can be found in for sinusoidal activation functions and in for more general sigmoidal activation functions. There are many other papers on this topic during the 90s and a good review of relevant works can be found in and .
There are many different choices of activation functions. In fact, as shown in , a neural network with a single hidden layer can approximate any continuous function for any activation function which is not a polynomial. Among all the activation functions, the so-called rectified linear unit (ReLU) activation function , namely , has emerged to be one of the most popular activation functions used in the deep learning literature and applications. presents an approximation of ReLU DNNs by relating to wavelets. Recently, establish and error bounds for functions of many variables that are approximated by linear combinations of ReLU. presents rates of approximation by deep CNNs for functions in the Sobolev space with . This paper is devoted to some further mathematical analysis of DNN models with ReLU as the activation function.
Motivated by this result, we study the following two questions on the DNN representation of a given CWPL function:
What is the minimal number of layers that are needed?
To answer the first question, in this paper, we will go through the proof of this representation result to give some explicit estimations of the number of neurons that are needed in a DNN to represent a given CPWL function. As a result, we find that the number of neurons that are needed for a DNN to represent a CPWL on -subdomains can be as large as !
In real applications, many efforts have been made to compress the deep neural networks by using heavily quantized weights, c.f. . Especially, binary and ternary weight models not only give high model compression rate, but also eliminate the need of most floating-point multiplications during interface phase. In particular, for some small data sets such as MINST and CIFAR-10 , the ternary CNNs are shown to have the same accuracy as that of the original CNN. Using the special structure for representing any CPWL functions by ReLU DNNs, we provide certain theoretical justification of the use of ternary CNNs. Furthermore, we also present a modified version of those models with some rigorous mathematical justifications.
Another topic that will be investigated in the paper is the application of artificial neural networks for differential equations. This topic can be traced back to in which collocation methods are studied. Recently, there are increased new research interests in the literature for the application of deep neural networks for numerical approximation of nonlinear and high dimensional PDEs as in . Based on our result about the relationship between FEM and ReLU DNNs, we discuss the application of ReLU DNN for solving PDEs with respect to the convergence properties. In particular, we use an 1D example to demonstrate that a Galerkin method using ReLU DNN can lead to better approximation result than adaptive finite element method that has exactly the same number of degrees of freedom as in the ReLU DNN.
Deep neural network (DNN) generated by ReLU
In this section, we briefly discuss the definition and properties of the deep neural networks generated by using ReLU as the activation function.
Given , the first ingredient in defining a deep neural network (DNN) is (vector) linear functions of the form
The second main ingredient is a nonlinear activation function, usually denoted as
By applying the function to each component, we can extend this naturally to
with . The following more concise notation is often used in computer science literature:
A ReLU DNN with hidden layers might be written as:
We note that ReLU is a continuous piecewise linear (CPWL) function. Since the composition of two CPWL functions is obviously still a CPWL function, we have the following simple observation().
For convenience of exposition, we introduce the following notation:
Namely represents the DNN model with hidden layers and ReLU activation function with arbitrary size.
We note that for , is a simple function space of global linear functions, which is often used in classic statistical analysis such as linear regression. The structure of gets more interesting as becomes larger. We shall now discuss the simple case when , namely
as the restriction of on .
Approximation property for the function class has been much studied in the literature. For example, in and , is proved to be dense in as , which is known as universal approximation. There are also many works devoted to the asymptotic error estimates. For example, established the following estimate:
where denotes the volume of and
For a given set of and , it is tempting to think the functions in are generated by . In such a consideration, the following result is of great theoretical interest. The proof will be seen in Section 4.
In real applications, and are variables. As a result, is generated by variable basis functions and in particular is a nonlinear space which is expected to have certain nonlinear approximation property as discussed in .
Linear finite element (LFE) function as a DNN
In this section, we consider a special CPWL function space, namely the space of linear simplicial finite element functions. We will first give a brief description of finite element method and give a constructive proof that any linear simplicial finite element function can be represented by a ReLU DNN.
The finite element method (FEM), as a popular numerical method for approximating the solutions of partial differential equations (PDEs), is a well-studied subject (,). The finite element function space is usually a subspace of the solution space, for example, the space of piecewise linear functions over a given mesh. In , it is shown that piecewise linear functions can be written as ReLU DNNs, which will be discussed in details later. By exploring the relationship between FEM and ReLU DNN, we hope to shed some new light on how DNN works in this special case.
A finite element space is defined in association with a simplicial finite element grid . A simplicial finite element grid consists of a set of simplexes and the corresponding set of nodal points is denoted by . For a given grid , the corresponding finite element space is given by
Given , it is easy to see that there exists a unique function , known as the nodal basis function, such that
A typical profile of is shown in Fig. 3.2 for and .
Obviously any can be uniquely represented in terms of these nodal basis functions:
Given , let denote all the indices such that contains the nodal point , namely
and denote the maximum number of neighboring elements in the grid
Let denote the support of the nodal basis :
We say that the grid is locally convex if is convex for each .
We proceed next to demonstrate how a finite element function can be represented by a ReLU DNN. Our derivation and analysis are based on the representation of the finite element function as a linear combination of basis functions as follows.
2 DNN representation of finite element functions
As an illustration, we will now demonstrate how a linear finite element function associated with a locally convex grid can be represented by a ReLU DNN. For more general grids, we refer to Remark 1 and §5.
Thanks to (3.3), it suffices to show that each basis function can be represented by a ReLU DNN. We first note that the case where is trivial as the basis function with support in can be easily written as
In order to consider the cases where , we first prove the following lemma.
Given , if is convex, then the corresponding basis function can be written as
where, for each , is the global linear function such that on .
Let be the hyperplane that passes through the subsimplex (of ) that does not contain (see the left figure in Figure 3.3). Since is convex by assumption, all points in should be on the same side of the hyperplane . As a result, for all ,
By combining the above inequality with the following obvious inequality that
and the fact that all are linear, we conclude that
This, together with (3.7), proves that (3.6) holds for all . Thus
On the other hand, if , there exists a such that contains a segment of the straight line that pass through and (see the right figure in Figure 3.3). Again let be the hyperplane associated with as defined above. We note that and are on the different sides of . Since
This finishes the proof of Lemma 3.1.
If is not convex, we could also write the basis function as some max-min functions. But the form of max-min function is not as simple as the case where is convex, and it depends on the shape of the support of the basis function. In some cases, we can write the basis function as the max-min-max form if is a special non-convex set.
We are now in a position to state and prove the main result in this section.
Given a locally convex finite element grid , any linear finite element function with degrees of freedom, can be written as a ReLU-DNN with at most hidden layers and at most number of the neurons.
By Lemma 3.1, the basis function can be written as:
According to this procedure, we get the minimum of terms by splitting them in two, each taking the minimum over at most terms. This contributes to one ReLU hidden layer. Then we can further split the terms
until all the minimum functions contain only 1 or 2 terms.
which is also a ReLU DNN with hidden layer. So we can write a basis function as a -hidden-layer DNN. Considering the binary-tree structure, a -layer full binary-tree has nodes. We can see the number of neurons is at most
By (3.3), the piecewise linear function can be represented as a DNN with hidden layers. The number of neurons is at most .
We now consider a special class of the so-called shape regular finite element grid which satisfies
for some constants and independent of and , where () is the radius of the largest (smallest) ball contained in (containing) .
Given a locally convex and shape regular finite element grid , any linear finite element function with degrees of freedom(DOFs), can be written as a ReLU-DNN with at most hidden layers. The number of neurons is at most for some constant depending on the shape-regularity of . The number of non-zero parameters is at most .
We note that, using the approach described in this section, a finite element function with DOFs can be represented by a DNN with number of weights. This property is expected to be useful when DNNs are used in adaptive mesh-less or vertex-less numerical discretization methods for partial differential equation, which is a subject of further study.
3 Comparison of error estimates in adaptive finite element and DNN methods
Error estimates for adaptive finite element methods are well studied in the literature. For example, an appropriately adapted linear finite element function with DOFs is proved to admit the following error estimate:
if and is the interpolation based on the adapted finite element grid. More details can be founded in .
For a shallow network with DOFs (i.e. neurons), we have the next error estimate in (2.9) as
In comparison, an adaptive finite element function with the same order of DOFs can only have convergence rate of order .
As will be shown in §4, shallow neural networks (namely with only one hidden layer) cannot recover a linear finite element function in general, but may potentially lead to better asymptotic accuracy as the dimension gets larger.
One idea that may help us to understand is that the shallow network is a kind of -term or basis selection ( )approximation scheme with as the basis functions (as shown in Theorem 2.1), similar to using as the basis functions in Fourier approximation or some others in wavelets.
For deep ReLU neural networks, our connections of FEM and ReLU DNNs in this section help us to construct a special ReLU DNN models with depth and parameters for DOFs. By using the approximation result for adaptive FEM, DNN approximation for special structure with DOFs can get
and . This shows that there exists some special deep ReLU DNN structure which is at least as good as adaptive FEM.
In the previous section, we show that a finite element function can be represented by a ReLU DNN with hidden layers.
In particular, we will show the following theorem.
with , and . Consider the finite element functions, if this one hidden layer ReLU DNN can recover any basis function of FEM, then it can recover the finite element space. Thus let us assume is a locally supported basis function for FEM, i.e. is bounded, where
Furthermore, if is a bounded domain, we assume that
For more general case, we can define the set of discontinuous points of a function by
for with be the Heaviside function defined as:
Without loss of generality, we can assume that
is contradictory to the assumption that is locally supported.
Hence cannot recover any piecewise linear function in for .
Following the proof above, we can prove Theorem 2.1.
Proof If and are linearly independent for any , we know that the set of discontinuous points for any nontrivial combinations of cannot be empty. So, this is contradictory to
since where is a combination of .
This shows that despite it has the so-called universal approximation properties , shallow network is not enough in the case of recovering all CPWL functions. More precisely, although the shallow ReLU DNNs are CPWL functions themselves and can approximate any CPWL functions with any accuracy, there are some CPWL functions they cannot represent exactly. As an example, a local basis function in FEM with compact support and some other simple conditions cannot be represented by ReLU DNNs with one hidden layer for dimensions greater than .
As for the upper bound, Theorem 5.2 in provides us with one answer.
This also indicates that is “optimal” for .
General CPWL as a ReLU DNN
In the previous sections, we present a special approach to represent a linear simplicial finite element function by a ReLU DNN. In this section, we discuss a general approach to represent a general CPWL by a ReLU DNN, which is introduced in . In comparison with the special approach in §3, this general approach gives a ReLU DNN with relatively fewer layers but significantly more number of neurons.
Namely, on each , is a linear function:
Proof Because is finite, so there must exist number of subdomains,
Then we proceed to estimate . On the one hand, we have pieces linear functions, so
For the relationship between ReLU DNNs and CPWL functions, we have the next theorem with some estimation.
the number of hidden layers is bounded by
here , satisfying , is the number of subdomains as defined in Lemma 5.1.
The main result in Theorem 5.2 is not new, which can be found in . In the next subsection, we will give an outline of the proof of Theorem 5.2 and in particular to derive the new estimate (5.3) on the number of neurons needed in the DNN representation. We will also discuss the application of this theorem to the simplicial finite element space in § 3.1.
2 On the proof of Theorem 5.2
In this section, we give an outline of the proof of Theorem 5.2. We will mainly follow the proof in which is based on many relevant results in existing literature such as , but add some detailed estimate of the number of neurons.
here and is one of , or .
With all the lemmas above, now we can start to prove Theorem 5.2.
where is the number of subdomains as defined in Lemma 5.1, and , is the number of distinct pieces of .
Let , then
with , and .
For each in (5.8), we can continue this procedure for times. Then we have:
here and .
Now that we write a piecewise linear function in form of (5.9), in order to get the -hidden-layer ReLU DNN, we need to do linear transformations to reduce the cardinality of from to . This can be done by Lemma 5.3.
Following the procedures in Lemma 5.3 by , when reducing one cardinality of , one will become at most terms. If the cardinality is reduced from to , then we need to repeat the whole procedure times. Hence for each , in total we have at most terms. Thus for
with and , we have .
For each with , Lemma 5.4 can be used to get a ReLU DNN with hidden layers. Again by Lemma 5.4, the size is at most . Adding these together, we have the total size is at most , which is .
Note that if , we do not need to use Lemma 5.3, the size will be at most .
The estimation in Theorem 5.2 is a rough one, but still can provide some insights of this DNN representation. It can be seen that although the depth of this DNN is relatively shallow, the size of it might be extremely large, depending on the numbers of subdomains and distinct pieces.
𝑑1\lceil\log_{2}(d+1)\rceil hidden layers Given a locally convex finite element grid, now we have two different ways to represent a linear finite element function. In this part, we estimate the number of neurons if we write the function as a ReLU DNN with at most hidden layers. Then we can compare the sizes of two different approaches. Again we start with the basis functions.
Given , denote the corresponding basis function as . If is convex ,then the ReLU DNN with at most hidden layers has size at most .
For simplicity, let us further assume that
The first step is to write it as the linear combination fo max functions
Our goal is to make every term on the right hand side only take maximum over at most linear functions, here is the dimension.
For any term with linear functions more than , we need to use linear transformation to reduce this number. When reducing by one, one term will become at most terms. Thus will become at most when .
For any term with number of linear functions less or equal than , it remains unchanged. The number of this kind of terms is
Then in total the number of terms should be
Proof According to Corollary 5.3, every basis function has a size independent of , so the size of the DNN function with at most hidden layers is at most .
By comparing the above results with Theorem 3.1, we can see that although the DNN with hidden layers has shallower depth, the number of neurons is much larger than the one with hidden layers.
Low bit-width DNN models
In this section, we will show the rationality of low bit-width models with respect to approximation properties in some sense by investigating that a special type of ReLU DNN model can also recover all CPWL functions. In , an incremental network quantization strategy is proposed for transforming a general trained CNN into some low bit-width version in which there parameters are all zeros or powers of two. Mathematically speaking, low bit-width DNN model is defined as:
where is the activation function and
Under this closed form, they propose a projected gradient descent methods with respect to SGD to train a general R-FCN model for object detection. They also find that 6-bit (i.e ) model works almost the same with classical model in the object detection tasks.
Then it comes the question: why can those kinds of models work? More precisely, for classification or detection problems, can this model separate those data exactly? By our results in previous sections, we find a special family of ReLU DNN which has at most one general layers and all other layers with low bit-width parameters. The results offer modification and theoretical explanation of the existing low bit-width DNNs proposed in the literature.
Here we try to explain why those low bit-width DNN model also work for classification problems to some extent. We have the following result:
Any continuous piecewise function can be represented by the next model:
with defined in (6.2) and .
Proof Because of Theorem 5.2, we can rewrite any piecewise linear function as a ReLU DNN
with . By (3.9), we know that
Also note that if . This completes the proof.
Application to Numerical PDEs
In this section, we discuss the application of DNNs to the numerical solution of partial differential equations (PDEs). In most of our discussion, we consider the following model problem:
The idea of using DNN for numerical PDEs can be traced back to where a collocation method is used. Similar ideas have been explored by many different authors for different types of PDEs.
For the model problem (7.1), roughly speaking, the collocation method amounts to the following least square problem:
here is taken among the DNN function class in the form of (2.3) with a smooth activation function such as sigmoidal function and are some collocation points.
Recently, applied DNN for numerical PDE in the Galerkin setting which amounts to the solution of the following energy minimization problem:
Numerical experiments have demonstrated the potential of this approach. In the rest of this section, we will discuss a number of aspects of this approach from both theoretical and practical viewpoints. In particular, we will discuss its relationship with two popular finite element methods: adaptive finite element method and moving grid method.
The finite element approximation to (7.1) can be written as
where is the finite element space as described in §3.1.
In the finite element setting, the optimization problem (7.4) is to find the coefficient as in (3.3) for a given finite element mesh . Some more sophisticated versions of the finite element method can be obtained by varying or optimizing so that more accurate finite element approximation can be obtained. Roughly speaking, there are two main approaches for optimizing : one is the adaptive finite element method and the other is the moving grid finite element method.
The adaptive finite element method is, roughly speaking, to vary by either coarsening or refining the grid. One main theoretical result is that a family of adapted grids with degrees of freedom can be obtained so that the corresponding adaptive finite element approximation satisfies the following error estimate
We refer to for relevant details and its generalizations.
One interesting observation is that the convergence rate in (7.5) deteriorate badly as increases. Of course, error estimate in the form (7.5) depends on which Sobolev or Besov function classes that the solution belongs to, namely what norms are used in the right hand side of (7.5). But regardless what function classes for the solution , no asymptotic error estimate seems to be known in the literature that is better than .
The moving grid method is, on the other hand, to optimize by varying the location of grid points while preserving topological structure of the grids (in particular the number of grid ponts remain unchanged). This approach proves to be effective in many applications, see . But there are very few theories on the error estimate like (7.5) in the moving grid method.
However, the approximation properties are still unclear even for . proves a similar result for error estimate for with activation function . For general activation functions, or just for ReLU, it is an open problem.
2 DNN-Galerkin method
The finite element methods discussed above, including adaptive method and moving grid method, depend crucially on the underlying finite element grids. Numerical methods based on DNN, as we shall describe now, are a family of numerical methods that require no grids at all. This is reminiscent of the “mesh-less method” that have been much studied in recent years . But the mesh-less method still requires the use of discretization points. The DNN-Galerkin method (as we shall call), namely the Galerkin version of the DNN-element method such as (7.3), goes one step further: it does not even need any discretization points! It is a totally point-free method!
Let us now give a brief discussion on the error estimate for the DNN-Galerkin method. We first recall a classic result by for a DNN with one hidden layer of DOFs ( i.e. neurons),
3 An 1D example: a two point-boundary value problem
As a proof of concept, let us discuss a very simple one dimensional example. We focus on the following model problem:
The exact solution satisfies that
We define the space of ReLU DNNs with one hidden layer as follows:
where is the slope of piecewise linear function in . In order to satisfy the condition , we have the constraint
where . We do the alternate iteration as below,
where is the step-length. Once is fixed, the minimization problem is a quadratic optimization, which is the traditional finite element method. So we solve the FEM solution on grid and then compute the slope on each .
with . In this numerical experiment, the learning rate , the max iteration step is , and the degrees of freedom .
At the beginning of the simulation, we use the adaptive finite element method(AFEM) to get the adaptive grid from the uniform grid. Next we construct DNN solution with the same degrees of freedom. Then we minimize the energy and get the DNN solution. We compare the energy and semi-norm error of uniform grid solution(uFEM), AFEM solution and DNN solution. From Table 1, the energy and semi-norm of the DNN solution are smaller than and , which implies that the DNNs can find better solution than AFEM. Figure 7.1 shows the two different grid points on the same graph, we can easily see the grid points are moving.
Conclusion
By relating ReLU DNN models with linear finite element functions, we provide some theoretical insights on why and how deep neural networks work. It is shown that ReLU DNN models with sufficiently many layers (at least two) can reproduce all the linear finite element functions. This in some sense provides some theoretical explanation of the expressive power of deep learning models and also the necessity of using deep layers in deep learning applications.
Two different approaches are discussed in this paper on the representation of continuous piecewise linear functions by ReLU DNNs. The first approach, as proposed in and described in §5, leads to a DNN representation with a relatively shallow network with hidden layers but a relatively larger number of neurons. The second approach, presented in this paper and described in §3, leads to a representation that has a relatively deeper network with hidden layers (see (3.4)). Further investigations are needed in the future to combine these two approaches to obtain a more balanced representation.
The DNN representation of linear finite element functions opens a door for theoretical explanation and possible improvement on the application of the quantized weights in a convolution neural networks (see ).
One theoretically interesting question addressed in this paper concerns the minimal number of layers that are needed in a DNN model to reproduce general continuous piecewise linear functions. Theorem 4.1 provides a partial answer to this question, namely the minimal number of layers is at least . As a result, the number of layers as given in Theorem 5.2 is optimal for . It is still an open question if this number is also optimal for .
This paper also briefly touches upon the application of DNN in numerical solution of partial differential equations, which is a topic that was investigated in the literature in 1990s and has attracted much attention recently. Our focus is on the comparison of Galerkin type of discretization methods provided by adaptive linear finite element methods and by deep neural networks. When the dimension is large, asymptotic approximation properties are compared for these two different approaches in terms of the number of the dimension and the number of the degrees of freedom. When is small, we use the simplest case to demonstrate that the deep neural network would lead to a more accurate Galerkin approximation to a differential equation solution than the adaptive finite element method would under the assumption that the degrees of freedom are the same in both cases but without comparing their computational costs. This preliminary study seems to indicate that deep neural network may provide a potentially viable approach to the numerical solution of partial differential equations for both high and low dimensions although the underlying computational cost is a serious issue that may or may not be properly addressed by further studies in the future.
Appendix A Lattice representation
Let be a continuous piecewise linear function and the unique-order region partition is
assume the linear function on is . If the parameters satisfy
Then there exists , such that
Let , .
Since here we only involve linear functions, so we can represent each point by using ’s and ’s.
Then the point is on , and we can write as following:
Since here is the minimum, we have:
which means we find the desired pair of and .
Notice here and by the assumptions in the lemma.
and next we show that for all x and every :
Then by Lemma A.1, there must exist with and
Thus for every , we have for all . So if we take the maximum over all these functions, we should have:
This is exactly the desired form. Here , and the number of depends on the domain partition we do.
The drawback of this representation is that the number of may be too large, so we want to deal with other domain partitions, for example, partitions that produce a set of convex regions. The following theorem is an improvement of Theorem 5.1.
If the arrangement inside a convex region is the same, then the proof should be the same as Theorem 5.1. If not, the contribution of each region can be considered as the union of the contributions of its unique-order subsets: , here is defined as in Theorem 5.1. We can see that is no less than , the number of unique-order regions. By applying properties of lattices, we have:
so according to Theorem 5.1, for all in the domain.
Appendix B Proof of Lemmas
In this section, we will show the proofs of the lemmas used in previous sections.
When , if , then (B.1) becomes
and the right hand side (RHS) of (5.4) becomes the same.
If , then (B.1) becomes
and the RHS of (5.4) is also this expression.
For , similarly, the identity (5.4) always holds. Further, if consider the cases , and respectively, we have the following important identity:
here and is one of , or .
B.2 Proof of Lemma 5.3
If for each , then by taking , we already make the RHS of (B.3) as the RHS of (5.6). Otherwise, for some , we can assume that , so
If for each , we can do the following linear transformation , where
in this case, the coefficients of are not 1.
So we can assume there is at least one for , say . Let
(B.7) is already the desired form, because now we only take maximum over linear functions and one constant.
As for the (B.7), notice that now we have eliminated in the third expression. So continue this procedure, at last we will only have constant in the last expression, by taking maximum of this constant and , we can reduce one term in the max expression.
For (B.7), consider the linear transformation :
Then it is the same as (B.7). Follow the same steps as for (B.7), we can achieve the desired result.
Whenever we eliminate one in the expression of , we will gain 3 terms, which is (B.7-B.7). Among these three terms, (B.7) is in desired form, and we need to continue to use (5.5) for (B.7) and (B.7) until we only have constant. Note that in the proof, . By this procedure, we will gain at most terms (see Figure B.1).
B.3 Proof of Lemma 5.4
Since can be represented by a 2-layer ReLU DNN with size 4, we know that can be represented by a ReLU DNN of depth at most and size at most .