Optimal approximation of continuous functions by very deep ReLU networks
Dmitry Yarotsky
Introduction
Expressiveness of deep neural networks with piecewise-linear (in particular, ReLU) activation functions has been a topic of much theoretical research in recent years. The topic has many aspects, with connections to combinatorics (Montufar et al., 2014; Telgarsky, 2016), topology (Bianchini and Scarselli, 2014), Vapnik-Chervonenkis dimension (Bartlett et al., 1998; Sakurai, 1999) and fat-shattering dimension (Kearns and Schapire, 1990; Anthony and Bartlett, 2009), hierarchical decompositions of functions (Mhaskar et al., 2016), information theory (Petersen and Voigtlaender, 2017), etc.
Here we adopt the perspective of classical approximation theory, in which the problem of expressiveness can be basically described as follows. Suppose that is a multivariate function, say on the cube , and has some prescribed regularity properties; how efficiently can one approximate by deep neural networks? The question has been studied in several recent publications. Depth-separation results for some explicit families of functions have been obtained in Safran and Shamir (2016); Telgarsky (2016). General upper and lower bounds on approximation rates for functions characterized by their degree of smoothness have been obtained in Liang and Srikant (2016); Yarotsky (2017). Hanin and Sellke (2017); Lu et al. (2017) establish the universal approximation property and convergence rates for deep and “narrow” (fixed-width) networks. Petersen and Voigtlaender (2017) establish convergence rates for approximations of discontinuous functions. Generalization capabilities of deep ReLU networks trained on finite noisy samples are studied in Schmidt-Hieber (2017).
In the slightly different but closely related context of approximation on balls in Sobolev spaces this question of optimal convergence rate has been studied in Yarotsky (2017). That paper described ReLU network architectures with weights ensuring approximation with error (Theorem 1). The construction was linear in the sense that the network weights depended on the approximated function linearly. Up to the logarithmic factor, this approximation rate matches the optimal rate over all parametric models under assumption of continuous parameter selection (DeVore et al. (1989)). It was also shown in Theorem 2 of Yarotsky (2017) that one can slightly (by a logarithmic factor) improve over this conditionally optimal rate by adjusting network architectures to the approximated function.
On the other hand, it was proved in Theorem 4 of Yarotsky (2017) that ReLU networks generally cannot provide approximation with accuracy better than – a bound with the power twice as big as in the previously mentioned existence result. As was shown in the same theorem, this bound can be strengthened for shallow networks. However, without imposing depth constraints, there was a serious gap between the powers and in the lower and upper accuracy bounds that was left open in that paper.
In the present paper we bridge this gap in the setting of continuous functions (which is slightly more general than the setting of the Sobolev space of Lipschitz functions, , i.e. the case ). Our key insight is the close connection between approximation theory and VC dimension bounds. The lower bound on the approximation accuracy in Theorem 4 of Yarotsky (2017) was derived using the upper VCdim bound from Goldberg and Jerrum (1995). More accurate upper and lower bounds involving the network depth have been given in Bartlett et al. (1998); Sakurai (1999). The recent paper Bartlett et al. (2017) establishes nearly tight lower and upper VCdim bounds: , where is the largest VC dimension of a piecewise linear network with weights and layers. The key element in the proof of the lower bound is the “bit extraction technique” (Bartlett et al. (1998)) providing a way to compress significant expressiveness in a single network weight. In the present paper we adapt this technique to the approximation theory setting.
Our main result is the complete phase diagram for the parameterized family of approximation rates involving the modulus of continuity of the function and the number of weights . We prove that using very deep networks one can approximate function with error , and this rate is optimal up to a logarithmic factor. In fact, the depth of the networks must necessarily grow almost linearly with to achieve this rate, in sharp contrast to shallow networks that can provide approximation with error . Moreover, whereas the slower rate can be achieved using a continuous weight assignment in the network, the optimal rate necessarily requires a discontinuous weight assignment. All this allows us to regard these two kinds of approximations as being in different “phases”. In addition, we explore the intermediate rates with and show that they are also in the discontinuous phase and require network depths We show that the optimal rate can be achieved with a deep constant-width fully-connected architecture, whereas the rates with and depth can be achieved by stacking the deep constant-width architecture with a shallow parallel architecture. Apart from the bit extraction technique, we use the idea of the two-scales expansion from Theorem 2 in Yarotsky (2017) as an essential tool in the proofs of our results.
We formulate precisely the results in Section 2, discuss them in Section 3, and give the proofs in Sections 4, 5.
The results
where is the euclidean norm of .
The network is determined by its architecture and weights. Clearly, the total number of weights, denoted by , is equal to the total number of connections and computation units (not counting the input units). We don’t impose any constraints on the network architecture (see Fig. 1 for an example of a valid architecture).
Throughout the paper, we consider the input dimension as fixed. Accordingly, by constants we will generally mean values that may depend on .
We are interested in relating the approximation errors to the complexity of the function , measured by its modulus of continuity , and to the complexity of the approximating network, measured by its total number of weights . More precisely, we consider approximation rates in terms of the following procedure.
with some constants possibly depending on and but not on or ?
Clearly, if inequality (2) holds for some , then it also holds for any smaller . However, we expect that for smaller the inequality can be in some sense easier to satisfy. In this paper we show that there is in fact a qualitative difference between different regions of ’s.
Our findings are best summarized by the phase diagram shown in Fig. 2. We give an informal overview of the diagram before moving on to precise statements. The region of generally feasible rates is This region includes two qualitatively distinct phases corresponding to and . At the rate (2) can be achieved by fixed-depth networks whose weights depend linearly on the approximated function . In contrast, at the rate can only be achieved by networks with growing depths and whose weights depend discontinuously on the approximated function. In particular, at the rightmost feasible point the approximating architectures have and are thus necessarily extremely deep and narrow.
There exist network architectures with weights and, for each , a weight assignment linear in such that Eq. (2) is satisfied with The network architectures can be chosen as consisting of parallel blocks each having the same architecture that only depends on (see Fig. 4). In particular, the depths of the networks depend on but not on .
The detailed proof is given in Section 4.2; we explain now the idea. The approximating function is constructed as a linear combination of “spike” functions sitting at the knots of the regular grid in , with coefficients given by the values of at these knots (see Fig. 4). For a grid of spacing with an appropriate , the number of knots is while the approximation error is We implement each spike by a block in the network, and implement the whole approximation by summing blocks connected in parallel and weighted. Then the whole network has weights and, by expressing as , the approximation error is , i.e. we obtain the rate (2) with .
We note that the weights of the resulting network either do not depend on at all or are given by with some In particular, the weight assignment is continuous in with respect to the standard topology of .
We turn now to the region Several properties of this region are either direct consequences or slight modifications of existing results, and it is convenient to combine them in a single theorem.
(Feasibility) Approximation rate (2) cannot be achieved with .
(Inherent discontinuity) Approximation rate (2) cannot be achieved with if the weights of are required to depend on continuously with respect to the standard topology of .
(Inherent depth) If approximation rate (2) is achieved with a particular , then the architectures must have depths with some possibly - and -dependent constant
The proofs of these statements have the common element of considering the approximation for functions from the unit ball in the Sobolev space of Lipschitz functions. Namely, suppose that the approximation rate (2) holds with some . Then all can be approximated by architectures with accuracy
with some constant independent of . The three statements of the theorem are then obtained as follows.
a) This statement is a consequence of Theorem 4a) of Yarotsky (2017), which is in turn a consequence of the upper bound for the VC dimension of a ReLU network (Goldberg and Jerrum (1995)). Precisely, Theorem 4a) implies that if an architecture allows to approximate all with accuracy , then with some . Comparing this with Eq. (3), we get
b) This statement is a consequence of the general bound of DeVore et al. (1989) on the efficiency of approximation of Sobolev balls with parametric models having parameters continuously depending on the approximated function. Namely, if the weights of the networks depend on continuously, then Theorem 4.2 of DeVore et al. (1989) implies that with some constant , which implies that
c) This statement can be obtained by combining arguments of Theorem 4 of Yarotsky (2017) with the recently established tight upper bound for the VC dimension of ReLU networks (Bartlett et al. (2017), Theorem 6) with given depth and the number of weights :
Theorem 1 suggests the existence of an approximation phase drastically different from the phase . This new phase would provide better approximation rates, up to , at the cost of deeper networks and some complex, discontinuous weight assignment. The main contribution of the present paper is the proof that this phase indeed exists.
We describe some architectures that, as we will show, correspond to this phase. First we describe the architecture for i.e. for the fastest possible approximation rate. Consider the usual fully-connected architecture connecting neighboring layers and having a constant number of neurons in each layer, see Fig. 6. We refer to this constant number of neurons as the “width” of the network. Such a network of width and depth has weights in total. We will be interested in the scenario of “narrow” networks where is fixed and the network grows by increasing ; then grows linearly with . Below we will refer to the “narrow fully-connected architecture of width having weights”: the depth is supposed in this case to be determined from the above equality; we will assume without loss of generality that the equality is satisfied with an integer . We will show that these narrow architectures provide the approximation rate if the width is large enough (say, ).
In the case we consider another kind of architectures obtained by stacking parallel shallow architectures (akin to those of Proposition 1) with the above narrow fully-connected architectures, see Fig. 6. The first, parallelized part of these architectures consists of blocks that only depend on (but not on or ). The second, narrow fully-connected part will again have a fixed width, and we will take its depth to be . All the remaining weights then go into the first parallel subnetwork, which in particular determines the number of blocks in it. Since the blocks are parallel and their architectures do not depend on , the overall depth of the network is determined by the second, deep subnetwork and is . On the other hand, in terms of the number of weights, for most computation is performed by the first, parallel subnetwork (the deep subnetwork has weights while the parallel one has an asymptotically larger number of weights, ).
Clearly, these stacked architectures can be said to “interpolate” between the purely parallel architectures for and the purely serial architectures for . Note that a parallel computation can be converted into a serial one at the cost of increasing the depth of the network. For , rearranging the parallel subnetwork of the stacked architecture into a serial one would destroy the bound on the depth of the full network, since the parallel subnetwork has weights. However, for this rearrangement does not affect the asymptotic of the depth more than by a constant factor – that’s why we don’t include the parallel subnetwork into the full network in this case.
We state now our main result as the following theorem.
For any , there exist a sequence of architectures with depths and respective weight assignments such that inequality (2) holds with this .
For , an example of such architectures is the narrow fully-connected architectures of constant width .
For , an example of such architectures are stacked architectures described above, with the narrow fully-connected subnetwork having width and depth .
Comparing this theorem with Theorem 1a) we see that the narrow fully-connected architectures provide the best possible approximation in the sense of Eq. (2). Moreover, for the upper bound on the network depth in Theorem 2c) matches the lower bound in Theorem 1c) up to a logarithmic factor. This proves that for our stacked architectures are also optimal (up to a logarithmic correction) if we additionally impose the asymptotic constraint on the network depth.
The full proof of Theorem 2 is given in Section 5; we explain now its main idea. Given a function and some , we first proceed as in Proposition 1 and construct its piecewise linear interpolation on the length scale with . This approximation has uniform error . Then, we improve this approximation by constructing an additional approximation for the discrepancy . This second approximation lives on a smaller length scale with . In contrast to , the second approximation is inherently discrete: we consider a finite set of possible shapes of in patches of linear size , and in each patch we use a special single network weight to encode the shape closest to . The second approximation is then fully determined by the collection of these special encoding weights found for all patches. We make the parallel subnetwork of the full network serve two purposes: in addition to computing the initial approximation as in Proposition 1, the subnetwork returns the position of within its patch along with the weight that encodes the second approximation within this patch. The remaining, deep narrow part of the network then serves to decode the second approximation within this patch from the special weight and compute the value . Since the second approximation lives on the smaller length scale , there are possible approximations within the patch that might need to be encoded in the special weight. It then takes a narrow network of depth to reconstruct the approximation from the special weight using the bit extraction technique of Bartlett et al. (1998). As , we get . At the same time, the second approximation allows us to effectively improve the overall approximation scale from down to , i.e. to , while keeping the total number of weights in the network. This gives us the desired error bound
Discussion
We discuss now our result in the context of general approximation theory and practical machine learning. First, a theorem of Kainen et al. (1999) shows that in the optimal approximations by neural networks the weights generally discontinuously depend on the approximated function, so the discontinuity property that we have established is not surprizing. However, this theorem of Kainen et al. (1999) does not in any way quantify the accuracy gain that can be acquired by giving up the continuity of the weights. Our result does this in the practically important case of deep ReLU networks, and explicitly describes a relevant mechanism.
In general, many nonlinear approximation schemes involve some form of discontinuity, often explicit (e.g., using different expansion bases for different approximated functions (DeVore (1998)). At the same time, discontinuous selection of parameters in parametric models is often perceived as an undesirable phenomenon associated with unreliable approximation (DeVore et al. (1989); DeVore (1998)). We point out, however, that deep architectures considered in the present paper resemble some popular state-of-the-art practical networks for highly accurate image recognition – residual networks (He et al., 2016) and highway networks (Srivastava et al., 2015) that may have dozens or even hundreds of layers. While our model does not explicitly include shortcut connections as in ResNets, a very similar element is effectively present in the proof of Theorem 2 (in the form of channels reserved for passing forward the data). We expect, therefore, that our result may help better understand the properties of ResNet-like networks.
Quantized network weights have been previously considered from the information-theoretic point of view in Bölcskei et al. (2017); Petersen and Voigtlaender (2017). In the present paper we do not use quantized weights in the statement of the approximation problem, but they appear in the solution (namely, we use them to store small-scale descriptions of the approximated function). One can expect that weight quantization may play an important role in the future development of the theory of deep networks.
Preliminaries and proof of Proposition 1
The modulus of continuity defined by (1) is monotone nondecreasing in . By the convexity of the cube for any integer we have More generally, for any we can write
The ReLU function allows us to implement the binary operation as and the binary min operation as . The maximum or minimum of any numbers can then be implemented by chaining binary ’s or ’s. Computation of the absolute value can be implemented by
Without loss of generality, we can assume that hidden units in the ReLU network may not include the ReLU nonlinearity (i.e., may compute just a linear combination of the input values). Indeed, we can simulate nonlinearity-free units just by increasing the weight in the formula , so that is always nonnegative (this is possible since the network inputs are from a compact set and the network implements a continuous function), and then compensating the shift by subtracting appropriate amounts in the units receiving input from the unit in question. In particular, we can simulate in this way trivial “pass forward” (identity) units that simply deliver given values to some later layers of the network.
In the context of deep networks of width considered in Section 5, it will be occasionally convenient to think of the network as consisting of “channels”, i.e. sequences of units with one unit per layer. Channels can be used to pass forward values or to perform computations. For example, suppose that we already have channels that pass forward numbers. If we also need to compute the maximum of these numbers, then, by chaining binary ’s, this can be done in a subnetwork including one additional channel that spans layers.
We denote vectors by boldface characters; the scalar components of a vector are denoted
2 Proof of Proposition 1
Indeed, if then it is easy to see that we can find some such that , hence vanishes outside , as required. On the other hand, consider the restriction of to . Each is nonnegative on this set. For each and , the value is a convex combination of the values of at the vertices of the simplex that belongs to. Since is nonnegative and for all , the minimum in (6) is attained at such that vanishes at all vertices of other than , i.e. at .
Note that each map in (6) is either of the form or (different simplexes may share the same map ), so the minimum actually needs to be taken only over different : \phi(\mathbf{x})=\big{(}\min(\min_{k\neq s}(1+x_{k}-x_{s}),\min_{k}(1+x_{k}),\min_{k}(1-x_{k}))\big{)}_{+}.
We define now the piecewise linear interpolation by
We can bound the modulus of continuity of by for Since any point is within the distance of one of the interpolation knots where vanishes, we can also write
where in the second inequality we again used Eq.(5).
We observe now that formula (7) can be represented by a parallel network shown in Fig. 4 in which the blocks compute the values and their output connections carry the weights . Since the blocks have the same architecture only depending on , for a network with blocks the total number of weights is . It follows that the number of weights will not exceed if we take with a suficiently small constant . Then, the error bound (8) ensures the desired approximation rate (2) with .
Proof of Theorem 2
We divide the proof into three parts. In Section 5.1 we construct the “two-scales” approximation for the given function and estimate its accuracy. In Section 5.2 we describe an efficient way to store and evaluate the refined approximation using the bit extraction technique. Finally, in Section 5.3 we describe the neural network implementations of and verify the network size constraints.
We follow the outline of the proof given after the statement of Theorem 2 and start by constructing the initial interpolating approximation to using Proposition 1. We may assume without loss of generality that must be implemented by a network with not more than weights, reserving the remaining weights for the second approximation. Then is given by Eq. (7), where
with a sufficiently small constant . The error of the approximation is given by Eq. (8). We turn now to approximating the discrepancy
It is convenient to represent as a finite sum of functions with supports consisting of disjoint “patches” of linear size Precisely, let and consider the partition of unity on the cube
Since we have by Eq. (8). Also, if then
where in the last step we used inequality (5).
1.2 The second (discrete) approximation.
We will construct the full approximation in the form
where are approximations to on a smaller length scale . We set
Here the discretization parameter is given by
1.3 Accuracy of the full approximation.
Summing the errors over all , we can bound the error of the full approximation using representations (11), (13) and the above bound:
Observe that, by Eq.(14), this yields the desired approximation rate (2) provided we can indeed implement with not more than weights:
2 Storing and decoding the refined approximation
We have reduced our task to implementing the functions subject to the network size and depth constraints. We describe now an efficient way to compute these functions using a version of the bit extraction technique.
Since we can encode all the values by a single ternary number
where is some enumeration of the multi-indices . The values for all and will be stored as weights in the network and encode the approximation in all the patches.
Such a function can be implemented by Observe that if , then for all the number belongs to one of the three intervals in the r.h.s. of Eq.(21) and hence . Thus, we can reconstruct the values for all by a ReLU network with layers and weights.
On the patch , the function can be expanded over the spike functions as
It is convenient to rewrite this computation in terms of the numbers and the expressions
using summation by parts in the direction :
where we used the identities .
The above representation involves products and we show now how to implement them by a ReLU network. Note that and . But for any we can write
2.4 Mapping 𝐱𝐱\mathbf{x} to the respective patch.
We will use the obtained formula (23) as a basis for computing . But the formula is dependent on the patch the input belongs to: the value depends on the patch index by Eq.(22), and the values must be recovered from the patch-specific encoding weight .
Given , the relevant value of can be selected among the values for all patches by the ReLU network computing
If belongs to some patch then , as required. If does not belong to any patch, then computes some garbage value which will be unimportant because we will ensure that all the factors appearing in the sum (23) will vanish for such .
Now we show how to perform a properly adjusted computation of . Namely, we will compute the function such that
Now if we try to define just by replacing with in the definition (22), then we fulfill the requirement (26), but not (27). To also fulfill (27), consider the auxiliary “suppressing” function
then this satisfies both (26) and (27). Then, based on (23), we can write the final formula for as
We summarize now how is computed by the ReLU network.
The patch-encoding number and the auxiliary functions are computed by Eqs.(25), (28)-(30), (31), respectively.
The numbers that describe the patch containing are decoded from by a deep network with layers and weights as detailed in Section 5.2.2. We choose the enumeration so that same- multi-indices form contiguous blocks of length corresponding to summation over in (33); the ordering of the multi-indices is arbitrary.
The values are computed iteratively in parallel to the respective , using Eq.(32). For , the sum is computed by adding the next term to the sum retained from the previous step. This shows that all the values can be computed with linear and ReLU operations.
The sum (33) is computed with the help of multiplication formula (24).
3 Network implementations
We can now establish statement a) of Theorem 2. Recall that we have already established the approximation rate in Eq. (20), but still need to show that the total number of weights is bounded by . The initial approximation was constructed so as to be implementable by a network architecture with weights, and we need to show now that the second approximation can also be implemented by an architecture with not more than weights.
By Eq. (13), computation of amounts to computations of the terms . These latter computations are detailed in stages 1–4 in Section 5.2.5. Stage 1 can be performed by a parallellized ReLU network of depth with elementary operations. If the constant in our choice (9) of is small enough, then the total number of weights required for stage 1 does not exceed . Next, stages 2–4 require a deep network that has layers and weights. By our choice of in Eqs. (9), (14) we can write as . Assuming is small enough, the number of weights for stages 2–4 does not exceed as well, and so the total number of weights required for does not exceed , as desired. In addition, the depth of the network implementation of is determined by the depth of stages 2–4, i.e. is also as claimed in Theorem 2a).
We describe now in more detail the implementation of by the narrow fully-connected or by the stacked architectures as stated in parts b) and c) of Theorem 2.
3.2 Implementation of f~~𝑓\widetilde{f} in the case p=2ν𝑝2𝜈p=\frac{2}{\nu}.
In this case we implement by a narrow fully-connected network and estimate now its sufficient width. We reserve channels to pass forward the components of the input and one channel to store partial results of the sum The terms in this sum can be computed in a serial way, so we only need to estimate the network width sufficient for implementing and any .
Given , the value or, more generally, , can be computed by (6) using chained binary ’s and just one additional channel. Therefore, by (7), can be implemented with the standard network of width .
Now consider implementation of as detailed in Section 5.2.5. In stage 1, we compute the numbers and the -dimensional vector . This can be done with channels (one extra channel is reserved for storing partial sums). We reserve channels for the next stages to pass forward the values . In stage 2, we decode the numbers from This can be done with 4 channels: one is used to store and refresh the values , another to output the sequence , and two more channels to compute . In stage 3, we compute the numbers . This can be done with 3 channels: one is used to keep partial sums , another to compute current , and the third to compute by Eq.(32). Stages 2 and 3 are performed in parallel and their channels are aligned so that for each the values and can be found in the same layer. In stage 4, we compute the sum (33). This stage is performed in parallel to stages 2 and 3 and requires one more channel for computing the ReLU operations in the multiplication formula (24). We conclude that channels are sufficient for the whole computation of and hence for the whole computation of
3.3 Implementation of f~~𝑓\widetilde{f} in the case p∈(1ν,2ν)𝑝1𝜈2𝜈p\in(\frac{1}{\nu},\frac{2}{\nu}).
In this case the initial approximation as well as stage 1 in the computations of are implemented by the parallel shallow subnetwork and only stages 2–4 of the computations of are implemented by the deep narrow subnetwork. We can make some blocks of the parallel subnetwork output the terms of the expansion of while other blocks output the numbers and the -dimensional vector . The stages 2–4 are then implemented as in the previous case , but now all the must be processed in parallel – that’s why the width of the deep subnetwork is taken to be
Acknowledgments
The author thanks Alexander Kuleshov and the anonymous referees for helpful comments and suggestions. The research was supported by the Skoltech SBI Bazykin/Yarotsky project.