Self-Normalizing Neural Networks

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter

Introduction

Deep Learning has set new records at different benchmarks and led to various commercial applications . Recurrent neural networks (RNNs) achieved new levels at speech and natural language processing, for example at the TIMIT benchmark or at language translation , and are already employed in mobile devices . RNNs have won handwriting recognition challenges (Chinese and Arabic handwriting) and Kaggle challenges, such as the “Grasp-and Lift EEG” competition. Their counterparts, convolutional neural networks (CNNs) excel at vision and video tasks. CNNs are on par with human dermatologists at the visual detection of skin cancer . The visual processing for self-driving cars is based on CNNs , as is the visual input to AlphaGo which has beaten one of the best human GO players . At vision challenges, CNNs are constantly winning, for example at the large ImageNet competition , but also almost all Kaggle vision challenges, such as the “Diabetic Retinopathy” and the “Right Whale” challenges .

However, looking at Kaggle challenges that are not related to vision or sequential tasks, gradient boosting, random forests, or support vector machines (SVMs) are winning most of the competitions. Deep Learning is notably absent, and for the few cases where FNNs won, they are shallow. For example, the HIGGS challenge, the Merck Molecular Activity challenge, and the Tox21 Data challenge were all won by FNNs with at most four hidden layers. Surprisingly, it is hard to find success stories with FNNs that have many hidden layers, though they would allow for different levels of abstract representations of the input .

To robustly train very deep CNNs, batch normalization evolved into a standard to normalize neuron activations to zero mean and unit variance . Layer normalization also ensures zero mean and unit variance, while weight normalization ensures zero mean and unit variance if in the previous layer the activations have zero mean and unit variance. However, training with normalization techniques is perturbed by stochastic gradient descent (SGD), stochastic regularization (like dropout), and the estimation of the normalization parameters. Both RNNs and CNNs can stabilize learning via weight sharing, therefore they are less prone to these perturbations. In contrast, FNNs trained with normalization techniques suffer from these perturbations and have high variance in the training error (see Figure 1). This high variance hinders learning and slows it down. Furthermore, strong regularization, such as dropout, is not possible as it would further increase the variance which in turn would lead to divergence of the learning process. We believe that this sensitivity to perturbations is the reason that FNNs are less successful than RNNs and CNNs.

Self-normalizing neural networks (SNNs) are robust to perturbations and do not have high variance in their training errors (see Figure 1). SNNs push neuron activations to zero mean and unit variance thereby leading to the same effect as batch normalization, which enables to robustly learn many layers. SNNs are based on scaled exponential linear units “SELUs” which induce self-normalizing properties like variance stabilization which in turn avoids exploding and vanishing gradients.

Self-normalizing Neural Networks (SNNs)

We consider the mapping gg that maps mean and variance of the activations from one layer to mean and variance of the activations in the next layer

A neural network is self-normalizing if it possesses a mapping g:ΩΩg:\Omega\mapsto\Omega for each activation yy that maps mean and variance from one layer to the next and has a stable and attracting fixed point depending on (ω,τ)(\omega,\tau) in Ω\Omega. Furthermore, the mean and the variance remain in the domain Ω\Omega, that is g(Ω)Ωg(\Omega)\subseteq\Omega, where Ω={(μ,ν)  μ[μmin,μmax],ν[νmin,νmax]}\Omega=\{(\mu,\nu)\ |\ \mu\in[\mu_{\min},\mu_{\max}],\nu\in[\nu_{\min},\nu_{\max}]\}. When iteratively applying the mapping gg, each point within Ω\Omega converges to this fixed point.

Therefore, we consider activations of a neural network to be normalized, if both their mean and their variance across samples are within predefined intervals. If mean and variance of x\bm{x} are already within these intervals, then also mean and variance of y\bm{y} remain in these intervals, i.e., the normalization is transitive across layers. Within these intervals, the mean and variance both converge to a fixed point if the mapping gg is applied iteratively.

Therefore, SNNs keep normalization of activations when propagating them through layers of the network. The normalization effect is observed across layers of a network: in each layer the activations are getting closer to the fixed point. The normalization effect can also observed be for two fixed layers across learning steps: perturbations of lower layer activations or weights are damped in the higher layer by drawing the activations towards the fixed point. If for all yy in the higher layer, ω\omega and τ\tau of the corresponding weight vector are the same, then the fixed points are also the same. In this case we have a unique fixed point for all activations yy. Otherwise, in the more general case, ω\omega and τ\tau differ for different yy but the mean activations are drawn into [μmin,μmax][\mu_{\min},\mu_{\max}] and the variances are drawn into [νmin,νmax][\nu_{\min},\nu_{\max}].

We aim at constructing self-normalizing neural networks by adjusting the properties of the function gg. Only two design choices are available for the function gg: (1) the activation function and (2) the initialization of the weights.

For the activation function, we propose “scaled exponential linear units” (SELUs) to render a FNN as self-normalizing. The SELU activation function is given by

SELUs allow to construct a mapping gg with properties that lead to SNNs. SNNs cannot be derived with (scaled) rectified linear units (ReLUs), sigmoid units, tanh\tanh units, and leaky ReLUs. The activation function is required to have (1) negative and positive values for controlling the mean, (2) saturation regions (derivatives approaching zero) to dampen the variance if it is too large in the lower layer, (3) a slope larger than one to increase the variance if it is too small in the lower layer, (4) a continuous curve. The latter ensures a fixed point, where variance damping is equalized by variance increasing. We met these properties of the activation function by multiplying the exponential linear unit (ELU) with λ>1\lambda>1 to ensure a slope larger than one for positive net inputs.

For the weight initialization, we propose ω=0\omega=0 and τ=1\tau=1 for all units in the higher layer. The next paragraphs will show the advantages of this initialization. Of course, during learning these assumptions on the weight vector will be violated. However, we can prove the self-normalizing property even for weight vectors that are not normalized, therefore, the self-normalizing property can be kept during learning and weight changes.

These integrals can be analytically computed and lead to following mappings of the moments:

The spectral norm of J(0,1)\mathcal{J}(0,1) (its largest singular value) is 0.7877<10.7877<1. That means gg is a contraction mapping around the fixed point (0,1)(0,1) (the mapping is depicted in Figure 2). Therefore, (0,1)(0,1) is a stable fixed point of the mapping gg.

A normalized weight vector w\bm{w} cannot be ensured during learning. For SELU parameters α=α01\alpha=\alpha_{\rm 01} and λ=λ01\lambda=\lambda_{\rm 01}, we show in the next theorem that if (ω,τ)(\omega,\tau) is close to (0,1)(0,1), then gg still has an attracting and stable fixed point that is close to (0,1)(0,1). Thus, in the general case there still exists a stable fixed point which, however, depends on (ω,τ)(\omega,\tau). If we restrict (μ,ν,ω,τ)(\mu,\nu,\omega,\tau) to certain intervals, then we can show that (μ,ν)(\mu,\nu) is mapped to the respective intervals. Next we present the central theorem of this paper, from which follows that SELU networks are self-normalizing under mild conditions on the weights.

We assume α=α01\alpha=\alpha_{\rm 01} and λ=λ01\lambda=\lambda_{\rm 01}. We restrict the range of the variables to the following intervals μ[0.1,0.1]\mu\in[-0.1,0.1], ω[0.1,0.1]\omega\in[-0.1,0.1], ν[0.8,1.5]\nu\in[0.8,1.5], and τ[0.95,1.1]\tau\in[0.95,1.1], that define the functions’ domain Ω\Omega. For ω=0\omega=0 and τ=1\tau=1, the mapping Eq. (3) has the stable fixed point (μ,ν)=(0,1)(\mu,\nu)=(0,1), whereas for other ω\omega and τ\tau the mapping Eq. (3) has a stable and attracting fixed point depending on (ω,τ)(\omega,\tau) in the (μ,ν)(\mu,\nu)-domain: μ[0.03106,0.06773]\mu\in[-0.03106,0.06773] and ν[0.80009,1.48617]\nu\in[0.80009,1.48617]. All points within the (μ,ν)(\mu,\nu)-domain converge when iteratively applying the mapping Eq. (3) to this fixed point.

Consequently, feed-forward neural networks with many units in each layer and with the SELU activation function are self-normalizing (see definition 1), which readily follows from Theorem 1. To give an intuition, the main property of SELUs is that they damp the variance for negative net inputs and increase the variance for positive net inputs. The variance damping is stronger if net inputs are further away from zero while the variance increase is stronger if net inputs are close to zero. Thus, for large variance of the activations in the lower layer the damping effect is dominant and the variance decreases in the higher layer. Vice versa, for small variance the variance increase is dominant and the variance increases in the higher layer.

However, we cannot guarantee that mean and variance remain in the domain Ω\Omega. Therefore, we next treat the case where (μ,ν)(\mu,\nu) are outside Ω\Omega. It is especially crucial to consider ν\nu because this variable has much stronger influence than μ\mu. Mapping ν\nu across layers to a high value corresponds to an exploding gradient, since the Jacobian of the activation of high layers with respect to activations in lower layers has large singular values. Analogously, mapping ν\nu across layers to a low value corresponds to an vanishing gradient. Bounding the mapping of ν\nu from above and below would avoid both exploding and vanishing gradients. Theorem 2 states that the variance of neuron activations of SNNs is bounded from above, and therefore ensures that SNNs learn robustly and do not suffer from exploding gradients.

The proof can be found in the Appendix Section A3. Thus, when mapped across many layers, the variance in the interval $ismappedtoavaluebelowis mapped to a value below3.Consequently,allfixedpoints. Consequently, all fixed points(\mu,\nu)ofthemappingof the mappingg(Eq.(3))have(Eq. (3)) have\nu<3$. Analogously, Theorem 3 states that the variance of neuron activations of SNNs is bounded from below, and therefore ensures that SNNs do not suffer from vanishing gradients.

In the derivative of the mapping (Eq. (3)), we used the central limit theorem (CLT) to approximate the network inputs z=i=1nwixiz=\sum_{i=1}^{n}w_{i}x_{i} with a normal distribution. We justified normality because network inputs represent a weighted sum of the inputs xix_{i}, where for Deep Learning nn is typically large. The Berry-Esseen theorem states that the convergence rate to normality is n1/2n^{-1/2} . In the classical version of the CLT, the random variables have to be independent and identically distributed, which typically does not hold for neural networks. However, the Lyapunov CLT does not require the variable to be identically distributed anymore. Furthermore, even under weak dependence, sums of random variables converge in distribution to a Gaussian distribution .

Experiments

We compare SNNs to other deep networks at different benchmarks. Hyperparameters such as number of layers (blocks), neurons per layer, learning rate, and dropout rate, are adjusted by grid-search for each dataset on a separate validation set (see Section A4). We compare the following FNN methods:

“MSRAinit”: FNNs without normalization and with ReLU activations and “Microsoft weight initialization” .

“BatchNorm”: FNNs with batch normalization .

“LayerNorm”: FNNs with layer normalization .

“WeightNorm”: FNNs with weight normalization .

“ResNet”: Residual networks adapted to FNNs using residual blocks with 2 or 3 layers with rectangular or diavolo shape.

The benchmark comprises 121 classification datasets from the UCI Machine Learning repository from diverse application areas, such as physics, geology, or biology. The size of the datasets ranges between 1010 and 130,000130,000 data points and the number of features from 44 to 250250. In abovementioned work , there were methodological mistakes which we avoided here. Each compared FNN method was optimized with respect to its architecture and hyperparameters on a validation set that was then removed from the subsequent analysis. The selected hyperparameters served to evaluate the methods in terms of accuracy on the pre-defined test sets (details on the hyperparameter selection are given in Section A4). The accuracies are reported in the Table A11. We ranked the methods by their accuracy for each prediction task and compared their average ranks. SNNs significantly outperform all competing networks in pairwise comparisons (paired Wilcoxon test across datasets) as reported in Table 1 (left panel).

We further included 17 machine learning methods representing diverse method groups in the comparison and the grouped the data sets into “small” and “large” data sets (for details see Section A4). On 75 small datasets with less than 1000 data points, random forests and SVMs outperform SNNs and other FNNs. On 46 larger datasets with at least 1000 data points, SNNs show the highest performance followed by SVMs and random forests (see right panel of Table 1, for complete results see Tables A12 and A12). Overall, SNNs have outperformed state of the art machine learning methods on UCI datasets with more than 1,000 data points.

Typically, hyperparameter selection chose SNN architectures that were much deeper than the selected architectures of other FNNs, with an average depth of 10.8 layers, compared to average depths of 6.0 for BatchNorm, 3.8 WeightNorm, 7.0 LayerNorm, 5.9 Highway, and 7.1 for MSRAinit networks. For ResNet, the average number of blocks was 6.35. SNNs with many more than 4 layers often provide the best predictive accuracies across all neural networks.

The Tox21 challenge dataset comprises about 12,000 chemical compounds whose twelve toxic effects have to be predicted based on their chemical structure. We used the validation sets of the challenge winners for hyperparameter selection (see Section A4) and the challenge test set for performance comparison. We repeated the whole evaluation procedure 5 times to obtain error bars. The results in terms of average AUC are given in Table 2. In 2015, the challenge organized by the US NIH was won by an ensemble of shallow ReLU FNNs which achieved an AUC of 0.846 . Besides FNNs, this ensemble also contained random forests and SVMs. Single SNNs came close with an AUC of 0.845±\pm0.003. The best performing SNNs have 8 layers, compared to the runner-ups ReLU networks with layer normalization with 2 and 3 layers. Also batchnorm and weightnorm networks, typically perform best with shallow networks of 2 to 4 layers (Table 2). The deeper the networks, the larger the difference in performance between SNNs and other methods (see columns 5–8 of Table 2). The best performing method is an SNN with 8 layers.

Since a decade, machine learning methods have been used to identify pulsars in radio wave signals . Recently, the High Time Resolution Universe Survey (HTRU2) dataset has been released with 1,639 real pulsars and 16,259 spurious signals. Currently, the highest AUC value of a 10-fold cross-validation is 0.976 which has been achieved by Naive Bayes classifiers followed by decision tree C4.5 with 0.949 and SVMs with 0.929. We used eight features constructed by the PulsarFeatureLab as used previously . We assessed the performance of FNNs using 10-fold nested cross-validation, where the hyperparameters were selected in the inner loop on a validation set (for details on the hyperparameter selection see Section A4). Table 3 reports the results in terms of AUC. SNNs outperform all other methods and have pushed the state-of-the-art to an AUC of 0.980.98.

Conclusion

We have introduced self-normalizing neural networks for which we have proved that neuron activations are pushed towards zero mean and unit variance when propagated through the network. Additionally, for activations not close to unit variance, we have proved an upper and lower bound on the variance mapping. Consequently, SNNs do not face vanishing and exploding gradient problems. Therefore, SNNs work well for architectures with many layers, allowed us to introduce a novel regularization scheme, and learn very robustly. On 121 UCI benchmark datasets, SNNs have outperformed other FNNs with and without normalization techniques, such as batch, layer, and weight normalization, or specialized architectures, such as Highway or Residual networks. SNNs also yielded the best results on drug discovery and astronomy tasks. The best performing SNN architectures are typically very deep in contrast to other FNNs.

Acknowledgments

This work was supported by IWT research grant IWT150865 (Exaptation), H2020 project grant 671555 (ExCAPE), grant IWT135122 (ChemBioBridge), Zalando SE with Research Agreement 01/2016, Audi.JKU Deep Learning Center, Audi Electronic Venture GmbH, and the NVIDIA Corporation.

References

The references are provided in Section A7.

Appendix

This appendix is organized as follows: the first section sets the background, definitions, and formulations. The main theorems are presented in the next section. The following section is devoted to the proofs of these theorems. The next section reports additional results and details on the performed computational experiments, such as hyperparameter selection. The last section shows that our theoretical bounds can be confirmed by numerical methods as a sanity check.

The proof of theorem 1 is based on the Banach’s fixed point theorem for which we require (1) a contraction mapping, which is proved in Subsection A3.4.1 and (2) that the mapping stays within its domain, which is proved in Subsection A3.4.2 For part (1), the proof relies on the main Lemma 12, which is a computer-assisted proof, and can be found in Subsection A3.4.1. The validity of the computer-assisted proof is shown in Subsection A3.4.5 by error analysis and the precision of the functions’ implementation. The last Subsection A3.4.6 compiles various lemmata with intermediate results that support the proofs of the main lemmata and theorems.

A1 Background

For neural networks with scaled exponential linear units, the mean is of the activations in the next layer computed according to

and the second moment of the activations in the next layer is computed according to

The parameters α01\alpha_{\rm 01} and λ01\lambda_{\rm 01} ensure

Figure 2 visualizes the mapping gg for ω=0\omega=0 and τ=1\tau=1 and α01\alpha_{\rm 01} and λ01\lambda_{\rm 01} at few pre-selected points. It can be seen that (0,1)(0,1) is an attracting fixed point of the mapping gg.

A2 Theorems

Theorem 1 shows that the mapping gg defined by Eq. (4) and Eq. (5) exhibits a stable and attracting fixed point close to zero mean and unit variance. Theorem 1 establishes the self-normalizing property of self-normalizing neural networks (SNNs). The stable and attracting fixed point leads to robust learning through many layers.

We assume α=α01\alpha=\alpha_{\rm 01} and λ=λ01\lambda=\lambda_{\rm 01}. We restrict the range of the variables to the domain μ[0.1,0.1]\mu\in[-0.1,0.1], ω[0.1,0.1]\omega\in[-0.1,0.1], ν[0.8,1.5]\nu\in[0.8,1.5], and τ[0.95,1.1]\tau\in[0.95,1.1]. For ω=0\omega=0 and τ=1\tau=1, the mapping Eq. (4) and Eq. (5) has the stable fixed point (μ,ν)=(0,1)(\mu,\nu)=(0,1). For other ω\omega and τ\tau the mapping Eq. (4) and Eq. (5) has a stable and attracting fixed point depending on (ω,τ)(\omega,\tau) in the (μ,ν)(\mu,\nu)-domain: μ[0.03106,0.06773]\mu\in[-0.03106,0.06773] and ν[0.80009,1.48617]\nu\in[0.80009,1.48617]. All points within the (μ,ν)(\mu,\nu)-domain converge when iteratively applying the mapping Eq. (4) and Eq. (5) to this fixed point.

A2.2 Theorem 2: Decreasing Variance from Above

The variance decreases in $andallfixedpointsand all fixed points(\mu,\nu)ofmappingEq.(5)andEq.(4)haveof mapping Eq. (5) and Eq. (4) have\nu<3$.

A2.3 Theorem 3: Increasing Variance from Below

We consider λ=λ01\lambda=\lambda_{\rm 01}, α=α01\alpha=\alpha_{\rm 01} and the two domains Ω1={(μ,ω,ν,τ)  0.1μ0.1,0.1ω0.1,0.05ν0.16,0.8τ1.25}\Omega_{1}^{-}=\{(\mu,\omega,\nu,\tau)\ |\ -0.1\leqslant\mu\leqslant 0.1,-0.1\leqslant\omega\leqslant 0.1,0.05\leqslant\nu\leqslant 0.16,0.8\leqslant\tau\leqslant 1.25\} and Ω2={(μ,ω,ν,τ)  0.1μ0.1,0.1ω0.1,0.05ν0.24,0.9τ1.25}\Omega_{2}^{-}=\{(\mu,\omega,\nu,\tau)\ |\ -0.1\leqslant\mu\leqslant 0.1,-0.1\leqslant\omega\leqslant 0.1,0.05\leqslant\nu\leqslant 0.24,0.9\leqslant\tau\leqslant 1.25\}.

A3 Proofs of the Theorems

We have to show that the mapping gg defined by Eq. (4) and Eq. (5) has a stable and attracting fixed point close to (0,1)(0,1). To proof this statement and Theorem 1, we apply the Banach fixed point theorem which requires (1) that gg is a contraction mapping and (2) that gg does not map outside the function’s domain, concretely:

Let (X,d)(X,d) be a non-empty complete metric space with a contraction mapping f:XXf:X\to X. Then ff has a unique fixed-point xfXx_{f}\in X with f(xf)=xff(x_{f})=x_{f}. Every sequence xn=f(xn1)x_{n}=f(x_{n-1}) with starting element x0Xx_{0}\in X converges to the fixed point: xnn xfx_{n}\xrightarrow[n\to\infty]{\ }x_{f}.

Contraction mappings are functions that map two points such that their distance is decreasing:

A function f:XXf:X\to X on a metric space XX with distance dd is a contraction mapping, if there is a 0δ<10\leqslant\delta<1, such that for all points u\bm{u} and v\bm{v} in XX: d(f(u),f(v))δd(u,v)d(f(\bm{u}),f(\bm{v}))\leqslant\delta d(\bm{u},\bm{v}).

To show that gg is a contraction mapping in Ω\Omega with distance .2\|.\|_{2}, we use the Mean Value Theorem for u,vΩu,v\in\Omega

in which MM is an upper bound on the spectral norm the Jacobian H\mathcal{H} of gg. The spectral norm is given by the largest singular value of the Jacobian of gg. If the largest singular value of the Jacobian is smaller than 1, the mapping gg of the mean and variance to the mean and variance in the next layer is contracting. We show that the largest singular value is smaller than 1 by evaluating the function for the singular value S(μ,ω,ν,τ,λ,α)S(\mu,\omega,\nu,\tau,\lambda,\alpha) on a grid. Then we use the Mean Value Theorem to bound the deviation of the function SS between grid points. To this end, we have to bound the gradient of SS with respect to (μ,ω,ν,τ)(\mu,\omega,\nu,\tau). If all function values plus gradient times the deltas (differences between grid points and evaluated points) is still smaller than 1, then we have proofed that the function is below 1 (Lemma 12). To show that the mapping does not map outside the function’s domain, we derive bounds on the expressions for the mean and the variance (Lemma 13). Section A3.4.1 and Section A3.4.2 are concerned with the contraction mapping and the image of the function domain of gg, respectively.

With the results that the largest singular value of the Jacobian is smaller than one (Lemma 12) and that the mapping stays in the domain Ω\Omega (Lemma 13), we can prove Theorem 1. We first recall Theorem 1:

We assume α=α01\alpha=\alpha_{\rm 01} and λ=λ01\lambda=\lambda_{\rm 01}. We restrict the range of the variables to the domain μ[0.1,0.1]\mu\in[-0.1,0.1], ω[0.1,0.1]\omega\in[-0.1,0.1], ν[0.8,1.5]\nu\in[0.8,1.5], and τ[0.95,1.1]\tau\in[0.95,1.1]. For ω=0\omega=0 and τ=1\tau=1, the mapping Eq. (4) and Eq. (5) has the stable fixed point (μ,ν)=(0,1)(\mu,\nu)=(0,1). For other ω\omega and τ\tau the mapping Eq. (4) and Eq. (5) has a stable and attracting fixed point depending on (ω,τ)(\omega,\tau) in the (μ,ν)(\mu,\nu)-domain: μ[0.03106,0.06773]\mu\in[-0.03106,0.06773] and ν[0.80009,1.48617]\nu\in[0.80009,1.48617]. All points within the (μ,ν)(\mu,\nu)-domain converge when iteratively applying the mapping Eq. (4) and Eq. (5) to this fixed point.

According to Lemma 12 the mapping gg (Eq. (4) and Eq. (5)) is a contraction mapping in the given domain, that is, it has a Lipschitz constant smaller than one. We showed that (μ,ν)=(0,1)(\mu,\nu)=(0,1) is a fixed point of the mapping for (ω,τ)=(0,1)(\omega,\tau)=(0,1).

The domain is compact (bounded and closed), therefore it is a complete metric space. We further have to make sure the mapping gg does not map outside its domain Ω\Omega. According to Lemma 13, the mapping maps into the domain μ[0.03106,0.06773]\mu\in[-0.03106,0.06773] and ν[0.80009,1.48617]\nu\in[0.80009,1.48617].

Now we can apply the Banach fixed point theorem given in Theorem 4 from which the statement of the theorem follows. ∎

A3.2 Proof of Theorem 2

The variance decreases in $andallfixedpointsand all fixed points(\mu,\nu)ofmappingEq.(5)andEq.(4)haveof mapping Eq. (5) and Eq. (4) have\nu<3$.

Therefore we have according to Theorem 16:

We set x=ντx=\nu\tau and y=μωy=\mu\omega and obtain

The derivative to this sub-function with respect to yy is

The inequality follows from Lemma 24, which states that zez2erfc(z)ze^{z^{2}}\operatorname{erfc}(z) is monotonically increasing in zz. Therefore the sub-function is increasing in yy. The derivative to this sub-function with respect to xx is

The sub-function is increasing in xx, since the derivative is larger than zero:

First inequality: We applied Lemma 22 two times.

Equalities factor out 2x\sqrt{2}\sqrt{x} and reformulate.

Second inequality part 2: we show that for a=110(960+169ππ13)a=\frac{1}{10}\left(\sqrt{\frac{960+169\pi}{\pi}}-13\right) following holds: 8xπ(a2+2a(x+y))0\frac{8x}{\pi}-\left(a^{2}+2a(x+y)\right)\geqslant 0. We have x8xπ(a2+2a(x+y))=8π2a>0\frac{\partial}{\partial x}\frac{8x}{\pi}-\left(a^{2}+2a(x+y)\right)=\frac{8}{\pi}-2a>0 and y8xπ(a2+2a(x+y))=2a<0\frac{\partial}{\partial y}\frac{8x}{\pi}-\left(a^{2}+2a(x+y)\right)=-2a<0. Therefore the minimum is at border for minimal xx and maximal yy:

for a=110(960+169ππ13)>0.878a=\frac{1}{10}\left(\sqrt{\frac{960+169\pi}{\pi}}-13\right)>0.878.

Equalities only solve square root and factor out the resulting terms (2(2x+y)+1)(2(2x+y)+1) and (2(x+y)+0.878)(2(x+y)+0.878).

We set α=α01\alpha=\alpha_{01} and multiplied out. Thereafter we also factored out xx in the numerator. Finally a quadratic equations was solved.

The sub-function has its minimal value for minimal x=ντ=1.50.8=1.2x=\nu\tau=1.5\cdot 0.8=1.2 and minimal y=μω=10.1=0.1y=\mu\omega=-1\cdot 0.1=-0.1. We further minimize the function

We now divide the μ\mu-domain into 1μ0-1\leqslant\mu\leqslant 0 and 0μ10\leqslant\mu\leqslant 1. Analogously we divide the ω\omega-domain into 0.1ω0-0.1\leqslant\omega\leqslant 0 and 0ω0.10\leqslant\omega\leqslant 0.1. In this domains gg is strictly monotonically.

For all domains gg is strictly monotonically decreasing in ν\nu and strictly monotonically increasing in τ\tau. Note that we now consider the range 3ν163\leqslant\nu\leqslant 16. For the maximal value of gg we set ν=3\nu=3 (we set it to 3!) and τ=1.25\tau=1.25.

We consider now all combination of these domains:

1μ0-1\leqslant\mu\leqslant 0 and 0.1ω0-0.1\leqslant\omega\leqslant 0:

gg is decreasing in μ\mu and decreasing in ω\omega. We set μ=1\mu=-1 and ω=0.1\omega=-0.1.

1μ0-1\leqslant\mu\leqslant 0 and 0ω0.10\leqslant\omega\leqslant 0.1:

gg is increasing in μ\mu and decreasing in ω\omega. We set μ=0\mu=0 and ω=0\omega=0.

0μ10\leqslant\mu\leqslant 1 and 0.1ω0-0.1\leqslant\omega\leqslant 0:

gg is decreasing in μ\mu and increasing in ω\omega. We set μ=0\mu=0 and ω=0\omega=0.

0μ10\leqslant\mu\leqslant 1 and 0ω0.10\leqslant\omega\leqslant 0.1:

gg is increasing in μ\mu and increasing in ω\omega. We set μ=1\mu=1 and ω=0.1\omega=0.1.

Therefore the maximal value of gg is 0.0180173-0.0180173.

A3.3 Proof of Theorem 3

We consider λ=λ01\lambda=\lambda_{\rm 01}, α=α01\alpha=\alpha_{\rm 01} and the two domains Ω1={(μ,ω,ν,τ)  0.1μ0.1,0.1ω0.1,0.05ν0.16,0.8τ1.25}\Omega_{1}^{-}=\{(\mu,\omega,\nu,\tau)\ |\ -0.1\leqslant\mu\leqslant 0.1,-0.1\leqslant\omega\leqslant 0.1,0.05\leqslant\nu\leqslant 0.16,0.8\leqslant\tau\leqslant 1.25\} and Ω2={(μ,ω,ν,τ)  0.1μ0.1,0.1ω0.1,0.05ν0.24,0.9τ1.25}\Omega_{2}^{-}=\{(\mu,\omega,\nu,\tau)\ |\ -0.1\leqslant\mu\leqslant 0.1,-0.1\leqslant\omega\leqslant 0.1,0.05\leqslant\nu\leqslant 0.24,0.9\leqslant\tau\leqslant 1.25\} .

The mean value theorem states that there exists a tt\in for which

Therefore we are interested to bound the derivative of the ξ\xi-mapping Eq. (13) with respect to ν\nu:

The sub-term Eq. (322) enters the derivative Eq. (47) with a negative sign! According to Lemma 18, the minimal value of sub-term Eq. (322) is obtained by the largest largest ν\nu, by the smallest τ\tau, and the largest y=μω=0.01y=\mu\omega=0.01. Also the positive term erfc(μω2ντ)+2\operatorname{erfc}\left(\frac{\mu\omega}{\sqrt{2}\sqrt{\nu\tau}}\right)+2 is multiplied by τ\tau, which is minimized by using the smallest τ\tau. Therefore we can use the smallest τ\tau in whole formula Eq. (47) to lower bound it.

First we consider the domain 0.05ν0.160.05\leqslant\nu\leqslant 0.16 and 0.8τ1.250.8\leqslant\tau\leqslant 1.25. The factor consisting of the exponential in front of the brackets has its smallest value for e0.010.0120.050.8e^{-\frac{0.01\cdot 0.01}{2\cdot 0.05\cdot 0.8}}. Since erfc\operatorname{erfc} is monotonically decreasing we inserted the smallest argument via erfc(0.0120.050.8)\operatorname{erfc}\left(-\frac{0.01}{\sqrt{2}\sqrt{0.05\cdot 0.8}}\right) in order to obtain the maximal negative contribution. Thus, applying Lemma 18, we obtain the lower bound on the derivative:

Next we consider the domain 0.05ν0.240.05\leqslant\nu\leqslant 0.24 and 0.9τ1.250.9\leqslant\tau\leqslant 1.25. The factor consisting of the exponential in front of the brackets has its smallest value for e0.010.0120.050.9e^{-\frac{0.01\cdot 0.01}{2\cdot 0.05\cdot 0.9}}. Since erfc\operatorname{erfc} is monotonically decreasing we inserted the smallest argument via erfc(0.0120.050.9)\operatorname{erfc}\left(-\frac{0.01}{\sqrt{2}\sqrt{0.05\cdot 0.9}}\right) in order to obtain the maximal negative contribution.

Thus, applying Lemma 18, we obtain the lower bound on the derivative:

A3.4 Lemmata and Other Tools Required for the Proofs

In this section, we show that the largest singular value of the Jacobian of the mapping gg is smaller than one. Therefore, gg is a contraction mapping. This is even true in a larger domain than the original Ω\Omega. We do not need to restrict τ[0.95,1.1]\tau\in[0.95,1.1], but we can extend to τ[0.8,1.25]\tau\in[0.8,1.25]. The range of the other variables is unchanged such that we consider the following domain throughout this section: μ[0.1,0.1]\mu\in[-0.1,0.1], ω[0.1,0.1]\omega\in[-0.1,0.1], ν[0.8,1.5]\nu\in[0.8,1.5], and τ[0.8,1.25]\tau\in[0.8,1.25].

The definition of the entries of the Jacobian J\mathcal{J} is:

If the largest singular value of the Jacobian is smaller than 1, then the spectral norm of the Jacobian is smaller than 1. Then the mapping Eq. (4) and Eq. (5) of the mean and variance to the mean and variance in the next layer is contracting.

We show that the largest singular value is smaller than 1 by evaluating the function S(μ,ω,ν,τ,λ,α)S(\mu,\omega,\nu,\tau,\lambda,\alpha) on a grid. Then we use the Mean Value Theorem to bound the deviation of the function SS between grid points. Toward this end we have to bound the gradient of SS with respect to (μ,ω,ν,τ)(\mu,\omega,\nu,\tau). If all function values plus gradient times the deltas (differences between grid points and evaluated points) is still smaller than 1, then we have proofed that the function is below 1.

The singular values of the 2×22\times 2 matrix

We used an explicit formula for the singular values . We now set H11=a11,H12=a12,H21=a21,H22=a22{\mathcal{H}}_{11}=a_{11},{\mathcal{H}}_{12}=a_{12},{\mathcal{H}}_{21}=a_{21},{\mathcal{H}}_{22}=a_{22} to obtain a formula for the largest singular value of the Jacobian depending on (μ,ω,ν,τ,λ,α)(\mu,\omega,\nu,\tau,\lambda,\alpha). The formula for the largest singular value for the Jacobian is:

where J\mathcal{J} are defined in Eq. (62) and we left out the dependencies on (μ,ω,ν,τ,λ,α)(\mu,\omega,\nu,\tau,\lambda,\alpha) in order to keep the notation uncluttered, e.g. we wrote J11{\mathcal{J}}_{11} instead of J11(μ,ω,ν,τ,λ,α){\mathcal{J}}_{11}(\mu,\omega,\nu,\tau,\lambda,\alpha).

In order to bound the gradient of the singular value, we have to bound the derivatives of the Jacobian entries J11(μ,ω,ν,τ,λ,α){\mathcal{J}}_{11}(\mu,\omega,\nu,\tau,\lambda,\alpha), J12(μ,ω,ν,τ,λ,α){\mathcal{J}}_{12}(\mu,\omega,\nu,\tau,\lambda,\alpha), J21(μ,ω,ν,τ,λ,α){\mathcal{J}}_{21}(\mu,\omega,\nu,\tau,\lambda,\alpha), and J22(μ,ω,ν,τ,λ,α){\mathcal{J}}_{22}(\mu,\omega,\nu,\tau,\lambda,\alpha) with respect to μ\mu, ω\omega, ν\nu, and τ\tau. The values λ\lambda and α\alpha are fixed to λ01\lambda_{\rm 01} and α01\alpha_{\rm 01}. The 16 derivatives of the 4 Jacobian entries with respect to the 4 variables are:

The following bounds on the absolute values of the derivatives of the Jacobian entries J11(μ,ω,ν,τ,λ,α){\mathcal{J}}_{11}(\mu,\omega,\nu,\tau,\lambda,\alpha), J12(μ,ω,ν,τ,λ,α){\mathcal{J}}_{12}(\mu,\omega,\nu,\tau,\lambda,\alpha), J21(μ,ω,ν,τ,λ,α){\mathcal{J}}_{21}(\mu,\omega,\nu,\tau,\lambda,\alpha), and J22(μ,ω,ν,τ,λ,α){\mathcal{J}}_{22}(\mu,\omega,\nu,\tau,\lambda,\alpha) with respect to μ\mu, ω\omega, ν\nu, and τ\tau hold:

where we used that (a) J11J_{11} is strictly monotonically increasing in μω\mu\omega and 2erfc(0.012ντ)1.00584|2-\operatorname{erfc}\left(\frac{0.01}{\sqrt{2}\sqrt{\nu\tau}}\right)|\leqslant 1.00584 and (b) Lemma 47 that eμω+ντ2erfc(μω+ντ2ντ)e0.01+0.642erfc(0.01+0.6420.64)=0.587622|e^{\mu\omega+\frac{\nu\tau}{2}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right)|\leqslant e^{0.01+\frac{0.64}{2}}\operatorname{erfc}\left(\frac{0.01+0.64}{\sqrt{2}\sqrt{0.64}}\right)=0.587622

For the first term we have 0.434947eμω+ντ2erfc(μω+ντ2ντ)0.5876220.434947\leqslant e^{\mu\omega+\frac{\nu\tau}{2}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right)\leqslant 0.587622 after Lemma 47 and for the second term 0.5826772πντeμ2ω22ντ0.9973560.582677\leqslant\sqrt{\frac{2}{\pi\nu\tau}}e^{-\frac{\mu^{2}\omega^{2}}{2\nu\tau}}\leqslant 0.997356, which can easily be seen by maximizing or minimizing the arguments of the exponential or the square root function. The first term scaled by α\alpha is 0.727780αeμω+ντ2erfc(μω+ντ2ντ)0.9832470.727780\leqslant\alpha e^{\mu\omega+\frac{\nu\tau}{2}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right)\leqslant 0.983247 and the second term scaled by α1\alpha-1 is 0.392294(α1)2πντeμ2ω22ντ0.6714840.392294\leqslant(\alpha-1)\sqrt{\frac{2}{\pi\nu\tau}}e^{-\frac{\mu^{2}\omega^{2}}{2\nu\tau}}\leqslant 0.671484. Therefore, the absolute difference between these terms is at most 0.9832470.3922940.983247-0.392294 leading to the derived bound.

We assume α=α01\alpha=\alpha_{\rm 01} and λ=λ01\lambda=\lambda_{\rm 01}. We restrict the range of the variables to the domain μ[0.1,0.1]\mu\in[-0.1,0.1], ω[0.1,0.1]\omega\in[-0.1,0.1], ν[0.8,1.5]\nu\in[0.8,1.5], and τ[0.8,1.25]\tau\in[0.8,1.25].

We set α=α01\alpha=\alpha_{\rm 01} and λ=λ01\lambda=\lambda_{\rm 01} and restrict the range of the variables to μ[μmin,μmax]=[0.1,0.1]\mu\in[\mu_{\rm min},\mu_{\rm max}]=[-0.1,0.1], ω[ωmin,ωmax]=[0.1,0.1]\omega\in[\omega_{\rm min},\omega_{\rm max}]=[-0.1,0.1], ν[νmin,νmax]=[0.8,1.5]\nu\in[\nu_{\rm min},\nu_{\rm max}]=[0.8,1.5], and τ[τmin,τmax]=[0.8,1.25]\tau\in[\tau_{\rm min},\tau_{\rm max}]=[0.8,1.25].

The absolute values of derivatives of the largest singular value S(μ,ω,ν,τ,λ,α)S(\mu,\omega,\nu,\tau,\lambda,\alpha) given in Eq. (71) with respect to (μ,ω,ν,τ)(\mu,\omega,\nu,\tau) are bounded as follows:

The Jacobian of our mapping Eq. (4) and Eq. (5) is defined as

from which follows using the bounds from Lemma 5:

Derivative of the singular value w.r.t. μ\mu:

where we used the results from the lemmata 5, 6, 7, and 9.

Derivative of the singular value w.r.t. ω\omega:

Derivative of the singular value w.r.t. ν\nu:

where we used the results from the lemmata 5, 6, 7, and 9.

Derivative of the singular value w.r.t. τ\tau:

We set α=α01\alpha=\alpha_{\rm 01} and λ=λ01\lambda=\lambda_{\rm 01} and restrict the range of the variables to μ[μmin,μmax]=[0.1,0.1]\mu\in[\mu_{\rm min},\mu_{\rm max}]=[-0.1,0.1], ω[ωmin,ωmax]=[0.1,0.1]\omega\in[\omega_{\rm min},\omega_{\rm max}]=[-0.1,0.1], ν[νmin,νmax]=[0.8,1.5]\nu\in[\nu_{\rm min},\nu_{\rm max}]=[0.8,1.5], and τ[τmin,τmax]=[0.8,1.25]\tau\in[\tau_{\rm min},\tau_{\rm max}]=[0.8,1.25].

The distance of the singular value at S(μ,ω,ν,τ,λ01,α01)S(\mu,\omega,\nu,\tau,\lambda_{\rm 01},\alpha_{\rm 01}) and that at S(μ+Δμ,ω+Δω,ν+Δν,τ+Δτ,λ01,α01)S(\mu+\Delta\mu,\omega+\Delta\omega,\nu+\Delta\nu,\tau+\Delta\tau,\lambda_{\rm 01},\alpha_{\rm 01}) is bounded as follows:

The mean value theorem states that a tt\in exists for which

We now apply Lemma 10 which gives bounds on the derivatives, which immediately gives the statement of the lemma. ∎

We set α=α01\alpha=\alpha_{\rm 01} and λ=λ01\lambda=\lambda_{\rm 01} and restrict the range of the variables to μ[0.1,0.1]\mu\in[-0.1,0.1], ω[0.1,0.1]\omega\in[-0.1,0.1], ν[0.8,1.5]\nu\in[0.8,1.5], and τ[0.8,1.25]\tau\in[0.8,1.25].

The the largest singular value of the Jacobian is smaller than 1:

Therefore the mapping Eq. (4) and Eq. (5) is a contraction mapping.

We set Δμ=0.0068097371\Delta\mu=0.0068097371, Δω=0.0008292885\Delta\omega=0.0008292885, Δν=0.0009580840\Delta\nu=0.0009580840, and Δτ=0.0007323095\Delta\tau=0.0007323095.

For a grid with grid length Δμ=0.0068097371\Delta\mu=0.0068097371, Δω=0.0008292885\Delta\omega=0.0008292885, Δν=0.0009580840\Delta\nu=0.0009580840, and Δτ=0.0007323095\Delta\tau=0.0007323095, we evaluated the function Eq. (71) for the largest singular value in the domain μ[0.1,0.1]\mu\in[-0.1,0.1], ω[0.1,0.1]\omega\in[-0.1,0.1], ν[0.8,1.5]\nu\in[0.8,1.5], and τ[0.8,1.25]\tau\in[0.8,1.25]. We did this using a computer. According to Subsection A3.4.5 the precision if regarding error propagation and precision of the implemented functions is larger than 101310^{-13}. We performed the evaluation on different operating systems and different hardware architectures including CPUs and GPUs. In all cases the function Eq. (71) for the largest singular value of the Jacobian is bounded by 0.99125241710587720.9912524171058772.

A3.4.2 Lemmata for proofing Theorem 1 (part 2): Mapping within domain

We further have to investigate whether the the mapping Eq. (4) and Eq. (5) maps into a predefined domains.

The mapping Eq. (4) and Eq. (5) map for α=α01\alpha=\alpha_{\rm 01} and λ=λ01\lambda=\lambda_{\rm 01} into the domain μ[0.03106,0.06773]\mu\in[-0.03106,0.06773] and ν[0.80009,1.48617]\nu\in[0.80009,1.48617] with ω[0.1,0.1]\omega\in[-0.1,0.1] and τ[0.95,1.1]\tau\in[0.95,1.1].

for all ω[0.1,0.1]\omega\in[-0.1,0.1] and τ[0.95,1.1]\tau\in[0.95,1.1].

A3.4.3 Lemmata for proofing Theorem 2: The variance is contracting

We consider the main sub-function of the derivate of second moment, J22J22 (Eq. (62)):

that depends on μω\mu\omega and ντ\nu\tau, therefore we set x=ντx=\nu\tau and y=μωy=\mu\omega. Algebraic reformulations provide the formula in the following form:

For λ=λ01\lambda=\lambda_{\rm 01} and α=α01\alpha=\alpha_{\rm 01}, we consider the domain 1μ1-1\leqslant\mu\leqslant 1, 0.1ω0.1-0.1\leqslant\omega\leqslant 0.1, 1.5ν161.5\leqslant\nu\leqslant 16, and, 0.8τ1.250.8\leqslant\tau\leqslant 1.25.

For xx and yy we obtain: 0.81.5=1.2x20=1.25160.8\cdot 1.5=1.2\leqslant x\leqslant 20=1.25\cdot 16 and 0.1(1)=0.1y0.1=0.110.1\cdot(-1)=-0.1\leqslant y\leqslant 0.1=0.1\cdot 1. In the following we assume to remain within this domain.

For 1.2x201.2\leqslant x\leqslant 20 and 0.1y0.1-0.1\leqslant y\leqslant 0.1,

is smaller than zero, is strictly monotonically increasing in xx, and strictly monotonically decreasing in yy for the minimal x=12/10=1.2x=12/10=1.2.

The graph of the subfunction in the specified domain is displayed in Figure A3.

For λ=λ01\lambda=\lambda_{\rm 01}, α=α01\alpha=\alpha_{\rm 01}, 1μ1-1\leqslant\mu\leqslant 1, 0.1ω0.1-0.1\leqslant\omega\leqslant 0.1 1.5ν161.5\leqslant\nu\leqslant 16, and 0.8τ1.250.8\leqslant\tau\leqslant 1.25, we first show that the derivative is positive and then upper bound it.

is negative. This expression multiplied by positive factors is subtracted in the derivative Eq. (118), therefore, the whole term is positive. The remaining term

of the derivative Eq. (118) is also positive according to Lemma 21. All factors outside the brackets in Eq. (118) are positive. Hence, the derivative Eq. (118) is positive.

First equality brings the expression into a shape where we can apply Lemma 15 for the the function Eq. (115).

First inequality: The overall factor τ\tau is bounded by 1.25.

Second inequality: We apply Lemma 15. According to Lemma 15 the function Eq. (115) is negative. The largest contribution is to subtract the most negative value of the function Eq. (115), that is, the minimum of function Eq. (115). According to Lemma 15 the function Eq. (115) is strictly monotonically increasing in xx and strictly monotonically decreasing in yy for x=1.2x=1.2. Therefore the function Eq. (115) has its minimum at minimal x=ντ=1.50.8=1.2x=\nu\tau=1.5\cdot 0.8=1.2 and maximal y=μω=1.00.1=0.1y=\mu\omega=1.0\cdot 0.1=0.1. We insert these values into the expression.

Third inequality: We use for the whole expression the maximal factor eμ2ω22ντ<1e^{-\frac{\mu^{2}\omega^{2}}{2\nu\tau}}<1 by setting this factor to 1.

Fourth inequality: erfc\operatorname{erfc} is strictly monotonically decreasing. Therefore we maximize its argument to obtain the least value which is subtracted. We use the minimal x=ντ=1.50.8=1.2x=\nu\tau=1.5\cdot 0.8=1.2 and the maximal y=μω=1.00.1=0.1y=\mu\omega=1.0\cdot 0.1=0.1.

Sixth inequality: evaluation of the terms.

is larger than zero when the term παe(μω+ντ)22ντerfc(μω+ντ2ντ)2(α1)ντ\sqrt{\pi}\alpha e^{\frac{(\mu\omega+\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right)-\frac{\sqrt{2}(\alpha-1)}{\sqrt{\nu\tau}} is larger than zero. This term obtains its minimal value at μω=0.01\mu\omega=0.01 and ντ=161.25\nu\tau=16\cdot 1.25, which can easily be shown using the Abramowitz bounds (Lemma 22) and evaluates to 0.160.16, therefore J12>0{\mathcal{J}}_{12}>0 in Ω+\Omega^{+}.

A3.4.4 Lemmata for proofing Theorem 3: The variance is expanding

We consider functions in μω\mu\omega and ντ\nu\tau, therefore we set x=μωx=\mu\omega and y=ντy=\nu\tau.

For λ=λ01\lambda=\lambda_{\rm 01} and α=α01\alpha=\alpha_{\rm 01}, we consider the domain 0.1μ0.1-0.1\leqslant\mu\leqslant 0.1, 0.1ω0.1-0.1\leqslant\omega\leqslant 0.1 0.00875ν0.70.00875\leqslant\nu\leqslant 0.7, and 0.8τ1.250.8\leqslant\tau\leqslant 1.25.

For xx and yy we obtain: 0.80.00875=0.007x0.875=1.250.70.8\cdot 0.00875=0.007\leqslant x\leqslant 0.875=1.25\cdot 0.7 and 0.1(0.1)=0.01y0.01=0.10.10.1\cdot(-0.1)=-0.01\leqslant y\leqslant 0.01=0.1\cdot 0.1. In the following we assume to be within this domain.

In this domain, we consider the main sub-function of the derivate of second moment in the next layer, J22J22 (Eq. (62)):

that depends on μω\mu\omega and ντ\nu\tau, therefore we set x=ντx=\nu\tau and y=μωy=\mu\omega. Algebraic reformulations provide the formula in the following form:

For 0.007x0.8750.007\leqslant x\leqslant 0.875 and 0.01y0.01-0.01\leqslant y\leqslant 0.01, the function

smaller than zero, is strictly monotonically increasing in xx and strictly monotonically increasing in yy for the minimal x=0.007=0.008750.8x=0.007=0.00875\cdot 0.8, x=0.56=0.70.8x=0.56=0.7\cdot 0.8, x=0.128=0.160.8x=0.128=0.16\cdot 0.8, and x=0.216=0.240.9x=0.216=0.24\cdot 0.9 (lower bound of 0.90.9 on τ\tau).

For λ=λ01\lambda=\lambda_{\rm 01}, α=α01\alpha=\alpha_{\rm 01} and the domain 0.1μ0.1-0.1\leqslant\mu\leqslant 0.1, 0.1ω0.1-0.1\leqslant\omega\leqslant 0.1, 0.00875ν0.70.00875\leqslant\nu\leqslant 0.7, and 0.8τ1.250.8\leqslant\tau\leqslant 1.25. We are interested of the derivative of

The derivative of the equation above with respect to

τ\tau is smaller than zero for maximal ν=0.7\nu=0.7, ν=0.16\nu=0.16, and ν=0.24\nu=0.24 (with 0.9τ0.9\leqslant\tau);

y=μωy=\mu\omega is larger than zero for ντ=0.008750.8=0.007\nu\tau=0.008750.8=0.007, ντ=0.70.8=0.56\nu\tau=0.70.8=0.56, ντ=0.160.8=0.128\nu\tau=0.160.8=0.128, and ντ=0.240.9=0.216\nu\tau=0.24\cdot 0.9=0.216.

A3.4.5 Computer-assisted proof details for main Lemma 12 in Section A3.4.1.

We investigate the error propagation for the singular value (Eq. (71)) if the function arguments μ,ω,ν,τ\mu,\omega,\nu,\tau suffer from numerical imprecisions up to ϵ\epsilon. To this end, we first derive error propagation rules based on the mean value theorem and then we apply these rules to the formula for the singular value.

For a real-valued function ff which is differentiable in the closed interval [a,b][a,b], there exists tt\in with

It follows that for computation with error Δx\Delta x, there exists a tt\in with

Therefore the increase of the norm of the error after applying function ff is bounded by the norm of the gradient f(x+tΔx)\left\|\nabla f(\bm{x}+t\Delta\bm{x})\right\|.

We now compute for the functions, that we consider their gradient and its 2-norm:

f(x)=x1+x2f(\bm{x})=x_{1}+x_{2} and f(x)=(1,1)\nabla f(\bm{x})=(1,1), which gives f(x)=2\left\|\nabla f(\bm{x})\right\|=\sqrt{2}.

f(x)=x1x2f(\bm{x})=x_{1}-x_{2} and f(x)=(1,1)\nabla f(\bm{x})=(1,-1), which gives f(x)=2\left\|\nabla f(\bm{x})\right\|=\sqrt{2}.

f(x)=x1x2f(\bm{x})=x_{1}x_{2} and f(x)=(x2,x1)\nabla f(\bm{x})=(x_{2},x_{1}), which gives f(x)=x\left\|\nabla f(\bm{x})\right\|=\left\|\bm{x}\right\|.

f(x)=x1x2f(\bm{x})=\frac{x_{1}}{x_{2}} and f(x)=(1x2,x1x22)\nabla f(\bm{x})=\left(\frac{1}{x_{2}},-\frac{x_{1}}{x_{2}^{2}}\right), which gives f(x)=xx22\left\|\nabla f(\bm{x})\right\|=\frac{\left\|\bm{x}\right\|}{x_{2}^{2}}.

f(x)=xf(x)=\sqrt{x} and f(x)=12xf^{\prime}(x)=\frac{1}{2\sqrt{x}}, which gives f(x)=12x\left|f^{\prime}(x)\right|=\frac{1}{2\sqrt{x}}.

f(x)=exp(x)f(x)=\exp(x) and f(x)=exp(x)f^{\prime}(x)=\exp(x), which gives f(x)=exp(x)\left|f^{\prime}(x)\right|=\exp(x).

If the values μ,ω,ν,τ\mu,\omega,\nu,\tau have a precision of ϵ\epsilon, the singular value (Eq. (71)) evaluated with the formulas given in Eq. (62) and Eq. (71) has a precision better than 292ϵ292\epsilon.

This means for a machine with a typical precision of 252=2.22044610162^{-52}=2.220446\cdot 10^{-16}, we have the rounding error ϵ1016\epsilon\approx 10^{-16}, the evaluation of the singular value (Eq. (71)) with the formulas given in Eq. (62) and Eq. (71) has a precision better than 1013>292ϵ10^{-13}>292\epsilon.

We have the numerical precision ϵ\epsilon of the parameters μ,ω,ν,τ\mu,\omega,\nu,\tau, that we denote by Δμ,Δω,Δν,Δτ\Delta\mu,\Delta\omega,\Delta\nu,\Delta\tau together with our domain Ω\Omega.

With the error propagation rules that we derived in Subsection A3.4.5, we can obtain bounds for the numerical errors on the following simple expressions:

Using these bounds on the simple expressions, we can now calculate bounds on the numerical errors of compound expressions:

Subsequently, we can use the above results to get bounds for the numerical errors on the Jacobian entries (Eq. (62)), applying the rules from Subsection A3.4.5 again:

We will show that our computations are correct up to 3 ulps. For our implementation in GNU C library and the hardware architectures that we used, the precision of all mathematical functions that we used is at least one ulp. The term “ulp” (acronym for “unit in the last place”) was coined by W. Kahan in 1960. It is the highest precision (up to some factor smaller 1), which can be achieved for the given hardware and floating point representation.

“Ulp(x)(x) is the gap between the two finite floating-point numbers nearest xx, even if xx is one of them. (But ulp(NaN) is NaN.)”

“an ulp in xx is the distance between the two closest straddling floating point numbers aa and bb, i.e. those with axba\leqslant x\leqslant b and aba\not=b assuming an unbounded exponent range.”

In the literature we find also slightly different definitions .

“IEEE-754 mandates four standard rounding modes:”

“Round-to-nearest: r(x)r(x) is the floating-point value closest to xx with the usual distance; if two floating-point value are equally close to xx, then r(x)r(x) is the one whose least significant bit is equal to zero.”

“IEEE-754 standardises 5 operations: addition (which we shall note \oplus in order to distinguish it from the operation over the reals), subtraction (\ominus), multiplication (\otimes), division (\oslash), and also square root.”

Consequently, the IEEE-754 standard guarantees that addition, subtraction, multiplication, division, and squared root is precise up to one ulp.

“With the Intel486 processor and Intel 387 math coprocessor, the worst- case, transcendental function error is typically 33 or 3.53.5 ulps, but is some- times as large as 4.54.5 ulps.”

According to https://www.mirbsd.org/htman/i386/man3/exp.htm and http://man.openbsd.org/OpenBSD-current/man3/exp.3:

“exp(x)(x), log(x)(x), expm1(x)(x) and log1p(x)(x) are accurate to within an ulp”

which is the same for freebsd https://www.freebsd.org/cgi/man.cgi?query=exp&sektion=3&apropos=0&manpath=freebsd:

“The values of exp(0), expm1(0), exp2(integer), and pow(integer, integer) are exact provided that they are representable. Otherwise the error in these functions is generally below one ulp.”

The same holds for “FDLIBM” http://www.netlib.org/fdlibm/readme:

“FDLIBM is intended to provide a reasonably portable (see assumptions below), reference quality (below one ulp for major functions like sin,cos,exp,log) math library (libm.a).”

A3.4.6 Intermediate Lemmata and Proofs

Since we focus on the fixed point (μ,ν)=(0,1)(\mu,\nu)=(0,1), we assume for our whole analysis that α=α01\alpha=\alpha_{\rm 01} and λ=λ01\lambda=\lambda_{\rm 01}. Furthermore, we restrict the range of the variables μ[μmin,μmax]=[0.1,0.1]\mu\in[\mu_{\rm min},\mu_{\rm max}]=[-0.1,0.1], ω[ωmin,ωmax]=[0.1,0.1]\omega\in[\omega_{\rm min},\omega_{\rm max}]=[-0.1,0.1], ν[νmin,νmax]=[0.8,1.5]\nu\in[\nu_{\rm min},\nu_{\rm max}]=[0.8,1.5], and τ[τmin,τmax]=[0.8,1.25]\tau\in[\tau_{\rm min},\tau_{\rm max}]=[0.8,1.25].

For bounding different partial derivatives we need properties of different functions. We will bound a the absolute value of a function by computing an upper bound on its maximum and a lower bound on its minimum. These bounds are computed by upper or lower bounding terms. The bounds get tighter if we can combine terms to a more complex function and bound this function. The following lemmata give some properties of functions that we will use in bounding complex functions.

Throughout this work, we use the error function erf(x):=1πxxet2\operatorname{erf}(x):=\frac{1}{\sqrt{\pi}}\int_{-x}^{x}e^{-t^{2}} and the complementary error function erfc(x)=1erf(x)\operatorname{erfc}(x)=1-\operatorname{erf}(x).

exp(x)\exp(x) is strictly monotonically increasing from at -\infty to \infty at \infty and has positive curvature.

According to its definition erfc(x)\operatorname{erfc}(x) is strictly monotonically decreasing from 2 at -\infty to 0 at \infty.

Next we introduce a bound on erfc\operatorname{erfc}:

The statement follows immediately from (page 298, formula 7.1.13). ∎

ex2erfc(x)e^{x^{2}}\operatorname{erfc}(x) is strictly monotonically decreasing for x>0x>0 and has positive curvature (positive 2nd order derivative), that is, the decreasing slowes down.

A graph of the function is displayed in Figure A5.

The derivative of ex2erfc(x)e^{x^{2}}\operatorname{erfc}(x) is

Thus ex2erfc(x)e^{x^{2}}\operatorname{erfc}(x) is strictly monotonically decreasing for x>0x>0.

The second order derivative of ex2erfc(x)e^{x^{2}}\operatorname{erfc}(x) is

Again using Lemma 22 (first inequality), we get

For the last inequality we added 1 in the numerator in the square root which is subtracted, that is, making a larger negative term in the numerator. ∎

The function xex2erfc(x)xe^{x^{2}}\operatorname{erfc}(x) has the sign of xx and is monotonically increasing to 1π\frac{1}{\sqrt{\pi}}.

The derivative of xex2erfc(x)xe^{x^{2}}\operatorname{erfc}(x) is

We apply Lemma 22 to xerfc(x)ex2x\operatorname{erfc}(x)e^{x^{2}} and divide the terms of the lemma by xx, which gives

For limx\lim_{x\to\infty} both the upper and the lower bound go to 1π\frac{1}{\sqrt{\pi}}. ∎

h11(μ,ω)=μωh_{11}(\mu,\omega)=\mu\omega is monotonically increasing in μω\mu\omega. It has minimal value t11=0.01t_{11}=-0.01 and maximal value T11=0.01T_{11}=0.01.

h22(ν,τ)=ντh_{22}(\nu,\tau)=\nu\tau is monotonically increasing in ντ\nu\tau and is positive. It has minimal value t22=0.64t_{22}=0.64 and maximal value T22=1.875T_{22}=1.875.

𝜇𝜔𝜈𝜏2𝜈𝜏\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}). h1(μ,ω,ν,τ)=μω+ντ2ντh_{1}(\mu,\omega,\nu,\tau)=\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}} is larger than zero and increasing in both ντ\nu\tau and μω\mu\omega. It has minimal value t1=0.5568t_{1}=0.5568 and maximal value T1=0.9734T_{1}=0.9734.

The derivative of the function μω+x2x\frac{\mu\omega+x}{\sqrt{2}\sqrt{x}} with respect to xx is

since x>0.80.8x>0.8\cdot 0.8 and μω<0.10.1\mu\omega<0.1\cdot 0.1. ∎

𝜇𝜔2𝜈𝜏2𝜈𝜏\frac{\mu\omega+2\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}). h2(μ,ω,ν,τ)=μω+2ντ2ντh_{2}(\mu,\omega,\nu,\tau)=\frac{\mu\omega+2\nu\tau}{\sqrt{2}\sqrt{\nu\tau}} is larger than zero and increasing in both ντ\nu\tau and μω\mu\omega. It has minimal value t2=1.1225t_{2}=1.1225 and maximal value T2=1.9417T_{2}=1.9417.

The derivative of the function μω+2x2x\frac{\mu\omega+2x}{\sqrt{2}\sqrt{x}} with respect to xx is

h3(μ,ω,ν,τ)=μω2ντh_{3}(\mu,\omega,\nu,\tau)=\frac{\mu\omega}{\sqrt{2}\sqrt{\nu\tau}} monotonically decreasing in ντ\nu\tau and monotonically increasing in μω\mu\omega. It has minimal value t3=0.0088388t_{3}=-0.0088388 and maximal value T3=0.0088388T_{3}=0.0088388.

h4(μ,ω,ν,τ)=(μω2ντ)2h_{4}(\mu,\omega,\nu,\tau)=\left(\frac{\mu\omega}{\sqrt{2}\sqrt{\nu\tau}}\right)^{2} has a minimum at 0 for μ=0\mu=0 or ω=0\omega=0 and has a maximum for the smallest ντ\nu\tau and largest μω\left|\mu\omega\right| and is larger or equal to zero. It has minimal value t4=0t_{4}=0 and maximal value T4=0.000078126T_{4}=0.000078126.

2π(α1)ντ>0\frac{\sqrt{\frac{2}{\pi}}(\alpha-1)}{\sqrt{\nu\tau}}>0 and decreasing in ντ\nu\tau.

Statements follow directly from elementary functions square root and division. ∎

2erfc(μω2ντ)>02-\operatorname{erfc}\left(\frac{\mu\omega}{\sqrt{2}\sqrt{\nu\tau}}\right)>0 and decreasing in ντ\nu\tau and increasing in μω\mu\omega.

Statements follow directly from Lemma 21 and erfc\operatorname{erfc}. ∎

For λ=λ01\lambda=\lambda_{\rm 01} and α=α01\alpha=\alpha_{\rm 01}, 2π((α1)μω(ντ)3/2αντ)<0\sqrt{\frac{2}{\pi}}\left(\frac{(\alpha-1)\mu\omega}{(\nu\tau)^{3/2}}-\frac{\alpha}{\sqrt{\nu\tau}}\right)<0 and increasing in both ντ\nu\tau and μω\mu\omega.

We consider the function 2π((α1)μωx3/2αx)\sqrt{\frac{2}{\pi}}\left(\frac{(\alpha-1)\mu\omega}{x^{3/2}}-\frac{\alpha}{\sqrt{x}}\right), which has the derivative with respect to xx:

This derivative is larger than zero, since

The last inequality follows from α30.10.1(α1)0.80.8>0\alpha-\frac{3\cdot 0.1\cdot 0.1(\alpha-1)}{0.8\cdot 0.8}>0 for α=α01\alpha=\alpha_{\rm 01}.

We next consider the function 2π((α1)x(ντ)3/2αντ)\sqrt{\frac{2}{\pi}}\left(\frac{(\alpha-1)x}{(\nu\tau)^{3/2}}-\frac{\alpha}{\sqrt{\nu\tau}}\right), which has the derivative with respect to xx:

which has as derivative with respect to xx:

The derivative of the term 3(α1)μ2ω2x(α+αμω+1)αx23(\alpha-1)\mu^{2}\omega^{2}-x(-\alpha+\alpha\mu\omega+1)-\alpha x^{2} with respect to xx is 1+αμωα2αx<0-1+\alpha-\mu\omega\alpha-2\alpha x<0, since 2αx>1.6α2\alpha x>1.6\alpha. Therefore the term is maximized with the smallest value for xx, which is x=ντ=0.80.8x=\nu\tau=0.8\cdot 0.8. For μω\mu\omega we use for each term the value which gives maximal contribution. We obtain an upper bound for the term:

Therefore the derivative with respect to x=ντx=\nu\tau is smaller than zero and the original function is decreasing in ντ\nu\tau

We now consider the derivative with respect to x=μωx=\mu\omega. The derivative with respect to xx of the function

Since 2x(1+α)+ντα>20.01(1+α01)+0.80.8α01>1.0574>0-2x(-1+\alpha)+\nu\tau\alpha>-2\cdot 0.01\cdot(-1+\alpha_{\rm 01})+0.8\cdot 0.8\alpha_{\rm 01}>1.0574>0, the derivative is larger than zero. Consequently, the original function is increasing in μω\mu\omega.

The maximal value is obtained with the minimal ντ=0.80.8\nu\tau=0.8\cdot 0.8 and the maximal μω=0.10.1\mu\omega=0.1\cdot 0.1. The maximal value is

Therefore the original function is smaller than zero. ∎

For λ=λ01\lambda=\lambda_{\rm 01} and α=α01\alpha=\alpha_{\rm 01}, 2π((α21)μω(ντ)3/23α2ντ)<0\sqrt{\frac{2}{\pi}}\left(\frac{\left(\alpha^{2}-1\right)\mu\omega}{(\nu\tau)^{3/2}}-\frac{3\alpha^{2}}{\sqrt{\nu\tau}}\right)<0 and increasing in both ντ\nu\tau and μω\mu\omega.

since α2xμω(1+α2)>α0120.80.80.10.1(1+α012)>1.77387\alpha^{2}x-\mu\omega(-1+\alpha^{2})>\alpha_{\rm 01}^{2}0.8\cdot 0.8-0.1\cdot 0.1\cdot(-1+\alpha_{\rm 01}^{2})>1.77387

The maximal function value is obtained by maximal ντ=1.51.25\nu\tau=1.5\cdot 1.25 and the maximal μω=0.10.1\mu\omega=0.1\cdot 0.1. The maximal value is 2π(0.10.1(α0121)(1.51.25)3/23α0121.51.25) = 4.88869\sqrt{\frac{2}{\pi}}\left(\frac{0.1\cdot 0.1\left(\alpha_{\rm 01}^{2}-1\right)}{(1.5\cdot 1.25)^{3/2}}-\frac{3\alpha_{\rm 01}^{2}}{\sqrt{1.5\cdot 1.25}}\right)\ =\ -4.88869. Therefore the function is negative. ∎

The function 2π((α21)μωντ3α2ντ)<0\sqrt{\frac{2}{\pi}}\left(\frac{\left(\alpha^{2}-1\right)\mu\omega}{\sqrt{\nu\tau}}-3\alpha^{2}\sqrt{\nu\tau}\right)<0 is decreasing in ντ\nu\tau and increasing in μω\mu\omega.

since 3α2xμω(1+α2)<3α0120.80.8+0.10.1(1+α012)<5.35764-3\alpha^{2}x-\mu\omega(-1+\alpha^{2})<-3\alpha_{\rm 01}^{2}0.8\cdot 0.8+0.1\cdot 0.1(-1+\alpha_{\rm 01}^{2})<-5.35764.

The maximal function value is obtained for minimal ντ=0.80.8\nu\tau=0.8\cdot 0.8 and the maximal μω=0.10.1\mu\omega=0.1\cdot 0.1. The value is 2π(0.10.1(α0121)0.80.830.80.8α012) = 5.34347\sqrt{\frac{2}{\pi}}\left(\frac{0.1\cdot 0.1\left(\alpha_{\rm 01}^{2}-1\right)}{\sqrt{0.8\cdot 0.8}}-3\sqrt{0.8\cdot 0.8}\alpha_{\rm 01}^{2}\right)\ =\ -5.34347. Thus, the function is negative. ∎

𝜇𝜔𝜈𝜏22𝜈𝜏erfc𝜇𝜔𝜈𝜏2𝜈𝜏\nu\tau e^{\frac{(\mu\omega+\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right)). The function ντe(μω+ντ)22ντerfc(μω+ντ2ντ)>0\nu\tau e^{\frac{(\mu\omega+\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right)>0 is increasing in ντ\nu\tau and decreasing in μω\mu\omega.

This derivative is larger than zero, since

The first inequality follows by applying Lemma 23 which says that e(μω+ντ)22ντerfc(μω+ντ2ντ)e^{\frac{(\mu\omega+\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right) is strictly monotonically decreasing. The minimal value that is larger than 0.4349 is taken on at the maximal values ντ=1.51.25\nu\tau=1.5\cdot 1.25 and μω=0.10.1\mu\omega=0.1\cdot 0.1.

The second inequality uses 120.43492π=0.545066>0.5\frac{1}{2}0.4349\sqrt{2\pi}=0.545066>0.5.

The equalities are just algebraic reformulations.

The last inequality follows from 0.5μ2ω2+μωντ+0.25(ντ)2>0.25(0.80.8)20.5(0.1)2(0.1)20.10.10.80.8=0.09435>0-0.5\mu^{2}\omega^{2}+\mu\omega\sqrt{\nu\tau}+0.25(\nu\tau)^{2}>0.25(0.8\cdot 0.8)^{2}-0.5\cdot(0.1)^{2}(0.1)^{2}-0.1\cdot 0.1\cdot\sqrt{0.8\cdot 0.8}=0.09435>0.

Therefore the function is increasing in ντ\nu\tau.

Decreasing in μω\mu\omega follows from decreasing of ex2erfc(x)e^{x^{2}}\operatorname{erfc}(x) according to Lemma 23. Positivity follows form the fact that erfc\operatorname{erfc} and the exponential function are positive and that ντ>0\nu\tau>0. ∎

𝜇𝜔2𝜈𝜏22𝜈𝜏erfc𝜇𝜔2𝜈𝜏2𝜈𝜏\nu\tau e^{\frac{(\mu\omega+2\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+2\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right)). The function ντe(μω+2ντ)22ντerfc(μω+2ντ2ντ)>0\nu\tau e^{\frac{(\mu\omega+2\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+2\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right)>0 is increasing in ντ\nu\tau and decreasing in μω\mu\omega.

We only have to determine the sign of πe(μω+2x)24x(2x(2x+1)μ2ω2)erfc(μω+2x2x)+x(μω2x)\sqrt{\pi}e^{\frac{(\mu\omega+2x)^{2}}{4x}}\left(2x(2x+1)-\mu^{2}\omega^{2}\right)\operatorname{erfc}\left(\frac{\mu\omega+2x}{2\sqrt{x}}\right)+\sqrt{x}(\mu\omega-2x) since all other factors are obviously larger than zero.

This derivative is larger than zero, since

The first inequality follows by applying Lemma 23 which says that e(μω+2ντ)22ντerfc(μω+2ντ2ντ)e^{\frac{(\mu\omega+2\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+2\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right) is strictly monotonically decreasing. The minimal value that is larger than 0.261772 is taken on at the maximal values ντ=1.51.25\nu\tau=1.5\cdot 1.25 and μω=0.10.1\mu\omega=0.1\cdot 0.1. 0.261772π>0.4639790.261772\sqrt{\pi}>0.463979.

The equalities are just algebraic reformulations.

The last inequality follows from μω(ντ0.463979μω)+0.85592(ντ)20.0720421ντ>0.85592(0.80.8)20.10.1(1.51.25+0.10.10.463979)0.07204211.51.25>0.201766\mu\omega\left(\sqrt{\nu\tau}-0.463979\mu\omega\right)+0.85592(\nu\tau)^{2}-0.0720421\nu\tau>0.85592\cdot(0.8\cdot 0.8)^{2}-0.1\cdot 0.1\left(\sqrt{1.5\cdot 1.25}+0.1\cdot 0.1\cdot 0.463979\right)-0.0720421\cdot 1.5\cdot 1.25>0.201766.

Therefore the function is increasing in ντ\nu\tau.

Decreasing in μω\mu\omega follows from decreasing of ex2erfc(x)e^{x^{2}}\operatorname{erfc}(x) according to Lemma 23. Positivity follows from the fact that erfc\operatorname{erfc} and the exponential function are positive and that ντ>0\nu\tau>0. ∎

The following bounds on the absolute values of the derivatives of the Jacobian entries J11(μ,ω,ν,τ,λ,α){\mathcal{J}}_{11}(\mu,\omega,\nu,\tau,\lambda,\alpha), J12(μ,ω,ν,τ,λ,α){\mathcal{J}}_{12}(\mu,\omega,\nu,\tau,\lambda,\alpha), J21(μ,ω,ν,τ,λ,α){\mathcal{J}}_{21}(\mu,\omega,\nu,\tau,\lambda,\alpha), and J22(μ,ω,ν,τ,λ,α){\mathcal{J}}_{22}(\mu,\omega,\nu,\tau,\lambda,\alpha) with respect to μ\mu, ω\omega, ν\nu, and τ\tau hold:

For each derivative we compute a lower and an upper bound and take the maximum of the absolute value. A lower bound is determined by minimizing the single terms of the functions that represents the derivative. An upper bound is determined by maximizing the single terms of the functions that represent the derivative. Terms can be combined to larger terms for which the maximum and the minimum must be known. We apply many previous lemmata which state properties of functions representing single or combined terms. The more terms are combined, the tighter the bounds can be made.

Next we go through all the derivatives, where we use Lemma 25, Lemma 26, Lemma 27, Lemma 28, Lemma 29, Lemma 30, Lemma 21, and Lemma 23 without citing. Furthermore, we use the bounds on the simple expressions t11t_{11},t22t_{22}, …, and T4T_{4} as defined the aforementioned lemmata:

J11μ\frac{\partial{\mathcal{J}}_{11}}{\partial\mu}

We use Lemma 31 and consider the expression αe(μω+ντ)22ντerfc(μω+ντ2ντ)2π(α1)ντ\alpha e^{\frac{(\mu\omega+\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right)-\frac{\sqrt{\frac{2}{\pi}}(\alpha-1)}{\sqrt{\nu\tau}} in brackets. An upper bound on the maximum of is

Thus, an upper bound on the maximal absolute value is

J11ω\frac{\partial{\mathcal{J}}_{11}}{\partial\omega}

We use Lemma 31 and consider the expression 2π(α1)μωντα(μω+1)e(μω+ντ)22ντerfc(μω+ντ2ντ)\frac{\sqrt{\frac{2}{\pi}}(\alpha-1)\mu\omega}{\sqrt{\nu\tau}}-\alpha(\mu\omega+1)e^{\frac{(\mu\omega+\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right) in brackets.

This term is subtracted, and 2erfc(x)>02-\operatorname{erfc}(x)>0, therefore we have to use the minimum and the maximum for the argument of erfc\operatorname{erfc}.

Thus, an upper bound on the maximal absolute value is

J11ν\frac{\partial{\mathcal{J}}_{11}}{\partial\nu}

We apply Lemma 33 for the first sub-term. An upper bound on the maximum is

Thus, an upper bound on the maximal absolute value is

J11τ\frac{\partial{\mathcal{J}}_{11}}{\partial\tau}

We use the results of item J11ν\frac{\partial{\mathcal{J}}_{11}}{\partial\nu} were the brackets are only differently scaled. Thus, an upper bound on the maximal absolute value is

J12μ\frac{\partial{\mathcal{J}}_{12}}{\partial\mu}

Since J12μ=J11ν\frac{\partial{\mathcal{J}}_{12}}{\partial\mu}=\frac{\partial{\mathcal{J}}_{11}}{\partial\nu}, an upper bound on the maximal absolute value is

J12ω\frac{\partial{\mathcal{J}}_{12}}{\partial\omega}

We use the results of item J11ν\frac{\partial{\mathcal{J}}_{11}}{\partial\nu} were the brackets are only differently scaled. Thus, an upper bound on the maximal absolute value is

J12ν\frac{\partial{\mathcal{J}}_{12}}{\partial\nu}

For the second term in brackets, we see that α01τmin2eT12erfc(T1)=0.465793\alpha_{\rm 01}\tau_{\rm min}^{2}e^{T_{1}^{2}}\operatorname{erfc}(T_{1})=0.465793 and α01τmax2et12erfc(t1)=1.53644\alpha_{\rm 01}\tau_{\rm max}^{2}e^{t_{1}^{2}}\operatorname{erfc}(t_{1})=1.53644.

where we maximize or minimize all single terms.

A lower bound on the minimum of this expression is

An upper bound on the maximum of this expression is

Thus, an upper bound on the maximal absolute value is

J12τ\frac{\partial{\mathcal{J}}_{12}}{\partial\tau}

We use Lemma 34 to obtain an upper bound on the maximum of the expression of the lemma:

We use Lemma 34 to obtain an lower bound on the minimum of the expression of the lemma:

Next we apply Lemma 37 for the expression ντe(μω+ντ)22ντerfc(μω+ντ2ντ)\nu\tau e^{\frac{(\mu\omega+\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right). We use Lemma 37 to obtain an upper bound on the maximum of this expression:

We use Lemma 37 to obtain an lower bound on the minimum of this expression:

Next we apply Lemma 23 for 2αe(μω+ντ)22ντerfc(μω+ντ2ντ)2\alpha e^{\frac{(\mu\omega+\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right). An upper bound on this expression is

The sum of the minimal values of the terms is 2.23019+0.62046+1.45560=0.154133-2.23019+0.62046+1.45560=-0.154133.

The sum of the maximal values of the terms is 1.72295+1.37380+1.96664=1.61749-1.72295+1.37380+1.96664=1.61749.

Thus, an upper bound on the maximal absolute value is

J21μ\frac{\partial{\mathcal{J}}_{21}}{\partial\mu}

Thus, an upper bound on the maximal absolute value is

J21ω\frac{\partial{\mathcal{J}}_{21}}{\partial\omega}

Thus, an upper bound on the maximal absolute value is

J21ν\frac{\partial{\mathcal{J}}_{21}}{\partial\nu}

Thus, an upper bound on the maximal absolute value is

J21τ\frac{\partial{\mathcal{J}}_{21}}{\partial\tau}

Thus, an upper bound on the maximal absolute value is

J22μ\frac{\partial{\mathcal{J}}_{22}}{\partial\mu}

We use the fact that J22μ=J21ν\frac{\partial{\mathcal{J}}_{22}}{\partial\mu}=\frac{\partial{\mathcal{J}}_{21}}{\partial\nu}. Thus, an upper bound on the maximal absolute value is

J22ω\frac{\partial{\mathcal{J}}_{22}}{\partial\omega}

Thus, an upper bound on the maximal absolute value is

J22ν\frac{\partial{\mathcal{J}}_{22}}{\partial\nu}

We apply Lemma 35 to the expression 2π((α21)μω(ντ)3/23α2ντ)\sqrt{\frac{2}{\pi}}\left(\frac{\left(\alpha^{2}-1\right)\mu\omega}{(\nu\tau)^{3/2}}-\frac{3\alpha^{2}}{\sqrt{\nu\tau}}\right). Using Lemma 35, an upper bound on the maximum is

Using Lemma 35, a lower bound on the minimum is

Thus, an upper bound on the maximal absolute value is

J22τ\frac{\partial{\mathcal{J}}_{22}}{\partial\tau}

We apply Lemma 36 to the expression 2π((α21)μωντ3α2ντ)\sqrt{\frac{2}{\pi}}\left(\frac{\left(\alpha^{2}-1\right)\mu\omega}{\sqrt{\nu\tau}}-3\alpha^{2}\sqrt{\nu\tau}\right). We apply Lemma 37 to the expression ντe(μω+ντ)22ντerfc(μω+ντ2ντ)\nu\tau e^{\frac{(\mu\omega+\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right). We apply Lemma 38 to the expression ντe(μω+2ντ)22ντerfc(μω+2ντ2ντ)\nu\tau e^{\frac{(\mu\omega+2\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+2\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right).

We combine the results of these lemmata to obtain an upper bound on the maximum:

We combine the results of these lemmata to obtain an lower bound on the minimum:

Thus, an upper bound on the maximal absolute value is

We assume α=α01\alpha=\alpha_{\rm 01} and λ=λ01\lambda=\lambda_{\rm 01}. We restrict the range of the variables to the domain μ[0.1,0.1]\mu\in[-0.1,0.1], ω[0.1,0.1]\omega\in[-0.1,0.1], ν[0.8,1.5]\nu\in[0.8,1.5], and τ[0.8,1.25]\tau\in[0.8,1.25].

Lemma 23 says ex2erfc(x)e^{x^{2}}\operatorname{erfc}(x) is decreasing in μω+ντ2ντ\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}. The first term (negative) is increasing in ντ\nu\tau since it is proportional to minus one over the squared root of ντ\nu\tau.

We obtain a lower bound by setting μω+ντ2ντ=1.51.25+0.10.121.51.25\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}=\frac{1.5\cdot 1.25+0.1\cdot 0.1}{\sqrt{2}\sqrt{1.5\cdot 1.25}} for the ex2erfc(x)e^{x^{2}}\operatorname{erfc}(x) term. The term in brackets is larger than e(1.51.25+0.10.121.51.25)2α01 erfc(1.51.25+0.10.121.51.25)2π0.80.8(α011)=0.056e^{\left(\frac{1.5\cdot 1.25+0.1\cdot 0.1}{\sqrt{2}\sqrt{1.5\cdot 1.25}}\right)^{2}}\alpha_{\rm 01}\ \operatorname{erfc}\left(\frac{1.5\cdot 1.25+0.1\cdot 0.1}{\sqrt{2}\sqrt{1.5\cdot 1.25}}\right)-\sqrt{\frac{2}{\pi 0.8\cdot 0.8}}(\alpha_{\rm 01}-1)=0.056 Consequently, the function is larger than zero.

We set x=ντx=\nu\tau and y=μωy=\mu\omega and obtain

The derivative of this sub-function with respect to yy is

The inequality follows from Lemma 24, which states that zez2erfc(z)ze^{z^{2}}\operatorname{erfc}(z) is monotonically increasing in zz. Therefore the sub-function is increasing in yy.

The derivative of this sub-function with respect to xx is

The sub-function is increasing in xx, since the derivative is larger than zero:

First inequality: We applied Lemma 22 two times.

Equalities factor out 2x\sqrt{2}\sqrt{x} and reformulate.

Second inequality part 2: we show that for a=120(2048+169ππ13)a=\frac{1}{20}\left(\sqrt{\frac{2048+169\pi}{\pi}}-13\right) following holds: 8xπ(a2+2a(x+y))0\frac{8x}{\pi}-\left(a^{2}+2a(x+y)\right)\geqslant 0. We have x8xπ(a2+2a(x+y))=8π2a>0\frac{\partial}{\partial x}\frac{8x}{\pi}-\left(a^{2}+2a(x+y)\right)=\frac{8}{\pi}-2a>0 and y8xπ(a2+2a(x+y))=2a>0\frac{\partial}{\partial y}\frac{8x}{\pi}-\left(a^{2}+2a(x+y)\right)=-2a>0. Therefore the minimum is at border for minimal xx and maximal yy:

for a=120(2048+169ππ13)>0.782a=\frac{1}{20}\left(\sqrt{\frac{2048+169\pi}{\pi}}-13\right)>0.782.

Equalities only solve square root and factor out the resulting terms (2(2x+y)+1)(2(2x+y)+1) and (2(x+y)+0.782)(2(x+y)+0.782).

We set α=α01\alpha=\alpha_{\rm 01} and multiplied out. Thereafter we also factored out xx in the numerator. Finally a quadratic equations was solved.

The sub-function has its minimal value for minimal xx and minimal yy x=ντ=0.80.8=0.64x=\nu\tau=0.8\cdot 0.8=0.64 and y=μω=0.10.1=0.01y=\mu\omega=-0.1\cdot 0.1=-0.01. We further minimize the function

Therefore the term in brackets is larger than zero.

First inequality: We applied Lemma 22 two times.

Equalities factor out 2x\sqrt{2}\sqrt{x} and reformulate.

Second inequality part 2: we show that for a=120(2048+169ππ13)a=\frac{1}{20}\left(\sqrt{\frac{2048+169\pi}{\pi}}-13\right) following holds: 8xπ(a2+2a(x+y))0\frac{8x}{\pi}-\left(a^{2}+2a(x+y)\right)\geqslant 0. We have x8xπ(a2+2a(x+y))=8π2a>0\frac{\partial}{\partial x}\frac{8x}{\pi}-\left(a^{2}+2a(x+y)\right)=\frac{8}{\pi}-2a>0 and y8xπ(a2+2a(x+y))=2a<0\frac{\partial}{\partial y}\frac{8x}{\pi}-\left(a^{2}+2a(x+y)\right)=-2a<0. Therefore the minimum is at border for minimal xx and maximal yy:

for a=120(2048+169ππ13)>0.782a=\frac{1}{20}\left(\sqrt{\frac{2048+169\pi}{\pi}}-13\right)>0.782.

Equalities only solve square root and factor out the resulting terms (2(2x+y)+1)(2(2x+y)+1) and (2(x+y)+0.782)(2(x+y)+0.782).

We know that (2erfc(x)>0(2-\operatorname{erfc}(x)>0 according to Lemma 21. For the sub-term we derived

in the domain 0.1μ0.1-0.1\leqslant\mu\leqslant-0.1, 0.1ω0.1-0.1\leqslant\omega\leqslant-0.1, and 0.02ντ0.50.02\leqslant\nu\tau\leqslant 0.5 is bounded by

where we have used the monotonicity of the terms in ντ\nu\tau.

Similarly, we can use the monotonicity of the terms in ντ\nu\tau to show that

Furthermore, when (ντ)0(\nu\tau)\rightarrow 0, the terms with the arguments of the complementary error functions erfc\operatorname{erfc} and the exponential function go to infinity, therefore these three terms converge to zero. Hence, the remaining terms are only 2μω12λ2\mu\omega\frac{1}{2}\lambda. ∎

contains the terms e(μω)22ντerfc(μω2ντ)e^{\frac{(\mu\omega)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega}{\sqrt{2}\sqrt{\nu\tau}}\right) and e(μω+ντ)22ντerfc(μω+ντ2ντ)e^{\frac{(\mu\omega+\nu\tau)^{2}}{2\nu\tau}}\operatorname{erfc}\left(\frac{\mu\omega+\nu\tau}{\sqrt{2}\sqrt{\nu\tau}}\right) which are monotonically decreasing in their arguments (Lemma 23). We can therefore obtain their minima and maximal at the minimal and maximal arguments. Since the first term has a negative sign in the expression, both terms reach their maximal value at μω=0.01\mu\omega=-0.01, ν=0.05\nu=0.05, and τ=0.8\tau=0.8.

We use the argumentation that the term with the error function is monotonically decreasing (Lemma 23) again for the expression

in the domain Ω={μ,ω,ν,τ  0.1μ0.1,0.1ω0.1,0.05ν0.24,0.8τ1.25}\Omega^{-}=\{\mu,\omega,\nu,\tau\ |\ -0.1\leqslant\mu\leqslant 0.1,-0.1\leqslant\omega\leqslant 0.1,0.05\leqslant\nu\leqslant 0.24,0.8\leqslant\tau\leqslant 1.25\}.

We use a similar strategy to the one we have used to show the bound on the singular value (Lemmata 10, 11, and 12), where we evaluted the function on a grid and used bounds on the derivatives together with the mean value theorem. Here we have

Furthermore we used error propagation to estimate the numerical error on the function evaluation. Using the error propagation rules derived in Subsection A3.4.5, we found that the numerical error is smaller than 101310^{-13} in the worst case. ∎

For 1.2x201.2\leqslant x\leqslant 20 and 0.1y0.1-0.1\leqslant y\leqslant 0.1,

is smaller than zero, is strictly monotonically increasing in xx, and strictly monotonically decreasing in yy for the minimal x=12/10=1.2x=12/10=1.2.

We first consider the derivative of sub-function Eq. (115) with respect to xx. The derivative of the function

For bounding this value, we use the approximation

from Ren and MacKenzie, . We start with an error analysis of this approximation. According to Ren and MacKenzie, (Figure 1), the approximation error is positive in the range [0.7,3.2][0.7,3.2]. This range contains all possible arguments of erfc\operatorname{erfc} that we consider. Numerically we maximized and minimized the approximation error of the whole expression

We numerically determined 0.0113556E(x,y)0.01695510.0113556\leqslant E(x,y)\leqslant 0.0169551 for 1.2x201.2\leqslant x\leqslant 20 and 0.1y0.1-0.1\leqslant y\leqslant 0.1. We used different numerical optimization techniques like gradient based constraint BFGS algorithms and non-gradient-based Nelder-Mead methods with different start points. Therefore our approximation is smaller than the function that we approximate. We subtract an additional safety gap of 0.0131259 from our approximation to ensure that the inequality via the approximation holds true. With this safety gap the inequality would hold true even for negative xx, where the approximation error becomes negative and the safety gap would compensate. Of course, the safety gap of 0.0131259 is not necessary for our analysis but may help or future investigations.

We have the sequences of inequalities using the approximation of Ren and MacKenzie, :

We explain this sequence of inequalities:

First inequality: The approximation of Ren and MacKenzie, and then subtracting a safety gap (which would not be necessary for the current analysis).

Equalities: The factor 2x\sqrt{2}\sqrt{x} is factored out and canceled.

Second inequality: adds a positive term in the first root to obtain a binomial form. The term containing the root is positive and the root is in the denominator, therefore the whole term becomes smaller.

Equalities: solve for the term and factor out.

Bringing all terms to the denominator ((x+y)+2.911π)((2.9111)(2x+y)+(2x+y)2+22.9112xπ)\left((x+y)+\frac{2.911}{\pi}\right)\left((2.911-1)(2x+y)+\sqrt{(2x+y)^{2}+\frac{2\cdot 2.911^{2}x}{\pi}}\right).

Equalities: Multiplying out and expanding terms.

Last inequality >0>0 is proofed in the following sequence of inequalities.

We look at the numerator of the last expression of Eq. (262), which we show to be positive in order to show >0>0 in Eq. (262). The numerator is

The factor in front of the root is positive. If the term, that does not contain the root, was positive, then the whole expression would be positive and we would have proofed that the numerator is positive. Therefore we consider the case that the term, that does not contain the root, is negative. The term that contains the root must be larger than the other term in absolute values.

Therefore the squares of the root term have to be larger than the square of the other term to show >0>0 in Eq. (262). Thus, we have the inequality:

We used 24.7796(20)41.2227(20)5=52090.9>024.7796\cdot(20)^{4}-1.2227\cdot(20)^{5}=52090.9>0 and x20x\leqslant 20. We have proofed the last inequality >0>0 of Eq. (262).

Consequently the derivative is always positive independent of yy, thus

is strictly monotonically increasing in xx.

Next we show that the sub-function Eq. (115) is smaller than zero. We consider the limit:

The limit follows from Lemma 22. Since the function is monotonic increasing in xx, it has to approach from below. Thus,

We now consider the derivative of sub-function Eq. (115) with respect to yy. We proofed that sub-function Eq. (115) is strictly monotonically increasing independent of yy. In the proof of Theorem 16, we need the minimum of sub-function Eq. (115). Therefore we are only interested in the derivative of sub-function Eq. (115) with respect to yy for the minimum x=12/10=1.2x=12/10=1.2

Consequently, we insert the minimum x=12/10=1.2x=12/10=1.2 into the sub-function Eq. (115). The main terms become

The derivative of this function with respect to yy is

We again will use the approximation of Ren and MacKenzie,

Therefore we first perform an error analysis. We estimated the maximum and minimum of

We obtained for the maximal absolute error the value 0.1630520.163052. We added an approximation error of 0.20.2 to the approximation of the derivative. Since we want to show that the approximation upper bounds the true expression, the addition of the approximation error is required here. We get a sequence of inequalities:

We explain this sequence of inequalities.

First inequality: The approximation of Ren and MacKenzie, and then adding the error bound to ensure that the approximation is larger than the true value.

First equality: The factor 2152\sqrt{15} and 2π2\sqrt{\pi} are factored out and canceled.

Second equality: Bringing all terms to the denominator

Last inequality <0<0 is proofed in the following sequence of inequalities.

We look at the numerator of the last term in Eq. (278). We have to proof that this numerator is smaller than zero in order to proof the last inequality of Eq. (278). The numerator is

We now compute upper bounds for this numerator:

For the first inequality we choose yy in the roots, so that positive terms maximally increase and negative terms maximally decrease. The second inequality just removed the y2y^{2} term which is always negative, therefore increased the expression. For the last inequality, the term in brackets is negative for all settings of yy. Therefore we make the brackets as negative as possible and make the whole term positive by multiplying with y=0.1y=-0.1.

is strictly monotonically decreasing in yy for the minimal x=1.2x=1.2. ∎

For 0.007x0.8750.007\leqslant x\leqslant 0.875 and 0.01y0.01-0.01\leqslant y\leqslant 0.01, the function

smaller than zero, is strictly monotonically increasing in xx and strictly monotonically increasing in yy for the minimal x=0.007=0.008750.8x=0.007=0.00875\cdot 0.8, x=0.56=0.70.8x=0.56=0.7\cdot 0.8, x=0.128=0.160.8x=0.128=0.16\cdot 0.8, and x=0.216=0.240.9x=0.216=0.24\cdot 0.9 (lower bound of 0.90.9 on τ\tau).

We first consider the derivative of sub-function Eq. (125) with respect to xx. The derivative of the function

For bounding this value, we use the approximation

from Ren and MacKenzie, . We start with an error analysis of this approximation. According to Ren and MacKenzie, (Figure 1), the approximation error is both positive and negative in the range [0.175,1.33][0.175,1.33]. This range contains all possible arguments of erfc\operatorname{erfc} that we consider in this subsection. Numerically we maximized and minimized the approximation error of the whole expression

We numerically determined 0.000228141E(x,y)0.00495688-0.000228141\leqslant E(x,y)\leqslant 0.00495688 for 0.08x0.8750.08\leqslant x\leqslant 0.875 and 0.01y0.01-0.01\leqslant y\leqslant 0.01. We used different numerical optimization techniques like gradient based constraint BFGS algorithms and non-gradient-based Nelder-Mead methods with different start points. Therefore our approximation is smaller than the function that we approximate.

We use an error gap of 0.0003-0.0003 to countermand the error due to the approximation. We have the sequences of inequalities using the approximation of Ren and MacKenzie, :

We explain this sequence of inequalities:

First inequality: The approximation of Ren and MacKenzie, and then subtracting an error gap of 0.00030.0003.

Equalities: The factor 2x\sqrt{2}\sqrt{x} is factored out and canceled.

Second inequality: adds a positive term in the first root to obtain a binomial form. The term containing the root is positive and the root is in the denominator, therefore the whole term becomes smaller.

Equalities: solve for the term and factor out.

Bringing all terms to the denominator ((x+y)+2.911π)((2.9111)(2x+y)+(2x+y)2+22.9112xπ)\left((x+y)+\frac{2.911}{\pi}\right)\left((2.911-1)(2x+y)+\sqrt{(2x+y)^{2}+\frac{2\cdot 2.911^{2}x}{\pi}}\right).

Equalities: Multiplying out and expanding terms.

Last inequality >0>0 is proofed in the following sequence of inequalities.

We look at the numerator of the last expression of Eq. (289), which we show to be positive in order to show >0>0 in Eq. (289). The numerator is

The factor 4x2+2xy+2.7795x2y20.9269y0.000277984x^{2}+2xy+2.7795x-2y^{2}-0.9269y-0.00027798 in front of the root is positive:

If the term that does not contain the root would be positive, then everything is positive and we have proofed the the numerator is positive. Therefore we consider the case that the term that does not contain the root is negative. The term that contains the root must be larger than the other term in absolute values.

Therefore the squares of the root term have to be larger than the square of the other term to show >0>0 in Eq. (289). Thus, we have the inequality:

We used x0.007x\geqslant 0.007 and x0.875x\leqslant 0.875 (reducing the negative x4x^{4}-term to a x3x^{3}-term). We have proofed the last inequality >0>0 of Eq. (289).

Consequently the derivative is always positive independent of yy, thus

is strictly monotonically increasing in xx.

Next we show that the sub-function Eq. (125) is smaller than zero. We consider the limit:

The limit follows from Lemma 22. Since the function is monotonic increasing in xx, it has to approach from below. Thus,

We now consider the derivative of sub-function Eq. (125) with respect to yy. We proofed that sub-function Eq. (125) is strictly monotonically increasing independent of yy. In the proof of Theorem 3, we need the minimum of sub-function Eq. (125). First, we are interested in the derivative of sub-function Eq. (125) with respect to yy for the minimum x=0.007=7/1000x=0.007=7/1000.

Consequently, we insert the minimum x=0.007=7/1000x=0.007=7/1000 into the sub-function Eq. (125):

The derivative of this function with respect to yy is

For the first inequality, we use Lemma 24. Lemma 24 says that the function xex2erfc(x)xe^{x^{2}}\operatorname{erfc}(x) has the sign of xx and is monotonically increasing to 1π\frac{1}{\sqrt{\pi}}. Consequently, we inserted the maximal y=0.01y=0.01 to make the negative term more negative and the minimal y=0.01y=-0.01 to make the positive term less positive.

is strictly monotonically increasing in yy for the minimal x=0.007x=0.007.

Next, we consider x=0.70.8=0.56x=0.7\cdot 0.8=0.56, which is the maximal ν=0.7\nu=0.7 and minimal τ=0.8\tau=0.8. We insert the minimum x=0.56=56/100x=0.56=56/100 into the sub-function Eq. (125):

For the first inequality we applied Lemma 24 which states that the function xex2erfc(x)xe^{x^{2}}\operatorname{erfc}(x) is monotonically increasing. Consequently, we inserted the maximal y=0.01y=0.01 to make the negative term more negative and the minimal y=0.01y=-0.01 to make the positive term less positive.

is strictly monotonically increasing in yy for x=0.56x=0.56.

Next, we consider x=0.160.8=0.128x=0.16\cdot 0.8=0.128, which is the minimal τ=0.8\tau=0.8. We insert the minimum x=0.128=128/1000x=0.128=128/1000 into the sub-function Eq. (125):

For the first inequality we applied Lemma 24 which states that the function xex2erfc(x)xe^{x^{2}}\operatorname{erfc}(x) is monotonically increasing. Consequently, we inserted the maximal y=0.01y=0.01 to make the negative term more negative and the minimal y=0.01y=-0.01 to make the positive term less positive.

is strictly monotonically increasing in yy for x=0.128x=0.128.

Next, we consider x=0.240.9=0.216x=0.24\cdot 0.9=0.216, which is the minimal τ=0.9\tau=0.9 (here we consider 0.90.9 as lower bound for τ\tau). We insert the minimum x=0.216=216/1000x=0.216=216/1000 into the sub-function Eq. (125):

For the first inequality we applied Lemma 24 which states that the function xex2erfc(x)xe^{x^{2}}\operatorname{erfc}(x) is monotonically increasing. Consequently, we inserted the maximal y=0.01y=0.01 to make the negative term more negative and the minimal y=0.01y=-0.01 to make the positive term less positive.

is strictly monotonically increasing in yy for x=0.216x=0.216. ∎

For λ=λ01\lambda=\lambda_{\rm 01}, α=α01\alpha=\alpha_{\rm 01} and the domain 0.1μ0.1-0.1\leqslant\mu\leqslant 0.1, 0.1ω0.1-0.1\leqslant\omega\leqslant 0.1, 0.00875ν0.70.00875\leqslant\nu\leqslant 0.7, and 0.8τ1.250.8\leqslant\tau\leqslant 1.25. We are interested of the derivative of

The derivative of the equation above with respect to

τ\tau is smaller than zero for maximal ν=0.7\nu=0.7, ν=0.16\nu=0.16, and ν=0.24\nu=0.24 (with 0.9τ0.9\leqslant\tau);

y=μωy=\mu\omega is larger than zero for ντ=0.008750.8=0.007\nu\tau=0.00875\cdot 0.8=0.007, ντ=0.70.8=0.56\nu\tau=0.7\cdot 0.8=0.56, ντ=0.160.8=0.128\nu\tau=0.16\cdot 0.8=0.128, and ντ=0.240.9=0.216\nu\tau=0.24\cdot 0.9=0.216.

We consider the domain: 0.1μ0.1-0.1\leqslant\mu\leqslant 0.1, 0.1ω0.1-0.1\leqslant\omega\leqslant 0.1, 0.00875ν0.70.00875\leqslant\nu\leqslant 0.7, and 0.8τ1.250.8\leqslant\tau\leqslant 1.25.

We use Lemma 17 to determine the derivatives. Consequently, the derivative of

with respect to ν\nu is larger than zero, which follows directly from Lemma 17 using the chain rule.

with respect to y=μωy=\mu\omega is larger than zero for ντ=0.008750.8=0.007\nu\tau=0.00875\cdot 0.8=0.007, ντ=0.70.8=0.56\nu\tau=0.7\cdot 0.8=0.56, ντ=0.160.8=0.128\nu\tau=0.16\cdot 0.8=0.128, and ντ=0.240.9=0.216\nu\tau=0.24\cdot 0.9=0.216, which also follows directly from Lemma 17.

We now consider the derivative with respect to τ\tau, which is not trivial since τ\tau is a factor of the whole expression. The sub-expression should be maximized as it appears with negative sign in the mapping for ν\nu.

First, we consider the function for the largest ν=0.7\nu=0.7 and the largest y=μω=0.01y=\mu\omega=0.01 for determining the derivative with respect to τ\tau.

We are considering only the numerator and use again the approximation of Ren and MacKenzie, . The error analysis on the whole numerator gives an approximation error 97<E<18697<E<186. Therefore we add 200 to the numerator when we use the approximation Ren and MacKenzie, . We obtain the inequalities:

After applying the approximation of Ren and MacKenzie, and adding 200, we first factored out 2035τ20\sqrt{35}\sqrt{\tau}. Then we brought all terms to the same denominator.

First we expanded the term (multiplied it out). The we put the terms multiplied by the same square root into brackets. The next inequality sign stems from inserting the maximal value of 1.251.25 for τ\tau for some positive terms and value of 0.80.8 for negative terms. These terms are then expanded at the ==-sign. The next equality factors the terms under the squared root. We decreased the negative term by setting τ=τ+0.0000263835\tau=\tau+0.0000263835 under the root. We increased positive terms by setting τ+0.000026286=1.00003τ\tau+0.000026286=1.00003\tau and τ+0.000026383=1.00003τ\tau+0.000026383=1.00003\tau under the root for positive terms. The positive terms are increase, since 0.8+0.0000263830.8=1.00003\frac{0.8+0.000026383}{0.8}=1.00003, thus τ+0.000026286<τ+0.0000263831.00003τ\tau+0.000026286<\tau+0.000026383\leqslant 1.00003\tau. For the next inequality we decreased negative terms by inserting τ=0.8\tau=0.8 and increased positive terms by inserting τ=1.25\tau=1.25. The next equality expands the terms. We use upper bound of 1.251.25 and lower bound of 0.80.8 to obtain terms with corresponding exponents of τ\tau.

For the last \leqslant-sign we used the function

The derivative at 0.8 is smaller than zero:

Since the second order derivative is negative, the derivative decreases with increasing τ\tau. Therefore the derivative is negative for all values of τ\tau that we consider, that is, the function Eq. (318) is strictly monotonically decreasing. The maximum of the function Eq. (318) is therefore at 0.80.8. We inserted 0.80.8 to obtain the maximum.

with respect to τ\tau is smaller than zero for maximal ν=0.7\nu=0.7.

Next, we consider the function for the largest ν=0.16\nu=0.16 and the largest y=μω=0.01y=\mu\omega=0.01 for determining the derivative with respect to τ\tau.

We are considering only the numerator and use again the approximation of Ren and MacKenzie, . The error analysis on the whole numerator gives an approximation error 1.1<E<121.1<E<12. Therefore we add 20 to the numerator when we use the approximation of Ren and MacKenzie, . We obtain the inequalities:

After applying the approximation of Ren and MacKenzie, and adding 20, we first factored out 402τ40\sqrt{2}\sqrt{\tau}. Then we brought all terms to the same denominator.

First we expanded the term (multiplied it out). The we put the terms multiplied by the same square root into brackets. The next inequality sign stems from inserting the maximal value of 1.251.25 for τ\tau for some positive terms and value of 0.80.8 for negative terms. These terms are then expanded at the ==-sign. The next equality factors the terms under the squared root. We decreased the negative term by setting τ=τ+0.00011542\tau=\tau+0.00011542 under the root. We increased positive terms by setting τ+0.00011542=1.00014τ\tau+0.00011542=1.00014\tau and τ+0.000115004=1.00014τ\tau+0.000115004=1.00014\tau under the root for positive terms. The positive terms are increase, since 0.8+0.000115420.8<1.000142\frac{0.8+0.00011542}{0.8}<1.000142, thus τ+0.000115004<τ+0.000115421.00014τ\tau+0.000115004<\tau+0.00011542\leqslant 1.00014\tau. For the next inequality we decreased negative terms by inserting τ=0.8\tau=0.8 and increased positive terms by inserting τ=1.25\tau=1.25. The next equality expands the terms. We use upper bound of 1.251.25 and lower bound of 0.80.8 to obtain terms with corresponding exponents of τ\tau.

with respect to τ\tau is smaller than zero for maximal ν=0.16\nu=0.16.

Next, we consider the function for the largest ν=0.24\nu=0.24 and the largest y=μω=0.01y=\mu\omega=0.01 for determining the derivative with respect to τ\tau. However we assume 0.9τ0.9\leqslant\tau, in order to restrict the domain of τ\tau.

We are considering only the numerator and use again the approximation of Ren and MacKenzie, . The error analysis on the whole numerator gives an approximation error 14<E<3214<E<32. Therefore we add 32 to the numerator when we use the approximation of Ren and MacKenzie, . We obtain the inequalities:

After applying the approximation of Ren and MacKenzie, and adding 200, we first factored out 403τ40\sqrt{3}\sqrt{\tau}. Then we brought all terms to the same denominator.

First we expanded the term (multiplied it out). The we put the terms multiplied by the same square root into brackets. The next inequality sign stems from inserting the maximal value of 1.251.25 for τ\tau for some positive terms and value of 0.90.9 for negative terms. These terms are then expanded at the ==-sign. The next equality factors the terms under the squared root. We decreased the negative term by setting τ=τ+0.0000769518\tau=\tau+0.0000769518 under the root. We increased positive terms by setting τ+0.0000769518=1.0000962τ\tau+0.0000769518=1.0000962\tau and τ+0.0000766694=1.0000962τ\tau+0.0000766694=1.0000962\tau under the root for positive terms. The positive terms are increase, since 0.8+0.00007695180.8<1.0000962\frac{0.8+0.0000769518}{0.8}<1.0000962, thus τ+0.0000766694<τ+0.00007695181.0000962τ\tau+0.0000766694<\tau+0.0000769518\leqslant 1.0000962\tau. For the next inequality we decreased negative terms by inserting τ=0.9\tau=0.9 and increased positive terms by inserting τ=1.25\tau=1.25. The next equality expands the terms. We use upper bound of 1.251.25 and lower bound of 0.90.9 to obtain terms with corresponding exponents of τ\tau.

with respect to τ\tau is smaller than zero for maximal ν=0.24\nu=0.24 and the domain 0.9τ1.250.9\leqslant\tau\leqslant 1.25. ∎

In the domain 0.01y0.01-0.01\leqslant y\leqslant 0.01 and 0.64x1.8750.64\leqslant x\leqslant 1.875, the function f(x,y)=e12(2y+x)erfc(x+y2x)f(x,y)=e^{\frac{1}{2}(2y+x)}\operatorname{erfc}\left(\frac{x+y}{\sqrt{2x}}\right) has a global maximum at y=0.64y=0.64 and x=0.01x=-0.01 and a global minimum at y=1.875y=1.875 and x=0.01x=0.01.

f(x,y)=e12(2y+x)erfc(x+y2x)f(x,y)=e^{\frac{1}{2}(2y+x)}\operatorname{erfc}\left(\frac{x+y}{\sqrt{2x}}\right) is strictly monotonically decreasing in xx, since its derivative with respect to xx is negative:

The two last inqualities come from applying Abramowitz bounds 22 and from the fact that the expression 2x3/2x+y2x+(x+y)22x+4π+y2x2\frac{2x^{3/2}}{\frac{x+y}{\sqrt{2}\sqrt{x}}+\sqrt{\frac{(x+y)^{2}}{2x}+\frac{4}{\pi}}}+y\sqrt{2}-x\sqrt{2} does not change monotonicity in the domain and hence the maximum must be found at the border. For x=0.64x=0.64 that maximizes the function f(x,y)f(x,y) is monotonically in yy, because its derivative w.r.t. yy at x=0.64x=0.64 is

Therefore, the values y=0.64y=0.64 and x=0.01x=-0.01 give a global maximum of the function f(x,y)f(x,y) in the domain 0.01y0.01-0.01\leqslant y\leqslant 0.01 and 0.64x1.8750.64\leqslant x\leqslant 1.875 and the values y=1.875y=1.875 and x=0.01x=0.01 give the global minimum. ∎

A4 Additional information on experiments

In this section, we report the hyperparameters that were considered for each method and data set and give details on the processing of the data sets.

For the UCI data sets, the best hyperparameter setting was determined by a grid-search over all hyperparameter combinations using 15% of the training data as validation set. The early stopping parameter was determined on the smoothed learning curves of 100 epochs of the validation set. Smoothing was done using moving averages of 10 consecutive values. We tested “rectangular” and “conic” layers – rectangular layers have constant number of hidden units in each layer, conic layers start with the given number of hidden units in the first layer and then decrease the number of hidden units to the size of the output layer according to the geometric progession. If multiple hyperparameters provided identical performance on the validation set, we preferred settings with a higher number of layers, lower learning rates and higher dropout rates. All methods had the chance to adjust their hyperparameters to the data set at hand.

A4.2 121 UCI Machine Learning Repository data sets: detailed results

We used data sets and preprocessing scripts by Fernández-Delgado et al., for data preparation and defining training and test sets. With several flaws in the method comparison that we avoided, the authors compared 179 machine learning methods of 17 groups in their experiments. The method groups were defined by Fernández-Delgado et al., as follows: Support Vector Machines, RandomForest, Multivariate adaptive regression splines (MARS), Boosting, Rule-based, logistic and multinomial regression, Discriminant Analysis (DA), Bagging, Nearest Neighbour, DecisionTree, other Ensembles, Neural Networks, Bayesian, Other Methods, generalized linear models (GLM), Partial least squares and principal component regression (PLSR), and Stacking. However, many of methods assigned to those groups were merely different implementations of the same method. Therefore, we selected one representative of each of the 17 groups for method comparison. The representative method was chosen as the group’s method with the median performance across all tasks. Finally, we included 17 other machine learning methods of Fernández-Delgado et al., , and 6 FNNs, BatchNorm, WeightNorm, LayerNorm, Highway, Residual and MSRAinit networks, and self-normalizing neural networks (SNNs) giving a total of 24 compared methods.

The results of the compared FNN methods can be found in Table A11.

We assigned each of the 121 UCI data sets into the group “large datasets” or “small datasets” if the had more than 1,000 data points or less, respectively. We expected that Deep Learning methods require large data sets to competitive to other machine learning methods. This resulted in 75 small and 46 large data sets.

The results of the method comparison are given in Tables A12 and A13 for small and large data sets, respectively. On small data sets, SVMs performed best followed by RandomForest and SNNs. On large data sets, SNNs are the best method followed by SVMs and Random Forest.

A4.3 Tox21 challenge data set: Hyperparameters

For the Tox21 data set, the best hyperparameter setting was determined by a grid-search over all hyperparameter combinations using the validation set defined by the challenge winners . The hyperparameter space was chosen to be similar to the hyperparameters that were tested by Mayr et al., . The early stopping parameter was determined on the smoothed learning curves of 100 epochs of the validation set. Smoothing was done using moving averages of 10 consecutive values. We tested “rectangular” and “conic” layers – rectangular layers have constant number of hidden units in each layer, conic layers start with the given number of hidden units in the first layer and then decrease the number of hidden units to the size of the output layer according to the geometric progession. All methods had the chance to adjust their hyperparameters to the data set at hand.

We empirically checked the assumption that the distribution of network inputs can well be approximated by a normal distribution. To this end, we investigated the density of the network inputs before and during learning and found that these density are close to normal distributions (see Figure A8).

A4.4 HTRU2 data set: Hyperparameters

For the HTRU2 data set, the best hyperparameter setting was determined by a grid-search over all hyperparameter combinations using one of the 9 non-testing folds as validation fold in a nested cross-validation procedure. Concretely, if MM was the testing fold, we used M1M-1 as validation fold, and for M=1M=1 we used fold 1010 for validation. The early stopping parameter was determined on the smoothed learning curves of 100 epochs of the validation set. Smoothing was done using moving averages of 10 consecutive values. We tested “rectangular” and “conic” layers – rectangular layers have constant number of hidden units in each layer, conic layers start with the given number of hidden units in the first layer and then decrease the number of hidden units to the size of the output layer according to the geometric progession. All methods had the chance to adjust their hyperparameters to the data set at hand.

A5 Other fixed points

A6 Bounds determined by numerical methods

In this section we report bounds on previously discussed expressions as determined by numerical methods (min and max have been computed).

A7 References

References

Brief index