Recovery Guarantees for One-hidden-layer Neural Networks

Kai Zhong, Zhao Song, Prateek Jain, Peter L. Bartlett, Inderjit S. Dhillon

Introduction

Neural Networks (NNs) have achieved great practical success recently. Many theoretical contributions have been made very recently to understand the extraordinary performance of NNs. The remarkable results of NNs on complex tasks in computer vision and natural language processing inspired works on the expressive power of NNs [CSS16, CS16, RPK+16, DFS16, PLR+16, MPCB14, Tel16]. Indeed, several works found NNs are very powerful and the deeper the more powerful. However, due to the high non-convexity of NNs, knowing the expressivity of NNs doesn’t guarantee that the targeted functions will be learned. Therefore, several other works focused on the achievability of global optima. Many of them considered the over-parameterized setting, where the global optima or local minima close to the global optima will be achieved when the number of parameters is large enough, including [FB16, HV15, LSSS14, DPG+14, SS16, HM17]. This, however, leads to overfitting easily and can’t provide any generalization guarantees, which are actually the essential goal in most tasks.

A few works have considered generalization performance. For example, [XLS17] provide generalization bound under the Rademacher generalization analysis framework. Recently [ZBH+17] describe some experiments showing that NNs are complex enough that they actually memorize the training data but still generalize well. As they claim, this cannot be explained by applying generalization analysis techniques, like VC dimension and Rademacher complexity, to classification loss (although it does not rule out a margins analysis—see, for example, [Bar98]; their experiments involve the unbounded cross-entropy loss).

In this paper, we don’t develop a new generalization analysis. Instead we focus on parameter recovery setting, where we assume there are underlying ground-truth parameters and we provide recovery guarantees for the ground-truth parameters up to equivalent permutations. Since the parameters are exactly recovered, the generalization performance will also be guaranteed.

Several other techniques are also provided to recover the parameters or to guarantee generalization performance, such as tensor methods [JSA15] and kernel methods [AGMR17]. These methods require sample complexity O(d3)O(d^{3}) or computational complexity O~(n2)\widetilde{O}(n^{2}), which can be intractable in practice. We propose an algorithm that has recovery guarantees for 1NN with sample complexity O~(d)\widetilde{O}(d) and computational time O~(dn)\widetilde{O}(dn) under some mild assumptions.

Recently [Sha16] show that neither specific assumptions on the niceness of the input distribution or niceness of the target function alone is sufficient to guarantee learnability using gradient-based methods. In this paper, we assume data points are sampled from Gaussian distribution and the parameters of hidden neurons are linearly independent.

We distill some properties for activation functions, which are satisfied by a wide range of activations, including ReLU, squared ReLU, sigmoid and tanh. With these properties we show positive definiteness (PD) of the Hessian in the neighborhood of the ground-truth parameters given enough samples (Theorem 4.2). Further, for activations that are also smooth, we show local linear convergence is guaranteed using gradient descent.

We propose a tensor method to initialize the parameters such that the initialized parameters fall into the local positive definiteness area. Our contribution is that we reduce the sample/computational complexity from cubic dependency on dimension to linear dependency (Theorem 5.6).

Combining the above two results, we provide a globally converging algorithm (Algorithm 2) for smooth homogeneous activations satisfying the distilled properties. The whole procedure requires sample/computational complexity linear in dimension and logarithmic in precision (Theorem 6.1).

Related Work

The recent empirical success of NNs has boosted their theoretical analyses [FZK+16, Bal16, BMBY16, SBL16, APVZ14, AGMR17, GKKT17]. In this paper, we classify them into three main directions.

Expressive power is studied to understand the remarkable performance of neural networks on complex tasks. Although one-hidden-layer neural networks with sufficiently many hidden nodes can approximate any continuous function [Hor91], shallow networks can’t achieve the same performance in practice as deep networks. Theoretically, several recent works show the depth of NNs plays an essential role in the expressive power of neural networks [DFS16]. As shown in [CSS16, CS16, Tel16], functions that can be implemented by a deep network of polynomial size require exponential size in order to be implemented by a shallow network. [RPK+16, PLR+16, MPCB14, AGMR17] design some measures of expressivity that display an exponential dependence on the depth of the network. However, the increasing of the expressivity of NNs or its depth also increases the difficulty of the learning process to achieve a good enough model. In this paper, we focus on 1NNs and provide recovery guarantees using a finite number of samples.

2 Achievability of Global Optima

The global convergence is in general not guaranteed for NNs due to their non-convexity. It is widely believed that training deep models using gradient-based methods works so well because the error surface either has no local minima, or if they exist they need to be close in value to the global minima. [SCP16] present examples showing that for this to be true additional assumptions on the data, initialization schemes and/or the model classes have to be made. Indeed the achievability of global optima has been shown under many different types of assumptions.

In particular, [CHM+15] analyze the loss surface of a special random neural network through spin-glass theory and show that it has exponentially many local optima, whose loss is small and close to that of a global optimum. Later on, [Kaw16] eliminate some assumptions made by [CHM+15] but still require the independence of activations as [CHM+15], which is unrealistic. [SS16] study the geometric structure of the neural network objective function. They have shown that with high probability random initialization will fall into a basin with a small objective value when the network is over-parameterized. [LSSS14] consider polynomial networks where the activations are square functions, which are typically not used in practice. [HV15] show that when a local minimum has zero parameters related to a hidden node, a global optimum is achieved. [FB16] study the landscape of 1NN in terms of topology and geometry, and show that the level set becomes connected as the network is increasingly over-parameterized. [HM17] show that products of matrices don’t have spurious local minima and that deep residual networks can represent any function on a sample, as long as the number of parameters is larger than the sample size. [SC16] consider over-specified NNs, where the number of samples is smaller than the number of weights. [DPG+14] propose a new approach to second-order optimization that identifies and attacks the saddle point problem in high-dimensional non-convex optimization. They apply the approach to recurrent neural networks and show practical performance. [AGMR17] use results from tropical geometry to show global optimality of an algorithm, but it requires (2n)kpoly(n)(2n)^{k}\operatorname{poly}(n) computational complexity.

Almost all of these results require the number of parameters is larger than the number of points, which probably overfits the model and no generalization performance will be guaranteed. In this paper, we propose an efficient and provable algorithm for 1NNs that can achieve the underlying ground-truth parameters.

3 Generalization Bound / Recovery Guarantees

The achievability of global optima of the objective from the training data doesn’t guarantee the learned model to be able to generalize well on unseen testing data. In the literature, we find three main approaches to generalization guarantees.

1) Use generalization analysis frameworks, including VC dimension/Rademacher complexity, to bound the generalization error. A few works have studied the generalization performance for NNs. [XLS17] follow [SC16] but additionally provide generalization bounds using Rademacher complexity. They assume the obtained parameters are in a regularization set so that the generalization performance is guaranteed, but this assumption can’t be justified theoretically. [HRS16] apply stability analysis to the generalization analysis of SGD for convex and non-convex problems, arguing early stopping is important for generalization performance.

2) Assume an underlying model and try to recover this model. This direction is popular for many non-convex problems including matrix completion/sensing [JNS13, Har14, SL15, BLWZ17], mixed linear regression [ZJD16], subspace recovery [EV09] and other latent models [AGH+14].

Without making any assumptions, those non-convex problems are intractable [AGKM12, GV15, SWZ17a, GG11, RSW16, SR11, HM13, AGM12, YCS14]. Recovery guarantees for NNs also need assumptions. Several different approaches under different assumptions are provided to have recovery guarantees on different NN settings.

Tensor methods [AGH+14, WTSA15, WA16, SWZ16] are a general tool for recovering models with latent factors by assuming the data distribution is known. Some existing recovery guarantees for NNs are provided by tensor methods [SA15, JSA15]. However, [SA15] only provide guarantees to recover the subspace spanned by the weight matrix and no sample complexity is given, while [JSA15] require O(d3/ϵ2)O(d^{3}/\epsilon^{2}) sample complexity. In this paper, we use tensor methods as an initialization step so that we don’t need very accurate estimation of the moments, which enables us to reduce the total sample complexity from 1/ϵ21/\epsilon^{2} to log(1/ϵ)\log(1/\epsilon).

[ABGM14] provide polynomial sample complexity and computational complexity bounds for learning deep representations in unsupervised setting, and they need to assume the weights are sparse and randomly distributed in $$.

[Tia17] analyze 1NN by assuming Gaussian inputs in a supervised setting, in particular, regression and classification with a teacher. This paper also considers this setting. However, there are some key differences. a) [Tia17] require the second-layer parameters are all ones, while we can learn these parameters. b) In [Tia17], the ground-truth first-layer weight vectors are required to be orthogonal, while we only require linear independence. c) [Tia17] require a good initialization but doesn’t provide initialization methods, while we show the parameters can be efficiently initialized by tensor methods. d) In [Tia17], only the population case (infinite sample size) is considered, so there is no sample complexity analysis, while we show finite sample complexity.

Recovery guarantees for convolution neural network with Gaussian inputs are provided in [BG17], where they show a globally converging guarantee of gradient descent on a one-hidden-layer no-overlap convolution neural network. However, they consider population case, so no sample complexity is provided. Also their analysis depends on ReLU\mathsf{ReLU} activations and the no-overlap case is very unlikely to be used in practice. In this paper, we consider a large range of activation functions, but for one-hidden-layer fully-connected NNs.

3) Improper Learning. In the improper learning setting for NNs, the learning algorithm is not restricted to output a NN, but only should output a prediction function whose error is not much larger than the error of the best NN among all the NNs considered. [ZLJ16, ZLW16] propose kernel methods to learn the prediction function which is guaranteed to have generalization performance close to that of the NN. However, the sample complexity and computational complexity are exponential. [AZS14] transform NNs to convex semi-definite programming. The works by [Bac14] and [BRV+05] are also in this direction. However, these methods are actually not learning the original NNs. Another work by [ZLWJ17] uses random initializations to achieve arbitrary small excess risk. However, their algorithm has exponential running time in 1/ϵ1/\epsilon.

The paper is organized as follows. In Section 3, we present our problem setting and show three key properties of activations required for our guarantees. In Section 4, we introduce the formal theorem of local strong convexity and show local linear convergence for smooth activations. Section 5 presents a tensor method to initialize the parameters so that they fall into the basin of the local strong convexity region.

Problem Formulation

We consider the following regression problem. Given a set of nn samples

such that each sample (x,y)S(x,y)\in S is sampled i.i.d. from this distribution, with

where ϕ(z)\phi(z) is the activation function, kk is the number of nodes in the hidden layer. The main question we want to answer is: How many samples are sufficient to recover the underlying parameters?

It is well-known that, training one hidden layer neural network is NP-complete [BR88]. Thus, without making any assumptions, learning deep neural network is intractable. Throughout the paper, we assume xx follows a standard normal distribution; the data is noiseless; the dimension of input data is at least the number of hidden nodes; and activation function ϕ(z)\phi(z) satisfies some reasonable properties.

Actually our results can be easily extended to multivariate Gaussian distribution with positive definite covariance and zero mean since we can estimate the covariance first and then transform the input to a standard normal distribution but with some loss of accuracy. Although this paper focuses on the regression problem, we can transform classification problems to regression problems if a good teacher is provided as described in [Tia17]. Our analysis requires kk to be no greater than dd, since the first-layer parameters will be linearly dependent otherwise.

For activation function ϕ(z)\phi(z), we assume it is continuous and if it is non-smooth let its first derivative be left derivative. Furthermore, we assume it satisfies Property 3.1, 3.2, and 3.3. These properties are critical for the later analyses. We also observe that most activation functions actually satisfy these three properties.

The first derivative ϕ(z)\phi^{\prime}(z) is nonnegative and homogeneously bounded, i.e., 0ϕ(z)L1zp0\leq\phi^{\prime}(z)\leq L_{1}|z|^{p} for some constants L1>0L_{1}>0 and p0p\geq 0.

The second derivative ϕ(z)\phi^{\prime\prime}(z) is either (a) globally bounded ϕ(z)L2|\phi^{\prime\prime}(z)|\leq L_{2} for some constant L2L_{2}, i.e., ϕ(z)\phi(z) is L2L_{2}-smooth, or (b) ϕ(z)=0\phi^{\prime\prime}(z)=0 except for ee (ee is a finite constant) points.

The first two properties are related to the first derivative ϕ(z)\phi^{\prime}(z) and the last one is about the second derivative ϕ(z)\phi^{\prime\prime}(z). At high level, Property 3.1 requires ϕ\phi to be non-decreasing with homogeneously bounded derivative; Property 3.2 requires ϕ\phi to be highly non-linear; Property 3.3 requires ϕ\phi to be either smooth or piece-wise linear.

Positive Definiteness of Hessian

In this section, we study the Hessian of empirical risk near the ground truth. We consider the case when vv^{*} is already known. Note that for homogeneous activations, we can assume vi{1,1}v^{*}_{i}\in\{-1,1\} since vϕ(z)=vvϕ(v1/pz)v\phi(z)=\frac{v}{|v|}\phi(|v|^{1/p}z), where pp is the degree of homogeneity. As viv_{i}^{*} only takes discrete values for homogeneous activations, in the next section, we show we can exactly recover vv^{*} using tensor methods with finite samples.

For a set of samples SS, we define the Empirical Risk,

For a distribution D{\cal D}, we define the Expected Risk,

Let’s calculate the gradient and the Hessian of f^S(W)\widehat{f}_{S}(W) and fD(W)f_{\cal D}(W). For each j[k]j\in[k], the partial gradient of fD(W)f_{\cal D}(W) with respect to wjw_{j} can be represented as

For each j,l[k]j,l\in[k] and jlj\neq l, the second partial derivative of fD(W)f_{\cal D}(W) for the (j,l)(j,l)-th off-diagonal block is,

and for each j[k]j\in[k], the second partial derivative of fD(W)f_{\cal D}(W) for the jj-th diagonal block is

If Property 3.3(b) is satisfied, ϕ(z)=0\phi^{\prime\prime}(z)=0 almost surely. So in this case the diagonal blocks of the empirical Hessian can be written as,

Now we show the Hessian of the objective near the global optimum is positive definite.

As we can see from Theorem 4.2, ρ(σk)\rho(\sigma_{k}) from Property 3.2 plays an important role for positive definite (PD) property. Interestingly, many popular activations, like ReLU, sigmoid and tanh, have ρ(σk)>0\rho(\sigma_{k})>0, while some simple functions like linear (ϕ(z)=z\phi(z)=z) and square (ϕ(z)=z2\phi(z)=z^{2}) functions have ρ(σk)=0\rho(\sigma_{k})=0 and their Hessians are rank-deficient. Another important numbers are κ\kappa and λ\lambda, two different condition numbers of the weight matrix, which directly influences the positive definiteness. If WW^{*} is rank deficient, λ\lambda\to\infty, κ\kappa\to\infty and we don’t have PD property. In the best case when WW^{*} is orthogonal, λ=κ=1\lambda=\kappa=1. In the worse case, λ\lambda can be exponential in kk. Also WW should be close enough to WW^{*}. In the next section, we provide tensor methods to initialize wiw_{i}^{*} and viv_{i}^{*} such that they satisfy the conditions in Theorem 4.2.

For the PD property to hold, we need the samples to be independent of the current parameters. Therefore, we need to do resampling at each iteration to guarantee the convergence in iterative algorithms like gradient descent. The following theorem provides the linear convergence guarantee of gradient descent for smooth activations.

Let WW be the current iterate satisfying WWpoly(1/ν,1/k,1/λ,ρ/σ12p)W\|W-W^{*}\|\leq\operatorname{poly}(1/\nu,1/k,1/\lambda,\rho/\sigma_{1}^{2p})\|W^{*}\|. Let SS denote a set of i.i.d. samples from distribution D{\cal D} (defined in (1)) with Sdpoly(logd,t,k,ν,τ,λ,σ12p/ρ)|S|\geq d\cdot\operatorname{poly}(\log d,t,k,\nu,\tau,\lambda,\sigma_{1}^{2p}/\rho) and let the activation function satisfy Property 3.1,3.2 and 3.3(a). Define m0:=Θ(vmin2ρ(σk)/(κ2λ))m_{0}:=\Theta(v_{\min}^{2}\rho(\sigma_{k})/(\kappa^{2}\lambda)) and M0:=Θ(kvmax2σ12p)M_{0}:=\Theta(kv_{\max}^{2}\sigma_{1}^{2p}). If we perform gradient descent with step size 1/M01/M_{0} on f^S(W)\widehat{f}_{S}(W) and obtain the next iterate,

then with probability at least 1dΩ(t)1-d^{-\Omega(t)},

We provide the proofs in the Appendix D.1

Tensor Methods for Initialization

In this section, we show that Tensor methods can recover the parameters WW^{*} to some precision and exactly recover vv^{*} for homogeneous activations.

It is known that most tensor problems are NP-hard [Hås90, HL13] or even hard to approximate [SWZ17b]. However, by making some assumptions, tensor decomposition method becomes efficient [AGH+14, WTSA15, WA16, SWZ16]. Here we utilize the noiseless assumption and Gaussian inputs assumption to show a provable and efficient tensor methods.

where A1,i,j=viviejejA_{1,i,j}=v_{i}\otimes v_{i}\otimes e_{j}\otimes e_{j}, A2,i,j=viejviejA_{2,i,j}=v_{i}\otimes e_{j}\otimes v_{i}\otimes e_{j}, A3,i,j=ejviviejA_{3,i,j}=e_{j}\otimes v_{i}\otimes v_{i}\otimes e_{j}, A4,i,j=viejejviA_{4,i,j}=v_{i}\otimes e_{j}\otimes e_{j}\otimes v_{i}, A5,i,j=ejviejviA_{5,i,j}=e_{j}\otimes v_{i}\otimes e_{j}\otimes v_{i} and A6,i,j=ejejviviA_{6,i,j}=e_{j}\otimes e_{j}\otimes v_{i}\otimes v_{i}.

Denote w=w/w\overline{w}=w/\|w\|. Now let’s calculate some moments.

According to Definition 5.1, we have the following results,

For each jj\in, Mj=i=1kvimj,iwijM_{j}=\sum_{i=1}^{k}v_{i}^{*}m_{j,i}\overline{w}_{i}^{*\otimes j}.

Note that some mj,im_{j,i}’s will be zero for specific activations. For example, for activations with symmetric first derivatives, i.e., ϕ(z)=ϕ(z)\phi^{\prime}(z)=\phi^{\prime}(-z), like sigmoid and erf, we have ϕ(z)+ϕ(z)\phi(z)+\phi(-z) being a constant and M2=0M_{2}=0 since γ0(σ)=γ2(σ)\gamma_{0}(\sigma)=\gamma_{2}(\sigma). Another example is ReLU\mathsf{ReLU}. ReLU\mathsf{ReLU} functions have vanishing M3M_{3}, i.e., M3=0M_{3}=0, as γ3(σ)=3γ1(σ)\gamma_{3}(\sigma)=3\gamma_{1}(\sigma). To make tensor methods work, we make the following assumption.

Assume the activation function ϕ(z)\phi(z) satisfies the following conditions: 1. If Mj0M_{j}\neq 0, then mj,i0m_{j,i}\neq 0 for all i[k]i\in[k]. 2. At least one of M3M_{3} and M4M_{4} is non-zero. 3. If M1=M3=0M_{1}=M_{3}=0, then ϕ(z)\phi(z) is an even function, i.e., ϕ(z)=ϕ(z)\phi(z)=\phi(-z). 4. If M2=M4=0M_{2}=M_{4}=0, then ϕ(z)\phi(z) is an odd function, i.e., ϕ(z)=ϕ(z)\phi(z)=-\phi(-z).

If ϕ(z)\phi(z) is an odd function then ϕ(z)=ϕ(z)\phi(z)=-\phi(-z) and vϕ(wx)=vϕ(wx)v\phi(w^{\top}x)=-v\phi(-w^{\top}x). Hence we can always assume v>0v>0. If ϕ(z)\phi(z) is an even function, then vϕ(wx)=vϕ(wx)v\phi(w^{\top}x)=v\phi(-w^{\top}x). So if ww recovers ww^{*} then w-w also recovers ww^{*}. Note that ReLU\mathsf{ReLU}, leaky ReLU\mathsf{ReLU} and squared ReLU\mathsf{ReLU} satisfy Assumption 5.3. We further define the following non-zero moments.

According to Definition 5.1 and 5.4, we have,

P2=i=1kvimj2,i(αwi)j22wi2P_{2}=\sum_{i=1}^{k}v_{i}^{*}m_{j_{2},i}(\alpha^{\top}\overline{w}_{i}^{*})^{{j_{2}-2}}\overline{w}_{i}^{*\otimes 2} and P3=i=1kvimj3,i(αwi)j33wi3P_{3}=\sum_{i=1}^{k}v_{i}^{*}m_{j_{3},i}(\alpha^{\top}\overline{w}_{i}^{*})^{{j_{3}-3}}\overline{w}_{i}^{*\otimes 3}.

2 Algorithm

Now we briefly introduce how to use a set of samples with size linear in dimension to recover the ground truth parameters to some precision. As shown in the previous section, we have a rank-kk 3rd-order moment P3P_{3} that has tensor decomposition formed by {w1,w2,,wk}\{\overline{w}_{1}^{*},\overline{w}_{2}^{*},\cdots,\overline{w}_{k}^{*}\}. Therefore, we can use the non-orthogonal decomposition method [KCL15] to decompose the corresponding estimated tensor P^3\widehat{P}_{3} and obtain an approximation of the parameters. The precision of the obtained parameters depends on the estimation error of P3P_{3}, which requires Ω(d3/ϵ2)\Omega(d^{3}/\epsilon^{2}) samples to achieve ϵ\epsilon error. Also, the time complexity for tensor decomposition on a d×d×dd\times d\times d tensor is Ω(d3)\Omega(d^{3}).

The detailed algorithm is shown in Algorithm 1. First, we randomly partition the dataset into three subsets each with size O~(d)\widetilde{O}(d). Then apply the power method on P^2\widehat{P}_{2}, which is the estimation of P2P_{2} from S2S_{2}, to estimate VV. After that, the non-orthogonal tensor decomposition (KCL)[KCL15] on R^3\widehat{R}_{3} outputs u^i\widehat{u}_{i} which estimates siVwis_{i}V^{\top}\overline{w}_{i}^{*} for i[k]i\in[k] with unknown sign si{1,1}s_{i}\in\{-1,1\}. Hence wi\overline{w}_{i}^{*} can be estimated by siVu^is_{i}V\widehat{u}_{i}. Finally we estimate the magnitude of wiw_{i}^{*} and the signs si,vis_{i},v_{i}^{*} in the RecMagSign function for homogeneous activations. We discuss the details of each procedure and provide PowerMethod and RecMagSign algorithms in Appendix E.

3 Theoretical Analysis

We formally present our theorem for Algorithm 1, and provide the proof in the Appendix E.2.

Global Convergence

Combining the positive definiteness of the Hessian near the global optimal in Section 4 and the tensor initialization methods in Section 5, we come up with the overall globally converging algorithm Algorithm 2 and its guarantee Theorem 6.1.

with probability at least 1dΩ(t)1-d^{-\Omega(t)}.

This follows by combining Theorem 4.4 and Theorem 5.6.

Numerical Experiments

First we show the number of samples required to recover the parameters for different dimensions. We fix k=5k=5, change dd for d=10,20,,100d=10,20,\cdots,100 and nn for n=1000,2000,,10000n=1000,2000,\cdots,10000. For each pair of dd and nn, we run 1010 trials. We say a trial successfully recovers the parameters if there exists a permutation π:[k][k]\pi:[k]\rightarrow[k], such that the returned parameters WW and vv satisfy

We record the recovery rates and represent them as grey scale in Fig. 1(a). As we can see from Fig. 1(a), the least number of samples required to have 100% recovery rate is about proportional to the dimension.

Next we test the tensor initialization. We show the error between the output of the tensor method and the ground truth parameters against the number of samples under different dimensions in Fig 1(b). The pure dark blocks indicate, in at least one of the 10 trials, i=1kvi(0)i=1kvi\sum_{i=1}^{k}v^{{(0)}}_{i}\neq\sum_{i=1}^{k}v^{*}_{i}, which means vi(0)v^{(0)}_{i} is not correctly initialized. Let Π(k)\Pi(k) denote the set of all possible permutations π:[k][k]\pi:[k]\rightarrow[k]. The grey scale represents the averaged error,

over 10 trials. As we can see, with a fixed dimension, the more samples we have the better initialization we obtain. We can also see that to achieve the same initialization error, the sample complexity required is about proportional to the dimension.

We also compare different initialization methods for gradient descent in Fig. 1(c). We fix d=10,k=5,n=10000d=10,k=5,n=10000 and compare three different initialization approaches, (\@slowromancapi@) Let both vv and WW be initialized from tensor methods, and then do gradient descent for WW while vv is fixed; (\@slowromancapii@) Let both vv and WW be initialized from random Gaussian, and then do gradient descent for both WW and vv; (\@slowromancapiii@) Let v=vv=v^{*} and WW be initialized from random Gaussian, and then do gradient descent for WW while vv is fixed. As we can see from Fig 1(c), Approach (\@slowromancapi@) is the fastest and Approach (\@slowromancapii@) doesn’t converge even if more iterations are allowed. Both Approach (\@slowromancapi@) and (\@slowromancapiii@) have linear convergence rate when the objective value is small enough, which verifies our local linear convergence claim.

Conclusion

As shown in Theorem 6.1, the tensor initialization followed by gradient descent will provide a globally converging algorithm with linear time/sample complexity in dimension, logarithmic in precision and polynomial in other factors for smooth homogeneous activation functions. Our distilled properties for activation functions include a wide range of non-linear functions and hopefully provide an intuition to understand the role of non-linear activations played in optimization. Deeper neural networks and convergence for SGD will be considered in the future.

References

Appendix A Notation

We use T\|T\| to denote the operator norm of the tensor TT, i.e.,

For any function ff, we define O~(f)\widetilde{O}(f) to be flogO(1)(f)f\cdot\log^{O(1)}(f). In addition to O()O(\cdot) notation, for two functions f,gf,g, we use the shorthand fgf\lesssim g (resp. \gtrsim) to indicate that fCgf\leq Cg (resp. \geq) for an absolute constant CC. We use fgf\eqsim g to mean cfgCfcf\leq g\leq Cf for constants c,Cc,C.

Appendix B Preliminaries

In this section, we introduce some lemmata and corollaries that will be used in the proofs.

We provide some facts that will be used in the later proofs.

Let zz denote a fixed dd-dimensional vector, then for any C1C\geq 1 and n1n\geq 1, we have

This follows by Proposition 1.1 in [HKZ12]. ∎

This follows by Proposition 1.1 in [HKZ12]. ∎

Part (\@slowromancapii@). We first show how to lower bound σk(W)\sigma_{k}(\overline{W}),

It remains to upper bound σ1(W)\sigma_{1}(\overline{W}),

We show how to simplify VU\|V_{\perp}^{\top}U\|,

Similarly we can simplify UV\|U_{\perp}^{\top}V\|,

For each i[d3]i\in[d_{3}], let bib_{i} denote the ii-th column of BB. We can upper bound CB\|CB\|,

Since all the three components ux|u^{\top}x|, vx|v^{\top}x|, wx|w^{\top}x| are positive and related to a common random vector xx, we can show a lower bound,

B.2 Matrix Bernstein

In many proofs we need to bound the difference between some population matrices/tensors and their empirical versions. Typically, the classic matrix Bernstein inequality requires the norm of the random matrix be bounded almost surely (e.g., Theorem 6.1 in [Tro12]) or the random matrix satisfies subexponential property (Theorem 6.2 in [Tro12]) . However, in our cases, most of the random matrices don’t satisfy these conditions. So we derive the following lemmata that can deal with random matrices that are not bounded almost surely or follow subexponential distribution, but are bounded with high probability.

Then we have for any 0<ϵ<10<\epsilon<1 and t1t\geq 1, if

with probability at least 11/d2tnγ1-1/d^{2t}-n\gamma,

In the next a few paragraphs, we will upper bound the above three terms.

The first term in Eq. (4). Denote ξc\xi^{c} as the complementary set of ξ\xi, thus Pr[ξic]γ\Pr[\xi^{c}_{i}]\leq\gamma. By a union bound over i[n]i\in[n], with probability 1nγ1-n\gamma, Bim\|B_{i}\|\leq m for all i[n]i\in[n]. Thus M^=B^\widehat{M}=\widehat{B}.

The second term in Eq. (4). For a matrix BB sampled from B{\cal B}, we use ξ\xi to denote the event that ξ={Bm}\xi=\{\|B\|\leq m\}. Then, we can upper bound MB\|\overline{M}-\overline{B}\| in the following way,

Since ϵ<1\epsilon<1, we also have MB12B\|\overline{M}-\overline{B}\|\leq\frac{1}{2}\|\overline{B}\| and 32BM12B\frac{3}{2}\|\overline{B}\|\geq\|\overline{M}\|\geq\frac{1}{2}\|\overline{B}\|.

The third term in Eq. (4). We can bound M^M\|\widehat{M}-\overline{M}\| by Matrix Bernstein’s inequality [Tro12].

for t1t\geq 1, we have with probability at least 11/d2t1-1/d^{2t},

Putting it all together, we have for 0<ϵ<10<\epsilon<1, if

with probability at least 11/d2tnγ1-1/d^{2t}-n\gamma,

We show that the four Properties in Lemma B.7 are satisfied. Define function B(x)=h(x)xxB(x)=h(x)xx^{\top}.

(\@slowromancapi@) B(x)=h(x)xx=h(x)x2\|B(x)\|=\|h(x)xx^{\top}\|=|h(x)|\|x\|^{2}.

Applying Lemma B.7, we obtain, for any 0<ϵ<10<\epsilon<1 and t1t\geq 1, if

Appendix C Properties of Activation Functions

We can easily verify that ReLU\mathsf{ReLU} , leaky ReLU\mathsf{ReLU} and squared ReLU\mathsf{ReLU} satisfy Property 3.2 by calculating ρ(σ)\rho(\sigma) in Property 3.2, which is shown in Table 1. Property 3.1 for ReLU\mathsf{ReLU} , leaky ReLU\mathsf{ReLU} and squared ReLU\mathsf{ReLU} can be verified since they are non-decreasing with bounded first derivative. ReLU\mathsf{ReLU} and leaky ReLU\mathsf{ReLU} are piece-wise linear, so they satisfy Property 3.3(b). Squared ReLU\mathsf{ReLU} is smooth so it satisfies Property 3.3(a).

The equality in the first inequality happens when ϕ(σz)\phi^{\prime}(\sigma\cdot z) is a constant a.e.. The equality in the second inequality happens when ϕ(σz)|\phi^{\prime}(\sigma\cdot z)| is a constant a.e., which is invalidated by the non-linearity and smoothness condition. The equality in the third inequality holds only when ϕ(z)=0\phi^{\prime}(z)=0 a.e., which leads to a constant function under non-decreasing condition. Therefore, ρ(σ)>0\rho(\sigma)>0 for any smooth non-decreasing non-linear activations with bounded symmetric first derivatives. The statements about linear activations and quadratic activation follow direct calculations. ∎

Appendix D Local Positive Definiteness of Hessian

The main idea of the proof follows the following inequalities,

The proof sketch is first to bound the range of the eigenvalues of 2fD(W)\nabla^{2}f_{\cal D}(W^{*}) (Lemma D.3) and then bound the spectral norm of the remaining error, 2f^S(W)2fD(W)\|\nabla^{2}\widehat{f}_{S}(W)-\nabla^{2}f_{\cal D}(W^{*})\|. 2f^S(W)2fD(W)\|\nabla^{2}\widehat{f}_{S}(W)-\nabla^{2}f_{\cal D}(W^{*})\| can be further decomposed into two parts, 2f^S(W)H\|\nabla^{2}\widehat{f}_{S}(W)-H\| and H2fD(W)\|H-\nabla^{2}f_{\cal D}(W^{*})\|, where HH is 2fD(W)\nabla^{2}f_{\cal D}(W) if ϕ\phi is smooth, otherwise HH is a specially designed matrix . We can upper bound them when WW is close enough to WW^{*} and there are enough samples. In particular, if the activation satisfies Property 3.3(a), see Lemma D.10 for bounding H2fD(W)\|H-\nabla^{2}f_{\cal D}(W^{*})\| and Lemma D.11 for bounding H2f^S(W)\|H-\nabla^{2}\widehat{f}_{S}(W)\|. If the activation satisfies Property 3.3(b), see Lemma D.15.

Finally we can complete the proof by setting δ=O(vmin2ρ(σ1)/(kvmax2κ2λσ12p))\delta=O(v_{\min}^{2}\rho(\sigma_{1})/(kv_{\max}^{2}\kappa^{2}\lambda\sigma_{1}^{2p})) in Lemma D.11 and Lemma D.15, setting WWvmin2ρ(σk)/(kκ2λvmax2σ1p)\|W-W^{*}\|\lesssim v_{\min}^{2}\rho(\sigma_{k})/(k\kappa^{2}\lambda v_{\max}^{2}\sigma_{1}^{p}) in Lemma D.10 and setting WWvmin4ρ2(σk)σk/(k2κ4λ2vmax4σ14p)\|W-W^{*}\|\leq v_{\min}^{4}\rho^{2}(\sigma_{k})\sigma_{k}/(k^{2}\kappa^{4}{\lambda}^{2}v_{\max}^{4}\sigma_{1}^{4p}) in Lemma D.15. ∎

D.1.2 Local Linear Convergence of Gradient Descent

Although Theorem D.1 gives upper and lower bounds for the spectrum of the Hessian w.h.p., it only holds when the current set of parameters WW are independent of samples. When we use iterative methods, like gradient descent, to optimize this objective, the next iterate calculated from the current set of samples will depend on this set of samples. Therefore, we need to do resampling at each iteration. Here we show that for activations that satisfies Properties 3.1, 3.2 and 3.3(a), linear convergence of gradient descent is guaranteed. To the best of our knowledge, there is no linear convergence guarantees for general non-smooth objective. So the following proposition also applies to smooth objectives only, which excludes ReLU\mathsf{ReLU}.

Let SS denote a set of i.i.d. samples from distribution D{\cal D} (defined in (1)) Let the activation function satisfy Property 3.1,3.2 and 3.3(a). Define

and perform gradient descent with step size 1/M01/M_{0} on f^S(Wc)\widehat{f}_{S}(W^{c}) and obtain the next iterate,

then with probability at least 1dΩ(t)1-d^{-\Omega(t)},

To prove Theorem D.2, we need to show the positive definite properties on the entire line between the current iterate and the optimum by constructing a set of anchor points, which are independent of the samples. Then we apply traditional analysis for the linear convergence of gradient descent.

In particular, given a current iterate WcW^{c}, we set d(p+1)/2d^{(p+1)/2} anchor points {Wa}a=1,2,,d(p+1)/2\{W^{a}\}_{a=1,2,\cdots,d^{(p+1)/2}} equally along the line ξW+(1ξ)Wc\xi W^{*}+(1-\xi)W^{c} for ξ\xi\in.

According to Theorem D.1, by setting tt+(p+1)/2t\leftarrow t+(p+1)/2, we have with probability at least 1d(t+(p+1)/2)1-d^{-(t+(p+1)/2)} for each anchor point {Wa}\{W^{a}\},

Then given an anchor point WaW^{a}, according to Lemma D.16, we have with probability 12d(t+(p+1)/2)1-2d^{-(t+(p+1)/2)}, for any points WW between (Wa1+Wa)/2(W^{a-1}+W^{a})/2 and (Wa+Wa+1)/2(W^{a}+W^{a+1})/2,

Finally by applying union bound over these d(p+1)/2d^{(p+1)/2} small intervals, we have with probability at least 1dt1-d^{-t} for any points WW on the line between WcW^{c} and WW^{*},

Now we can apply traditional analysis for linear convergence of gradient descent.

We can upper bound f^S(Wc)F2\|\nabla\widehat{f}_{S}(W^{c})\|_{F}^{2},

where the third equality holds by setting η=1M0\eta=\frac{1}{M_{0}}. ∎

D.2 Positive Definiteness of Population Hessian at the Ground Truth

The goal of this Section is to prove Lemma D.3.

If ϕ(z)\phi(z) satisfies Property 3.1,3.2 and 3.3, we have the following property for the second derivative of function fD(W)f_{\cal D}(W) at WW^{*},

The proof directly follows Lemma D.6 (Section D.2.2) and Lemma D.7(Section D.2.3). ∎

The main idea is to explicitly calculate the LHS of Eq (8), then reformulate the equation and find a lower bound represented by α0,α1,α2,β0,β2\alpha_{0},\alpha_{1},\alpha_{2},\beta_{0},\beta_{2}.

Further, we can rewrite the diagonal term in the following way,

We can rewrite the off-diagonal term in the following way,

A+B1+B2+B3=C1+C2+C3+C4+C5A+B_{1}+B_{2}+B_{3}=C_{1}+C_{2}+C_{3}+C_{4}+C_{5}.

The key properties we need are, for two vectors a,ba,b, a+b2=a2+2a,b+b2\|a+b\|^{2}=\|a\|^{2}+2\langle a,b\rangle+\|b\|^{2}; for two matrices A,BA,B, A+BF2=AF2+2A,B+BF2\|A+B\|_{F}^{2}=\|A\|_{F}^{2}+2\langle A,B\rangle+\|B\|_{F}^{2}. Then, we have

D.2.2 Lower Bound on the Eigenvalues of Hessian for Non-orthogonal Case

First we show the lower bound of the eigenvalues. The main idea is to reduce the problem to a kk-by-kk problem and then lower bound the eigenvalues using orthogonal weight matrices.

If ϕ(z)\phi(z) satisfies Property 3.1,3.2 and 3.3, we have the following property for the second derivative of function fD(W)f_{\cal D}(W) at WW^{*},

We use Dd{\cal D}_{d} to denote Gaussian distribution N(0,Id){\cal N}(0,I_{d}), Ddk{\cal D}_{d-k} to denote Gaussian distribution N(0,Idk){\cal N}(0,I_{d-k}), and Dk{\cal D}_{k} to denote Gaussian distribution N(0,Ik){\cal N}(0,I_{k}). Then we can rewrite formulation (10) (removing vmin2v_{\min}^{2}) as

We calculate A,B,CA,B,C separately. First, we can show

Third, we have C=0C=0 since UxU_{\perp}^{\top}x is independent of wixw_{i}^{*\top}x and UxU^{\top}x. Thus, putting them all together,

For BB, similar to the proof of Lemma D.4, we have,

Note that 1=a2=b2+c21=\|a\|^{2}=\|b\|^{2}+\|c\|^{2}. Thus, we finish the proof for the lower bound. ∎

D.2.3 Upper Bound on the Eigenvalues of Hessian for Non-orthogonal Case

If ϕ(z)\phi(z) satisfies Property 3.1,3.2 and 3.3, we have the following property for the second derivative of function fD(W)f_{\cal D}(W) at WW^{*},

Similarly, we can calculate the upper bound of the eigenvalues by

where the first inequality follows Hölder’s inequality, the second inequality follows by Property 3.1, the third inequality follows by wiσ1(W)\|w_{i}^{*}\|\leq\sigma_{1}(W^{*}), and the last inequality by Cauchy-Schwarz inequality.

D.3 Error Bound of Hessians near the Ground Truth for Smooth Activations

The goal of this Section is to prove Lemma D.8

Let ϕ(z)\phi(z) satisfy Property 3.1, Property 3.2 and Property 3.3(a). Let WW satisfy WWσk/2\|W-W^{*}\|\leq\sigma_{k}/2. Let SS denote a set of i.i.d. samples from the distribution defined in (1). Then for any t1t\geq 1 and 0<ϵ<1/20<\epsilon<1/2, if

then we have, with probability at least 11/dΩ(t)1-1/d^{\Omega(t)},

This follows by combining Lemma D.10 and Lemma D.11 directly. ∎

The goal of this Section is to prove Lemma D.10.

Note that if WWσk(W)/2\|W-W^{*}\|\leq\sigma_{k}(W^{*})/2, we have σk(W)/2σi(W)32σ1(W)\sigma_{k}(W^{*})/2\leq\sigma_{i}(W)\leq\frac{3}{2}\sigma_{1}(W^{*}) for all i[k]i\in[k] by Weyl’s inequality. By definition of singular value, we have σk(W)wiσ1(W)\sigma_{k}(W^{*})\leq\|w_{i}^{*}\|\leq\sigma_{1}(W^{*}). By definition of spectral norm, we have wiwiWW\|w_{i}-w_{i}^{*}\|\leq\|W-W^{*}\|. Thus, we can lower bound wi\|w_{i}\|,

Similarly, we have wi12wi\|w_{i}\|\geq\frac{1}{2}\|w_{i}^{*}\|. ∎

If ϕ(z)\phi(z) satisfies Property 3.1, Property 3.2 and Property 3.3(a), then for any WW with WWσk/2\|W-W^{*}\|\leq\sigma_{k}/2, we have

In the next a few paragraphs, we first show how to bound the off-diagonal term, and then show how to bound the diagonal term.

where the first step follows by definition of Δi,l\Delta_{i,l}, the second step follows by definition of spectral norm and vivlvmax2v_{i}^{*}v_{l}^{*}\leq v_{\max}^{2}, the third step follows by triangle inequality, the fourth step follows by linearity of expectation, the fifth step follows by Property 3.3(a), i.e., ϕ(wix)ϕ(wix)L2(wiwi)x|\phi^{\prime}(w_{i}^{\top}x)-\phi^{\prime}(w_{i}^{*\top}x)|\leq L_{2}|(w_{i}-w_{i}^{*})^{\top}x|, the sixth step follows by Property 3.1, i.e., ϕ(z)L1zp\phi^{\prime}(z)\leq L_{1}|z|^{p}, the seventh step follows by Fact B.6, and the last step follows by a2=1\|a\|^{2}=1, wiwiWW\|w_{i}-w_{i}^{*}\|\leq\|W-W^{*}\|, wi32wi\|w_{i}\|\leq\frac{3}{2}\|w_{i}^{*}\|, and wiσ1(W)\|w_{i}^{*}\|\leq\sigma_{1}(W^{*}).

Note that the proof for the off-diagonal terms also applies to bounding the second-term in the diagonal block Eq. (D.3.1). Thus we only need to show how to bound the first term in the diagonal block Eq. (D.3.1).

where the first step follows by y=r=1kvrϕ(wrx)y=\sum_{r=1}^{k}v_{r}^{*}\phi(w_{r}^{*\top}x), the second step follows by definition of spectral norm and vrvivmax2v_{r}^{*}v_{i}^{*}\leq|v_{\max}|^{2}, the third step follows by Property 3.3(a), i.e., ϕ(wix)L2|\phi^{\prime\prime}(w_{i}^{\top}x)|\leq L_{2}, the fourth step follows by ϕ(wrx)ϕ(wrx)maxz[wrx,wrx]ϕ(z)(wrwr)x|\phi(w_{r}^{\top}x)-\phi(w_{r}^{*\top}x)\leq\max_{z\in[w_{r}^{\top}x,w_{r}^{*\top}x]}|\phi^{\prime}(z)||(w_{r}-w_{r}^{*})^{\top}x|, the fifth step follows Property 3.1, i.e., ϕ(z)L1zp|\phi^{\prime}(z)|\leq L_{1}|z|^{p}, the sixth step follows by maxz[wrx,wrx]zp(wrxp+wrxp)\max_{z\in[w_{r}^{\top}x,w_{r}^{*\top}x]}|z|^{p}\leq(|w_{r}^{\top}x|^{p}+|w_{r}^{*\top}x|^{p}), the seventh step follows by Fact B.6.

Putting it all together, we can bound the error by

D.3.2 Empirical and Population Difference for Smooth Activations

Note that Bernstein inequality requires the spectral norm of each random matrix to be bounded almost surely. However, since we assume Gaussian distribution for xx, x2\|x\|^{2} is not bounded almost surely. The main idea is to do truncation and then use Matrix Bernstein inequality. Details can be found in Lemma B.7 and Corollary B.8.

Let ϕ(z)\phi(z) satisfy Property 3.1,3.2 and 3.3(a). Let WW satisfy WWσk/2\|W-W^{*}\|\leq\sigma_{k}/2. Let SS denote a set of i.i.d. samples from distribution D{\cal D} (defined in (1)). Then for any t1t\geq 1 and 0<ϵ<1/20<\epsilon<1/2, if

then we have, with probability at least 1dΩ(t)1-d^{-\Omega(t)},

Define Δ=2fD(W)2f^S(W)\Delta=\nabla^{2}f_{\cal D}(W)-\nabla^{2}\widehat{f}_{S}(W). Let’s first consider the diagonal blocks. Define

Let’s further decompose Δi,i\Delta_{i,i} into Δi,i=Δi,i(1)+Δi,i(2)\Delta_{i,i}=\Delta_{i,i}^{(1)}+\Delta_{i,i}^{(2)}, where

Combining Claims. D.12, D.13 and D.14, and taking a union bound over k2k^{2} different Δi,j\Delta_{i,j}, we obtain if nϵ2κ2(W)τdpoly(t,logd)n\geq\epsilon^{-2}\kappa^{2}(W^{*})\tau d\cdot\operatorname{poly}(t,\log d) for any ϵ(0,1/2)\epsilon\in(0,1/2), with probability at least 11/dt1-1/d^{t},

For each i[k]i\in[k], if ndpoly(logd,t)n\geq d\operatorname{poly}(\log d,t)

According to Properties 3.1,3.2 and 3.3(a), we have for each xSx\in S,

(\@slowromancapi@) Bounding hr(x)|h_{r}(x)|.

According to Fact B.1, we have for any constant t1t\geq 1, with probability 11/(nd4t)1-1/(nd^{4t}),

(\@slowromancapii@) Bounding Br\|\overline{B}_{r}\|.

where the first step follows by definition of spectral norm, and last step follows by Fact B.6. Using Fact B.6, we can also prove an upper bound Br\|\overline{B}_{r}\|, BrL1L2wrpwrwr\|\overline{B}_{r}\|\lesssim L_{1}L_{2}\|w_{r}\|^{p}\|w_{r}-w_{r}^{*}\|.

By applying Corollary B.8, if nϵ2dpoly(logd,t)n\geq\epsilon^{-2}d\operatorname{poly}(\log d,t), then with probability 11/d4t1-1/d^{4t},

For each i[k]i\in[k], if nϵ2dτpoly(logd,t)n\geq\epsilon^{-2}d\tau\operatorname{poly}(\log d,t) , then

Recall the definition of Δi,i(2)\Delta_{i,i}^{(2)}.

(\@slowromancapi@) Bounding hi(x)|h_{i}(x)|.

For any constant t1t\geq 1, (ϕ(wix))2L12wix2pL12wi2ptplogp(n)(\phi^{\prime}(w_{i}^{\top}x))^{2}\leq L_{1}^{2}|w_{i}^{\top}x|^{2p}\lesssim L_{1}^{2}\|w_{i}\|^{2p}t^{p}\log^{p}(n) with probability 11/(nd4t)1-1/(nd^{4t}) according to Fact B.1.

(\@slowromancapii@) Bounding Bi\|\overline{B}_{i}\|.

where wi=wi/wi\overline{w}_{i}=w_{i}/\|w_{i}\| and vv is a unit vector orthogonal to wiw_{i} such that a=αwi+βva=\alpha\overline{w}_{i}+\beta v. Now from Property 3.2, we have

By applying Corollary B.8, we have, for any 0<ϵ<10<\epsilon<1, if nϵ2dwi4pρ2(wi)poly(logd,t)n\geq\epsilon^{-2}d\frac{\|w_{i}\|^{4p}}{\rho^{2}(\|w_{i}\|)}\operatorname{poly}(\log d,t) the following bound holds

Therefore, if nϵ2dτpoly(logd,t)n\geq\epsilon^{-2}d\tau\operatorname{poly}(\log d,t), where τ=(3σ1/2)4pminσ[σk/2,3σ1/2]ρ2(σ)\tau=\frac{(3\sigma_{1}/2)^{4p}}{\min_{\sigma\in[\sigma_{k}/2,3\sigma_{1}/2]}\rho^{2}(\sigma)}, we have with probability 11/d4t1-1/d^{4t}

For each i[k],j[k],iji\in[k],j\in[k],i\neq j, if nϵ2κ2τdpoly(logd,t)n\geq\epsilon^{-2}\kappa^{2}\tau d\operatorname{poly}(\log d,t), then

Recall the definition of off-diagonal blocks Δi,l\Delta_{i,l},

(\@slowromancapi@) Bounding hi,l(x)|h_{i,l}(x)|.

For any constant t1t\geq 1, we have with probability 11/(nd4t)1-1/(nd^{4t})

where the third step follows by Fact B.1.

(\@slowromancapii@) Bounding Bi,l\|\overline{B}_{i,l}\|.

For simplicity, we just set b=1\|b\|=1 and c=0\|c\|=0 to lower bound,

where the second step is from Property 3.2 and the fact that σk/2σ2(V)σ1(V)3σ1/2\sigma_{k}/2\leq\sigma_{2}(V)\leq\sigma_{1}(V)\leq 3\sigma_{1}/2.

where the first step follows by Property 3.1, the second step follows by Fact B.6, and the last step follows by v1σ1\|v_{1}\|\leq\sigma_{1}, v2σ1\|v_{2}\|\leq\sigma_{1} and b1\|b\|\leq 1. Similarly, we can upper bound,

Therefore, applying Corollary B.8, we have, if nϵ2κ2(W)τdpoly(logd,t)n\geq\epsilon^{-2}\kappa^{2}(W^{*})\tau d\operatorname{poly}(\log d,t), then

holds with probability at least 11/d4t1-1/d^{4t}. ∎

D.4 Error Bound of Hessians near the Ground Truth for Non-smooth Activations

The goal of this Section is to prove Lemma D.15,

Let ϕ(z)\phi(z) satisfy Property 3.1,3.2 and 3.3(b). Let WW satisfy WWσk/2\|W-W^{*}\|\leq\sigma_{k}/2. Let SS denote a set of i.i.d. samples from the distribution defined in (1). Then for any t1t\geq 1 and 0<ϵ<1/20<\epsilon<1/2, if

with probability at least 1dΩ(t)1-d^{-\Omega(t)},

As we noted previously, when Property 3.3(b) holds, the diagonal blocks of the empirical Hessian can be written as, with probability 11,

Note that H2fD(W)H\neq\nabla^{2}f_{\operatorname*{{\mathcal{D}}}}(W). However we can still bound H2f^S(W)\|H-\nabla^{2}\widehat{f}_{S}(W)\| and H2fD(W)\|H-\nabla^{2}f_{\operatorname*{{\mathcal{D}}}}(W^{*})\| when we have enough samples and WW\|W-W^{*}\| is small enough. The proof for H2f^S(W)\|H-\nabla^{2}\widehat{f}_{S}(W)\| basically follows the proof of Lemma D.11 as Δii(2)\Delta_{ii}^{(2)} in Eq. (14) and Δil\Delta_{il} in Eq. (16) forms the blocks of H2f^S(W)H-\nabla^{2}\widehat{f}_{S}(W) and we can bound them without smoothness of ϕ()\phi(\cdot).

Now we focus on H2fD(W)H-\nabla^{2}f_{\operatorname*{{\mathcal{D}}}}(W^{*}). We again consider each block.

We used the boundedness of ϕ(z)\phi^{\prime\prime}(z) to prove Lemma D.10. Here we can’t use this condition. Without smoothness, we will stop at the following position.

where the first step follows by definition of spectral norm, the second step follows by triangle inequality, and the last step follows by linearity of expectation.

where the first step follows by a=Ub+Uca=Ub+U_{\bot}c, the last step follows by (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2}.

By Property 3.3(b), we have ee exceptional points which have ϕ(z)0\phi^{\prime\prime}(z)\neq 0. Let these ee points be p1,p2,,pep_{1},p_{2},\cdots,p_{e}. Note that if vizv_{i}^{\top}z and vizv_{i}^{*\top}z are not separated by any of these exceptional points, i.e., there exists no j[e]j\in[e] such that vizpjvizv_{i}^{\top}z\leq p_{j}\leq v_{i}^{*\top}z or vizpjvizv_{i}^{*\top}z\leq p_{j}\leq v_{i}^{\top}z, then we have ϕ(viz)=ϕ(viz)\phi^{\prime}(v_{i}^{\top}z)=\phi^{\prime}(v_{i}^{*\top}z) since ϕ(s)\phi^{\prime\prime}(s) are zeros except for {pj}j=1,2,,e\{p_{j}\}_{j=1,2,\cdots,e}. So we consider the probability that viz,vizv_{i}^{\top}z,v_{i}^{*\top}z are separated by any exception point. We use ξj\xi_{j} to denote the event that viz,vizv_{i}^{\top}z,v_{i}^{*\top}z are separated by an exceptional point pjp_{j}. By union bound, 1j=1ePr[ξj]1-\sum_{j=1}^{e}\Pr[\xi_{j}] is the probability that viz,vizv_{i}^{\top}z,v_{i}^{*\top}z are not separated by any exceptional point. The first term of Equation (D.4) can be bounded as,

where the first step follows by if viz,vizv_{i}^{\top}z,v_{i}^{*\top}z are not separated by any exceptional point then ϕ(viz)=ϕ(viz)\phi^{\prime}(v_{i}^{\top}z)=\phi^{\prime}(v_{i}^{*\top}z) and the last step follows by Hölder’s inequality and Property 3.1.

It remains to upper bound PrzD3[ξj]\Pr_{z\sim\operatorname*{{\mathcal{D}}}_{3}}[\xi_{j}]. First note that if viz,vizv_{i}^{\top}z,v_{i}^{*\top}z are separated by an exceptional point, pjp_{j}, then vizpjvizvizviviz|v_{i}^{*\top}z-p_{j}|\leq|v_{i}^{\top}z-v_{i}^{*\top}z|\leq\|v_{i}-v_{i}^{*}\|\|z\|. Therefore,

Note that (vizzvi+1)/2(\frac{v_{i}^{*\top}z}{\|z\|\|v_{i}^{*}\|}+1)/2 follows Beta(1,1) distribution which is uniform distribution on $$.

where the first step is because we can view vizz\frac{v_{i}^{*\top}z}{\|z\|} and pjz\frac{p_{j}}{\|z\|} as two independent random variables: the former is about the direction of zz and the later is related to the magnitude of zz. Thus, we have

Finally combining Eq. (17), Eq. (19) and Eq. (20), we have

D.5 Positive Definiteness for a Small Region

Here we introduce a lemma which shows that the Hessian of any WW, which may be dependent on the samples but is very close to an anchor point, is close to the Hessian of this anchor point.

with probability at least 1dt1-d^{-t}, for any WW (which is not necessarily to be independent of samples) satisfying WaWσk/4\|W^{a}-W\|\leq\sigma_{k}/4, we have

We further decompose Δi,i\Delta_{i,i} into Δi,i=Δi,i(1)+Δi,i(2)\Delta_{i,i}=\Delta_{i,i}^{(1)}+\Delta_{i,i}^{(2)}, where

We can further decompose Δi,i(1)\Delta_{i,i}^{(1)} into Δi,i(1,1)\Delta_{i,i}^{(1,1)} and Δi,i(1,2)\Delta_{i,i}^{(1,2)},

Combining Claim D.17 and Claim D.18 , we have if

Therefore, combining Eq. (22), Claim D.19 and Claim D.20, we complete the proof. ∎

For each i[k]i\in[k], if ndpoly(logd,t)n\geq d\operatorname{poly}(\log d,t), then

Define function h1(x)=xp+1h_{1}(x)=\|x\|^{p+1} and h2(x)=wqxp(wqwqa)xh_{2}(x)=|w_{q}^{*\top}x|^{p}|(w_{q}^{*}-w_{q}^{a})^{\top}x|. Note that h1h_{1} and h2h_{2} don’t contain WW which maybe depend on the samples. Therefore, we can use the modified matrix Bernstein inequality (Corollary B.8) to bound Δi,i(1)\Delta_{i,i}^{(1)}.

(\@slowromancapi@) Bounding h1(x)|h_{1}(x)|.

By Fact B.2, we have h1(x)(tdlogn)(p+1)/2h_{1}(x)\lesssim(td\log n)^{(p+1)/2} with probability at least 11/(nd4t)1-1/(nd^{4t}).

Therefore we have with probability at least 11/dt1-1/d^{t},

For each i[k]i\in[k], if ndpoly(logd,t)n\geq d\operatorname{poly}(\log d,t), then

holds with probability at least 11/d4t1-1/d^{4t}.

Recall the definition of Δi,i(1,2)\Delta_{i,i}^{(1,2)},

In order to upper bound the Δi,i(1,2)\|\Delta_{i,i}^{(1,2)}\|, it suffices to upper bound the spectral norm of this quantity,

By Property 3.3, we have ϕ(wix)ϕ(wiax)2L2|\phi^{\prime\prime}(w_{i}^{\top}x)-\phi^{\prime\prime}(w_{i}^{a\top}x)|\leq 2L_{2}.

For the second part wqxp(wqwqa)xxx|w_{q}^{*\top}x|^{p}|(w_{q}^{*}-w_{q}^{a})^{\top}x|xx^{\top}, according to Eq. (D.3.2), we have with probability 1dt1-d^{-t}, if ndpoly(logd,t)n\geq d\operatorname{poly}(\log d,t),

For each i[k]i\in[k], if ndpoly(logd,t)n\geq d\operatorname{poly}(\log d,t), then

Applying Corollary B.8 finishes the proof.

For each i[k],j[k],iji\in[k],j\in[k],i\neq j, if ndpoly(logd,t)n\geq d\operatorname{poly}(\log d,t), then

Applying Corollary B.8 completes the proof. ∎

Appendix E Tensor Methods

We describe the details of each procedure in Algorithm 1 in this section.

a) Compute the subspace estimation from P^2\widehat{P}_{2} (Algorithm 3). Note that the eigenvalues of P2P_{2} and P^2\widehat{P}_{2} are not necessarily nonnegative. However, only kk of the eigenvalues will have large magnitude. So we can first compute the top-k eigenvectors/eigenvalues of both CI+P^2C\cdot I+\widehat{P}_{2} and CIP^2C\cdot I-\widehat{P}_{2}, where CC is large enough such that C2P2C\geq 2\|P_{2}\|. Then from the 2k2k eigen-pairs, we pick the top-kk eigenvectors with the largest eigenvalues in magnitude, which is executed in TopK in Algorithm 3. For the outputs of TopK, k1,k2k_{1},k_{2} are the numbers of picked largest eigenvalues from CI+P^2C\cdot I+\widehat{P}_{2} and CIP^2C\cdot I-\widehat{P}_{2} respectively and π1(i)\pi_{1}(i) returns the original index of ii-th largest eigenvalue from CI+P^2C\cdot I+\widehat{P}_{2} and similarly π2\pi_{2} is for CIP^2C\cdot I-\widehat{P}_{2}. Finally orthogonalizing the picked eigenvectors leads to an estimation of the subspace spanned by {w1  w2    wk}\{w_{1}^{*}\;w_{2}^{*}\;\cdots\;w_{k}^{*}\}. Also note that forming P^2\widehat{P}_{2} takes O(nd2)O(n\cdot d^{2}) time and each step of the power method doing a multiplication between a d×dd\times d matrix and a d×kd\times k matrix takes kd2k\cdot d^{2} time by a naive implementation. Here we reduce this complexity from O((k+n)d2)O((k+n)d^{2}) to O(knd)O(knd). The idea is to compute each step of the power method without explicitly forming P^2\widehat{P}_{2}. We take P2=M2P_{2}=M_{2} as an example; other cases are similar. In Algorithm 3, let the step P^2V\widehat{P}_{2}V be calculated as P^2V=1S(x,y)Sy(x(xV)V)\widehat{P}_{2}V=\frac{1}{|S|}\sum_{(x,y)\in S}y(x(x^{\top}V)-V). Now each iteration only needs O(knd)O(knd) time. Furthermore, the number of iterations required will be a small number, since the power method has a linear convergence rate and as an initialization method we don’t need a very accurate solution. The detailed algorithm is shown in Algorithm 3. The approximation error bound of P^2\widehat{P}_{2} to P2P_{2} is provided in Lemma E.5. Lemma E.6 provides the theoretical bound for Algorithm 3.

b) Form and decompose the 3rd-moment R^3\widehat{R}_{3} (Algorithm 1 in [KCL15]). We apply the non-orthogonal tensor factorization algorithm, Algorithm 1 in [KCL15], to decompose R^3\widehat{R}_{3}. According to Theorem 3 in [KCL15], when R^3\widehat{R}_{3} is close enough to R3R_{3}, the output of the algorithm, u^i\widehat{u}_{i} will close enough to siVwis_{i}V^{\top}\overline{w}_{i}^{*}, where sis_{i} is an unknown sign. Lemma E.10 provides the error bound for R^3R3\|\widehat{R}_{3}-R_{3}\|.

c) Recover the magnitude of wiw_{i}^{*} and the signs si,vis_{i},v_{i}^{*} (Algorithm 4). For Algorithm 4, we only consider homogeneous functions. Hence we can assume vi{1,1}v_{i}^{*}\in\{-1,1\} and there exist some universal constants cjc_{j} such that mj,i=cjwip+1m_{j,i}=c_{j}\|w_{i}^{*}\|^{p+1} for j=1,2,3,4j=1,2,3,4, where p+1p+1 is the degree of homogeneity. Note that different activations have different situations even under Assumption 5.3. In particular, if M4=M2=0M_{4}=M_{2}=0, ϕ()\phi(\cdot) is an odd function and we only need to know sivis_{i}v_{i}^{*}. If M3=M1=0M_{3}=M_{1}=0, ϕ()\phi(\cdot) is an even function, so we don’t care about what sis_{i} is.

Let’s describe the details for Algorithm 4. First define two quantities Q1Q_{1} and Q2Q_{2},

where l11l_{1}\geq 1 such that Ml10M_{l_{1}}\neq 0 and l22l_{2}\geq 2 such that Ml20M_{l_{2}}\neq 0. There are possibly multiple choices for l1l_{1} and l2l_{2}. We will discuss later on how to choose them. Now we solve two linear systems.

The solutions of the above linear systems are

We can approximate siwis_{i}\overline{w}_{i}^{*} by Vu^iV\widehat{u}_{i} and approximate Q1Q_{1} and Q2Q_{2} by their empirical versions Q^1\widehat{Q}_{1} and Q^2\widehat{Q}_{2} respectively. Hence, in practice, we solve

In Lemma E.13 and Lemma E.14, we provide robustness of the above two linear systems, i.e., the solution errors, z^z\|\widehat{z}-z^{*}\| and r^r\|\widehat{r}-r^{*}\|, are bounded under small perturbations of wi,Q1\overline{w}_{i}^{*},Q_{1} and Q2Q_{2}. Recall that the final goal is to approximate wi\|w_{i}^{*}\| and the signs vi,siv_{i}^{*},s_{i}. Now we can approximate wi\|w_{i}^{*}\| by (z^i/(cl1(αVu^i)l11))1/(p+1)(|\widehat{z}_{i}/(c_{l_{1}}(\alpha^{\top}V\widehat{u}_{i})^{{l_{1}-1}})|)^{1/(p+1)}. To recover vi,siv_{i}^{*},s_{i}, we need to note that if l1l_{1} and l2l_{2} are both odd or both even, we can’t recover both viv_{i}^{*} and sis_{i}. So we consider the following situations,

The 1st situation corresponds to part 3 of Assumption 5.3,where sis_{i} doesn’t matter, and the 2nd situation corresponds to part 4 of Assumption 5.3, where only sivis_{i}v_{i}^{*} matters. So we recover wi\|w_{i}^{*}\| to some precision and vi,siv_{i}^{*},s_{i} exactly provided enough samples. The recovery of wiw_{i}^{*} and viv_{i}^{*} then follows.

Time complexity: In Part a), by using a specially designed power method, we only need O(knd)O(knd) time to compute the subspace estimation VV. Part b) needs O(knd)O(knd) to form R^3\widehat{R}_{3} and the tensor factorization needs O(k3)O(k^{3}) time. Part c) requires calculation of d×kd\times k and k2×kk^{2}\times k linear systems in Eq. (28), which takes at most O(knd)O(knd) running time. Hence, the total time complexity is O(knd)O(knd).

E.2 Main Result for Tensor Methods

The goal of this Section is to prove Theorem 5.6.

The success of Algorithm 1 depends on two approximations. The first is the estimation of the normalized wiw_{i}^{*} up to some unknown sign flip, i.e., the error wisiVu^i\|\overline{w}_{i}^{*}-s_{i}V\widehat{u}_{i}\| for some si{1,1}s_{i}\in\{-1,1\}. The second is the estimation of the magnitude of wiw_{i}^{*} and the signs vi,siv_{i}^{*},s_{i} which is conducted in Algorithm 4.

where the first step follows by triangle inequality, the second step follows by VV=IV^{\top}V=I.

We can upper bound VVwiwi\|VV^{\top}\overline{w}_{i}^{*}-\overline{w}_{i}^{*}\|,

where the first step follows by Lemma E.6, the second step follows by σk(P2)1/poly(k,κ)\sigma_{k}(P_{2})\geq 1/\operatorname{poly}(k,\kappa), and the last step follows by P^2P2ϵpoly(k,κ)\|\widehat{P}_{2}-P_{2}\|\leq\epsilon\operatorname{poly}(k,\kappa) if the number of samples is proportional to O~(d/ϵ2)\widetilde{O}(d/\epsilon^{2}) as shown in Lemma E.5.

We can upper bound Vwisiu^i\|V^{\top}\overline{w}_{i}^{*}-s_{i}\widehat{u}_{i}\|,

where the first step follows by Theorem 3 in [KCL15], and the last step follows by R^3R3ϵpoly(k,κ)\|\widehat{R}_{3}-R_{3}\|\leq\epsilon\operatorname{poly}(k,\kappa) if the number of samples is proportional to O~(k2/ϵ2)\widetilde{O}(k^{2}/\epsilon^{2}) as shown in Lemma E.10.

Combining Eq. (E.2), (E.2) and (31) together,

For the second one, we can bound the error of the estimation of moments, Q1Q_{1} and Q2Q_{2}, using number of samples proportional to O~(d)\widetilde{O}(d) by Lemma E.12 and Lemma E.5 respectively. The error of the solutions of the linear systems Eq.(28) can be bounded by Q1Q^1,Q2Q^2,u^iVwi\|Q_{1}-\widehat{Q}_{1}\|,\|Q_{2}-\widehat{Q}_{2}\|,\|\widehat{u}_{i}-V^{\top}\overline{w}_{i}^{*}\| and (IVV)wi\|(I-VV^{\top})\overline{w}_{i}^{*}\| according to Lemma E.13 and Lemma E.14. Then we can bound the error of the output of Algorithm 4. Furthermore, since viv_{i}^{*}’s are discrete values, they can be exactly recovered. All the sample complexities mentioned in the above lemmata are linear in dimension and polynomial in other factors to achieve a constant error. So accumulating all these errors we complete our proof.

The proofs of these lemmata for Theorem 1 can be found in the following sections. Note that these lemmata also hold for any activations satisfying Property 3.1 and Assumption 5.3. However, since we are unclear how to implement the last step of Algorithm 1 (Algorithm 4) for general non-homogeneous activations, we restrict our theorem to homogeneous activations only.

E.3 Error Bound for the Subspace Spanned by the Weight Matrix

Let M2M_{2} be defined as in Definition 5.1. Let M^2\widehat{M}_{2} be the empirical version of M2M_{2}, i.e.,

where SS denote a set of samples from Distribution D\operatorname*{{\mathcal{D}}} defined in Eq. (1). Assume M20M_{2}\neq 0, i.e., mi(2)0m_{i}^{(2)}\neq 0 for any ii. Then for any 0<ϵ<1,t10<\epsilon<1,t\geq 1, if

(\@slowromancapi@) Bounding Bi(x)\|B_{i}(x)\|.

where the last step follows by Fact B.1 and Fact B.2.

Note that Bi(x)B_{i}(x) is a symmetric matrix, thus it suffices to only bound one of them.

Note that Bi(x)B_{i}(x) is a symmetric matrix, thus it suffices to consider the case where a=ba=b.

Define L=wip+1+ϕ(0)L=\|w_{i}^{*}\|^{p+1}+|\phi(0)|. Then we have for any 0<ϵ<10<\epsilon<1, if

Let M3M_{3} be defined as in Definition 5.1. Let M^3\widehat{M}_{3} be the empirical version of M3M_{3}, i.e.,

where SS denote a set of samples (each sample is i.i.d. sampled from Distribution D\operatorname*{{\mathcal{D}}} defined in Eq. (1)). Assume M30M_{3}\neq 0, i.e., m3,i0m_{3,i}\neq 0 for any ii. Let α\alpha be a fixed unit vector. Then for any 0<ϵ<1,t10<\epsilon<1,t\geq 1, if

Since y=i=1kviϕ(wix)y=\sum_{i=1}^{k}v_{i}^{*}\phi(w_{i}^{*\top}x). We consider each component i[k]i\in[k].

Define g(z)=ϕ(z)ϕ(0)g(z)=\phi(z)-\phi(0), then g(z)=0zϕ(s)dsL1p+1zp+1zp+1|g(z)|=|\int_{0}^{z}\phi^{\prime}(s)ds|\leq\frac{L_{1}}{p+1}|z|^{p+1}\lesssim|z|^{p+1}, which follows Property 3.1. In order to apply Lemma B.7, we need to bound the following four quantities,

(\@slowromancapi@) Bounding Bi(x)\|B_{i}(x)\|.

where the third step follows by definition of g(z)g(z), and last step follows by definition of spectral norm and triangle inequality.

Using Fact B.1 and Fact B.2, we have for any constant t1t\geq 1, with probability 11/(nd4t)1-1/(nd^{4t}),

Because matrix Bi(x)B_{i}(x) is symmetric, thus it suffices to bound one of them,

Define L=wip+1+ϕ(0)L=\|w_{i}^{*}\|^{p+1}+|\phi(0)|. Then we have for any 0<ϵ<10<\epsilon<1, if

Let M4M_{4} be defined as in Definition 5.1. Let M^4\widehat{M}_{4} be the empirical version of M4M_{4}, i.e.,

where SS denote a set of samples (where each sample is i.i.d. sampled are sampled from Distribution D\operatorname*{{\mathcal{D}}} defined in Eq. (1)). Assume M40M_{4}\neq 0, i.e., m4,i0m_{4,i}\neq 0 for any ii. Let α\alpha be a fixed unit vector. Then for any 0<ϵ<1,t10<\epsilon<1,t\geq 1, if

Since y=i=1kviϕ(wix)y=\sum_{i=1}^{k}v_{i}^{*}\phi(w_{i}^{*\top}x). We consider each component i[k]i\in[k].

Define g(z)=ϕ(z)ϕ(0)g(z)=\phi(z)-\phi(0), then g(z)=0zϕ(s)dsL1/(p+1)zp+1zp+1|g(z)|=|\int_{0}^{z}\phi^{\prime}(s)ds|\leq L_{1}/(p+1)|z|^{p+1}\lesssim|z|^{p+1}, which follows Property 3.1.

(\@slowromancapi@) Bounding Bi(x)\|B_{i}(x)\|.

Using Fact B.1 and Fact B.2, we have for any constant t1t\geq 1, with probability 11/(nd4t)1-1/(nd^{4t}),

Define L=wip+1+ϕ(0)L=\|w_{i}^{*}\|^{p+1}+|\phi(0)|. Then we have for any 0<ϵ<10<\epsilon<1, if

E.3.2 Error Bound for the Second-order Moment

The goal of this section is to prove Lemma E.5, which shows we can approximate the second order moments up to some precision by using linear sample complexity in dd.

then with probability at least 1dΩ(t)1-d^{-\Omega(t)},

We have shown the bound for j2=2,3,4j_{2}=2,3,4 in Lemma E.2, Lemma E.3 and Lemma E.4 respectively. To summarize, for any 0<ϵ<10<\epsilon<1 we have if

E.3.3 Subspace Estimation Using Power Method

Lemma E.6 shows a small number of power iterations can estimate the subspace of {wi}i[k]\{w_{i}^{*}\}_{i\in[k]} to some precision, which provides guarantees for Algorithm 3.

According to Weyl’s inequality, we are able to pick up the correct numbers of positive eigenvalues and negative eigenvalues in Algorithm 3 as long as P^2\widehat{P}_{2} and P2P_{2} are close enough.

According to Lemma 9 in [HK13], we have (IU1U1)V1P^2P2/σk(P2)\|(I-U_{1}U_{1}^{\top})\overline{V}_{1}\|\lesssim\|\widehat{P}_{2}-P_{2}\|/\sigma_{k}(P_{2}),(IU2U2)V2P^2P2/σk(P2)\|(I-U_{2}U_{2}^{\top})\overline{V}_{2}\|\lesssim\|\widehat{P}_{2}-P_{2}\|/\sigma_{k}(P_{2}) . Using Fact B.4, we have (IUU)V=UUVV\|(I-UU^{\top})\overline{V}\|=\|UU^{\top}-\overline{V}\overline{V}^{\top}\|.

According to Theorem 7.2 in [AKZ12], we have V1V1V1V1ϵ\|\overline{V}_{1}\overline{V}_{1}^{\top}-V_{1}V_{1}^{\top}\|\leq\epsilon and V2V2V2V2ϵ\|\overline{V}_{2}\overline{V}_{2}^{\top}-V_{2}V_{2}^{\top}\|\leq\epsilon.

Let UU_{\perp} be the complementary matrix of UU. Then we have,

where the first step follows by definition of spectral norm, the second step follows by I=U1U1+U2U2+UUI=U_{1}U_{1}^{\top}+U_{2}U_{2}^{\top}+U_{\perp}^{\top}U_{\perp}^{\top}, the third step follows by U2U=0U_{2}^{\top}U_{\perp}=0, and last step follows by definition of spectral norm.

We can upper bound (IUU)V\|(I-UU^{\top})\overline{V}\|,

where the first step follows by triangle inequality, the second step follows by Eq. (E.3.3), and the last step follows by Lemma 9 in [HK13].

We define matrix RR such that V2R=(IV1V1)V2\overline{V}_{2}R=(I-V_{1}V_{1}^{\top})V_{2} is the QR decomposition of (IV1V1)V2(I-V_{1}V_{1}^{\top})V_{2}, then we have

where the first step follows by V2=(IV1V1)V2R1\overline{V}_{2}=(I-V_{1}V_{1}^{\top})V_{2}R^{-1}, and the last step follows by triangle inequality.

where the first step follows by definition of α,β\alpha,\beta, the second step follows by triangle inequality, the third step follows by ABAB\|AB\|\leq\|A\|\|B\|, the fourth step follows by (IV2V2)=1\|(I-\overline{V}_{2}\overline{V}_{2}^{\top})\|=1, the fifth step follows by Eq. (E.3.3), the sixth step follows by Fact B.4, the seventh step follows by V1V1V1V1ϵ\|\overline{V}_{1}\overline{V}_{1}^{\top}-V_{1}V_{1}^{\top}\|\leq\epsilon and V2V2V2V2ϵ\|\overline{V}_{2}\overline{V}_{2}^{\top}-V_{2}V_{2}^{\top}\|\leq\epsilon, and the last step follows by R12\|R^{-1}\|\leq 2 (Claim E.7).

where the first step follows by triangle inequality, the second step follows by Fact B.4, the third step follows by Eq. (E.3.3), the fourth step follows by triangle inequality, and the last step follows by V1V1V1V1ϵ\|\overline{V}_{1}\overline{V}_{1}^{\top}-V_{1}V_{1}^{\top}\|\leq\epsilon and V2V2V2V2ϵ\|\overline{V}_{2}\overline{V}_{2}^{\top}-V_{2}V_{2}^{\top}\|\leq\epsilon.

First, we can rewrite RRR^{\top}R in the follow way,

Second, we can upper bound V2V1\|V_{2}^{\top}V_{1}\| by 1/41/4,

where the first step follows by V2V2=IV_{2}^{\top}V_{2}=I, the second step follows by triangle inequality, the third step follows by triangle inequality, the fourth step follows by V2V1V1=0\|\overline{V}_{2}^{\top}\overline{V}_{1}\overline{V}_{1}^{\top}\|=0, the fifth step follows by ABAB\|AB\|\leq\|A\|\cdot\|B\|, and the last step follows by V1V1V1V1ϵ\|\overline{V}_{1}\overline{V}_{1}^{\top}-V_{1}V_{1}^{\top}\|\leq\epsilon, V1=1\|V_{1}\|=1, V2V2V2V2ϵ\|\overline{V}_{2}\overline{V}_{2}^{\top}-V_{2}V_{2}^{\top}\|\leq\epsilon and V2=1\|\overline{V}_{2}^{\top}\|=1, and the last step follows by ϵ<1/8\epsilon<1/8.

Thus, we can lower bound σk2(R)\sigma_{k}^{2}(R),

E.4 Error Bound for the Reduced Third-order Moment

Let M3M_{3} be defined as in Definition 5.1. Let M^3\widehat{M}_{3} be the empirical version of M3M_{3}, i.e.,

(\@slowromancapi@) Bounding Bi(x)\|B_{i}(x)\|.

Note that VxN(0,Ik)V^{\top}x\sim\operatorname*{{\mathcal{N}}}(0,I_{k}). According to Fact B.1 and Fact B.2, we have for any constant t1t\geq 1, with probability 11/(ndt)1-1/(nd^{t}),

Define L=wip+1+ϕ(0)L=\|w_{i}^{*}\|^{p+1}+|\phi(0)|. Then we have for any 0<ϵ<10<\epsilon<1, if

Also note that for any symmetric 3rd-order tensor RR, the operator norm of RR,

Let M4M_{4} be defined as in Definition 5.1. Let M^4\widehat{M}_{4} be the empirical version of M4M_{4}, i.e.,

(\@slowromancapi@) Bounding Bi(x)\|B_{i}(x)\|.

Note that VxN(0,Ik)V^{\top}x\sim\operatorname*{{\mathcal{N}}}(0,I_{k}). According to Fact B.1 and Fact B.2, we have for any constant t1t\geq 1, with probability 11/(ndt)1-1/(nd^{t}),

Define L=wip+1+ϕ(0)L=\|w_{i}^{*}\|^{p+1}+|\phi(0)|. Then we have for any 0<ϵ<10<\epsilon<1, if

with probability at least 1dT1-d^{-T}, Eq. (34) holds. Also note that for any symmetric 3rd-order tensor RR, the operator norm of RR,

E.4.2 Final Error Bound for the Reduced Third-order Moment

Lemma E.10 shows R^3\widehat{R}_{3} can approximate R3R_{3} to some small precision with poly(k)\operatorname{poly}(k) samples.

holds with probability at least 1dΩ(t)1-d^{-\Omega(t)}.

The main idea is to use matrix Bernstein bound after matricizing the third-order tensor. Similar to the proof of Lemma E.5, we consider each node component individually and then sum up the errors and apply union bound.

We have shown the bound for j3=3,4j_{3}=3,4 in Lemma E.8 and Lemma E.9 respectively. To summarize, for any 0<ϵ<10<\epsilon<1 we have if

E.5 Error Bound for the Magnitude and Sign of the Weight Vectors

The lemmata in this section together with Lemma E.5 provide guarantees for Algorithm 4. In particular, Lemma E.12 shows with linear sample complexity in dd, we can approximate the 1st-order moment to some precision. And Lemma E.13 and Lemma E.14 provide the error bounds of linear systems Eq. (28) under some perturbations.

By definition of zz and z^\widehat{z}, we can rewrite zz and z^\widehat{z},

As A~14κσk(A)\|\widetilde{A}\|\leq\frac{1}{4\kappa}\sigma_{k}(A), we have A~A+AA~(AA)11/4\|\widetilde{A}^{\top}A+A^{\top}\widetilde{A}\|\|(A^{\top}A)^{-1}\|\leq 1/4. Together with b~14b\|\widetilde{b}\|\leq\frac{1}{4}\|b\|, we can ignore the high-order errors. So we have

E.5.2 Error Bound for the First-order Moment

Let Q1Q_{1} be defined as in Eq. (25) and Q^1\widehat{Q}_{1} be the empirical version of Q1Q_{1} using dataset SS, where each sample of SS is i.i.d. sampled from distribution D{\cal D}(defined in (1)). Assume the activation function satisfies Property 3.1 and Assumption 5.3. Then for any 0<ϵ<10<\epsilon<1 and t1t\geq 1, define j1=min{j1Mj0}j_{1}=\min\{j\geq 1|M_{j}\neq 0\} and m0=mini[k](mj1,i(wiα)j11)2m_{0}=\min_{i\in[k]}(m_{j_{1},i}(\overline{w}_{i}^{*\top}\alpha)^{j_{1}-1})^{2} if

we have with probability at least 1dΩ(t)1-d^{-\Omega(t)},

We consider the case when l1=3l_{1}=3, i.e.,

Since y=i=1kviϕ(wix)y=\sum_{i=1}^{k}v_{i}^{*}\phi(w_{i}^{*\top}x). We consider each component i[k]i\in[k].

(\@slowromancapi@) Bounding Bi(x)\|B_{i}(x)\|.

According to Fact B.1 and Fact B.2, we have for any constant t1t\geq 1, with probability 11/(ndt)1-1/(nd^{t}),

Define L=wip+1+ϕ(0)L=\|w_{i}^{*}\|^{p+1}+|\phi(0)|. Then we have for any 0<ϵ<10<\epsilon<1, if

Summing up all kk components, we obtain if

Other cases (j1=1,2,4j_{1}=1,2,4) are similar, so we complete the proof. ∎

E.5.3 Linear System for the First-order Moment

The following lemma provides estimation error bound for the first linear system in Eq. (28).

Using Fact B.3, we can lower bound σk(A)\sigma_{k}(A),

We can upper bound A~\|\widetilde{A}\| in the following way,

We can upper bound b\|b\| and b~\|\widetilde{b}\|,

To apply Lemma E.11, we need δ^41/4\widehat{\delta}_{4}\leq 1/4 and δ^21/(kκ2)\widehat{\delta}_{2}\lesssim 1/(\sqrt{k}\kappa^{2}), δ^31/(kκ2)\widehat{\delta}_{3}\lesssim 1/(\sqrt{k}\kappa^{2}). So we have,

E.5.4 Linear System for the Second-order Moment

The following lemma provides estimation error bound for the second linear system in Eq. (28).

Let \circ be the element-wise matrix product (a.k.a. Hadamard product), W=[w1  w2    wk]\overline{W}=[\overline{w}_{1}^{*}\;\overline{w}_{2}^{*}\;\cdots\;\overline{w}_{k}^{*}] and U=[u1  u2    uk]=VWU=[u_{1}\;u_{2}\;\cdots\;u_{k}]=V^{\top}\overline{W} . We can upper bound A\|A\| and A~\|\widetilde{A}\| as follows,

where fourth step follows Schur product theorem, the last step follows by the fact that CBFσmin(C)BF\|CB\|_{F}\geq\sigma_{\min}(C)\|B\|_{F} and \circ is the element-wise multiplication of two matrices.

We can upper bound b\|b\| and b~\|\widetilde{b}\|,

Note that according to Fact B.3, σk(W)1/κ\sigma_{k}(\overline{W})\geq 1/\kappa. Therefore, if δ^21/(2κk)\widehat{\delta}_{2}\leq 1/(2\kappa\sqrt{k}), we will have σk(VW)1/(2κ)\sigma_{k}(V^{\top}\overline{W})\geq 1/(2\kappa). Similarly, we have σ1(VW)VWk\sigma_{1}(V^{\top}\overline{W})\leq\|V\|\|\overline{W}\|\leq\sqrt{k}. Then applying Lemma E.11 and setting δ^21kκ3\widehat{\delta}_{2}\lesssim\frac{1}{\sqrt{k}\kappa^{3}}, we complete the proof. ∎

Appendix F Acknowledgments

The authors would like to thank Surbhi Goel, Adam Klivans, Qi Lei, Eric Price, David P. Woodruff, Peilin Zhong, Hongyang Zhang and Jiong Zhang for useful discussions.