Neural Networks can Learn Representations with Gradient Descent

Alex Damian, Jason D. Lee, Mahdi Soltanolkotabi

Introduction

Crucial to the practical success of deep learning is the ability of gradient-based algorithms to learn good feature representations from the training data and learn simple functions on top of these representations. Despite significant progress towards a theoretical foundation for neural networks, a robust understanding of this unique representation learning capability of gradient descent methods has remained elusive. A major challenge is that due to the highly nonconvex loss landscape, establishing convergence to a global optimum that achieves near zero training loss is challenging. Furthermore, due to the overparameterized nature of modern neural nets (containing many more parameters than training data) the training landscape has many global optima. In fact, there are many global optima with poor generalization performance . This paper thus focuses on answering this intriguing question:

How do gradient-based methods learn feature representations and why do these representations allow for efficient generalization and transfer learning?

The most prominent contemporary approach to understanding neural networks is the linearization or neural tangent kernel (NTK) technique. The premise of the linearization method is that the dynamics of gradient descent are well-approximated by gradient descent on a linear regression instance with fixed feature representation. Using this linearization technique, it is possible to prove convergence to a zero training loss point . However, this technique often requires unrealistic hyper-parameter choices (e.g. small learning rate, large initialization, or wide networks) that does not allow the features to evolve across the iterations and thus the generalization error with this technique cannot be better than that of a kernel method. Indeed, precise lower bounds show that the NTK solutions do not generalize better than the polynomial kernel . As a result this regime of training is also sometimes referred to as the lazy regime .See Section 4 for a more in depth discussion of this literature and other related work. In practice, neural networks far outperform their corresponding induced kernels . Therefore, understanding the representation learning of neural networks beyond the lazy regime is of fundamental importance.

In this paper, we initiate the study of the representation learning of neural networks beyond this NTK/linear/lazy regime. To this aim, we consider the problem of learning polynomials with low-dimensional latent representation of the form $f^{*}(x)=g(Ux)$ , where $U$ maps from $d$ to $r$ dimensions with $d\gg r$ with $g$ a multivariate polynomial of degree $p$ . This is a natural choice as the failure of the NTK solution is in part due to its inability to learn data-dependent feature representations that adapt to the intrinsic low latent dimensionality of the ground truth function. Existing analysis based on the NTK regime provably require $n\asymp d^{p}$ samples to learn any degree $p$ polynomial, even if they only depend on a few relevant directions. In contrast we show that gradient descent from random initialization only requires $n\asymp d^{2}r+dr^{p}$ samples, breaking the sample complexity barrier dictated by NTK proof techniques. More specifically, our contributions are as follows:

Feature Learning: When the target function $f^{\star}=g(Ux)$ only depends on the projection of $x$ onto a hidden subspace $\operatorname{span}(U)$ , we show that gradient descent learns features that span $\operatorname{span}(U)$ . Leveraging these features, gradient descent can reach vanishing training loss with a very small network which guarantees good generalization performance. See Section 5.1.

Lower Bound: Finally, we show a lower bound that demonstrates our non-degeneracy assumption (Assumption 2) is strictly necessary. Without the non-degeneracy, there is a family of polynomials which depend on single relevant dimensions (i.e. of the form $f^{\star}(x)=g(u\cdot x)$ ) which cannot be learned with fewer than $n\asymp d^{p/2}$ by any gradient descent based learner.

Setup

where $\varsigma^{2}$ controls the strength of the label noise.

In order to make the problem of learning $f^{\star}$ tractable, additional assumptions are necessary. The set of degree $p$ polynomials in $d$ dimensions span a linear subspace of $L^{2}(\mathcal{D})$ of dimension $\Theta(d^{p})$ . Learning arbitrary degree $p$ polynomials therefore requires $n\gtrsim d^{p}$ samples. We follow Chen and Meka , Chen et al. in assuming that the ground truth $f^{\star}$ has a special low dimensional latent structure. Specifically, we assume that $f^{\star}$ only depends on a small number of relevant dimensions and that the expected Hessian is non degenerate. We show in Theorem 2 that this non degeneracy assumption is strictly necessary to avoid sample complexity $d^{\Omega(p)}$ .

We will call $S^{\star}:=\operatorname{span}(u_{1},\ldots,u_{r})$ the principal subspace of $f^{\star}$ . We will also denote by $\Pi^{\star}:=\Pi_{S^{\star}}$ the orthogonal projection onto $S^{\star}$ .

We will also denote the normalized condition number of $H$ by $\kappa:=\frac{\norm*{H^{\dagger}}}{\sqrt{r}}$ .

2 The Network and Loss

where $m$ denotes the width of the network. We use a symmetric initialization, so that $f_{\theta_{0}}(x)=0$ . Explicitly, we will assume that $m$ is an even number and that

We will use the following initialization:

We note that while we focus on such symmetric initialization for clarity of exposition, our results also hold with small random initialization that is not necessarily symmetric. This holds by simple modifications in the proof accounting for the small nonzero output of the network at initialization. We will also denote the empirical and population losses by $\mathcal{L}(\theta)$ and $\mathcal{L}_{\mathcal{D}}(\theta)$ respectively:

3 Notation

Main Results

Before we formally state our main result let us specify the exact form of gradient-based training we use in our theory.

With this algorithm in place, we are now ready to state our main result.

It is useful to note that the use of $\lambda$ in the algorithm corresponds to the common practice of weight decay and its value is chosen in such a way that $\norm{a^{(T)}}\leq B_{a}$ , i.e. to solve a constrained minimization problem (see Section 5.1). In practice, one simply tunes the hyperparameter $\lambda$ in order to achieve the desired tradeoff between training and test loss.

An intriguing aspect of the above result is that despite the fact that $f^{\star}$ may be of arbitrarily high degree, learning $f^{\star}$ requires only $n\gtrsim dr^{p}+d^{2}r$ samples and only requires a very small network with $m\gtrsim r^{p}$ . We note that our dependence on the latent dimension $r$ is near optimal as the minimax sample complexity even when the principal subspace $S^{\star}$ is known is $\Theta(r^{p})$ .

We show in Theorem 3 that by resampling the data after the first step, the sample complexity can be further reduced to $d^{2}r+r^{p}$ , dropping a factor of $d$ from the second term. The extra factor of $d$ results from the dependence between the data used in the first and second stages and we believe that a more careful analysis could remove this additional factor.

We contrast Theorem 1 with the following lower bound for learning a function class which satisfies 1 with $r=1$ but does not satisfy 2.

For any $p\geq 0$ , there exists a function class $\mathcal{F}_{p}$ of polynomials of degree $p$ , each of which depends on a single relevant dimension, such that any correlational statistical query learner using $q$ queries requires a tolerance $\tau$ of at most

in order to output a function $f\in\mathcal{F}_{p}$ with $L^{2}(\mathcal{D})$ loss at most $1$ .

Using the heuristic $\tau\approx\frac{1}{\sqrt{n}}$ , which represents the expected scale of the concentration error, we get the immediate corollary that violating 2 allows us to construct a function class which any neural network with polynomially many parameters trained for polynomially many steps of gradient descent cannot learn without at least $n\gtrsim d^{p/2}$ samples. We emphasize that this is only a heuristic argument as concentration errors are random rather than adversarial.

On the other hand, Theorem 1 shows that incorporating 2 allows gradient descent to efficiently learn polynomials of arbitrarily high degree with only $d^{2}r+dr^{p}$ samples.

The difference in sample complexity between Theorem 1 and Theorem 2 is that in Theorem 1, our non-degeneracy assumption (2) allows the network $f_{\theta}$ to extract useful features that aid robust learning and allowed learning high degree polynomials with $n\gtrsim d^{2}$ samples. Theorem 2 shows that violating this assumption allows us to construct a function class which cannot be learned without $d^{\Omega(p)}$ samples, demonstrating the necessity of 2.

The fact that the network $f_{\theta}$ extracts useful features not only allows it to learn $f^{\star}$ efficiently, but also allows for efficient transfer learning. In particular, Theorem 3 shows that we can efficiently learn any target polynomial $g^{\star}(x)$ that depends on the same relevant dimensions as $f^{\star}$ with sample complexity independent of $d$ by simply truncating and retraining the head of the network:

Learning $g^{\star}(x)$ therefore only requires $N,m\gtrsim r^{p}$ , which is independent of the ambient dimension $d$ . We note that this is minimax optimal for learning arbitrary degree $p$ polynomials even when the hidden subspace $S^{\star}$ is known. Theorem 3 also shows that $n\gtrsim d^{2}r$ pre-training samples are sufficient for gradient descent to learn the subspace $S^{\star}$ from the pre-training data.

Related work

A growing body of recent work show the connection between gradient descent on the full network and the Neural Tangent Kernel (NTK) . Using this technique one can prove concrete results about neural network training and generalization in the kernel regime. The key idea is that for a large enough initialization, it suffices to consider a linearization of the neural network around the origin. This allows connecting the analysis of neural networks with the well-studied theory of kernel methods. This is also sometimes referred to as lazy training, as with such an initialization the parameters of the neural networks stay close to the parameters at initialization and these results can only show that neural networks are as powerful as shallow learners such as kernels. There is however growing evidence that this NTK-style analysis might not be sufficient to completely explain the success of neural networks in practice. The papers provides empirical evidence that by choosing a smaller initialization the test error of the neural network decreases. A similar performance gap between the performance of the NTK and neural networks has been observed in . This NTK-style analysis however does not yield satisfactory results in the setting studied in this paper. In particular for learning the polynomials of the form we study in this paper, demonstrates that one needs at least $d^{p}$ samples in the kernel regime. In contrast, our results only require on the order of $d^{2}$ samples.

Leveraging the fact that linearized models are not feature learners, Ghorbani et al. and showed precise upper and lower bounds on the sample complexity of NTK methods. They showed that because NTK is unable to learn new features, learning any polynomial in dimension $d$ of degree $p$ requires $n=\Theta(d^{p})$ samples, which gives no improvement over polynomial kernels. On the empirical front, the NTK linearization analysis is also lacking. Arora et al. demonstrated that the kernel predictor loses more than $20\%$ in test accuracy relative to a deep network trained with SGD and state-of-art regularization on CIFAR-10. Our work is motivated by the contrast between these negative theoretical results for linearized NTK models and the spectacular empirical performance of deep learning.

The gap between such shallow learners and the full neural network has been established in theory and observed in practice . There is an emerging literature on learning beyond the lazy/NTK regime in the small initialization setting. The papers shows that for the problem of low-rank reconstruction in a non-lazy regime with small random initialization gradient descent finds globally optimal solutions with good generalization capability. This is carried out by utilizing a spectral bias phenomena exhibited by the early stages of gradient descent from small random initialization that puts the iterates on the trajectory towards generalizable models. For the problem of tensor decomposition it has also been shown that gradient descent with small initialization is able to leverage low-rank structure . In , it has been shown that neural networks with orthogonal weights can be learned via SGD and outperform any kernel method. One crucial element in their analysis is that the early stage of the training is connected with learning the first and second moment of the data. Higher-order approximations of the training dynamics and the Neural Tangent Hierarchy have also been recently proposed towards closing this gap. None of the above papers, however, focus on learning polynomial representations efficiently via neural networks as carried out in this paper.

Another line of work focuses on learning single activations such as the ReLU function. In this context shows that it is hard to learn a single ReLU activation via stochastic gradient descent with random features where as learning such activations is possible in a non-NTK regime again highlighting this important gap. In related work where the label also only depends on a single relevant direction , the authors show that in the context of learning the parity function, gradient descent is able to efficiently learn the planted set. However, this is a result of the unbalanced data distribution which skews the gradient towards the planted set. In contrast, we consider isotropic Gaussian data so that no information can be extracted from the data distribution itself and features must be extracted from higher order correlations between the data and the labels. Chen and Meka also studied the problem of learning polynomials of few relevant dimensions. They provide an algorithm that learns polynomials of degree $p$ in $d$ dimensions that depends on $r$ hidden dimensions with $n\gtrsim C(r,p)d$ samples where $C(r,p)$ is an unspecified function of $r,p$ which is likely exponential in $r$ . However, their algorithm is not a variant of gradient descent, and requires a clever spectral initialization. On the other hand, this work focuses on the ability of gradient descent to automatically extract hidden features and learn representations from the data.

There is also a line of work , which is concerned with the mean-field analysis of neural networks. The insight is that for sufficiently large width the training dynamics of the neural network can be coupled with the evolution of a probability distribution described by a PDE. These papers use a smaller initialization than in the NTK-regime and, hence, the parameters can move away from the initialization. However, these results do not provide explicit convergence rates and require an unrealistically large width of the neural network. To the extent of our knowledge such an analysis technique has not been used to show efficient learning of polynomial representations using neural networks as carried out in this paper.

A concurrent line of work studied the feature learning ability of gradient descent in the mean field regime with data sampled from the boolean cube . The authors identified a necessary and sufficient condition for learning with sample complexity linear in $d$ , dubbed the merged staircase property, in the special case when the hidden weights of the two layer neural network are initialized at . However, the zero initialization hinders the feature learning ability of the network. For example, the boolean function XOR violates the merged staircase property, however noisy XOR is known to be learnable by two layer neural networks with sample complexity linear in $d$ . In this work we study the impact that the nonzero initialization of the hidden weights has on the feature learning ability of neural networks.

Proof Sketches

Using the chain rule, we can further expand this as

With high probability over the random initialization,

Note that the remainder term, of order $d^{-1}$ , contains all higher order terms in the series expansion.

However, it is also important to note that the population gradient is bounded by $\norm{\nabla_{w_{j}}\mathcal{L}_{\mathcal{D}}(\theta)}=O(d^{-1/2})$ and we only have access to the empirical gradient $\nabla_{w_{j}}\mathcal{L}(\theta)$ . As mentioned above, extracting the necessary subspace information from $\nabla_{w_{j}}\mathcal{L}_{\mathcal{D}}(\theta)$ to learn $f^{\star}$ therefore requires $n\gtrsim d^{2}$ samples, which is the dominant term in our final sample complexity result.

Once we show that the gradient at initialization contains all the relevant features, we note that after the first step of gradient descent,

After the first step, the model therefore resembles a random feature model with random features $\{Hw\}_{w\in S^{d-1}}\subset S^{\star}$ . Previous results have shown that in these linearized regimes, e.g. random feature models/NTK, learning degree $p$ polynomials requires $n\gtrsim d^{p}$ samples and width $m\gtrsim d^{p}$ . As our “random features” are now constrained to the hidden subspace $S^{\star}$ , which has dimension $r$ , we should expect that our sample complexity improves to $n\gtrsim r^{p}$ .

The remainder of Algorithm 1 runs ridge regression on the network head $a$ with fixed features $x\to\sigma(W^{(1)}x+b)$ . We can directly analyze the generalization of this algorithm using standard techniques from Rademacher complexity. In particular, a high level sketch of the remainder of the proof goes as follows:

(Section A.3): We show the equivalence between ridge regression and norm constrained linear regression implies the existence of $\lambda>0$ such that the $T$ th iterate $a^{(T)}$ satisfies

2 Proof of Theorem 2

Let $\mathcal{F}$ be a class of functions and $\mathcal{D}$ be a data distribution such that

Then any correlational statistical query learner requires at least $\frac{\absolutevalue{\mathcal{F}}(\tau^{2}-\epsilon)}{2}$ queries of tolerance $\tau$ to output a function in $\mathcal{F}$ with $L^{2}(\mathcal{D})$ loss at most $2-2\epsilon$ .

To construct $\mathcal{F}_{p}$ , we begin by showing that there are a large number of approximately orthogonal unit vectors in $S^{d-1}$ :

There exists an absolute constant $c$ such that for any $\epsilon>0$ , there exists a set $S$ of $\frac{1}{2}e^{c\epsilon^{2}d}$ unit vectors such that for any $v,w\in S$ such that $v\neq w$ , we have $\absolutevalue{v\cdot w}\leq\epsilon$ .

Therefore $\absolutevalue{u\cdot v}\leq d^{-1/2}\sqrt{\log m}$ implies $\absolutevalue{E_{x\sim\mathcal{D}}[f_{u}(x)f_{v}(x)]}\leq d^{-k/2}(\log m)^{k/2}$ . Theorem 2 then directly follows from Lemma 2 (see Appendix D for a more detailed proof).

Experiments

In this section we present a toy example that clearly demonstrates the gap between kernel methods and gradient descent on two layer networks. For $u\in S^{d-1}$ , consider the target function

which satisfies $E_{x\sim\mathcal{D}}[f_{u}^{\star}(x)^{2}]=1$ . Note that $f^{\star}$ only depends on the projection of $x$ onto a single relevant direction, $u$ . We show in Section 5.1 that gradient descent is capable of isolating the subspace spanned by $u$ and then fitting a one dimensional random feature model to $g$ , and that this entire process requires $n\asymp d^{2}$ samples to generalize.

On the other hand, existing works Ghorbani et al. have shown that $n\asymp d^{p}$ samples are strictly necessary in order to learn $f^{\star}$ in the NTK or random features regime. The theory predicts that with $n<d^{2}$ samples, kernel regression will return the predictor and with $d^{2}<n<d^{p}$ samples, kernel regression will return $\frac{1}{2}He_{2}(u\cdot x)$ , incurring a $L^{2}(\mathcal{D})$ loss of $\frac{1}{2}$ .

We empirically verify these predictions. We take $d=10$ and $p=4$ and consider the function $f_{e_{1}}^{\star}(x)=\frac{He_{2}(x_{1})}{2}+\frac{He_{4}(x_{1})}{4\sqrt{3}}$ . We use label noise $\sigma^{2}=1$ and attempt to learn $f^{\star}$ using Algorithm 1, a random feature model, and a linearized NTK model. All experiments are conducted on a two layer neural network with widths $m=100$ and $m=1000$ . For each value of $n$ , the weight decay parameter $\lambda$ is tuned on a holdout set of size $10^{5}$ and test accuracies are reported over a separate test set of size $10^{5}$ . Errors bars reflect the mean and standard deviation over $10$ random seeds.

We note that while Algorithm 1 easily converged to vanishing excess risk, even at width $m=100$ , both the random features model and the neural tangent kernel model only managed to fit the quadratic term $\frac{1}{2}He_{2}(u\cdot x)$ , as predicted by the theory in Ghorbani et al. .

2 Transfer Learning

The proof of Theorem 1 involves showing that Algorithm 1 learns features corresponding to $S^{\star}$ (see Section 5.1) and the proof of Theorem 3 shows that this implies efficient transfer learning. We again verify this empirically. We consider the function:

Note that this was exactly the hard example in Theorem 2 that was unlearnable without $n\gtrsim d^{\frac{p}{2}}$ samples by a correlational statistical query learner (and in particular, gradient-based learners).

We pretrain with $n$ samples on the $f^{\star}(x)$ from Section 6.1, then train the output layer using $N$ samples from $f^{\star}_{\text{target}}$ . As in Section 6.1, we use a label noise strength of $\sigma^{2}=1$ . We pick $p=3$ so that random feature methods or the neural tangent kernel will require at least $n\gtrsim d^{3}$ samples to learn $f^{\star}$ .

We note that in Figure 2, when $n=d^{0},d^{1}$ , fine tuning on $N$ target samples gives trivial risk until $N\gtrsim d^{3}$ , which is to be expected of a kernel method with no prior information. However, for $n\geq d^{2}$ pretraining samples, we can fine tune on just $N=O(1)$ target samples to reach nontrivial loss and the loss decays rapidly as a function of $N$ . This experiment therefore fully supports the conclusion of Theorem 3.

Discussion and Future Work

In this work we provide a clear separation between gradient-based training and kernel methods. We show that there is a large family of degree $p$ polynomials which are efficiently learnable by gradient descent with $n\asymp d^{2}$ samples, in contrast to the lower bound of $d^{p}$ for random feature/NTK analysis. The main idea driving both our sample complexity result (Theorem 1) and our transfer learning result (Theorem 3) is that gradient descent learns useful representations of the data.

One promising direction for future work is tightening the dimension dependence of our upper bound. In particular, our $n\asymp d^{2}$ sample complexity is driven by the difficult in learning from a degree $2$ Hermite polynomial. However, our lower bound for such functions (Theorem 2) only rules out learning with $n\leq d$ samples. In this situation the lower bound is tight as Chen et al. show that sparse degree $2$ polynomials can be efficiently learned with $n\asymp d$ samples.

Another promising direction from future work is generalizing our result to the situation in which the hidden layer and the output layer are trained together. This introduces dependencies between the hidden and output layers which are difficult to control. However, such analysis may lead to a better understanding of learning order and inductive bias in deep learning.

Acknowledgements

AD acknowledges support from a NSF Graduate Research Fellowship. JDL and AD acknowledge support of the ARO under MURI Award W911NF-11-1-0304, the Sloan Research Fellowship, NSF CCF 2002272, NSF IIS 2107304, ONR Young Investigator Award, and NSF-CAREER under award #2144994. MS is supported by the Packard Fellowship in Science and Engineering, a Sloan Fellowship in Mathematics, an NSF-CAREER under award #1846369, DARPA Learning with Less Labels (LwLL) and FastNICS programs, and NSF-CIF awards #1813877 and #2008443.

References

Appendix A Proofs

We define $\iota=C_{\iota}\log(nmd)$ for a sufficiently large constant $C_{\iota}$ . Throughout the appendix we will use $e^{-\iota}$ to track failure probabilities of various lemmas and theorems.

We say that an event $A$ happens with high probability if it happens with probability at least $1-\operatorname{poly}(n,m,d)e^{-\iota}$ where $\operatorname{poly}(n,m,d)$ does not depend on $C_{\iota}$ .

Note that high probability events are closed under taking union bounds over sets of size $\operatorname{poly}(n,m,d)$ . We will assume throughout that $\iota\leq cd$ for a sufficiently small absolute constant $c$ .

The following lemma bounds $\norm{x_{i}}$ and is a direct corollary of Lemma 15:

With high probability, $\norm{x_{i}}^{2}\in\quantity[\frac{d}{2},2d]$ for $i=1,\ldots,n$ .

All remaining proofs will be conditioned on this high probability event.

Let $\sigma(x):=\operatorname{ReLU}(x)=\max(0,x)$ . Then the Hermite expansion of $\sigma(x)$ is

Let $c_{k}$ denote the Hermite coefficients of $\sigma$ , i.e. $\sigma(x)=\sum_{k\geq 0}\frac{c_{k}}{k!}He_{k}(x)$ . Note that

Let the Hermite expansion of $f^{\star}$ be

Note that as an immediate consequence of Lemma 5, $\norm{C_{k}}_{F}^{2}\leq k!$ . In addition, 1 guarantees that $C_{k}\quantity(x^{\otimes k})=C_{k}\quantity(\quantity(\Pi^{\star}x)^{\otimes k})$ .

A.1.3 Concentrating α,β𝛼𝛽\alpha,\beta

Let $\alpha=\frac{1}{n}\sum_{i=1}^{n}y_{i}$ and $\beta=\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{i}$ . Then, with high probability,

Let $F(x_{1},\ldots,x_{n})=\frac{1}{n}\sum_{i=1}^{n}f^{\star}(x_{i})-C_{0}$ . Note that

The bound on $\absolutevalue{\alpha-C_{0}}$ therefore immediately follows from Lemma 17 applied to $F$ . The bound on $\norm{\beta-C_{1}}$ is a special case of Lemma 19 with $\sigma(x)=x$ . ∎

A.1.4 Hermite Expanding the Features

Note that by the scale invariance of $\sigma(x)=\operatorname{ReLU}(x)$ , Algorithm 1 does not depend on $\norm{w_{j}}$ for $j=1,\ldots,m$ . Therefore we can assume WLOG that $\norm{w_{j}}=1$ for $j=1,\ldots,m$ and $w_{j}\sim\operatorname{Unif}(S^{d-1})$ . For the remainder of the appendix we will assume that $\norm{w_{j}}=1$ .

We define $\widehat{f}^{\star}(x):=f^{\star}(x)-\alpha-\beta\cdot x$ .

The functions $g(w)$ and $g_{n}(w)$ capture the features that can be learned after one step of gradient descent:

By Stein’s lemma and the orthogonality of Hermite polynomials,

Note that these sums are finite as $C_{k}=0$ for $k>p$ . Next, by Corollary 12 we have the high probability bounds,

Applying these bounds term by term and using Lemma 6 to bound $\absolutevalue{C_{0}-\alpha}$ and $\norm{C_{1}-\beta}$ gives the desired result. ∎

Furthermore, it will become necessary to bound terms of the form $g_{n}(w)\cdot x_{i}$ . Note that $g_{n}(w)$ and $x_{i}$ are dependent random variables. The following lemma handles this dependence.

Let $w\sim S^{d-1}$ and assume $n\geq d^{2}\iota^{p}$ . Then with high probability,

For the first term, note that $g(w)$ and $x_{j}$ are independent so $g(w)\cdot x_{j}\sim N(0,\norm{g(w)}^{2})$ so with high probability,

Note that in the first term, the $x_{j}$ and the sum are independent. Therefore by Corollary 7 the first term is bounded with high probability by $O\quantity(\sqrt{\frac{d\iota^{p+2}}{n}})$ . In addition, by Lemma 17, the second term is bounded by $O\quantity(\frac{\iota^{p/2}d}{n})$ which completes the proof. ∎

A.2 Random Feature Approximation

This section shows that after we reinitialize the biases we can use random features to transform the activation $\sigma(x)=\operatorname{ReLU}(x)$ into $\sigma(x)=x^{p}$ which is more natural for learning polynomials.

Let $a\sim\operatorname{Unif}(\quantity{-1,1})$ , and $b\sim\operatorname{Unif}()$ . Then for any $k\geq 0$ there exists $v_{k}(a,b)$ such that for $\absolutevalue{x}\leq 1$ ,

First, for $k=0$ we can take $v_{0}(a,b):=6b$ . Then,

and $\sup_{a,b}\absolutevalue{v_{0}(a,b)}=6$ . Next, for $k=1$ we can take $v_{1}(a,b):=2a$ . Then,

and we have $\sup_{a,b}\absolutevalue{v_{1}(a,b)}=2$ . Next, note that by integration by parts we have for any function $f$ ,

Therefore for $k\geq 2$ if $f(x)=x^{k}$ and

Let $a\sim\operatorname{Unif}(\quantity{-1,1})$ , and $b\sim N(0,1)$ . Then for any $k\geq 0$ there exists $v_{k}(a,b)$ such that for $\absolutevalue{x}\leq 1$ ,

Let $\overline{v}_{k}$ be the function constructed in Lemma 9 and let

where $\mu(b):=\frac{e^{-\frac{x^{2}}{2}}}{\sqrt{2\pi}}$ denotes the density of $b$ . Then,

A.2.2 Multivariable Random Feature Approximation

With high probability over the data $\{x_{i}\}_{i\in[n]}$ , we have for $j\leq 4p$ ,

We can decompose $r(w)=\quantity[g_{n}(w)-g(w)]+\quantity[g(w)-\frac{Hw}{\sqrt{2\pi}}]$ and note that

We can bound the $j$ th moment term by term. We have by Corollary 8 and Lemma 24 that for $k\geq 2$ ,

We can now show that the random features $g_{n}(w)$ are sufficiently expressive to allow us to efficiently represent any polynomial of degree $p$ restricted to the principal subspace $S^{\star}$ .

For any $k\leq p$ , there exists an absolute constant $C$ such that if $n\geq Cd^{2}r\kappa^{2}\iota^{p+1}$ and $d\geq C\kappa r^{3/2}$ ,

where $\Pi_{\operatorname{Sym}^{k}(S^{\star})}$ denotes the orthogonal projection onto symmetric $k$ tensors restricted to $S^{\star}$ .

for all symmetric $k$ tensor $T$ with $\norm{T}_{F}^{2}=1$ . Recall that $g_{n}(w)=\frac{Hw}{\sqrt{2\pi}}+r(w)$ . Therefore by the binomial theorem,

where $\absolutevalue{\delta(w)}\lesssim\sum_{i=1}^{k}\norm{T\quantity((Hw)^{\otimes k-i})}_{F}\norm{\Pi^{\star}r(w)}^{i}.$ Therefore by Young’s inequality,

Let $\hat{T}$ be the symmetric $k$ tensor defined by $\hat{T}(v_{1},\ldots,v_{k})=T(Hv_{1},\ldots,Hv_{k})$ . Then by Corollary 13,

Because we assumed $n\geq Cd^{2}r\kappa^{2}\iota^{p+1}$ and $d\geq C\kappa r^{3/2}$ for a sufficiently large constant $C$ , we have

Assume $n\geq Cd^{2}r\kappa^{2}\iota^{p+1}$ and $d\geq C\kappa r^{3/2}$ for a sufficiently large constant $C$ . Then for any $k\leq p$ and any symmetric $k$ tensor $T$ supported on $S^{\star}$ , there exists $z_{T}(w)$ such that

Assume $n\geq Cd^{2}r\kappa^{2}\iota^{p+1}$ and $d\geq C\kappa r^{3/2}$ for a sufficiently large constant $C$ . Let $\eta_{1}=\sqrt{\frac{d}{C^{2}\iota^{3}}}$ , let $k\leq p$ and let $T$ be a $k$ tensor. Then with high probability, there exists $h_{T}(a,w,b)$ such that if

where $v_{k}(a,b)$ and $z_{T}(w)$ are constructed in Corollary 3 and Corollary 4 respectively. Recall that $w^{(1)}=2\eta_{1}ag_{n}(w)$ . Then for $x\in\quantity{x_{1},\ldots,x_{n}}$ ,

where the second to last line followed from Lemma 8. The first part of the lemma now follows from a union bound over $x_{1},\ldots,x_{n}$ . For the bounds on $h$ , we have

Assume $n\geq Cd^{2}r\kappa^{2}\iota^{p+1}$ and $d\geq C\kappa r^{3/2}$ for a sufficiently large constant $C$ and let $\eta_{1}=\sqrt{\frac{d}{\iota^{3}}}$ . Then with high probability, there exists $h(a,w,b)$ such that if

with $\norm{T_{k}}_{F}\lesssim r^{\frac{p-k}{4}}$ . Let

Then $\frac{1}{n}\sum_{i=1}^{n}(f_{h}(x_{i})-f^{\star}(x_{i}))^{2}\lesssim\frac{1}{n}$ is immediate from Lemma 12 and

Let $a^{\star}_{j}:=\frac{1}{m}h(a_{j},w_{j},b_{j})$ where $h$ is the function constructed in Corollary 5. Then,

Then with probability $1-\operatorname{poly}(n,m,d)e^{-\iota}$ we have that $Z_{j}(x_{i})=\overline{Z}_{j}(x_{i})$ for $i=1,\ldots,n$ . Therefore,

For the first term, by Bernstein’s inequality we have with probability at least $1-2e^{-\iota}$ ,

and the first part of the lemma follows from a union bound.

We will now turn to the bound on $\norm{a^{\star}}^{2}$ . Let $z_{i}=(a^{\star}_{i})^{2}+(a^{\star}_{m-i})^{2}$ . Note that $\{z_{i}\}_{i\leq m/2}$ are positive, i.i.d., and bounded by $O(m^{-2}r^{2p}\kappa^{4p}\iota^{12p})$ . In addition, they have expectation $O(m^{-2}r^{p}\kappa^{2p}\iota^{3p})$ . Therefore by Popoviciu’s inequality they have variance bounded by

Therefore by Bernstein’s inequality we have that with high probability,

A.3 Proof of Theorem 1

to be the empirical $L^{2}$ losses with respect to the true labels (recall $y_{i}=f^{\star}(x_{i})+\epsilon_{i}$ , $\epsilon_{i}\sim\{-\sigma,\sigma\}$ ).

Assume $n\geq Cd^{2}r\kappa^{2}\iota^{p+1}$ and $d\geq C\kappa r^{3/2}$ for a sufficiently large constant $C$ and let $\eta_{1}=\sqrt{\frac{d}{\iota^{3}}}$ . Let $a^{\star}$ be the vector constructed in the proof of Lemma 13 and let $\theta=(a^{\star},W^{(1)},b^{(1)})$ . Then with high probability,

Let $\delta_{i}=f_{\theta}(x_{i})-f^{\star}(x_{i})$ . Then,

First, by Hoeffding’s inequality, we have with high probability,

We are now ready to directly prove Theorem 1.

Note that we can assume that there is an absolute constant $C$ such that $n\geq Cd^{2}r\kappa^{2}\iota^{p+1}$ , and $m\geq r^{p}\kappa^{2p}\iota^{6p+1}$ . Otherwise, we can simply take $\lambda\to\infty$ and return the zero predictor.

From Lemma 14 we know that with high probability, there exists $a^{\star}$ such that if $\theta=(a^{\star},W^{(1)},b^{(1)})$ ,

and $\norm{a^{\star}}_{2}^{2}\lesssim\frac{r^{p}\kappa^{2p}\iota^{6p+1}}{m}.$ Therefore by equality of norm constrained linear regression and ridge regression, there exists $\lambda>0$ such that if

Then with high probability, $f_{(a^{(T)},W^{(1)},b^{(1)})}\in\mathcal{F}$ . In addition, from Lemma 28,

Appendix B Transfer Learning

The proof of Theorem 3 is virtually identical to that of Theorem 1. We can use Lemma 13 to construct $a^{\star}$ such that if $\theta^{\star}=(a^{\star},W^{(1)},b^{(1)})$ then with high probability,

In addition, there exists $\lambda$ such that if $T\geq\Theta(\eta^{-1}\lambda^{-1})$ ,

Now let $\mathcal{F}=\quantity{f_{(a,W,b)}~{}:~{}\norm{a}_{2}\leq\norm{a^{\star}}}$ . Then by Lemma 27 we have with high probability,

Appendix C Concentration Lemmas

Let $X\sim\chi^{2}(d)$ . Then, for any $t\geq 0$ ,

Let $w\sim N(0,I_{d})$ . Then for some constant $C$ ,

Let $g$ be a polynomial of degree $p$ . Then there exists an absolute constant $C_{p}$ depending only on $p$ such that for any $\delta$ ,

Therefore by Theorem 1.2 of , there exists an absolute constant $C_{p}$ such that

Note that the planes $w\cdot x_{1}=0,\ldots,w\cdot x_{n}=0$ divides the sphere $S^{d-1}$ into at most $\sum_{i=0}^{d}\binom{n}{i}\lesssim n^{d}$ convex regions. For each region there exists an $\epsilon$ net of size $\quantity(\frac{3}{\epsilon})^{d}$ . Therefore we can take the union of these nets over each region which has size at most $\quantity(\frac{3n}{\epsilon})^{d}=e^{Cd\log(n/\epsilon)}$ . ∎

Let $f(x)$ be a polynomial of degree $p$ and let $\sigma(x)\in\{x,\operatorname{ReLU}(x)\}$ . Then there exists an absolute constant $C_{p}$ depending only on $p$ such that for any $\iota>0$ , with probability at least $1-2ne^{-\iota}$ , we have

Let $Z_{i}(w):=g(x_{i})(u\cdot x_{i})\sigma^{\prime}(w\cdot x_{i})\mathbf{1}_{\{\absolutevalue{g(x)}<R\}}$ so that

Then note that for fixed $w$ , $Z_{i}(w)$ is $R$ -sub Gaussian so for each $u\in\mathcal{N}_{1/4}$ , with probability $1-2e^{-z}$ we have

so by a union bound we have with probability $1-2e^{Cd\log(n/\epsilon)}e^{-z}$ ,

so setting $z=Cd\log(n/\epsilon)+\iota$ we have with probability $1-2e^{\iota}$ ,

Using $\epsilon=\sqrt{\frac{d}{n}}$ and putting everything together gives with probability $1-2ne^{-\iota}$ ,

Let $\epsilon_{i}\sim\{-\varsigma,\varsigma\}$ . Then with high probability,

Next, note that for fixed $u,w$ , $\epsilon_{i}(u\cdot x_{i})\sigma^{\prime}(w\cdot x_{i})$ is $\varsigma^{2}$ sub-Gaussian so for any $\iota>0$ , with probability $1-2e^{-\iota}$ ,

By a union bound, with probability at least $1-2e^{\iota}$ ,

Appendix D CSQ Lower Bound

The proof is a modified version of the proof in Szörényi . Let $\langle\cdot,\cdot\rangle_{\mathcal{D}}$ denote the $L^{2}$ inner product with respect to $\mathcal{D}$ . We will show that there are at least two functions $f,g\in\mathcal{F}$ such that for each query $h_{k}$ , $\absolutevalue{\langle f,h_{k}\rangle_{\mathcal{D}}}\leq\tau$ and $\absolutevalue{\langle g,h_{k}\rangle_{\mathcal{D}}}\leq\tau$ . Therefore, we can simply respond to each query adversarially with and it is impossible for the learner to distinguish between $f,g$ . Note that failing to do so will result in a loss of $\norm{f-g}_{\mathcal{D}}^{2}\geq 2-2\epsilon$ . Let the $k$ th query be $h_{k}$ and let

Similarly, we have that $\absolutevalue{A_{k}^{-}}\leq\frac{1}{\tau^{2}-\epsilon}$ so the number of functions that are eliminated from the $k$ th query is at most $\frac{2}{\tau^{2}-\epsilon}$ . We can continue this process for at most $\frac{\absolutevalue{F}(\tau^{2}-\epsilon)}{2}$ iterations. ∎

Let $v_{1},\ldots,v_{k}\sim S^{d-1}$ . Then for every pair $i\neq j$ , $v_{i}\cdot v_{j}$ is $O(d^{-1})$ subgaussian so for an absolute constant $c$ , with probability $1-2e^{-2c\epsilon^{2}d}$ , $\absolutevalue{v_{i}\cdot v_{j}}\leq\epsilon$ . Therefore with probability $1-k^{2}e^{-2c\epsilon^{2}d}>0$ this holds for all $i\neq j$ so there must exist at least one collection of such points. ∎

Let $S$ be the set constructed in Lemma 3. Let

and note that for all $f\in\mathcal{F}$ , $\norm{f}_{\mathcal{D}}=1$ . Then for $v,w\in S$ and $v\neq w$ ,

Therefore, by Lemma 2 we have for any $\epsilon$ ,

In particular if we take $\epsilon=\sqrt{\frac{\log(4q(cd)^{k/2})}{cd}}$ we get

Appendix E Additional Technical Lemmas

For a $k$ tensor $T$ , let $\operatorname{Sym}(T)$ denote the symmetrization of $T$ along all $k!$ permutations of indices.

There exist $T_{0},\ldots,T_{p}$ such that

and $\norm{T_{k}}_{F}\lesssim r^{\frac{p-k}{4}}$ for $k\leq p$ .

Note that from the Taylor series of $f^{\star}(x)$ we have

Note that by a simple counting argument, the number of permutations such that this product of indicators is nonzero is exactly $k!\prod_{i=1}^{d}\frac{c_{j}!}{(c_{j}/2)!}$ as you can first order the indices corresponding to each $c_{j}$ , then split them into groups of two, then shuffle these groups of two. Therefore,

because $\sum_{j}c_{j}=2k$ , which completes the proof. ∎

Let $\{h_{kl}\}$ and $\{h^{-1}_{kl}\}$ denote the change of basis matrices between Hermite polynomials and monomials, i.e.

Let $T$ be a symmetric $p$ -tensor and let $w\sim N(0,I_{d})$ . Then for $k\leq p$ ,

Let $T=\sum_{i}c_{i}v_{i}^{p}$ with $\|v_{i}\|=1$ . Using the change of basis $x^{k}\to\sum_{l\leq k}h^{-1}_{kl}He_{l}(x)$ ,

Let $T$ be a symmetric $p$ -tensor with $\dim(\operatorname{span}(T))=r$ . For $k\leq p$ ,

The proof follows directly from Lemma 23 and the inequality $\|T(I^{\otimes l})\|_{F}=\|T(\Pi_{\operatorname{span}(T)}^{\otimes l})\|_{F}\leq\norm{T}_{F}^{2}\norm{\Pi_{\operatorname{span}(T)}^{\otimes l}}_{F}^{2}=r^{l}\norm{T}_{F}^{2}$ for $2l\leq k$ . ∎

Let $T$ be a symmetric $p$ -tensor with $\dim(\operatorname{span}(T))=r$ . With probability at least $1-2e^{-\iota}$ ,

Therefore by Lemma 17, with probability at least $1-2e^{-\iota}$ , $F(w)\lesssim\norm{T}_{F}^{2}r^{\lfloor\frac{k}{2}\rfloor}\iota^{k}$ and taking square roots completes the proof. ∎

This follows immediately from Lemma 23 and $(k-2l)!\binom{k}{2l}^{2}\leq(p-2l)!\binom{p}{2l}^{2}$ . ∎

Let $f$ be a polynomial of degree $p$ . Then

E.2 Sphere Lemmas

This follows from the decomposition $w=\nu\overline{w}$ with $\nu\sim\chi(d),\overline{w}\sim S^{d-1}$ independent. ∎

Let $T$ be a symmetric $p$ -tensor with $\dim(\operatorname{span}(T))=r$ . With probability at least $1-2e^{-\iota}$ ,

Let $\overline{w}\sim S^{d-1}$ . For $k\leq p$ ,

E.3 Rademacher Complexity Bounds

Let $f=a^{T}\sigma(W^{\star}x+b)$ be a two layer neural network. For fixed $W,b$ , Let

Let $\theta=(a,W,b)$ and let $f=a^{T}\sigma(Wx+b)$ be a two layer neural network. Let