Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks

Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, Nathan Srebro

Introduction

Deep neural networks have enjoyed great success in learning across a wide variety of tasks. They played a crucial role in the seminal work of Krizhevsky et al. , starting an arms race of training larger networks with more hidden units, in pursuit of better test performance . In fact the networks used in practice are over-parametrized to the extent that they can easily fit random labels to the data . Even though they have such a high capacity, when trained with real labels they achieve smaller generalization error.

Traditional wisdom in learning suggests that using models with increasing capacity will result in overfitting to the training data. Hence capacity of the models is generally controlled either by limiting the size of the model (number of parameters) or by adding an explicit regularization, to prevent from overfitting to the training data. Surprisingly, in the case of neural networks we notice that increasing the model size only helps in improving the generalization error, even when the networks are trained without any explicit regularization - weight decay or early stopping . In particular, Neyshabur et al. observed that training on models with increasing number of hidden units lead to decrease in the test error for image classification on MNIST and CIFAR-10. Similar empirical observations have been made over a wide range of architectural and hyper-parameter choices . What explains this improvement in generalization with over-parametrization? What is the right measure of complexity of neural networks that captures this generalization phenomenon?

To study and analyze this phenomenon more carefully, we need to simplify the architecture making sure that the property of interest is preserved after the simplification. We therefore chose two layer ReLU networks since as shown in the left and middle panel of Figure 1, it exhibits the same behavior with over-parametrization as the more complex pre-activation ResNet18 architecture. In this paper we prove a tighter generalization bound (Theorem 2) for two layer ReLU networks. Our capacity bound, unlike existing bounds, correlates with the test error and decreases with the increasing number of hidden units. Our key insight is to characterize complexity at a unit level, and as we see in the right panel in Figure 1 these unit level measures shrink at a rate faster than $1/\sqrt{h}$ for each hidden unit, decreasing the overall measure as the network size increases. When measured in terms of layer norms, our generalization bound depends on the Frobenius norm of the top layer and the Frobenius norm of the difference of the hidden layer weights with the initialization, which decreases with increasing network size (see Figure 2).

The closeness of learned weights to initialization in the over-parametrized setting can be understood by considering the limiting case as the number of hidden units go to infinity, as considered in Bengio et al. and Bach . In this extreme setting, just training the top layer of the network, which is a convex optimization problem for convex losses, will result in minimizing the training error, as the randomly initialized hidden layer has all possible features. Intuitively, the large number of hidden units here represent all possible features and hence the optimization problem involves just picking the right features that will minimize the training loss. This suggests that as we over-parametrize the networks, the optimization algorithms need to do less work in tuning the weights of the hidden units to find the right solution. Dziugaite and Roy indeed have numerically evaluated a PAC-Bayes measure from the initialization used by the algorithms and state that the Euclidean distance to the initialization is smaller than the Frobenius norm of the parameters. Nagarajan and Kolter also make a similar empirical observation on the significant role of initialization, and in fact prove an initialization dependent generalization bound for linear networks. However they do not prove a similar generalization bound for neural networks. Alternatively, Liang et al. suggested a Fisher-Rao metric based complexity measure that correlates with generalization behavior in larger networks but they also prove the capacity bound only for linear networks.

Contributions: Our contributions in this paper are as follows.

We empirically investigate the role of over-parametrization in generalization of neural networks on 3 different datasets (MNIST, CIFAR10 and SVHN), and show that the existing complexity measures increase with the number of hidden units - hence do not explain the generalization behavior with over-parametrization.

We prove tighter generalization bounds (Theorems 2 and 5) for two layer ReLU networks. Our proposed complexity measure actually decreases with the increasing number of hidden units, and can potentially explain the effect of over-parametrization on generalization of neural networks.

We provide a matching lower bound for the Rademacher complexity of two layer ReLU networks. Our lower bound considerably improves over the best known bound given in Bartlett et al. , and to our knowledge is the first such lower bound that is bigger than the Lipschitz of the network class.

Generalization of Two Layer ReLU Networks

where $\mathcal{R}_{S}(\mathcal{H})$ is the Rademacher complexity of a class $\mathcal{H}$ of functions with respect to the training set $\mathcal{S}$ which is defined as:

Rademacher complexity is a capacity measure that captures the ability of functions in a function class to fit random labels which increases with the complexity of the class.

We will bound the Rademacher complexity of neural networks to get a bound on the generalization error . Since the Rademacher complexity depends on the function class considered, we need to choose the right function class that only captures the real trained networks, which is potentially much smaller than networks with all possible weights, to get a complexity measure that explains the decrease in generalization error with increasing width. Choosing a bigger function class can result in weaker bounds that does not capture this phenomenon. Towards that we first investigate the behavior of different measures of network layers with increasing number of hidden units. The experiments discussed below are done on the CIFAR-10 dataset. Please see Section A for similar observations on SVHN and MNIST datasets.

First layer: As we see in the second panel in Figure 2 even though the spectral and Frobenius norms of the learned layer decrease initially, they eventually increase with $h$ , with Frobenius norm increasing at a faster rate. However the distance Frobenius norm measured w.r.t. initialization ( $\left\lVert{\mathbf{U}-\mathbf{U}_{0}}\right\rVert_{F}$ ) decreases. This suggests that the increase in the Frobenius norm of the weights in larger networks is due to the increase in the Frobenius norm of the random initialization. To understand this behavior in more detail we also plot the distance to initialization per unit and the distribution of angles between learned weights and initial weights in the last two panels of Figure 2. We indeed observe that per unit distance to initialization decreases with increasing $h$ , and a significant shift in the distribution of angles to initial points, from being almost orthogonal in small networks to almost aligned in large networks. This per unit distance to initialization is a key quantity that appears in our capacity bounds and we refer to it as unit capacity in the remainder of the paper.

Unit capacity. We define $\beta_{i}=\left\lVert{\mathbf{u}_{i}-\mathbf{u}^{0}_{i}}\right\rVert_{2}$ as the unit capacity of the hidden unit $i$ .

Second layer: Similar to first layer, we look at the behavior of different measures of the second layer of the trained networks with increasing $h$ in the first panel of Figure 2. Here, unlike the first layer, we notice that Frobenius norm and distance to initialization both decrease and are quite close suggesting a limited role of initialization for this layer. Moreover, as the size grows, since the Frobenius norm $\left\lVert{\mathbf{V}}\right\rVert_{F}$ of the second layer slightly decreases, we can argue that the norm of outgoing weights $\mathbf{v}_{i}$ from a hidden unit $i$ decreases with a rate faster than $1/\sqrt{h}$ . If we think of each hidden unit as a linear separator and the top layer as an ensemble over classifiers, this means the impact of each classifier on the final decision is shrinking with a rate faster than $1/\sqrt{h}$ . This per unit measure again plays an important role and we define it as unit impact for the remainder of this paper.

Unit impact. We define $\alpha_{i}=\left\lVert{\mathbf{v}_{i}}\right\rVert_{2}$ as the unit impact, which is the magnitude of the outgoing weights from the unit $i$ .

Motivated by our empirical observations we consider the following class of two layer neural networks that depend on the capacity and impact of the hidden units of a network. Let $\mathcal{W}$ be the following restricted set of parameters:

We now consider the hypothesis class of neural networks represented using parameters in the set $\mathcal{W}$ :

Our empirical observations indicate that networks we learn from real data have bounded unit capacity and unit impact and therefore studying the generalization behavior of the above function class can potentially provide us with a better understanding of these networks. Given the above function class, we will now study its generalization properties.

2 Generalization Bound

In this section we prove a generalization bound for two layer ReLU networks. We first bound the Rademacher complexity of the class $\mathcal{F}_{\mathcal{W}}$ in terms of the sum over hidden units of the product of unit capacity and unit impact. Combining this with the equation (2) will give a generalization bound.

The proof is given in the supplementary Section B. The main idea behind the proof is a new technique to decompose the complexity of the network into complexity of the hidden units. To our knowledge, all previous works decompose the complexity to that of layers and use Lipschitz property of the network to bound the generalization error. However, Lipschitzness of the layer is a rather weak property that ignores the linear structure of each individual layer. Instead, by decomposing the complexity across the hidden units, we get the above tighter bound on the Rademacher complexity of the networks.

The generalization bound in Theorem 1 is for any function in the function class defined by a specific choice of $\bm{\alpha}$ and $\bm{\beta}$ fixed before the training procedure. To get a generalization bound that holds for all networks, we need to cover the space of possible values for $\bm{\alpha}$ and $\bm{\beta}$ and take a union bound over it. The following theorem states the generalization bound for any two layer ReLU network For the statement with exact constants see Lemma 13 in Supplementary Section B. .

3 Comparison with Existing Results

Neyshabur et al. , Golowich et al. showed a bound depending on the product of Frobenius norms of layers, which increases with $h$ , showing the important role of initialization in our bounds. In fact the proof technique of Neyshabur et al. does not allow for getting a bound with norms measured from initialization, and our new decomposition approach is the key for the tighter bound.

Experimental comparison. We train two layer ReLU networks of size $h$ on CIFAR-10 and SVHN datasets with values of $h$ ranging from $2^{6}$ to $2^{15}$ . The training and test error for CIFAR-10 are shown in the first panel of Figure 1, and for SVHN in the left panel of Figure 4. We observe for both datasets that even though a network of size 128 is enough to get to zero training error, networks with sizes well beyond 128 can still get better generalization, even when trained without any regularization. We further measure the unit-wise properties introduce in the paper, namely unit capacity and unit impact. These quantities decrease with increasing $h$ , and are reported in the right panel of Figure 1 and second panel of Figure 4. Also notice that the number of epochs required for each network size to get 0.01 cross-entropy loss decreases for larger networks as shown in the third panel of Figure 4.

For the same experimental setup, Figure 5 compares the behavior of different capacity bounds over networks of increasing sizes. Generalization bounds typically scale as $\sqrt{C/m}$ where $C$ is the effective capacity of the function class. The left panel reports the effective capacity $C$ based on different measures calculated with all the terms and constants. We can see that our bound is the only that decreases with $h$ and is consistently lower that other norm-based data-independent bounds. Our bound even improves over VC-dimension for networks with size larger than 1024. While the actual numerical values are very loose, we believe they are useful tools to understand the relative generalization behavior with respect to different complexity measures, and in many cases applying a set of data-dependent techniques, one can improve the numerical values of these bounds significantly . In the middle and right panel we presented each capacity bound normalized by its maximum in the range of the study for networks trained on CIFAR-10 and SVHN respectively. For both datasets, our capacity bound is the only one that decreases with the size even for networks with about $100$ million parameters. All other existing norm-based bounds initially decrease for smaller networks but then increase significantly for larger networks. Our capacity bound therefore could potentially point to the right properties that allow the over-parametrized networks to generalize.

Finally we check the behavior of our complexity measure under a different setting where we compare this measure between networks trained on real and random labels . We plot the distribution of margin normalized by our measure, computed on networks trained with true and random labels in the last panel of Figure 4 - and as expected they correlate well with the generalization behavior.

Lower Bound

In this section we will prove a Rademacher complexity lower bound for neural networks matching the dominant term in the upper bound of Theorem 1. We will show our lower bound on a smaller function class than $\mathcal{F}_{\mathcal{W}}$ , with an additional constraint on spectral norm of the hidden layer, as it allows for comparison with the existing results, and extends also to the bigger class $\mathcal{F}_{\mathcal{W}}$ .

The proof is given in the supplementary Section B.3. Clearly, $\mathcal{W}^{\prime}\subseteq\mathcal{W}$ since it has an extra constraint. The above lower bound matches the first term, $\frac{\sum_{i=1}^{h}\alpha_{i}\beta_{i}\|\mathbf{X}\|_{F}}{m\gamma}$ , in the upper bound of Theorem 1, upto $\frac{1}{\gamma}$ which comes from the $\frac{1}{\gamma}$ -Lipschitz constant of the ramp loss $l_{\gamma}$ . Also when $c=1$ and $\bm{\beta}=\bm{0}$ ,

To our knowledge all previous capacity lower bounds for spectral norm bounded classes of neural networks correspond to the Lipschitz constant of the network. Our lower bound strictly improves over this, and shows gap between Lipschitz constant of the network (which can be achieved by even linear models) and capacity of neural networks. This lower bound is non-trivial in the sense that the smaller function class excludes the neural networks with all rank-1 matrices as weights, and thus shows $\Theta(\sqrt{h})$ -gap between the neural networks with and without ReLU. The lower bound therefore does not hold for linear networks. Finally, one can extend the construction in this bound to more layers by setting all weight matrices in intermediate layers to be the Identity matrix.

In particular, Bartlett et al. has proved a lower bound of $\Omega\left(\frac{s_{1}s_{2}\|\mathbf{X}\|_{F}}{m}\right)$ , for the function class defined by the parameter set,

Note that $s_{1}s_{2}$ is a Lipschitz bound of the function class $\mathcal{F}_{\mathcal{W}_{spec}}$ . Given $\mathcal{W}_{spec}$ with bounds $s_{1}$ and $s_{2}$ , choosing $\bm{\alpha}$ and $\bm{\beta}$ such that $\left\lVert{\bm{\alpha}}\right\rVert_{2}=s_{1}$ and $\max_{i\in[h]}\beta_{i}=s_{2}$ results in $\mathcal{W}^{\prime}\subset\mathcal{W}_{spec}$ . Hence we get the following result from Theorem 3.

Hence our result improves the lower bound in Bartlett et al. by a factor of $\sqrt{h}$ . Theorem 7 in Golowich et al. also gives a $\Omega(s_{1}s_{2}\sqrt{c})$ lower bound, $c$ is the number of outputs of the network, for the composition of 1-Lipschitz loss function and neural networks with bounded spectral norm, or $\infty$ -Schatten norm. Our above result even improves on this lower bound.

Generalization for Extremely Large Values of hℎh

In contrast to Theorem 2 the additive $\sqrt{h}$ term is replaced by $\sqrt{\frac{h}{2^{p}}}$ . For $p$ of order $\ln h$ , $\sqrt{\frac{h}{2^{p}}}\approx$ constant improves on the $\sqrt{h}$ additive term in Theorem 2. However the norms in the first term $\left\lVert{\mathbf{V}}\right\rVert_{F}$ and $\left\lVert{\mathbf{U}-\mathbf{U}_{0}}\right\rVert_{F}$ are replaced by $h^{\frac{1}{2}-\frac{1}{p}}\left\lVert{\mathbf{V}^{T}}\right\rVert_{p,2}$ and $h^{\frac{1}{2}-\frac{1}{p}}\left\lVert{\mathbf{U}-\mathbf{U}_{0}}\right\rVert_{p,2}$ . For $p\approx\ln h$ , $h^{\frac{1}{2}-\frac{1}{p}}\left\lVert{\mathbf{V}^{T}}\right\rVert_{p,2}\approx h^{\frac{1}{2}-\frac{1}{\ln h}}\left\lVert{\mathbf{V}^{T}}\right\rVert_{\ln h,2}$ which is a tight upper bound for $\left\lVert{\mathbf{V}}\right\rVert_{F}$ and is of the same order if all rows of $\mathbf{V}$ have the same norm - hence giving a tighter bound that decreases with $h$ for larger values. In particular for $p=\ln h$ we get the following bound.

Under the settings of Theorem 5, with probability $1-\delta$ over the choice of the training set $\mathcal{S}=\{\mathbf{x}_{i}\}_{i=1}^{m}$ , for any function $f(\mathbf{x})=\mathbf{V}[\mathbf{U}\mathbf{x}]_{+}$ , the generalization error is bounded as follows:

Discussion

In this paper we present a new capacity bound for neural networks that decreases with the increasing number of hidden units, and could potentially explain the better generalization performance of larger networks. However our results are currently limited to two layer networks and it is of interest to understand and extend these results to deeper networks. Also while these bounds are useful for relative comparison between networks of different size, their absolute values are still much larger than the number of training samples, and it is of interest to get smaller bounds. Finally we provided a matching lower bound for the capacity improving on the current lower bounds for neural networks.

In this paper we do not address the question of whether optimization algorithms converge to low complexity networks in the function class considered in this paper, or in general how does different hyper parameter choices affect the complexity of the recovered solutions. It is interesting to understand the implicit regularization effects of the optimization algorithms for neural networks, which we leave for future work.

Acknowledgements

The authors thank Sanjeev Arora for many fruitful discussions on generalization of neural networks and David McAllester for discussion on the distance to random initialization. This research was supported in part by NSF IIS-RI award 1302662 and Schmidt Foundation.

References

Appendix A Experiments

Below we describe the setting for each reported experiment.

In this experiment, we trained a pre-activation ResNet18 architecture on CIFAR-10 dataset. The architecture consists of a convolution layer followed by 8 residual blocks (each of which consist of two convolution) and a linear layer on the top. Let $k$ be the number of channels in the first convolution layer. The number of output channels and strides in residual blocks is then $[k,k,2k,2k,4k,4k,8k,8k]$ and $ $respectively. Finally, we use the kernel sizes 3 in all convolutional layers. We train 11 architectures where for architecture$ i $we set$ k=\lceil 2^{2+i/2}\rceil $. In each experiment we train using SGD with mini-batch size 64, momentum 0.9 and initial learning rate 0.1 where we reduce the learning rate to 0.01 when the cross-entropy loss reaches 0.01 and stop when the loss reaches 0.001 or if the number of epochs reaches 1000. We use the reference line in the plots to differentiate the architectures that achieved 0.001 loss. We do not use weight decay or dropout but perform data augmentation by random horizontal flip of the image and random crop of size$ 28\times 28$ followed by zero padding.

Two Layer ReLU Networks

We trained fully connected feedforward networks on CIFAR-10, SVHN and MNIST datasets. For each data set, we trained 13 architectures with sizes from $2^{3}$ to $2^{15}$ each time increasing the number of hidden units by factor 2. For each experiment, we trained the network using SGD with mini-batch size 64, momentum 0.9 and fixed step size 0.01 for MNIST and 0.001 for CIFAR-10 and SVHN. We did not use weight decay, dropout or batch normalization. For experiment, we stopped the training when the cross-entropy reached 0.01 or when the number of epochs reached 1000. We use the reference line in the plots to differentiate the architectures that achieved 0.01 loss.

Evaluations

For each generalization bound, we have calculated the exact bound including the log-terms and constants. We set the margin to $5$ th percentile of the margin of data points. Since bounds in and are given for binary classification, we multiplied by factor $c$ and by factor $\sqrt{c}$ to make sure that the bound increases linearly with the number of classes (assuming that all output units have the same norm). Furthermore, since the reference matrices can be used in the bounds given in and , we used random initialization as the reference matrix. When plotting distributions, we estimate the distribution using standard Gaussian kernel density estimation.

A.2 Supplementary Figures

Figures 6 and 7 show the behavior of several measures on networks with different sizes trained on SVHN and MNIST datasets respectively. The left panel of Figure 8 shows the over-parametrization phenomenon in MNSIT dataset and the middle and right panels compare our generalization bound to others.

Appendix B Proofs

We start by stating a simple lemma which is a vector-contraction inequality for Rademacher complexities and relates the norm of a vector to the expected magnitude of its inner product with a vector of Rademacher random variables. We use the following technical result from Maurer in our proof.

The above lemma can be useful to get Rademacher complexities in multi-class settings. The below lemma bounds the Rademacher-like complexity term for linear operators with multiple output centered around a reference matrix. The proof is very simple and similar to that of linear separators. See for similar arguments.

$(i)$ follows from the Jensen’s inequality. ∎

We next show that the Rademacher complexity of the class of networks defined in (5) and (4) can be decomposed to that of hidden units.

Given a training set $\mathcal{S}=\{\mathbf{x}_{i}\}_{i=1}^{m}$ and $\gamma>0$ , Rademacher complexity of the class $\mathcal{F}_{\mathcal{W}}$ defined in equations (5) and (4) is bounded as follows:

Let $\rho_{ij}={\left\lvert{\left\langle\mathbf{u}^{0}_{j},\mathbf{x}_{i}\right\rangle}\right\rvert}$ . We prove the lemma by showing the following statement by induction on $t$ :

The above statement holds trivially for the base case of $t=1$ by the definition of the Rademacher complexity (3). We now assume that it is true for any $t^{\prime}\leq t$ and prove it is true for $t^{\prime}=t+1$ .

The last inequality follows from the $\frac{\sqrt{2}}{\gamma}$ Lipschitzness of the ramp loss. The ramp loss is $1/\gamma$ Lipschitz with respect to each dimension but since the loss at each point only depends on score of the correct labels and the maximum score among other labels, it is $\frac{\sqrt{2}}{\gamma}$ -Lipschitz.

Using the triangle inequality we can bound the first term in the above bound as follows.

We will now add and subtract the initialization $\mathbf{U}^{0}$ terms.

From equations (9), (10), (11) and Lemma 7 we get,

Hence the induction step at $t=m$ gives us:

Using Lemma 8, we can bound the the right hand side of the upper bound on the Rademacher complexity given in Lemma 9:

B.2 Proof of Theorems 2 and 5

Therefore, we have $\alpha^{\prime}\in Q$ . Furthermore for any $\alpha\in Q$ , we have:

Therefore, to complete the proof, we only need to bound the size of the set $Q$ . The size of the set $Q$ is equal to the number of unique solutions for the problem $\sum_{i=1}^{D}z_{i}=K+D-1$ for non-zero integer variables $z_{i}$ , which is $\binom{K+D-2}{D-1}$ . ∎

The proof of this lemma follows from using the result of Theorem 1 and taking a union bound to cover all the possible values of $\{\mathbf{V}\mid\left\lVert{\mathbf{V}}\right\rVert_{p,2}\leq C_{1}\}$ and $\mathbf{U}=\{\mathbf{U}\mid\left\lVert{\mathbf{U}-\mathbf{U}^{0}}\right\rVert_{p,2}\leq C_{2}\}$ .

By Lemma 10, picking $\epsilon=((1+\mu)^{1/p}-1)$ , we can find a set of vectors, $\{\bm{\alpha}^{i}\}_{i=1}^{N_{p,h}}$ , where $K=\left\lceil\frac{h}{\mu}\right\rceil,N_{p,h}=\binom{K+h-2}{h-1}$ such that $\forall\mathbf{x},\left\lVert{\mathbf{x}}\right\rVert_{p}\leq C_{1}$ , $\exists 1\leq i\leq N_{p,h}$ , $x_{j}\leq\alpha^{i}_{j},\ \forall j\in[h]$ . Similarly, picking $\epsilon=((1+\mu)^{1/p}-1)$ , we can find a set of vectors, $\{\bm{\beta}^{i}\}_{i=1}^{N_{p,h}}$ , where $K=\left\lceil\frac{h}{\mu}\right\rceil,N_{p,h}=\binom{K+h-2}{h-1}$ such that $\forall\mathbf{x},\left\lVert{\mathbf{x}}\right\rVert_{p}\leq C_{2}$ , $\exists 1\leq i\leq N_{p,h}$ , $x_{j}\leq\beta^{i}_{j},\ \forall j\in[h]$ . ∎

This lemma can be proved by directly applying union bound on Lemma 11 with for every $C_{1}\in\left\{\frac{i}{h^{1/2-1/p}}\mid i\in\left[\left\lceil\frac{\gamma\sqrt{m}}{4}\right\rceil\right]\right\}$ and every $C_{2}\in\left\{\frac{i}{h^{1/2-1/p}\left\lVert{\mathbf{X}}\right\rVert_{F}}\mid i\in\left[\left\lceil\frac{\gamma\sqrt{m}}{4}\right\rceil\right]\right\}$ . For $\|\mathbf{V}^{\top}\|_{p,2}\leq\frac{1}{h^{1/2-1/p}}$ , we can use the bound where $C_{1}=1$ , and the additional constant $1$ in Eq. 14 will cover that. The same is true for the case of $\|\mathbf{U}\|_{p,2}\leq\frac{i}{h^{1/2-1/p}\left\lVert{\mathbf{X}}\right\rVert_{F}}$ . When any of $h^{1/2-1/p}\|\mathbf{V}^{\top}\|_{p,2}$ and $h^{1/2-1/p}\left\lVert{\mathbf{X}}\right\rVert_{F}\|\mathbf{U}\|_{p,2}$ is larger than $\left\lceil\frac{\gamma\sqrt{m}}{4}\right\rceil$ , the second term in Eq. 14 is larger than 1 thus holds trivially. For the rest of the case, there exists $(C_{1},C_{2})$ such that $h^{1/2-1/p}C_{1}\leq h^{1/2-1/p}\|\mathbf{V}^{\top}\|_{p,2}+1$ and $h^{1/2-1/p}C_{2}\leq h^{1/2-1/p}\left\lVert{\mathbf{X}}\right\rVert_{F}\left\lVert{\mathbf{X}}\right\rVert_{F}\|\mathbf{U}\|_{p,2}+1$ . Finally, we have $\frac{\gamma\sqrt{m}}{4}\geq 1$ otherwise the second term in Eq. 14 is larger than 1. Therefore, $\left\lceil\frac{\gamma\sqrt{m}}{4}\right\rceil\leq\frac{\gamma\sqrt{m}}{4}+1\leq\frac{\gamma\sqrt{m}}{2}$ . ∎

We next use the general results in Lemma 12 to give specific results for the case $p=2$ .

To prove the lemma, we directly upper bound the generalization bound given in Lemma 12 for $p=2$ and $\mu=\frac{3\sqrt{2}}{4}-1$ . For this choice of $\mu$ and $p$ , we have $4(\mu+1)^{2/p}\leq 3\sqrt{2}$ and $\ln N_{p,h}$ is bounded as follows:

Next lemma states a generalization bound for any $p\geq 2$ , which is looser than 13 for $p=2$ due to extra constants and logarithmic factors.

To prove the lemma, we directly upper bound the generalization bound given in Lemma 12 for $\mu=e^{p}-1$ . For this choice of $\mu$ and $p$ , we have $(\mu+1)^{2/p}=e^{2}$ . Furthermore, if $\mu\geq h$ , $N_{p,h}=0$ , otherwise $\ln N_{p,h}$ is bounded as follows:

Since the right hand side of the above inequality is greater than zero for $\mu\geq h$ , it is true for every $\mu>0$ . ∎

B.3 Proof of the Lower Bound

We will pick $\mathbf{V}=\bm{\alpha}^{\top}=[\alpha_{1}\ldots\alpha_{2^{k}}]$ for every $\bm{\xi}$ , and $\mathcal{S}=\{\mathbf{x}_{i}\}_{i=1}^{m}$ , where $\mathbf{x}_{i}:=\mathbf{e}_{\left\lceil\frac{i}{n}\right\rceil}$ . That is, the whole dataset are divides into $2^{k}$ groups, while each group has $n$ copies of a different element in standard orthonormal basis.

Thus without loss of generality, we can assume $\forall i\in[2^{k}]$ , $\left\langle\mathbf{s},\mathbf{f}_{i}\right\rangle\geq 2^{\frac{k}{2}-1}\bm{\alpha}^{\top}\bm{\beta}$ by flipping the signs of $\mathbf{f}_{i}$ .

For any $\bm{\xi}\in\{-1,1\}^{n}$ , let $\textrm{Diag}(\bm{\beta})$ be the square diagonal matrix with its diagonal equal to $\bm{\beta}$ and $\widetilde{\mathbf{F}}(\bm{\xi})$ be the following:

and we will choose $\mathbf{U}(\bm{\xi})$ as $\textrm{Diag}(\bm{\beta})\times\widetilde{\mathbf{F}}(\bm{\xi})$ .

The last inequality uses the previous assumption, that $\forall i\in[2^{k}]$ , $\left\langle\mathbf{s},\mathbf{f}_{i}\right\rangle\geq 2^{-\frac{k}{2}-1}\bm{\alpha}^{\top}\bm{\beta}$ .