SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz

Introduction

Neural networks have achieved remarkable performance in many machine learning tasks. Although recently there have been numerous theoretical contributions to understand their success, it is still largely unexplained and remains a mystery. In particular, it is not known why in the over-parameterized setting, in which there are far more parameters than training points, stochastic gradient descent (SGD) can learn networks that generalize well, as been observed in practice (Neyshabur et al., 2014; Zhang et al., 2016).

In such over-parameterized settings, the loss function can contain multiple global minima that generalize poorly. Therefore, learning can in principle lead to models with low training error, but high test error. However, as often observed in practice, SGD is in fact able to find models with low training error and good generalization performance. This suggests that the optimization procedure, which depends on the optimization method (SGD) and the training data, introduces some form of inductive bias which directs it towards a low complexity solution. Thus, in order to explain the success of neural networks, it is crucial to characterize this inductive bias and understand what are the guarantees for generalization of over-parameterized neural networks.

In this work, we address these problems in a binary classification setting where SGD optimizes a two-layer over-parameterized network with the goal of learning a linearly separable function. Clearly, an over-parameterized network is not necessary for classifying linearly separable data, since this is possible with linear classifiers (e.g., with the Perceptron algorithm) which also have good generalization guarantees (Shalev-Shwartz & Ben-David, 2014). But, the key question which we address here is whether a large network will overfit in such a case or not. As we shall see, it turns out that although the networks we consider are rich enough to considerably overfit the data, this does not happen when SGD is used for optimization. In other words, SGD introduces an inductive bias which allows it to learn over-parameterized networks that can generalize well. Therefore, this setting serves as a good test bed for studying the effect of over-paramaterization.

Problem Formulation

where $\sigma$ is a non-linear activation function applied element-wise.

We define the empirical loss over $S$ to be the mean hinge-loss:

In our setting we fix the second layer to be $\boldsymbol{v}=(\overbrace{v\dots v}^{k},\overbrace{-v\dots-v}^{k})$ such that $v>0$ and only learn the weight matrix $W$ . We will consider only positive homogeneous activations (Leaky ReLU and ReLU) and thus the network we consider with $2k$ hidden neurons is as expressive as networks with $k$ hidden neurons and any vector $\boldsymbol{v}$ in the second layer. For example, consider a network with $k$ hidden neurons with positive homogeneous activations, where each hidden neuron $i$ has incoming weight vector $\boldsymbol{w}_{i}$ and outgoing weight $v_{i}$ . Then we can express this network with the network defined in Eq. 1 as follows. For each $i$ such that $v_{i}>0$ , we define a neuron in the new network with incoming weight vector $\frac{\boldsymbol{w}_{i}}{v_{i}}$ and outgoing weight $1$ . Similarly, if $v_{i}<0$ , we define a neuron in the new network with incoming weight vector $\frac{\boldsymbol{u}_{i}}{-v_{i}}$ and outgoing weight $-1$ . For all other neurons we define an incoming zero weight vector. Due to the positive homogeneity, it follows that this network is equivalent to the network with $k$ hidden neurons. Hence, we can fix the second layer without limiting the expressive power of the two-layer network. Although it is relatively simpler than the case where the second layer is not fixed, the effect of over-parameterization can be studied in this setting as well.

Hence, the objective of the optimization problem is to find:

We focus on the case where $L_{S}\left(W\right)$ is minimized using an SGD algorithm with batch of size $1$ , and where only the weights of the first layer (namely $W$ ) are updated. At iteration $t$ , SGD randomly chooses a point $(\boldsymbol{x}_{t},y_{t})\in S$ and updates the weights with a constant learning rate $\eta$ . Formally, let $\mathcal{W}_{t}=(W_{t},\boldsymbol{v})$ be the parameters at iteration $t$ , then the update at iteration $t$ is given by

Main Result

We now present our main results, for the case where $\sigma$ is the Leaky ReLU function. Namely, $\sigma(z)=\max\{\alpha z,z\}$ where $0<\alpha<1$ .

First, we show that SGD can find a global optimum of $L_{S}\left(W\right)$ . Note that this is by no means obvious, since $L_{S}\left(W\right)$ is a non-convex function (see Proposition 5.1). Specifically, we show that SGD converges to such an optimum while making at most:

non-zero update steps (see Corollary 5.2). In particular, the bound is independent of the number of neurons $2k$ . To the best of our knowledge, this is the first convergence guarantee of SGD for neural networks with the hinge loss. Furthermore, we prove a lower bound of $\Omega\Big{(}\frac{\|\boldsymbol{w}^{*}\|}{\eta}+\|\boldsymbol{w}^{*}\|^{2}\Big{)}$ for the number of non-zero updates (see Theorem 2).

Next, we address the question of generalization. As noted earlier, since the network is large, it can in principle overfit. Indeed, there are parameter settings for which the network will have arbitrarily bad test error (see Section 6.2). However, as we show here, this will not happen in our setting where SGD is used for optimization. In Theorem 4 we use a compression bound to show that the model learned by SGD will have a generalization error of $O\Big{(}\frac{M\log{n}}{n}\Big{)}$ .See discussion in Remark 5 on the dependence of the generalizaion bound on $\eta$ . This implies that for any network size, given a sufficiently large number of training samples that is independent of the network size, SGD converges to a global minimum with good generalization behaviour. This is despite the fact that for sufficiently large $k$ there are multiple global minima which overfit the training set (see Section 6.2). This implies that SGD is biased towards solutions that can be expressed by a small set of training points and thus generalizes well.

To summarize, when the activation is the Leaky ReLU and the data is linearly separable, we provide provable guarantees of optimization, generalization and expressive power for over-parameterized networks. This allows us to provide a rigorous explanation of the performance of over-parameterized networks in this setting. This is a first step in unraveling the mystery of the success of over-parameterized networks in practice.

We further study the same over-parameterized setting where the non-linear activation is the ReLU function (i.e., $\sigma(z)=\max\{0,z\}$ ). Surprisingly, this case has different properties. Indeed, we show that the loss contains spurious local minima and thus the previous convergence result of SGD to a global minimum does not hold in this case. Furthermore, we show an example where over-parameterization is favorable from an optimization point of view. Namely, for a sufficiently small number of hidden neurons, SGD will converge to a local minimum with high probability, whereas for a sufficiently large number of hidden neurons, SGD will converge to a global minimum with high probability.

The paper is organized as follows. We discuss related work in Section 4 . In Section 5 we prove the convergence bounds, in Section 6 we give the generalization guarantees and in Section 7 the results for the ReLU activation. We conclude our work in Section 8.

Related Work

The generalization performance of neural networks has been studied extensively. Earlier results (Anthony & Bartlett, 2009) provided bounds that depend on the VC dimension of the network, and the VC dimension was shown to scale linearly with the number of parameters. More recent works, study alternative notions of complexity, such as Rademacher compexity (Bartlett & Mendelson, 2002; Neyshabur et al., 2015; Bartlett et al., 2017; Kawaguchi et al., 2017), Robustness (Xu & Mannor, 2012) and PAC-Bayes (Neyshabur et al., 2017b). However, all of these notions fail to explain the generalization performance of over-parameterized networks (Neyshabur et al., 2017a). This is because these bounds either depend on the number of parameters or on the number of hidden neurons (directly or indirectly via norms of the weights) and become loose when these quantities become sufficiently large. The main disadvantage of these approaches, is that they do not depend on the optimization method (e.g., SGD), and thus do not capture its role in the generalization performance. In our work, we give generalization guarantees based on a compression bound that follows from convergence rate guarantees of SGD, and thus take into account the effect of the optimization method on the generalization performance. This analysis results in generalization bounds that are independent of the network size and thus hold for over-parameterized networks.

In parallel to our work, Kawaguchi et al. (2017) give generalization bounds for neural networks that are based on Rademacher complexity. Here too, the analysis does not take into account the optimization algorithm and the bound depends on the norm of the weights. Therefore, the bound can become vacuous for over-parameterized networks.

Stability bounds for SGD in non-convex settings were given in Hardt et al. (2016); Kuzborskij & Lampert (2017). However, their results hold for smooth loss functions, whereas the loss function we consider is not smooth due to the non-smooth activation functions (Leaky ReLU, ReLU).

Other works have studied generalization of neural networks in a model recovery setting, where assumptions are made on the underlying model and the input distribution (Brutzkus & Globerson, 2017; Zhong et al., 2017; Li & Yuan, 2017; Du et al., 2017; Tian, 2017). However, in their works the neural networks are not over-parameterized as in our setting.

Soltanolkotabi et al. (2017) analyze the optimization landscape of over-parameterized networks and give convergence guarantees for gradient descent to a global minimum when the data follows a Gaussian distribution and the activation functions are differentiable. The main difference from our work is that they do not provide generalization guarantees for the resulting model. Furthermore, we do not make any assumptions on the distribution of the feature vectors.

In a recent work, Nguyen & Hein (2017) show that if training points are linearly separable then under assumptions on the rank of the weight matrices of a fully-connected neural network, every critical point of the loss function is a global minimum. Their work extends previous results in Gori & Tesi (1992); Frasconi et al. (1997); Yu & Chen (1995). Our work differs from these in several respects. First, we show global convergence guarantees of SGD, whereas they only analyze the optimization landscape, without direct implications on performance of optimization methods. Second, we provide generalization bounds and their focus is solely on optimization. Third, we consider non-differentiable activation functions (Leaky ReLU, ReLU) while their results hold only for continuously differentiable activation functions.

Convergence Analysis

In this section we consider the setting of Section 2 with a leaky ReLU activation function. In Section 5.1 we show SGD will converge to a globally optimal solution, and analyze the rate of convergence. In Section 5.1 we also provide lower bounds on the rate of convergence. The results in this section are interesting for two reasons. First, they show convergence of SGD for a non-convex objective. Second, the rate of convergence results will be used to derive generalization bounds in Section 6.

Before proving convergence of SGD to a global minimum, we show that every critical point is a global minimum and the loss function is non-convex. The proof is deferred to the appendix.

$L_{S}\left(W\right)$ satisfies the following properties: 1) Every critical point is a global minimum. 2) It is non-convex.

We assume that SGD is initialized such that the norms of all rows of $W_{0}$ are upper bounded by some constant $R>0$ . Namely for all $1\leq i\leq k$ it holds that:

Define $M_{k}:=\frac{\|\boldsymbol{w}^{*}\|^{2}}{\alpha^{2}}+\frac{\|\boldsymbol{w}^{*}\|^{2}}{k\eta v^{2}\alpha^{2}}+\frac{\sqrt{R(8k^{2}\eta^{2}v^{2}+8\eta k)}{\|\boldsymbol{w}^{*}\|}^{1.5}}{2k(\eta v\alpha)^{1.5}}+\frac{2R\|\boldsymbol{w}^{*}\|}{\eta v\alpha}$ . We give an upper bound on the number of non-zero updates SGD makes until convergence to a critical point (which is a global minimum by Proposition 5.1). The result is summarized in the following theorem.

SGD converges to a global minimum after performing at most $M_{k}$ non-zero updates.

To obtain a simpler bound than the one obtained in Theorem 1, we use the fact that we can set $R,v$ arbitrarily, and choose:This initialization resembles other initializations that are used in practice (Bengio, 2012; Glorot & Bengio, 2010)

Then by Theorem 1 we get the following. The derivation is given in the Appendix (Section A.1.3).

Let $R=v=\frac{1}{\sqrt{2k}}$ , then SGD converges to a global minimum after perfoming at most $M_{k}=\frac{\|\boldsymbol{w}^{*}\|^{2}}{\alpha^{2}}+O\Bigg{(}\frac{\|\boldsymbol{w}^{*}\|^{2}}{\min\{\eta,\sqrt{\eta}\}}\Bigg{)}$ non-zero updates.

Thus the bound consists of two terms, the first which only depends on the margin (via $\|\boldsymbol{w}^{*}\|$ ) and the second which scales inversely with $\eta$ . More importantly, the bound is independent of the network size.

2 Lower Bound

We use the same notations as in Section 5.1. The lower bound is given in the following theorem, which is proved in the Appendix (Section A.1.4).

Assume SGD is initialized according to Eq. 6, then for any $d$ there exists a sequence of linearly separable points on which SGD will make at least $\Omega\Big{(}\frac{\|\boldsymbol{w}^{*}\|}{\eta}+\|\boldsymbol{w}^{*}\|^{2}\Big{)}$ mistakes.

Although this lower bound is not tight, it does show that the upper bound in Corollary 5.2 cannot be much improved. Furthermore, the example presented in the proof of Theorem 2, demonstrates that $\eta\rightarrow\infty$ can be optimal in terms of optimization and generalization, i.e., SGD makes the minimum number of updates ( $\|\boldsymbol{w}^{*}\|^{2}$ ) and the learned model is equivalent to the true classifier $\boldsymbol{w}^{*}$ . We will use this observation in the discussion on the dependence of the generalization bound in Theorem 4 on $\eta$ (see Remark 5).

Generalization

In this section we give generalization guarantees for SGD learning of over-parameterized networks with Leaky ReLU activations. These results are obtained by combining Theorem 1 with a compression generalization bound (see Section 6.1). In Section 6.2 we show that over-parameterized networks are sufficiently expressive to contain global minima that overfit the training set. Taken together, these results show that although there are models that overfit, SGD effectively avoids these, and finds the models that generalize well.

Given the bound in Theorem 1 we can invoke compression bounds for generalization guarantees with respect to the - $1$ loss (Littlestone & Warmuth, 1986) . Denote by $N_{k}$ a two-layer neural network with $2k$ hidden neurons defined in Section 1 where $\sigma$ is the Leaky ReLU. Let $SGD_{k}(S,W_{0})$ be the output of running SGD for training this network on a set $S$ and initialized with $W_{0}$ that satisfies Eq. 5. Define $\mathcal{H}_{k}$ to be the set of all possible hypotheses that $SGD_{k}(S,W_{0})$ can output for any $S$ and $W_{0}$ which satisfies Eq. 5.

Let $n\geq 2c_{k}$ , then with probability of at least $1-\delta$ over the choice of $S$ and $W_{0}$ we have

Since $L_{V}^{0-1}(SGD_{k}(S,W_{0}))=0$ holds at a global minimum of $L_{S}$ , then by Combining the results of Corollary 5.2 and Theorem 3, we get the following theorem.

If $n\geq 2c_{k}$ and assuming the initialization defined in Eq. 6, then with probability at least $1-\delta$ over the choice of $S$ and $W_{0}$ , SGD converges to a global minimum of $L_{S}$ with - $1$ test error at most

Thus for fixed $\|\boldsymbol{w}^{*}\|$ and $\eta$ we obtain a sample complexity guarantee that is independent of the network size (See Remark 5 for a discussion on the dependence of the bound on $\eta$ ). This is despite the fact that for sufficiently large $k$ , the network has global minima that have arbitrarily high test errors, as we show in the next section. Thus, SGD and the linearly separable data introduce an inductive bias which directs SGD to the global minimum with low test error while avoiding global minima with high test error. In Figure 1 we demonstrate this empirically for a linearly separable data set (from a subset of MNIST) learned using over-parameterized networks. The figure indeed shows that SGD converges to a global minimum which generalizes well.

The generelization bound in Eq. 7 holds for $\eta\rightarrow\infty$ , which is unique for the setting that we consider, and may seem surprising, given that a choice of large $\eta$ often fails in practice. Furthermore, the bound is optimal for $\eta\rightarrow\infty$ . To support this theoretical result, we show in Theorem 2 an example where indeed $\eta\rightarrow\infty$ is optimal in terms of the number of updates and generalization. On the other hand, we note that in practice, it may not be optimal to use large $\eta$ in our setting, since this bound results from a worst-case analysis of a sequence of examples encountered by SGD. Finally, the important thing to note is that the bound holds for any $\eta$ , and is thus applicable to realistic applications of SGD.

2 Expressiveness

Theorem 6 implies that for sufficiently large networks, the optimization problem (2) can have arbitrarely bad global minima with respect to a given test set, i.e., ones which do not generalize well on a given test set.

ReLU- Success and Failure Cases

In this section we consider the same setting as in section 5, but with the ReLU activation function $\sigma(x)=\max\{0,x\}$ . In Section 7.1 we show that the loss function contains arbitrarely bad local minima. In Section 7.2 we give an example where for a sufficiently small network, with high probability SGD will converge to a local minimum. On the other hand, for a sufficiently large network, with high probability SGD will converge to a global minimum.

The result is summarized in the following theorem and the proof is deferred to the Appendix (Section A.3.1). The main idea is to construct a network with weight paramater $W$ such that for at least $\frac{|S|}{2}$ points $(\boldsymbol{x},y)\in S$ it holds that $\left\langle\boldsymbol{w},\boldsymbol{x}\right\rangle<0$ for each neuron with weight vector $\boldsymbol{w}$ . Furthermore, the remaining points satisfy $yN_{\mathcal{W}}(\boldsymbol{x})>1$ and thus the gradient is zero and $L_{S}(W)>\frac{1}{2}$ .

2 Orthogonal vectors - simple case analysis

The main result of this section is given in the following theorem. The proof is given in the Appendix (Section A.3.2). The main observation is that the convergence to non-global minimum depends solely on the initialization and occurs if and only if there exists a point $\boldsymbol{x}$ such that for all neurons, the corresponding initialized weight vector $\boldsymbol{w}$ satisfies $\left\langle\boldsymbol{w},\boldsymbol{x}\right\rangle\leq 0$ .

Fix $\delta>0$ and assume we run SGD with examples from $S=\{\boldsymbol{e}_{1}\dots\boldsymbol{e}_{d}\}\times\{1\}$ . If $k\leq\log_{2}(\frac{d}{-\ln(\delta)})$ , then with probability of at least $1-\delta$ , SGD will converge to a non global minimum point. On the other hand, if $k\geq\log_{2}(\frac{2d}{\delta})$ , then with probability of at least $1-\delta$ , SGD will converge to a global minimum point after $\lceil\max\{\frac{dC}{\eta},\frac{d}{\eta}\}\rceil$ iterations.

Note that in the first part of the theorem, we can make the basin of attraction of the non-global minimum exponentially large by setting $\delta=e^{-\alpha d}$ for $\alpha\leq\frac{1}{2}$ .

Conclusion

Understanding the performance of over-parameterized neural networks is essential for explaining the success of deep learning models in practice. Despite a plethora of theoretical results for generalization of neural networks, none of them give guarantees for over-parameterized networks. In this work, we give the first provable guarantees for the generalization performance of over-parameterized networks, in a setting where the data is linearly separable and the network has Leaky ReLU activations. We show that SGD compresses its output when learning over-parameterized networks, and thus exhibits good generalization performance.

The analysis for networks with Leaky ReLU activations does not hold for networks with ReLU activations, since in this case the loss contains spurious local minima. However, due to the success of over-parameterized networks with ReLU activations in practice, it is likely that similar results hold here as well. It would be very interesting to provide convergence guarantees and generalization bounds for this case. Another direction for future work is to show that similar results hold under different assumptions on the data.

This research is supported by the Blavatnik Computer Science Research Fund and the European Research Council (TheoryDL project).

References

Appendix A Appendix

Otherwise, if $yN_{\mathcal{W}}{}(\boldsymbol{x})\geq 1$ , then the gradient vanishes and thus

It follows that if there exists $(\boldsymbol{x},y)\in S$ , such that $yN_{\mathcal{W}}{}(\boldsymbol{x})<1$ , then we have

and thus $\frac{\partial}{\partial\vec{W}}L_{S}(\vec{W})\neq 0$ . Therefore, for any critical point it holds that $yN_{\mathcal{W}}{}(\boldsymbol{x})\geq 1$ for all $(\boldsymbol{x},y)\in S$ , which implies that it is a global minimum.

For simplicity consider the function $f_{\boldsymbol{x}}(\boldsymbol{w},\boldsymbol{u})=\sigma(\left\langle\boldsymbol{w},\boldsymbol{x}\right\rangle)-\sigma(\left\langle\boldsymbol{u},\boldsymbol{x}\right\rangle)$ for $\boldsymbol{x}\neq 0$ . Define $\boldsymbol{w}_{1}=\boldsymbol{w}_{2}=\boldsymbol{u}_{1}=\boldsymbol{x}$ and $\boldsymbol{u}_{2}=-\boldsymbol{x}$ . Then

and thus $f_{\boldsymbol{x}}(\frac{\boldsymbol{w}_{1}+\boldsymbol{w}_{2}}{2},\frac{\boldsymbol{u}_{1}+\boldsymbol{u}_{2}}{2})>\frac{1}{2}f_{\boldsymbol{x}}(\boldsymbol{w}_{1},\boldsymbol{u}_{1})+\frac{1}{2}f_{\boldsymbol{x}}(\boldsymbol{w}_{2},\boldsymbol{u}_{2})$ which implies that the function is not convex.

A.1.2 Proof of Theorem 1

Then, from Cauchy-Schwartz inequality we have

Since the update at iteration $t$ is non-zero, we have $y_{t}N_{t-1}(\boldsymbol{x}_{t})<1$ and the update rule is given by

where $p_{t}^{(i)}=1$ if $\left\langle\boldsymbol{w}^{(i)}_{t-1},\boldsymbol{x}_{t}\right\rangle\geq 0$ and $p_{t}^{(i)}=\alpha$ otherwise. Similarly $q_{t}^{(i)}=1$ if $\left\langle\boldsymbol{u}^{(i)}_{t-1},\boldsymbol{x}_{t}\right\rangle\geq 0$ and $q_{t}^{(i)}=\alpha$ otherwise. It follows that:

where the second inequality follows since $y_{t}v\left(\sum_{i=1}^{k}\left\langle\boldsymbol{w}^{(i)}_{t-1},\boldsymbol{x}_{t}\right\rangle p_{t}^{(i)}-\sum_{i=1}^{k}\left\langle\boldsymbol{u}^{(i)}_{t-1},\boldsymbol{x}_{t}\right\rangle q_{t}^{(i)}\right)=y_{t}N_{t-1}(\boldsymbol{x}_{t})<1$ . Using the above recursively, we obtain:

where the inequality follows since $\left\langle y_{t}\boldsymbol{x}_{t},\boldsymbol{w}^{*}\right\rangle\geq 1$ . This implies that

By combining equations Eq. 8, Eq. 10 and Eq. 11 we get,

Using $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ the above implies,

Since $\|\boldsymbol{w}^{(i)}_{0}\|,\|\boldsymbol{u}^{(i)}_{0}\|\leq R$ we have $G(W_{0})\leq\sqrt{2k}R$ . Noting that $\|\vec{W}^{*}\|=\sqrt{2k}\|\boldsymbol{w}^{*}\|$ we get,

where $a=2k\eta v\alpha$ , $b=\sqrt{(4k^{2}\eta^{2}v^{2}+4\eta k)}\|\boldsymbol{w}^{*}\|$ and $c=4kR\|\boldsymbol{w}^{*}\|$ . By inspecting the roots of the parabola $P(x)=x^{2}-\frac{b}{a}x-\frac{c}{a}$ we conclude that

A.1.3 Proof of Corollary 5.2

Since $\frac{R}{v}=1$ , we have by Theorem 1 and the inequality $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ ,

A.1.4 Proof of Theorem 2

We will prove a more general theorem. Theorem 2 follows by setting $R=v=\frac{1}{\sqrt{2k}}$ .

For any $d$ there exists a sequence of linearly separable points on which SGD will make at least

Define a sequence $\mathcal{S}$ of size $d$ ,

for all $1\leq i\leq k$ . Note that $\|\boldsymbol{w}^{(i)}_{0}\|,\|\boldsymbol{u}^{(i)}_{0}\|\leq R$ for all $1\leq i\leq k$ .

Since $\boldsymbol{w}^{(i)}_{0}=\boldsymbol{w}^{(j)}_{0}$ and $\boldsymbol{u}^{(i)}_{0}=\boldsymbol{u}^{(j)}_{0}$ for all $i\neq j$ , we have by induction that $\boldsymbol{w}^{(i)}_{t}=\boldsymbol{w}^{(j)}_{t}$ and $\boldsymbol{u}^{(i)}_{t}=\boldsymbol{u}^{(j)}_{t}$ for all $i\neq j$ and $t>0$ . Hence, we will denote $\boldsymbol{w}_{t}=\boldsymbol{w}^{(i)}_{t}$ and $\boldsymbol{u}_{t}=\boldsymbol{u}^{(i)}_{t}$ for all $1\leq i\leq d$ and $t>0$ . Then for all $1\leq j\leq d$ we have $N_{\mathcal{W}_{t}}(\boldsymbol{e}_{j})=kv\sigma^{\prime}(\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{j}\right\rangle)\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{j}\right\rangle-kv\sigma^{\prime}(\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{j}\right\rangle)\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{j}\right\rangle$ .

Since at the global minumum $\mathcal{N}_{W_{t},\boldsymbol{v}}(\boldsymbol{e}_{j})\geq 1$ for all $1\leq j\leq d$ , it follows that a necessary condition for convergence to a global minimum is that there exists an iteration $t$ in which either $kv\sigma^{\prime}(\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle)\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle\geq\frac{1}{2}$ or $-kv\sigma^{\prime}(\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{d}\right\rangle)\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{d}\right\rangle\geq\frac{1}{2}$ . Equivalently, either $\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle\geq\frac{1}{2kv}$ or $\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{d}\right\rangle\leq-\frac{1}{2\alpha kv}$ .

Since $\left\langle\boldsymbol{w}_{0},\boldsymbol{e}_{d}\right\rangle=-\frac{R}{\sqrt{d}}$ , then by the gradient updates (Eq. 9) it follows that after at least $\frac{R}{\eta v\alpha\sqrt{d}}$ copies of $\mathcal{S}$ , or equivalently, after at least $\frac{Rd}{\eta v\alpha\sqrt{d}}$ iterations we will have $0\leq\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle\leq\eta v\alpha$ . Then, after at least $\frac{d\min\{\frac{1}{2kv}-\eta v\alpha,0\}}{\eta v}$ iterations we have $\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle\geq\frac{1}{2kv}$ . Thus, in total, after at least $\frac{R\|\boldsymbol{w}^{*}\|}{\eta v\alpha}+\min\{\frac{\|\boldsymbol{w}^{*}\|^{2}}{2\eta kv^{2}}-\alpha\|\boldsymbol{w}^{*}\|^{2},0\}$ iterations, we have $\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle\geq\frac{1}{2kv}$ .

By the same reasoning, we have $\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{d}\right\rangle\leq-\frac{1}{2\alpha kv}$ after at least $\frac{R\|\boldsymbol{w}^{*}\|}{\eta v}+\min\{\frac{\|\boldsymbol{w}^{*}\|^{2}}{2\alpha^{2}\eta kv^{2}}-\frac{\|\boldsymbol{w}^{*}\|^{2}}{\alpha},0\}$ iterations. Finally, SGD must update on at least $d$ points in order to converge to the global minimum. The claim now follows. ∎

A.2 Missing Proofs for Section 6

By Theorem 30.2 and Corollary 30.3 in Shalev-Shwartz & Ben-David (2014), for $n\geq 2c_{k}$ we have that with probability of at least $1-\delta$ over the choice of $S$

A.2.2 Proof of Theorem 6

We can easily extend Theorem 8 in (Soudry & Hoffer, 2017) to hold for labels in $\{-1,1\}$ . By the theorem we can construct networks $N_{\mathcal{W}_{1}}$ and $N_{\mathcal{W}_{2}}$ such that for all $i$ :

$N_{\mathcal{W}_{1}}(\boldsymbol{x}_{i})=1$ if $y_{i}=1$ and $N_{\mathcal{W}_{1}}(\boldsymbol{x}_{i})=0$ otherwise.

$N_{\mathcal{W}_{2}}(\boldsymbol{x}_{i})=1$ if $y_{i}=-1$ and $N_{\mathcal{W}_{2}}(\boldsymbol{x}_{i})=0$ otherwise.

A.3 Missing Proofs for Section 7

There exists $\alpha>0$ such that for each $(\boldsymbol{x},y)\in S$ we have $|\left\langle\boldsymbol{x},\hat{\boldsymbol{w}}\right\rangle|>\alpha$ .

$\#\{(\boldsymbol{x},y)\in S:~{}\left\langle\hat{\boldsymbol{w}},\boldsymbol{x}\right\rangle<0\}>\frac{1}{2}|S|$ .

we can choose $\hat{\boldsymbol{w}}$ and $\alpha=\frac{\beta}{2}$ and we are done. Otherwise, choosing $-\hat{\boldsymbol{w}}$ and $\alpha=\frac{\beta}{2}$ satisfies all the assumptions of the lemma. ∎

Let $(\boldsymbol{x},y)\in S$ be an arbitrary example. If $\left\langle\hat{\boldsymbol{w}},\boldsymbol{x}\right\rangle>\alpha$ , then

Therefore $yN_{\mathcal{W}}(\boldsymbol{x})>1$ , so we get zero loss for this example, and therefore the gradient of the loss will also be zero. If, on the other hand, $\left\langle\hat{\boldsymbol{w}},\boldsymbol{x}\right\rangle<-\alpha$ , then

In this case the loss on the example would be $\max\{1-yN_{\mathcal{W}}(\boldsymbol{x}),0\}=1$ , but the gradient will also be zero. Along with assumption 2, we would conclude that:

A.3.2 Proof of Theorem 8

Denote $W_{t}=[\boldsymbol{w}^{(1)}_{t}\dots\boldsymbol{w}^{(k)}_{t}\boldsymbol{u}^{(1)}_{t}\dots\boldsymbol{u}^{(k)}_{t}]$ and define $K_{t}=\{\boldsymbol{e}_{j}:~{}\forall_{i\in[k]}\left\langle\boldsymbol{w}^{(i)}_{t},\boldsymbol{e}_{j}\right\rangle\leq 0\}$ . We first prove the following lemma.

Let $\boldsymbol{e}_{j}$ be the example seen in time $t$ . If $N_{\mathcal{W}_{t}}(\boldsymbol{e}_{j})\geq 1$ then there is no update and we are done. Otherwise, if $\boldsymbol{e}_{j}\in K_{t}$ then for each $i\in[k]$ we have $\frac{\partial}{\partial\boldsymbol{w}^{(i)}_{t}}N_{\mathcal{W}_{t}}(\boldsymbol{e}_{j})=0$ and therefore the update does not change the value of $\boldsymbol{w}^{(i)}_{t}$ , and thus $K_{t+1}=K_{t}$ . If $\boldsymbol{e}_{j}\notin K_{t}$ then there exists $i\in[k]$ such that $\left\langle\boldsymbol{w}^{(i)}_{t},\boldsymbol{e}_{j}\right\rangle>0$ . In that case, we update $\boldsymbol{w}^{(i)}_{t+1}\leftarrow\boldsymbol{w}^{(i)}_{t}+\eta\boldsymbol{e}_{j}$ . Now, note that

We can now prove the theorem. For each $j\in[d]$ , by the symmetry of the initialization, with probability $\frac{1}{2}$ over the initialization of $\boldsymbol{w}^{(i)}_{0}$ , we get that $\left\langle\boldsymbol{w}^{(i)}_{0},\boldsymbol{e}_{j}\right\rangle\leq 0$ . Since all $\boldsymbol{w}_{i}$ ’s are initialized independently, we get that:

Now, assuming $k\leq\log_{2}(\frac{d}{-\ln(\delta)})$ , from the independence of the initialization of $\boldsymbol{w}^{(i)}_{0}$ ’s coordinates we get

On the other hand, assuming $k\geq\log_{2}(\frac{d}{\delta})$ , using the union bound we get:

and therefore $N_{\mathcal{W}_{t}}(\boldsymbol{e}_{j})>1$ , which implies that $L_{\{(\boldsymbol{e}_{j},1)\}}(W_{t})=0$ . From this, we can conclude that for each $j\in[d]$ , we perform at most $\lceil\max\{\frac{C}{\eta},\frac{1}{\eta}\}\rceil$ update iterations on $\boldsymbol{e}_{j}$ before reaching zero loss, and therefore we can perform at most $\lceil\max\{\frac{dC}{\eta},\frac{d}{\eta}\}\rceil$ update iterations until convergence. Since we show that we never get stuck with zero gradient on an example with loss greater than zero, this means we converge to a global optimum after at most $\lceil\max\{\frac{dC}{\eta},\frac{d}{\eta}\}\rceil$ iterations.