SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz

Introduction

Neural networks have achieved remarkable performance in many machine learning tasks. Although recently there have been numerous theoretical contributions to understand their success, it is still largely unexplained and remains a mystery. In particular, it is not known why in the over-parameterized setting, in which there are far more parameters than training points, stochastic gradient descent (SGD) can learn networks that generalize well, as been observed in practice (Neyshabur et al., 2014; Zhang et al., 2016).

In such over-parameterized settings, the loss function can contain multiple global minima that generalize poorly. Therefore, learning can in principle lead to models with low training error, but high test error. However, as often observed in practice, SGD is in fact able to find models with low training error and good generalization performance. This suggests that the optimization procedure, which depends on the optimization method (SGD) and the training data, introduces some form of inductive bias which directs it towards a low complexity solution. Thus, in order to explain the success of neural networks, it is crucial to characterize this inductive bias and understand what are the guarantees for generalization of over-parameterized neural networks.

In this work, we address these problems in a binary classification setting where SGD optimizes a two-layer over-parameterized network with the goal of learning a linearly separable function. Clearly, an over-parameterized network is not necessary for classifying linearly separable data, since this is possible with linear classifiers (e.g., with the Perceptron algorithm) which also have good generalization guarantees (Shalev-Shwartz & Ben-David, 2014). But, the key question which we address here is whether a large network will overfit in such a case or not. As we shall see, it turns out that although the networks we consider are rich enough to considerably overfit the data, this does not happen when SGD is used for optimization. In other words, SGD introduces an inductive bias which allows it to learn over-parameterized networks that can generalize well. Therefore, this setting serves as a good test bed for studying the effect of over-paramaterization.

Problem Formulation

where σ\sigma is a non-linear activation function applied element-wise.

We define the empirical loss over SS to be the mean hinge-loss:

In our setting we fix the second layer to be v=(vvk,vvk)\boldsymbol{v}=(\overbrace{v\dots v}^{k},\overbrace{-v\dots-v}^{k}) such that v>0v>0 and only learn the weight matrix WW. We will consider only positive homogeneous activations (Leaky ReLU and ReLU) and thus the network we consider with 2k2k hidden neurons is as expressive as networks with kk hidden neurons and any vector v\boldsymbol{v} in the second layer. For example, consider a network with kk hidden neurons with positive homogeneous activations, where each hidden neuron ii has incoming weight vector wi\boldsymbol{w}_{i} and outgoing weight viv_{i}. Then we can express this network with the network defined in Eq. 1 as follows. For each ii such that vi>0v_{i}>0, we define a neuron in the new network with incoming weight vector wivi\frac{\boldsymbol{w}_{i}}{v_{i}} and outgoing weight 11. Similarly, if vi<0v_{i}<0, we define a neuron in the new network with incoming weight vector uivi\frac{\boldsymbol{u}_{i}}{-v_{i}} and outgoing weight 1-1. For all other neurons we define an incoming zero weight vector. Due to the positive homogeneity, it follows that this network is equivalent to the network with kk hidden neurons. Hence, we can fix the second layer without limiting the expressive power of the two-layer network. Although it is relatively simpler than the case where the second layer is not fixed, the effect of over-parameterization can be studied in this setting as well.

Hence, the objective of the optimization problem is to find:

We focus on the case where LS(W)L_{S}\left(W\right) is minimized using an SGD algorithm with batch of size 11, and where only the weights of the first layer (namely WW) are updated. At iteration tt, SGD randomly chooses a point (xt,yt)S(\boldsymbol{x}_{t},y_{t})\in S and updates the weights with a constant learning rate η\eta. Formally, let Wt=(Wt,v)\mathcal{W}_{t}=(W_{t},\boldsymbol{v}) be the parameters at iteration tt, then the update at iteration tt is given by

Main Result

We now present our main results, for the case where σ\sigma is the Leaky ReLU function. Namely, σ(z)=max{αz,z}\sigma(z)=\max\{\alpha z,z\} where 0<α<10<\alpha<1.

First, we show that SGD can find a global optimum of LS(W)L_{S}\left(W\right). Note that this is by no means obvious, since LS(W)L_{S}\left(W\right) is a non-convex function (see Proposition 5.1). Specifically, we show that SGD converges to such an optimum while making at most:

non-zero update steps (see Corollary 5.2). In particular, the bound is independent of the number of neurons 2k2k. To the best of our knowledge, this is the first convergence guarantee of SGD for neural networks with the hinge loss. Furthermore, we prove a lower bound of \Omega\Big{(}\frac{\|\boldsymbol{w}^{*}\|}{\eta}+\|\boldsymbol{w}^{*}\|^{2}\Big{)} for the number of non-zero updates (see Theorem 2).

Next, we address the question of generalization. As noted earlier, since the network is large, it can in principle overfit. Indeed, there are parameter settings for which the network will have arbitrarily bad test error (see Section 6.2). However, as we show here, this will not happen in our setting where SGD is used for optimization. In Theorem 4 we use a compression bound to show that the model learned by SGD will have a generalization error of O\Big{(}\frac{M\log{n}}{n}\Big{)}.See discussion in Remark 5 on the dependence of the generalizaion bound on η\eta. This implies that for any network size, given a sufficiently large number of training samples that is independent of the network size, SGD converges to a global minimum with good generalization behaviour. This is despite the fact that for sufficiently large kk there are multiple global minima which overfit the training set (see Section 6.2). This implies that SGD is biased towards solutions that can be expressed by a small set of training points and thus generalizes well.

To summarize, when the activation is the Leaky ReLU and the data is linearly separable, we provide provable guarantees of optimization, generalization and expressive power for over-parameterized networks. This allows us to provide a rigorous explanation of the performance of over-parameterized networks in this setting. This is a first step in unraveling the mystery of the success of over-parameterized networks in practice.

We further study the same over-parameterized setting where the non-linear activation is the ReLU function (i.e., σ(z)=max{0,z}\sigma(z)=\max\{0,z\}). Surprisingly, this case has different properties. Indeed, we show that the loss contains spurious local minima and thus the previous convergence result of SGD to a global minimum does not hold in this case. Furthermore, we show an example where over-parameterization is favorable from an optimization point of view. Namely, for a sufficiently small number of hidden neurons, SGD will converge to a local minimum with high probability, whereas for a sufficiently large number of hidden neurons, SGD will converge to a global minimum with high probability.

The paper is organized as follows. We discuss related work in Section 4 . In Section 5 we prove the convergence bounds, in Section 6 we give the generalization guarantees and in Section 7 the results for the ReLU activation. We conclude our work in Section 8.

Related Work

The generalization performance of neural networks has been studied extensively. Earlier results (Anthony & Bartlett, 2009) provided bounds that depend on the VC dimension of the network, and the VC dimension was shown to scale linearly with the number of parameters. More recent works, study alternative notions of complexity, such as Rademacher compexity (Bartlett & Mendelson, 2002; Neyshabur et al., 2015; Bartlett et al., 2017; Kawaguchi et al., 2017), Robustness (Xu & Mannor, 2012) and PAC-Bayes (Neyshabur et al., 2017b). However, all of these notions fail to explain the generalization performance of over-parameterized networks (Neyshabur et al., 2017a). This is because these bounds either depend on the number of parameters or on the number of hidden neurons (directly or indirectly via norms of the weights) and become loose when these quantities become sufficiently large. The main disadvantage of these approaches, is that they do not depend on the optimization method (e.g., SGD), and thus do not capture its role in the generalization performance. In our work, we give generalization guarantees based on a compression bound that follows from convergence rate guarantees of SGD, and thus take into account the effect of the optimization method on the generalization performance. This analysis results in generalization bounds that are independent of the network size and thus hold for over-parameterized networks.

In parallel to our work, Kawaguchi et al. (2017) give generalization bounds for neural networks that are based on Rademacher complexity. Here too, the analysis does not take into account the optimization algorithm and the bound depends on the norm of the weights. Therefore, the bound can become vacuous for over-parameterized networks.

Stability bounds for SGD in non-convex settings were given in Hardt et al. (2016); Kuzborskij & Lampert (2017). However, their results hold for smooth loss functions, whereas the loss function we consider is not smooth due to the non-smooth activation functions (Leaky ReLU, ReLU).

Other works have studied generalization of neural networks in a model recovery setting, where assumptions are made on the underlying model and the input distribution (Brutzkus & Globerson, 2017; Zhong et al., 2017; Li & Yuan, 2017; Du et al., 2017; Tian, 2017). However, in their works the neural networks are not over-parameterized as in our setting.

Soltanolkotabi et al. (2017) analyze the optimization landscape of over-parameterized networks and give convergence guarantees for gradient descent to a global minimum when the data follows a Gaussian distribution and the activation functions are differentiable. The main difference from our work is that they do not provide generalization guarantees for the resulting model. Furthermore, we do not make any assumptions on the distribution of the feature vectors.

In a recent work, Nguyen & Hein (2017) show that if training points are linearly separable then under assumptions on the rank of the weight matrices of a fully-connected neural network, every critical point of the loss function is a global minimum. Their work extends previous results in Gori & Tesi (1992); Frasconi et al. (1997); Yu & Chen (1995). Our work differs from these in several respects. First, we show global convergence guarantees of SGD, whereas they only analyze the optimization landscape, without direct implications on performance of optimization methods. Second, we provide generalization bounds and their focus is solely on optimization. Third, we consider non-differentiable activation functions (Leaky ReLU, ReLU) while their results hold only for continuously differentiable activation functions.

Convergence Analysis

In this section we consider the setting of Section 2 with a leaky ReLU activation function. In Section 5.1 we show SGD will converge to a globally optimal solution, and analyze the rate of convergence. In Section 5.1 we also provide lower bounds on the rate of convergence. The results in this section are interesting for two reasons. First, they show convergence of SGD for a non-convex objective. Second, the rate of convergence results will be used to derive generalization bounds in Section 6.

Before proving convergence of SGD to a global minimum, we show that every critical point is a global minimum and the loss function is non-convex. The proof is deferred to the appendix.

LS(W)L_{S}\left(W\right) satisfies the following properties: 1) Every critical point is a global minimum. 2) It is non-convex.

We assume that SGD is initialized such that the norms of all rows of W0W_{0} are upper bounded by some constant R>0R>0. Namely for all 1ik1\leq i\leq k it holds that:

Define Mk:=w2α2+w2kηv2α2+R(8k2η2v2+8ηk)w1.52k(ηvα)1.5+2RwηvαM_{k}:=\frac{\|\boldsymbol{w}^{*}\|^{2}}{\alpha^{2}}+\frac{\|\boldsymbol{w}^{*}\|^{2}}{k\eta v^{2}\alpha^{2}}+\frac{\sqrt{R(8k^{2}\eta^{2}v^{2}+8\eta k)}{\|\boldsymbol{w}^{*}\|}^{1.5}}{2k(\eta v\alpha)^{1.5}}+\frac{2R\|\boldsymbol{w}^{*}\|}{\eta v\alpha}. We give an upper bound on the number of non-zero updates SGD makes until convergence to a critical point (which is a global minimum by Proposition 5.1). The result is summarized in the following theorem.

SGD converges to a global minimum after performing at most MkM_{k} non-zero updates.

To obtain a simpler bound than the one obtained in Theorem 1, we use the fact that we can set R,vR,v arbitrarily, and choose:This initialization resembles other initializations that are used in practice (Bengio, 2012; Glorot & Bengio, 2010)

Then by Theorem 1 we get the following. The derivation is given in the Appendix (Section A.1.3).

Let R=v=12kR=v=\frac{1}{\sqrt{2k}}, then SGD converges to a global minimum after perfoming at most M_{k}=\frac{\|\boldsymbol{w}^{*}\|^{2}}{\alpha^{2}}+O\Bigg{(}\frac{\|\boldsymbol{w}^{*}\|^{2}}{\min\{\eta,\sqrt{\eta}\}}\Bigg{)} non-zero updates.

Thus the bound consists of two terms, the first which only depends on the margin (via w\|\boldsymbol{w}^{*}\|) and the second which scales inversely with η\eta. More importantly, the bound is independent of the network size.

2 Lower Bound

We use the same notations as in Section 5.1. The lower bound is given in the following theorem, which is proved in the Appendix (Section A.1.4).

Assume SGD is initialized according to Eq. 6, then for any dd there exists a sequence of linearly separable points on which SGD will make at least \Omega\Big{(}\frac{\|\boldsymbol{w}^{*}\|}{\eta}+\|\boldsymbol{w}^{*}\|^{2}\Big{)} mistakes.

Although this lower bound is not tight, it does show that the upper bound in Corollary 5.2 cannot be much improved. Furthermore, the example presented in the proof of Theorem 2, demonstrates that η\eta\rightarrow\infty can be optimal in terms of optimization and generalization, i.e., SGD makes the minimum number of updates (w2\|\boldsymbol{w}^{*}\|^{2}) and the learned model is equivalent to the true classifier w\boldsymbol{w}^{*}. We will use this observation in the discussion on the dependence of the generalization bound in Theorem 4 on η\eta (see Remark 5).

Generalization

In this section we give generalization guarantees for SGD learning of over-parameterized networks with Leaky ReLU activations. These results are obtained by combining Theorem 1 with a compression generalization bound (see Section 6.1). In Section 6.2 we show that over-parameterized networks are sufficiently expressive to contain global minima that overfit the training set. Taken together, these results show that although there are models that overfit, SGD effectively avoids these, and finds the models that generalize well.

Given the bound in Theorem 1 we can invoke compression bounds for generalization guarantees with respect to the -11 loss (Littlestone & Warmuth, 1986) . Denote by NkN_{k} a two-layer neural network with 2k2k hidden neurons defined in Section 1 where σ\sigma is the Leaky ReLU. Let SGDk(S,W0)SGD_{k}(S,W_{0}) be the output of running SGD for training this network on a set SS and initialized with W0W_{0} that satisfies Eq. 5. Define Hk\mathcal{H}_{k} to be the set of all possible hypotheses that SGDk(S,W0)SGD_{k}(S,W_{0}) can output for any SS and W0W_{0} which satisfies Eq. 5.

Let n2ckn\geq 2c_{k}, then with probability of at least 1δ1-\delta over the choice of SS and W0W_{0} we have

Since LV01(SGDk(S,W0))=0L_{V}^{0-1}(SGD_{k}(S,W_{0}))=0 holds at a global minimum of LSL_{S}, then by Combining the results of Corollary 5.2 and Theorem 3, we get the following theorem.

If n2ckn\geq 2c_{k} and assuming the initialization defined in Eq. 6, then with probability at least 1δ1-\delta over the choice of SS and W0W_{0}, SGD converges to a global minimum of LSL_{S} with -11 test error at most

Thus for fixed w\|\boldsymbol{w}^{*}\| and η\eta we obtain a sample complexity guarantee that is independent of the network size (See Remark 5 for a discussion on the dependence of the bound on η\eta). This is despite the fact that for sufficiently large kk, the network has global minima that have arbitrarily high test errors, as we show in the next section. Thus, SGD and the linearly separable data introduce an inductive bias which directs SGD to the global minimum with low test error while avoiding global minima with high test error. In Figure 1 we demonstrate this empirically for a linearly separable data set (from a subset of MNIST) learned using over-parameterized networks. The figure indeed shows that SGD converges to a global minimum which generalizes well.

The generelization bound in Eq. 7 holds for η\eta\rightarrow\infty, which is unique for the setting that we consider, and may seem surprising, given that a choice of large η\eta often fails in practice. Furthermore, the bound is optimal for η\eta\rightarrow\infty. To support this theoretical result, we show in Theorem 2 an example where indeed η\eta\rightarrow\infty is optimal in terms of the number of updates and generalization. On the other hand, we note that in practice, it may not be optimal to use large η\eta in our setting, since this bound results from a worst-case analysis of a sequence of examples encountered by SGD. Finally, the important thing to note is that the bound holds for any η\eta, and is thus applicable to realistic applications of SGD.

2 Expressiveness

Theorem 6 implies that for sufficiently large networks, the optimization problem (2) can have arbitrarely bad global minima with respect to a given test set, i.e., ones which do not generalize well on a given test set.

ReLU- Success and Failure Cases

In this section we consider the same setting as in section 5, but with the ReLU activation function σ(x)=max{0,x}\sigma(x)=\max\{0,x\}. In Section 7.1 we show that the loss function contains arbitrarely bad local minima. In Section 7.2 we give an example where for a sufficiently small network, with high probability SGD will converge to a local minimum. On the other hand, for a sufficiently large network, with high probability SGD will converge to a global minimum.

The result is summarized in the following theorem and the proof is deferred to the Appendix (Section A.3.1). The main idea is to construct a network with weight paramater WW such that for at least S2\frac{|S|}{2} points (x,y)S(\boldsymbol{x},y)\in S it holds that w,x<0\left\langle\boldsymbol{w},\boldsymbol{x}\right\rangle<0 for each neuron with weight vector w\boldsymbol{w}. Furthermore, the remaining points satisfy yNW(x)>1yN_{\mathcal{W}}(\boldsymbol{x})>1 and thus the gradient is zero and LS(W)>12L_{S}(W)>\frac{1}{2}.

2 Orthogonal vectors - simple case analysis

The main result of this section is given in the following theorem. The proof is given in the Appendix (Section A.3.2). The main observation is that the convergence to non-global minimum depends solely on the initialization and occurs if and only if there exists a point x\boldsymbol{x} such that for all neurons, the corresponding initialized weight vector w\boldsymbol{w} satisfies w,x0\left\langle\boldsymbol{w},\boldsymbol{x}\right\rangle\leq 0.

Fix δ>0\delta>0 and assume we run SGD with examples from S={e1ed}×{1}S=\{\boldsymbol{e}_{1}\dots\boldsymbol{e}_{d}\}\times\{1\}. If klog2(dln(δ))k\leq\log_{2}(\frac{d}{-\ln(\delta)}), then with probability of at least 1δ1-\delta, SGD will converge to a non global minimum point. On the other hand, if klog2(2dδ)k\geq\log_{2}(\frac{2d}{\delta}), then with probability of at least 1δ1-\delta, SGD will converge to a global minimum point after max{dCη,dη}\lceil\max\{\frac{dC}{\eta},\frac{d}{\eta}\}\rceil iterations.

Note that in the first part of the theorem, we can make the basin of attraction of the non-global minimum exponentially large by setting δ=eαd\delta=e^{-\alpha d} for α12\alpha\leq\frac{1}{2}.

Conclusion

Understanding the performance of over-parameterized neural networks is essential for explaining the success of deep learning models in practice. Despite a plethora of theoretical results for generalization of neural networks, none of them give guarantees for over-parameterized networks. In this work, we give the first provable guarantees for the generalization performance of over-parameterized networks, in a setting where the data is linearly separable and the network has Leaky ReLU activations. We show that SGD compresses its output when learning over-parameterized networks, and thus exhibits good generalization performance.

The analysis for networks with Leaky ReLU activations does not hold for networks with ReLU activations, since in this case the loss contains spurious local minima. However, due to the success of over-parameterized networks with ReLU activations in practice, it is likely that similar results hold here as well. It would be very interesting to provide convergence guarantees and generalization bounds for this case. Another direction for future work is to show that similar results hold under different assumptions on the data.

This research is supported by the Blavatnik Computer Science Research Fund and the European Research Council (TheoryDL project).

References

Appendix A Appendix

Otherwise, if yNW(x)1yN_{\mathcal{W}}{}(\boldsymbol{x})\geq 1, then the gradient vanishes and thus

It follows that if there exists (x,y)S(\boldsymbol{x},y)\in S, such that yNW(x)<1yN_{\mathcal{W}}{}(\boldsymbol{x})<1, then we have

and thus WLS(W)0\frac{\partial}{\partial\vec{W}}L_{S}(\vec{W})\neq 0. Therefore, for any critical point it holds that yNW(x)1yN_{\mathcal{W}}{}(\boldsymbol{x})\geq 1 for all (x,y)S(\boldsymbol{x},y)\in S, which implies that it is a global minimum.

For simplicity consider the function fx(w,u)=σ(w,x)σ(u,x)f_{\boldsymbol{x}}(\boldsymbol{w},\boldsymbol{u})=\sigma(\left\langle\boldsymbol{w},\boldsymbol{x}\right\rangle)-\sigma(\left\langle\boldsymbol{u},\boldsymbol{x}\right\rangle) for x0\boldsymbol{x}\neq 0. Define w1=w2=u1=x\boldsymbol{w}_{1}=\boldsymbol{w}_{2}=\boldsymbol{u}_{1}=\boldsymbol{x} and u2=x\boldsymbol{u}_{2}=-\boldsymbol{x}. Then

and thus fx(w1+w22,u1+u22)>12fx(w1,u1)+12fx(w2,u2)f_{\boldsymbol{x}}(\frac{\boldsymbol{w}_{1}+\boldsymbol{w}_{2}}{2},\frac{\boldsymbol{u}_{1}+\boldsymbol{u}_{2}}{2})>\frac{1}{2}f_{\boldsymbol{x}}(\boldsymbol{w}_{1},\boldsymbol{u}_{1})+\frac{1}{2}f_{\boldsymbol{x}}(\boldsymbol{w}_{2},\boldsymbol{u}_{2}) which implies that the function is not convex.

A.1.2 Proof of Theorem 1

Then, from Cauchy-Schwartz inequality we have

Since the update at iteration tt is non-zero, we have ytNt1(xt)<1y_{t}N_{t-1}(\boldsymbol{x}_{t})<1 and the update rule is given by

where pt(i)=1p_{t}^{(i)}=1 if wt1(i),xt0\left\langle\boldsymbol{w}^{(i)}_{t-1},\boldsymbol{x}_{t}\right\rangle\geq 0 and pt(i)=αp_{t}^{(i)}=\alpha otherwise. Similarly qt(i)=1q_{t}^{(i)}=1 if ut1(i),xt0\left\langle\boldsymbol{u}^{(i)}_{t-1},\boldsymbol{x}_{t}\right\rangle\geq 0 and qt(i)=αq_{t}^{(i)}=\alpha otherwise. It follows that:

where the second inequality follows since ytv(i=1kwt1(i),xtpt(i)i=1kut1(i),xtqt(i))=ytNt1(xt)<1y_{t}v\left(\sum_{i=1}^{k}\left\langle\boldsymbol{w}^{(i)}_{t-1},\boldsymbol{x}_{t}\right\rangle p_{t}^{(i)}-\sum_{i=1}^{k}\left\langle\boldsymbol{u}^{(i)}_{t-1},\boldsymbol{x}_{t}\right\rangle q_{t}^{(i)}\right)=y_{t}N_{t-1}(\boldsymbol{x}_{t})<1. Using the above recursively, we obtain:

where the inequality follows since ytxt,w1\left\langle y_{t}\boldsymbol{x}_{t},\boldsymbol{w}^{*}\right\rangle\geq 1. This implies that

By combining equations Eq. 8, Eq. 10 and Eq. 11 we get,

Using a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} the above implies,

Since w0(i),u0(i)R\|\boldsymbol{w}^{(i)}_{0}\|,\|\boldsymbol{u}^{(i)}_{0}\|\leq R we have G(W0)2kRG(W_{0})\leq\sqrt{2k}R. Noting that W=2kw\|\vec{W}^{*}\|=\sqrt{2k}\|\boldsymbol{w}^{*}\| we get,

where a=2kηvαa=2k\eta v\alpha, b=(4k2η2v2+4ηk)wb=\sqrt{(4k^{2}\eta^{2}v^{2}+4\eta k)}\|\boldsymbol{w}^{*}\| and c=4kRwc=4kR\|\boldsymbol{w}^{*}\|. By inspecting the roots of the parabola P(x)=x2baxcaP(x)=x^{2}-\frac{b}{a}x-\frac{c}{a} we conclude that

A.1.3 Proof of Corollary 5.2

Since Rv=1\frac{R}{v}=1, we have by Theorem 1 and the inequality a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b},

A.1.4 Proof of Theorem 2

We will prove a more general theorem. Theorem 2 follows by setting R=v=12kR=v=\frac{1}{\sqrt{2k}}.

For any dd there exists a sequence of linearly separable points on which SGD will make at least

Define a sequence S\mathcal{S} of size dd,

for all 1ik1\leq i\leq k. Note that w0(i),u0(i)R\|\boldsymbol{w}^{(i)}_{0}\|,\|\boldsymbol{u}^{(i)}_{0}\|\leq R for all 1ik1\leq i\leq k.

Since w0(i)=w0(j)\boldsymbol{w}^{(i)}_{0}=\boldsymbol{w}^{(j)}_{0} and u0(i)=u0(j)\boldsymbol{u}^{(i)}_{0}=\boldsymbol{u}^{(j)}_{0} for all iji\neq j, we have by induction that wt(i)=wt(j)\boldsymbol{w}^{(i)}_{t}=\boldsymbol{w}^{(j)}_{t} and ut(i)=ut(j)\boldsymbol{u}^{(i)}_{t}=\boldsymbol{u}^{(j)}_{t} for all iji\neq j and t>0t>0. Hence, we will denote wt=wt(i)\boldsymbol{w}_{t}=\boldsymbol{w}^{(i)}_{t} and ut=ut(i)\boldsymbol{u}_{t}=\boldsymbol{u}^{(i)}_{t} for all 1id1\leq i\leq d and t>0t>0. Then for all 1jd1\leq j\leq d we have NWt(ej)=kvσ(wt,ej)wt,ejkvσ(ut,ej)ut,ejN_{\mathcal{W}_{t}}(\boldsymbol{e}_{j})=kv\sigma^{\prime}(\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{j}\right\rangle)\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{j}\right\rangle-kv\sigma^{\prime}(\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{j}\right\rangle)\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{j}\right\rangle.

Since at the global minumum NWt,v(ej)1\mathcal{N}_{W_{t},\boldsymbol{v}}(\boldsymbol{e}_{j})\geq 1 for all 1jd1\leq j\leq d, it follows that a necessary condition for convergence to a global minimum is that there exists an iteration tt in which either kvσ(wt,ed)wt,ed12kv\sigma^{\prime}(\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle)\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle\geq\frac{1}{2} or kvσ(ut,ed)ut,ed12-kv\sigma^{\prime}(\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{d}\right\rangle)\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{d}\right\rangle\geq\frac{1}{2}. Equivalently, either wt,ed12kv\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle\geq\frac{1}{2kv} or ut,ed12αkv\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{d}\right\rangle\leq-\frac{1}{2\alpha kv}.

Since w0,ed=Rd\left\langle\boldsymbol{w}_{0},\boldsymbol{e}_{d}\right\rangle=-\frac{R}{\sqrt{d}}, then by the gradient updates (Eq. 9) it follows that after at least Rηvαd\frac{R}{\eta v\alpha\sqrt{d}} copies of S\mathcal{S}, or equivalently, after at least Rdηvαd\frac{Rd}{\eta v\alpha\sqrt{d}} iterations we will have 0wt,edηvα0\leq\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle\leq\eta v\alpha. Then, after at least dmin{12kvηvα,0}ηv\frac{d\min\{\frac{1}{2kv}-\eta v\alpha,0\}}{\eta v} iterations we have wt,ed12kv\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle\geq\frac{1}{2kv}. Thus, in total, after at least Rwηvα+min{w22ηkv2αw2,0}\frac{R\|\boldsymbol{w}^{*}\|}{\eta v\alpha}+\min\{\frac{\|\boldsymbol{w}^{*}\|^{2}}{2\eta kv^{2}}-\alpha\|\boldsymbol{w}^{*}\|^{2},0\} iterations, we have wt,ed12kv\left\langle\boldsymbol{w}_{t},\boldsymbol{e}_{d}\right\rangle\geq\frac{1}{2kv}.

By the same reasoning, we have ut,ed12αkv\left\langle\boldsymbol{u}_{t},\boldsymbol{e}_{d}\right\rangle\leq-\frac{1}{2\alpha kv} after at least Rwηv+min{w22α2ηkv2w2α,0}\frac{R\|\boldsymbol{w}^{*}\|}{\eta v}+\min\{\frac{\|\boldsymbol{w}^{*}\|^{2}}{2\alpha^{2}\eta kv^{2}}-\frac{\|\boldsymbol{w}^{*}\|^{2}}{\alpha},0\} iterations. Finally, SGD must update on at least dd points in order to converge to the global minimum. The claim now follows. ∎

A.2 Missing Proofs for Section 6

By Theorem 30.2 and Corollary 30.3 in Shalev-Shwartz & Ben-David (2014), for n2ckn\geq 2c_{k} we have that with probability of at least 1δ1-\delta over the choice of SS

A.2.2 Proof of Theorem 6

We can easily extend Theorem 8 in (Soudry & Hoffer, 2017) to hold for labels in {1,1}\{-1,1\}. By the theorem we can construct networks NW1N_{\mathcal{W}_{1}} and NW2N_{\mathcal{W}_{2}} such that for all ii:

NW1(xi)=1N_{\mathcal{W}_{1}}(\boldsymbol{x}_{i})=1 if yi=1y_{i}=1 and NW1(xi)=0N_{\mathcal{W}_{1}}(\boldsymbol{x}_{i})=0 otherwise.

NW2(xi)=1N_{\mathcal{W}_{2}}(\boldsymbol{x}_{i})=1 if yi=1y_{i}=-1 and NW2(xi)=0N_{\mathcal{W}_{2}}(\boldsymbol{x}_{i})=0 otherwise.

A.3 Missing Proofs for Section 7

There exists α>0\alpha>0 such that for each (x,y)S(\boldsymbol{x},y)\in S we have x,w^>α|\left\langle\boldsymbol{x},\hat{\boldsymbol{w}}\right\rangle|>\alpha.

#{(x,y)S: w^,x<0}>12S\#\{(\boldsymbol{x},y)\in S:~{}\left\langle\hat{\boldsymbol{w}},\boldsymbol{x}\right\rangle<0\}>\frac{1}{2}|S|.

we can choose w^\hat{\boldsymbol{w}} and α=β2\alpha=\frac{\beta}{2} and we are done. Otherwise, choosing w^-\hat{\boldsymbol{w}} and α=β2\alpha=\frac{\beta}{2} satisfies all the assumptions of the lemma. ∎

Let (x,y)S(\boldsymbol{x},y)\in S be an arbitrary example. If w^,x>α\left\langle\hat{\boldsymbol{w}},\boldsymbol{x}\right\rangle>\alpha, then

Therefore yNW(x)>1yN_{\mathcal{W}}(\boldsymbol{x})>1, so we get zero loss for this example, and therefore the gradient of the loss will also be zero. If, on the other hand, w^,x<α\left\langle\hat{\boldsymbol{w}},\boldsymbol{x}\right\rangle<-\alpha, then

In this case the loss on the example would be max{1yNW(x),0}=1\max\{1-yN_{\mathcal{W}}(\boldsymbol{x}),0\}=1, but the gradient will also be zero. Along with assumption 2, we would conclude that:

A.3.2 Proof of Theorem 8

Denote Wt=[wt(1)wt(k)ut(1)ut(k)]W_{t}=[\boldsymbol{w}^{(1)}_{t}\dots\boldsymbol{w}^{(k)}_{t}\boldsymbol{u}^{(1)}_{t}\dots\boldsymbol{u}^{(k)}_{t}] and define Kt={ej: i[k]wt(i),ej0}K_{t}=\{\boldsymbol{e}_{j}:~{}\forall_{i\in[k]}\left\langle\boldsymbol{w}^{(i)}_{t},\boldsymbol{e}_{j}\right\rangle\leq 0\}. We first prove the following lemma.

Let ej\boldsymbol{e}_{j} be the example seen in time tt. If NWt(ej)1N_{\mathcal{W}_{t}}(\boldsymbol{e}_{j})\geq 1 then there is no update and we are done. Otherwise, if ejKt\boldsymbol{e}_{j}\in K_{t} then for each i[k]i\in[k] we have wt(i)NWt(ej)=0\frac{\partial}{\partial\boldsymbol{w}^{(i)}_{t}}N_{\mathcal{W}_{t}}(\boldsymbol{e}_{j})=0 and therefore the update does not change the value of wt(i)\boldsymbol{w}^{(i)}_{t}, and thus Kt+1=KtK_{t+1}=K_{t}. If ejKt\boldsymbol{e}_{j}\notin K_{t} then there exists i[k]i\in[k] such that wt(i),ej>0\left\langle\boldsymbol{w}^{(i)}_{t},\boldsymbol{e}_{j}\right\rangle>0. In that case, we update wt+1(i)wt(i)+ηej\boldsymbol{w}^{(i)}_{t+1}\leftarrow\boldsymbol{w}^{(i)}_{t}+\eta\boldsymbol{e}_{j}. Now, note that

We can now prove the theorem. For each j[d]j\in[d], by the symmetry of the initialization, with probability 12\frac{1}{2} over the initialization of w0(i)\boldsymbol{w}^{(i)}_{0}, we get that w0(i),ej0\left\langle\boldsymbol{w}^{(i)}_{0},\boldsymbol{e}_{j}\right\rangle\leq 0. Since all wi\boldsymbol{w}_{i}’s are initialized independently, we get that:

Now, assuming klog2(dln(δ))k\leq\log_{2}(\frac{d}{-\ln(\delta)}), from the independence of the initialization of w0(i)\boldsymbol{w}^{(i)}_{0}’s coordinates we get

On the other hand, assuming klog2(dδ)k\geq\log_{2}(\frac{d}{\delta}), using the union bound we get:

and therefore NWt(ej)>1N_{\mathcal{W}_{t}}(\boldsymbol{e}_{j})>1, which implies that L{(ej,1)}(Wt)=0L_{\{(\boldsymbol{e}_{j},1)\}}(W_{t})=0. From this, we can conclude that for each j[d]j\in[d], we perform at most max{Cη,1η}\lceil\max\{\frac{C}{\eta},\frac{1}{\eta}\}\rceil update iterations on ej\boldsymbol{e}_{j} before reaching zero loss, and therefore we can perform at most max{dCη,dη}\lceil\max\{\frac{dC}{\eta},\frac{d}{\eta}\}\rceil update iterations until convergence. Since we show that we never get stuck with zero gradient on an example with loss greater than zero, this means we converge to a global optimum after at most max{dCη,dη}\lceil\max\{\frac{dC}{\eta},\frac{d}{\eta}\}\rceil iterations.