Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs

Alon Brutzkus, Amir Globerson

Introduction

Deep neural networks have achieved state-of-the-art performance on many machine learning tasks in areas such as natural language processing (Wu et al., 2016), computer vision (Krizhevsky et al., 2012) and speech recognition (Hinton et al., 2012). Training of such networks is often successfully performed by minimizing a high-dimensional non-convex objective function, using simple first-order methods such as stochastic gradient descent.

Nonetheless, the success of deep learning from an optimization perspective is poorly understood theoretically. Current results are mostly pessimistic, suggesting that even training a 3-node neural network is NP-hard (Blum & Rivest, 1993), and that the objective function of a single neuron can admit exponentially many local minima (Auer et al., 1996; Safran & Shamir, 2016). There have been recent attempts to bridge this gap between theory and practice. Several works focus on the geometric properties of loss functions that neural networks attempt to minimize. For some simplified architectures, such as linear activations, it can be shown that there are no bad local minima (Kawaguchi, 2016). Extension of these results to the non-linear case currently requires very strong independence assumptions (Kawaguchi, 2016).

Since gradient descent is the main “work-horse” of deep learning it is of key interest to understand its convergence properties. However, there are no results showing that gradient descent is globally optimal for non-linear models, except for the case of many hidden neurons (Andoni et al., 2014) and non-linear activation functions that are not widely used in practice (Zhang et al., 2017).See more related work in Section 2. Here we provide the first such result for a neural architecture that has two very common components: namely a ReLU activation function and a convolution layer.

We note that such architectures have been used in several works (Lin et al., 2013; Milletari et al., 2016), but we view them as important firstly because they capture key properties of general convolutional networks.

We address the realizable case, where training data is generated from a function as in Eq. 1 with weight vector w\boldsymbol{w}^{*}. Training data is then generated by sampling nn training points x1,,xn\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{n} from a distribution D{\cal D}, and assigning them labels using y=f(x;w)y=f(\boldsymbol{x};\boldsymbol{w}^{*}). The learning problem is then to find a w\boldsymbol{w} that minimizes the squared loss. In other words, solve the optimization problem:

In the limit nn\to\infty, this is equivalent to minimizing the population risk:

Like several recent works (Hardt et al., 2016; Hardt & Ma, 2016) we focus on minimizing the population risk, leaving the finite sample case to future work. We believe the population risk captures the key characteristics of the problem, since the large data regime is the one of interest.

Worst Case Hardness: Despite the simplicity of No-Overlap Networks, we show that learning them is in fact hard if D{\cal D} is unconstrained. Specifically, in Section 4, we show that learning No-Overlap Networks is NP complete via a reduction from a variant of the set splitting problem.

Distribution Dependent Tractability: When D{\cal D} corresponds to independent Gaussian variables with μ=0,σ2=1\mu=0,\sigma^{2}=1, we show in Section 5 that No-Overlap Networks can be learned in polynomial time using gradient descent.

The above two results nicely demonstrate the gap between worst-case intractability and tractability under assumptions on the data. We provide an empirical demonstration of this in Section 6 where gradient descent is shown to succeed on the Gaussian case and fail for a different distribution.

To further understand the role of overlap in the network, we consider networks that do have overlap between the filters. In Section 7.1 we show that in this case, even under Gaussian distributed inputs, there will be non-optimal local minima. Thus, gradient descent will no longer be optimal in the overlap case. In Section 7.2 we show empirically that these local optima may be overcome in practice by using gradient descent with multiple restarts.

Taken together, our results are the first to demonstrate distribution dependent optimality of gradient descent for learning a neural architecture with a convolutional like architecture and a ReLU activation function.

Related Work

Hardness of learning neural networks has been demonstrated for many different settings. For example, (Blum & Rivest, 1993) show that learning a neural network with one hidden layer with a sign activation function is NP-hard in the realizable case. (Livni et al., 2014) extend this to other activation functions and bounded norm optimization. Hardness can also be shown for improper learning under certain cryptographic assumptions (e.g., see Daniely et al., 2014; Klivans, 2008; Livni et al., 2014). Note that these hardness results do not hold for the regression and tied parameter setting that we consider.

Due to the above hardness results, it is clear that the success of deep-learning can only be explained by making additional assumptions about the data generating distribution. The classic algorithm by (Baum, 1990) shows that intersection of halfspaces (i.e., a specific instance of a one hidden layer network) is PAC learnable under any symmetric distribution. This was later extended in (Klivans et al., 2009) to log-concave distributions.

The above works do not consider gradient descent as the optimization method, leaving open the question of which assumptions can lead to global optimality of gradient descent. Such results have been hard to obtain, and we survey some recent ones below. One instance when gradient descent can succeed is when there are enough hidden units such that random initialization of the first layer can lead to zero error even if only the second layer is trained. Such over-specified networks have been considered in (Andoni et al., 2014; Livni et al., 2014) and it was shown that gradient descent can globally learn them in some cases (Andoni et al., 2014). However, the assumption of over-specification is very restrictive and limits generalization. In contrast, we show convergence of gradient descent to a global optimum for any network size and consider convolutional neural networks with shared parameters. Another interesting case is linear dynamical systems, where (Hardt et al., 2016) show that under independence assumptions maximum likelihood is quasi-concave and hence solvable with gradient ascent.

Recent work by (Mei et al., 2016) shows that regression with a single neuron and certain non-linear activation functions, can be learned with gradient descent for sub-Gaussian inputs. We note that their architecture is significantly simpler than ours, in that it uses a single neuron. In fact, their regression problem can also be solved via methods for generalized linear models such as (Kakade et al., 2011).

(Shamir, 2016) recently showed that there is a limit to what distribution dependent results can achieve. Namely, it was shown that for large enough one-hidden layer networks, no distributional assumptions can make gradient descent tractable. Importantly, the construction in (Shamir, 2016) does not use parameter tying and thus is not applicable to the architecture we study here.

Several works have focused on understanding the loss surface of neural network objectives, but without direct algorithmic implications. (Kawaguchi, 2016) show that linear neural networks do not suffer from bad local minima. (Hardt & Ma, 2016) consider objectives of linear residual networks and prove that there are no critical points other than the global optimum. (Soudry & Carmon, 2016) show that in the objective of over-parameterized neural networks with dropout-like noise, all differentiable local minima are global. Other works (Safran & Shamir, 2016; Haeffele & Vidal, 2015) give similar results for over-specified networks. All of these results are purely geometric and do not have direct implications on convergence of optimization algorithms. In a different approach, (Janzamin et al., 2015), suggest alternatives to gradient-based methods for learning neural networks. However, these algorithms are not widely used in practice. Finally, (Choromanska et al., 2015) use spin glass models to argue that, under certain generative modelling and architectural constraints, local minima are likely to have low loss values.

The theory of non-convex optimization is closely related to the theory of neural networks. Recently, there has been substantial progress in proving convergence guarantees of simple first-order methods in various machine learning problems, that don’t correspond to typical neural nets. These include for example matrix completion (Ge et al., 2016) and tensor decompositions (Ge et al., 2015).

Finally, recent work by (Zhang et al., 2016) shows that neural nets can perfectly fit random labelings of the data. Understanding this from an optimization perspective is largely an open problem.

Preliminaries

We use bold-faced letters for vectors and capital letters for matrices. The ithi^{th} row of a matrix AA is denoted by ai\mathbf{a}_{i}.

where σ()\sigma\left(\right) is the pointwise ReLU function.

We consider the realizable setting where there exists a true WW^{*} using which the training data is generated. The population risk (see Eq. 3) is then:

The next Lemma from (Cho & Saul, 2009) shows that g(u,v)g(\boldsymbol{u},\boldsymbol{v}) has a simple form.

The gradient of gg with respect to u\boldsymbol{u} also turns out to have a simple form, as stated in the lemma below. The proof is deferred to the Appendix A.

Let gg be as defined in Eq. 6. Then gg is differentiable at all points u0\boldsymbol{u}\neq\mathbf{0} and

where β=k2k2π\beta=\frac{k^{2}-k}{2\pi} and γ=β+k2\gamma=\beta+\frac{k}{2}.

Learning No-Overlap Networks is NP-Complete

The No-Overlap Networks architecture is a simplified convolutional layer with average pooling. However, as we show here, learning it is still a hard problem. This will motivate our exploration of distribution dependent results in Section 5.

We begin by defining the Set-Splitting-by-k-Sets problem, which is a variant of the classic Set-Splitting problem (Garey & Johnson, 1990). After establishing the hardness of Set-Splitting-by-k-Sets, we will provide a reduction from it to learning No-Overlap Networks.

The Set-Splitting-by-k-Sets decision problem is defined as follows: Given a finite set SS of dd elements and a collection C\cal C of at most (k1)d(k-1)d subsets CjC_{j} of SS, do there exist disjoint sets S1,S2,...,SkS_{1},S_{2},...,S_{k} such that iSi=S\bigcup_{i}{S_{i}}=S and for all jj and ii, Cj⊈SiC_{j}\not\subseteq S_{i}?

For k=2k=2 and without the upper bound on C|\cal C| this is known as the Set-Splitting decision problem which is NP-complete (Garey & Johnson, 1990). Next, we show that Set-Splitting-by-k-Sets is NP-complete. The proof is via a reduction from 3SAT and induction, and is provided in Appendix B.

Set-Splitting-by-k-Sets is NP-complete for all k2k\geq 2.

We next formulate the No-Overlap Networks optimization problem.

Otherwise an arbitrary weight vector is returned.

The above problem returns a w\boldsymbol{w} that minimizes the population-risk up to 14k5d\frac{1}{4k^{5}d} accuracy. It is thus easier than minimizing the risk to an arbitrary precision ϵ\epsilon (see Section 5, Theorem 5.2).

We prove the following theorem, which uses some ideas from (Blum & Rivest, 1993), but introduces additional constructions needed for the no overlap case.

For all the k2k\geq 2, the k-Non-Overlap-Opt problem is NP-complete.

where the last equality follows since for all ll and jj, Cj⊈SlC_{j}\not\subseteq S_{l}. Therefore there exists a w\boldsymbol{w} for which the error in Eq. 9 is zero and k-Non-Overlap-Opt will return a weight vector with low risk.

Define sets Sl={iwlTei>12k}S_{l}=\{i\mid\boldsymbol{w}_{l}^{T}\boldsymbol{e}_{i}>\frac{1}{2k}\} for 1lk1\leq l\leq k and WLOG assume they are disjoint by arbitrarily assigning points that belong to more than one set, to one of the sets they belong to. We will next show that these SlS_{l} are splitting. Namely, it holds that lSl=S\bigcup_{l}{S_{l}}=S and no subset CjC_{j} is a subset of some SlS_{l}.

Since f(d(ei);w)=l=1kσ(wlTei)k>1k12k2>12kf(\boldsymbol{d}(\boldsymbol{e}_{i});\boldsymbol{w})=\frac{\sum_{l=1}^{k}{\sigma(\boldsymbol{w}_{l}^{T}\boldsymbol{e}_{i})}}{k}>\frac{1}{k}-\frac{1}{2k^{2}}>\frac{1}{2k} for all ii, it follows that for each iSi\in S there exists 1lk1\leq l\leq k such that wlTei>12k\boldsymbol{w}_{l}^{T}\boldsymbol{e}_{i}>\frac{1}{2k}. Therefore, by the definition of SlS_{l} we deduce that lSl=S\bigcup_{l}{S_{l}}=S. To show the second property, assume by contradiction that for some jj and mm, CjSmC_{j}\subseteq S_{m}. Then wmT(iCjei)>Cj2k\boldsymbol{w}_{m}^{T}(\sum_{i\in C_{j}}{\boldsymbol{e}_{i}})>\frac{|C_{j}|}{2k}, which implies that f(d(iCjei);w)=l=1kσ(wlT(iCjei))k>Cj2k212k2f(\boldsymbol{d}(\sum_{i\in C_{j}}{\boldsymbol{e}_{i}});\boldsymbol{w})=\frac{\sum_{l=1}^{k}{\sigma(\boldsymbol{w}_{l}^{T}(\sum_{i\in C_{j}}{\boldsymbol{e}_{i}}))}}{k}>\frac{|C_{j}|}{2k^{2}}\geq\frac{1}{2k^{2}}, a contradiction. This concludes our proof. ∎

To conclude, we have shown that No-Overlap Networks are hard to learn if one does not make any assumptions about the training data. In fact we have shown that finding a w\boldsymbol{w} with loss at most 14k5d{1\over 4k^{5}d} is hard. In the next section, we show that certain distributional assumptions make the problem tractable.

No-Overlap Networks can be Learned for Gaussian Inputs

A local maximum at w=0\boldsymbol{w}=\mathbf{0}.

A unique global minimum at w=w\boldsymbol{w}=\boldsymbol{w}^{*}.

A degenerate saddle point at w=(k2kk2+(π1)k)w\boldsymbol{w}=-(\frac{k^{2}-k}{k^{2}+(\pi-1)k})\boldsymbol{w}^{*}.

For k=1k=1, w=0\boldsymbol{w}=\mathbf{0} is not a local maximum and the unique global minimum w\boldsymbol{w}^{*} is the only differentiable critical point.

Our main result, stated formally below, is that the above update is guaranteed to converge to an ϵ\epsilon accurate solution after O(1ϵ2)O({1\over\epsilon^{2}}) iterations. We note that the dependence of the convergence rate on ϵ\epsilon is similar to standard results on convergence of gradient descent to stationary points (e.g., see discussion in Allen-Zhu & Hazan, 2016).

The complete proof is provided in Appendix C. Here we provide a high level overview. In particular, we first explain why gradient descent will stay away from the two bad points mentioned in Lemma 5.1.

where c1c_{1} and c2c_{2} are two functions such that c11c_{1}\geq-1 and c20c_{2}\geq 0. Thus the gradient is a sum of a vector in the direction of wt\boldsymbol{w}_{t} and a vector in the direction of w\boldsymbol{w}^{*}. At iteration t+1t+1 we have:

It follows that for λ<1\lambda<1 the angle between wt\boldsymbol{w}_{t} and w\boldsymbol{w}^{*} will decrease in each iteration. Therefore, if w0\boldsymbol{w}_{0} has an angle with w\boldsymbol{w}^{*} that is not π\pi, we will never converge to the saddle point in Lemma 5.1.

Gradient descent solves the k-Non-Overlap-Opt problem under the Gaussian assumption on D\mathcal{D} with high probability and in polynomial time.

Empirical Illustration of Tractability Gap

The results in the previous sections showed that No-Overlap Networks optimization is hard in the general case, but tractable for Gaussian inputs. Here we empirically demonstrate both the easy and hard cases. The training data for the two cases will be generated by using the same w\boldsymbol{w}^{*} but different distributions over x\boldsymbol{x}.

To generate the “hard” case, we begin with a set splitting problem. In particular, we consider a set SS with 4040 elements and a collection C{\cal C} of 760760 subsets of SS, each of size 2020. We choose CjC_{j} such that there exists subsets S1S_{1},S2S_{2} that split the subsets CjC_{j}. We use the reduction in Section 4 to convert this into a No-Overlap Networks optimization problem. This results in a training set of size 800800.

Since we know the w\boldsymbol{w}^{*} that solves the set splitting problem, we can use it to label data from a different distribution. Motivated by Section 5 we use a Gaussian distribution G{\cal G} as defined earlier and generate a training set of the same size (namely 800800) and labels given by the no-overlap network with weight w\boldsymbol{w}^{*}.

For these two learning problems we used AdaGrad (Duchi et al., 2011) to optimize the empirical risk (plain gradient descent also converges, but AdaGrad requires less tuning of step size). For both datasets we used a random normal initializer and for each we chose the best performing learning rate schedule. The training error for each setting as a function of the number of epochs is shown in Figure 2. It is clear that in the non-Gaussian case, AdaGrad gets trapped at a sub-optimal point, whereas the Gaussian case is solved optimally.We note that the value of 0.060.06 attained by the non-Gaussian case is quite high, since the zero weight vector in this case has loss of order 0.10.1. In the Gaussian case AdaGrad converged to w\boldsymbol{w}^{*}. Therefore, given the Gaussian dataset we were able to recover the true weight vector w\boldsymbol{w}^{*}, whereas given the data constructed via the reduction we were not, even though both datasets were of the same size. We conclude that these empirical findings are in line with our theoretical results.

Networks with Overlapping Filters

Thus far we showed that the non-overlapping case becomes tractable under Gaussian inputs. A natural question is then what happens when overlaps are allowed (namely, the stride is smaller than the filter size). Will gradient descent still find a global optimum? Here we show that this is in fact not the case, and that with probability greater than 14\frac{1}{4} gradient descent will get stuck in a sub-optimal region. In Section 7.1 we analyze this setting for a two dimensional example and provide bounds on the level of suboptimality. In Section 7.2 we report on an empirical study of optimization for networks with overlapping filters. Our results suggest that by restarting gradient descent a constant number of times, it will converge to the global minimum with high probability. Complete proofs of the results are provided in Appendix D.

One might wonder why the analysis of the overlapping case should be any different than the non-overlapping case. However, even for a filter of size two, as above, the loss function and consequently the gradient, are more complex in the overlapping case. Indeed, the loss function in this case is given by:

where \alpha=\frac{1}{k^{2}}\big{(}\frac{k}{2}+\frac{k^{2}-3k+2}{2\pi}\big{)}, β=2k\beta=2k and γ=k23k+2π\gamma=\frac{k^{2}-3k+2}{\pi}.

Compared to the objective in Eq. 8 which depends only on w\left\|\boldsymbol{w}\right\|, w\left\|\boldsymbol{w}\right\| and θw,w\theta_{\boldsymbol{w},\boldsymbol{w}^{*}}, we see that the objective in Eq. 16 has new terms such as g(wr,wl)g(\boldsymbol{w}_{r},\boldsymbol{w}_{l}^{*}) which has a more complicated dependence on the weight vectors w\boldsymbol{w}^{*} and w\boldsymbol{w}. This does not only have implications on the analysis, but also on the geometric properties of the loss function and the dynamics of gradient descent. In particular, in Figure 3 we see that the objective has a large sub-optimal region which is not the case when the filters are non-overlapping.

Define h(k)h(k) as in Proposition 7.2. Then with probability 14\geq\frac{1}{4}, a randomly initialized gradient descent with learning rate λ(0,13)\lambda\in(0,\frac{1}{3}) will get stuck in a sub-optimal region, where each point in this region has loss at least 2h(k)+1k2(2h(k)+2)w2\frac{2h(k)+1}{k^{2}(2h(k)+2)}{\left\|\boldsymbol{w}^{*}\right\|}^{2} and this bound is tight.

2 Empirical study of Gradient Descent for m>2𝑚2m>2

We experimented with a range of d,md,m and overlap values (see Appendix E for details of the experimental setup). For each value of dd, mm and overlap we sampled 9090 values of w\boldsymbol{w}^{*} from various uniform input distributions with different supports and several pre-defined deterministic values. This resulted in more than 1200 different sampled w\boldsymbol{w}^{*}. For each such w\boldsymbol{w}^{*} we ran gradient descent multiple times, each initialized randomly from a different w0\boldsymbol{w}_{0}. Using the results from these runs, we could estimate the probability of sampling a w0\boldsymbol{w}_{0} that would converge to the unique global minimum. Viewed differently, this is the probability mass of the basin of attraction of the global optimum. We note that the uniqueness of the global minimum follows easily from equating the population risk (Eq. 3) to 0 and the full proof is deferred to Appendix F.

Our results are that across all values of d,md,m, overlap and w\boldsymbol{w}^{*}, the probability mass of the basin of attraction is at least 117\frac{1}{17}. The practical implication is that multiple restarts of gradient descent (in this case a few dozen) will find the global optimum with high probability. We leave formal analysis of this intriguing fact for future work.

Discussion

The key theoretical question in deep learning is why it succeeds in finding good models despite the non-convexity of the training loss. It is clear that an answer must characterize specific settings where deep learning provably works. Despite considerable recent effort, such a case has not been shown. Here we provide the first analysis of a non-linear architecture where gradient descent is globally optimal, for a certain input distribution, namely Gaussian. Thus our specific characterization is both in terms of architecture (no-overlap networks, single hidden layer, and average pooling) and input distribution. We show that learning in no-overlap architectures is hard, so that some input distribution restriction is necessary for tractability. Note however, that it is certainly possible that other, non-Gaussian, distributions also result in tractability. Some candidates would be sub-Gaussian and log-concave distributions.

Our derivation addressed the population risk, which for the Gaussian case can be calculated in closed form. In practice, one minimizes an empirical risk. Our experiments in Section 6 suggest that optimizing the empirical risk in the Gaussian case is tractable. It would be interesting to prove this formally. It is likely that measure concentration results can be used to get similar results to those we had for the population risk (e.g., see Mei et al., 2016; Xu et al., 2016, for use of such tools).

Convolution layers are among the basic building block of neural networks. Our work is among the first to analyze optimization for these. The architecture we study is similar in structure to convolutional networks, in the sense of using parameter tying and pooling. However, most standard convolutional layers have overlap and use max pooling. In Section 7 we provide initial results for the case of overlap, showing there is hope for proving optimality for gradient descent with random restarts. Analyzing max pooling would be very interesting and is left for future work.

Finally, we note that distribution dependent tractability has been shown for intersection of halfspaces (Klivans et al., 2009), which is a non-convolutional architecture. However, these results do not use gradient descent. It would be very interesting to use our techniques to try and understand gradient descent for the population risk in these settings.

References

Appendix A Proof of Lemma 3.2

First assume that θu,v0,π\theta_{\mathbf{u},\mathbf{v}}\neq 0,\pi . Then by straightforward calculation we have

Now we assume that u\mathbf{u} is parallel to v\mathbf{v}. We first show that gg is differentiable in this case. Without loss of generality we can assume that u\boldsymbol{u} and v\boldsymbol{v} lie on the u1u_{1} axis. This follows since gg is a function of u\left\|\mathbf{u}\right\|, v\left\|\mathbf{v}\right\| and θu,v\theta_{\mathbf{u},\mathbf{v}} and therefore g(,v)g(\cdot,\mathbf{v}) has a directional derivative in direction d\mathbf{d} at u\mathbf{u} if and only if g(,Rv)g(\cdot,R\mathbf{v}) has a directional derivative in direction RdR\mathbf{d} at RuR\mathbf{u} where RR is a rotation matrix. Hence g(,v)g(\cdot,\mathbf{v}) is differentiable at u\mathbf{u} if and only if g(,Rv)g(\cdot,R\mathbf{v}) is differentiable at RuR\mathbf{u}. Furthermore, if v\mathbf{v} and u\mathbf{u} are on the u1u_{1} axis, then by symmetry the partial derivatives with respect to other axes at u\mathbf{u} are all equal, hence we only need to consider the partial derivative with respect to the u1u_{1} and u2u_{2} axes.

Let v=(1,0,...,0)\boldsymbol{v}=(1,0,...,0) and u=(u,0,...,0)\boldsymbol{u}=(u,0,...,0) where u0u\neq 0. In order to show differentiability, we will prove that g(u,v)g(\mathbf{u},\mathbf{v}) has continuous partial derivatives at u\boldsymbol{u} (by equality (18) the partial derivatives are clearly continuous at points that are not on the u1u_{1} axis. Define uϵ=(u,ϵ,0,...,0)\mathbf{u}_{\epsilon}=(u,\epsilon,0,...,0). Then

By L’hopital’s rule and the calculation of equality (18) we get

Furthermore, by equality (18) we see that limuugu2(u,v)=0\lim_{\mathbf{u}^{\prime}\to\mathbf{u}}{\frac{\partial g}{\partial u_{2}}(\mathbf{u}^{\prime},\mathbf{v})}=0 since limuusinθu,v=0\lim_{\mathbf{u}^{\prime}\to\mathbf{u}}{\sin\theta_{\boldsymbol{u}^{\prime},\boldsymbol{v}}}=0.

For a fixed θu,v\theta_{\mathbf{u},\mathbf{v}} equal to or π\pi, gu1(u,v)\frac{\partial g}{\partial u_{1}}(\mathbf{u},\mathbf{v}) is the same as gu(u,v)\frac{\partial g}{\partial\left\|\mathbf{u}\right\|}(\mathbf{u},\mathbf{v}). Hence,

and the partial derivative is continuous since

Finally, we see that for the case where u\mathbf{u} and v\mathbf{v} are parallel, the values we got for the partial derivatives coincide with equation Eq. 18. This concludes the proof.

Appendix B Proof of Proposition 4.1

We will prove the claim by induction on kk. For the base case we will show that Set-Splitting-by-2-Sets is NP-complete. We will prove this via a reduction from a variant of the 3-SAT problem with the restriction of equal number of variables and clauses, which we denote Equal-3SAT. We will first prove that Equal-3SAT is NP-complete.

This can be shown via a reduction from 3SAT. Given a formula ϕ\phi with nn variables and mm clauses we can increase nmn-m by 11 by adding a new clause of the form (xyx\vee y) for new variables xx and yy. Furthermore, we can decrease nmn-m by 11 by adding two new identical clauses of the form (zz) for a new variable zz. In each case the formula with the new clause(s) is satisfiable if and only if ϕ\phi is. Therefore given a formula ϕ\phi we can construct a new formula ψ\psi with equal number of variables and clauses such that ϕ\phi is satisfiable if and only if ψ\psi is. ∎

We will now give a reduction from Equal-3SAT to Set-Splitting-by-2-Sets.

The following reduction is exactly the reduction from 3SAT to Splitting-Sets and we include it here for completeness. Let ϕ\phi be a formula with set of variables VV and equal number of variables and clauses. We construct the sets SS and C\cal C as follows. Define

where xˉ\bar{x} is the negation of variable xx and nn is a new variable not in VV. For each clause cc with set of variables or negations of variables VcV_{c} that appear in the clause (for example, if c=(xˉy)c=(\bar{x}\vee y) then Vc={xˉ,y}V_{c}=\{\bar{x},y\}) construct a set Sc=Vc{n}S_{c}=V_{c}\cup\{n\}. Furthermore, for each variable xVx\in V construct a set Sx={x,xˉ}S_{x}=\{x,\bar{x}\}. Let C\cal C be the family of subsets ScS_{c} and SxS_{x} for all clauses cc and xVx\in V. Note that C|\cal C|S\leq|S| which is required by the definition of Set-Splitting-by-2-Sets.

Assume that ϕ\phi is satisfiable and let AA be the satisfying assignment. Define S1={xA(x)=true}{xˉA(x)=false}S_{1}=\{x|A(x)=true\}\cup\{\bar{x}|A(x)=false\} and S2={xA(x)=false}{xˉA(x)=true}{n}S_{2}=\{x|A(x)=false\}\cup\{\bar{x}|A(x)=true\}\cup\{n\}. Note that S1S2=SS_{1}\cup S_{2}=S. Assume by contradiction that there exists a set TCT\in\cal C such that TS1T\subseteq S_{1} or TS2T\subseteq S_{2}. If TS1T\subseteq S_{1} then TT is not a set ScS_{c} for some clause cc because nS1n\notin S_{1}. However, by the construction of S1S_{1} a variable and its negation cannot be in S1S_{1}. Hence TS1T\subseteq S_{1} is impossible. If TS2T\subseteq S_{2} then as in the previous claim TT cannot be a set SxS_{x} for a variable xx. Hence T=ScT=S_{c} for some clause cc. However, this implies that A(c)=falseA(c)=false, a contradiction.

Conversely, assume there exists splitting sets S1S_{1} and S2S_{2} and w.l.o.g. nS1n\in S_{1}. We note that it follows that no variable xx and its negation xˉ\bar{x} are both contained in one of the sets S1S_{1} or S2S_{2}. Define the following assignment AA for ϕ\phi. For all xVx\in V if xS1x\in S_{1} let A(x)=falseA(x)=false, otherwise let A(x)=trueA(x)=true. Note that AA is a well defined assignment. Assume by contradiction that there is a clause cc in ϕ\phi which is not satisfiable. Since S2S_{2} splits ScS_{c} it follows that there exists a variable xx such that it or its negation xˉ\bar{x} are in S2S_{2} (recall that nS1n\in S_{1}). If xS2x\in S_{2} then A(x)=trueA(x)=true and if xˉS2\bar{x}\in S_{2} then A(xˉ)=trueA(\bar{x})=true since xS1x\in S_{1}. In both cases cc is satisfiable, a contradiction. ∎

This proves the base case. We will now prove the induction step by giving a reduction from Set-Splitting-by-k-Sets to Set-Splitting-by-(k+1)-Sets. Given S={1,2,...,d}S=\{1,2,...,d\} and C\cal C ={Cj}j=\{C_{j}\}_{j} such that C|\cal C| (k1)d\leq(k-1)d, define S={1,2,...,d+1}S^{\prime}=\{1,2,...,d+1\} and C\cal C^{\prime} =C=\cal C {Dj}j\cup\{D_{j}\}_{j} where Dj={j,d+1}D_{j}=\{j,d+1\} for all 1jd1\leq j\leq d. Note that C|\cal C^{\prime}| kd<k(d+1)\leq kd<k(d+1). Assume that there are S1,...,SkS_{1},...,S_{k} that split the sets in C\cal C. Then if we define Sk+1={d+1}S_{k+1}=\{d+1\}, it follows that i=1k+1Si=S\bigcup_{i=1}^{k+1}{S_{i}}=S and S1,...,Sk,Sk+1S_{1},...,S_{k},S_{k+1} are disjoint and split the sets in C\cal C^{\prime}.

Conversely, assume that S1,...,Sk,Sk+1S_{1},...,S_{k},S_{k+1} split the sets in C\cal C^{\prime}. Let w.l.o.g. Sk+1S_{k+1} be the set that contains d+1d+1. Then for all 1jd1\leq j\leq d we have Dj⊈Sk+1D_{j}\not\subseteq S_{k+1}. It follows that for all 1jd1\leq j\leq d, jSk+1j\notin S_{k+1}, or equivalently, Sk+1={d+1}S_{k+1}=\{d+1\}. Hence, i=1kSi=S\bigcup_{i=1}^{k}{S_{i}}=S and S1,...,SkS_{1},...,S_{k} are disjoint and split the sets in C\cal C, as desired.

Appendix C Missing Proofs for Section 5

For w0\mathbf{w}\neq\mathbf{0}, the claim follows from Lemma 3.2. As in the proof of Lemma 3.2 we can assume w.l.o.g. that w=(0,0,...,0)\mathbf{w}=(0,0,...,0) and w=(1,0,...,0)\mathbf{w}^{*}=(1,0,...,0). Let f(w,w)=2kg(w,w)+(k2k)wwπf(\mathbf{w},\mathbf{w}^{*})=2kg(\mathbf{w},\mathbf{w}^{*})+(k^{2}-k)\frac{\left\|\mathbf{w}\right\|\left\|\mathbf{w}^{*}\right\|}{\pi}. It suffices to show that fu2(w,w)\frac{\partial f}{\partial u_{2}}(\mathbf{w},\mathbf{w}^{*}) does not exist. Indeed, let wϵ=(0,ϵ,0,...,0)\mathbf{w}_{\epsilon}=(0,\epsilon,0,...,0) then by L’hopital’s rule

Hence the left and right partial derivatives with respect to variable u2u_{2} are not equal, and thus fu2(w,w)\frac{\partial f}{\partial u_{2}}(\mathbf{w},\mathbf{w}^{*}) does not exist.

Denote θθw,w\theta\triangleq\theta_{\boldsymbol{w},\boldsymbol{w}^{*}}. If θ=0\theta=0 then let w=αw\mathbf{w}=\alpha\mathbf{w}^{*} for some α>0\alpha>0. It follows that

or equivalently α=1\alpha=1, and thus w=w\mathbf{w}=\mathbf{w}^{*}.

If θ=π\theta=\pi then w=k2kk2+(π1)kw\left\|\mathbf{w}\right\|=\frac{k^{2}-k}{k^{2}+(\pi-1)k}\left\|\mathbf{w}^{*}\right\| and thus w=(k2kk2+(π1)k)w\mathbf{w}=-(\frac{k^{2}-k}{k^{2}+(\pi-1)k})\mathbf{w}^{*}. By setting θ=π\theta=\pi in the loss function, one can see that w=(k2kk2+(π1)k)w\mathbf{w}=-(\frac{k^{2}-k}{k^{2}+(\pi-1)k})\mathbf{w}^{*} is a one-dimensional local minimum, whereas by fixing w\left\|\mathbf{w}\right\| and decreasing θ\theta, the loss function decreases. It follows that w=(k2kk2+(π1)k)w\mathbf{w}=-(\frac{k^{2}-k}{k^{2}+(\pi-1)k})\mathbf{w}^{*} is a saddle point. If θ0,π\theta\neq 0,\pi then w\mathbf{w} and w\mathbf{w}^{*} are linearly independent and thus \frac{k}{\pi}\Big{(}\pi-\theta\Big{)}=0 which is a contradiction.

for any w\mathbf{w} and directions d1\mathbf{d}_{1} and d2\mathbf{d}_{2}. It follows that

for ij,sti\neq j,s\neq t such that i,j,s,t1i,j,s,t\neq 1. It follows that we only need to consider second partial derivatives with respect to 3 axes w1w_{1},w2w_{2} and w3w_{3}. Denote uϵ=(γ(k),ϵ,0,...,0)\mathbf{u}_{\epsilon}=(-\gamma(k),\epsilon,0,...,0) and w=(1,0,...,0)\boldsymbol{w}^{*}=(1,0,...,0) and β(k)=k2kπ\beta(k)=\frac{k^{2}-k}{\pi} and note that γ(k)=β(k)β(k)+k\gamma(k)=\frac{\beta(k)}{\beta(k)+k}. Then by equation Eq. 19 we have

where θuϵ,w=arccos(γ(k)ϵ2+γ2(k))\theta_{\mathbf{u}_{\epsilon},\mathbf{w}^{*}}=\arccos(\frac{-\gamma(k)}{\sqrt{\epsilon^{2}+\gamma^{2}(k)}}).

Taking derivatives of the gradient with respect to w1w_{1} is easier because the expressions in Eq. 19 that depend on θw,w\theta_{\mathbf{w},\mathbf{w}^{*}} and ww\frac{\mathbf{w}}{\left\|\mathbf{w}\right\|} are constant. Therefore,

C.2 Proof of Theorem 5.2

If 0<θt<π0<\theta_{t}<\pi and λ<1\lambda<1 then θt+1<θt\theta_{t+1}<\theta_{t}.

to wt\mathbf{w}_{t} does not change θt\theta_{t} for λ<1\lambda<1, since k+k2kπk21\frac{k+\frac{k^{2}-k}{\pi}}{k^{2}}\leq 1 for k1k\geq 1. In addition, adding \frac{\lambda}{\pi k}\Big{(}\pi-\theta\Big{)}\mathbf{w}^{*} decreases θt\theta_{t}. ∎

We will need the following two lemmas to establish a lower bound on wt\left\|\mathbf{w}_{t}\right\|.

If π2<θt<π\frac{\pi}{2}<\theta_{t}<\pi then wt+1sinθtsinθt+1min{wt,wsinθtα(k)π}\left\|\mathbf{w}_{t+1}\right\|\geq\frac{\sin{\theta_{t}}}{\sin{\theta_{t+1}}}\min\{\left\|\mathbf{w}_{t}\right\|,\frac{\left\|\mathbf{w}^{*}\right\|\sin\theta_{t}}{\alpha(k)\pi}\}.

Notice that if wtwsinθtα(k)π\left\|\mathbf{w}_{t}\right\|\leq\frac{\left\|\mathbf{w}^{*}\right\|\sin\theta_{t}}{\alpha(k)\pi} then

Similarly, if wtwsinθtα(k)π\left\|\mathbf{w}_{t}\right\|\geq\frac{\left\|\mathbf{w}^{*}\right\|\sin\theta_{t}}{\alpha(k)\pi} then utwsinθtα(k)π\left\|\mathbf{u}_{t}\right\|\geq\frac{\left\|\mathbf{w}^{*}\right\|\sin\theta_{t}}{\alpha(k)\pi}. Furthermore, by a simple geometric observation we see that wt+1cos(θt+1π2)=utcos(θtπ2)\left\|\mathbf{w}_{t+1}\right\|\cos(\theta_{t+1}-\frac{\pi}{2})=\left\|\mathbf{u}_{t}\right\|\cos(\theta_{t}-\frac{\pi}{2}) if θt+1>π2\theta_{t+1}>\frac{\pi}{2} and wt+1cos(π2θt+1)=utcos(θtπ2)\left\|\mathbf{w}_{t+1}\right\|\cos(\frac{\pi}{2}-\theta_{t+1})=\left\|\mathbf{u}_{t}\right\|\cos(\theta_{t}-\frac{\pi}{2}) if θt+1π2\theta_{t+1}\leq\frac{\pi}{2}. This is equivalent to wt+1=sinθtsinθt+1ut\left\|\mathbf{w}_{t+1}\right\|=\frac{\sin{\theta_{t}}}{\sin{\theta_{t+1}}}\left\|\mathbf{u}_{t}\right\|. It follows that wt+1sinθtsinθt+1min{wt,wsinθtα(k)π}\left\|\mathbf{w}_{t+1}\right\|\geq\frac{\sin{\theta_{t}}}{\sin{\theta_{t+1}}}\min\{\left\|\mathbf{w}_{t}\right\|,\frac{\left\|\mathbf{w}^{*}\right\|\sin\theta_{t}}{\alpha(k)\pi}\} as desired. ∎

If 0<θtπ20<\theta_{t}\leq\frac{\pi}{2} and 0<λ<120<\lambda<\frac{1}{2} then wt+1min{wt,w8}\left\|\mathbf{w}_{t+1}\right\|\geq\min\{\left\|\mathbf{w}_{t}\right\|,\frac{\left\|\mathbf{w}^{*}\right\|}{8}\}

First assume that k2k\geq 2. Let ut\boldsymbol{u}_{t} be as in Lemma C.2, then

It follows that if wt(k2k)wα(k)πk2w2π\left\|\mathbf{w}_{t}\right\|\geq\frac{(k^{2}-k)\left\|\mathbf{w}^{*}\right\|}{\alpha(k)\pi k^{2}}\geq\frac{\left\|\mathbf{w}^{*}\right\|}{2\pi} then utw2π\left\|\mathbf{u}_{t}\right\|\geq\frac{\left\|\mathbf{w}^{*}\right\|}{2\pi}. Otherwise if wt(k2k)wα(k)πk2\left\|\mathbf{w}_{t}\right\|\leq\frac{(k^{2}-k)\left\|\mathbf{w}^{*}\right\|}{\alpha(k)\pi k^{2}} then utwt\left\|\mathbf{u}_{t}\right\|\geq\left\|\mathbf{w}_{t}\right\|. Since \mathbf{w}_{t+1}=\mathbf{u}_{t}+\frac{\lambda}{\pi k}\Big{(}\pi-\theta\Big{)}\mathbf{w}^{*} and 0<θtπ20<\theta_{t}\leq\frac{\pi}{2} we have wt+1utmin{w2π,wt}\left\|\mathbf{w}_{t+1}\right\|\geq\left\|\mathbf{u}_{t}\right\|\geq\min\{\frac{\left\|\mathbf{w}^{*}\right\|}{2\pi},\left\|\mathbf{w}_{t}\right\|\}.

Finally, assume θtπ3\theta_{t}\geq\frac{\pi}{3}. As in the proof of Lemma C.2, if wtwsinθtπ32wπ\left\|\mathbf{w}_{t}\right\|\geq\frac{\left\|\mathbf{w}^{*}\right\|\sin\theta_{t}}{\pi}\geq\frac{\sqrt{3}}{2}\frac{\left\|\mathbf{w}^{*}\right\|}{\pi} then wt+1ut32wπ\left\|\mathbf{w}_{t+1}\right\|\geq\left\|\mathbf{u}_{t}\right\|\geq\frac{\sqrt{3}}{2}\frac{\left\|\mathbf{w}^{*}\right\|}{\pi}. Otherwise, if wt<wsinθtπ\left\|\mathbf{w}_{t}\right\|<\frac{\left\|\mathbf{w}^{*}\right\|\sin\theta_{t}}{\pi} then wt+1utwt\left\|\mathbf{w}_{t+1}\right\|\geq\left\|\mathbf{u}_{t}\right\|\geq\left\|\mathbf{w}_{t}\right\|. This concludes our proof. ∎

We can now show that in each iteration wt\left\|\mathbf{w}_{t}\right\| is bounded away from by a constant.

Assume GD is initialized at w0\mathbf{w}_{0} such that θ0π\theta_{0}\neq\pi and runs for TT iterations with learning rate 0<λ<120<\lambda<\frac{1}{2}. Then for all 0tT0\leq t\leq T,

Let θ0>θ1>...>θT\theta_{0}>\theta_{1}>...>\theta_{T} (by Lemma C.1). Let ii be the last index such that θi>π2\theta_{i}>\frac{\pi}{2} (if such ii does not exist let i=1i=-1). Since sinθj>sinθ0\sin{\theta_{j}}>\sin{\theta_{0}} for all 0ji0\leq j\leq i, by applying Lemma C.2 at most j+1j+1 times we have

Finally, by Lemma C.3 and the fact that θjπ2\theta_{j}\leq\frac{\pi}{2} for all i<jTi<j\leq T, we get

for all i+1<jTi+1<j\leq T, from which the claim follows. ∎

Assume w1,w2M\left\|\mathbf{w}_{1}\right\|,\left\|\mathbf{w}_{2}\right\|\geq M, w1,w2\mathbf{w}_{1},\mathbf{w}_{2} and w\mathbf{w}^{*} are on the same two dimensional half-plane defined by w\boldsymbol{w}^{*}, then

for L=1+3wML=1+\frac{3\left\|\mathbf{w}^{*}\right\|}{M}.

Let θ1\theta_{1} and θ2\theta_{2} be the angles between w1\boldsymbol{w}_{1},w\boldsymbol{w}^{*} and w2\boldsymbol{w}_{2},w\boldsymbol{w}^{*}, respectively. By the inequality x0sinxsinx0x\frac{x_{0}\sin{x}}{\sin{x_{0}}}\geq x for 0xx0<π0\leq x\leq x_{0}<\pi and since θ1θ22π2\frac{|\theta_{1}-\theta_{2}|}{2}\leq\frac{\pi}{2} we have

Furthermore w1w2\left\|\mathbf{w}_{1}-\mathbf{w}_{2}\right\| is minimized (for fixed angles θ1\theta_{1} and θ2\theta_{2}) when w1=w2=M\left\|\mathbf{w}_{1}\right\|=\left\|\mathbf{w}_{2}\right\|=M and is equal to 2Msinθ1θ222M\sin\frac{|\theta_{1}-\theta_{2}|}{2}. Thus, under our assumptions we have,

For the first summand, we will first find the parameterization of a two dimensional vector of length sinθ\sin\theta where θ\theta is the angle between the vector and the positive xx axis. Denote this vector by (a,b)(a,b), then the following holds

The solution to these equations is (a,b)=(sin2θ2,sin2θ)(a,b)=(\frac{\sin 2\theta}{2},\sin^{2}\theta). Hence (here we use the fact that w1\boldsymbol{w}_{1},w2\boldsymbol{w}_{2} are on the same half-plane)

where the first inequality follows from the fact that sinxsinyxy|\sin x-\sin y|\leq|x-y| and the second inequality from previous results. In conclusion, we have

Similarly, in order to show that the function f(w)=wwf(\boldsymbol{w})=\frac{\boldsymbol{w}}{\left\|\boldsymbol{w}\right\|} is Lipschitz continuous, we parameterize the unit vector by (cosθ,sinθ)(\cos\theta,\sin\theta) where θ\theta is the angle between the vector and the positive xx axis. We now obtain

The proof exactly follows the proof of Lemma 1.2.3 in (Nesterov, 2004) and note that the proof only requires Lipschitz continuity of the gradient on the set S={x+τ(yx)0τ1}S=\{x+\tau(y-x)\mid 0\leq\tau\leq 1\} and that SDS\subseteq D. ∎

By Proposition C.4, for all tt, wtM\left\|\mathbf{w}_{t}\right\|\geq M^{\prime} where

. Furthermore, by a simple geometric observation we have

It follows by Lemma C.5 that for any tt and x1,x2St{wt+τ(wt+1wt)0τ1}\boldsymbol{x}_{1},\boldsymbol{x}_{2}\in S_{t}\triangleq\{\mathbf{w}_{t}+\tau(\mathbf{w}_{t+1}-\mathbf{w}_{t})\mid 0\leq\tau\leq 1\},

where L=1+3wML=1+\frac{3\left\|\mathbf{w}^{*}\right\|}{M} and M=Mcosθ02M=M^{\prime}\cos{\frac{\theta_{0}}{2}} (Note that cosθtθt+12cosθ02\cos{\frac{\theta_{t}-\theta_{t+1}}{2}}\geq\cos{\frac{\theta_{0}}{2}} for all tt by Lemma C.1).

Proof of Theorem 5.2. First, we observe that for a randomly initialized point w0\mathbf{w}_{0}, 0θ0π(1δ)0\leq\theta_{0}\leq\pi(1-\delta) with probability 1δ1-\delta. Hence by Proposition C.6 we have for L=1+3wML=1+\frac{3\left\|\mathbf{w}^{*}\right\|}{M} where M=min{sin(π(1δ)),sin2(π(1δ))α(k)π,18}cos(π(1δ)2)M=\min\{\sin(\pi(1-\delta)),\frac{\sin^{2}(\pi(1-\delta))}{\alpha(k)\pi},\frac{1}{8}\}\cos(\frac{\pi(1-\delta)}{2}) and α(k)=k+k2kπ\alpha(k)=k+\frac{k^{2}-k}{\pi}, and for λ=1L\lambda=\frac{1}{L} (we assume w.l.o.g. that L>2L>2),

Similarly to the previous argument, we have

Hence, θt<arcsin(2kϵ)=O(ϵ)\theta_{t}<\arcsin(2k\epsilon)=O(\epsilon). It follows by the triangle inequality that

where the last inequality follows since the arc of a circle is larger than its corresponding segment.

Therefore we get wtw<O(ϵ)|\left\|\mathbf{w}_{t}\right\|-\left\|\mathbf{w}^{*}\right\||<O(\epsilon). By the bounds on θt\theta_{t} and wtw|\left\|\mathbf{w}_{t}\right\|-\left\|\mathbf{w}^{*}\right\|| and the inequality cosx1x\cos x\geq 1-x for x0x\geq 0, we can give an upper bound on wtw\left\|\mathbf{w}_{t}-\mathbf{w}^{*}\right\|:

where the second inequality follows from Lipschitz continuity of σ\sigma, the third inequality from the Cauchy-Schwarz inequality and the last equality since x2{\left\|\mathbf{x}\right\|}^{2} follows a chi-squared distribution with dd degrees of freedom.

Appendix D Missing Proofs for Section 7.1

Define wp=(w2,w1)\boldsymbol{w}_{p}=(w_{2},w_{1}), wp1=(0,w)\boldsymbol{w}^{*}_{p_{1}}=(0,-w^{*}) and wp2=(w,0)\boldsymbol{w}^{*}_{p_{2}}=(w^{*},0). We first prove the following lemma.

The gradient does not follow immediately from Lemma 3.2 because the loss has expressions with of the function gg but with different dependencies on the parameters in w\boldsymbol{w}. We will only calculate g(wr,wl)w\frac{\partial g(\mathbf{w}_{r},\mathbf{w}_{l})}{\partial\mathbf{w}}, the other expressions are calculated in the same manner.

Let w=(w1,w2)\mathbf{w}=(w_{1},w_{2}) then cosθwr,wl=w1w2w12+w22\cos\theta_{\mathbf{w}_{r},\mathbf{w}_{l}}=\frac{w_{1}w_{2}}{w_{1}^{2}+w_{2}^{2}}. Then,

or equivalently cosθwr,wlw=wpw22wcosθwr,wlw2\frac{\partial\cos\theta_{\mathbf{w}_{r},\mathbf{w}_{l}}}{\partial\mathbf{w}}=\frac{\mathbf{w}_{p}}{{\left\|\mathbf{w}\right\|}^{2}}-\frac{2\mathbf{w}\cos\theta_{\mathbf{w}_{r},\mathbf{w}_{l}}}{{\left\|\mathbf{w}\right\|}^{2}}. It follows that

We will prove that wt+10\mathbf{w}_{t+1}\neq 0 and that it is in the interior of the fourth quadrant. Denote w=wt\mathbf{w}=\mathbf{w}_{t} and \nabla l(\mathbf{w})=\frac{1}{k^{2}}\big{(}B_{1}(\mathbf{w})+B_{2}(\mathbf{w})+B_{3}(\mathbf{w})\big{)} where

Let w=(w,mw)\mathbf{w}=(w,-mw) for w,m0w,m\geq 0. Straightforward calculation shows that cosθwl,wr=12(1+m2)\cos\theta_{\mathbf{w}_{l},\mathbf{w}^{*}_{r}}=\frac{1}{\sqrt{2(1+m^{2}})} and cosθwr,wl=m2(m2+1)\cos\theta_{\mathbf{w}_{r},\mathbf{w}^{*}_{l}}=\frac{m}{\sqrt{2(m^{2}+1)}}. Hence π4θwl,wr,θwr,wlπ2\frac{\pi}{4}\leq\theta_{\mathbf{w}_{l},\mathbf{w}^{*}_{r}},\theta_{\mathbf{w}_{r},\mathbf{w}^{*}_{l}}\leq\frac{\pi}{2}. Since w\mathbf{w} is in the fourth quadrant we also have 3π4θw,wπ\frac{3\pi}{4}\leq\theta_{\mathbf{w},\mathbf{w}^{*}}\leq\pi. Therefore, adding λB3(w)-\lambda B_{3}(\mathbf{w}) can only increase w\left\|\mathbf{w}\right\|. This follows since in the worst case (the least possible increase of w\left\|\mathbf{w}\right\|)

which is in the fourth quadrant for k2k\geq 2. In addition, since wp-\mathbf{w}_{p} is in the fourth quadrant then adding λB2(w)-\lambda B_{2}(\mathbf{w}) increases w\left\|\mathbf{w}\right\|.

If w<w16\left\|\mathbf{w}\right\|<\frac{\left\|\mathbf{w}^{*}\right\|}{16} then B1(w)-B_{1}(\mathbf{w}) points in the direction of w\mathbf{w} since in this case B1(w)=αw-B_{1}(\mathbf{w})=\alpha\mathbf{w} where

for k2k\geq 2. If B1(w)-B_{1}(\mathbf{w}) points in the direction of w-\mathbf{w} then by the assumption that λ(0,13)\lambda\in(0,\frac{1}{3}) we have λB1(w)<w\left\|\lambda B_{1}(\mathbf{w})\right\|<\left\|\mathbf{w}\right\|. Thus we can conclude that wt+10\mathbf{w}_{t+1}\neq 0.

Now, let w=(w1,w2)\mathbf{w}=(w_{1},w_{2}), θt\theta_{t} be the angle between w=wt\mathbf{w}=\mathbf{w}_{t} and the positive xx axis and first assume that w1>w2w_{1}>-w_{2}. In this case B3(w)-B_{3}(\mathbf{w}) least increases (or even most decreases) θt\theta_{t} when

which is a vector in the fourth quadrant for k2k\geq 2. Otherwise, B3(w)-B_{3}(\mathbf{w}) is a vector in the fourth quadrant as well. Note that we used the facts π4θwl,wr,θwr,wlπ2\frac{\pi}{4}\leq\theta_{\mathbf{w}_{l},\mathbf{w}^{*}_{r}},\theta_{\mathbf{w}_{r},\mathbf{w}^{*}_{l}}\leq\frac{\pi}{2} and 3π4θw,wπ\frac{3\pi}{4}\leq\theta_{\mathbf{w},\mathbf{w}^{*}}\leq\pi. Since λB1(w)-\lambda B_{1}(\mathbf{w}) does not change θt\theta_{t} and λB2(w)-\lambda B_{2}(\mathbf{w}) increases θt\theta_{t} but never to an angle greater than or equal to π2\frac{\pi}{2}, it follows that 0<θt+1<π20<\theta_{t+1}<\frac{\pi}{2}.

If w1w2w_{1}\leq-w_{2} then by defining all angles with respect to the negative yy axis, we get the same argument as before. This shows that wt+1\mathbf{w}_{t+1} is in the interior of the fourth quadrant, which concludes our proof.

D.2 Proof of Proposition 7.2

We will need the following auxiliary lemmas.

Let w\mathbf{w} be in the fourth quadrant, then g(\mathbf{w}_{l},\mathbf{w}_{r})\geq\frac{1}{2\pi}\big{(}\frac{\sqrt{3}}{2}-\frac{\pi}{6}\big{)}{\left\|\mathbf{w}\right\|}^{2}.

First note that the function s(θ)=sinθ+(πθ)cosθs(\theta)=\sin\theta+(\pi-\theta)\cos\theta is decreasing as a function of θ[0,π]\theta\in[0,\pi]. Let w=(w,mw)\mathbf{w}=(w,-mw) for w,m0w,m\geq 0. Straightforward calculation shows that cosθwl,wr=mm2+1\cos\theta_{\mathbf{w}_{l},\mathbf{w}_{r}}=-\frac{m}{m^{2}+1}. As a function of m[0,)m\in[0,\infty), cosθwl,wr\cos\theta_{\mathbf{w}_{l},\mathbf{w}_{r}} is minimized for m=1m=1 with value 12-\frac{1}{2}, i.e., when θ(wl,wr)=2π3\theta(\mathbf{w}_{l},\mathbf{w}_{r})=\frac{2\pi}{3} and this is the largest angle possible. Thus g(\mathbf{w}_{l},\mathbf{w}_{r})\geq\frac{1}{2\pi}s(\frac{2\pi}{3})\big{)}{\left\|\mathbf{w}\right\|}^{2}=\frac{1}{2\pi}\big{(}\frac{\sqrt{3}}{2}-\frac{\pi}{6}\big{)}{\left\|\mathbf{w}\right\|}^{2}. ∎

, then in the interval θ[0,π4]\theta\in[0,\frac{\pi}{4}], f(θ)f(\theta) is maximized at θ=π4\theta=\frac{\pi}{4} for all k2k\geq 2.

We will maximize the function f(θ)2(k1)=kk1f1(θ)+f2(θ)+f3(θ)\frac{f(\theta)}{2(k-1)}=\frac{k}{k-1}f_{1}(\theta)+f_{2}(\theta)+f_{3}(\theta) where f1(θ),f2(θ),f3(θ)f_{1}(\theta),f_{2}(\theta),f_{3}(\theta) correspond to the three summands in the expression of f(θ)f(\theta).

Since for h(x)=1x2+(πarccos(x))xh(x)=\sqrt{1-x^{2}}+(\pi-\arccos(x))x we have h(x)=πarccos(x)h^{\prime}(x)=\pi-\arccos(x), it follows that f2(θ)=(πarccoscosθ2)sinθ2f_{2}^{\prime}(\theta)=-(\pi-\arccos{\frac{\cos\theta}{\sqrt{2}}})\frac{\sin\theta}{\sqrt{2}}, f3(θ)=(πarccossinθ2)cosθ2f_{3}^{\prime}(\theta)=(\pi-\arccos{\frac{\sin\theta}{\sqrt{2}}})\frac{\cos\theta}{\sqrt{2}} and f1(θ)=(π4θ)sin(3π4+θ)f_{1}^{\prime}(\theta)=-(\frac{\pi}{4}-\theta)\sin(\frac{3\pi}{4}+\theta). It therefore suffices to show that

By applying the inequalities arccos(x)π2x\arccos(x)\leq\frac{\pi}{2}-x for xx\in and arccos(x)π2x110\arccos(x)\geq\frac{\pi}{2}-x-\frac{1}{10} for x[12,12]x\in[\frac{1}{2},\frac{1}{\sqrt{2}}] we get d1(θ)d2(θ)d_{1}(\theta)\geq d_{2}(\theta) where

We notice that d2(0)0d_{2}(0)\geq 0 and d2(34)0d_{2}(\frac{3}{4})\geq 0 for all k2k\geq 2. In addition,

and d2(0)>0d_{2}^{\prime}(0)>0 for all k2k\geq 2. It follows that in order to show that d2(θ)0d_{2}(\theta)\geq 0 for θ[0,34]\theta\in[0,\frac{3}{4}] and k2k\geq 2, it suffices to show that d2(θ)0d_{2}^{\prime\prime}(\theta)\leq 0 for θ[0,34]\theta\in[0,\frac{3}{4}] and k2k\geq 2. Indeed,

for all θ[0,34]\theta\in[0,\frac{3}{4}] and k2k\geq 2. Note that the first inequality follows since cosθsinθ\cos\theta\geq\sin\theta and the second since cos(3π4+θ)max{sinθ,sin(3π4+θ)}\cos(\frac{3\pi}{4}+\theta)\geq\max\{\sin\theta,\sin(\frac{3\pi}{4}+\theta)\}, both for θ[0,34]\theta\in[0,\frac{3}{4}]. This shows that d1(θ)0d_{1}(\theta)\geq 0 for θ[0,34]\theta\in[0,\frac{3}{4}].

Now assume that θ[34,π4]\theta\in[\frac{3}{4},\frac{\pi}{4}]. Since d1(34)0d_{1}(\frac{3}{4})\geq 0 and d1(π4)0d_{1}(\frac{\pi}{4})\geq 0, it suffices to prove that d1(θ)0d_{1}^{\prime}(\theta)\leq 0 for θ[34,π4]\theta\in[\frac{3}{4},\frac{\pi}{4}]. Indeed, for all θ[34,π4]\theta\in[\frac{3}{4},\frac{\pi}{4}]

We conclude that d1(θ)0d_{1}(\theta)\geq 0 for all θ[0,π4]\theta\in[0,\frac{\pi}{4}] as desired.

Proof of Proposition 7.2. First assume that w1w2w_{1}\geq-w_{2}. Let θ\theta be the angle between w\mathbf{w} and the positive xx axis. Then cosθ=w1w\cos\theta=\frac{w_{1}}{\left\|\mathbf{w}\right\|} and tanθ=w2w1\tan\theta=-\frac{w_{2}}{w_{1}}. Therefore we get

By setting w=αw\left\|\mathbf{w}\right\|=\alpha\left\|\mathbf{w}^{*}\right\| we get

Solving for α\alpha that minimizes the latter expression we obtain

Plugging α\alpha^{*} back to the inequality we get

Finally, assume w1w2w_{1}\leq-w_{2}. In this case, let θ\theta be the angle between w\mathbf{w} and the negative yy axis. Then cosθ=w2w\cos\theta=\frac{-w_{2}}{\left\|\mathbf{w}\right\|} and tanθ=w1w2\tan\theta=-\frac{w_{1}}{w_{2}}. Therefore

Notice that from now on we get the same analysis as in the case where w1w2w_{1}\geq-w_{2}, where we switch between expressions with wl,wr\mathbf{w}_{l},\mathbf{w}^{*}_{r} and expressions with wr,wl\mathbf{w}_{r},\mathbf{w}^{*}_{l}. This concludes our proof. \square

Appendix E Experimental Setup for Section 7.2

In our experiments we estimated the probability of convergence to the global minimum of a randomly initialized gradient descent for many different ground truths w\mathbf{w}^{*} of a convolutional neural network with overlapping filters. For each value of number of hidden neurons, filter size, stride length and ground truth distribution we randomly selected 3030 different ground truths w\mathbf{w}^{*} with respect to the given distribution. We tested with all combinations of values given in Table 1.

Furthermore, for each combination of values of number of hidden neurons, filter size and stride length we tested with deterministic ground truths: ground truth with all entries equal to 1, all entries equal to -1 and with entries that form an increasing sequence from -1 to 1, -2 to 0 and 0 to 2 or decreasing sequence from 1 to -1, 0 to -2 and 2 to 0.

For each ground truth, we ran gradient descent 20 times and for each run we recorded whether it reached a point very close to the unique global minimum or it repeatedly (5000 consecutive iterations) incurred very low gradient values and stayed away from the global minimum. We then calculated the empirical probability p^=#times reached global minimum20\hat{p}=\frac{\text{\#times reached global minimum}}{20}. To compute the one-sided confidence interval we used the Wilson method ((Brown et al., 2001)) which gives a lower bound

where zαz_{\alpha} is the ZZ-score with α=0.05\alpha=0.05 and in our experiments n=20n=20. Note that we initialized gradient descent inside a large hypercube such that outside the hypercube the gradient does not vanish (this can be easily proved after writing out the gradient for each setting).

For all ground truths we got p^0.15\hat{p}\geq 0.15, i.e., for each ground truth we reached the global minimum at least 33 times. Hence the confidence interval lower bound Eq. 43 is greater than 117\frac{1}{17} in all settings. This suggests that with a few dozen repeated runs of a randomly initialized gradient descent, with high probability it will converge to the global minimum.

Appendix F Uniqueness of Global Minimum in the Population Risk