Implicit Bias of Gradient Descent on Linear Convolutional Networks

Suriya Gunasekar, Jason Lee, Daniel Soudry, Nathan Srebro

Introduction

Implicit biases introduced by optimization algorithms play an crucial role in learning deep neural networks (neyshabur2015search; neyshabur2015path; hochreiter1997flat; keskar2016large; chaudhari2016entropy; dinh2017sharp; andrychowicz2016learning; neyshabur2017geometry; zhang2017understanding; wilson2017marginal; hoffer2017train; Smith2018). Large scale neural networks used in practice are highly over-parameterized with far more trainable model parameters compared to the number of training examples. Consequently, optimization objectives for learning such high capacity models have many global minima that fit training data perfectly. However, minimizing the training loss using specific optimization algorithms take us to not just any global minima, but some special global minima, e.g., global minima minimizing some regularizer $\mathcal{R}(\boldsymbol{\beta})$ . In over-parameterized models, specially deep neural networks, much, if not most, of the inductive bias of the learned model comes from this implicit regularization from the optimization algorithm. Understanding the implicit bias, e.g., via characterizing $\mathcal{R}(\boldsymbol{\beta})$ , is thus essential for understanding how and what the model learns.

Similarly, and as we shall see in this paper, changing to a different parameterization of the same model class can also dramatically change the implicit bias gunasekar2017implicit. In particular, we study the implicit bias of optimizing multi-layer fully connected linear networks, and linear convolutional networks (multiple full width convolutional layers followed by a single fully connected layer) using gradient descent. Both of these types of models ultimately implement linear transformations, and can implement any linear transformation. The model class defined by these networks is thus simply the class of all linear predictors, and these models can be seen as mere (over) parameterizations of the class of linear predictors. Minimizing the training loss on these models is therefore entirely equivalent to minimizing the training loss for linear classification. Nevertheless, as we shall see, optimizing these networks with gradient descent leads to very different solutions.

In particular, we show that for fully connected networks with single output, optimizing the exponential loss over linearly separable data using gradient loss again converges to the homogeneous hard margin support vector machine solution. This holds regardless of the depth of the network, and hence, at least with a single output, gradient descent on fully connected networks has the same implicit bias as direct gradient descent on the parameters of the linear predictor. In contrast, training a linear convolutional network with gradient descent biases us toward linear separators that are sparse in the frequency domain. Furthermore, this bias changes with the depth of the network, and a network of depth $L$ (with $L-1$ convolutional layers), implicitly biases towards minimizing the $\|{\widehat{\boldsymbol{\beta}}}\|_{\nicefrac{{2}}{{L}}}$ bridge penalty with $\nicefrac{{2}}{{L}}\leq 1$ of the Fourier transform $\widehat{\boldsymbol{\beta}}$ of the learned linear predictor $\boldsymbol{\beta}$ subject to margin constraints (the gradient descent predictor reaches a stationary point of the $\|{\widehat{\boldsymbol{\beta}}}\|_{\nicefrac{{2}}{{L}}}$ minimization problem). This is a sparsity inducing regularizer, which induces sparsity more aggressively as the depth increases.

Finally, in this paper we focus on characterizing which global minimum does gradient descent on over-parameterized linear models converge to, while assuming that for appropriate choice of step sizes gradient descent iterates asymptotically minimize the optimization objective. A related challenge in neural networks, not addressed in this paper, is an answer to when does gradient descent minimize the non-convex empirical loss objective to reach a global minimum. This problem while hard in worst case, has been studied for linear networks. Recent work have concluded that with sufficient over-parameterization (as is the case with our settings), loss landscape of linear models are well behaved and all local minima are global minima making the problem tractable burer2003nonlinear; journee2010low; kawaguchi2016deep; nguyen2017loss; lee2016gradient.

Multi-layer Linear Networks

Remark: We use circular convolution with a scaling of $\nicefrac{{1}}{{\sqrt{D}}}$ to make the analysis cleaner. For convolutions with zero-padding, we expect a similar behavior. Secondly, since our goal here to study implicit bias in sufficiently over-parameterized models, we only study full dimensional convolutional filters. In practice it is common to have filters of width $K$ smaller than the number of input features, which can change the implicit bias.

Thus, the empirical risk minimization problem in (4) is equivalent to the following optimization over the linear predictors $\boldsymbol{\beta}=\mathcal{P}(\mathbf{w})$ :

Although the optimization problems (4) and (5) are exactly equivalent in terms of the set of global minima, in this paper, we show that optimizing (4) with different parameterizations leads to very different classifiers compared to optimizing (5) directly.

In particular, consider problems (4)/(5) on a linearly separable dataset $\{\mathbf{x}_{n},y_{n}\}_{n=1}^{N}$ and using the logistic loss (the two class version of the cross entropy loss typically used in deep learning). The global infimum of $\mathcal{L}(\boldsymbol{\beta})$ is , but this is not attainable by any finite $\boldsymbol{\beta}$ . Instead, the loss can be minimized by scaling the norm of any linear predictor that separates the data to infinity. Thus, any sequence of predictors $\boldsymbol{\beta}^{(t)}$ (say, from an optimization algorithm) that asymptotically minimizes the loss in eq. (5) necessarily separates the data and diverges in norm, $\|{\boldsymbol{\beta}^{(t)}}\|\to\infty$ . In general there are many linear separators that correctly label the training data, each corresponding to a direction in which we can minimize (5). Which of these separators will we converge to when optimizing (4)/(5)? In other words, what is the direction $\overline{}$