Gradient Descent Learns Linear Dynamical Systems

Moritz Hardt, Tengyu Ma, Benjamin Recht

Introduction

Many learning problems are by their nature sequence problems where the goal is to fit a model that maps a sequence of input words x1,,xTx_{1},\dots,x_{T} to a corresponding sequence of observations y1,,yT.y_{1},\dots,y_{T}. Text translation, speech recognition, time series prediction, video captioning and question answering systems, to name a few, are all sequence to sequence learning problems. For a sequence model to be both expressive and parsimonious in its parameterization, it is crucial to equip the model with memory thus allowing its prediction at time tt to depend on previously seen inputs.

Recurrent neural networks form an expressive class of non-linear sequence models. Through their many variants, such as long-short-term-memory Hochreiter and Schmidhuber (1997), recurrent neural networks have seen remarkable empirical success in a broad range of domains. At the core, neural networks are typically trained using some form of (stochastic) gradient descent. Even though the training objective is non-convex, it is widely observed in practice that gradient descent quickly approaches a good set of model parameters. Understanding the effectiveness of gradient descent for non-convex objectives on theoretical grounds is a major open problem in this area.

If we remove all non-linear state transitions from a recurrent neural network, we are left with the state transition representation of a linear dynamical system. Notwithstanding, the natural training objective for linear systems remains non-convex due to the composition of multiple linear operators in the system. If there is any hope of eventually understanding recurrent neural networks, it will be inevitable to develop a solid understanding of this special case first.

To be sure, linear dynamical systems are important in their own right and have been studied for many decades independently of machine learning within the control theory community. Control theory provides a rich set techniques for identifying and manipulating linear systems. The learning problem in this context corresponds to “linear dynamical system identification”. Maximum likelihood estimation with gradient descent is a popular heuristic for dynamical system identification Ljung (1998). In the context of machine learning, linear systems play an important role in numerous tasks. For example, their estimation arises as subroutines of reinforcement learning in robotics Levine and Koltun (2013), location and mapping estimation in robotic systems Durrant-Whyte and Bailey (2006), and estimation of pose from video Rahimi et al. (2005).

In this work, we show that gradient descent efficiently minimizes the maximum likelihood objective of an unknown linear system given noisy observations generated by the system. More formally, we receive noisy observations generated by the following time-invariant linear system:

Here, A,B,C,DA,B,C,D are linear transformations with compatible dimensions and we denote by Θ=(A,B,C,D)\Theta=(A,B,C,D) the parameters of the system. The vector hth_{t} represents the hidden state of the model at time t.t. Its dimension nn is called the order of the system. The stochastic noise variables {ξt}\{\xi_{t}\} perturb the output of the system which is why the model is called an output error model in control theory. We assume the variables are drawn i.i.d. from a distribution with mean and variance σ2.\sigma^{2}.

Throughout the paper we focus on controllable and externally stable systems. A linear system is externally stable (or equivalently bounded-input bounded-output stable) if and only if the spectral radius of AA, denoted ρ(A),\rho(A), is strictly bounded by 1.1. Controllability is a mild non-degeneracy assumption that we formally define later. Without loss of generality, we further assume that the transformations BB, CC and DD have bounded Frobenius norm. This can be achieved by a rescaling of the output variables. We assume we have NN pairs of sequences (x,y)(x,y) as training examples,

Our goal is to fit a linear system to the observations. We parameterize our model in exactly the same way as (1.1). That is, for linear mappings (A^,B^,C^,D^)(\hat{A},\hat{B},\hat{C},\hat{D}), the trained model is defined as:

The (population) risk of the model is obtained by feeding the learned system with the correct initial states and comparing its predictions with the ground truth in expectation over inputs and errors. Denoting by y^t\hat{y}_{t} the tt-th prediction of the trained model starting from the correct initial state that generated yty_{t}, and using Θ^\widehat{\Theta} as a short hand for (A^,B^,C^,D^)(\hat{A},\hat{B},\hat{C},\hat{D}), we formally define population risk as:

Note that even though the prediction y^t\hat{y}_{t} is generated from the correct initial state, the learning algorithm does not have access to the correct initial state for its training sequences.

While the squared loss objective turns out to be non-convex, it has many appealing properties. Assuming the inputs xtx_{t} and errors ξt\xi_{t} are drawn independently from a Gaussian distribution, the corresponding population objective corresponds to maximum likelihood estimation. In this work, we make the weaker assumption that the inputs and errors are drawn independently from possibly different distributions. The independence assumption is certainly idealized for some learning applications. However, in control applications the inputs can often be chosen by the controller rather than by nature. Moreover, the outputs of the system at various time steps are correlated through the unknown hidden state and therefore not independent even if the inputs are.

2 Proper learning

Under our assumption, projected stochastic gradient descent, when given NN sample sequence of length TT, returns parameters Θ^\widehat{\Theta} with population risk

Here the expectation on LHS is with respect to the randomness of the algorithm. Recall that f(Θ)f(\Theta) is the population risk of the optimal system, and σ2\sigma^{2} refers to the variance of the noise variables. We also assume that the inputs xtx_{t} are drawn from a pairwise independent distribution with mean and variance 1.1. Note, however, that this does not imply independence of the outputs as these are correlated by a common hidden state. The stated version of our result glosses over the fact that we need our assumption to hold with a small amount of slack; a precise version follows in Section 4. Our theorem establishes a polynomial convergence rate for stochastic gradient descent. Since each iteration of the algorithm only requires a sequence of matrix operations and an efficient projection step, the running time is polynomial, as well. Likewise, the sample requirements are polynomial since each iteration requires only a single fresh example. An important feature of this result is that the error decreases with both the length TT and the number of samples NN. The dependency on the dimension nn, on the other hand, is likely to be quite loose, and tighter bounds are left for future work.

The algorithm requires a (polynomial-time) projection step to a convex set at every iteration (formally defined in Section 4 and Algorithm 1). Computationally, it can be a bottleneck, although it is unlikely to be required in practice and may be an artifact of our analysis.

3 The power of over-parameterization

Endowing the model with additional parameters compared to the ground truth turns out to be surprisingly powerful. We show that we can essentially remove the assumption we previously made in proper learning. The idea is simple. If pp is the characteristic polynomial of AA of degree nn. We can find a system of order n>nn^{\prime}>n such that the characteristic polynomial of its transition matrix becomes ppp\cdot p^{\prime} for some polynomial pp^{\prime} of order nn.n^{\prime}-n. This means that to apply our result we only need the polynomial ppp\cdot p^{\prime} to satisfy our assumption. At this point, we can choose pp^{\prime} to be an approximation of the inverse p1p^{-1}. For sufficiently good approximation, the resulting polynomial ppp\cdot p^{\prime} is close to 11 and therefore satisfies our assumption. Such an approximation exists generically for n=O(n)n^{\prime}=O(n) under mild non-degeneracy assumptions on the roots of pp. In particular, any small random perturbation of the roots would suffice.

Under a mild non-degeneracy assumption, stochastic gradient descent returns parameters Θ^\widehat{\Theta} corresponding to a system of order n=O(n)n^{\prime}=O(n) with population risk

when given NN sample sequences of length TT.

We remark that the idea we sketched also shows that, in the extreme, improper learning of linear dynamical systems becomes easy in the sense that the problem can be solved using linear regression against the outputs of the system. However, our general result interpolates between the proper case and the regime where linear regression works. We discuss in more details in Section 6.3.

4 Multi-input multi-output systems

Both results we saw immediately extend to single-input multi-output (SIMO) systems as the dimensionality of CC and DD are irrelevant for us. The case of multi-input multi-output (MIMO) systems is more delicate. Specifically, our results carry over to a broad family of multi-input multi-output systems. However, in general MIMO systems no longer enjoy canonical forms like SISO systems. In Section 7, we introduce a natural generalization of controllable canonical form for MIMO systems and extend our results to this case.

5 Related work

System identification is a core problem in dynamical systems and has been studied in depth for many years. The most popular reference on this topic is the text by Ljung Ljung (1998). Nonetheless, the list of non-asymptotic results on identifying linear systems from noisy data is surprisingly short. Several authors have recently tried to estimate the sample complexity of dynamical system identification using machine learning tools Vidyasagar and Karandikar (2008); Campi and Weyer (2002); Weyer and Campi (1999). All of these result are rather pessimistic with sample complexity bounds that are exponential in the degree of the linear system and other relevant quantities. Contrastingly, we prove that gradient descent has an associated polynomial sample complexity in all of these parameters. Moreover, all of these papers only focus on how well empirical risk approximates the true population risk and do not provide guarantees about any algorithmic schemes for minimizing the empirical risk.

The only result to our knowledge which provides polynomial sample complexity for identifying linear dynamical systems is in Shah et al Shah et al. (2012). Here, the authors show that if certain frequency domain information about the linear dynamical system is observed, then the linear system can be identified by solving a second-order cone programming problem. This result is about improper learning only, and the size of the resulting system may be quite large, scaling as the (1ρ(A))2(1-\rho(A))^{-2}. As we describe in this work, very simple algorithms work in the improper setting when the system degree is allowed to be polynomial in (1ρ(A))1(1-\rho(A))^{-1}. Moreover, it is not immediately clear how to translate the frequency-domain results to the time-domain identification problem discussed above.

Our main assumption about the image of the polynomial q(z)q(z) is an appeal to the theory of passive systems. A system is passive if the dot product between the input sequence utu_{t} and output sequence yty_{t} are strictly positive. Physically, this notion corresponds to systems that cannot create energy. For example, a circuit made solely of resistors, capacitors, and inductors would be a passive electrical system. If one added an amplifier to the internals of the system, then it would no longer be passive. The set of passive systems is a subset of the set of stable systems, and the subclass is somewhat easier to work with mathematically. Indeed, Megretski used tools from passive systems to provide a relaxation technique for a family of identification problems in dynamical systems Megretski (2008). His approach is to lower bound a nonlinear least-squares cost with a convex functional. However, he does not prove that his technique can identify any of the systems, even asymptotically. Söderström and Stoica (1982); Stoica and Söderström (1982, 1984) and later Bazanella et al. (2008); Eckhard and Bazanella (2011) prove the quasi-convexity of a cost function under a passivity condition in the context of system identification, but no sample complexity or global convergence proofs are provided.

6 Proof overview

Our results on improper learning in Section 6 rely on a surprisingly simple but powerful insight. We can extend the degree of the transfer function G(z)G(z) by extending both numerator and denominator with a polynomial u(z)u(z) such that G(z)=s(z)u(z)/p(z)u(z).G(z)=s(z)u(z)/p(z)u(z). While this results in an equivalent system in terms of input-output behavior, it can dramatically change the geometry of the optimization landscape. In particular, we can see that only p(z)u(z)p(z)u(z) has to satisfy the assumption of our proper learning algorithm. This allows us, for example, to put u(z)p(z)1u(z)\approx p(z)^{-1} so that p(z)u(z)1,p(z)u(z)\approx 1, hence trivially satisfying our assumption. A suitable inverse approximation exists under light assumptions and requires degree no more than d=O(n).d=O(n). Algorithmically, there is almost no change. We simply run stochastic gradient descent with n+dn+d model parameters rather than nn model parameters.

7 Preliminaries

A SISO of order nn is in controllable canonical form if AA and BB have the following form

We will parameterize A^,B^,C^,D^\hat{A},\hat{B},\hat{C},\hat{D} accordingly. We will write A=CC(a)A=\textup{CC}(a) for brevity, where aa is used to denote the unknown last row [an,,a1][-a_{n},\dots,-a_{1}] of matrix AA. We will use a^\hat{a} to denote the corresponding training variables for aa. Since here BB is known, so B^\hat{B} is no longer a trainable parameter, and is forced to be equal to BB. Moreover, CC is a row vector and we use [c1,,cn][c_{1},\cdots,c_{n}] to denote its coordinates (and similarly for C^\hat{C}).

A SISO is controllable if and only if the matrix [BABA2BAn1B][B\mid AB\mid A^{2}B\mid\cdots\mid A^{n-1}B] has rank n.n. This statement corresponds to the condition that all hidden states should be reachable from some initial condition and input trajectory. Any controllable system admits a controllable canonical form Heij et al. (2007). For vector a=[an,,a1]a=[a_{n},\dots,a_{1}], let pa(z)p_{a}(z) denote the polynomial

Gradient descent and quasi-convexity

It is known that under certain mild conditions (stochastic) gradient descent converges even on non-convex functions to local minimum Ge et al. (2015); Lee et al. (2016). Though usually for concrete problems the challenge is to prove that there is no spurious local minimum other than the target solution. Here we introduce a condition similar to the quasi-convexity notion in Hazan et al. (2015), which ensures that any point with vanishing gradient is the optimal solution . Roughly speaking, the condition says that at any point θ\theta the negative of the gradient f(θ)-\nabla f(\theta) should be positively correlated with direction θθ\theta^{*}-\theta pointing towards the optimum. Our condition is slightly weaker than that in Hazan et al. (2015) since we only require quasi-convexity and smoothness with respect to the optimum, and this (simple) extension will be necessary for our analysis.

We say an objective function ff is τ\tau-weakly-quasi-convex (τ\tau-WQC) over a domain B\mathcal{B} with respect to global minimum θ\theta^{*} if there is a positive constant τ>0\tau>0 such that for all θB\theta\in\mathcal{B},

We further say ff is Γ\Gamma-weakly-smooth if for for any point θ\theta, f(θ)2Γ(f(θ)f(θ))\|\nabla f(\theta)\|^{2}\leq\Gamma(f(\theta)-f(\theta^{*})).

Projected stochastic gradient descent over some closed convex set B\mathcal{B} with learning rate η>0\eta>0 refers to the following algorithm in which ΠB\Pi_{\mathcal{B}} denotes the Euclidean projection onto B\mathcal{B}:

The following Proposition is well known for convex objective functions (corresponding to 11-weakly-quasi-convex functions). We extend it (straightforwardly) to the case when τ\tau-WQC holds with any positive constant τ\tau.

Suppose the objective function ff is τ\tau-weakly-quasi-convex and Γ\Gamma-weakly-smooth, and r()\mathfrak{r}(\cdot) is an unbiased estimator for f(θ)\nabla f(\theta) with variance VV. Moreover, suppose the global minimum θ\theta^{*} belongs to B\mathcal{B}, and the initial point θ0\theta_{0} satisfies θ0θR\|\theta_{0}-\theta^{*}\|\leq R. Then projected gradient descent (2.2) with a proper learning rate returns θK\theta_{K} in KK iterations with expected error

We defer the proof of Proposition 2.3 to Appendix A which is a simple variation of the standard convergence analysis of stochastic gradient descent (see, for example, Bottou (1998)). Finally, we note that the sum of two quasi-convex functions may no longer be quasi-convex. However, if a sequence functions is τ\tau-WQC with respect to a common point θ\theta^{*}, then their sum is also τ\tau-WQC. This follows from the linearity of gradient operation.

Suppose functions f1,,fnf_{1},\dots,f_{n} are individually τ\tau-weakly-quasi-convex in B\mathcal{B} with respect to a common global minimum θ\theta^{*} , then for non-negative w1,,wnw_{1},\dots,w_{n} the linear combination f=i=1nwifif=\sum_{i=1}^{n}w_{i}f_{i} is also τ\tau-weakly-quasi-convex with respect to θ\theta^{*} in B\mathcal{B}.

Population risk in frequency domain

We next establish conditions under which risk is weakly-quasi-convex. Our strategy is to first approximate the risk functional f(Θ^)f(\widehat{\Theta}) by what we call idealized risk. This approximation of the objective function is fairly straightforward; we justify it toward the end of the section. We can show that

The leading term DD^2\|D-\hat{D}\|^{2} is convex in D^\hat{D} which appears nowhere else in the objective. It therefore doesn’t affect the convergence of the algorithm (up to lower order terms) by virtue of Proposition 2.5, and we restrict our attention to the remaining terms.

We now use basic concepts from control theory (see Heij et al. (2007); Hespanha (2009) for more background) to express the idealized risk (3.2) in Fourier domain. The transfer function of the linear system is

Note that G(z)G(z) is a rational function over the complex numbers of degree nn and hence we can find polynomials s(z)s(z) and p(z)p(z) such that G(z)=s(z)p(z),G(z)=\frac{s(z)}{p(z)}\,, with the convention that the leading coefficient of p(z)p(z) is 11. In controllable canonical form (1.4), the coefficients of pp will correspond to the last row of the A,A, while the coefficients of ss correspond to the entries of CC. Also note that

The sequence r=(r0,r1,,rt,)=(CB,CAB,,CAtB,)r=(r_{0},r_{1},\ldots,r_{t},\ldots)=(CB,CAB,\ldots,CA^{t}B,\ldots) is called the impulse response of the linear system. The behavior of a linear system is uniquely determined by the impulse response and therefore by G(z).G(z). Analogously, we denote the transfer function of the learned system by G^(z)=C^(zIA^)1B=s^(z)/p^(z).\widehat{G}(z)=\hat{C}(zI-\hat{A})^{-1}B=\hat{s}(z)/\hat{p}(z)\,. The idealized risk (3.2) is only a function of the impulse response r^\hat{r} of the learned system, and therefore it is only a function of G^(z)\widehat{G}(z).

Recall that C=[c1,,cn]C=[c_{1},\dots,c_{n}] is defined in Section 1.7. For future reference, we note that by some elementary calculation (see Lemma B.1), we have

which implies that s(z)=c1++cnzn1s(z)=c_{1}+\dots+c_{n}z^{n-1} and p(z)=zn+a1zn1++anp(z)=z^{n}+a_{1}z^{n-1}+\dots+a_{n}.

With these definitions in mind, we are ready to express idealized risk in Fourier domain.

Suppose pa^(z)p_{\hat{a}}(z) has all its roots inside unit circle, then the idealized risk g(a^,C^)g(\hat{a},\hat{C}) can be written in the Fourier domain as

Proof Note that G(eiθ)G(e^{i\theta}) is the Fourier transform of the sequence {rk}\{r_{k}\} and so is G^(eiθ)\widehat{G}(e^{i\theta}) the Fourier transformThe Fourier transform exists since rk2=C^A^kB^2C^A^kB^cρ(A^)k\|r_{k}\|^{2}=\|\hat{C}\hat{A}^{k}\hat{B}\|^{2}\leq\|\hat{C}\|\|\hat{A}^{k}\|\|\hat{B}\|\leq c\rho(\hat{A})^{k} where cc doesn’t depend on kk and ρ(A^)<1\rho(\hat{A})<1. of r^k\hat{r}_{k}. Therefore by Parseval’ Theorem, we have that

Now that we have a convenient expression for risk in Fourier domain, we can prove that the idealized risk g(A^,C^)g(\hat{A},\hat{C}) is weakly-quasi-convex when a^\hat{a} is not so far from the true aa in the sense that pa(z)p_{a}(z) and p^a(z)\hat{p}_{a}(z) have an angle less than π/2\pi/2 for every zz on the unit circle. We will use the convention that aa and a^\hat{a} refer to the parameters that specify AA and A^,\hat{A}, respectively.

For τ>0\tau>0 and every C^\hat{C}, the idealized risk g(A^,C^)g(\hat{A},\hat{C}) is τ\tau-weakly-quasi-convex over the domain

Proof We first analyze a single term h=G^(z)G(z)2h=|\widehat{G}(z)-G(z)|^{2}. Recall that G^(z)=s^(z)/p^(z)\widehat{G}(z)=\hat{s}(z)/\hat{p}(z) where p^(z)=pa^(z)=zn+a^1zn1++a^n\hat{p}(z)=p_{\hat{a}}(z)=z^{n}+\hat{a}_{1}z^{n-1}+\dots+\hat{a}_{n}. Note that zz is fixed and hh is a function of a^\hat{a} and C^\hat{C}. Then it is straightforward to see that

Since s^(z)\hat{s}(z) and p^(z)\hat{p}(z) are linear in C^\hat{C} and a^\hat{a} respectively, by chain rule we have that

Plugging the formulas (3.7) and (3.8) for hs^(z){\frac{\partial h}{\partial\hat{s}(z)}} and hp^(z){\frac{\partial h}{\partial\hat{p}(z)}} into the equation above, we obtain that

Hence h=G^(z)G(z)2h=|\widehat{G}(z)-G(z)|^{2} is τ\tau-weakly-quasi-convex with τ=2minz=1{p(z)p^(z)}\tau=2\min_{|z|=1}\Re\left\{\frac{p(z)}{\hat{p}(z)}\right\}. This implies our claim, since by Proposition 3.2, the idealized risk gg is convex combination of functions of the form G^(z)G(z)2|\widehat{G}(z)-G(z)|^{2} for z=1|z|=1. Moreover, Proposition 2.5 shows that convex combination preserves weak quasi-convexity.

For future reference, we also prove that the idealized risk is O(n2/τ14)O(n^{2}/\tau_{1}^{4})-weakly smooth.

The idealized risk g(A^,C^)g(\hat{A},\hat{C}) is Γ\Gamma-weakly smooth with Γ=O(n2/τ14)\Gamma=O(n^{2}/\tau_{1}^{4}).

Proof By equation (3.8) and the chain rule we get that

therefore we can bound the norm of the gradient by

Similarly, we could show that ga^2O(n2/τ12)g(A^,C^)\left\|{\frac{\partial g}{\partial\hat{a}}}\right\|^{2}\leq O(n^{2}/\tau_{1}^{2})\cdot g(\hat{A},\hat{C}).

2 Justifying idealized risk

We need to justify the approximation we made in Equation (3.1).

Assume that ξt\xi_{t} and xtx_{t} are drawn i.i.d. from an arbitrary distribution with mean and variance 1.1. Then the population risk f(Θ^)f(\widehat{\Theta}) can be written as,

The idealized risk is upper bound of the population risk f(Θ^)f(\widehat{\Theta}) according to equation (3.1) and (3.9). We don’t have to quantify the gap between them because later in Algorithm 1, we will directly optimize the idealized risk by constructing an estimator of its gradient, and thus the optimization will guarantee a bounded idealized risk which translates to a bounded population risk. See Section 5 for details.

Proof [Proof of Lemma 3.5] Under the distributional assumptions on ξt\xi_{t} and xt,x_{t}, we can calculate the objective functions above analytically. We write out yt,y^ty_{t},\hat{y}_{t} in terms of the inputs,

Therefore, using the fact that xtx_{t}’s are independent and with mean 0 and covariance II, the expectation of the error can be calculated (formally by Claim B.2),

Recall that under the controllable canonical form (1.4), B=enB=e_{n} is known and therefore B^=B\hat{B}=B is no longer a variable. Then the expected objective function (3.11) simplifies to

The previous lemma does not yet control higher order contributions present in the idealized risk. This requires additional structure that we introduce in the next section.

Effective relaxations of spectral radius

The previous section showed quasi-convexity of the idealized risk. However, several steps are missing towards showing finite sample guarantees for stochastic gradient descent. In particular, we will need to control the variance of the stochastic gradient at any system that we encounter in the training. For this purpose we formally introduce our main assumption now and show that it serves as an effective relaxation of spectral radius. This results below will be used for proving convergence of stochastic gradient descent in Section 5.

Consider the following convex region C\mathcal{C} in the complex plane,

We say a polynomial p(z)p(z) is α\alpha-acquiescent if {p(z)/zn ⁣:z=α}C.\{p(z)/z^{n}\colon|z|=\alpha\}\subseteq\mathcal{C}. A linear system with transfer function G(z)=s(z)/p(z)G(z)=s(z)/p(z) is α\alpha-acquiescent if the denominator p(z)p(z) is.

Suppose aBαa\in\mathcal{B}_{\alpha}, then the roots of polynomial pa(z)p_{a}(z) have magnitudes bounded by α\alpha. Therefore the controllable canonical form A=CC(a)A=\textup{CC}(a) defined by aa has spectral radius ρ(A)<α\rho(A)<\alpha.

Proof Define holomorphic function f(z)=znf(z)=z^{n} and g(z)=pa(z)=zn+a1zn1++ang(z)=p_{a}(z)=z^{n}+a_{1}z^{n-1}+\dots+a_{n}. We apply the symmetric form of Rouche’s theorem Estermann (1962) on the circle K={z:z=α}\mathcal{K}=\{z:|z|=\alpha\}. For any point zz on K\mathcal{K}, we have that f(z)=αn|f(z)|=\alpha^{n}, and that f(z)g(z)=αn1pa(z)/zn|f(z)-g(z)|=\alpha^{n}\cdot|1-p_{a}(z)/z^{n}|. Since aBαa\in\mathcal{B}_{\alpha}, we have that pa(z)/znCp_{a}(z)/z^{n}\in\mathcal{C} for any zz with z=α|z|=\alpha. Observe that for any cCc\in\mathcal{C} we have that 1c<1+c|1-c|<1+|c|, therefore we have that

Hence, using Rouche’s Theorem, we conclude that ff and gg have same number of roots inside circle K\mathcal{K}. Note that function f=znf=z^{n} has exactly nn roots in K\mathcal{K} and therefore gg have all its nn roots inside circle K\mathcal{K}.

The following lemma establishes the fact that Bα\mathcal{B}_{\alpha} is a monotone family of sets in α\alpha. The proof follows from the maximum modulo principle of the harmonic functions (znp(1/z))\Re(z^{n}p(1/z)) and (znp(1/z))\Im(z^{n}p(1/z)). We defer the short proof to Section C.1. We remark that there are larger convex sets than Bα\mathcal{B}_{\alpha} that ensure bounded spectral radius. However, in order to also guarantee monotonicity and the no blow-up property below, we have to restrict our attention to Bα.\mathcal{B}_{\alpha}.

For any 0<α<β0<\alpha<\beta, we have that BαBβ\mathcal{B}_{\alpha}\subset\mathcal{B}_{\beta}.

Our next lemma entails that acquiescent systems have well behaved impulse responses.

Suppose aBαa\in\mathcal{B}_{\alpha} for some α1\alpha\leq 1. Then the companion matrix A=CC(a)A=\textup{CC}(a) satisfies

Moreover, it holds that for any k0k\geq 0,

Let fλ=k=0eiλkαkAkBf_{\lambda}=\sum_{k=0}^{\infty}e^{i\lambda k}\alpha^{-k}A^{k}B be the Fourier transform of the series αkAkB\alpha^{-k}A^{k}B. Then using Parseval’s Theorem, we have

where at the last step we used the fact that (IwA)1B=1pa(w1)[w1,w2,zn](I-wA)^{-1}B=\frac{1}{p_{a}(w^{-1})}[w^{-1},w^{-2}\dots,z^{-n}]^{\top} (see Lemma B.1), and that α1\alpha\leq 1. Since aBαa\in\mathcal{B}_{\alpha}, we have that qa(α1eiλ)τ1|q_{a}(\alpha^{-1}e^{i\lambda})|\geq\tau_{1}, and therefore pa(αeiλ)=αneinλq(eiλ/α)p_{a}(\alpha e^{-i\lambda})=\alpha^{n}e^{-in\lambda}q(e^{i\lambda}/\alpha) has magnitude at least τ1αn\tau_{1}\alpha^{n}. Plugging in this into equation (4.4), we conclude that

Finally we establish the bound for AkB2\|A^{k}B\|^{2}. By Lemma 4.3, we have BαB1\mathcal{B}_{\alpha}\subset\mathcal{B}_{1} for α1\alpha\leq 1. Therefore we can pick α=1\alpha=1 in equation (4.3) and it still holds. That is, we have that

This also implies that AkB22πn/τ12\|A^{k}B\|^{2}\leq 2\pi n/\tau_{1}^{2}.

Learning acquiescent systems

Next we show that we can learn acquiescent systems.

Suppose the true system Θ\Theta is α\alpha-acquiescent and satisfies C1\|C\|\leq 1. Then with NN samples of length TΩ(n+1/(1α))T\geq\Omega(n+1/(1-\alpha)), stochastic gradient descent (Algorithm 1) with projection set Bα\mathcal{B}_{\alpha} returns parameters Θ^=(A^,B^,C^,D^)\widehat{\Theta}=(\hat{A},\hat{B},\hat{C},\hat{D}) with population risk

where O()O(\cdot)-notation hides polynomial dependencies on 1/(1α)1/(1-\alpha), 1/τ0,1/τ1,τ21/\tau_{0},1/\tau_{1},\tau_{2}, and R=aR=\|a\|. The expectation is taken over the randomness of the algorithms and the examples.

Recall that TT is the length of the sequence and NN is the number of samples. The first term in the bound (5.1) comes from the smoothness of the population risk and the second comes from the variance of the gradient estimator of population risk (which will be described in detail below). An important (but not surprising) feature here is the variance scale in 1/T1/T and therefore for long sequence actually we got 1/N1/N convergence instead of 1/N1/\sqrt{N} (for relatively small NN).

Computational complexity: Step 2 in each iteration of the algorithm takes O(Tn)O(Tn) arithmetic operations, and the projection step takes O(n3.5)O(n^{3.5}) time to solve an linear programming problem. The project step is unlikely to be required in practice and may be an artifact of our analysis.

We can further balance the variance of the estimator with the number of samples by breaking each long sequence of length TT into Θ(T/n)\Theta(T/n) short sequences of length Θ(n)\Theta(n), and then run back-propagation (1) on these TN/nTN/n shorter sequences. This leads us to the following bound which gives the right dependency in TT and NN as we expected: TNTN should be counted as the true number of samples for the sequence-to-sequence model.

Under the assumption of Theorem 5.1, Algorithm 2 returns parameters Θ^\widehat{\Theta} with population risk

where O()O(\cdot)-notation hides polynomial dependencies on 1/(1α)1/(1-\alpha), 1/τ0,1/τ1,τ21/\tau_{0},1/\tau_{1},\tau_{2}, and R=a.R=\|a\|.

We remark the the gradient computation procedure takes time linear in TnTn since one can use chain-rule (also called back-propagation) to compute the gradient efficiently . For completeness, Algorithm 3 gives a detailed implementation. Finally and importantly, we remark that although we defined the population risk as the expected error with respected to sequence of length TT, actually our error bound generalizes to any longer (or shorter) sequences of length Tmax{n,1/(1α)}T^{\prime}\gg\max\{n,1/{(1-\alpha)}\}. By the explicit formula for f(Θ^)f(\widehat{\Theta}) (Lemma 3.5) and the fact that CAkB\|CA^{k}B\| decays exponentially for knk\gg n (Lemma 4.4), we can bound the population risk on sequences of different lengths. Concretely, let fT(Θ^)f_{T^{\prime}}(\widehat{\Theta}) denote the population risk on sequence of length TT^{\prime}, we have for all Tmax{n,1/(1α)}T^{\prime}\gg\max\{n,1/{(1-\alpha)}\},

We note that generalization to longer sequence does deserve attention. Indeed in practice, it’s usually difficult to train non-linear recurrent networks that generalize to longer sequences than the training data.

We could hope to achieve linear convergence by showing that the empirical risk also satisfies the weakly-quasi-convexity. Then, we can re-use the samples and hope to use strong optimization tools (such as SVRG) to achieve the linear convergence. This is beyond the scope of this paper and left to future work.

Our proof of Theorem 5.1 simply consists of three parts: a) showing the idealized risk is weakly quasi-convex in the convex set Bα\mathcal{B}_{\alpha} (Lemma 5.3); b) designing an (almost) unbiased estimator of the gradient of the idealized risk (Lemma 5.4); c) variance bounds of the gradient estimator (Lemma 5.5).

Proof [Proof of Lemma 5.3] It suffices to show that for all a^,aBα\hat{a},a\in\mathcal{B}_{\alpha}, it satisfies a^Nτ(a)\hat{a}\in\mathcal{N}_{\tau}(a) for τ=Ω(τ0τ1/τ2)\tau=\Omega(\tau_{0}\tau_{1}/\tau_{2}). Indeed, by the monotonicity of the family of sets Bα\mathcal{B}_{\alpha} (Lemma 4.3), we have that a^,aB1\hat{a},a\in\mathcal{B}_{1}, which by definition means for every zz on unit circle, pa(z)/zn,pa^(z)/znCp_{a}(z)/z^{n},p_{\hat{a}}(z)/z^{n}\in\mathcal{C}. By definition of C\mathcal{C}, for any point w,w^Cw,\hat{w}\in\mathcal{C}, the angle ϕ\phi between ww and w^\hat{w} is at most πΩ(τ0)\pi-\Omega(\tau_{0}) and ratio of the magnitude is at least τ1/τ2\tau_{1}/\tau_{2}, which implies that (w/w^)=w/w^cos(ϕ)Ω(τ0τ1/τ2)\Re(w/\hat{w})=|w|/|\hat{w}|\cdot\cos(\phi)\geq\Omega(\tau_{0}\tau_{1}/\tau_{2}). Therefore (pa(z)/pa^(z))Ω(τ0τ1/τ2)\Re(p_{a}(z)/p_{\hat{a}}(z))\geq\Omega(\tau_{0}\tau_{1}/\tau_{2}), and we conclude that a^Nτ(a)\hat{a}\in\mathcal{N}_{\tau}(a). The smoothness bound was established in Lemma 3.4.

Towards designing an unbiased estimator of the gradient, we note that there is a small caveat here that prevents us to just use the gradient of the empirical risk, as commonly done for other (static) problems. Recall that the population risk is defined as the expected risk with known initial state h0h_{0}, while in the training we don’t have access to the initial states and therefore using the naive approach we couldn’t even estimate population risk from samples without knowing the initial states.

We argue that being able to handle the missing initial states is indeed desired: in most of the interesting applications h0h_{0} is unknown (or even to be learned). Moreover, the ability of handling unknown h0h_{0} allows us to break a long sequence into shorter sequences, which helps us to obtain Corollary 5.2. Here the difficulty is essentially that we have a supervised learning problem with missing data h0h_{0}. We get around it by simply ignoring first T1=Ω(T)T_{1}=\Omega(T) outputs of the system and setting the corresponding errors to 0. Since the influence of h0h_{0} to any outputs later than time kT1max{n,1/(1α)}k\geq T_{1}\gg\max\{n,1/{(1-\alpha)}\} is inverse exponentially small, we could safely assume h0=0h_{0}=0 when the error earlier than time T1T_{1} is not taken into account.

This small trick also makes our algorithm suitable to the cases when these early outputs are actually not observed. This is indeed an interesting setting, since in many sequence-to-sequence model Sutskever et al. (2014), there is no output in the first half fraction of iterations (of course these models have non-linear operation that we cannot handle).

The proof of the correctness of the estimator is almost trivial and deferred to Section C.

Under the assumption of Theorem 5.1, suppose a^,aBα\hat{a},a\in\mathcal{B}_{\alpha}. Then in Algorithm 1, at each iteration, GA,GCG_{A},G_{C} are unbiased estimators of the gradient of the idealized risk (3.2) in the sense that:

Finally, we control the variance of the gradient estimator.

The (almost) unbiased estimator (GA,GC)(G_{A},G_{C}) of the gradient of g(A^,C^)g(\hat{A},\hat{C}) has variance bounded by

where Λ=O(max{n,1/(1α)log1/(1α)})\Lambda=O(\max\{n,1/{(1-\alpha)}\log 1/{(1-\alpha)}\}).

We bound directly the variance instead. It’s tedious but simple in spirit. We mainly need Lemma 4.4 to control various difference sums that shows up from calculating the expectation. The only tricky part is to obtain the 1/T1/T dependency which corresponds to the cancellation of the contribution from the cross terms. In the proof we will basically write out the variance as a (complicated) function of A^,C^\hat{A},\hat{C} which consists of sums of terms involving (C^A^kBCAkB)(\hat{C}\hat{A}^{k}B-CA^{k}B) and A^kB\hat{A}^{k}B. We control these sums using Lemma 4.4. The proof is deferred to Section C.

Finally we are ready to prove Theorem 5.1. We essentially just combine Lemma 5.3, Lemma 5.4 and Lemma 5.5 with the generic convergence Proposition 2.3. This will give us low error in idealized risk and then we relate the idealized risk to the population risk.

Proof [Proof of Theorem 5.1] We consider g(A^,C^,D^)=(D^D)2+g(A^,C^)g^{\prime}(\hat{A},\hat{C},\hat{D})=(\hat{D}-D)^{2}+g(\hat{A},\hat{C}), an extended version of the idealized risk which takes the contribution of D^\hat{D} into account. By Lemma 5.4 we have that Algorithm 1 computes GA,GCG_{A},G_{C} which are almost unbiased estimators of the gradients of gg^{\prime} up to negligible error exp(Ω((1α)T))\exp(-\Omega({(1-\alpha)}T)), and by Lemma C.2 we have GDG_{D} is an unbiased estimator of gg^{\prime} with respect to D^\hat{D}. Moreover by Lemma 5.5, these unbiased estimator has total variance V=O(n5+σ2n3)TV=\frac{O\left(n^{5}+\sigma^{2}n^{3}\right)}{T} where O()O(\cdot) hides dependency on τ1\tau_{1} and (1α){(1-\alpha)}. Applying Proposition 2.3 (which only requires an unbiased estimator of the gradient of gg^{\prime}), we obtain that after TT iterations, we converge to a point with g(a^,C^,D^)O(n2N+n5+σ2n3TN)g^{\prime}(\hat{a},\hat{C},\hat{D})\leq O\left(\frac{n^{2}}{N}+\sqrt{\frac{n^{5}+\sigma^{2}n^{3}}{TN}}\right). Then, by Lemma 3.5 we have f(Θ^)g(a^,C^,D^)+σ2=g(a^,C^,D^)+f(Θ)O(n2N+n5+σ2n3TN)+f(Θ)f(\widehat{\Theta})\leq g^{\prime}(\hat{a},\hat{C},\hat{D})+\sigma^{2}=g^{\prime}(\hat{a},\hat{C},\hat{D})+f(\Theta)\leq O\left(\frac{n^{2}}{N}+\sqrt{\frac{n^{5}+\sigma^{2}n^{3}}{TN}}\right)+f(\Theta) which completes the proof.

The power of improper learning

We observe an interesting and important fact about the theory in Section 5: it solely requires a condition on the characteristic function p(z)p(z). This suggests that the geometry of the training objective function depends mostly on the denominator of the transfer function, even though the system is uniquely determined by the transfer function G(z)=s(z)/p(z)G(z)=s(z)/p(z). This might seem to be an undesirable discrepancy between the behavior of the system and our analysis of the optimization problem.

However, we can actually exploit this discrepancy to design improper learning algorithms that succeed under much weaker assumptions. We rely on the following simple observation about the invariance of a system G(z)=s(z)p(z)G(z)=\frac{s(z)}{p(z)}. For an arbitrary polynomial u(z)u(z) of leading coefficient 1, we can write G(z)G(z) as

Our discussion motivates the following definition.

A polynomial p(z)p(z) of degree nn is α\alpha-acquiescent by extension of degree dd if there exists a polynomial u(z)u(z) of degree dd and leading coefficient 1 such that p(z)u(z)p(z)u(z) is α\alpha-acquiescent.

For a transfer function G(z)G(z), we define it’s H2\mathcal{H}_{2} norm as

We assume (with loss of generality) that the true transfer function G(z)G(z) has bounded H2\mathcal{H}_{2} norm, that is, GH21\|G\|_{\mathcal{H}_{2}}\leq 1. This can be achieve by a rescalingIn fact, this is a natural scaling that makes comparing error easier. Recall that the population risk is essentially G^GH2\|\hat{G}-G\|_{\mathcal{H}_{2}}, therefore rescaling CC so that GH2=1\|G\|_{\mathcal{H}_{2}}=1 implies that when error 1\ll 1 we achieve non-trivial performance. of the matrix CC.

Suppose the true system has transfer function G(z)=s(z)/p(z)G(z)=s(z)/p(z) with a characteristic function p(z)p(z) that is α\alpha-acquiescent by extension of degree dd, and GH21\|G\|_{\mathcal{H}_{2}}\leq 1, then projected stochastic gradient descent with m=n+dm=n+d states (that is, Algorithm 2 with mm states) returns a system Θ^\widehat{\Theta} with population risk

where the O()O(\cdot) notation hides polynomial dependencies on τ0,τ1,τ2,1/(1α)\tau_{0},\tau_{1},\tau_{2},1/(1-\alpha).

The theorem follows directly from Corollary 5.2 (with some additional care about the scaling.

In the rest of this section, we discuss in subsection 6.1 the instability of the minimum representation in subsection, and in subsection 6.2 we show several examples where the characteristic function p(z)p(z) is not α\alpha-acquiescent but is α\alpha-acquiescent by extension with small degree dd.

As a final remark, the examples illustrated in the following sub-sections may be far from optimally analyzed. It is beyond the scope of this paper to understand the optimal condition under which p(z)p(z) is acquiescent by extension.

We begin by constructing a contrived example where the minimum representation of G(z)G(z) is not stable at all and as a consequence one can’t hope to recover the minimum representation of G(z)G(z).

Consider G(z)=s(z)p(z):=zn0.8n(z0.1)(zn0.9n)G(z)=\frac{s(z)}{p(z)}:=\frac{z^{n}-0.8^{-n}}{(z-0.1)(z^{n}-0.9^{-n})} and G(z)=s(z)p(z):=1z0.1G^{\prime}(z)=\frac{s^{\prime}(z)}{p^{\prime}(z)}:=\frac{1}{z-0.1}. Clearly these are the minimum representations of the G(z)G(z) and G(z)G^{\prime}(z), which also both satisfy acquiescence. On the one hand, the characteristic polynomial p(z)p(z) and p(z)p^{\prime}(z) are very different. On the other hand, the transfer functions G(z)G(z) and G(z)G^{\prime}(z) have almost the same values on unit circle up to exponentially small error,

Moreover, the transfer functions G(z)G(z) and G^(z)\hat{G}(z) are on the order of Θ(1)\Theta(1) on unit circle. These suggest that from an (inverse polynomially accurate) approximation of the transfer function G(z)G(z), we cannot hope to recover the minimum representation in any sense, even if the minimum representation satisfies acquiescence.

2 Power of improper learning in various cases

We illustrate the use of improper learning through various examples below.

We consider a simple contrived example where improper learning can help us learn the transfer function dramatically. We will show an example of characteristic function which is not 11-acquiescent but (α+1)/2(\alpha+1)/2-(α+1)/2(\alpha+1)/2-acquiescent by extension of degree 33.

Let nn be a large enough integer and α\alpha be a constant. Let J={1,n1,n}J=\{1,n-1,n\} and ω=e2πi/n\omega=e^{2\pi i/n}, and then define p(z)=z3j[n],jJ(zαωj)p(z)=z^{3}\prod_{j\in[n],j\notin J}(z-\alpha\omega^{j}). Therefore we have that

Taking z=eiπ/2z=e^{-i\pi/2} we have that p(z)/znp(z)/z^{n} has argument (phase) roughly 3π/4-3\pi/4, and therefore it’s not in C\mathcal{C}, which implies that p(z)p(z) is not 11-acquiescent. On the other hand, picking u(z)=(zω)(z1)(zω1)u(z)=(z-\omega)(z-1)(z-\omega^{-1}) as the helper function, from equation (6.1) we have p(z)u(z)/zn+3=1αn/znp(z)u(z)/z^{n+3}=1-\alpha^{n}/z^{n} takes values inverse exponentially close to 11 on the circle with radius (α+1)/2(\alpha+1)/2. Therefore p(z)u(z)p(z)u(z) is (α+1)/2(\alpha+1)/2-acquiescent.

2.2 Example: characteristic function with separated roots

A characteristic polynomial with well separated roots will be acquiescent by extension. Our bound will depend on the following quantity of pp that characterizes the separateness of the roots.

For a polynomial h(z)h(z) of degree nn with roots λ1,,λn\lambda_{1},\dots,\lambda_{n} inside unit circle, define the quantity Γ()\Gamma(\cdot) of the polynomial hh as:

Suppose p(z)p(z) is a polynomial of degree nn with distinct roots inside circle with radius α\alpha. Let Γ=Γ(p)\Gamma=\Gamma(p), then p(z)p(z) is α\alpha-acquiescent by extension of degree dd  =O(max{(1α)1log(nΓpH2),0})~{}=O(\max\{(1-\alpha)^{-1}\log(\sqrt{n}\Gamma\cdot\|p\|_{\mathcal{H}_{2}}),0\}).

Our main idea to extend p(z)p(z) by multiplying some polynomial uu that approximates p1p^{-1} (in a relatively weak sense) and therefore pupu will always take values in the set C\mathcal{C}. We believe the following lemma should be known though for completeness we provide the proof in Section D.

Suppose p(z)p(z) is a polynomial of degree nn and leading coefficient 1 with distinct roots inside circle with radius α\alpha, and Γ=Γ(p)\Gamma=\Gamma(p). Then for d=O(max{(11αlogΓ(1α)ζ,0})d=O(\max\{(\frac{1}{1-\alpha}\log\frac{\Gamma}{(1-\alpha)\zeta},0\}), there exists a polynomial h(z)h(z) of degree dd and leading coefficient 1 such that for all zz on unit circle,

Proof [Proof of Lemma 6.4] Let γ=1α\gamma=1-\alpha. Using Lemma 6.5 with ζ=0.5pH1\zeta=0.5\|p\|_{\mathcal{H}_{\infty}}^{-1}, we have that there exists polynomial uu of degree d=O(max{11αlog(ΓpH),0})d=O(\max\{\frac{1}{1-\alpha}\log(\Gamma\|p\|_{\mathcal{H}_{\infty}}),0\}) such that

Therefore p(z)u(z)/zn+dCτ0,τ1,τ2p(z)u(z)/z^{n+d}\in\mathcal{C}_{\tau_{0},\tau_{1},\tau_{2}} for constant τ0,τ1,τ2\tau_{0},\tau_{1},\tau_{2}. Finally noting that for degree nn polynomial we have hHnhH2\|h\|_{\mathcal{H}_{\infty}}\leq\sqrt{n}\cdot\|h\|_{\mathcal{H}_{2}}, which completes the proof.

2.3 Example: Characteristic polynomial with random roots

We consider the following generative model for characteristic polynomial of degree 2n2n. We generate nn complex numbers λ1,,λn\lambda_{1},\dots,\lambda_{n} uniformly randomly on circle with radius α<1\alpha<1, and take λi,λˉi\lambda_{i},\bar{\lambda}_{i} for i=1,,ni=1,\dots,n as the roots of p(z)p(z). That is, p(z)=(zλ1)(zλˉ1)(zλn)(zλˉn)p(z)=(z-\lambda_{1})(z-\bar{\lambda}_{1})\dots(z-\lambda_{n})(z-\bar{\lambda}_{n}). We show that with good probability (over the randomness of λi\lambda_{i}’s), polynomial p(z)p(z) will satisfy the condition in subsection 6.2.2 so that it can be learned efficiently by our improper learning algorithm.

Towards proving Theorem 6.6, we need the following lemma about the expected distance of two random points with radius ρ\rho and rr in log-space.

Proof When rρr\neq\rho, let NN be an integer and ω=e2iπ/N\omega=e^{2i\pi/N}. Then we have that

2.4 Example: Passive systems

We will show that with improper learning we can learn almost all passive systems, an important class of stable linear dynamical system as we discussed earlier. We start off with the definition of a strict-input passive system.

A SISO linear system is strict-input passive if and only if for some τ0>0\tau_{0}>0 and any zz on unit circle, (G(z))τ0.\Re(G(z))\geq\tau_{0}\,.

In order to learn the passive system, we need to add assumptions in the definition of strict passivity. To make it precise, we define the following subsets of complex plane: For positive constant τ0,τ1,τ2\tau_{0},\tau_{1},\tau_{2}, define

We say a transfer function G(z)=s(z)/p(z)G(z)=s(z)/p(z) is (τ0,τ1,τ2)(\tau_{0},\tau_{1},\tau_{2})-strict input passive if for any zz on unit circle we have G(z)Cτ0,τ1,τ2+G(z)\in\mathcal{C}^{+}_{\tau_{0},\tau_{1},\tau_{2}}. Note that for small constant τ0,τ1\tau_{0},\tau_{1} and large constant τ2\tau_{2}, this basically means the system is strict-input passive.

Now we are ready to state our main theorem in this subsection. We will prove that passive systems could be learned improperly with a constant factor more states (dimensions), assuming s(z)s(z) has all its roots strictly inside unit circles and Γ(s)exp(O(n))\Gamma(s)\leq\exp(O(n)).

Suppose G(z)=s(z)/p(z)G(z)=s(z)/p(z) is (τ0,τ1,τ2)(\tau_{0},\tau_{1},\tau_{2})-strict-input passive. Moreover, suppose the roots of s(z)s(z) have magnitudes inside circle with radius α\alpha and Γ=Γ(s)exp(O(n))\Gamma=\Gamma(s)\leq\exp(O(n)) and pH2exp(O(n))\|p\|_{\mathcal{H}_{2}}\leq\exp(O(n)). Then p(z)p(z) is α\alpha-acquiescent by extension of degree d=Oτ,α(n)d=O_{\tau,\alpha}(n), and as a consequence we can learn G(z)G(z) with n+dn+d states in polynomial time.

Moreover, suppose in addition we assume that G(z)Cτ0,τ1,τ2G(z)\in\mathcal{C}_{\tau_{0},\tau_{1},\tau_{2}} for every zz on unit circle. Then p(z)p(z) is α\alpha-acquiescent by extension of degree d=Oτ,α(n)d=O_{\tau,\alpha}(n).

The proof of Theorem 6.9 is similar in spirit to that of Lemma 6.4, and is deferred to Section D.

3 Improper learning using linear regression

In this subsection, we show that under stronger assumption than α\alpha-acquiescent by extension, we can improperly learn a linear dynamical system with linear regression, up to some fixed bias.

A polynomial p(z)p(z) of degree nn is extremely-acquiescent by extension of degree dd with bias ε\varepsilon if there exists a polynomial u(z)u(z) of degree dd and leading coefficient 1 such that for all zz on unit circle,

We remark that if p(z)p(z) is 11-acquiescent by extension of degree dd, then there exists u(z)u(z) such that p(z)u(z)/zn+dCp(z)u(z)/z^{n+d}\in\mathcal{C}. Therefore, equation (6.4) above is a much stronger requirement than acquiescence by extension.We need (1δ)(1-\delta)-acquiescence by extension in previous subsections for small δ>0\delta>0, though this is merely additional technicality needed for the sample complexity. We ignore this difference between 1δ1-\delta-acquiescence and 11-acquiescence and for the purpose of this subsection

Learning multi-input multi-output (MIMO) systems

We consider multi-input multi-output systems with the transfer functions that have a common denominator p(z)p(z),

Let’s define the risk function in the frequency domain as,

The following lemma is an analog of Lemma 3.3 for the MIMO case. Itss proof actually follows from a straightforward extension of the proof of Lemma 3.3 by observing that matrix S(z)S(z) (or S^(z)\hat{S}(z)) commute with scalar p(z)p(z) and p^(z)\hat{p}(z), and that S^(z),p^(z)\hat{S}(z),\hat{p}(z) are linear in a^,C^\hat{a},\hat{C}.

The risk function g(a^,C^)g(\hat{a},\hat{C}) defined in (7.2) is τ\tau-weakly-quasi-convex in the domain

Finally, as alluded before, we use a particular state space representation for learning the system in time domain with example sequences. It is known that any transfer function of the form (7.1) can be realized uniquely by the state space system of the following special case of Brunovsky normal form Brunovsky (1970),

The following Theorem is a straightforward extension of Corollary 5.2 and Theorem 6.2 to the MIMO case.

Suppose transfer function G(z)G(z) of a MIMO system takes form (7.1), and has norm GH21\|G\|_{\mathcal{H}_{2}}\leq 1. If the common denominator p(z)p(z) is α\alpha-acquiescent by extension of degree dd then projected stochastic gradient descent over the state space representation (7.3) will return Θ^\widehat{\Theta} with risk

Simulations

In this section, we provide proof-of-concepts experiments on synthetic data. We will demonstrate that

plain SGD tends to blow up even with relatively small learning rate, especially on hard instances

SGD with our projection step converges with reasonably large learning rate, and with over-parameterization the final error is competitive

SGD with gradient clipping has the strongest performance in terms both of the convergence speed and the final error

Here gradient clipping refers to the technique of using a normalized gradient instead of the true gradient. Specifically, for some positive hyper parameter BB, we follow the approximate gradient

This method is commonly applied in training recurrent neural networks Pascanu et al. (2013).

Bullet 1) suggests that stability is indeed a real concern. Bullet 2) corroborates our theoretical study. Finding 3) suggests the instability of SGD partly arises from the noise in the batches, and such noise is reduced by the gradient clipping. Our experiments suggest that the landscape of the objective function may be even nicer than what is predicted by our theoretical development. It remains possible that the objective has no non-global local minima, possibly even outside the convex set to which our algorithm projects.

We generate the true system with state dimension d=20d=20 by randomly picking the conjugate pairs of roots of the characteristic polynomial inside the circle with radius ρ=0.95\rho=0.95 and randomly generating the vector CC from standard normal distribution. The distribution of the norm of the impulse response rr (defined in Section 3) of such systems has a heavy-tail. When the norm of rr is several magnitudes larger than the median it’s difficult to learn the system. Thus we select systems with reasonable r\|r\| for experiments, and we observe that the difficulty of learning increases as r\|r\| increases. The inputs of the dynamical model are generated from standard normal distribution with length T=500T=500. We note that we generate new fresh inputs and outputs at every iterations and therefore the training loss is equal to the test loss (in expectation.) We use initial learning rate 0.01 in the projected gradient descent and SGD with gradient clipping. We use batch size 100 for all experiments, and decay the learning rate at 200K and 250K iteration by a factor of 10 in all experiments.

Acknowledgments

We thank Amir Globerson, Alexandre Megretski, Pablo A. Parrilo, Yoram Singer, Peter Stoica, and Ruixiang Zhang for helpful discussions. We are indebted to Mark Tobenkin for pointers to relevant prior work. We also thank Alexandre Megretski for helpful feedback, insights into passive systems and suggestions on how to organize Section 3.

References

A Background on optimization

The proof below uses the standard analysis of gradient descent for non-smooth objectives and demonstrates that the argument still works for weakly-quasi-convex functions.

Proof [Proof of Proposition 2.3] We start by using the weakly-quasi-convex condition and then the rest follows a variant of the standard analysis of non-smooth projected sub-gradient descentAlthough we used weak smoothness to get a slightly better bound. We conditioned on θk\theta_{k}, and have that

where the first inequality uses weakly-quasi-convex and the rest of lines are simply algebraic manipulations. Since θk+1\theta_{k+1} is the projection of wk+1w_{k+1} to B\mathcal{B} and θ\theta^{*} belongs to B\mathcal{B}, we have wk+1θθk+1θ\|w_{k+1}-\theta^{*}\|\geq\|\theta_{k+1}-\theta^{*}\|. Together with (A.1), and

Taking expectation over all the randomness and summing over kk we obtain that

where we use the assumption that θ0θR\|\theta_{0}-\theta^{*}\|\leq R. Suppose K4R2Γ2Vτ2K\geq\frac{4R^{2}\Gamma^{2}}{V\tau^{2}}, then we take η=RVK\eta=\frac{R}{\sqrt{VK}}. Therefore we have that τηΓτ/2\tau-\eta\Gamma\geq\tau/2 and therefore

On the other hand, if K4R2Γ2Vτ2K\leq\frac{4R^{2}\Gamma^{2}}{V\tau^{2}}, we pick η=τ2Γ\eta=\frac{\tau}{2\Gamma} and obtain that

Therefore using equation (A.3) and (A.2) we obtain that when choosing η\eta properly according to KK as above,

B Toolbox

where pa(x)=xn+a1xn1++anp_{a}(x)=x^{n}+a_{1}x^{n-1}+\dots+a_{n} is the characteristic polynomial of AA.

Proof let v=(IwA)1Bv=(I-wA)^{-1}B then we have (IwA)v=B(I-wA)v=B. Note that B=enB=e_{n}, and IwAI-wA is of the form

Therefore we obtain that vk=wvk+1v_{k}=wv_{k+1} for 1kn11\leq k\leq n-1. That is, vk=v0wkv_{k}=v_{0}w^{-k} for v0=v1w1v_{0}=v_{1}w^{1}. Using the fact that ((IwA)v)n=1((I-wA)v)_{n}=1, we obtain that v0=pa(w1)1v_{0}=p_{a}(w^{-1})^{-1} where pa()p_{a}(\cdot) is the polynomial pa(x)=xn+a1xn1++anp_{a}(x)=x^{n}+a_{1}x^{n-1}+\dots+a_{n}. Then we have that u(IwA)1B=u1w1++unwnpa(w1)u(I-wA)^{-1}B=\frac{u_{1}w^{-1}+\dots+u_{n}w^{-n}}{p_{a}(w^{-1})}

Suppose x1,,xnx_{1},\dots,x_{n} are independent variables with mean 0 and covariance matrices and II, U1,,UdU_{1},\dots,U_{d} are fixed matrices, then

C Missing proofs in Sections 4 and 5

For any 0<α<β0<\alpha<\beta, we have that BαBβ\mathcal{B}_{\alpha}\subset\mathcal{B}_{\beta}.

Proof Let qa(z)=1+a1z++anznq_{a}(z)=1+a_{1}z+\dots+a_{n}z^{n}. Note that q(z1)=pa(z)/znq(z^{-1})=p_{a}(z)/z^{n}. Therefore we note that Bα={a:qa(z)C,z=1/α}B_{\alpha}=\{a:q_{a}(z)\in\mathcal{C},\forall|z|=1/\alpha\}. Suppose aBαa\in\mathcal{B}_{\alpha}, then (qa(z))τ1\Re(q_{a}(z))\geq\tau_{1} for any zz with z=1/α|z|=1/\alpha. Since (qa(z))\Re(q_{a}(z)) is the real part of the holomorphic function qa(z)q_{a}(z), its a harmonic function. By maximum (minimum) principle of the harmonic functions, we have that for any z1/α|z|\leq 1/\alpha, (qa(z))infz=1/α(qa(z))τ1\Re(q_{a}(z))\geq\inf_{|z|=1/\alpha}\Re(q_{a}(z))\geq\tau_{1}. In particular, it holds that for z=1/β<1/α|z|=1/\beta<1/\alpha, (qa(z))τ1\Re(q_{a}(z))\geq\tau_{1}. Similarly we can prove that for zz with z=1/β|z|=1/\beta, (qa(z))(1+τ0)(qa(z))\Re(q_{a}(z))\geq(1+\tau_{0})\Im(q_{a}(z)), and other conditions for aa being in Bβ\mathcal{B}_{\beta}.

C.2 Proof of Lemma 5.4

In algorithm 3 the values of GA,GC,GDG_{A},G_{C},G_{D} are equal to the gradients of g(A^,C^)+(D^D)2g(\hat{A},\hat{C})+(\hat{D}-D)^{2} with respect to A^\hat{A}, C^\hat{C} and D^\hat{D} up to inverse exponentially small error.

This can be seen simply from similar calculation to the proof of Lemma 3.5. Note that

Therefore noting that when tT1Ω(T)t\geq T_{1}\geq\Omega(T), we have that CAt1h0exp(Ω((1α)T)\|CA^{t-1}h_{0}\|\leq\exp(-\Omega((1-\alpha)T) and therefore the effect of h0h_{0} is negligible. Then we have that

where the first line use the fact that CAt1h0exp(Ω((1α)T)\|CA^{t-1}h_{0}\|\leq\exp(-\Omega((1-\alpha)T), the second uses equation (3.10) and the last line uses the no-blowing up property of AkBA^{k}B (Lemma 4.4).

C.3 Proof of Lemma 5.5

Proof [Proof of Lemma 5.5] Both GAG_{A} and GCG_{C} can be written in the form of a quadratic form (with vector coefficients) of x1,,xTx_{1},\dots,x_{T} and ξ1,,ξT\xi_{1},\dots,\xi_{T}. That is, we will write

where ustu_{st} and vstv_{st} are vectors that will be calculated later. By Lemma C.3, we have that

Therefore in order to bound from above Var[GA]\textup{Var}\left[G_{A}\right], it suffices to bound ust2\sum\|u_{st}\|^{2} and ust2\sum\|u^{\prime}_{st}\|^{2}, and similarly for GCG_{C}.

We begin by writing out ustu_{st} for fixed s,t[T]s,t\in[T] and bounding its norm. We use the same set of notations as int the proof of Lemma 5.4. Recall that we set rk=CAkBr_{k}=CA^{k}B and r^k=C^A^kB\hat{r}_{k}=\hat{C}\hat{A}^{k}B, and Δrk=r^krk\Delta r_{k}=\hat{r}_{k}-r_{k}. Moreover, let zk=A^kBz_{k}=\hat{A}^{k}B. We note that the sums of zk2\|z_{k}\|^{2} and rk2r_{k}^{2} can be controlled. By the assumption of the Lemma, we have that

which will be used many times in the proof that follows.

We calculate the explicit form of GAG_{A} using the explicit back-propagation Algorithm 3. We have that in Algorithm 3,

Then using GA=k2BΔhkhk1G_{A}=\sum_{k\geq 2}B^{\top}\Delta h_{k}h_{k-1}^{{}^{\prime}\top} and equation (C.5) and equation (C.6) above, we have that

Towards bounding ust\|u_{st}\|, we consider four different cases. Let Λ=Ω({max{n,(1α)1log(11α)})\Lambda=\Omega\left(\{\max\{n,(1-\alpha)^{-1}\log(\frac{1}{1-\alpha})\}\right) be a threshold.

When 0stΛ0\leq s-t\leq\Lambda, we rewrite ustu_{st} by rearranging equation (C.7),

We could bound the contribution from Δrk2\Delta r_{k}^{2} ssing equation (C.4), and it remains to bound terms T1T_{1} and T2T_{2}. Using the tail bounds for zk\|z_{k}\| (equation (C.3)) and the fact that r^k=C^A^kBA^kB=zk|\hat{r}_{k}|=|\hat{C}\hat{A}^{k}B|\leq\|\hat{A}^{k}B\|=\|z_{k}\| , we have that

We bound the inner sum of RHS of (C.10) using the fact that zk2O(nα2k2n/τ12)\|z_{k}\|^{2}\leq O(n\alpha^{2k-2n}/\tau_{1}^{2}) and obtain that,

Plugging equation (C.12) and (C.11) into equation (C.10), we have that

For the second term in equation (C.9), we bound similarly,

Therefore using the bounds for T1T_{1} and T2T_{2} we obtain that,

When st>Λs-t>\Lambda, we tighten equation (C.13) by observing that,

where we used equation (C.11). Similarly we can prove that

Therefore, we have when stΛs-t\geq\Lambda,

When Λst0-\Lambda\leq s-t\leq 0, we can rewrite ustu_{s}t and use the Cauchy-Schwartz inequality and obtain that

Using almost the same arguments as in equation (C.11) and (C.12), we that

Then using a same type of argument as equation (C.13), we can have that

It follows that in this case ust\|u_{st}\| can be bounded with the same bound in (C.15).

When stΛs-t\leq-\Lambda, we use a different simplification of ustu_{st} from above. First of all, it follows (C.7) that

Since jsks>4nj-s\geq k-s>4n and it follows that

Therefore, using the bound for ust2\|u_{st}\|^{2} obtained in the four cases above, taking sum over s,ts,t, we obtain that

We finished the bounds for ust\|u_{st}\| and now we turn to bound ust2\|u^{\prime}_{st}\|^{2}. Using the formula for ustu_{st}^{\prime} (equation C.8), we have that for ts+1t\leq s+1, ust=0u_{st}^{\prime}=0. For s+Λts+2s+\Lambda\geq t\geq s+2, we have that by Cauchy-Schwartz inequality,

On the other hand, for t>s+Λt>s+\Lambda, by the bound that r^k2zk2O(nα2k2n/τ12)|\hat{r}_{k}^{\prime}|^{2}\leq\|z_{k}^{\prime}\|^{2}\leq O(n\alpha^{2k-2n}/\tau_{1}^{2}), we have,

Therefore taking sum over s,ts,t, similarly to equation (C.19),

Then using equation (C.2) and equation (C.19) and (C.20), we obtain that

We can prove the bound for GCG_{C} similarly.

Let x1,,xTx_{1},\dots,x_{T} be independent random variables with mean 0 and variance 1 and 4-th moment bounded by O(1)O(1), and uiju_{ij} be vectors for i,j[T]i,j\in[T]. Moreover, let ξ1,,ξT\xi_{1},\dots,\xi_{T} be independent random variables with mean 0 and variance σ2\sigma^{2} and uiju^{\prime}_{ij} be vectors for i,j[T]i,j\in[T]. Then,

Note that the two sums in the target are independent with mean 0, therefore we only need to bound the variance of both sums individually. The proof follows the linearity of expectation and the independence of xix_{i}’s:

Similarly, we can control Var[i,jxiξjuij]\textup{Var}\left[\sum_{i,j}x_{i}\xi_{j}u^{\prime}_{ij}\right] by O(σ2)i,juij2O(\sigma^{2})\sum_{i,j}\|u_{ij}^{\prime}\|^{2}.

D Missing proofs in Section 6

Towards proving Lemma 6.5, we use the following lemma to express the inverse of a polynomial as a sum of inverses of degree-1 polynomials.

Let p(z)=(zλ1)(zλn)p(z)=(z-\lambda_{1})\dots(z-\lambda_{n}) where λj\lambda_{j}’s are distinct. Then we have that

Proof [Proof of Lemma D.1] By interpolating constant function at points λ1,,λn\lambda_{1},\dots,\lambda_{n} using Lagrange interpolating formula, we have that

Dividing p(z)p(z) on both sides we obtain equation (D.1).

The following lemma computes the Fourier transform of function 1/(zλ)1/(z-\lambda).

Proof [Proof of Lemma D.2] For m0m\geq 0, since zmz^{m} is a holomorphic function, by Cauchy’s integral formula, we have that

For m<0m<0, by changing of variable y=z1y=z^{-1} we have that

since λy=λ<1|\lambda y|=|\lambda|<1, then we by Taylor expansion we have,

Since the series λy\lambda y is dominated by λk|\lambda|^{k} which converges, we can switch the integral with the sum. Note that ym1y^{-m-1} is holomorphic for m<0m<0, and therefore we conclude that

Proof [Proof of Lemma 6.5] Let m=n+dm=n+d. We compute the Fourier transform of zm/p(z)z^{m}/p(z). That is, we write

Indeed these can be obtained by writing out the lagrange interpolation for polynomial f(x)=xsf(x)=x^{s} with sn1s\leq n-1 and compare the leading coefficient. Therefore, we further simplify βk\beta_{k} to

Let h(z)=k0βkzkh(z)=\sum_{k\geq 0}\beta_{k}z^{k}. Then we have that h(z)h(z) is a polynomial with degree d=mnd=m-n and leading term 1. Moreover, for our choice of d,d,

D.2 Proof of Theorem 6.9

Theorem 6.9 follows directly from a combination of Lemma D.3 and Lemma D.4 below. Lemma D.3 shows that the denominator of a function (under the stated assumptions) can be extended to a polynomial that takes values in C+\mathcal{C}^{+} on unit circle. Lemma D.4 shows that it can be further extended to another polynomial that takes values in C\mathcal{C}.

Suppose the roots of ss are inside circle with radius α<1\alpha<1, and Γ=Γ(s)\Gamma=\Gamma(s). If transfer function G(z)=s(z)/p(z)G(z)=s(z)/p(z) satisfies that G(z)Cτ0,τ1,τ2G(z)\in\mathcal{C}_{\tau_{0},\tau_{1},\tau_{2}} (or G(z)Cτ0,τ1,τ2+G(z)\in\mathcal{C}^{+}_{\tau_{0},\tau_{1},\tau_{2}}) for any zz on unit circle, then there exists u(z)u(z) of degree d=Oτ(max{(11αlognΓpH21α,0})d=O_{\tau}(\max\{(\frac{1}{1-\alpha}\log\frac{\sqrt{n}\Gamma\cdot\|p\|_{\mathcal{H}_{2}}}{1-\alpha},0\}) such that p(z)u(z)/zn+dCτ0,τ1,τ2p(z)u(z)/z^{n+d}\in\mathcal{C}_{{}_{\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}}} (or p(z)u(z)/zn+dCτ0,τ1,τ2+p(z)u(z)/z^{n+d}\in\mathcal{C}^{+}_{\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}} respectively) for τ=Θτ(1)\tau^{\prime}=\Theta_{\tau}(1) , where Oτ(),Θτ()O_{\tau}(\cdot),\Theta_{\tau}(\cdot) hide the polynomial dependencies on τ0,τ1,τ2\tau_{0},\tau_{1},\tau_{2}.

Proof [Proof of Lemma D.3] By the fact that G(z)=s(z)/p(z)Cτ0,τ1,τ2G(z)=s(z)/p(z)\in\mathcal{C}_{\tau_{0},\tau_{1},\tau_{2}}, we have that p(z)/s(z)Cτ0,τ1,τ2p(z)/s(z)\in\mathcal{C}_{\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}} for some τ\tau^{\prime} that polynomially depend on τ\tau. Using Lemma 6.5, there exists u(z)u(z) of degree dd such that

where we set ζmin{τ0,τ1}/τ2pH1\zeta\ll\min\{\tau_{0}^{\prime},\tau_{1}^{\prime}\}/\tau_{2}^{\prime}\cdot\|p\|_{\mathcal{H}_{\infty}}^{-1}. Then we have that

It follows from equation (D.5) implies that that p(z)u(z)/zn+dCτ0,τ1,τ2p(z)u(z)/z^{n+d}\in\mathcal{C}_{\tau_{0}^{\prime\prime},\tau_{1}^{\prime\prime},\tau_{2}^{\prime\prime}}, where τ\tau^{\prime\prime} polynomially depends on τ\tau. The same proof still works when we replace C\mathcal{C} by C+\mathcal{C}^{+}.

Suppose p(z)p(z) of degree nn and leading coefficient 1 satisfies that p(z)Cτ0,τ1,τ2+p(z)\in\mathcal{C}^{+}_{\tau_{0},\tau_{1},\tau_{2}} for any zz on unit circle. Then there exists u(z)u(z) of degree dd such that p(z)u(z)/zn+dCτ0,τ1,τ2p(z)u(z)/z^{n+d}\in\mathcal{C}_{\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}} for any zz on unit circle with d=Oτ(n)d=O_{\tau}(n) and τ0,τ1,τ2=Θτ(1)\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}=\Theta_{\tau}(1), where Oτ(),Θτ()O_{\tau}(\cdot),\Theta_{\tau}(\cdot) hide the dependencies on τ0,τ1,τ2\tau_{0},\tau_{1},\tau_{2}.

Proof [Proof of Lemma D.4] We fix zz on unit circle first. Let’s defined p(z)/zn\sqrt{p(z)/z^{n}} be the square root of p(z)/znp(z)/z^{n} with principle value. Let’s write p(z)/zn=τ2(1+(p(z)τ2zn1))p(z)/z^{n}=\tau_{2}(1+(\frac{p(z)}{\tau_{2}z^{n}}-1)) and we take Taylor expansion for 1p(z)/zn=τ21/2(1+(p(z)τ2zn1))1/2=τ21/2(k=0(p(z)τ2zn1)k)\frac{1}{\sqrt{p(z)/z^{n}}}=\tau_{2}^{-1/2}(1+(\frac{p(z)}{\tau_{2}z^{n}}-1))^{-1/2}=\tau_{2}^{-1/2}\left(\sum_{k=0}^{\infty}(\frac{p(z)}{\tau_{2}z^{n}}-1)^{k}\right). Note that since τ1<p(z)<τ2\tau_{1}<|p(z)|<\tau_{2}, we have that p(z)τ2zn1<1τ1/τ2|\frac{p(z)}{\tau_{2}z^{n}}-1|<1-\tau_{1}/\tau_{2}. Therefore truncating the Taylor series at k=Oτ(1)k=O_{\tau}(1) we obtain a polynomial a rational function h(z)h(z) of the form

which approximates 1p(z)/zn\frac{1}{\sqrt{p(z)/z^{n}}} with precision ζmin{τ0,τ1}/τ2\zeta\ll\min\{\tau_{0},\tau_{1}\}/\tau_{2}, that is, 1p(z)/znh(z)ζ.\left|\frac{1}{\sqrt{p(z)/z^{n}}}-h(z)\right|\leq\zeta\,. Therefore, we obtain that p(z)h(z)znp(z)/znζp(z)/znζτ2.\left|\frac{p(z)h(z)}{z^{n}}-\sqrt{p(z)/z^{n}}\right|\leq\zeta|p(z)/z^{n}|\leq\zeta\tau_{2}\,. Note that since p(z)/znCτ0,τ1,τ2+p(z)/z^{n}\in\mathcal{C}^{+}_{\tau_{0},\tau_{1},\tau_{2}}, we have that p(z)/znCτ0,τ1,τ2\sqrt{p(z)/z^{n}}\in\mathcal{C}_{\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}} for some constants τ0,τ1,τ2\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}. Therefore p(z)h(z)znCτ0,τ1,τ2\frac{p(z)h(z)}{z^{n}}\in\mathcal{C}_{\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}}. Note that h(z)h(z) is not a polynomial yet. Let u(z)=znkh(z)u(z)=z^{nk}h(z) and then u(z)u(z) is polynomial of degree at most nknk and p(z)u(z)/z(n+1)kCτ0,τ1,τ2p(z)u(z)/z^{(n+1)k}\in\mathcal{C}_{\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}} for any zz on unit circle.

E Back-propagation implementation

In order to have a fast projection algorithm to the convex set Bα\mathcal{B}_{\alpha}, we consider a grid GM\mathcal{G}_{M} of size MM over the circle with radius α\alpha. We will show that M=Oτ(n)M=O_{\tau}(n) will be enough to approximate the set Bα\mathcal{B}_{\alpha} in the sense that projecting to the approximating set suffices for the convergence.

We will first show that with M=Oτ(n)M=O_{\tau}(n), we can make Bα,τ1,τ2,τ3\mathcal{B}^{\prime}_{\alpha,\tau_{1},\tau_{2},\tau_{3}} to be sandwiched within to two sets Bα,τ0,τ1,τ2\mathcal{B}_{\alpha,\tau_{0},\tau_{1},\tau_{2}} and Bα,τ0,τ1,τ2\mathcal{B}_{\alpha,\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}}.

For any τ0>τ0,τ1>τ1,τ2<τ2\tau_{0}>\tau_{0}^{\prime},\tau_{1}>\tau_{1}^{\prime},\tau_{2}<\tau_{2}^{\prime}, we have that for M=Oτ(n)M=O_{\tau}(n), there exists κ0,κ1,κ2\kappa_{0},\kappa_{1},\kappa_{2} that polynomially depend on τi,τi\tau_{i},\tau_{i}^{\prime}’s such that Bα,τ0,τ1,τ2Bα,κ0,κ1,κ2Bα,τ0,τ1,τ2\mathcal{B}_{\alpha,\tau_{0},\tau_{1},\tau_{2}}\subset\mathcal{B}_{\alpha,\kappa_{0},\kappa_{1},\kappa_{2}}^{\prime}\subset\mathcal{B}_{\alpha,\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}}

Before proving the lemma, we demonstrate how to use the lemma in our algorithm: We will pick τ0=τ0/2\tau_{0}^{\prime}=\tau_{0}/2, τ1=τ1/2\tau_{1}^{\prime}=\tau_{1}/2 and τ2=2τ2\tau_{2}^{\prime}=2\tau_{2}, and find κi\kappa_{i}’s guaranteed in the lemma above. Then we use Bα,κ0,κ1,κ2\mathcal{B}_{\alpha,\kappa_{0},\kappa_{1},\kappa_{2}}^{\prime} as the projection set in the algorithm (instead of Bα,τ0,τ1,τ2)\mathcal{B}_{\alpha,\tau_{0},\tau_{1},\tau_{2}})). First of all, the ground-truth solution Θ\Theta is in the set Bα,κ0,κ1,κ2\mathcal{B}^{\prime}_{\alpha,\kappa_{0},\kappa_{1},\kappa_{2}}. Moreover, since Bα,κ0,κ1,κ2Bα,τ0,τ1,τ2\mathcal{B}_{\alpha,\kappa_{0},\kappa_{1},\kappa_{2}}^{\prime}\subset\mathcal{B}_{\alpha,\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}}, we will guarantee that the iterates Θ^\widehat{\Theta} will remain in the set Bα,τ0,τ1,τ2\mathcal{B}_{\alpha,\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}} and therefore the quasi-convexity of the objective function still holdswith a slightly worse parameter up to constant factor since τi\tau_{i}’s are different from τi\tau_{i}’s up to constant factors.

Note that the set Bα,κ0,κ1,κ2\mathcal{B}^{\prime}_{\alpha,\kappa_{0},\kappa_{1},\kappa_{2}} contains O(n)O(n) linear constraints and therefore we can use linear programming to solve the projection problem. Moreover, since the points on the grid forms a Fourier basis and therefore Fast Fourier transform can be potentially used to speed up the projection. Finally, we will prove Lemma F.1. We need S. Bernstein’s inequality for polynomials.

Let p(z)p(z) be any polynomial of degree nn with complex coefficients. Then,

We will use the following corollary of Bernstein’s inequality.

Let p(z)p(z) be any polynomial of degree nn with complex coefficients. Then, for m=20nm=20n,

Proof For simplicity let τ=supk[m]p(e2ikπ/m)\tau=\sup_{k\in[m]}|p(e^{2ik\pi/m})|, and let τ=supk[m]p(e2ikπ/m)\tau^{\prime}=\sup_{k\in[m]}|p(e^{2ik\pi/m})|. If τ2τ\tau^{\prime}\leq 2\tau then we are done by Bernstein’s inequality. Now let’s assume that τ>2τ\tau^{\prime}>2\tau. Suppose p(z)=τp(z)=\tau^{\prime}. Then there exists kk such that ze2πik/m4/m|z-e^{2\pi ik/m}|\leq 4/m and p(e2πik/m)τ|p(e^{2\pi ik/m})|\leq\tau. Therefore by Cauchy mean-value theorem we have that there exists ξ\xi that lies between zz and e2πik/me^{2\pi ik/m} such that p(ξ)m(ττ)/41.1nτp^{\prime}(\xi)\geq m(\tau^{\prime}-\tau)/4\geq 1.1n\tau^{\prime}, which contradicts Bernstein’s inequality.

Suppose a polynomial of degree nn satisfies that p(w)τ|p(w)|\leq\tau for every w=αe2iπk/mw=\alpha e^{2i\pi k/m} for some m20nm\geq 20n. Then for every zz with z=α|z|=\alpha there exists w=αe2iπk/mw=\alpha e^{2i\pi k/m} such that p(z)p(w)O(nατ/m)|p(z)-p(w)|\leq O(n\alpha\tau/m).

Proof Let g(z)=p(αz)g(z)=p(\alpha z) by a polynomial of degree at most nn. Therefore we have g(z)=αp(z)g^{\prime}(z)=\alpha p(z). Let w=αe2iπk/mw=\alpha e^{2i\pi k/m} such that zwO(α/m)|z-w|\leq O(\alpha/m). Then we have

Proof [Proof of Lemma F.1] We choose κi=12(τi+τi)\kappa_{i}=\frac{1}{2}(\tau_{i}+\tau_{i}^{\prime}).The first inequality is trivial. We prove the second one. Consider aa such that aBα,κ0,κ1,κ2a\in\mathcal{B}_{\alpha,\kappa_{0},\kappa_{1},\kappa_{2}}. We wil show that aBα,τ0,τ1,τ2a\in\mathcal{B}^{\prime}_{\alpha,\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}}. Let qa(z)=p(z1)znq_{a}(z)=p(z^{-1})z^{n}. By Lemma F.4, for every zz with z=1/α|z|=1/\alpha, we have that there exists w=α1e2πik/Mw=\alpha^{-1}e^{2\pi ik/M} for some integer kk such that qa(z)qa(w)O(τ2n/(αM))|q_{a}(z)-q_{a}(w)|\leq O(\tau_{2}n/(\alpha M)). Therefore let M=cnM=cn for sufficiently large cc (which depends on τi\tau_{i}’s), we have that for every zz with z=1/α|z|=1/\alpha, qa(z)Cτ0,τ1,τ2q_{a}(z)\in\mathcal{C}_{\tau_{0}^{\prime},\tau_{1}^{\prime},\tau_{2}^{\prime}}. This completes the proof.