Physics-Informed Probabilistic Learning of Linear Embeddings of Non-linear Dynamics With Guaranteed Stability

Shaowu Pan, Karthik Duraisamy

Introduction

Nonlinear dynamical systems are prevalent in physical, mathematical and engineering problems . Nonlinearity gives rise to rich phenomena, and is much more challenging to characterize in comparison to linear systems for which comprehensive techniques have been developed for decades .

Specifically, for measure-preserving systems, e.g., Hamiltonian system or the dynamics on an attractor of the Naiver–Stokes equations, one can guarantee a spectral decomposition of any observable $f\in\mathcal{F}$ , or the existence of a spectral resolution of the Koopman operator restricted on the attractor, i.e., $\mathcal{K}_{t}|_{\mathcal{A}}$ , where $\mathcal{A}\subset\mathcal{M}$ is the attractor. Such a spectral decomposition is a sum of three essentials: temporal averaging of the observable, the contribution from the point spectrum, which is almost periodic in time, and that from the continuous spectrum, which is chaotic . For a more comprehensive discussion, readers are referred to the excellent review by Budišić et al. . It should be stressed that for measure-preserving systems, the Koopman operator is not only well-defined on $L^{2}(\mathcal{M},\mu)$ but also can be shown to be unitary, which ensures properties such as simple eigenvalues and the existence of the aforementioned spectral resolution . Throughout this work, we employ the simple but practical assumption $\mathcal{F}=L^{2}(\mathcal{M},\mu)$ , where $\mu$ is some positive measure with support equal to $\mathcal{M}$ . Note that this implies that the Koopman operator is well-defined on $\mathcal{F}$ .

Koopman analysis has also gained increased attention in different communities. Koopman eigenfunctions have been leveraged in nonlinear optimal control . In modal analysis, the equivalence between the dynamic mode decomposition and the approximation of Koopman modes was recognized by Rowley et al. . In a reduced order modeling context, the operator-theoretic viewpoint of the Koopman operator is critical to the formulation of Mori-Zwanzig (M-Z) formalism , which has recently found utility in addressing closure and stabilization of multiscale problems.

Despite its appealing properties, the Koopman decomposition cannot be pursued in its original form described above for practical scientific applications, because the operator is defined on the infinite dimensional Hilbert space. To accommodate practical computation of the Koopman operator, we consider two simplifications:

Only a finite dimensional space invariant to $\mathcal{K}_{t}$ is of interest. This excludes the possibility of dealing with a chaotic system since it is impossible for a finite dimensional linear system to be topologically conjugate to a chaotic system . However, one could assign a state-dependent Koopman operator to work around this limitation .

Nonlinear reconstruction of observables from Koopman eigenfunctions. The original concept of the Koopman operator studies the evolution of any observable in the Koopman invariant subspace. Thus, the observables of interest can be linearly reconstructed from the Koopman eigenfunctions. Regarding the latter Linear reconstruction is desirable especially in the content of control . Li et al. considered augmenting the Koopman invariant subspace with neural network-trained observables together with the system state $\mathbf{x}$ to force linear reconstruction. However, for systems with multiple fixed points, there does not exist a finite-dimensional Koopman invariant subspace that also spans the state globallyIn this paper, the term “global embedding” is used to imply non-locality in phase space. Note that the Hartman-Grobman theorem establishes a topological conjugacy to a linear system with the same eigenvalues in a small neighborhood of the fixed point. due to the impossibility of establishing such a topological conjugacy with the linear system. Therefore, attention naturally moves to the introduction of nonlinear reconstruction that is more expressive than linear reconstruction, together with extra modes that indicate different basins of each attractor . Unfortunately, nonlinear reconstruction is equivalent to removing Koopman modes for the observables of interest . In other words, this is equivalent to removing the constraint that the system state lies in the finite dimensional Koopman invariant subspace. Although for systems with a single attractor, one could still retain a linear reconstruction representation, which might lead to a decrease in accuracy .

To determine the Koopman eigenfunctions, we consider addressing the problem from a data-driven perspective. A common approximation is to assume that the Koopman invariant subspace $\mathcal{F}_{D}$ is spanned by a finite dictionary of functions such as monomials, and then minimize the empirical risk of the residual that comes from the imperfect $\mathcal{F}_{D}$ . This is the general idea behind Extended Dynamic Mode Decomposition (EDMD) and Kernel-based Dynamic Mode Decomposition (KDMD) where an implicitly-defined infinite dimensional Koopman invariant subspace can be computationally affordable if the kernel satisfies Mercer’s condition . Note that EDMD is essentially a least-squares regression of the (linear) action of the Koopman operator for the features in the dictionary, with samples drawn from some measure $\mu$ with fixed features, which can be proved as a $L^{2}(\mathcal{M},\mu)$ orthogonal projection in the limit of large independent identically distributed data .

By framing the data-derived Koopman operator in the Hilbert space endowed with a proper measure, one can prove optimality in the asymptotic sense for EDMD and Hankel-DMD . As a side note, if the dynamics on the attractor is of central interest, it has been shown that Hankel-DMD with appropriate delay embedding and sampling rate can recover the entire system, even without sampling the full attractor . Further, one can establish a duality between the measure-preserving dynamics on the attractor with a stationary stochastic process, which reflects a close relationship between the estimation of a continuous spectrum from the trajectories, and the extraction of the power spectral density of stochastic signals . Other algorithms include generalized Laplace analysis (GLA) , and the Ulam Galerkin method . If the representation of the Koopman eigenfunctions is sparse in the pre-defined dictionary in EDMD, then one can leverage sparse regression techniques to select observables in an iterative fashion . In this work, we do not address continuous spectra. One way of addressing continuous spectra is to use an auxiliary sub-network to characterize the eigenvalues as state-dependent.

However, it has been shown that traditional EDMD/KDMD method is prone to overfitting and is hard to interpret . For instance, the number of approximated Koopman eigenfunctions scales with either the number of features in the dictionary or the number of training snapshots even though only a few of them are genuine Koopman eigenfunctions. Recently, there have been several alternative attempts to leverage deep learning architectures , to extract Koopman decompositions. Yeung et al. used feedforward neural networks to learn the dictionary, but the reconstruction loss was found to be non-optimal. Li et al. enforced several non-trainable functions, e.g., components of the system state, in the Koopman observables to ensure an accurate reconstruction but one that could be inefficient in terms of obtaining a finite-dimensional Koopman observable subspace . Further, Takeishi et al. utilized linear time delay embedding in the feedforward neural network framework to construct Koopman observables with nonlinear reconstruction which is critical for partially observed systems . Lusch et al. further extended the deep learning framework to chaotic systems. Otto and Rowley considered a recurrent loss for better performance on long time prediction on trajectories that transit to the attractor. Recently, Morton et al. addressed the uncertainty in a deep learning model with a focus on control. The benefit of formulating the search for the Koopman operator in an optimization setting enables the enforcement of stability. For example, it is also feasible to constrain eigenvalues in optimized DMD . Specifically in the neural network context, Erichson et al. considered a stability promoting loss to encourage Lyapunov stability of the dynamical system.

A unified approach towards uncertainty quantification, stabilization, and incorporation of physics information is lacking, particularly in a continuous setting. This motivates us to establish a probabilistic stabilized deep learning framework specifically to learn the Koopman decomposition for a continuous dynamical system. We employ automatic differentiation variational inference (ADVI) to quantify parametric uncertainty in deep neural networks, and structural parameterization to enforce stability of the Koopman operator extracted from the deep neural networks. A broad comparison of the present work with other approaches is shown in Table 1.

The outline of this paper is as follows: In Section 2, our framework of deep learning of Koopman operators of continuous dynamics, together with structured parameterization that enforces stabilization is presented. In Section 3, Bayesian deep learning, specifically the employment of variational inference is introduced. Results and discussion on several problems are given in Section 4. Conclusions are drawn in Section 5.

Data-driven framework to learn continuous-time Koopman decompositions

In contrast to previous approaches , our framework pursues continuous-time Koopman decompositions. The continuous formulation is more amenable to posit desired constraints and contend with non-uniform sampling , which is frequently encountered in experiments or temporal multi-scale data. To begin with, recall the general form of autonomous continuous nonlinear dynamical systems,

We seek a finite dimensional Koopman invariant subspace $\mathcal{F}_{D}$ with $D$ linearly independent smooth observation functions, defined as

By the chain rule, the relationship between Eq. 4 and Eq. 1 is

The goal is to find a minimal set of basis functions that spans a Koopman invariant subspace. There are several existing methods for approximation:

Dynamic mode decomposition (DMD): observation functions are linear transformations of the state, usually POD modes for robustness purposes .

Extended/Kernel DMD: observation functions are pre-specified, either explicit polynomials or implicitly by the kernel .

Deep learning Koopman/Neural Network DMD searching a set of fixed size for the nonlinear observations by artificial neural networks.

In this work, we specifically focus on using neural networks.

2 Problem formulation

We first define the ideal problem of learning the Koopman decomposition as a constrained variational problem, and incorporate assumptions to make it tractable step by step. Then, we introduce the effect of data as an empirical measure into the optimization in the function space. Specifically, we propose two slightly different formulations based on varying requirements.

Differential form: for low-dimensional systems when the governing equations, i.e., Eq. 5 are known, they can be leveraged without the trajectory data.

Recurrent form: for high-dimensional systems where only discrete trajectory data is obtained, which implies the access to the action of $\mathcal{K}_{t}$ over discrete $t$ .

As a natural extension, for any finite $n$ , given the vector-valued function $\mathbf{\Phi}=\begin{bmatrix}\phi_{1}&\ldots&\phi_{n}\end{bmatrix}\in\mathcal{F}^{n}$ , we define the corresponding norm as,

which describes the extent to which $\mathcal{F}_{D}$ is invariant to the Koopman operator $\mathcal{K}_{t}$ with $t\rightarrow 0$ with respect to each basis. Ideally, if we can find $\mathbf{\Phi}$ such that the corresponding $\mathcal{F}$ is invariant to $\mathcal{K}$ , i.e., $\mathcal{J}[\mathbf{\Phi}]=0$ , then $\mathcal{F}_{D}$ is invariant to $\mathcal{K}_{t}$ , $\forall t>0$ , i.e., a perfect linear embedding in the $L^{2}(\mu)$ sense. Once such an embedding is determined, the realization of $\mathcal{K}$ is simply the matrix $\mathbf{K}$ . In this work, we are interested in $\mathbf{\Phi}$ such that one can recover $\mathbf{x}$ and $\mathcal{J}[\mathbf{\Phi}]\geq 0$ .

Although the above problem setup only contains minimal necessary assumptions, it is both mathematically and computationally challenging. For practical purposes, following previous studies , we consider the following assumptions to make the problem tractable:

Instead of solving the equation $\mathcal{J}[\mathbf{\Phi}]=0$ , we search for $\mathbf{\Phi}$ by finding the minimum of the following constrained variational problem,

Instead of directly solving the variational problem in the infinite dimensional $\mathcal{F}^{D}$ , we optimize $\mathbf{\Phi}$ in the finite dimensional parameter space of feedforward neural networks with fixed architecture in which the number of layers and the number of hidden units in each layer is determined heuristically. Note that we are searching in a subset of $\mathcal{F}^{D}$ described by $\mathbf{W}_{\mathbf{\Phi}}$ , which might induce a gap due to the choice of the neural network architecture,

Clearly, the gap is bounded by the right hand side of the second inequality above. In addition, it should be noted that the requirement of linear independence in $\{\phi_{1},\ldots,\phi_{D}\}$ is relaxed, but $\dim\mathcal{F}_{D}$ is bounded by $D$ .

As a standard procedure in deep learning , we use the penalty method to approximate the constrained optimization problem with an unconstrained optimization with unity penalty coefficient. Since this still entails nonconvex optimization, we define a global minima as follows:

We then optimize the sum of square error $\widehat{\mathcal{J}}$ over all components of $\mathbf{\Phi}$ , which serves a upper bound of $\mathcal{J}$ ,

The above formulation also implies equal importance among all components of $\mathbf{\Phi}$ .

Despite the non-convex nature of the problem, we employ gradient-descent optimization to search for a local minimum.

In summary based on the above assumptions, we will solve the following optimization problem by gradient-descent:

2.2 Unknown physics with only the trajectory data: recurrent form

From the viewpoint of approximation, it can be argued that the most natural way to determine the continuous Koopman operator is in the aforementioned differential form. However, higher accuracy can be achieved by taking advantage of trajectory information and minimizing the error over multiple time steps in the trajectory. This is the key idea behind optimized DMD . Recently, Lusch et al. , Otto and Rowley extended this idea to determine the discrete-time Koopman operator using deep learning.

and solve the following optimization problem via a gradient-based algorithm:

As a side note, one might also define the following reconstruction functional similar to previous differential form that is independent of $\mathbf{K}$ ,

which can bound the prediction error functional together with the Koopman error functional by triangular inequality,

where $L_{\mathbf{\Psi}}$ is the Lipschitz constant for $\mathbf{\Psi}$ . However, in this work, we do not have control over $L_{\mathbf{\Psi}}$ , and thus we prefer to directly minimize the prediction error function as in previous studies .

3 Measures

We consider the situation where $M$ data points on $\mathcal{M}$ , i.e., $\{\mathbf{x}_{m}\}_{m=1}^{M}$ are drawn independently from some measure $\mu$ on $\mathcal{M}$ , e.g., uniform distribution. This induces the following empirical measure $\hat{\mu}_{M}$ ,

where $\delta_{x}$ is the Dirac measure for $x$ . Note that $\hat{\mu}_{M}$ uniformly converges to $\mu$ as $M\rightarrow\infty$ . Thus, one can rewrite the differential form in Eq. 18 as an empirical risk minimization ,

where $\widehat{\mathcal{F}}_{M}=L^{2}(\mathcal{M},\hat{\mu}_{M})$ .

3.2 Measure for trajectory data generated by solving the initial value problem

In the general case, information content in trajectory data resulting from the solution of the initial value problem is strongly dependent on the initial condition. For instance, when the initial condition is in a region of phase space with sharp changes, it would be sensible to use a high sampling rate. On the other hand, if the initial condition is near a fixed point attractor, one might prefer to stop collecting the data after the system arrives at the fixed point. Such a sampling pattern for the specific initial state can be summarized as a Markov kernel $\kappa:\mathcal{M}\times\Sigma_{\mathcal{T}}\mapsto$ . For every fixed initial state $\mathbf{x}\in\mathcal{M}$ , the map $\kappa(\mathcal{E}_{\mathcal{T}},\mathbf{x})$ is a measure on $(\mathcal{T},\Sigma_{\mathcal{T}})$ , $\mathcal{E}_{\mathcal{T}}\in\Sigma_{\mathcal{T}}$ , where $\Sigma_{\mathcal{T}}$ is the $\sigma$ -algebra on $\mathcal{T}$ . Then, there exists a unique measure $\nu$ on $\mathcal{U}$ such that,

for $\mathcal{E}_{\mathcal{T}}\in\Sigma_{\mathcal{T}},\mathcal{E}_{\mathcal{M}}\in\Sigma_{\mathcal{M}}$ .

Assume we are given $M$ trajectories, $\{\{\mathbf{x}_{m,j}\}_{j=1}^{T_{m}}\}_{m=1}^{M}$ , $\{\{t_{m,j}\}_{j=1}^{T_{m}}\}_{m=1}^{M}$ . The initial condition for each trajectory is drawn independently from $\mu$ . For the $m$ -th trajectory with initial condition $\mathbf{x}_{m}$ , there are $T_{m}$ samples drawn independently from measure $\kappa(\cdot,\mathbf{x}_{m,1})$ where the time elapse of $j$ -th sample away from initial condition $\mathbf{x}_{m,1}$ is $t_{m,j}$ ( $t_{m,1}=0$ ), where $m=1,\ldots,M$ . Then, we can define the corresponding empirical measure from Eq. 28,

where $\displaystyle\widehat{T}=\max_{m=1,\ldots,M}T_{m}$ , and $\hat{\nu}_{M,\widehat{T}}$ uniformly converges to $\nu$ as $M,\widehat{T}\rightarrow\infty$ .

Similarly, we can rewrite Eq. 22 as the following,

where $\widehat{G}=L^{2}(\mathcal{U},\widehat{\nu}_{M,\widehat{T}})$ .Note that Eq. 30 generalizes the LRAN model with a simpler loss function and the discrete spectrum model in the paper of Lusch et al. .

Note that the above data setup also generalizes to cases where data along a single long trajectory is “Hankelized” , i.e., dividing a long trajectory into several smaller-sized consecutive trajectories. This would lead to a truncated time horizon in the model, which leads to better computational efficiency compared to the original data, at the cost of some loss in prediction accuracy.

4 Guaranteed stabilization of the Koopman operator

Eigenvalues of the Koopman operator are critically important in understanding the temporal behavior of certain modes in the dynamical system. For a measure-preserving system or systems on an attractor, e.g., post-transient flow dynamics , even if the system is chaotic, the eigenspectrum of the continuous-time Koopman operator would still be on the imaginary axis. It should be noted that although the Koopman operator still accepts unstable modes, i.e., the real part of the eigenvalues of continuous-time Koopman operator being positive, its absence in systems governed by Navier-Stokes equation in fluid mechanics has been documented . Hence, in this work, we assume that the Koopman eigenvalues corresponding to the finite dimensional Koopman invariant subspace of interest have non-positive real parts. It is important to note that the concept of unstable Koopman modes should not be confused with that of flow instability. Prior models (for instance, ) have not explicitly taken stability into account, and thus resulted in slightly unstable Koopman modes. While this is acceptable for relatively short time predictions, long time predictions will be problematic.

Before presenting the stabilization technique, it is instructive to note the non-uniqueness of ideal observation functions $\mathbf{\Phi}$ , i.e., the one corresponding to the exact Koopman invariant subspace. This is because one can simultaneously multiply any $D\times D$ invertible real matrix and its inverse before and after the observation vector $\mathbf{\Phi}$ while keeping the Koopman eigenfunction, eigenvalues, and the output from the reconstruction the same. Thus, observation functions described by neural networks cannot be expected to be uniquely determined. We will leverage this non-uniqueness to enforce stability.

Enlightened by recent studies in the design of stable deep neural network structures where skew-symmetric weights are used inside nonlinear activations, we devised a novel parameterization for the realization of the Koopman operator in the following form:

In Appendix A, we show that the real parts of the eigenvalues of a $n\times n$ real (possibly non-symmetric) negative semi-definite matrix $\mathbf{A}$ are non-positive. We then prove that the above parameterization posits a constraint that the time evolution associated with $\mathbf{K}_{stable}$ in Eq. 31 for any choice of parameters in $\mathbf{R}$ is always stable. Further, the constraint from the parameterization in Eq. 31 is actually rich enough such that any diagonalizable matrix corresponding to a stable Koopman operator can be represented without loss of expressivity.

Without loss of generality, the diagonal matrix $\mathbf{J}$ contains $2D_{c}$ complex eigenvalues $\{\lambda^{c}_{j}\}_{j=1}^{2D_{c}}$ and $D_{r}$ real eigenvalues $\{\lambda^{r}_{j}\}_{j=1}^{D_{r}}$ where $2D_{c}+D_{r}=D$ .

Consider a $2\times 2$ real matrix, $\mathbf{A}_{j}=\begin{bmatrix}-\sigma_{j}^{2}&\zeta_{j}\\ -\zeta_{j}&-\sigma_{j}^{2}\end{bmatrix}$ where the eigenvalues are $\lambda_{1,2}=-\sigma_{j}^{2}\pm j\zeta_{j}^{2}$ . We use this matrix to construct a $2\times 2$ matrix that has eigenvalues $\lambda^{c}_{j}$ . For each pair of complex eigenvalues, $j=1,\ldots,D_{c}$ , we have $\mathbf{A}_{j}=\begin{bmatrix}\textrm{Re}(\lambda^{c}_{j})&\sqrt{|\textrm{Im}(\lambda^{c}_{j})|}\\ -\sqrt{|\textrm{Im}(\lambda^{c}_{j})|}&\textrm{Re}(\lambda^{c}_{j})\end{bmatrix}$ . Next, combining with the $D_{r}$ real eigenvalues, we have the following block diagonal matrix that shares the same eigenvalues as $\mathbf{J}$ ,

Therefore, $\mathbf{K}_{stable}$ is a negative semi-definite matrix. Following Lemma A.2, the real part of any eigenvalue of $\mathbf{K}_{stable}$ is non-positive.

Careful readers might notice that indeed one can further truncate the parameterization with a $2\times 2$ block matrix instead of the tridiagonal form used in Eq. 31. However, since this reduction takes more effort in the implementation while the parameterization in Eq. 31 has already reduced the number of parameters from $O(D^{2})$ to $O(D)$ , we prefer the tridiagonal form. For the rest of the work, we will use this parameterization for all of the cases concerned.

5 Design of neural network architecture with SVD-DMD embedding

In the previous subsection, we described the deep learning formulation for the Koopman operator as a finite dimensional optimization problem that approximates the constrained variational problem, together with a stable parameterization for $\mathbf{K}$ . In this subsection, we will describe the design of our neural network that further embeds the SVD-DMD for differential and recurrent forms, which were presented in Sections 2.2.1 and 2.2.2.

As a standard procedure, we consider normalization on the snapshot matrix of state variable $\mathbf{X}$ ,

for better training performance on neural networks . Specifically, we consider Z-normalization shown in Eq. 36, i.e., subtracting the mean of $\mathbf{x}$ then dividing the standard deviation to obtain $\mathbf{z}$ .

where $\overline{\mathbf{x}}=\frac{1}{M}\sum_{m=1}^{M}\mathbf{x}_{m}$ , $\mathbf{\Lambda}=\textrm{diag}\{d_{1},\ldots,d_{N}\}$ , $d_{j}$ is the uncorrected standard deviation of $j$ -th component of $\mathbf{x}$ , i.e.,

where $\mathbf{x}_{m,j}$ is $j$ -th component of $m$ -th data, $j=1,\ldots,N$ .

While such a normalization is helpful for neural network training in most cases, in some cases where the data is a set of POD coefficients, it cannot differentiate between components that could be more significant than others. Therefore, for those cases, we consider a different normalization with the $\mathbf{\Lambda}$ that sets the ratio of standard deviation between components as:

where $d_{max}=\max_{j=1,\ldots,N}d_{j}$ , and $\mathbf{I}$ is the identity matrix.

5.2 Embedding with SVD-DMD

Instead of directly using the standard feedforward neural network structure employed in previous works, we embed SVD-DMD into the framework and learn the residual. Recall for the standard SVD-DMD algorithm, given $M$ sequential snapshots in Eq. 35 uniformly sampled at intervals $\Delta t$ , one linearly approximates the action of $\mathcal{K}_{\Delta t}$ of the first $r$ dominant SVD modes of centered snapshots, i.e., $\mathbf{q}=(\mathbf{x}-\overline{\mathbf{x}})\mathbf{V}_{r}$ . Here, $r$ is empirically chosen as a balance between numerical robustness and reconstruction accuracy, and $\mathbf{V}_{r}$ are the first $r$ columns from $\mathbf{V}$ of the SVD of centered snapshots,

Then the SVD-DMD operator is simply the matrix $\mathbf{A}$ that minimizes $\sum_{m=1}^{M-1}\lVert\mathbf{q}_{m+1}-\mathbf{q}_{m}\mathbf{A}\rVert$ , where $\mathbf{q}_{m}$ is the corresponding orthogonal projection of $\mathbf{x}_{m}$ .

To embed the above SVD-DMD structure into the neural network, we introduce two modifications. First, we take $r=D$ . Since $D$ is arbitrary, if $N<D$ , we simply append zero columns in $\mathbf{V}_{D}$ . Second, we would also need to accommodate $\mathbf{z}$ with the input normalization in Section 2.5.1. Thus, we cast $\mathbf{q}=\mathbf{z}\mathbf{\Lambda}\mathbf{V}_{D}$ . We then have,

where $\mathbf{\Phi}_{nn}\triangleq\mathbf{\Phi}(\cdot;{\color[rgb]{0,0,0}\mathbf{W}_{\mathbf{\Phi}}\setminus{\{W_{enc,L}\}}})$ , $\mathbf{W}_{\mathbf{\Phi}}=\{W_{enc,1},b_{enc,1},\ldots,W_{enc,L}\}$ , $\mathbf{\Psi}_{nn}\triangleq\mathbf{\Psi}(\cdot;\mathbf{W}_{\mathbf{\Psi}})$ , $\mathbf{W}_{\mathbf{\Psi}}=\{W_{dec,1},\ldots,W_{dec,L},b_{dec,L}\}$ , $L$ is the number of layers for encoder or decoder neural network. The embedding is illustrated in Fig. 1.

The intuition behind the embedding of the SVD-DMD into the framework is given below:

Schmid’s DMD algorithm relies on the dominant POD modes to reduce the effect of noise, and has been shown to approximate the Koopman invariant subspace . This has been demonstrated even for high-dimensional nonlinear dynamical systems with millions of degrees of freedom . Moreover, for systems with continuous spectra, POD appears to be a robust alternative to Koopman mode decomposition . Thus, we assume that the true Koopman invariant subspace is easier to obtain by only learning the residual with respect to the dominant POD subspace.

Although we did not precisely implement ResNet blocks (empirically, for the problem concerned in this paper, our network does not have to be as deep as common architectures in the deep learning community ), we believe that such fixed mappings may have similar benefits in ResNets.

The neural network architecture for the differential form is shown in Fig. 2,

Note that, if the nonlinear part, i.e., the feedforward neural network is not activated, the above formulation reverts to an over-parameterized SVD-DMD. Specifically, the recurrent form model with neural network deactivated can be viewed as a simplified version of optimized DMD .

6 Implementation

The framework is built using Tensorflow . Neural network parameters $\mathbf{W}_{\mathbf{\Phi}},\mathbf{W}_{\mathbf{\Psi}}$ , are initialized with the standard truncated normal distribution. $\mathbf{K}$ is initialized with the corresponding DMD approximation. The objective function is optimized using Adam , which is an adaptive first order stochastic optimization method using gradient updates scaled by square roots of exponential moving averages of previous squared gradients. Note that we also include weight decay regularization, $\mathcal{L}_{reg}=\left\lVert\bm{W}_{\mathbf{\Phi}}\right\rVert^{2}+\left\lVert\bm{W}_{\mathbf{\Psi}}\right\rVert^{2}$ , in the objective function, to avoid spurious oscillations in the learned Koopman functions, which helps generalization in an interpolation sense .

Probabilistic formulation

A Bayesian formalism is adopted to quantify the impact of several sources of uncertainty in model construction on the model predictions. Bayes’ rule is

where $\mathcal{D}$ is data, and $\mathbf{\Theta}$ is the set of parameters. For simplicity, $P$ represents the probability density function (PDF) on the measure space generated by the data and parameters. From a traditional Bayesian standpoint, as the number of parameters in the neural network is large, it is impossible for common inference tools such as Markov Chain Monte Carlo (MCMC) to be practical. To overcome the curse of dimensionality, several general approaches such as Laplacian approximation and variational inference have been proposed. The former is computationally economical but has two major limitations: First, computing the full Hessian is impossible and expensive for a high dimensional problem and most often approximations such as $\mathcal{J}^{\top}\mathcal{J}$ , where $\mathcal{J}$ is the Jacobian are employed . Second, it only provides a local approximation, which can often be far removed from the true posterior. Variational inference has become popular in the deep learning community as it offers an informed balance between the computationally expensive MCMC method, and the cheap but less descriptive models such as the Laplacian approximation. Historically, variational inference for neural networks has been difficult to formulate, largely due to the difficulty of deriving analytical solutions to the required integrals over the variational posteriors even for simple network structures. Graves proposed a stochastic method for variational inference with a diagonal Gaussian posterior that can be applied to almost any differentiable log-loss parametric model, including neural networks. However, there is always a trade-off between the complexity of the posterior and scalability and robustness . In this work, we adopt the mean-field variational inference .

2 Mean-field variational inference

However, direct computation of the KL divergence is intractable, since we do not have access to $\log P(\mathcal{D})$ . Instead, we choose an alternative, the evidence lower bound (ELBO), i.e., the negative KL divergence plus $\log P(\mathcal{D})$ , in Eq. 44 to be maximized.

To maximize the ELBO, we leverage automatic differentiation from Tensorflow to compute the gradients with respect to $\xi$ , following the framework of Automatic Differentiation Variational Inference (ADVI) where Gaussian distributions are considered as the variational family. Specifically, we employ the mean-field assumption in Eq. 45 such that,

where $Z$ is the total number of parameters. For $j=1,\ldots,Z$ , $\theta_{j}$ is the $j$ -th parameter as random variable, and $\xi_{j}$ is the corresponding variational parameter that describes the distribution. This is particularly convenient for neural network models constructed in Tensorflow since weights and biases are naturally defined on some real coordinate space. If the support of the parameter distribution is restricted, one can simply consider a one-to-one differentiable coordinate transformation $\Upsilon(\mathbf{\Theta})=\mathbf{Z}$ , such that $\mathbf{Z}$ is not restricted in some real coordinate space, and posit a Gaussian distribution on $\mathbf{Z}$ . Note that this naturally induces non-Gaussian distribution. Here, we employ the ADVI functionality in Edward , which is built upon Tensorflow to implement ADVI. Interested readers should refer to the original paper of ADVI for the specific details of implementing mean-field variational inference including the usage of the reparameterization-trick to compute the gradients. In contrast, note that the maximum a posteriori (MAP) estimation of the posterior can be cast as a regularized deterministic model as illustrated in Fig. 4. Since weight decay is employed in previous deep learning models to learn the Koopman decomposition, one can show that the previous model is essentially a MAP estimation of the corresponding posterior.

Note that the mean-field Gaussian assumption is still simplified, yet effective and scalable for deep neural nets . It is interesting to note that several recent works leverage Stein Variational Gradient Descent (SVGD) , a non-parametric, particle-based inference method, which is able to capture multi-modal posteriors. Robustness and scalability to high dimensions, e.g., deep neural nets, is still an area of active research .

3 Bayesian hierarchical model setup

Recall that we have the following parameters for the deep learning models introduced in Section 2.2.1 and Section 2.2.2: 1) weights and biases for the “encoder”, $\mathbf{W}_{\mathbf{\Phi}}$ , 2) weights and biases for the “decoder”, $\mathbf{W}_{\mathbf{\Psi}}$ , and 3) stabilized $\mathbf{K}$ with $\zeta_{1},\ldots,\zeta_{D}$ and $\sigma_{1},\ldots,\sigma_{D}$ . Based on mean-field assumptions, we just need to prescribe the prior and variational posterior for each parameter. For each weight and bias, we posit a Gaussian prior with zero mean, with the scaling associated with each parameter to follow the recommended half–Cauchy distribution , which has zero mean and scale as 1 (empirical). The variational posterior for each weight and bias is Gaussian, and Log-normal for scale parameters. For the off-diagonal part of $\mathbf{K}$ , $\zeta_{i}$ , we posit the same type of Gaussian prior as before with the scale parameter following a hierarchical model for each $i=1,\ldots,D$ . The corresponding variational posterior for $\zeta_{i}$ is still Gaussian while log-normal for the scale parameter. For the non-negative diagonal part of $\mathbf{K}$ , we posit a Gamma distribution for each $\sigma_{i}^{2}$ , $i=1,\ldots,D$ , with rate parameter as 0.5 and shape parameter following the previous half-Cauchy distribution. Variational posteriors for both $\sigma^{2}_{i}$ and its shape parameter are log-normal.

For the differential form in Section 2.2.1, we set up the following likelihood functionNote that independence between the data and the structure of the aleatoric uncertainty is assumed. Such an assumption correlates well with existing deterministic models based on mean-square-error. based on Z-normalized data $\mathcal{D}=\{\mathbf{z}_{m},\mathbf{\dot{z}}_{m}\}_{m=1}^{M}$ :

where $\mathbf{\Lambda}_{rec},\mathbf{\Lambda}_{lin}$ are diagonal covariance matrices with the prior of each diagonal element following the previous half–Cauchy distribution. We also posit log-normals to infer the posterior of $\mathbf{\Lambda}_{rec},\mathbf{\Lambda}_{lin}$ . We denote $\mathbf{\Theta}$ as $\mathbf{W}_{\mathbf{\Phi}},\mathbf{W}_{\mathbf{\Psi}}$ , and associated scale parameters together with $\mathbf{\Lambda}_{rec}$ , $\mathbf{\Lambda}_{lin}$ .

For the recurrent form in Section 2.2.2, given normalized data,

we consider the following likelihood, Alternative likelihoods can be chosen to account for the variation of aleatoric noise in time, which is well-suited for short-horizon forecasting. However, we are more interested in a free-run situation where only the initial condition is given.:

4 Propagation of uncertainties

Given data $\mathcal{D}$ and the inferred posterior $P(\mathbf{\Theta}|\mathcal{D})$ , assuming a noise-free initial condition $\mathbf{z}_{0}$ , we are interested in future state predictions with uncertainties. For the differential form in Eq. 46, we have

However, $P(\mathbf{\Phi}(t)|\mathbf{K},\mathbf{\Phi}(\mathbf{z}_{0};\mathbf{W}_{\mathbf{\Phi}}))$ is unknown because the differential form does not use trajectory information. If we assume multivariate Gaussian white noise with the same covariance in the linear loss $\mathbf{\Lambda}_{lin}$ , then one can forward propagate the aleatoric uncertainty associated with the likelihood function of the linear loss. Then one immediately recognizes that the continuous-time random process of $\mathbf{\Phi}(t)$ becomes a multivariate Ornstein–Uhlenbeck process,

where $\mathbf{B}(t)$ is a $D$ -dimensional Gaussian white noise vector with unit variance. Note that

where $\int_{0}^{t}e^{s\mathbf{K}}\mathbf{\Lambda}_{lin}e^{s\mathbf{K}^{\top}}ds=\mathbf{vec}^{-1}(-(\mathbf{K}\oplus\mathbf{K})^{-1}(\mathbf{I}-e^{t(\mathbf{K}\oplus\mathbf{K})})\mathbf{vec}(\mathbf{\Lambda}_{lin}))$ . $\mathbf{vec}(\cdot)$ is the stack operator, and $\oplus$ is the Kronecker sum .

It is interesting to note that, since $\mathbf{K}$ is restricted by Eq. 31 and does not contain any eigenvalues with positive real part, the variance in Eq. 51 will not diverge in finite time. One can thus simply draw samples of $\mathbf{\Phi}(t)$ from Eq. 51. Thus, we approximate Eq. 49 with Monte Carlo sampling from the corresponding variational posterior,

For the recurrent form in Eq. 48, the posterior predictive distribution of $\mathbf{z}(t)$ given the initial condition is straightforward:

Numerical Results & discussion

To demonstrate and analyze the approaches presented herein, three numerical examples are pursued.

Consider the two-dimensional nonlinear dynamical system with a fixed point,

where $\mu=-0.05$ , $\lambda=-1$ . For this low dimensional system with known governing equations, we consider the differential form in Section 2.2.1, with 1600 states as training samples, sampled from $[-0.5,0.5]$ using the standard Latin-Hypercube-Sampling method for both $x_{1}$ and $x_{2}$ . The embedding of SVD-DMD is not employed in this case since it is not very meaningful. The hyperparameters for training are given in Table 2.

After we obtained the inferred posterior, we consider Monte Carlo sampling described in Section 3.4 with $N_{mc}=100$ , $M_{mc}=10$ to approximate the posterior distribution of the Koopman observables and prediction on the dynamical system given an initial condition $\mathbf{x}_{0}$ . The mean Koopman eigenvalues are $\lambda_{1}=-0.99656$ and $\lambda_{2}=-0.05049$ , and the amplitude and phase angle for the mean eigenfunctions are shown in Fig. 5, which resembles the analytic solution with $\varphi_{1}=x_{2}-\lambda x_{1}^{2}/(\lambda-2\mu)$ , $\varphi_{2}=x_{1}$ .

Finally, since we have a learned a distribution over the Koopman operator, we can obtain the posterior distribution of the predicted dynamics, given an arbitrary unseen initial condition following Section 3.4, for example, at $\mathbf{x}_{0}=(0.4,-0.4)$ . The effect of the number of data samples $M$ on the confidence of the predicted dynamics can be seen in Fig. 6. Clearly, as more data is collected in the region of interest, the propagated uncertainty of the evolution of the dynamics predicted on the testing data decreases as expected. It is interesting to note that, even when the data is halved, the standard deviation of the Koopman eigenvalue is very small compared to the mean.

2 2D unforced duffing oscillator

Next, we consider the unforced duffing system:

where $\delta=0.5$ , $\beta=-1$ , $\alpha=1$ . We use 10000 samples of the state $\mathbf{x}$ distributed on $x_{1},x_{2}\in$ from Latin-Hypercube-Sampling. We infer the posterior using the differential form model in Section 2.2.1, with the following hyperparameters in Table 3.

To assess the uncertainty in the Koopman eigenfunctions, we draw 100 samples from the variational posterior of $\mathbf{W}_{\mathbf{\Phi}}$ and $\mathbf{K}$ . First, the mean of the non-trivial Koopman eigenvalues are $\lambda_{1,2}=-0.535\pm 0.750i$ and the mean of the third eigenvalue is $\lambda_{3}=-2\times 10^{-5}$ . The mean of the module and phase angle of the Koopman eigenfunctions is shown in Fig. 7. The results are similar to Ref. in which a deterministic model is employed. The Koopman eigenfunction associated with $\lambda_{3}$ acts as an indicator of the basin of attraction. Second, to better visualize the effect of uncertainty on unseen data, we normalize the standard deviation of the module of Koopman eigfunctions by the minimal standard deviation over $ $. Fig. 7 shows a uniformly distributed standard deviation within$ \times$ where sampling data is distributed, and a drastic increase outside that training region as expected. Note that the area where the normalized standard deviation is larger than ten is cropped.

To obtain the posterior distribution of the evolution of the dynamics predicted by the Koopman operator, we arbitrarily choose an initial condition within the range of training data as $\mathbf{x}(0)=[1.2,1.2]$ . Note that we did not input the model with any trajectory data other than the state and the corresponding time derivative obtained from the governing equation. With Monte Carlo sampling, we obtain the distribution of the trajectory in Fig. 7. Note that the uncertainty is quite small since it is within the training data region with enough data.

3 Flow wake behind a cylinder

We consider the velocity field of a two-dimensional laminar flow past a cylinder in a transient regime at a Reynolds number $Re_{D}=U_{\infty}D/\nu=100$ , which is above 47, the critical Reynolds number associated with Hopf bifurcation. $U_{\infty}$ is the freestream velocity, $D$ is the cylinder diameter, and $\nu$ is the kinematic viscosity. The transient regime is rather difficult due to the high nonlinearity of post-Hopf-bifurcation dynamics between unstable equilibrium and the limit cycle . Data is generated by solving the two dimensional incompressible Navier–Stokes equations using OpenFOAM . Grid convergence was verified using a sequence of successively refined meshes.

The initial condition is a uniform flow with the freestream velocity superimposed with standard Gaussian random noise, and with pressure initialized with Gaussian random noise. Although this initial condition is a rough approximation to the development of true instabilities from equilibrium , the flow is observed to rapidly converge to a quasi-steady solution after a few steps, and then starts to oscillate and form a long separation bubble with two counter-rotating vortices. The first 50 POD snapshots of velocity are considered with the kinetic energy captured upto 99%. We sample 1245 snapshots of data on the trajectory starting from the unstable equilibrium point to the vortex shedding attractor with $\Delta t=0.1t_{ref}=0.1D/U_{\infty}$ where the characteristic advection time scale $t_{ref}=D/U_{\infty}=2\textrm{ sec}$ . The first 600 snapshots are considered as training data, i.e., $0\leq t\leq 60t_{ref}$ .

To further analyze the robustness of the model to noisy data, Gaussian white noise is added Note that noise was not added to the original flow field since taking the dominant POD modes on the flow field would contribute to de-noising. to the temporal data of POD coefficients by considering a fixed signal-to-noise ratio as 5%, 10%, 20%, 30%, for each component at each time instance. The model performance is evaluated by predicting the entire trajectory with the (noisy) initial condition given, including the remaining 645 snapshots. We consider the recurrent form model described in Section 2.2.2 together with SVD-DMD described in Section 2.5.2 with the corresponding hyperparameters given in Table 4. Note that we consider 20 intrinsic modes, i.e., at most 10 distinct frequencies can be captured, which are empirically chosen. Also, the finite-horizon window length corresponding to $T=100$ is $10t_{ref}$ , which is much less than the time required for the system to transit from unstable equilibrium to the attractor.

The continuous-time Koopman eigenvalue distribution of 20 modes for the training data with five different signal-to-noise ratios is shown in Fig. 8. First, all Koopman eigenvalues are stable according to the stable parameterization of $\mathbf{K}$ in Eq. 31. Second, when noise is added, the eigenvalues are seen to deviate except the one on the imaginary axis with $\lambda=\pm 0.528j$ , which corresponds to the dominant vortex shedding frequency on the limiting cycle with $St=\lambda D/(2\pi U_{\infty})=0.168$ . This exercise verifies the robustness of the present approach to Gaussian noise.

Recall that we have a posterior distribution of the predicted trajectory of POD components for with five different noise ratios. Monte Carlo sampling of the posterior distribution is shown for the noisy training data in Fig. 9. Clearly, uncertainty from the data due to the Gaussian white noise is well characterized by the ensemble of Monte Carlo sampling on the distribution. To further analyze the effect of the difference between the mean of the posterior distribution of the prediction and the ground truth clean trajectory on the flow field, we show the (projected) mean and standard deviation of the predicted POD coefficients at $t=100t_{ref}$ , in Fig. 10 and Fig. 11. As seen in Fig. 10, there is hardly any difference between the mean of the posterior distribution and the ground truth. The contour of standard deviation projected onto the flow field shows a similar pattern to the vortex shedding, and the standard deviation near the wake region is relatively small compared to other domains in the flow field.

Conclusions

A probabilistic, stabilized deep learning framework was presented towards the end of extracting the Koopman decomposition for continuous nonlinear dynamical systems. We formulated the deep learning problem from a measure-theoretic perspective with a clear layout of all the assumptions. Two different forms: differential and recurrent, suitable for different situations were proposed and discussed. A parameterization of Koopman operator was proposed, with guaranteed stability. Further, a novel deep learning architecture was devised such that the SVD-DMD is naturally embedded. Finally, mean-field variational inference was used to quantify uncertainty in the modeling. To evaluate the posterior distribution, Monte Carlo sampling procedures corresponding to different forms were derived. Finally, the model was evaluated on three continuous nonlinear dynamical systems ranging from toy polynomial problems to an unstable wake flow behind a cylinder, from three different aspects: the uncertainty with respect to the density of data in the domain, unseen data, and noise in the data. The results show that the proposed model is able to capture the uncertainty in all the above cases and is robust to noise.

Appendix A Real part of eigenvalues of a negative semi-definite real square matrix

For a negative semi-definite real square matrix, the real part of all of its eigenvalues is non-positive.

Then we have $0=(\mathbf{A}-\lambda)\mathbf{v}=(\mathbf{A}-\alpha-j\beta)(\mathbf{v}_{r}+j\mathbf{v}_{i})$ , which further leads to $(\mathbf{A}-\alpha)\mathbf{v}_{r}=-\beta\mathbf{v}_{i}$ and $(\mathbf{A}-\alpha)\mathbf{v}_{i}=\beta\mathbf{v}_{r}$ . Then we have $\mathbf{v}_{r}^{\top}(\mathbf{A}-\alpha)\mathbf{v}_{r}=-\beta\mathbf{v}_{i}^{\top}\mathbf{v}_{r}$ , and $\mathbf{v}_{i}^{\top}(\mathbf{A}-\alpha)\mathbf{v}_{i}=\mathbf{v}_{i}^{\top}\beta\mathbf{v}_{r}$ . Thus adding the two equations, we have $\mathbf{v}_{r}^{\top}(\mathbf{A}-\alpha)\mathbf{v}_{r}+\mathbf{v}_{i}^{\top}(\mathbf{A}-\alpha)\mathbf{v}_{i}=0$ , which implies $\alpha=(\mathbf{v}_{r}^{\top}\mathbf{A}\mathbf{v}_{r}+\mathbf{v}^{\top}_{i}\mathbf{A}\mathbf{v}_{i})/(\mathbf{v}_{r}^{\top}\mathbf{v}_{r}+\mathbf{v}_{i}^{\top}\mathbf{v}_{i})$ .

Using the definition of negative semi-definite matrices, we have $\mathbf{v}_{r}^{\top}\mathbf{A}\mathbf{v}_{r}\leq 0$ and $\mathbf{v}_{i}^{\top}\mathbf{A}\mathbf{v}_{i}\leq 0$ , even if $\mathbf{v}_{r}$ or $\mathbf{v}_{i}=\mathbf{0}$ . Since there is at least one non-zero vector between $\mathbf{v}_{r}$ and $\mathbf{v}_{i}$ , one can safely arrive at $\alpha\leq 0$ that the real part of any eigenvalue of $\mathbf{A}$ is non-positive.

Acknowledgements

This work was supported by DARPA under the grant titled Physics Inspired Learning and Learning the Order and Structure Of Physics,. Computing resources were provided by the NSF via grant 1531752 MRI: Acquisition of Conflux, A Novel Platform for Data-Driven Computational Physics.