Long short-term memory embedded nudging schemes for nonlinear data assimilation of geophysical flows

Suraj Pawar, Shady E. Ahmed, Omer San, Adil Rasheed, Ionel M. Navon

Introduction

Data assimilation (DA) is a methodology where the observations are utilized to correct the results from a mathematical model to reconstruct spatiotemporal dynamics of a system . DA is used extensively for weather forecasting, where there is a growing number of observations coming from satellites, and in situ monitoring. Variational and sequential schemes are two of the most widely used approaches in dynamical data assimilation. For the former, DA is formulated as a minimization problem, where the objective function is defined as the discrepancy between real observations and model’s predictions based on a given set of initial conditions and parameters. The argument of this minimization problem is the set of model’s initial conditions and parameters that need to be tuned to drive the predictions towards the observations. On the other hand, sequential methods usually rely on statistical inference using Bayesian analysis, where the current measurements are used to correct the prior model forecasts to get a better posterior estimate, usually called the analysis in DA terminology.

One of the key limitations of DA methods is that they rely on a forward model whose dynamics is known. For high-dimensional systems like geophysical flows, standard DA methods suffer from the curse of dimensionality. With the increasing resolution of numerical models, the nonlinearities are likely to become so strong that DA algorithms based on linearization might fail . In recent years, with an explosion of data generated from observations, experimental measurements, and numerical simulations, there is a growing interest in applying data-driven methods along with DA . Mostly, efforts focused on using data-driven models in lieu of conventional (physics-based) models in order to accelerate the DA computations. Tang et al. developed a surrogate based model based on convolutional and recurrent neural networks for predicting dynamical subsurface flows and employed it in the DA framework as an emulator to the forward dynamical model. There have been several other studies that demonstrated the potential of data-driven methods in the accurate prediction of complex physical systems such as flooding , global atmospheric model , quasigeostrophic flows , chaotic systems , soil water dynamics, and tsunami modeling . Recent works have also drawn ideas to synthesize DA with reduced order models . Bocquet et al. proposed a hybrid framework by combining DA and machine learning (ML) to estimate the model, the state trajectory, and model error statistics for high-dimensional chaotic systems from partial and noisy observations. Brajard et al. proposed an algorithm where neural networks provide a surrogate forward model to DA and DA provides a time series of complete states to train the neural network. They illustrated the convergence of proposed algorithm for the Lorenz 96 system and achieved the accurate forecasts up to two Lyapunov time units.

Correspondingly, ML tools can also benefit from DA algorithms. Abarbanel et al. offer a perspective on the equivalence between ML and statistical data assimilation and discuss how methods developed in DA can be potentially useful for ML. Bocquet et al. proposed DA as a learning tool to infer ordinary differential equations for dynamical systems solely from noisy data and showed its connection with deep learning methods. Pérez-Ortiz et al. showed that the long short-term memory (LSTM) network can be trained efficiently with better generalization using the decoupled extended Kalman filter .

As an extension to the current efforts of using ML tools in DA context, we propose a modular neural-network based DA framework. In other words, we utilize ML to achieve the fusion between the model’s estimates and noisy observations to provide more accurate predictions, rather than using ML as a facilitator to just accelerate existing DA algorithms. To accomplish this, we train a long short-term memory (LSTM) neural network to “nudge” model’s forecast given a set of sparse observations. Nudging is a relatively simple DA approach that uses the forecast error, defined as the difference between model predictions and measurements, to constrain and correct the model evolution. Nudging was introduced by Anthes for the initialization of hurricane models from real observational data. In nudging methods, the state analysis is approximated as a linear superposition between its model forecast and forecast error. Despite its conceptual simplicity, nudging schemes often require adhoc approximation of the nudging (or weighting) matrix. In our framework, we relax this linear superposition assumption and avoid those adhoc approximations by training an LSTM neural network to nonlinearly blend model’s forecast and sparse observations.

We demonstrate and test the proposed LSTM-DA framework using the Lorenz 96 system as a benchmark problem in geophysical science applications. We illustrate the success of LSTM-DA using different sets of observations with varying levels of noise and sparsity. In particular, we consider combinations between data-rich, data-deficient, observation-rich, and observation-deficient settings. We also compare our results against some of the common DA techniques. Namely, we discuss the results of extended Kalman filter (EKF), ensemble Kalman filter (EnKF), deterministic ensemble Kalman filter (DEnKF), and a simple forward nudging method. Our LSTM-DA framework can be considered very much similar to the methodology proposed by Zhu et al. in which the fully connected neural network was used to learn the uncertainty in the mathematical model arising from linearization, discretization, and model reduction. The difference in our proposed framework is that we employ the LSTM neural network to learn the nudging correction term in order to cure the discrepancy between prior predictions and measurements that might arise due to inaccurate initial conditions, boundary conditions, or model parameters.

The rest of the manuscript is outlined here. In Section 2, we describe three of the most common nonlinear filtering techniques as benchmarks to compare our framework against. In particular, we briefly outline the extended Kalman filter, which is a first-order adaptation of the standard Kalman filter to deal with nonlinear models. We then introduce the ensemble Kalman filter and its deterministic version as reduced rank variants of nonlinear filters. In Section 3, we discuss the nudging method as a simple alternative to nonlinear filters, which is then extended as a base for our proposed DA-LSTM framework in Section 4. We define the DA set-up using Lorenz 96 system in Section 5. After that, we provide our results in Section 6 as well as relevant discussions and comparisons using different sets of historical data and observations. Finally, we draw our conclusions as well as the limitations and potential extensions of the present study in Section 7.

Nonlinear filtering

The central goal of DA is to extract the information from observational data to correct dynamical models and improve their prediction. There are different approaches such as variational methods like 4D-Var and stochastic methods like ensemble filters that are widely used in DA. Several textbooks on data assimilation offer academic explanations and discussion on these methods .

In this section, we discuss sequential data assimilation problem and then outline the algorithm procedure for extended Kalman filter (EKF) and ensemble Kalman filter (EnKF). The complete derivation of Kalman filter and its different variants can be found in a number of literature .

For demonstration, we consider the dynamical system whose evolution is governed by

We will use the notation $\widehat{\mathbf{x}}_{k}$ to denote an analyzed state of the system at time $t_{k}$ when all of the observations up to and including time $t_{k}$ are used in determining the state of the system. When all the observations before (but not including) time $t_{k}$ are utilized for estimating the state of the system, then we call it the forecast estimate and denote it as $\mathbf{x}^{f}_{k}$ . We use the notation $\mathbf{P}_{k}$ to denote the error covariance matrix. The error covariance matrix for the state vector $\mathbf{x}_{k}$ is defined as

where $\text{E}[\cdot]$ denotes the expected value. We use $\widehat{\mathbf{P}}_{k}$ to denote the error covariance for an analyzed state $\widehat{\mathbf{x}}_{k}$ and $\mathbf{P}^{f}_{k}$ denotes the error covariance for the forecast estimate $\mathbf{x}^{f}_{k}$ .

We first outline the algorithm for extended Kalman filter (EKF) and then discuss in detail its important steps. The procedure for the EKF is summarized in Algorithm 1.

2 Ensemble Kalman filter

When the system is high-dimensional, i.e., $n$ is very large, then the computations for the EKF algorithm are practically infeasible. In addition, the EKF algorithm requires computation of Jacobians and it might be numerically difficult to compute Jacobians for complex models. Ensemble filtering techniques are attractive for such systems where the approximate state of the system is estimated using the standard Monte Carlo framework.

In EKF, the mean estimate of the state $\widehat{\mathbf{x}}_{k}$ and the error covariance matrix $\widehat{\mathbf{P}}_{k}$ are updated sequentially. In contrast to an EKF algorithm, we apply the forecast step to an ensemble of states in the EnKF algorithm. The sample mean and covariance of the ensembles analyses represent the analyzed state estimate $\widehat{\mathbf{x}}_{k}$ and error covariance matrix $\widehat{\mathbf{P}}_{k}$ .

Let $x_{0}$ be an initial condition drawn from the Gaussian distribution with mean $\mathbf{m}_{0}$ and the covariance matrix $\mathbf{P}_{0}$ , i.e., $x_{0}\sim{\cal{N}}(\mathbf{m}_{0},\mathbf{P}_{0})$ . In our notation we use $\mathbf{X}_{k}(i)$ to denote the $i^{\text{th}}$ member of ensembles and $N$ is the size of ensembles, i.e., $i=1,2,\dots,N$ . The procedure for the EnKF is summarized in Algorithm 2. We initialize the state of the system for all ensemble members from known distribution of the initial condition for the system as given in Equation 11. Then, we forecast the state of the system for all ensemble members between two observation points (i.e., from time $t_{k}$ to $t_{k+1}$ ) using the nonlinear model dynamics as given in Equation 13. The forecast state estimate and error covariance are calculated based on the sample mean and sample variance of all ensembles as given in Equation 14 and Equation 16, respectively.

Once the observations $\mathbf{z}_{k+1}$ are available at time $t_{k+1}$ , we create $N$ different virtual observations using Equation 17. In the original formulation of EnKF algorithm proposed by Evensen, virtual observations were not used in the assimilation step. However, Burgers et al. showed that it is essential to include random perturbations to observations to ensure that the analyzed covariance is not underestimated. Once the virtual observations are generated, the forecast state estimate for all ensembles are assimilated using Equation 18. The Kalman gain $\mathbf{K}$ is computed using the same formula as the EKF algorithm. The analysis state estimate is calculated using the sample mean of analyzed state estimate for all ensemble members as given in Equation 20.

3 Deterministic ensemble Kalman filter

Sakov et al. proposed a modification in traditional EnKF that results into matching the analyzed error covariance to that of standard Kalman filter without the need to virtual observations.

The procedure for the deterministic EnKF (DEnKF) is summarized in Algorithm 3. In practice (e.g., when $n>>N$ ), we compute Equation 28 using its square root version (without storing or computing $\mathbf{P}^{f}_{k+1}$ explicitly) as follows

In other words, we skip computing Equation 26, and use its reduced-rank square root definition given by

We start the DEnKF algorithm in a similar manner as the EnKF algorithm by initializing the state estimate for all ensemble members using Equation 21. The anomalies between the forecast estimate of all ensembles and its sample mean is computed utilizing Equation 25. Once the observations are available at time $t_{k+1}$ , the forecast state estimate is assimilated as given in Equation 27, where the Kalman gain $\mathbf{K}$ is computed in a similar manner as the EKF algorithm. The anomalies for all ensemble members are updated separately with half the Kalman gain. Therefore, the analyzed anomalies for all ensemble members are calculated using Equation 29. The analyzed state estimate for all ensembles members are obtained by offsetting the analyzed anomalies with the analyzed state estimate and is computed using Equation 30.

Nudging dynamics

Nudging is another data assimilation method that was introduced by Anthes for initialization of hurricane models from real observational data. Contrary to variational and sequential data assimilation methods that minimize the cost function based on the error between model forecast and observations, nudging methods utilize the forecast error as a constraint to the model evolution equation. The evolution of the dynamical system based on nudging methods can be written as

Spectral nudging is another technique where the nudging term is added in the spectral domain with maximum efficiency for large scales and no effect for small scales . This method has been successfully applied to force large-scale atmospheric states from global climate models onto a regional climate model . The main idea in spectral nudging is that small-scale details for weather prediction are governed by the interplay between larger-scale atmospheric flow and geographic features like mountains, and land-sea distribution. It is computationally impractical to resolve these small scales in global climate models. Therefore, spectral nudging is applied to match overlapping scales in global and regional climate models by forcing the regional model to behave as global model. Spectral nudging method has also been applied for inferring flow parameters for turbulent flows , and for three-dimensional homogeneous isotropic turbulence . There are also nudging methods that make use of present and past observations in the formulation of forcing term to drive the model evolution toward observation. An et al. used the time delayed nudging method for estimating the state of geophysical system from sparse observation data.

Long short-term memory nudging

With the huge amount of data generated from high-fidelity numerical simulations, non-invasive experimental techniques like particle image velocimetry (PIV), and satellite data, there is a growing interest in using machine learning for data assimilation . One of the difficulties in weather and climate prediction is that atmospheric flows are multiscale in nature and their dynamics are typically chaotic. Several data-driven algorithms can address these challenges. The recurrent neural networks (RNNs) are particularly attractive for complex dynamical systems due to their ability to capture temporal dependencies and to take state history into account for future state prediction. One of the problems with RNN is that the gradient vanishes during the learning procedure. Long short-term memory is a type of RNN that alleviates this issue of vanishing gradient by employing cell architecture that remembers or forgets information.

There is a rich literature on the application of LSTM for modeling chaotic dynamical systems. Vlachas et al. proposed a data-driven forecasting method for the high-dimensional chaotic system by modeling their temporal dynamics on reduced order space using LSTM. They also integrated the LSTM with a mean stochastic model to ensure convergence and demonstrated its improved prediction performance compared to the Gaussian process. In Wan et al., the LSTM was employed to learn the mismatch between imperfect Galerkin based reduced order model and the actual dynamics projected onto the reduced order space. They showed the improved performance of the proposed framework for the prediction of extreme events. Jia et al. introduced the physics-guided RNN that combines the LSTM and physics-based model to model the dynamics of temperature in lakes. They utilized a physics-based regularization as a penalty term to the optimization cost function to enforce physics into the training. Apart from LSTM, other machine learning algorithms such as reservoir computing have been used for modeling chaotic dynamical systems and residual network for predicting dynamical system evolution . In a recent study, Vlachas et al. investigated the performance of LSTM trained with backpropagation through time and reservoir computing for long term forecasting of chaotic dynamical systems.

Zhang et al. presented an LSTM based Kalman filter for data assimilation of two-dimensional spatio-temporal varying depth of ocean field for underwater glider path planning. In their study, the temporal evolution of spatial basis function was modeled using LSTM. They train the LSTM network to predict the future temporal coefficients based on the historical states of these coefficients. Jin et al. utilized LSTM to perform observation bias correction for data assimilation of dust storm prediction. They showed that with the LSTM model for bias corrections, existing measurements are used precisely and that improves the resulting prediction accuracy. In the work by Loh et al., the LSTM was deployed as a prediction model for their EnKF approach to achieve real-time production forecast in natural gas wells. Xingjian et al. proposed a convolutional LSTM framework to predict the rainfall intensity over a short period of time and illustrated its ability to capture more improved correlation than existing methods.

We highlight some of the features of the LSTM nudging framework here. The LSTM nudging framework is highly modular and it can be implemented with other types of neural network architectures also based on the size or type of problems. For example, convolutional autoencoders are gaining popularity to find the nonlinear basis functions of complex physical systems and they are complemented with the LSTM network for learning the latent-space dynamics . The LSTM nudging framework can be easily applied to high dimensional systems, where convolutional autoencoders are employed for dimensionality reduction and the LSTM is trained to learn the nudging dynamics in latent-space instead of high-dimensional space. Novel neural network architectures like generative adversarial networks (GANs) can also be applied to learn the nudging dynamics. Another feature of the LSTM nudging scheme is that once the network is trained with the archival or background data, it can be retrained efficiently with transfer learning as the new observation data becomes available. Therefore, training the LSTM network for the first time is the only computationally heavier part of the LSTM nudging scheme. Our main goal in this study is to illustrate that the neural network can be effectively trained to provide accurate and stable nudging dynamics.

Data assimilation problem set-up

In this section we describe the Lorenz 96 model proposed by Lorenz , which is commonly used as a prototypical test case in data assimilation. This model describes the temporal evolution of atmospheric quantity discretized spatially over a single latitude circle. The system of ordinary differential equation governing the Lorenz 96 model can be written as

for $i\in\{1,2,\dots,n\}$ . The first term on the right hand side of Equation 47 is the nonlinear advection term, the second term present an internal dissipation, and the third term present an external forcing. We use $n=40$ and $F=10$ in our analysis. We apply the periodic boundary conditions at ghost points, i.e., $u_{0}=u_{n},u_{-1}=u_{n-1},$ and $u_{n+1}=u_{1}$ .

We use the fourth-order Runge-Kutta scheme for time integration with a time step of $\Delta t=0.005$ . To generate a physical initial condition for the forward run, we start with an equilibrium condition at time $t=-5$ . The equilibrium condition for the model is $u_{i}=F$ for $i\in\{1,2,\dots,n\}$ . We introduce a very little perturbation to the equilibrium state for the state $u_{20}$ , i.e., we set $u_{20}=F+0.01$ to generate chaotic dynamics and then do the time integration up to $t=0$ . Once the true initial condition is generated, we run the forward solver up to time $t=10$ .

The twin experiment is one of the most commonly used methods to validate any data assimilation algorithm before it can be applied to real-life applications . For twin experiments, first, we generate the $n$ -dimensional data for the Lorenz 96 model and select $m$ observations. These observations are obtained by adding some noise to the true state of the system to take experimental uncertainties and measurement error into account. The observations are also sparse in time, meaning that the time interval between two observations can be different from the time step of the model. For our twin experiments, we assume that observations are recorded at every $10^{\text{th}}$ time step of the model. Therefore, the time difference between two observations is $\delta t=0.05$ . The analysis time step $\delta t=0.05$ is representative of six hours of a data assimilation cycle of global meteorological models. The accurate estimation of the full state of the system depends upon the number of observations that are assimilated by the model . We assume that observations locations are constant throughout the time unlike asynchronous observations where they can be rotated . We compare the performance of traditional data assimilation algorithms and the proposed LSTM nudging algorithm for three sets of observations. The first set of observations is very sparse with only 10% of the full state of the system (i.e., $m=4$ ), utilizing observations for states $[u_{10},u_{20},u_{30},u_{40}]\in R^{4}$ . In a second set of observations ( $m=8$ ), we employ observations at $[u_{5},u_{10},\dots,u_{40}]\in R^{8}$ for the assimilation. The third set of observations consists of 50% of the full state of the system ( $m=20$ ), i.e., observations at states $[u_{2},u_{4},\dots,u_{40}]\in R^{20}$ for the assimilation.

Results

In this section, we describe the results of numerical experiments with the Lorenz 96 model using algorithms discussed in Section 2, 3 and 4. We assume that our model is perfect for all numerical experiments except for the EKF and EnKF algorithm. For these two algorithms, it is found that an introduction of small uncertainty in the model provides more accurate predictions than the assumption of a perfect model. For the aforementioned two algorithms, we assume that the model noise is drawn from the Gaussian distribution with zero mean and variance $1\times 10^{-4}$ . The observations are created by adding random noise from Gaussian distribution with zero mean and variance $1\times 10^{-2}$ to the true state of the system. The erroneous initial condition is generated by adding a noise form Gaussian distribution of zero mean and $1\times 10^{-2}$ variance to the true initial condition. To ensure a fair comparison between EnKF and DEnKF, we use an equal number of ensembles in both algorithms. For the comparison, we plot time evolution of states $u_{10},u_{21},u_{39}$ , and also the full state trajectory of the Lorenz 96 model. We use black lines to denote true states, dashed blue lines to denote states with the erroneous initial condition and dashed-dotted green color lines for assimilated states. The observations for the state $u_{10}$ are shown with red circles in all of the time series plots.

In Figure 2, we present the time evolution of selected states for three different number of observations included in the assimilation of the EKF algorithm. There is an excellent agreement between true and assimilated states $u_{21}$ and $u_{39}$ when more than 20% observations are utilized for assimilation. We also provide the full state trajectory of the Lorenz 96 model in Figure 3. The results obtained clearly show that the EKF algorithm can determine the correct state trajectory with more than 20% observations, i.e., for $m\geq 8$ . We observe a discrepancy in prediction after $t\sim 7$ , when only four observations are used in the assimilation step. Figure 4 shows the time evolution of selected states predicted by the EnKF algorithm with $N=40$ ensemble members. We notice some discrepancy between the true and predicted states with $m=12$ observations after $t\sim 7.5$ . If we compare the full state trajectory prediction by the EnKF algorithm in Figure 5, we can conclude that there is almost a perfect match between true and assimilated states with more than 8 observations. Since the EnKF algorithm is based on the Monte Carlo framework, its accuracy can be improved by applying increased number of ensembles. The typical number of ensembles is $O(100)$ for high-dimensional systems . Considering that the Lorenz 96 model is a lower-dimensional system with $n=40$ states, we apply only 40 ensemble members. If we consider the computational cost of the EKF algorithm, the major bottleneck is the propagation of the error covariance matrix as given in Equation 7. The computational overhead of the EnKF algorithm goes up with an increase in the number of ensembles. However, with the advancement in parallel algorithms and high-performance computing, ensemble Kalman filter algorithms are particularly attractive data assimilation of complex physical systems .

As we observed in Figure 5, the use of virtual observations in the EnKF algorithm leads to suboptimal performance when fewer observations are used for assimilation with a small number of ensembles. The EnKF data solution converges towards a true solution with an increase in the number of ensembles. The DEnKF algorithm is the deterministic version of the EnKF algorithm where no virtual observations are used. Instead of using virtual observations, the DEnKF algorithm updates the ensemble mean with standard analysis equation and ensemble anomalies are updated separately with half the Kalman gain in the same equation . In Figure 6, we illustrate the time evolution of selected states for different percentages of observations used in the assimilation step. We notice that even with just four observations, the DEnKF algorithm is able to correct the erroneous states up to the final time $t=10$ . From the results depicted in Figure 7, we can deduce that the DEnKF algorithm leads to better performance than the EnKF algorithm when the number of observations is smaller.

The nonlinear filtering methods discussed in Section 2 are computationally expensive and are prone to curse of dimensionality with an increase in the resolution of the forward numerical model. Nudging methods, on the other hand, are computationally inexpensive and straightforward to implement. As described in Section 3, nudging is accomplished by adding a correction term to the dynamical model which is proportional to the difference between observations and model forecast. One of the main limitations of nudging methods is the adhoc specification of the nudging relaxation coefficient and it is not clear how to choose this coefficient to obtain an optimal solution . Here, we demonstrate how the choice of nudging coefficient affects the prediction when 20 observations are available for assimilation. Since the nudging coefficient represents the relaxation of time scale, we use a constant value for the nudging coefficient that is a function of the time step of the model. Also, the nudging coefficient is assumed to be constant throughout the time integration, i.e., $G_{k}=\tau$ , where $\tau$ is a function of the time step of the model. Figure 8 displays the time evolution of selected states for three different values of the nudging coefficient. We notice that for a higher value of $\tau$ , the nudging method is not able to correct the model forecast accurately. Figure 9 provides the full state trajectory of the Lorenz 96 model with different nudging coefficient matrix. We observe that the error is sufficiently low at $\tau=100\Delta t$ . Though, the prediction can be further refined by fine-tuning of the nudging coefficient matrix.

Figure 10 and Figure 11 present the time evolution of selected states and full Lorenz 96 system for different number of observations with $\tau=150\Delta t$ . We can easily see that the prediction capability of the nudging scheme is poor when less number of observations are available for the assimilation. Indeed the performance of the nudging scheme can be improved by the optimal specification of nudging coefficient , or by using back and forth nudging algorithm . However, the optimal nudging coefficient computation involves obtaining an adjoint model and solving a constrained minimization problem. Also, the back and forth nudging algorithm requires $O(10)$ iterations for convergence and the computational cost will be large for high-dimensional systems. Therefore, machine learning algorithms that are successful in finding the nonlinear mapping between two quantities can be exploited to learn the nudging dynamics.

Now, we describe the results of numerical experiments with the LSTM nudging scheme described in Section 4. For the fair comparison with the EnKF and DEnKF algorithms, we use the data generated from $N=40$ perturbed initial conditions for the training of the LSTM network. These perturbed initial conditions are created by adding noise from the Gaussian distribution of zero mean and $1\times 10^{-2}$ variance to the erroneous initial condition. The training data is obtained by integrating the model with these perturbed initial conditions from time $t=0$ to $t=10$ with $dt=5\times 10^{-3}$ and then storing the states at all times where observations are present. Therefore, there will be 40,000 samples available for training the LSTM network. The LSTM network is trained using the procedure described in Algorithm 4. We use fairly simple LSTM architecture with two hidden layers consisting of 80 LSTM cells each, and train the network for 2500 epochs. We apply the ReLU activation function and Adam optimizer for the optimization. We found that our training is not highly sensitive to neural network hyperparameters and a similar level of accuracy can be achieved with other sets of hyperparameters. Figure 12 presents the time evolution of selected states for three different number of observations. We see that the LSTM network has learned the mapping from input data to the correction term and is able to produce the correct trajectory even for those states for which observations are not available. In Figure 13, we provide the full state trajectory of the Lorenz 96 model for the LSTM nudging method. We get a sufficient level of accuracy comparable to nonlinear filtering algorithms with 20% observations.

From our analysis of numerical experiments with three sets of observations, we can conclude that the LSTM network can learn the nudging dynamics efficiently. Some of the other questions that we want to investigate in this study are; how sparse can the observations be for an accurate prediction?, and how much training data is required for training the network effectively? Figure 14 displays the time evolution selected states for the LSTM nudging scheme with very sparse observations, i.e., $m=2,3,$ and $4$ and we observe a large discrepancy between true and predicted states with less than 10% observations. Figure 15 reports the full state trajectory for the Lorenz 96 model with very sparse observations. The results in Figure 14 and Figure 15 suggests that at least 10% observations are necessary for producing the correct prediction with low error. We point out here that we utilized the data created from only 40 perturbed initial conditions for training, and it is well known that the performance of the neural network can be improved by training with more data.

In Figure 16 and Figure 17, we illustrate the improvement in prediction for highly sparse observations as the amount of data employed for training the LSTM network is increased. We show only the error plot (the difference between true and predicted states) for the conciseness. We can easily observe that the error is large for the EnKF and DEnKF algorithms compared to the LSTM nudging scheme when only two or three observations are available for assimilation. When four observations are present, we see a similar level of accuracy for EnKF, DEnKF, and LSTM nudging method. If we compare the error in Figure 16 and Figure 17, there is an improvement in the prediction as we increase the training data. The results presented in Figure 16 and Figure 17 are obtained by utilizing $N=200$ and $N=400$ ensemble members for the EnKF and DEnKF algorithms. The same number of perturbed initial conditions are also used for training the LSTM network. Therefore, in terms of computational cost, all three methods can be considered equivalent because the same number of forward numerical models are integrated from initial time to final time for all three methods. In terms of the storage, the LSTM nudging is more demanding as it requires the storage of full state for all training sets (i.e., perturbed initial condition) at all observation points for the training. However, there is no need to store the solution of all ensemble members in the EnKF and DEnKF algorithm. This limitation can be addressed by transfer learning, where the weights and biases of the neural network are updated by training its last few layers with new data. Therefore, training the LSTM network for the first time is a computationally intensive task and the LSTM network can be retrained as new observations become available.

Conclusions

In the present study, we introduced the LSTM nudging scheme that learns the nudging dynamics from the full state of the system and partial observations. We illustrate the approach for the Lorenz 96 system and compare its performance against extended Kalman filter (EKF), ensemble Kalman filter (EnKF), and deterministic ensemble Kalman filter (DEnKF) approaches. We consider different aspects of the LSTM nudging scheme such as sparsity in observations, and the amount of available data for training the LSTM network. We successfully demonstrate that the LSTM network can be trained to learn the nudging dynamics with extremely sparse observations provided there is a large amount of training data. In terms of computational overhead, training the neural network is the most demanding task. However, this is a one time task and future observations can be incorporated by retraining the neural network with transfer learning at a much less computational cost.

The results of our numerical experiments with the LSTM nudging scheme indicate its potential benefit of assimilation from very sparse observations. Another benefit is that there are no matrix computational operations such as Kalman gain calculation. One of the important caveats of the LSTM nudging scheme is that the neural networks are data-hungry and hence a large amount of archival or background data will be necessary to train the neural network. The suitability of LSTM nudging scheme for DA problems is summarized in Figure 18, where the DA problems are classified based on the sparsity of observations and amount of archival background information. The LSTM nudging scheme is well suited for problems where observations are very sparse and there is the availability of archival background information (i.e., type I problems). Another limitation is that the training procedure in the present form will not be feasible for very high-dimensional systems. One of the solutions to address this constraint is to utilize reduced order modeling (ROM) approaches for dimensionality reduction and recently, machine learning methods are found to give accurate, stable, and robust ROMs for physical systems. Since the LSTM nudging scheme is flexible, we foresee that this approach can be extended to large scale systems by blending it with ROM approaches. One more reservation of the LSTM nudging method is that it does not predict the uncertainty in analyzed states.

We re-emphasize here that the significance of the proposed LSTM nudging method on the prototype model does not mean that they can be directly extended to higher-dimensional and more complex problems. In this work, we assumed that the model is perfect and the noise is Gaussian, which is a very idealized condition. In actual scenarios, real weather forecast models are approximate and contain a lot of parameterizations for subgrid scale processes. Therefore, one can look at the results of numerical experiments presented in this study as the early findings and substantial future work is required for the demonstration of the proposed method in a realistic situation. As a part of future studies, we plan to illustrate the LSTM nudging method for a two-dimensional quasi-geostrophic model with an application of convolutional autoencoder for dimensionality reduction. Neural networks have also been shown to be capable of discovering hidden information about the physical processes embedded in the data and we will integrate these methods with the LSTM nudging scheme for imperfect models.

Acknowledgement

This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research under Award Number DE-SC0019290. O.S. gratefully acknowledges their support.

Disclaimer. This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.

Appendix A Jacobian of the model and observation matrix

We apply a fourth-order Runge-Kutta (RK4) numerical scheme for temporal integration of the Lorenz 96 model and it can be written as follow

The function $\mathbf{f}$ is the right hand ride of the Lorenz 96 model and in the discrete form it can be written as

The Jacobian of the function $\mathbf{f}$ is defined as below

where $1\leq i\leq m$ and $1\leq j\leq n$ . Since we use linear observations, $\mathbf{D_{h}}$ will be a constant sparse matrix. Each row of the matrix $\mathbf{D_{h}}$ will consist of all zeros except for the corresponding observation location, where it will have the value of one.