Machine Learning of Linear Differential Equations using Gaussian Processes

Maziar Raissi, George Em. Karniadakis

Introduction

A grand challenge with great opportunities facing researchers is to develop a coherent framework that enables scientists to blend conservation laws expressed by differential equations with the vast data sets available in many fields of engineering, science and technology. In particular, this article investigates conservation laws of the form

$\mathcal{L}_{x}^{\phi}:$ $\phi{\ =\ }?<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mi>u</mi><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow><annotation encoding="application/x-tex">u(x)</annotation></semantics></math>u(x)f(x)$ , which model the relationship between two black-box functions $u(x)$ and $f(x)$ . Here,

and $\mathcal{L}_{x}^{\phi}$ is a parametric linear operator identified by its parameters $\phi$ . Given noisy observations $\{\bm{X}_{u},\bm{y}_{u}\}$ , $\{\bm{X}_{f},\bm{y}_{f}\}$ of $u(x)$ , $f(x)$ , respectively, the aim is to learn the parameters $\phi$ and hence the governing equation which best describes the data. Such problems are ubiquitous in science, and in mathematical literature are often referred to as “inverse problems” (see ). To provide a unified framework for resolving such problems, this work employs and modifies Gaussian processes (GPs) (see ), which is a non-parametric Bayesian machine learning technique. Quoting Diaconis , “once we allow that we don’t know $f$ (and $u$ ), but do know some things, it becomes natural to take a Bayesian approach”. The Bayesian procedure adopted here, namely Gaussian processes, provides a flexible prior distribution over functions, enjoys analytical tractability, and has a fully probabilistic work-flow that returns robust posterior variance estimates which quantify uncertainty in a natural way. Moreover, Gaussian processes are among a class of methods known as kernel machines (see ) and have analogies with regularization approaches (see ). However, they are distinguished by their probabilistic viewpoint and their powerful traning procedure. Along exactly the same lines, the methodology outlined in this work is related to and yet fundamentally differentiated from the so-called “meshless” methods (see ) for differential equations. Furthermore, differential equations are the cornerstone of a diverse range of applied sciences and engineering fields. However, use within statistics and machine learning, and combination with probabilistic models is less explored. Perhaps the most significant related work in this direction is latent force models . Such models generalize latent variable models using differential equations. In sharp contrast to latent force models, this work bypasses the need for solving differential equations either analytically or numerically by placing the Gaussian process prior on $u(x)$ rather than on $f(x)$ . Additionally, equation 1 can be further motivated by considering the familiar cases where

for some kernel $\kappa$ . This particular instance corresponds to the well-known convolved Gaussian processes which are suitable for multi-output purposes . Moreover, yet another special and interesting case arises by assuming $\kappa(x-x^{\prime};\phi)=\phi\delta(x-x^{\prime})$ , with $\delta$ being the Kronecker delta, which yields

This results in the so-called recursive co-kriging model $f(x)=\phi u(x)+v(x)$ . Here, $v(x)$ is a latent function and is included to capture unexplained factors. Recursive co-kriging models can be employed to create a platform for blending multiple information sources of variable fidelity, e.g., experimental data, high-fidelity numerical simulations, expert opinion, etc. The main assumption in multi-fidelity settings is that the data on the high-fidelity function $f(x)$ are scarce since they are generated by an accurate but costly process, while the data on the low fidelity function $u(x)$ are less accurate, cheap to generate, and hence abundant.

Methodology

The proposed data-driven algorithm for learning general parametric linear equations of the form 1 employs Gaussian process priors that are tailored to the corresponding differential operators.

Specifically, the algorithm starts by making the assumption that $u(x)$ is Gaussian process with mean and covariance function $k_{uu}(x,x^{\prime};\theta)$ , i.e.,

where $\theta$ denotes the hyper-parameters of the kernel $k_{uu}$ . The key observation to make is that any linear transformation of a Gaussian process such as differentiation and integration is still a Gaussian process. Consequently,

with the following fundamental relationship between the kernels $k_{uu}$ and $k_{ff}$ ,

Moreover, the covariance between $u(x)$ and $f(x^{\prime})$ , and similarly the one between $f(x)$ and $u(x^{\prime})$ , are given by $k_{uf}(x,x^{\prime};\theta,\phi)=\mathcal{L}^{\phi}_{x^{\prime}}k_{uu}(x,x^{\prime};\theta)$ , and $k_{fu}(x,x^{\prime};\theta,\phi)=\mathcal{L}^{\phi}_{x}k_{uu}(x,x^{\prime};\theta)$ , respectively. The main contribution of this work can be best recognized by noticing how the parameters $\phi$ of the operator $\mathcal{L}_{x}^{\phi}$ are turned into hyper-paramters of the kernels $k_{ff},k_{uf},$ and $k_{fu}$ .

Without loss of generality, all Gaussian process priors used in this work are assumed to have a squared exponential covariance function, i.e.,

where $\sigma_{u}^{2}$ is a variance parameter, $x$ is a $D$ -dimensional vector that includes spatial or temporal coordinates, and $\theta=\left(\sigma_{u}^{2},(w_{d})_{d=1}^{D}\right)$ . Moreover, anisotropy across input dimensions is handled by Automatic Relevance Determination (ARD) weights $w_{d}$ . From a theoretical point of view, each kernel gives rise to a Reproducing Kernel Hilbert Space that defines a class of functions that can be represented by this kernel. In particular, the squared exponential covariance function chosen above, implies smooth approximations. More complex function classes can be accommodated by appropriately choosing kernels.

2 Training

The hyper-parameters $\theta$ and more importantly the parameters $\phi$ of the linear operator $\mathcal{L}_{x}^{\phi}$ can be trained by employing a Quasi-Newton optimizer L-BFGS to minimize the negative log marginal likelihood

where $\bm{y}=\left[\begin{array}[]{c}\bm{y}_{u}\\ \bm{y}_{f}\end{array}\right]$ , $p(\bm{y}|\phi,\theta,\sigma_{n_{u}}^{2},\sigma_{n_{f}}^{2})=\mathcal{N}\left(\bm{0},\bm{K}\right)$ , and $\bm{K}$ is given by

Here, $\sigma_{n_{u}}^{2}$ and $\sigma_{n_{f}}^{2}$ are included to capture noise in the data. The implicit underlying assumption is that $\bm{y}_{u}=u(\bm{X}_{u})+\bm{\epsilon}_{u}$ , $\bm{y}_{f}=f(\bm{X}_{f})+\bm{\epsilon}_{f}$ with $\bm{\epsilon}_{u}\sim\mathcal{N}(\bm{0},\sigma_{n_{u}}^{2}\bm{I}_{n_{u}})$ , $\bm{\epsilon}_{f}\sim\mathcal{N}(\bm{0},\sigma_{n_{f}}^{2}\bm{I}_{n_{f}})$ . Moreover, $\bm{\epsilon}_{u}$ and $\bm{\epsilon}_{f}$ are assumed to be independent. It is worth mentioning that the marginal likelihood does not simply favor the models that fit the training data best. In fact, it induces an automatic trade-off between data-fit and model complexity. This effect is called Occam’s razor after William of Occam 1285–1349 who encouraged simplicity in explanations by the principle: “plurality should not be assumed without necessity”. The flexible training procedure outlined above distinguishes Gaussian processes and consequently this work from other kernel-based methods. The most computationally intensive part of training is associated with inverting dense covariance matrices $\bm{K}$ . This scales cubically with the number of training data in $\bm{y}$ . Although this scaling is a well-known limitation of Gaussian process regression, it must be emphasize that it has been effectively addressed by the recent works of .

3 Prediction

Having trained the model, one can predict the values $u(x)$ and $f(x)$ at a new test point $x$ by writing the posterior distributions

where for notational convenience the dependence of kernels on hyper-parameters and parameters is dropped. The posterior variances $s_{u}^{2}(x)$ and $s_{f}^{2}(x)$ can be used as good indicators of how confident one could be about the estimated parameters $\phi$ of the linear operator $\mathcal{L}_{x}^{\phi}$ and predictions made based on these parameters. Furthermore, such built-in quantification of uncertainty encoded in the posterior variances is a direct consequence of the Bayesian approach adopted in this work. Although not pursued here, this information is very useful in designing a data acquisition plan, often referred to as active learning , that can be used to optimally enhance our knowledge about the parametric linear equation under consideration.

Results

The proposed algorithm provides an entirely agnostic treatment of linear operators, which can be of fundamentally different nature. For example, one can seamlessly learn parametric integro-differential, time-dependent, or even fractional equations. This generality will be demonstrated using three benchmark problems with simulated data. Moreover, the methodology will be applied to a fundamental problem in functional genomics, namely determining the structure and dynamics of genetic networks based on real expression data on early Drosophila melanogaster development.

Consider the following integro-differential equation,

Note that for $(\alpha,\beta)=(2,5)$ , the functions $u(x)=\sin(2\pi x)$ and $f(x)=2\pi\cos(2\pi x)+\frac{5}{\pi}\sin(\pi x)^{2}+2\sin(2\pi x)$ satisfy this equation. In the following, the parameters $(\alpha,\beta)$ will be infered from two types of data, namely, noise-free and noisy observations.

Assume that the noise-free data $\{\bm{x}_{u},\bm{y}_{u}\}$ , $\{\bm{x}_{f},\bm{y}_{f}\}$ on $u(x)$ , $f(x)$ are generated according to $\bm{y}_{u}=u(\bm{x}_{u})$ , $\bm{y}_{f}=f(\bm{x}_{f})$ with $\bm{x}_{u}$ , $\bm{x}_{f}$ constituting of $n_{u}=4$ , $n_{f}=3$ data points chosen at random in the interval $ $, respectively. Given these noise-free training data, the algorithm learns the parameters$ (\alpha,\beta) $to have values$ (2.012627,\ 4.977879) $. It should be emphasized that the algorithm is able to learn the parameters of the operator using only$ 7 $training data. Moreover, the resulting posterior distributions for$ u(x) $and$ f(x)$ are depicted in Figure 1(A, B). The posterior variances could be used as good indicators of how uncertain one should be about the estimated parameters and predictions made based on these parameters.

Noisy data

Consider the case where the noisy data $\{\bm{x}_{u},\bm{y}_{u}\}$ , $\{\bm{x}_{f},\bm{y}_{f}\}$ on $u(x)$ , $f(x)$ are generated according to $\bm{y}_{u}=u(\bm{x}_{u})+\bm{\epsilon}_{u}$ , $\bm{y}_{f}=f(\bm{x}_{f})+\bm{\epsilon}_{f}$ with $\bm{x}_{u}$ , $\bm{x}_{f}$ constituting of $n_{u}=14$ , $n_{f}=10$ data points chosen at random in the interval $ $, respectively. Here, the noise$ \bm{\epsilon}_{u} $and$ \bm{\epsilon}_{f} $are randomly generated according to the normal distributions$ \mathcal{N}(\bm{0},0.1^{2}\ \bm{I}_{n_{u}}) $and$ \mathcal{N}(\bm{0},0.5^{2}\ \bm{I}_{n_{f}}) $, respectively. Given these noisy training data, the algorithm learns the parameters$ (\alpha,\beta) $to have values$ (2.073054,\ 5.627249) $. It should be emphasized that for this example the data is deliberately chosen to have a sizable noise. This highlights the ability of the method to handle highly noisy observations without any modifications. The resulting posterior distributions for$ u(x) $and$ f(x)$ are depicted in Figure 1(C, D). The posterior variances not only quantify scarcity in observations but also signify noise levels in data.

2 Heat Equation

This example is chosen to highlight the capability of the proposed framework to handle time-dependent problems using only scattered space-time observations. To this end, consider the heat equation

Note that for $\alpha=1$ , the functions $f(t,x)=e^{-t}(4\pi^{2}-1)\sin(2\pi x)$ and $u(t,x)=e^{-t}\sin(2\pi x)$ satisfy this equation. Assume that the noise-free data $\{(\bm{t}_{u},\bm{x}_{u}),\bm{y}_{u}\}$ , $\{(\bm{t}_{f},\bm{x}_{f}),\bm{y}_{f}\}$ on $u(t,x)$ , $f(t,x)$ are generated according to $\bm{y}_{u}=u(\bm{t}_{u},\bm{x}_{u})$ , $\bm{y}_{f}=f(\bm{t}_{f},\bm{x}_{f})$ with $(\bm{t}_{u},\bm{x}_{u})$ , $(\bm{t}_{f},\bm{x}_{f})$ constituting of $n_{u}=n_{f}=20$ data points chosen at random in the domain $^{2}$ , respectively. Given these training data, the algorithm learns the parameter $\alpha$ to have value $0.999943$ . The resulting posterior distributions for $u(t,x)$ and $f(t,x)$ are depicted in Figure 2. A visual inspection of this figure illustrates how closely uncertainty in predictions measured by posterior variances (see Figure 2(E, F)) correlate with prediction errors (see Figure 2(C, D)). Remarkably, the proposed methodology circumvents the need for temporal discretization, and is essentially immune to any restrictions arising due to time-stepping, e.g., the fundamental consistency and stability issues in classical numerical analysis.

3 Fractional Equation

Consider the one dimensional fractional equation

where $\widehat{k}_{uu}(w,w^{\prime};\theta)$ is the Fourier transform of the kernel $k_{uu}(x,x^{\prime};\theta)$ . Similarly, one can obtain $k_{uf}(x,x^{\prime};\theta,\alpha)$ and $k_{fu}(x,x^{\prime};\theta,\alpha)$ .

Note that for $\alpha=\sqrt{2}$ , the functions $u(x)=\frac{1}{2}e^{-2i\pi x}\left(\frac{(2\pi+i)e^{4i\pi x}}{-1+(2i\pi)^{\sqrt{2}}}+\frac{2\pi-i}{-1+(-2i\pi)^{\sqrt{2}}}\right)$ and $f(x)=2\pi\cos(2\pi x)-\sin(2\pi x)$ satisfy the fractional equation. Assume that the noise-free data $\{\bm{x}_{u},\bm{y}_{u}\}$ , $\{\bm{x}_{f},\bm{y}_{f}\}$ on $u(x)$ , $f(x)$ are generated according to $\bm{y}_{u}=u(\bm{x}_{u})$ , $\bm{y}_{f}=f(\bm{x}_{f})$ with $\bm{x}_{u}$ , $\bm{x}_{f}$ constituting of $n_{u}=5$ , $n_{f}=4$ data points chosen at random in the interval $ $, respectively. Given these noise-free training data, the algorithm learns the parameter$ \alpha $to have value$ 1.412104 $. The resulting posterior distributions for$ u(x) $and$ f(x)$ are depicted in Figure 3.

4 Drosophila melanogaster gap gene dynamics [36, 16]

The gap gene dynamics of protein $a\in\{Hb,Kr,Gt,Kni\}$ (see figure 4) can be modeled using a reaction-diffusion partial differential equation

where $u^{a}(t,x)$ denotes the relative concentration of gap protein $a$ (unitless, ranging from 0 to 255) at space point $x$ (from 35% to 92% of embryo length) and time $t$ (0 min to 68 min after the start of cleavage cycle 13). Here, $\lambda^{a}$ and $D^{a}$ are decay and diffusion rates of protein $a$ , respectively. Moreover, the right-hand-side is given by

models the doubling of nuclei and shutdown of transcription during mitosis and

specifies the production rate of protein $a$ . The model combines the processes of transcription and translation into a single production process $P^{a}(t,x)$ . Here, $R^{a}$ is the maximum production rate,

and $b\in\{Bcd,Cad,Hb,Kr,Gt,Kni,Tll\}$ ranges over the seven genes (see figure 4). The regulatory weights $T^{ab}$ , encode the effect protein $b$ has on the production rate of protein $a$ . If $T^{ab} >0$ (or $T^{ab} <0$ ), then gene $b$ is interpreted as being an activator (or a repressor) of gene $a$ .

This work assumes the maximum production rate $R^{a}$ , the regulatory weights $T^{ab}$ , and the bias or offset $h^{a}$ to be specified as in table 1 and seeks to learn the decay $\lambda^{a}$ and diffusion $D^{a}$ rates of protein $a$ . In fact, table 2 summarizes the values learned by the algorithm for these parameters and figure 5 depicts the corresponding posterior distributions for $u^{a}(t,x)$ and $f^{a}(t,x)$ . Indeed, figure 5 gives a good indication of how certain one could be about the estimated parameters and the predictions made based on them.

Discussion

In summary, this work introduced a possibly disruptive probabilistic technology for learning general parametric linear equations from noisy data. This generality was demostrated using various bechmark problems with utterly different attributes along with an example application in functional genomics. Furthermore, the current methodology can be applied to inverse problems involving characterization of materials, tomography and electrophysiology, design of effective metamaterials, etc. The methodology can be straightforwardly generalized to address data with multiple levels of fidelity and equations with variable coefficients and complex geometries. Non-Gaussian and input-dependent noise models (e.g., student-t, heteroscedastic, etc.) can also be accommodated. Moreover, systems of linear integro-differential equations can be addressed using multi-output Gaussian process regressions . These scenarios are all feasible because they do not affect the key observation that any linear transformation of a Gaussian process is still a Gaussian process. In its current form, despite its generality regarding linear equations, the proposed framework cannot deal with non-linear equations. However, some specific non-linear operators can be addressed with extensions of the current framework by transforming such equations into systems of linear equations – albeit in high dimensions. In the end, the proposed methodology in this work, being essentially a regression technology, is suitable for resolving such high-dimensional problems.