Signature Kernel Conditional Independence Tests in Causal Discovery for Stochastic Processes

Georg Manten, Cecilia Casolo, Emilio Ferrucci, Søren Wengel Mogensen, Cristopher Salvi, Niki Kilbertus

Introduction

Understanding cause-effect relationships from observational data can help identify causal drivers for disease progression in longitudinal data and aid the development of new treatments, understand the underlying influences of stock prices to support lucrative trading strategies, or speed up scientific discovery by uncovering interactions in complex biological or chemical systems such as gene regulatory pathways.

Causal discovery (or causal structure learning) has received continued attention from the scientific community for at least two decades (Glymour et al.,, 2019, Spirtes et al.,, 2000, Vowels et al.,, 2022) with a notable uptick in previous years particularly regarding differentiable score-based causal discovery (Zheng et al.,, 2018, Brouillard et al.,, 2020, Charpentier et al.,, 2022, Hägele et al.,, 2023, Lorch et al.,, 2021, Annadani et al.,, 2023, Zheng et al.,, 2020), score matching (Rolland et al.,, 2022, Montagna et al., 2023b, , Montagna et al., 2023a, ), and other deep-learning based approaches (Chen et al.,, 2022, Ke et al.,, 2023, Yu et al.,, 2019, Ke et al.,, 2020). Many of these approaches aim at improving the scalability of causal discovery in the number of variables and observations as well as at incorporating uncertainty or efficiently making use of interventional data.

Despite these limitations, eq. 1 goes far beyond common assumptions in causal discovery from time-series data, also in comparison with similar models as in Peters et al., (2022): (a) It does not rely on the ‘discrete-time’ assumption (neither in the underlying model nor the actual observations). (b) Path-dependence ( $\mu^{k},\sigma^{k}$ may depend on the entire previous history $X_{[0,t]}$ of paths), including delayed SDEs, is typically not captured in existing approaches. (c) It goes beyond ‘additive observation noise’ as it incorporates ‘driving noise’ where causal dependence may also come from the diffusion, which cannot be captured by the mean of observed paths.

Contributions. We develop sound and complete constraint-based algorithms for (a) inferring the full DAG, (b) inferring the completed partially directed acyclic graph (CPDAG) representing the Markov equivalence class from path-valued observations (assuming a conditional independence test oracle and faithfulness). While (a) yields strictly stronger results, it relies on a different independence model. To render these algorithms applicable, we also propose flexible CI tests on path-space building on recent advances in signature kernels. After extensive empirical evaluations, demonstrating the superior performance of our CI tests (for different types of CI models), we combine our test with the causal discovery algorithms to achieve promising performance in graph discovery for systems with up to 5 variables. Finally, we discuss benefits and drawbacks of the two algorithms.

Background and Related Work

Kernel methods are at the core of many key techniques in machine learning. A kernel defines a feature map that lifts observations into a (possibly infinite-dimensional) Hilbert space in which non-linear optimization problems on the original data often become linear and solutions can be expressed entirely in terms of kernel evaluations of the data. Selecting an effective kernel for a particular task and data modality is often challenging, particularly when the data is sequential. Signature kernels Király and Oberhauser, (2019), Salvi et al., 2021a are a class of universal kernels on sequential data which have received attention in recent years thanks to their efficiency in handling path-dependent problems Lemercier et al., (2021), Salvi et al., 2021c , Cochrane et al., (2021), Salvi et al., 2021b , Cirone et al., (2023), Issa et al., (2023).

A kernel-trick was derived by Salvi et al., 2021a (, Theorem 2.5) who showed that the signature kernel satisfies the following path-dependent integral equation

Several off-the-shelf python packages offer efficient PDE solvers for eq. 3 such as sigkerax.

2 Conditional Independence Tests

Conditional independence (CI) is a fundamental notion in computational statistics and forms the foundation of graphical models. CI testing involves assessing whether two random variables, $X$ and $Y$ , are independent conditioned on a third random variables, $Z$ , i.e., whether the null hypothesis $H_{0}:X\operatorname{\perp\perp}Y\mid Z$ can be rejected against the alternative $H_{1}:X\perp\not\hskip 1.00006pt\perp Y\mid Z$ . When $Z$ is discrete, conditional independence testing simplifies to a collection of unconditional tests for each value of $Z$ (Tsamardinos and Borboudakis,, 2010). When $Z$ is continuous, most CI tests either rely on parametric assumptions about the underlying distributions (Baba et al.,, 2004) or rely primarily on kernel methods for nonparametric approaches (Li and Fan,, 2020, Zhang et al.,, 2017).

Kernel methods for CI testing typically leverage either kernel (ridge) regression or kernel mean embeddings and the Hilbert-Schmidt (conditional) independence criterion (HS(C)IC) to measure the reproducing kernel Hilbert space (RKHS) distance of the (conditional) mean embeddings of the joint distribution and the product of marginal distributions (Gretton et al.,, 2007, Muandet et al.,, 2017, Park and Muandet,, 2020). Examples include KCIT (Zhang et al.,, 2011) that draws on partial associations (Daudin,, 1980) of regression functions connecting $X$ , $Y$ , $Z$ , together with analytical or empirical (bootstrapped) expressions for the asymptotic distribution of the test statistic under the null. Strobl et al., (2019) approximate kernel computations in KCIT via random Fourier features for improved efficiency.

Other approaches rely on reducing CI to a two-sample test for the equivalent null $H_{0}:P(X,Y,Z)=P(X\mid Z)P(Y\mid Z)P(Z)$ . KCIPT (Doran et al.,, 2014) simulates samples from $P(X\mid Z)P(Y\mid Z)P(Z)$ by learning a permutation of the samples that approximately preserves $Z$ values. Lee and Honavar, (2017) propose a modified, unbiased estimate of the Maximum Mean Discrepancy (MMD) (Gretton et al.,, 2006) as the test statistic in KCIPT claiming improved calibration. Finally, other permutation-based approaches require large sample sizes (Sen et al.,, 2017) or rely on densities (Kim et al.,, 2022), which do not exist in our setting. Recently, Shah and Peters, (2020) proposed a simple test based on double regression for increased robustness and later generalized this technique to a kernel-based regression techniques (Lundborg et al.,, 2022), which also applies to functional data via function-on-function ridge regressions. In our setting, these regressions are difficult to perform as they would have to map between arbitrary time segments of multivariate SDE solution paths. Laumann et al., (2023), perhaps the closest work to ours, develop a conditional independence test for functional data based on HSCIC (Park and Muandet,, 2020) as the test statistic paired with a permutation test (Berrett et al.,, 2019). They further combine existing techniques such as the PC-algorithm (Glymour et al.,, 2019) with a regression-based method (Hoyer et al.,, 2008) for a causal discovery heuristic. Crucially, their method assumes knowledge of $P_{X\mid Z}$ , a major challenge in our SDE setting. We focus on constraint-based methods leveraging the ability of the signature kernel to capture filtrations on arbitrary intervals. Since Laumann et al., (2023) propose the only other method we are aware of that can be applied in our setting, we benchmark our CI test against them.

On a practical level, large conditioning sets inevitably imply curse-of-dimensionality type issues as the ‘number possible values to condition on’ grows exponentially. Moreover, existing approaches often assume observations in real Euclidean vector spaces and do not directly generalize to path-valued random variables where no densities exist. On a theoretical level, CI testing faces a fundamental impossibility in that there cannot be a universally powerful test for all distributions while also controlling the type I error (Shah and Peters,, 2020, Lundborg et al.,, 2022). This ‘no free lunch’ statement highlights the importance of carefully selecting an appropriate test for the setting under consideration. Currently, there exists no practical CI test tailored to our case—a gap we fill.

3 Causal Discovery on Time Series

A growing body of work considers differential equations at equilibrium to obtain ‘solvable’ structural causal models that may include cycles (Mooij et al.,, 2013, Bongers et al.,, 2018, Bongers and Mooij,, 2018). Due to their assumptions about the data generating mechanism, this work does not apply to our setting. Other works instead aim at learning the full dynamical law by leveraging invariance under the assumption of mass-action-kinetics (Peters et al.,, 2022) or relying on non-convex optimization of heavily overparameterized neural network models for reconstruction performance (Aliee et al.,, 2021, 2022, Bellot et al.,, 2022). In continuous-time stochastic processes, so-called (conditional) local independence defines an asymmetric independence relation (Schweder,, 1970, Mogensen et al.,, 2018). Tests of local independence may be used to learn a (partial) causal graph in, e.g., point process models and diffusions (Didelez,, 2008, Meek,, 2014, Mogensen et al.,, 2018, Mogensen and Hansen,, 2020).

Method

We assume the existence of a conditional independence oracle for statements of the form

Asymptotically, such an oracle can realistically be replaced in practice by a consistent finite-sample CI test for path-valued random variables (see Section 3.2). The following example demonstrates that constraint-based causal discovery of the full dependency graph is impossible using just 3.1 when allowing for cycles.

The oracle in 3.1 is not powerful enough to rule out a directed edge $X^{1}\to X^{3}$ for an SDE model with the following dependency graph:

$X^{1}<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msup><mi>X</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">X^{2}</annotation></semantics></math>X2X^{3}$ We go through this example and describe a different type of oracle that would be required for causal discovery in cyclic SDE models in Section A.2. Example 3.2 provides additional motivation for considering acyclic SDE models, still allowing for loops. Loops are crucial in the dynamic setting, as they represent dependencies of a variable on itself, infinitesimally into the past.

We will use the oracle in three distinct ways ( $i,j\in[d]$ , $K\subset[d]$ , $\{i,j\}\cap K=\varnothing$ ):

$X^{i}$ is conditionally $h$ -locally self-independent given $X^{K}$ at $s$ , $X^{i}\operatorname{\perp\perp}^{\circlearrowleft}_{s,h}\mid X^{K}$ if

Assume $\mathcal{G}$ is directed and acyclic except for loops. Then $\widetilde{\mathcal{G}}$ is acyclic, and the assignment of $X^{i}_{[0,s]}$ to the node $i_{0}$ and $X^{i}_{[s,s+h]}$ to $i_{1}$ , for each $i\in V$ defines a fully observed SCM (Peters et al.,, 2017). In particular, its observational distribution satisfies the global Markov property w.r.t. to $\widetilde{\mathcal{G}}$ . We use the term Markov for the Markov property in causality, not as in ‘Markov process’.

Our proof is in Section A.3. Under 3.1 and assuming faithfulness for the SCM of Proposition 3.3, we now develop modified versions of the PC algorithm. More details on faithfulness and minimality can be found in Section A.4.

Discovering the full DAG $\mathcal{G}$ . To discover the full graph including loops we propose Algorithm 1 that makes use of the time-ordering via our $h$ -local independence models (consider $s,h>0$ arbitrary but fixed). Algorithm 1 is kept conceptually simple and not an optimized version. As suggested by Spirtes et al., (2000), we could for instance restrict the conditioning set ( $K$ ) to nodes that lie on undirected paths between $i$ and $j$ or introduce an additional for-loop that slowly grows the cardinality of the conditioning set in the second stage (line 8 of Algorithm 1 as well).

Algorithm 1 is sound and complete for the SDE model eq. 1, assuming acyclicity except for loops and faithfulness: its output coincides with the dependency graph $\mathcal{G}$ of the SDE.

The proof of the theorem can be found in Section A.6.

If we knew in advance that there is no path-dependency, the test $X^{i}\operatorname{\perp\perp}^{+}_{s,h}X^{j}$ (with no additional conditioning set) is equivalent to the test on values of the process $X^{i}_{s}\operatorname{\perp\perp}X^{j}_{s+h}\mid X^{j}_{s}$ : this is because $X^{j}_{u}$ , for $u\in[s,t]$ only depends on $X_{[s,t]}^{j}$ via $X^{j}_{s}$ . However, this kind of statement no longer works in the conditional case (by arguments similar to Example 3.2), and testing on whole paths is strictly necessary when allowing for path-dependency.

CI testing of SDEs is conceptually similar to static CI on discretely sampled data, when the sampling rate is lower than the actual frequency at which causal effects propagate. This is because, in continuous time and for any $h>0$ , there are going to be causal effects that occur at time scales less than $h$ . Whenever we just discretely sample time series with no instantaneous effects, Peters et al., (2017, Thm. 10.3) provides full causal discovery, even in the presence of cycles in the summary graph, by performing $d^{2}$ tests $X^{i}_{s}\operatorname{\perp\perp}X^{j}_{s+\Delta s}\mid X^{[d]\setminus\{i,j\}}_{[0,s]}$ . Potentially absent loops could then be removed as in Algorithm 1. Ignoring that this might suffer from the conditioning set being too large, our signature method can be of help in this setting too, if there is path-dependence: the conditioning variable can be taken to be $S(X^{[d]\setminus\{i,j\}})_{0,s}$ .

Given these potential benefits, we propose to further augment the CPDAG discovery by a post-processing step in which we aim at orienting all remaining unoriented edges $(i,j)$ within the Markov equivalence class represented by the CPDAG leveraging the following corollary.

2 Signature Kernel Conditional Independence Test

KCIPT with the signature kernel on path-space is consistent for the tests in 3.1.

We provide the precise conditions and a proof in Section A.8. While such consistency proofs provide theoretical grounding, as described in Section 2, CI tests must be carefully selected and validated in a given domain empirically. Therefore, the extensive evaluation of our CI test in the next section is perhaps more important in practice.

Experiments

First, we compare how different kernel-based CI tests perform in conjunction with the signature kernel in Appendix C. Based on results in Figure 5 and the consistency statement in Theorem 3.8, we use KCIPT in all subsequent experiments.

Baselines. We compare against established causal discovery methods in time series, specifically CCM (Sugihara et al.,, 2012), PCMCI (Runge et al.,, 2019), Granger causality (Granger,, 1969), and the recent approach by Laumann et al., (2023). CCM and Granger causality are restricted to bivariate settings and Laumann et al., (2023) conditional independence test requires additional knowledge about the underlying distribution and is therefore not suited for a comparison outside the unconditional case. Details about the baseline implementations are in Appendix C.

In the first series of experiments, we measure the test power using the unconditional test in the case $X^{1}\rightarrow X^{2}$ for SDEs of the form eq. 5, with constant diffusion $B\equiv 0$ , $d_{i}=0.4$

Figure 2 shows that test power swiftly approaches 1 for as few as 40-60 samples as the ratio between causal- and self-interaction strength $\frac{a_{21}}{a_{22}}$ at $X^{2}$ increases, so as soon as the causal interaction of $X^{1}$ feeding into $X^{2}$ reaches a substantial level compared to $X^{2}$ self-driven dynamics. Our method consistently outperforms Laumann et al., (2023) across sample sizes. In addition, since the signature kernel naturally handles irregularly observed paths, Figure 6 shows that our test also consistently outperforms the baselines even at high levels of data missingness (for details see Section B.4).

2 Leveraging the Direction of Time

Next, we explore a bivariate scenario with $X^{1}\to X^{2}$ , where we leverage the direction of time to orient the edge using the future-extended h-local conditional independence with $K=\emptyset$ for different interaction types.

Linear case. In the case of linear drift-interaction, SDEs are drawn from eq. 5 with $a_{21}\sim\mathcal{U}([0.8,1])$ , $a_{11},a_{22}\sim\mathcal{U}([0,0.5])$ , $a_{12}=0$ , $B\equiv 0$ , $d_{i}\sim\mathcal{U}([0.3,0.4])$ . Table 1 (setting linear) shows that our criterion not only accurately captures the presence of the edge but also captures the direction of dependence for sufficient amounts of samples, outperforming both CCM, Granger causality and PCMCI.

Non-linear case. Our proposed criterion is also able to almost perfectly predict the graph for non-linear SDEs as

where $\nu\sim\mathcal{U}()$ , $\omega=2\pi\nu$ , $r\sim\mathcal{U}([2,2.5])$ and $d_{i}\sim\mathcal{U}([0.3,0.4])$ , outperforming the baselines (see Table 1).

Diffusion case. In the final bivariate experiment, we test the prediction-capability of our independence criterion in setting (5) where all entries of $A,B$ are zero except for $a_{11},a_{22}\sim\mathcal{U}([0.5,1])$ and $b_{21}\sim\mathcal{U}([2.5,3])$ , and $d\equiv 0$ . As the only method capable of handling this type of dependence, we again outperform the baselines (see Table 1).

3 Building Blocks for Causal Discovery

The performance of the conditional and unconditional independence tests was evaluated using widely recognized causal structures that are at the foundation of causal discovery. Specifically, we examined scenarios involving two variables, both connected and unconnected, and three-variable configurations comprising chain, fork, and collider structures. More details on the experiments are found in Section B.5. The analysis of type I and type II errors, as shown in Table 2, indicates robust performance for both conditional KCIPT and unconditional HSIC tests. These results refer to the symmetric conditional independence $\operatorname{\perp\perp}_{sym}$ , while future-extended h-local conditional independence are presented in Table 6. In Section B.5, further results in the diffusion dependency setting (Table 5).

4 Causal Discovery in Higher Dimensions

After having dealt with the bivariate case and the building blocks of causal discovery, we now test the performance of our CI test in conjunction with our causal discovery algorithms. We sample $100$ DAG adjacency structures for dimensions $d\in\{3,4,5\}$ and draw samples from $d$ -dimensional SDEs as in (5) satisfying this given adjacency structure and $a_{ij}\sim\mathcal{U}(\cup),j\neq i$ , $a_{ii}\sim\mathcal{U}([-0.5,0.5])$ .

5 Real-World Pairs Trading Example

Finally, we apply our method on real-world data for evaluating pair trading strategies on ten stocks from the VBR Small-Cap ETF over a three-year period from January 1, 2010, to December 31, 2012. While financial data typically lacks a ground truth for validation, we assess the effectiveness of our method by quantifying the product-and-loss (P&L) profile of the pair trading strategy generated via our model. We select pairs based on the pairwise p-values of different hypothesis tests: 1) cointegration, tested via the classical Augmented Dickey-Fuller (ADF) test, 2) classical Granger causality test and 3) our method. As shown in Table 3, the P&L of our trading strategy outperforms the P&L of the two benchmarks in terms of most classical trading metrics: total return, APR, Sharpe ratio, maxDDD. More details on the implementation can be found in Section B.6.

Conclusions and Future Work

In this work, we introduced a kernel-based test for conditional independence on path-space, utilizing signature kernels, and applied it to constraint-based causal discovery in stochastic dynamical systems. We developed an algorithm for comprehensive causal discovery in acyclic systems. This algorithm operates under the assumptions of faithfulness and the availability of a conditional independence oracle.

To deploy this algorithm in practice, we developed a consistent conditional independence test based on the signature kernel and extensively evaluated its performance empirically. Our results indicate that our method outperforms existing baselines for causal discovery on time series, highlighting its potential in complex dynamic systems. Additionally, we successfully applied our method to a real-world pairs trading example, where it also demonstrated strong performance.

An interesting direction for future work involves extending our approach to cyclic settings and accommodating dynamics with temporally changing causal structure or with partial observations. Future research could expand on these applications in real-world scenarios, to evaluate the method’s practical impact in concrete domains. Moreover, the field of causality represents just one of the many potential applications of conditional independence tests for path-valued random variables. Exploring other applications in different domains could also catalyze new insights and developments.

Acknowledgments

Cecilia Casolo is supported by the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research. Emilio Ferrucci is supported by UKRI EPSRC Programme Grant EP/S026347/1. Søren Wengel Mogensen is supported by Independent Research Fund Denmark (DFF-International Postdoctoral Grant 0164-00023B) and a member of the ELLIIT Strategic Research Area at Lund University.

References

Appendix A Proofs and theoretical digressions

Making the actual dependence of $\mu^{k}$ and $\sigma^{k}$ on their arguments in eq. 1 explicit, we can rewrite it as

Given that we are considering the dynamic setting, it is generally natural to allow for $\mathcal{G}$ to have cycles. This comes at no additional requirement of consistency constraints as it does in the static case (see for example Bongers et al., (2021)), since the causal arrows in the model eq. 1 should be thought of as ‘pointing towards the infinitesimal future, with infinitesimal magnitude’, integrated over the whole time interval considered. Indeed, the system of SDEs does not require any global consistency to be well-posed, other than the regularity and growth conditions that guarantee global-in-time existence and uniqueness. On the other hand, we will be interested in the potential for constraint-based causal discovery of such systems, qualified by the following assumption on the types of conditional independencies that we allow to be tested in continuous time:

A.2 Counterexample for Cyclic Causal Discovery in the SDE Model

We restate our example, namely that the oracle in 3.1 is not powerful enough to rule out a directed edge $X^{1}\to X^{3}$ for an SDE model with the following dependency graph:

$X^{1}<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msup><mi>X</mi><mn>2</mn></msup></mrow><annotation encoding="application/x-tex">X^{2}</annotation></semantics></math>X2X^{3}$ Testing $X^{1}_{[0,s]}\operatorname{\perp\perp}X^{3}_{[s,s+h]}\mid X^{2,3}_{[0,s]}$ , will not remove the edge, due to the open path

Testing $X^{1}_{[0,s]}\operatorname{\perp\perp}X^{3}_{[s,s+h]}\mid X^{3}_{[0,s]},X^{2}_{[0,s+h]}$ , on the other hand, runs into the collider

There is no other way of splitting up the time interval (even allowing for more than 2 subintervals) that overcomes both these problems: when testing $X^{1}_{[0,s]}\operatorname{\perp\perp}X^{3}_{[s,s+h]}$ (as is necessary in order to rely on time to obtain direction of the arrow), if $X^{2}$ is not conditioned on over $[s,s+h]$ it will be a mediator, and if it is, it will be a collider.

What would be needed to detect the edge in the above example (and to perform causal discovery more generally in the cyclic case), is the availability of tests of strong instantaneous non-causality in the Granger sense as defined in Florens and Fougere, (1996). This reflects the fact that the SDE model is ‘acyclic at infinitesimal scales’. Unfortunately, such a property—which has to do with the Doob-Meyer decompositions of functions of the process w.r.t. two different filtrations—is much more difficult to test for, as it is infinitesimal in nature. Recent progress in this direction was made in Christgau et al., (2023) for the case of SDEs driven by jump processes; this local independence criterion, which goes back to Schweder, (1970), is however weak in the sense that it only takes into account the Doob-Meyer decomposition of $X$ and not functions of it, and is not therefore able, for example, to detect dependence in the diffusion coefficient Gégout-Petit and Commenges, (2010). Here we take the view that a continuously-indexed path, queried over an interval, can be considered as an observational distribution (cf. the discussion in Sokol and Hansen, (2013) on whether the infinitesimal generator can be considered ‘observational’).

A.3 Proof of Proposition 3.3

Since $\mathcal{G}$ is acyclic, a directed cycle in $\widetilde{\mathcal{G}}$ that is not a loop must contain nodes both in $V_{0},V_{1}$ . But there can be no such directed cycles, since edges only travel in the direction $V_{0}\to V_{1}$ , by construction of $\widetilde{\mathcal{G}}$ . Since $\widetilde{\mathcal{G}}$ also free of loops, it is a DAG. We now show that there exists an SCM over $\widetilde{\mathcal{G}}$ with the path-valued random variables of the statement: Markovianity will then follow from Peters et al., (2017, Proposition 6.31), original to Verma and Pearl, (1990). In other words, we must show there exist Borel-measurable functions (setting $t\coloneqq s+h$ )

with the initial conditions $x^{0}$ and Brownian path segments $W^{k}_{[0,s]}$ , $W^{k}_{[s,t]}$ collectively independent. This independence follows from the definition of the SDE model and from independence of Brownian increments over disjoint or consecutive intervals. These functions are partial in that they are not defined on the whole space of continuous functions on the interval: we only need to show them to be defined on a measurable set of paths that contains all solutions of SDEs on the required interval. $F^{k}_{1}$ is defined as follows, with care to make explicit the dependence on the solution on the two intervals via the operation of path-concatenation $*$ :

A.4 Faithfulness and causal minimality

A.5 Proof of Corollary 3.7

A.6 Proof of Theorem 3.4

We begin with proving correctness of the recovery of the skeleton modulo loops. This will follow if we show that for $(i\to j)\not\in\mathcal{G}$ for $i,j\in V$ distinct

We argue similarly for the loop-removal phase. If $(i\to i)\not\in\mathcal{G}\iff(i_{0}\to i_{1})\not\in\widetilde{\mathcal{G}}$ , we must show

A.7 Global Markov Property for the Symmetric Independence Criterion

In this section, we provide a proof for the global Markov property of the symmetric independence criterion in order for the PC-algorithm to be applicable. It is based on the notion of independence models:

Let $V\cong\{1,\ldots,n\}$ be a set. An independence model $\mathcal{J}(V)$ over $V$ is a ternary relation over disjoint subsets of $V$ ,

When $(A,B,C)$ is a triple in $\mathcal{J}(V)$ , $(A,B,C)\in\mathcal{J}(V)$ , we also use $\langle A,B\mid C\rangle$ to denote $(A,B,C)$ . This notation highlights the fact that $C$ is a conditioning set. An independence model $\mathcal{J}(V)$ is called a semigraphoid if 1.-4. hold for all disjoint $A,B,C\subset V$ .

(symmetry) $\langle A,B\mid C\rangle\in\mathcal{J}(V)$ $\Rightarrow$ $\langle B,A\mid C\rangle\in\mathcal{J}(V)$

(decomposition) $\langle A,B\mid C\rangle\in\mathcal{J}(V)$ and $D\subset B$ $\Rightarrow$ $\langle A,D\mid C\rangle\in\mathcal{J}(V)$

(weak union) $\langle A,B\cup D\mid C\rangle\in\mathcal{J}(V)$ $\Rightarrow$ $\langle A,B\mid C\cup D\rangle\in\mathcal{J}(V)$

(contraction) $\langle A,B\mid C\rangle\in\mathcal{J}(V)$ and $\langle A,D\mid B\cup C\rangle\in\mathcal{J}(V)$ $\Rightarrow$ $\langle A,B\cup D\mid C\rangle\in\mathcal{J}(V)$

A semigraphoid $\mathcal{J}(V)$ is called graphoid if 5. holds.

(intersection) $\langle A,B\mid C\cup D\rangle\in\mathcal{J}(V)$ and $\langle A,D\mid B\cup C\rangle\in\mathcal{J}(V)\Rightarrow\langle A,B\cup D\mid C\rangle\in\mathcal{J}(V)$

An independence model $\mathcal{J}(V)$ called compositional if 6. holds.

(composition) $\langle A,B\mid C\rangle\in\mathcal{J}(V)$ and $\langle A,D\mid C\rangle\in\mathcal{J}(V)\Rightarrow\langle A,B\cup D\mid C\rangle\in\mathcal{J}(V)$

The graphical criterion of d-separation defines a independence model on a DAG $\mathcal{G}=(V,\mathcal{E})$ by

For any DAG $\mathcal{G}=(V,\mathcal{E})$ , the independence model $\mathcal{J}_{d-sep}(\mathcal{G})$ is a compositional graphoid.

Let $V=\{1,\ldots,n\}$ . Given random variables $X^{i}:(\Omega,\mathcal{A},P)\rightarrow(\mathcal{X}^{i},\mathcal{A}_{i}),\ i\in V,$ and $A,B,C\subset V$ disjoint, conditional independence

defines the probabilistic independence model

We will use this as a symmetric conditional independence relation. Note that $X^{i}$ may take values in a path-space in which case it is a stochastic process. The independence model $\mathcal{J}(P)$ is a semigraphoid which is an immediate consequence of sub- $\sigma$ -algebra properties.

Let $\mathcal{G}=(V,\mathcal{E})$ be a DAG and $\mathcal{J}(V)$ be an independence model over $V$ . The independence model $\mathcal{J}(V)$ satisfies the global Markov property w.r.t. $\mathcal{G}$ : $\Leftrightarrow$

The independence model $\mathcal{J}(V)$ satisfies the directed local Markov property w.r.t. $\mathcal{G}$ : $\Leftrightarrow$

Let $\mathcal{G}=(V,\mathcal{E})$ be a DAG, let and $\mathcal{J}(V)$ a semigraphoid over $V$ . The directed global and local Markov properties are equivalent, that is,

Let $X$ be a set of variables induces by the model in eq. 6 with a constant and diagonal diffusion matrix s.t. it induces the DAG $\mathcal{G}=(V,\mathcal{E})$ . Then the system satisfies the directed local Markov property w.r.t $\mathcal{G}$ ,

establishing eq. 10 for $\mathcal{J}(P)$ w.r.t. $\mathcal{G}$ . ∎

A.8 Consistency of the conditional independence test

Our measure-theoretical argumentation now follows the definitions in Park and Muandet, (2020) and the original proof in Doran et al., (2014). Let $\mathcal{H}$ be a Banach space, $\mathcal{E}\subset\mathcal{F}$ a sub- $\sigma$ -algebra of the Borel- $\sigma$ -algebra.

Let $H:(\Omega,\mathcal{A},P)\rightarrow(\mathcal{H},\mathcal{F})$ a Bochner $P$ -integrable random variable. The conditional expectation of $H$ w.r.t $\mathcal{E}$ is a Bochner $P$ -integrable RV $H^{\prime}:(\Omega,\mathcal{A},P)\rightarrow(\mathcal{H},\mathcal{E})$ such that

where the second equivalence is trivial. Let now $\sigma\in\mathcal{S}_{n}$ be a permutation such that its permutation matrix $M_{\sigma}=(m_{ij})$

We can now write the distance between test statistic and the approximated test statistic in terms of the sum of the distance between the empirical estimates and its respective counterparts by using the inverse and the regular triangle inequality

By Park and Muandet, (2020), the first term tends to zero for $n\rightarrow\infty$ . With a similar line of argumentation as in Doran et al., (2014) on can establish a majorant for the estimated HSCIC, namely

where the assumptions can be made due to the boundedness of the kernels. Hence, under the assumption that

a statement that holds true in the case of densities (see supplementary material of Janzing et al., (2013)), since $P_{\mid Z}$ is regular, $\hat{\mu}_{P^{\prime}_{X,Y,Z}}\rightarrow\mu_{X\mid Z}\otimes\mu_{Y\mid Z}\otimes\mu_{Z}$ .

Appendix B Experiments and Evaluation

This section delves into the experimental framework, elaborating on the data generation process, the parameters and architectures employed, and presents additional results.

where $\|\cdot\|_{L^{2}}$ is the is the $L^{2}$ norm of entire paths, i.e., integrated over the time interval $I$ . Given that in practice we observe the paths $x_{i}$ still at fixed time points $t_{1}<\ldots<t_{m}$ within the interval $I$ , we can also consider the heuristic

B.2 Empirical Comparison of Different Tests

Different kernel-based conditional and unconditional tests are compared. HSIC with bootstrap (HSICb) and with $\gamma$ -distribution approximation (HSICγ) (Gretton et al.,, 2007) were implemented for the unconditional setting, while KCIPT (Doran et al.,, 2014), KCIT with bootstrap (KCITb), KCIT with $\gamma$ approximation (KCITγ) (Zhang et al.,, 2011) and SDCIT for the conditional setting (Lee and Honavar,, 2017). Figure 5 shows that KCIPT and HSICb are overall the best performing ones in terms of type I and type II error.

B.3 Simulated Datasets

The data-generating process corresponding to Table 2 consists in linear Stochastic Differential Equations (SDEs) as defined in Equation 5. The drift matrix elements are created using a uniform distribution, with $a_{11}$ and $a_{22}$ following $\mathcal{U}[-0.5,0.5]$ . The off-diagonal elements, $a_{21}$ and $a_{12}$ , are sampled from $\mathcal{U}[(-2.5,-1.0)\cup(1.0,2.5)]$ . Additionally, we set both drift and diffusion biases by sampling $b_{i}$ and $d_{i}$ from $\mathcal{U}[-0.1,0.1]$ and $\mathcal{U}[-0.2,0.2]$ , respectively. It’s also important to note that the diffusion matrix in these simulations contains only zero elements.

B.4 Additional Experimental Results for the Bivariate Case

Figure 6 shows that in the simple unconditional bivariate setting of Section 4.1 our test maintains high power up to substantial levels of missingness and only starts degrading once more than 80% of observations are dropped at random starting from $64$ time steps over the interval $ $for$ 1000$ different SDEs.

B.5 Building blocks for causal discovery

Similarly to the results in Table 2, we perform the same experiments for the setting in which the causal dependency only comes from the diffusion. Specifically, referring to the linear SDE in eq. 5, $a_{ij}=a_{ji}=0$ , $a_{11},a_{22}\sim\mathcal{U}[-0.5,0.5]$ , $\sigma_{12},\sigma_{21}\sim\mathcal{U}[(-2.5,-1.0)\cup(1.0,2.5)]$ and $\sigma_{11},\sigma_{22}\sim\mathcal{U}[-0.5,0.5]$ . Moreover, $b_{i}$ and $d_{i}$ are sampled from $\mathcal{U}[-0.1,0.1]$ and $\mathcal{U}[-0.2,0.2]$ . Table 5 includes the results for all the different causal structures. We also assess the performance of the future-extended h-locally conditionally independence criterion ( $\operatorname{\perp\perp}_{s,t}^{+}$ ) in the different three-dimensional causal structures: chain, collider and fork. The results are presented respectively in Table 6. The tables show that the test is able to reconstruct the presence and the directionality of the edges overall all building blocks of causal structure learning.

B.6 Real-World Pairs Trading Example

In the pairs trading experiment, stock price data is downloaded from Yahoo Finance for a predefined list of stocks over a specific period, divided into training (1st January 2010 to 31st December 2011) and trading intervals (1st January 2012 to 31st December 2012). The chose stocks are Trinity Industries (TRN), Brandywine Realty Trust (BDN), Commercial Metals Company (CMC), The New York Times Company (NYT), New York Community Bancorp (NYCB), The Wendy’s Company (WEN), CNX Resources Corporation (CNX). Logarithms of the stock prices are computed to stabilize variance and normalize the prices. During the training period, pairs of stocks are selected based on various statistical tests: Cointegration Test (Engle-Granger) to determine if a pair of stocks is cointegrated, suggesting a long-term equilibrium relationship; Augmented Dickey-Fuller (ADF) Test to assess the stationarity of the spread (ratio) of a pair’s prices; and Granger Causality Test to check if the price of one stock in the pair can predict the price of the other. Pairs are selected based on the p-values from these tests, with a threshold of $0.05$ determining significance. In the trading phase, a rolling window is used to continuously recalculate the mean and standard deviation of the spread between the selected pairs. Trades are initiated based on the z-score of the spread, where a high z-score indicates a short position and a low z-score indicates a long position. Positions are managed and closed when the spread returns to a predefined z-score threshold. The strategy’s performance is evaluated based on total return, annual percentage rate (APR), Sharpe ratio, maximum drawdown (maxDD), and maximum drawdown duration (maxDDD).

B.7 Evaluation Metrics

Following common conventions in the causal discovery literature, we evaluate our algorithms using the following metrics.

Given two directed graphs $\mathcal{G}_{1}=\left(V,\mathcal{E}_{1}\right)$ , $\mathcal{G}_{2}=\left(V,\mathcal{E}_{2}\right)$ over the same nodes with different sets of edges and let $A_{1},A_{2}\in\{0,1\}^{n\times n}$ be their adjacency matrices. Then we define the structural hamming distance (SHD) as

Normalized SHD

In order to compare the recovery performance of our algorithms for different types of graph-sizes, the normalized SHD is defined as

Appendix C Baselines

In the experiment section Section 4, we benchmark our method against other baselines, including Granger causality, CCM, PCMCI, and Laumann (Laumann et al.,, 2023). For the latter, we used their implementation provided in the causal-fda package. For the Granger-implementation for two variables ( $d=2$ ), we used Seabold and Perktold, (2010), for CCM we used Javier, (2021), and for PCMCI we used the tigramite package (Runge et al.,, 2019). In PCMCI, tests for edges are conducted by applying distance correlation-based independence tests (Székely et al.,, 2007) between the variables’ residuals after regressing out other nodes using Gaussian processes.

Appendix D Summary of notations

A summary of notation is provided in Table 7.