Variational Autoencoders and Nonlinear ICA: A Unifying Framework

Ilyes Khemakhem, Diederik P. Kingma, Ricardo Pio Monti, Aapo Hyvärinen

INTRODUCTION

The framework of variational autoencoders (Kingma and Welling,, 2013; Rezende et al.,, 2014) (VAEs) and its extensions (e.g. Burda et al., (2015); Kingma et al., (2016); Tucker et al., (2018); Maaløe et al., (2019)) offers a scalable set of techniques for learning deep latent-variable models and corresponding inference models. With VAEs, we can in principle learn flexible models of data such that, after optimization, the model’s implicit marginal distribution over the observed variables approximates their true (but unknown) distribution. With VAEs we can also efficiently synthesize pseudo-data from the model.

However, we’re often interested in going a step further and want to learn the true joint distribution over both observed and latent variables. This is generally a very difficult task, since by definition we only ever observe the observed variables, never the latent variables, therefore we cannot directly estimate their joint distribution. If we could however somehow achieve this task and learn the true joint distribution, this would imply that we have also learned to approximate the true prior and posterior distributions over latent variables. Learning about these distributions can be very interesting for various purposes, for example in order to learn about latent structure behind the data, or in order to infer the latent variables from which the data originated.

Learning the true joint distribution is only possible when the model is identifiable, as we will explain. The original VAE theory doesn’t tell us when this is the case; it only tells us how to optimize the model’s parameters such that its (marginal) distribution over the observed variables matches the data. The original theory doesn’t tell us if or when we learn the correct joint distribution over observed and latent variables.

Almost no literature exists on achieving this goal. A pocket of the VAE literature works towards the related goal of disentanglement, but offers no proofs or theoretic guarantees of identifiability of the model or its latent variables. The most prominent of such models are $\beta$ -VAEs and their extensions (Burgess et al.,, 2018; Higgins et al.,, 2016, 2018; Esmaeili et al.,, 2018; Kim and Mnih,, 2018; Chen et al.,, 2018), in which the authors introduce adjustable hyperparameters in the VAE objective to encourage disentanglement. Other work attempts to find maximally independent components through the GAN framework (Brakel and Bengio,, 2017). However, models in these earlier works are actually non-identifiable due to non-conditional latent priors, as has been seen empirically (Locatello et al.,, 2018), and we will show formally below.

Recent work in nonlinear Independent Component Analysis (ICA) theory (Hyvärinen and Morioka,, 2016, 2017; Hyvärinen et al.,, 2019) provided the first identifiability results for deep latent-variable models. Nonlinear ICA provides a rigorous framework for recovering independent latents that were transformed by some invertible nonlinear transformation into the data. Some special but not very restrictive conditions are necessary, since it is known that when the function from latent to observed variables is nonlinear, the general problem is ill-posed, and one cannot recover the independent latents (Hyvärinen and Pajunen,, 1999). However, existing nonlinear ICA methods do not learn to model the data distribution (pdf), nor do they allow us to synthesize pseudo-data.

In this paper we show that under relatively mild conditions the joint distribution over observed and latent variables in VAEs is identifiable and learnable, thus bridging the gap between VAEs and nonlinear ICA. To this end, we establish a principled connection between VAEs and an identifiable nonlinear ICA model, providing a unified view of two complementary methods in unsupervised representation learning. This integration is achieved by using a latent prior that has a factorized distribution that is conditioned on additionally observed variables, such as a class label, time index, or almost any other further observation. Our theoretical results trivially apply to any consistent parameter estimation method for deep latent-variable models, not just the VAE framework. We found the VAE a logical choice since it allows for efficient latent-variable inference and scales to large datasets and models. Finally, we put our theoretical results to a test in experiments. Perhaps most notably, we find that on a synthetic dataset with known ground-truth model, our method with an identifiable VAE indeed learns to closely approximate the true joint distribution over observed and latent variables, in contrast with a baseline non-identifiable model.

UNIDENTIFIABILITY OF DEEP LATENT VARIABLE MODELS

where $\thetab\in\Theta$ is a vector of parameters, $p_{\thetab}(\mathbf{z})$ is called a prior distribution over the latent variables. The distribution $p_{\thetab}(\mathbf{x}|\mathbf{z})$ , often parameterized with a neural network called the decoder, tells us how the distribution on $\mathbf{x}$ depends on the values of $\mathbf{z}$ . The model then gives rise to the observed distribution of the data as:

Assuming $p_{\thetab}(\mathbf{x}|\mathbf{z})$ is modelled by a deep neural network, this can model a rich class of data distributions $p_{\thetab}(\mathbf{x})$ .

We assume that we observe data which is generated from an underlying joint distribution $p_{\thetab^{*}}(\mathbf{x},\mathbf{z})=p_{\thetab^{*}}(\mathbf{x}|\mathbf{z})p_{\thetab^{*}}(\mathbf{z})$ where $\thetab^{*}$ are its true but unknown parameters. We then collect a dataset of observations of $\mathbf{x}$ :

Note that the original values $\mathbf{z}^{*(i)}$ of the latent variables $\mathbf{z}$ are by definition not observed and unknown. The ICA literature, including this work, uses the term sources to refer to $\mathbf{z}^{*(i)}$ . Also note that we could just as well have written: $\mathbf{x}^{(i)}\sim p_{\thetab^{*}}(\mathbf{x})$ .

The VAE framework (Kingma and Welling,, 2013; Rezende et al.,, 2014) allows us to efficiently optimize the parameters $\thetab$ of such models towards the (approximate) maximum marginal likelihood objective, such that after optimization:

In other words, after optimization we have then estimated the marginal density of $\mathbf{x}$ .

2 Parameter Space vs Function Space

In this work we use slightly non-standard notation and nomenclature: we use $\thetab\in\Theta$ to refer to the model parameters in function space. In contrast, let $\mathbf{w}\in W$ refer to the space of original neural network parameters (weights, biases, etc.) in which we usually perform gradient ascent.

3 Identifiability

The VAE model actually learns a full generative model $p_{\thetab}(\mathbf{x},\mathbf{z})=p_{\thetab}(\mathbf{x}|\mathbf{z})p_{\thetab}(\mathbf{z})$ and an inference model $q_{\phib}(\mathbf{z}|\mathbf{x})$ that approximates its posterior $p_{\thetab}(\mathbf{z}|\mathbf{x})$ . The problem is that we generally have no guarantees about what these learned distributions actually are: all we know is that the marginal distribution over $\mathbf{x}$ is meaningful (Eq. 3). The rest of the learned distributions are, generally, quite meaningless.

What we are looking for is models for which the following implication holds for all $(\mathbf{x},\mathbf{z})$ :

That is: if any two different choices of model parameter $\thetab$ and $\thetab^{\prime}$ lead to the same marginal density $p_{\thetab}(\mathbf{x})$ , then this would imply that they are equal and thus have matching joint distributions $p_{\thetab}(\mathbf{x},\mathbf{z})$ . This means that if we learn a parameter $\thetab$ that fits the data perfectly: $p_{\thetab}(\mathbf{x})=p_{\thetab^{*}}(\mathbf{x})$ (the ideal case of Eq. 3), then its joint density also matches perfectly: $p_{\thetab}(\mathbf{x},\mathbf{z})=p_{\thetab^{*}}(\mathbf{x},\mathbf{z})$ . If the joint density matches, this also means that we found the correct prior $p_{\thetab}(\mathbf{z})=p_{\thetab^{*}}(\mathbf{z})$ and correct posteriors $p_{\thetab}(\mathbf{z}|\mathbf{x})=p_{\thetab^{*}}(\mathbf{z}|\mathbf{x})$ . In case of VAEs, we can then also use the inference model $q_{\phib}(\mathbf{z}|\mathbf{x})$ to efficiently perform inference over the sources $\mathbf{z}^{*}$ from which the data originates.

The general problem here is a lack of identifiability guarantees of the deep latent-variable model. We illustrate this by showing that any model with unconditional latent distribution $p_{\thetab}(\mathbf{z})$ is unidentifiable, i.e. that Eq. (4) does not hold. In this case, we can always find transformations of $\mathbf{z}$ that changes its value but does not change its distribution. For a spherical Gaussian distribution $p_{\thetab}(\mathbf{z})$ , for example, applying a rotation keeps its distribution the same. We can then incorporate this transformation as the first operation in $p_{\thetab}(\mathbf{x}|\mathbf{z})$ . This will not change $p_{\thetab}(\mathbf{x})$ , but it will change $p_{\thetab}(\mathbf{z}|\mathbf{x})$ , since now the values of $\mathbf{x}$ come from different values of $\mathbf{z}$ . This is an example of a broad class of commonly used models that are non-identifiable. We show rigorously in Supplementary Material D that, in fact, models with any form of unconditional prior $p_{\thetab}(\mathbf{z})$ are unidentifiable.

AN IDENTIFIABLE MODEL BASED ON CONDITIONALLY FACTORIAL PRIORS

In this section, we define a broad family of deep latent-variable models which is identifiable, and we show how to estimate the model and its posterior through the VAE framework. We call this family of models, together with its estimation method, Identifiable VAE, or iVAE for short.

The primary assumption leading to identifiability is a conditionally factorized prior distribution over the latent variables $p_{\thetab}(\mathbf{z}|\mathbf{u})$ , where $\mathbf{u}$ is an additionally observed variable (Hyvärinen et al.,, 2019). The variable $\mathbf{u}$ could be, for example, the time index in a time series (Hyvärinen and Morioka,, 2016), previous data points in a time series, some kind of (possibly noisy) class label, or another concurrently observed variable.

We describe the model above with noisy and continuous-valued observations $\mathbf{x}=\mathbf{f}(\mathbf{z})+\epsb$ .Equation (6) can be modified to model discrete variables too, as is detailed in Supplementary Material C, but that requires a bespoke identifiability theory. However, our identifiability results also apply to non-noisy observations $\mathbf{x}=\mathbf{f}(\mathbf{z})$ , which are a special case of Eq. (6) where $p_{\epsb}(\epsb)$ is Gaussian with infinitesimal variance. For these reasons, we can use flow-based generative models (Dinh et al.,, 2014) for $p_{\thetab}(\mathbf{x}|\mathbf{z})$ , while maintaining identifiability.

The prior on the latent variables $p_{\thetab}(\mathbf{z}|\mathbf{u})$ is assumed to be conditionally factorial, where each element of $z_{i}\in\mathbf{z}$ has a univariate exponential family distribution given conditioning variable $\mathbf{u}$ . The conditioning on $\mathbf{u}$ is through an arbitrary function $\lambdab(\mathbf{u})$ (such as a look-up table or neural network) that outputs the individual exponential family parameters $\lambda_{i,j}$ . The probability density function is thus given by:

where $Q_{i}$ is the base measure, $Z_{i}(\mathbf{u})$ is the normalizing constant and $\mathbf{T}_{i}=(T_{i,1},\dots,T_{i,k})$ are the sufficient statistics and $\lambdab_{i}(\mathbf{u})=(\lambda_{i,1}(\mathbf{u}),\dots,\lambda_{i,k}(\mathbf{u}))$ the corresponding parameters, crucially depending on $\mathbf{u}$ . Finally, $k$ , the dimension of each sufficient statistic, is fixed (not estimated). Note that exponential families have universal approximation capabilities, so this assumption is not very restrictive (Sriperumbudur et al.,, 2017).

2 Estimation by VAE

Next we propose a practical estimation method for the model introduced above. Consider we have a dataset $\mathcal{D}=\left\{\left(\mathbf{x}^{(1)},\mathbf{u}^{(1)}\right),\dots,\left(\mathbf{x}^{(N)},\mathbf{u}^{(N)}\right)\right\}$ of observations generated according to the generative model defined in Eq. (5). We propose to use a VAE as a means of learning the true generating parameters $\thetab^{*}:=(\mathbf{f}^{*},\mathbf{T}^{*},\lambdab^{*})$ , up to the indeterminacies discussed below.

We use the reparameterization trick (Kingma and Welling,, 2013) to sample from $q_{\phib}(\mathbf{z}|\mathbf{x},\mathbf{u})$ . This trick provides a low-variance stochastic estimator for gradients of the lower bound with respect to $\phib$ . The training algorithm is the same as in a regular VAE. Estimates of the latent variables can be obtained by sampling from the variational posterior.

3 Identifiability and consistency results

As discussed in section 2.3, identifiability as defined by equation (4) is very hard to achieve in deep latent variable models. As a first step towards an identifiable model, we seek to recover the model parameters or the latent variables up to trivial transformations. Here, we state informally our results on this weaker form of identifiability of the model —a rigorous treatment is given in Section 4. Consider for simplicity the case of no noise and sufficient statistics of size $k=1$ , and define $T_{i}:=T_{i,1}$ . Then we can recover $\mathbf{z}$ which are related to the original $\mathbf{z}^{*}$ as follows:

for an invertible matrix $A$ . That is, we can recover the original latent variables up to a component-wise (point-wise) transformations $T_{i}^{*},T_{i}$ , which are defined as the sufficient statistics of exponential families, and up to a subsequent linear transformation $A$ . Importantly, the linear transformation $A$ can often be resolved by excluding families where, roughly speaking, only the location (mean) is changing. Then $A$ is simply a permutation matrix, and equation (9) becomes

for a permuted index $i^{\prime}$ . Thus, the only real indeterminacy is often the component-wise transformations of the latents, which may be inconsequential in many applications.

4 Interpretation as nonlinear ICA

where $\mathbf{z}$ are assumed to follow a factorized (but typically unknown) distribution $p(\mathbf{z})=\prod_{i=1}^{d}p_{i}(z_{i})$ . This model is essentially a deep generative model. The difference to the definition above is mainly in the lack of noise and the equality of the dimensions: The transformation $\mathbf{f}$ is deterministic and invertible. Thus, any posteriors would be degenerate.

The goal is then to recover (identify) $\mathbf{f}^{-1}$ , which gives the independent components as $\mathbf{z}=\mathbf{f}^{-1}(\mathbf{x})$ , based on a dataset of observations of $\mathbf{x}$ alone. Thus, the goal of nonlinear ICA was always identifiability, which is in general not attained by deep latent variable models, as was discussed in Section 2 above.

To obtain identifiability, we either have to restrict $\mathbf{f}$ (for instance make it linear) and/or we have to introduce some additional constraints on the distribution of the sources $\mathbf{z}$ . Recently, three new nonlinear ICA frameworks (Hyvärinen and Morioka,, 2016, 2017; Hyvärinen et al.,, 2019) exploring the latter direction were proposed, in which it is possible to recover identifiable sources, up to some trivial transformations.

The framework in Hyvärinen et al., (2019) is particularly close to what we proposed above. However, there are several important differences. First, here we define a generative model where posteriors are non-degenerate, which allows us to show an explicit connection to VAE. We are thus also able to perform maximum likelihood estimation, in terms of evidence lower bound, while previous nonlinear ICA used more heuristic self-supervised schemes. Computing a lower bound on the likelihood is useful, for example, for model selection and validation. In addition, we can in fact prove a tight link between maximum likelihood estimation and maximization of independence of latents, as discussed in Supplementary Material F. We also learn both the forward and backward models, which allows for recovering independent latents from data, but also generating new data. The forward model is also likely to help investigate the meaning of the latents. At the same time, we are able to provide stronger identifiability results which apply for more general models than earlier theory, and in particular considers the case where the number of latent variables is smaller than the number of observed variables and is corrupted by noise. Given the popularity of VAEs, our current framework should thus be of interest. Further discussion can be found in Supplementary Material G.

IDENTIFIABILITY THEORY

Now we give our main technical results. The proofs are in Supplementary Material B.

1 General results

In practice, we are often interested in models that are identifiable up to a class of transformation. Thus, we introduce the following definition:

Let $\sim$ be an equivalence relation on $\Theta$ . We say that (1) is identifiable up to $\sim$ (or $\sim$ -identifiable) if

The elements of the quotient space ${\raisebox{1.99997pt}{$ \Theta $}\left/\raisebox{-1.99997pt}{$ \sim $}\right.}$ are called the identifiability classes.

We now define two equivalence relations on the set of parameters $\Theta$ .

Let $\sim$ be the equivalence relation on $\Theta$ defined as follows:

where $A$ is an $nk\times nk$ matrix and $\mathbf{c}$ is a vector

Our main result is the following Theoreman alternative version is in Supplementary Material E.:

Assume that we observe data sampled from a generative model defined according to (5)-(7), with parameters $(\mathbf{f},\mathbf{T},\lambdab)$ . Assume the following holds:

The set $\{\mathbf{x}\in\mathcal{X}|\varphi_{\varepsilon}(\mathbf{x})=0\}$ has measure zero, where $\varphi_{\varepsilon}$ is the characteristic function of the density $p_{\varepsilon}$ defined in (6).

The mixing function $\mathbf{f}$ in (6) is injective.

The sufficient statistics $T_{i,j}$ in (7) are differentiable almost everywhere, and $(T_{i,j})_{1\leq j\leq k}$ are linearly independent on any subset of $\mathcal{X}$ of measure greater than zero.

There exist $nk+1$ distinct points $\mathbf{u}^{0},\dots,\mathbf{u}^{nk}$ such that the matrix

of size $nk\times nk$ is invertible.the intuition and feasibility of this assumption are discussed in Supplementary Material B.2.3.

then the parameters $(\mathbf{f},\mathbf{T},\lambdab)$ are $\sim_{A}$ -identifiable.

2 Characterization of the linear indeterminacy

The equivalence relation $\sim_{A}$ provides a useful form of identifiability, but it is very desirable to remove the linear indeterminacy $A$ , and reduce the equivalence relation to $\sim_{P}$ by analogy with linear ICA where such matrix is resolved up to a permutation and signed scaling. We present in this section sufficient conditions for such reduction, and special cases to avoid.

We will start by giving two Theorems that provide sufficient conditions. Theorem 2 deals with the more general case $k\geq 2$ , while Theorem 3 deals with the special case $k=1$ .

Assume the hypotheses of Theorem 1 hold, and that $k\geq 2$ . Further assume:

The sufficient statistics $T_{i,j}$ in (7) are twice differentiable.

The mixing function $\mathbf{f}$ has all second order cross derivatives.

then the parameters $(\mathbf{f},\mathbf{T},\lambdab)$ are $\sim_{P}$ -identifiable.

Assume the hypotheses of Theorem 1 hold, and that $k=1$ . Further assume:

The sufficient statistics $T_{i,1}$ are not monotonicmonotonic means it is strictly increasing or decreasing..

All partial derivatives of $\mathbf{f}$ are continuous.

then the parameters $(\mathbf{f},\mathbf{T},\lambdab)$ are $\sim_{P}$ -identifiable.

This kind of identifiability is stronger than any previous results in the literature, and considered sufficient in many applications. On the other hand, there are very special cases where a linear indeterminacy cannot be resolved, as shown by the following:

$Q_{i}(z_{i})=1$ or $Q_{i}(z_{i})=e^{-z_{i}^{2}}$ for all $i$ .

Then $A$ can not be reduced to a permutation matrix.

This Proposition stipulates that if the components are Gaussian (or exponential in the case of non-negative components) and only the location is changing, we can’t hope to reduce the matrix $A$ in $\sim_{A}$ to a permutation. In fact, to prove this in the Gaussian case, we simply consider orthogonal transformations of the latent variables, which all give rise to the same observational distribution with a simple adjustment of parameters.

3 Consistency of Estimation

The theory above further implies a consistency result on the VAE. If the variational distribution $q_{\phib}$ is a broad parametric family that includes the true posterior, then we have the following result.

The family of distributions $q_{\phib}(\mathbf{z}|\mathbf{x},\mathbf{u})$ contains $p_{\mathbf{f},\mathbf{T},\lambdab}(\mathbf{z}|\mathbf{x},\mathbf{u})$ .

We maximize $\mathcal{L}(\thetab,\phib)$ with respect to both $\thetab$ and $\phib$ .

then in the limit of infinite data, the VAE learns the true parameters $\thetab^{*}:=(\mathbf{f}^{*},\mathbf{T}^{*},\lambdab^{*})$ up to the equivalence class defined by $\sim$ in (13).

EXPERIMENTS

We run simulations on data used previously in the nonlinear ICA literature (Hyvärinen and Morioka,, 2016; Hyvärinen et al.,, 2019). We generate synthetic datasets where the sources are non-stationary Gaussian time-series: we divide the sources into $M$ segments of $L$ samples each. The conditioning variable $\mathbf{u}$ is the segment label, and its distribution is uniform on the integer set $[\![1,M]\!]$ . Within each segment, the conditional prior distribution is chosen from the family (7) for small $k$ . When $k=2$ , we used mean and variance modulated Gaussian distribution. When $k=1$ , we used variance modulated Gaussian or Laplace (to fall within the hypotheses of Theorem 3). The true parameters $\lambda_{i}$ were randomly and independently generated across the segments and the components from a non degenerate distributions to satisfy assumption (iv) of Theorem 1. Following Hyvärinen et al., (2019), we mix the sources using a multi-layer perceptron (MLP) and add small Gaussian noise.

Our estimates of the latent variables are generated from the variational posterior $q_{\phib}(\mathbf{z}|\mathbf{u},\mathbf{x})$ , for which we chose the following form: $q_{\phib}(\mathbf{z}|\mathbf{x},\mathbf{u})=\mathcal{N}\left(\mathbf{z}|\mathbf{g}(\mathbf{x},\mathbf{u};\phi_{\mathbf{g}}),\mathop{\bf diag}{\sigmab^{2}(\mathbf{x},\mathbf{u};\phi_{\sigmab})}\right)$ , a multivariate Gaussian with a diagonal covariance. The noise distribution $p_{\varepsilon}$ is Gaussian with small variance. The functional parameters of the decoder and the inference model, as well as the conditional prior are chosen to be MLPs. We use an Adam optimizer (Kingma and Ba,, 2014) to update the parameters of the network by maximizing $\mathcal{L}(\thetab,\phib)$ in equation (8). The data generation process as well as hyperparameter choices are detailed in Supplementary Material H.1.

To evaluate the performance of the method, we compute the mean correlation coefficient (MCC) between the original sources and the corresponding latents sampled from the learned posterior. To compute this performance metric, we first calculate all pairs of correlation coefficients between source and latent components. We then solve a linear sum assignment problem to assign each latent component to the source component that best correlates with it, thus reversing any permutations in the latent space. A high MCC means that we successfully identified the true parameters and recovered the true sources, up to point-wise transformations. This is a standard measure used in ICA.

First, we show a visualization of identifiability of iVAE in a 2D case in Figure 1, where we plot the original sources, observed data and the posterior distributions learned by our model, compared to a vanilla VAE. Our method recovers the original sources up to trivial indeterminacies (rotation and sign flip), whereas the VAE fails to do a good separation of the latent variables.

We compared the performance of iVAE to a vanilla VAE. We used the same network architecture for both models, with the sole exception of the addition of the conditional prior in iVAE . When the data is centered, the VAE prior is Gaussian or Laplace. We also compared the performance to two models from the disentanglement literature, namely a $\beta$ -VAE (Higgins et al.,, 2016) and a $\beta$ -TC-VAE (Chen et al.,, 2018). The parameter $\beta$ of the $\beta$ -VAE and the parameters $\alpha$ , $\beta$ and $\gamma$ for $\beta$ -TC-VAE were chosen by following the instructions of their respective authors. We trained these 4 models on the dataset described above, with $M=40$ , $L=1000$ , $d=5$ and $n\in$ . Figure 2(a) compares performances obtained from an optimal choice of parameters achieved by iVAE and the three models discussed above, when the dimension of the latent space equals the dimension of the data ( $n=d=5$ ). iVAE achieved an MCC score of above $95\%$ , whereas the other three models fail at finding a good estimation of the true parameters. We further investigated the impact of the latent dimension on the performance in Figure 2(b). iVAE has much higher correlations than the three other models, especially as the dimension increases. Further visualization are in Supplementary Material I.4.

Next, we compared our method to previous nonlinear ICA methods, namely TCL by Hyvärinen and Morioka, (2016), which is based on a self supervised classification task (see Supplementary Material G.1). We run simulations on the same dataset as Figure 2(a), where we varied the number of segments from 10 to 50. Our method slightly outperformed TCL in our experiments. The results are reported in Figure 3(a). Note that according to Hyvärinen et al., (2019), TCL performs best among previously proposed methods for this kind of data.

Finally, we wanted to show that our method is robust to some failure modes which occur in the context of self-supervised methods. The theory of TCL is premised on the notion that in order to accurately classify observations into their relative segments, the model must learn the true log-densities of sources within each segment. While such theory will hold in the limit of infinite data, we considered here a special case where accurate classification did not require learning the log-densities very precisely. This was achieved by generating synthetic data where $x_{2}$ alone contained sufficient information to perform classification, by making the mean of $x_{2}$ significantly modulated across segments; further details in Supplementary Material H.2. In such a setting, TCL is able to obtain high classification accuracy without unmixing observations, resulting in its failure to recover latent variables as reflected in Figure 3(b). In contrast, the proposed iVAE, by virtue of optimizing a maximum likelihood objective, does not suffer from such degenerate behaviour.

Further simulations on hyperparameter selection and discrete data are in Supplementary Material I.

2 Nonlinear causal discovery in fMRI

An important application of ICA methods is within the domain of causal discovery (Peters et al.,, 2017). The use of ICA methods in this domain is premised on the equivalence between a (nonlinear) ICA model and the corresponding structural equation model (SEM). Such a connection was initially exploited in the linear case (Shimizu et al.,, 2006) and extended to the nonlinear case by Monti et al., (2019) who employed TCL.

Briefly, consider data $\mathbf{x}=(x_{1},x_{2})$ . The goal is to establish if the causal direction is $x_{1}\rightarrow x_{2}$ , or $x_{2}\rightarrow x_{1}$ , or conclude that no (acyclic) causal relationship exists. Assuming $x_{1}\rightarrow x_{2}$ , then the problem can be described by the following SEM: $x_{1}=f_{1}(n_{1}),~{}x_{2}=f_{2}(x_{1},n_{2})$ where $\mathbf{f}=(f_{1},f_{2})$ is a (possibly nonlinear) mapping and $\mathbf{n}=(n_{1},n_{2})$ are latent disturbances that are assumed to be independent. The above SEM can be seen as a nonlinear ICA model where latent disturbances, $\mathbf{n}$ , are the sources. As such, we may perform causal discovery by first recovering latent disturbances (using TCL or iVAE) and then running a series of independence tests. Formally, if $x_{1}\rightarrow x_{2}$ then, denoting statistical independence by $\perp\mkern-9.5mu\perp$ , it suffices to verify that $x_{1}\perp\mkern-9.5mu\perp n_{2}$ whereas $x_{1}\mathchoice{\mathrel{\hbox to0.0pt{\kern 7.63889pt\kern-5.27776pt$ \displaystyle\not $\hss}{\perp\mkern-9.5mu\perp}}}{\mathrel{\hbox to0.0pt{\kern 7.63889pt\kern-5.27776pt$ \textstyle\not $\hss}{\perp\mkern-9.5mu\perp}}}{\mathrel{\hbox to0.0pt{\kern 2.61118pt\kern-4.11108pt$ \scriptstyle\not $\hss}{\perp\mkern-9.5mu\perp}}}{\mathrel{\hbox to0.0pt{\kern 1.11118pt\kern-3.3333pt$ \scriptscriptstyle\not $\hss}{\perp\mkern-9.5mu\perp}}}n_{1}$ , $x_{2}\mathchoice{\mathrel{\hbox to0.0pt{\kern 7.63889pt\kern-5.27776pt$ \displaystyle\not $\hss}{\perp\mkern-9.5mu\perp}}}{\mathrel{\hbox to0.0pt{\kern 7.63889pt\kern-5.27776pt$ \textstyle\not $\hss}{\perp\mkern-9.5mu\perp}}}{\mathrel{\hbox to0.0pt{\kern 2.61118pt\kern-4.11108pt$ \scriptstyle\not $\hss}{\perp\mkern-9.5mu\perp}}}{\mathrel{\hbox to0.0pt{\kern 1.11118pt\kern-3.3333pt$ \scriptscriptstyle\not $\hss}{\perp\mkern-9.5mu\perp}}}n_{1}$ and $x_{2}\mathchoice{\mathrel{\hbox to0.0pt{\kern 7.63889pt\kern-5.27776pt$ \displaystyle\not $\hss}{\perp\mkern-9.5mu\perp}}}{\mathrel{\hbox to0.0pt{\kern 7.63889pt\kern-5.27776pt$ \textstyle\not $\hss}{\perp\mkern-9.5mu\perp}}}{\mathrel{\hbox to0.0pt{\kern 2.61118pt\kern-4.11108pt$ \scriptstyle\not $\hss}{\perp\mkern-9.5mu\perp}}}{\mathrel{\hbox to0.0pt{\kern 1.11118pt\kern-3.3333pt$ \scriptscriptstyle\not $\hss}{\perp\mkern-9.5mu\perp}}}n_{2}$ . Such an approach can be extended beyond two-dimensional observations as described in Monti et al., (2019).

To demonstrate the benefits of iVAE as compared to TCL, both algorithms were employed to learn causal structure from fMRI data (details in Supplementary Material I.3). The recovered causal graphs are shown in Figure 4. Blue edges are anatomically feasible whilst red edges are not. There is significant overlap between the estimated causal networks, but in the case of iVAE both anatomically incorrect edges correspond to indirect causal effects. This is in contrast with TCL where incorrect edges are incompatible with anatomical structure and cannot be explained as indirect effects.

CONCLUSION

Unsupervised learning can have many different goals, such as: $(i)$ approximate the data distribution, $(ii)$ generate new samples, $(iii)$ learn useful features, and above all $(iv)$ learn the original latent code that generated the data (identifiability). Deep latent-variable models typically implemented by VAEs are an excellent framework to achieve $(i)$ , and are thus our first building block. The nonlinear ICA model discussed in section 3.4 is the only existing framework to provably achieve $(iv)$ . We bring these two pieces together to create our new model termed iVAE . In particular, this is the first rigorous proof of identifiability in the context of VAEs. Our model in fact checks all the four boxes above that are desired in unsupervised learning.

The advantage of the new framework over typical deep latent-variable models used with VAEs is that we actually recover the original latents, thus providing principled disentanglement. On the other hand, the advantages of this algorithm for solving nonlinear ICA over Hyvärinen et al., (2019) are several; briefly, we significantly strengthen the identifiability results, we obtain the likelihood and can use MLE, we learn a forward model as well and can generate new data, and we consider the more general cases of noisy data with fewer components.

Corrigendum

The published version of this paper claimed that the identifiability result extends to the case with discrete observations. Unfortunately, after publication, we found an error in the proof for the discrete case, so we can no longer claim that this is the case. We suspect that the discrete case requires a different type of proof, which we leave for future work. To remove incorrect claims, we made minimal changes to the abstract, Section 3.1 and Section 6; we also rewrote Appendix C which contained the erroneous proof. We do still provide experiments in Supplementary Material I that strongly suggest that identifiability is achievable in such setting.

Appendix A LEMMAS FOR THE EXPONENTIAL FAMILIES

A univariate exponential family is a set of distributions whose probability density function can be written as

A.2 Strongly exponential distributions

In other words, the density of a strongly exponential distribution has almost surely the exponential component in its expression and can only be reduced to the base measure on a set of measure zero.

The strongly exponential condition is very general, and is satisfied by all the usual exponential family distributions like the Gaussian, Laplace, Pareto, Chi-squared, Gamma, Beta, etc.

We will now give useful Lemmas that will be used in the proofs of the technical Theorems.

Consider a strongly exponential distribution of size $k\geq 2$ with sufficient statistic $\mathbf{T}$ . Further assume that $\mathbf{T}$ is twice differentiable almost everywhere. Then

We will give now an example of an exponential family distribution that is not strongly exponential.

Consider an exponential family distribution with density function

Appendix B PROOFS

The binary relations $\sim_{A}$ and $\sim_{P}$ are equivalence relations on $\Theta$ .

The following proof applies to both $\sim_{A}$ and $\sim_{P}$ which we will simply denote by $\sim$ .

B.2 Proof of Theorem 1

The proof of this Theorem is done in three steps.

In the first step, we use a simple convolutional trick made possible by assumption (i), to transform the equality of observed data distributions into equality of noiseless distributions. In other words, it simplifies the noisy case into a noiseless case. This step results in equation $\eqref{eq:php}$ .

The second step consists of removing all terms that are either a function of observations $\mathbf{x}$ or auxiliary variables $\mathbf{u}$ . This is done by introducing the points provided by assumption (iv), and using $\mathbf{u}_{0}$ as a "pivot". This is simply done in equations (29)-(32).

The last step of the proof is slightly technical. The the goal is to show that the linear transformation is invertible thus resulting in an equivalence relation. This is where we use assumption (iii).

B.2.2 Proof

on the left hand side, and similarly on the right hand side.

in equation (24), we used $*$ for the convolution operator.

in equation (25), we used $F[.]$ to designate the Fourier transform, and where $\varphi_{\varepsilon}=F[p_{\varepsilon}]$ (by definition of the characteristic function).

in equation (26), we dropped $\varphi_{\varepsilon}(\omega)$ from both sides as it is non-zero almost everywhere (by assumption (i)).

Equation (27) is valid for all $(\mathbf{x},\mathbf{u})\in\mathcal{X}\times\mathcal{U}$ . What is basically says is that for the distributions to be the same after adding the noise, the noise-free distributions have to be the same. Note that $\mathbf{x}$ here is a general variable and we are actually dealing with the noise-free probability densities.

By taking the logarithm on both sides of equation (27) and replacing $p_{\mathbf{T},\lambdab}$ by its expression from (7), we get:

Let $\mathbf{u}_{0},\dots,\mathbf{u}_{nk}$ be the points provided by assumption (iv) of the Theorem, and define $\overline{\lambdab}(\mathbf{u})=\lambdab(\mathbf{u})-\lambdab(\mathbf{u}_{0})$ . We plug each of those $\mathbf{u}_{l}$ in (29) to obtain $nk+1$ such equations. We subtract the first equation for $\mathbf{u}_{0}$ from the remaining $nk$ equations to get for $l=1,\dots,nk$ :

We multiply both sides of (31) by the transpose of the inverse of $L^{T}$ from the left to find:

If $k=1$ , then this means that $A$ is invertible (because $A$ is $n\times n$ ).

Moreover, we have the following observations:

B.2.3 Understanding assumption (iv) in Theorem 1

We next gvive a simple example where this assumption always holds. Suppose $n=2$ and $k=1$ , and that the auxiliary variable is a positive scalar. Consider sources $z_{i}\sim\mathcal{N}(0,\lambda_{i}(u))$ that are distributed according to Gaussian distributions with zero mean and variances modulated as follows:

Because the functions $u\mapsto u$ and $u\mapsto u^{2}$ are linearly independent (as functions), then for any choice of "pivot" point $u_{0}$ , for instance $u_{0}=1$ , and any choice of distinct non-zero scalars $u_{1}$ and $u_{2}$ , the columns of the matrix $L:=(\lambdab(u_{1})-1,\lambdab(u_{2})-1)$ are linearly independent, and the matrix is invertible.

B.3 Proof of Theorem 2

The proof of this Theorem is done in two main steps.

B.3.2 Proof

In this Theorem we suppose that $k\geq 2$ . The assumptions of Theorem 1 hold, and so we have

and by differentiating (37) with respect to $z_{t},t>s$ :

Now by Lemma 5, we know that $\overline{e}(\mathbf{z})$ has rank $2n$ almost surely on $\mathcal{Z}$ . Since A is invertible, it is full rank, and thus $\operatorname*{rank}(\overline{e}\left(\mathbf{z})A\right)=2n$ almost surely on $\mathcal{Z}$ . It suffices then to multiply by its pseudo-inverse from the right to get

Let $D=A^{-1}$ . The last equation is valid for every component:

By differentiating both sides with respect to $z_{s}$ where $s\neq i$ we get

By Lemma 1, we get $D_{i,l,s,b}=0$ for all $1\leq b\leq k$ . Since (43) is valid for all $l$ and all $s\neq i$ , we deduce that the matrix $D$ has a block diagonal form:

We conclude that $A$ has the same block diagonal form. Each block $i$ transforms $\mathbf{T}_{i}(\mathbf{z})$ into $\overline{\mathbf{T}}_{i}(\mathbf{z})$ , which achieves the proof. $\square$

B.4 Proof of Theorem 3

This proof uses concepts borrowed from differential geometry. A good reference is the monograph by Lee, (2003).

Intuitively, since $T_{i}$ is not monotonic, it admits a local extremum (supposed to be a minimum). By working locally around this minimum, we can suppose that it is global and attained at a unique point $y_{i}$ . The smoothness condition on $\mathbf{v}$ imply that the manifold where $T_{i}\circ v_{i}$ is minimized has dimension $n-1$ . This is where we need assumption (3.ii) of the Theorem.

On the other hand, because of the separability in the sum, each non constant $h_{i,k}$ (minimized as a consequence of minimizing $T_{i}\circ v_{i}$ ) introduces a constraint on this manifold that reduces its dimension by $1$ . That’s why we can only have one non constant $h_{i,k}$ for each $i$ .

B.4.2 Proof

for which we have $m_{i}=\sum_{a}\mu_{i,a}$ .

B.5 Proof of Proposition 1

B.6 Proof of Theorem 4

If the family $q_{\phib}(\mathbf{z}|\mathbf{x},\mathbf{u})$ is large enough to include $p_{\thetab}(\mathbf{z}|\mathbf{x},\mathbf{u})$ , then by optimizing the loss over its parameter $\phib$ , we will minimize the KL term, eventually reaching zero, and the loss will be equal to the log-likelihood. The VAE in this case inherits all the properties of maximum likelihood estimation. In this particular case, since our identifiability is guaranteed up to equivalence classes, the consistency of MLE means that we converge to the equivalence classthis is easy to show: because true identifiability is one of the assumptions for MLE consistency, replacing it by identifiability up to equivalence class doesn’t change the proof but only the conclusion. (Theorem 1) of true parameter $\thetab^{*}$ i.e. in the limit of infinite data. $\square$

Appendix C DISCRETE OBSERVATIONS

We can use a well-known logistic model to replace the additive Gaussian noise to model discrete observations. For example, in the binary case, let:

where $\text{sigmoid}()$ is the element-wise sigmoid nonlinearity.

However, by the very nature of discrete variables, the mapping $\mathbf{z}\rightarrow\mathbf{x}$ can no longer be injective. This is one of the key assumptions in our identifiability theory, which can no longer hold. The discrete observation case requires a bespoke identifiability proof. Nevertheless, we provide experiments in Supplementary Material I that strongly suggest that identifiability is achievable in such setting.

Appendix D UNIDENTIFIABILITY OF GENERATIVE MODELS WITH UNCONDITIONAL PRIOR

In this section, we present two well-known proofs of unidentifiability of generative models. The first proof is simpler and considers factorial priors, which are widely-used in deep generative models and the VAE literature. The second proof is extremely general, and shows how any random vector can be transformed into independent components, in particular components which are standardized Gaussian. Thus, we see how in the general nonlinear case, there is little hope of finding the original latent variables based on the (unconditional, marginal) statistics of $\mathbf{x}$ alone.

Let us start with factorial, Gaussian priors. In other words, let $\mathbf{z}\sim p_{\thetab}(\mathbf{z})=N(\mathbf{0},\mathbf{I})$ . Now, a well-known result says that any orthogonal transformation of $\mathbf{z}$ has exactly the same distribution. Thus, we could transform the latent variable by any orthogonal transformation $\mathbf{z}^{\prime}=M\mathbf{z}$ , and cancel that transformation in $p(\mathbf{x}|\mathbf{z})$ (e.g. in the first layer of the neural network), and we would get exactly the same observed data (and thus obviously the same distribution of observed data) with $\mathbf{z}^{\prime}$ .

where we have used the fact that the determinant of an orthogonal matrix is equal to unity.

This result applies easily to any factorial prior. For $z_{i}$ of any distribution, we can transform it to a uniform distribution by $F_{i}(z_{i})$ where $F_{i}$ is the cumulative distribution function of $z_{i}$ . Next, we can transform it into standardized Gaussian by $\Phi^{-1}(F_{i}(z_{i}))$ where $\Phi$ is the standardized Gaussian cdf. After this transformation, we can again take any orthogonal transformation without changing the distribution. And we can even transform back to the same marginal distributions by $F_{i}^{-1}(\Phi(.))$ . Thus, the original latents are not identifiable.

D.2 General priors

The second proof comes from the theory of nonlinear ICA (Hyvärinen and Pajunen,, 1999), from which the following Theorem is adapted.

The proof is based on an iterative procedure reminiscent of Gram-Schmidt, where a new variable can always be transformed to be independent of any previously considered variables, which is why $z_{1}$ is essentially unchanged.

This Theorem means that there are infinitely many ways of defining independent components $\mathbf{z}$ that nonlinearly generated an observation $\mathbf{x}$ . This is because we can first transform $\mathbf{z}$ any way we like and then apply the Theorem. The arbitrariness of the components is seen in the fact that we will always find that one arbitrary chosen variable in the transformation is one of the independent components. This is in some sense an alternative kind of indeterminacy to the one in the previous subsection.

In particular, we can even apply this Theorem on the observed data, taking $\mathbf{x}$ instead of $\mathbf{z}$ . Then, in the case of factorial priors, just permuting the data variables, we would arrive at the conclusion that any of the $x_{i}$ can be taken to be one of the independent components, which is absurd.

Now, to apply this theory in the case of a general prior on $\mathbf{z}$ , it is enough to point out that we can transform any variable into independent Gaussian variables, apply any orthogonal transformation, then invert the transformation in the Theorem, and we get a nonlinear transformation $\mathbf{z}^{\prime}=\mathbf{g}^{-1}(M\mathbf{g}(\mathbf{z}))$ which has exactly the same distribution as $\mathbf{z}$ but is a complex nonlinear transformation. Thus, no matter what the prior may be, by looking at the data alone, it is not possible to recover the true latents based an unconditional prior distribution, in the general nonlinear case.

Appendix E ALTERNATIVE FORMULATION OF THEOREM 1

Assume that we observe data sampled from a generative model defined according to (5)-(7), with parameters $(\mathbf{f},\mathbf{T},\lambdab)$ . Assume the following holds:

The mixing function $\mathbf{f}$ in (6) is injective.

The sufficient statistics $T_{i,j}$ in (7) are differentiable almost everywhere, and $T_{i,j}^{\prime}\neq 0$ almost everywhere for all $1\leq i\leq n$ and $1\leq j\leq k$ .

$\lambdab$ is differentiable, and there exists $\mathbf{u}_{0}\in\mathcal{U}$ such that $J_{\lambdab}(\mathbf{u}_{0})$ is invertible.

Proof: The start of the proof is similar to the proof of Theorem 1. When we get to equation (29):

By evaluating both sides at $\mathbf{u}_{0}$ provided by assumption (iv), and multiplying both sides by $J_{\lambdab}(\mathbf{u}_{0})^{-T}$ (invertible by hypothesis), we find:

Appendix F LINK BETWEEN MAXIMUM LIKELIHOOD AND TOTAL CORRELATION

where the components of the latent variable are independent given the auxiliary variable $\mathbf{u}$ . We can relate the log-likelihood of the data to the total correlation of the latent variables. To see this connection, let’s use the change of variable formula in the expression of the log-likelihood:

where $H(z_{i}|\mathbf{u})$ is the conditional differential entropy of $z_{i}$ given $\mathbf{u}$ . The same change of variable formula applied to $H(\mathbf{x}|\mathbf{u})$ yields:

which we then use in the expression of the conditional total correlation:

Putting equations (64) and (66) together, we get:

The last term in this equation is a function of the data only and is thus a constant. An algorithm which learns to maximize the data likelihood is decreasing the total correlation of the latent variable. The total correlation is measure of independence as it is equal to zero if and only if the components of the latent variable are independent. Thus, by using a VAE to maximize a lower bound on the data likelihood, we are trying to learn an estimate of the inverse of the mixing function that gives the most independent components.

Appendix G REMARKS ON PREVIOUS WORK

Time Contrastive Learning (TCL) introduced in Hyvärinen and Morioka, (2016) is a method for nonlinear ICA based on the assumption that while the sources are independent, they are also non-stationary time series. This implies that they can be divided into known non-overlapping segments, such that their distributions vary across segments. The non-stationarity is supposed to be slow compared to the sampling rate, so that we can consider the distributions within each segment to be unchanged over time; resulting in a piecewise stationary distribution across segments. Formally, given a segment index $\tau\in\mathcal{T}$ , where $\mathcal{T}$ is a finite set of indices, the distribution of the sources within that segment is modelled as an exponential family, which is in their notation:

where $q_{j,0}$ is a stationary baseline and $q$ is the sufficient statistic for the exponential family of the sources (note that exponential families with different sufficient statistics for each source, or more than one sufficient statistic per source are allowed, but we focus on this simpler case here). Note that parameters $\lambda_{j}$ depend on the segment index, indicating that the distribution of sources changes across segments. It follows from equation (11) that the observations are piece-wise stationary.

TCL recovers the inverse transformation $\mathbf{f}^{-1}$ by self-supervised learning, where the goal is to classify original data points against segment indices in a multinomial classification task. To this end, TCL employed a deep network consisting of a feature extractor $h(\mathbf{x}^{(i)};\eta)$ with parameters $\eta$ in the form of a neural network, followed by a final classifying layer (e.g. softmax). The theory of TCL, as stated in Theorem 1 of Hyvärinen and Morioka, (2016), is premised on the fact that in order to optimally classify observations into their corresponding segments the feature extractor, $h(\mathbf{x}^{(i)};\eta)$ , must learn about the changes in the underlying distribution of latent sources. The theory shows that the method can learn the independent components up to transformations by sufficient statistics and a linear transformation, as in $\sim_{A}$ identifiability. It is further proposed that a linear ICA can recover the final $A$ if the number of segments grows infinite and the segment distributions are random in a certain sense, but this latter assumption is unrealistic in applications where the number of segments is small. We also emphasize that our estimation method based on VAE is very different from such a self-supervised scheme.

A more recent development in nonlinear ICA is given by Hyvärinen et al., (2019) where it is assumed that we observe data following a noiseless conditional nonlinear ICA case:

This formulation is so general that it subsumes previous models by Hyvärinen and Morioka, (2016, 2017) in the sense of the data model. However, their estimation method is very different from TCL: They rely on a self-supervised binary discrimination task based on randomization to learn the unmixing function. More specifically, from a dataset of observations and auxiliary variables pairs $\mathcal{D}=\{\mathbf{x}^{(i)},\mathbf{u}^{(i)}\}$ , they construct a randomized dataset $\mathcal{D^{*}}=\{\mathbf{x}^{(i)},\mathbf{u}^{*}\}$ where $\mathbf{u}^{*}$ is randomly drawn from the observed distribution of $\mathbf{u}$ . To distinguish between both datasets, a deep logistic regression is used. The last hidden layer of the neural network is a feature extractor denoted $\mathbf{h}(\mathbf{x})$ ; like in TCL, the purpose of the feature extractor is therefore to extract the relevant features which will allow to distinguish between the two datasets. The identifiability results by Hyvärinen et al., (2019) have a lot of similarity to ours, and several of our proofs are inspired by them. However, we strengthen those results, while concentrating on the case of exponential family models. In particular, we show how any non-monotonic sufficient statistics for $k=1$ leads to identifiability in Theorem 3, and also Theorem 2 generalizes the corresponding result (Theorem 2, case 2) in Hyvärinen et al., (2019). Again, their estimation method is completely different from ours.

G.2 Previous work on identifiability in VAEs

Our framework might look similar to semi-supervised learning methods in the VAE context, due to the inclusion of the auxiliary variable $\mathbf{u}$ . However, the auxiliary variable $\mathbf{u}$ can play a more general role. For instance, in time-series, it can simply be the the time index or history; in audiovisual data, it can be either one of the modalities, where the other is used as an observation. More importantly, and to our knowledge, there is no proof of identifiability in the semi-supervised literature.

The question of identifiability, or lack of, in deep latent variable models especially VAEs has been tackled in work related to disentanglement. In Mathieu et al., (2018); Rolinek et al., (2018); Locatello et al., (2018) the authors show how isotropic priors lead to rotation invariance in the ELBO. We proved here (section 2.3 and supplementary material D) a much more general result: unconditional priors lead to unidentifiable models. These papers however focused on showcasing this problem, or how it can avoided in practice, and didn’t provide alternative models that can be shown to be identifiable. This is what we try to achieve in this work, to provide a complementary analysis to previous research. Our proof of identifiability applies to the generative model itself, regardless of the estimation method. This is why we didn’t focus in our analysis on the role of the encoder, which has been claimed to have a central role in some of the work cited above.

Appendix H SIMULATION DETAILS

We give here more detail on the data generation process for our simulations. The dataset is described in section 5.1. The conditioning variable $\mathbf{u}$ is the segment label, and its distribution is uniform on the integer set $[\![1,M]\!]$ . Within each segment, the conditional prior distribution is chosen from the family (7), where $k=1$ , $T_{i,1}(z_{i})=z_{i}^{2}$ and $Q_{i}(z_{i})=1$ , and the true $\lambda_{i}$ were randomly and independently generated across the segments and the components so that the variances have a uniform distribution on $[.5,3]$ . We sample latent variable $\mathbf{z}$ from these distribution, and then mix them using a 4-layer multi-layer perceptron (MLP). An example of what the sources look like is plotted in Figure 5(a). We finally add small noise ( $\sigma^{2}=0.01$ ) to the observations. When comparing to previous ICA methods, we omit this step, as these methods are for the noiseless case.

For the decoder (6), we chose $p_{\varepsilon}=\mathcal{N}\left(0,\sigma^{2}I\right)$ a zero mean Gaussian, where the scalar $\sigma^{2}$ controls the noise level. We fix the noise level $\sigma^{2}=0.01$ . As for the inference model, we let $q_{\phib}(\mathbf{z}|\mathbf{x},\mathbf{u})=\mathcal{N}\left(\mathbf{z}|\mathbf{g}(\mathbf{x},\mathbf{u};\phi_{\mathbf{g}}),\mathop{\bf diag}{\sigmab^{2}(\mathbf{x},\mathbf{u};\phi_{\sigmab})}\right)$ be a multivariate Gaussian with a diagonal covariance. The functional parameters of the decoder ( $\mathbf{f}$ ) and the inference model ( $\mathbf{g}$ , $\sigmab^{2}$ ) as well as the conditional prior ( $\lambdab$ ) are chosen to be MLPs, where the dimension of the hidden layers is varied between $10$ and $200$ , the activation function is a leaky ReLU, and the number of layers is chosen from $\{3,4,5,6\}$ . Mini-batches are of size 64, and the learning rate of the Adam optimizer is chosen from $\{0.01,0.001\}$ . We also use a scheduler to decay the learning rate as a function of epochs.

To implement the VAE, we followed Kingma and Welling, (2013). We made sure the range of the hyperparameters (mainly number of layers and dimension of hidden layers) of the VAE is large enough for it to be comparable in complexity to our method (which has the extra $\lambda$ network to learn). To implement a $\beta$ -VAE, we followed the instructions of Higgins et al., (2016) for the choice of hyperparameter $\beta$ , which was chosen in the set $ $. Similarly, we followed Chen et al., (2018) for the choice of the hyperparameters$ \alpha $,$ \beta $and$ \gamma $when implementing a$ \beta $-TC-VAE: we chose$ \alpha=\gamma=1 $and$ \beta $was chosen in the set$ $.

H.2 Description of significant mean modulated data

Appendix I FURTHER EXPERIMENTS

As discussed in section 3.4, our estimation method has many benefits over previously proposed self-supervised nonlinear ICA methods: it allows for dimensionality reduction, latent dimension selection based on the ELBO as a cross validation metric, and solving discrete ICA. We performed a series of simulations to test these claims.

To further test the capabilities of our method, we tested it on discrete data, and compared its identifiability performance to a vanilla VAE. The dimensions of the data and latents are $d=100$ and $n=10$ . The results are shown in Figure 6(a) and proves that our method is capable of performing discrete ICA.

The examples in section 5.1 already showcased dimensionality reduction. In Figure 2(b) for example, we have a mismatch between the dimensions of the latents and observations. In real world ICA applications, we usually don’t know the dimension of the latents beforehand. One way to guess it is to use the ELBO as a proxy to select the dimension. Our method enables this when compared to previous nonlinear ICA methods like TCL (Hyvärinen and Morioka,, 2016). This is showcased in Figure 6(b), where the real dimensions of the simulated data are $d^{*}=80$ and $n^{*}=15$ , and we run multiple experiments where we vary the latent dimensions between 2 and 40. We can see that the ELBO can be a good proxy for dimension selection, since it has a "knee" around the right value of dimension.

One important benefit of the proposed method is that it seeks to optimize an objective function derived from the marginal log-likelihood of observations. As such, it follows that we may employ the ELBO to perform hyperparameter selection. To verify this claim, we run experiments for various distinct choices of hyperparameters (for example the dimension of hidden layers, number of hidden layers in the estimation network, learning rate, nonlinearities) on a synthetic dataset. Results are provided in Figure 6(c) which serves to empirically demonstrate that the ELBO is indeed a good proxy for how accurately we are able to recover the true latent variables. In contrast, alternative methods for nonlinear ICA, such as TCL, do not provide principled and reliable proxies which reflect the accuracy of estimated latent sources.

I.2 Additional causality experiments for comparison to TCL

The data generation process used by (Monti et al.,, 2019, Section 4) is similar to the one we described in section 5.1, with the difference that the mixing should be in such a way that we get an acyclic causal relationship between the observations. This can be achieved by ensuring weight matrices in the mixing network are all lower-triangular, thereby introducing acyclic causal structure over observations.

We seek to compare iVAE and TCL in the context of causal discovery, as described in Section 5.2. Such an approach involves a two-step procedure whereby first either TCL or iVAE are employed to recover latent disturbances, followed by a series of independence tests. Throughout all causal discovery experiments we employ HSIC as a general test of statistical independence (Gretton et al.,, 2005). When comparing iVAE and TCL in this setting we report the proportion of times the correct causal direction is reported. It is important to note that the aforementioned testing procedure can produce one of three decisions: $x_{1}\rightarrow x_{2}$ , $x_{2}\rightarrow x_{1}$ or a third decision which states that no acyclic causal direction can be determined. The first two outcomes correspond to identifying causal structure and will occur when we fail to reject the null hypothesis in only one of the four tests. Whereas the third decision (no evidence of acyclic causal structure) will be reported when either there is evidence to reject the null in all four tests or we fail to reject the null more than once. Typically, this will occur if the nonlinear unmixing has failed to accurately recover the true latent sources. The results are reported in Figure 7(a) where we note that both TCL and iVAE perform comparably.

As a further experiment, we consider causal discovery in the scenario where one or both of the underlying sources demonstrate a significant mean modulation as shown in Figure 5. In such a setting the surrogate classification problem which is solved as part of TCL training becomes significantly easier, to the extent that TCL no longer needs to learn an accurate representation of the log-density of sources within each segment. This is to the detriment of TCL as it implies that it cannot accurately recover latent sources and therefore fails at the task of causal discovery. This can be seen in Figure 7(b) where TCL based causal discovery fails whereas iVAE continues to perform well. This is a result of the fact that iVAE directly optimizes the log-likelihood as opposed to a surrogate classification problem. Moreover, Figure 7(c) visualizes the mean classification accuracy for TCL as a function of the number of segments. We note that TCL consistently obtains classification accuracy that are significantly better that random classification. This provides evidence that the poor performance of TCL in the context of data with significant mean modulations is not a result of sub-optimal optimisation but are instead a a negative consequence of TCL’s reliance on solving a surrogate classification problem to perform nonlinear unmixing.

I.3 Real data experiments

Here we provide further details relating to the resting-state Hippocampal data provided by Poldrack et al., (2015) and studied in Section 5.2, closely following the earlier causal work using TCL by Monti et al., (2019). The data corresponds to daily fMRI scans from a single individual (Caucasian male, aged 45) collected over a period of 84 successive days. We consider data collected from each day as corresponding to a distinct segment, encoded in $\mathbf{u}$ . Within each day 518 BOLD observations are provided across the following six brain regions: perirhinal cortex (PRc), parahippocampal cortex (PHc), entorhinal cortex (ERc), subiculum (Sub), CA1 and CA3/Dentate Gyrus (DG).

I.4 Additional visualisations for comparison to VAE variants

As a further visualization we show in Figures 8 and 9 the recovered latents for VAE and iVAE ; we sampled a random (contiguous) subset of the sources from the dataset, and compared them to the recovered latents (after inverting any permutation in the components). We can see that iVAE has an excellent estimation of the original sources compared to VAE (other models were almost indistinguishable from vanilla VAE).

Appendix J ACKNOWLEDGEMENT

I.K. and R.P.M. were supported by the Gatsby Charitable Foundation. A.H. was supported by a Fellowship from CIFAR, and from the DATAIA convergence institute as part of the "Programme d’Investissement d’Avenir", (ANR-17-CONV-0003) operated by Inria.