Bayesian Models of Graphs, Arrays and Other Exchangeable Random Structures

Peter Orbanz, Daniel M. Roy

I Introduction

For data represented by exchangeable sequences, Bayesian nonparametrics has developed into a flexible and powerful toolbox of models and algorithms. Its modeling primitives—Dirichlet processes, Gaussian processes, etc.—are widely applied and well-understood, and can be used as components in hierarchical models or dependent models to address a wide variety of data analysis problems. One of the main challenges for Bayesian statistics and machine learning is arguably to extend this toolbox to the analysis of data sets with additional structure, such as graph, network, and relational data.

In this article, we consider structured data—sequences, graphs, trees, matrices, etc.—and ask:

What is the appropriate class of statistical models for a given type of structured data?

Representation theorems for exchangeable random structures lead us to an answer, and they do so in a very precise way: They characterize the class of possible Bayesian models for the given type of data, show how these models are parametrized, and even provide basic convergence guarantees. The probability literature provides such results for dozens of exchangeable random structures, including sequences, graphs, partitions, arrays, trees, etc. The purpose of this article is to explain how to interpret these results and how to translate them into a statistical modeling approach.

Consider a data analysis problem in which observations are represented as edges in a graph. As we observe more data, the graph grows. For statistical purposes, we might model the graph as a sample from some probability distribution on graphs. Can we estimate the distribution or, at least, some of its properties? If we observed multiple graphs, all sampled independently from the same distribution, we would be in the standard setting of statistical inference. For many problems in e.g. network analysis, that is clearly not the approach we are looking for: There is just one graph, and so our sample size should relate to the size of the graph. Of course, we could assume that the edges are independent and identically distributed random variables, and estimate their distribution—that would indeed be a way of performing inference within a single graph. The resulting model, however, is sensitive only to the number of edges in a graph and so has only a single parameter ${p\in}$ , the probability that an edge is present. What are more expressive models for graph-valued data?

Compare the problem to a more familiar one, where data is represented by a random sequence ${X:=(X_{1},X_{2},\ldots)}$ , whose elements take values in a sample space $\mathbf{X}$ , and $n$ observations are interpreted as the values of the initial elements ${X_{1},\ldots,X_{n}}$ of the sequence. In this case, a statistical model is a family ${\mathcal{P}=\{P_{\theta}|\theta\in\mathbf{T}\}}$ of distributions on $\mathbf{X}$ , indexed by elements $\theta$ of some parameter space $\mathbf{T}$ . If the sequence $X$ is exchangeable, de Finetti’s theorem tells us that there is some model $\mathcal{P}$ and some distribution $\nu$ on $\mathbf{T}$ such that the joint distribution of $X$ is

This means the sequence can be generated by first generating a random parameter value ${\Theta\sim\nu}$ , and then sampling ${X_{1},X_{2},\ldots|\Theta\sim_{\mbox{\tiny iid}}P_{\Theta}}$ . In particular, the elements of the sequence are conditionally i.i.d. given $\Theta$ .

For the purposes of statistical inference, the (conditional) i.i.d. structure implies that we can regard the elements $X_{i}$ of the sequence as repetitive samples from an unknown distribution $P_{\Theta}$ , and pool these observations to extract information about the value of $\Theta$ . This may be done using a Bayesian approach, by making a modeling assumption on $\nu$ (i.e., by defining a prior distribution) and computing the posterior given data, or in a frequentist way, by assuming $\theta$ is non-random and deriving a suitable estimator.

To generalize this idea to the graph case, we regard the infinite sequence ${X:=(X_{1},X_{2},\ldots)}$ as an infinite random structure, of which we observe a finite substructure (the initial segment ${X_{1},\ldots,X_{n}}$ ). From this perspective, de Finetti’s theorem tells us how to break down a random structure (the sequence) into components (the conditionally independent elements), which in turn permits the definition of statistical inference procedures. What if, instead of an infinite sequence, $X$ is an infinite graph, of which we observe a finite subgraph? To mimic the sequence case, we would need a definition of exchangeability applicable to graphs, and a suitable representation theorem to substitute for de Finetti’s.

Since generating a parameter ${\Theta\sim\nu}$ randomly determines a distribution $P_{\Theta}$ , we can think of $P_{\Theta}$ as a random probability measure with distribution defined by $\nu$ , and paraphrase de Finetti’s theorem as follows:

The joint distribution of any exchangeable sequence of random values in $\mathbf{X}$ is characterized by the distribution of a random probability measure on $\mathbf{X}$ .

If we assume a random graph to be exchangeable—where we put off the precise definition for now—it is indeed also characterized by a representation theorem. The implications for statistical models are perhaps more surprising than in the case of sequences:

The distribution of any exchangeable graph is characterized by a distribution on the space of functions from $^{2}$ to $$.

Hence, any specific function ${w:^{2}\rightarrow}$ defines a distribution $P_{w}$ on graphs (we will see in Section III how we can sample a graph from $P_{w}$ ).

For modeling purposes, this means that any statistical model of exchangeable graphs is a family of such distributions $P_{w}$ , and $w$ can be regarded as the model parameter. Density estimation in exchangeable graphs can therefore be formulated as a regression problem: It is equivalent to recovering the function $w$ from data. Once again, we can choose a frequentist approach (define an estimator for $w$ ) or a Bayesian approach (define a prior distribution on a random function $W$ ); we can obtain nonparametric models by choosing infinite-dimensional subspaces of functions, or parametric models by keeping the dimension finite.

Since a graph can be regarded as a special type of matrix (the adjacency matrix), we can ask more generally for models of exchangeable matrices, and obtain a similar result:

There is a wide variety of random structures for which exchangeability can be defined; Table I lists some important examples. Borrowing language from , we collectively refer to such random objects as exchangeable random structures. This article explains representation theorems for exchangeable random structures and their implications for Bayesian statistics and machine learning. The overarching theme is that key aspects of de Finetti’s theorem can be generalized to many types of data, and that these results are directly applicable to the derivation and interpretation of statistical models.

reviews exchangeable random structures, their representation theorems, and the role of such theorems in Bayesian statistics.

introduces the generalization of de Finetti’s theorem to models of graph- and matrix-valued data, the Aldous-Hoover theorem, and explains how Bayesian models of such data can be constructed.

surveys models of graph- and relational data available in the machine learning and statistics literature. Using the Aldous-Hoover representation, models can be classified and some close connections emerge between models which seem, at first glance, only loosely related.

describes recent development in the mathematical theory of graph limits. The results of this theory refine the Aldous-Hoover representation of graphs and provide a precise understanding of how graphs converge and how random graph models are parametrized.

explains the general Aldous-Hoover representation for higher-order arrays.

discusses sparse random structures and networks, why these models contradict exchangeability, and open questions arising from this contradiction.

II Bayesian Models of Exchangeable Structures

The fundamental Bayesian modeling paradigm based on exchangeable sequences can be extended to a very general approach, where data is represented by a random structure. Exchangeability properties are then used to deduce valid statistical models and useful parametrizations. This section sketches out the ideas underlying this approach, before we focus on graphs, matrices, and arrays in Section III.

The simplest example of an exchangeable random structure is an exchangeable sequence. We use the customary shorthand notation ${(x_{i}):=(x_{1},x_{2},\ldots)}$ for a sequence, and similarly $(x_{ij})$ for a matrix, etc. Suppose $(X_{i})$ is an infinite sequence of random variables in a sample space $\mathbf{X}$ . We call $(X_{i})$ exchangeable if its joint distribution satisfies

or even $(X_{i})\overset{\mbox{\tiny d}}{=}(X_{\pi(i)})$ , where the notation $Y\overset{\mbox{\tiny d}}{=}Z$ means that the random variables $Y$ and $Z$ have the same distribution. Informally, exchangeability means that the probability of observing a particular sequence does not depend on the order of the elements in the sequence.

If the elements of a sequence are exchangeable, de Finetti’s representation theorem implies they are conditionally i.i.d. The conditional independence structure is represented by a random probability measure, a random variable with values in the set $\mathbf{M}(\mathbf{X})$ of probability distributions on $\mathbf{X}$ .

Let ${(X_{1},X_{2},\dotsc)}$ be an infinite sequence of random variables with values in a space $\mathbf{X}$ . The sequence $X_{1},X_{2},\dotsc$ is exchangeable if and only if there is a random probability measure $\Theta$ on $\mathbf{X}$ such that the $X_{i}$ are conditionally i.i.d. given $\Theta$ and

where $\nu$ is the distribution of $\Theta$ .

The integral on the right-hand side of (II.3) can be interpreted as a two-stage sampling procedure:

Sample ${\Theta\sim\nu}$ , i.e., draw a probability distribution at random from the distribution $\nu$ .

Conditioned on $\Theta$ , sample the $X_{n}$ conditionally i.i.d. as

The theorem says that any exchangeable sequence can be sampled by such a two-stage procedure; the distribution of the sequence is determined by the choice of $\nu$ . The random measure $\Theta$ is called the directing random measure of $X$ . Its distribution $\nu$ is called the mixing measure or de Finetti measure.

Statistical inference is only possible if the distribution of the data, or at least some of its properties, can be recovered from observations. For i.i.d. random variables, this is ensured by the law of large numbers. The proof of de Finetti’s theorem also implies a law of large numbers for exchangeable sequences:

If the sequence $(X_{i})$ is exchangeable, the empirical distributions

holds with probability 1 for every set $A$ .

The two theorems have fundamental implications for Bayesian modeling. If we assume the data can be represented by (some finite prefix of) an exchangeable sequence, this implies without any further assumptions:

Conditioned on a random probability measure $\Theta$ representing an unknown distribution $\theta$ , every sample $X_{1},X_{2},\dotsc$ is i.i.d. with distribution $\Theta$ .

Every exchangeable sequence model is characterized by a unique distribution $\nu$ on $\mathbf{M}(\mathbf{X})$ .

A statistical model can be taken to be some subset of $\mathbf{M}(\mathbf{X})$ rather than $\mathbf{M}(\mathbf{X}^{\infty})$ , which we would have to consider for a general random sequence.

Statistical inference is possible in principle: With probability one, the empirical distributions $\hat{S}_{n}$ converge to the distribution $\Theta$ generating the data, according to (II.6).

A modeling application might look like this: We consider a specific data source or measurement process, and assume that data generated by this source can be represented as an exchangeable sequence. The definition of exchangeability for an infinite sequence does not mean we have to observe an infinite number of data points to invoke de Finetti’s theorem; rather, it expresses the assumption that samples of any finite size generated by the source would be exchangeable. Hence, exchangeability is an assumption on the data source, rather than the data.

Given observations ${X_{1},\ldots,X_{n}}$ , we then compute the posterior distribution, by conditioning $\nu$ on the observations. Theorem II.2 implies that, if the empirical measure converges asymptotically to a specific measure ${\theta\in\mathbf{M}(\mathbf{X})}$ , the posterior converges to a point mass at $\theta$ . This result has to be interpreted very cautiously, however: It only holds for a sequence $(X_{i})$ which was actually generated from the measure $\nu$ we use as a prior. In other words, suppose someone generates $(X_{i})$ from a distribution $\nu_{1}$ on $\mathbf{M}(\mathbf{X})$ by the two-stage sampling procedure above, without disclosing $\nu_{1}$ to us. In the sampling procedure, the variable ${\Theta\sim\nu_{1}}$ assumes as its value a specific distribution $\theta_{1}$ , from which the data is then generated independently. We model the observed sequence by choosing a prior $\nu_{2}$ . The posterior under $\nu_{2}$ still converges to a point mass, but there is no guarantee that it is a point mass at $\theta_{1}$ , and (II.6) only holds if ${\nu_{2}=\nu_{1}}$ .

Thus, there are several important questions that exchangeability does not answer:

The de Finetti theorem says that there is some prior which adequately represents the data, but provides no guidance regarding the choice of $\nu$ : Any probability measure $\nu$ on $\mathbf{M}(\mathbf{X})$ is the prior for some exchangeable sequence.

Theorem II.2 only guarantees convergence for sequences of random variables generated from the prior $\nu$ .

Theorem II.2 is a first-order result: It provides no information on how quickly the sequence converges. Results on convergence rates can only be obtained for more specific models; the set of all exchangeable distributions is too large and too complicated to obtain non-trivial statements.

Answers to these questions typically require further modeling assumptions.

II-B The general form of exchangeability results

Many problems in machine learning and modern statistics involve data which is more naturally represented by a random structure that is not a sequence: often a graph, matrix, array, tree, partition, etc. is a better fit. If it is possible to define a suitable notion of exchangeability, the main features of de Finetti’s theorem typically generalize. Although results differ in their details, there is a general pattern, which we sketch in this section before considering specific types of exchangeable structures.

The setup is as follows: The product space $\mathbf{X}^{\infty}$ of infinite sequences is substituted by a suitable space $\mathbf{X}_{\infty}$ of more general, infinite structures. An infinite random structure $X_{\infty}$ is a random variable with values in $\mathbf{X}_{\infty}$ . Each element of $\mathbf{X}_{\infty}$ can be thought of as a representation of an infinitely large data set or “asymptotic” sample. An actual, finite sample of size $n$ is modeled as a substructure $X_{n}$ of $X_{\infty}$ , such as a the length- $n$ prefix of an infinite sequence or a $n$ -vertex subgraph of an infinite graph.

The first step in identifying a notion of exchangeability is to specify what it means to permute components of a structure ${x_{\infty}\in\mathbf{X}_{\infty}}$ . If $x_{\infty}$ is an infinite matrix, for example, a very useful notion of exchangeability arises when one considers all permutations that exchange the ordering of rows/columns, rather than the ordering of individual entries. Exchangeability of a random structure $X_{\infty}$ then means that the distribution of $X_{\infty}$ is invariant under the specified family of permutations.

Once a specific exchangeable random structure $X_{\infty}$ is defined, the next step is to invoke a representation theorem that generalizes de Finetti’s theorem to $X_{\infty}$ . Probability theory provides such theorems for a range of random structures; see Table I for examples. A representation theorem can be interpreted as determining (1) a natural parameter space $\mathbf{T}$ for exchangeable models on $\mathbf{X}_{\infty}$ , and (2) a special family of distributions on $\mathbf{X}_{\infty}$ , which are called the ergodic distributions or ergodic measures. Each element ${\theta\in\mathbf{T}}$ determines an ergodic distribution, and we denote this distribution as ${\mathbf{p}_{\theta}}$ . The set of ergodic distributions is

The distribution of any exchangeable random structure $X_{\infty}$ can then be represented as a mixture of these ergodic distributions,

In the specific case of exchangeable sequences, (II.8) is precisely the integral representation (II.3) in de Finetti’s theorem, and the ergodic measures are the distributions of i.i.d. sequences, that is,

For more general random structures, the ergodic measures are not usually product distributions, but they retain some key properties:

They are particularly simple distributions on $\mathbf{X}_{\infty}$ , and form a “small” subset of all exchangeable distributions.

They have a conditional independence property, in the sense that a random structure $X_{\infty}$ sampled from one of the ergodic distributions decomposes into conditionally independent components. In de Finetti’s theorem, these conditionally independent components are the elements $X_{i}$ of the sequence.

As in the sequence case, the integral (II.8) in the general case represents a two-stage sampling scheme:

A Bayesian model for an exchangeable random structure $X_{\infty}$ with representation (II.8) is characterized by a prior distribution on $\mathbf{T}$ .

Suppose the prior $\nu$ concentrates on a subset ${\mathcal{T}\subset\mathbf{T}}$ , that is, $\mathcal{T}$ is the smallest subset to which the prior assigns probability 1. Then $\mathcal{T}$ defines a subset

of ergodic measures. We thus have defined a Bayesian model on $\mathbf{X}_{\infty}$ , with prior $\nu$ and observation model $\mathcal{P}$ . In summary:

$\mathbf{T}$ is the natural parameter space for Bayesian models of $X_{\infty}$ , and the prior distribution is a distribution on $\mathbf{T}$ .

The observation model $\mathcal{P}$ is a subset of the ergodic measures. An exchangeability theorem characterizing the ergodic measures therefore also characterizes the possible observation models.

The representation (II.8) is typically complemented by a convergence results: A specific function of the samples converges to $\Theta$ almost surely as ${n\to\infty}$ , generalizing Theorem II.2. In particular, the parameter space $\mathbf{T}$ can be interpreted as the set of all possible limit objects.

If the set $\mathcal{T}$ on which the prior concentrates its mass is a finite-dimensional subspace of $\mathbf{T}$ , we call the resulting Bayesian model parametric. If $\mathcal{T}$ has infinite dimension, the model is nonparametric.

II-C Exchangeable partitions

Kingman showed that exchangeable random partitions can again be represented in the form of Eq. II.8. The parameter space $\mathbf{T}$ consists of all sequence ${\theta:=(s_{1},s_{2},\dotsc)}$ of scalars ${s_{i}\in}$ which satisfy

Let $\bar{s}_{n}:=\sum_{i=1}^{n}s_{i}$ . Then $\theta$ defines a partition of $$ into intervals

Generate ${U_{1},U_{2},\dotsc\sim_{\mbox{\tiny iid}}\mbox{Uniform}}$ .

Kingman called this distribution a paint-box distribution.

$X_{\infty}$ is exchangeable if and only if

for some distribution $\nu$ on $\mathbf{T}$ , where ${p_{\theta}}$ is the paint-box distribution with parameter ${\theta\in\mathbf{T}}$ .

If $X_{\infty}$ is exchangeable, the scalars $s_{i}$ can be recovered asymptotically as limiting relative block sizes

Part 1) is of course the counterpart to de Finetti’s theorem, and part 2) corresponds to Theorem II.2. In (II.15), we compute averages within a single random structure, having observed only a substructure of size $n$ . Nonetheless, we can recover the parameter $\theta$ asymptotically from data. This is a direct consequence of exchangeability, and would not generally be true for an arbitrary random partition.

II-D “Non-exchangeable” data

Exchangeability seems at odds with many types of data; for example, a sequence of stock prices over time will be poorly modeled by an exchangeable sequence. Nonetheless, a Bayesian model of a time series will almost certainly imply an exchangeability assumption—the crucial question is which components of the overall model are assumed to be exchangeable. As the next example illustrates, these components need not be the variables representing the observations.

In other words, $X$ is a Lévy process if, for any disjoint ${I_{1},I_{2},\ldots}$ of equal length, the increments ${(\Delta X_{I_{1}},\Delta X_{I_{2}},\ldots)}$ form an i.i.d. sequence. It is then natural to ask for an exchangeable process: We say that $X$ is an exchangeable continuous-time process if, again for any disjoint ${I_{1},I_{2},\ldots}$ of equal length, the sequence ${(\Delta X_{I_{1}},\Delta X_{I_{2}},\ldots)}$ is exchangeable (see Fig. 2). The representation theorem for such processes is due to Hans Bühlmann [e.g. 39, Theorem 1.19]:

Hence, each ergodic measure $\mathbf{p}_{\theta}$ is the distribution of a Lévy process, and the measure $\nu$ is a distribution on parameters of Lévy processes or—in the parlance of stochastic process theory—on Lévy characteristics.

Another important type of exchangeability property is defined for sequences $X_{1},X_{2},\dotsc$ taking values in a countable space $\mathbf{X}$ . Such a sequence is called Markov exchangeable if the probability of observing an initial trajectory $x_{1},\dotsc,x_{n}$ depends only on the initial state $x_{1}$ and, for every pair ${y,y^{\prime}\in\mathbf{X}}$ , on the number of transitions ${t_{y,y^{\prime}}=\#\{j<n:x_{j}=y,x_{j+1}=y^{\prime}\}}$ . In particular, the probability does not depend on when each transition occurs. Diaconis and Freedman showed the following:

If a (recurrent) process is Markov exchangeable, it is a mixture of Markov chains.

(Recurrence means that each visited state is visited infinitely often if the process is run for an infinite number of steps.) Thus, each ergodic distribution $\mathbf{p}_{\theta}$ is the distribution of a Markov chain, and a parameter value $\theta$ consists of a distribution on $\mathbf{X}$ (the distribution of the initial state) and a transition matrix. If a Markov exchangeable process is substituted for the Markov chain in a hidden Markov model, i.e., if the Markov exchangeable variables are latent variables of the model, the resulting model can express much more general dependencies than Markov exchangeability. The infinite hidden Markov model is an example; see . Recent work by Bacallado, Favaro, and Trippa constructs prior distributions on random walks that are Markov exchangeable and almost surely reversible.

A very general approach to modeling is to assume that an exchangeability assumption holds marginally at each value of a covariate variable $z$ , e.g., a time or a location in space: Suppose $\mathbf{X}^{\infty}$ is a set of structures as described above, and $\mathbf{Z}$ is a space of covariate values. A marginally exchangeable random structure is a random measurable mapping

such that, for each $z\in\mathbf{Z}$ , the random variable $\xi(z)$ is an exchangeable random structure in $\mathbf{X}_{\infty}$ .

A popular example of a marginally exchangeable model is the dependent Dirichlet process (DDP) of MacEachern . In this case, for each ${z\in\mathbf{Z}}$ , the random variable $\xi(z)$ is a random probability measure whose distribution is a Dirichlet process. More formally, $\mathbf{Y}$ is some sample space, ${\mathbf{X}_{\infty}=\mathbf{M}(\mathbf{Y})}$ , and the DDP is a distribution on mappings ${\mathbf{Z}\to\mathbf{M}(\mathbf{Y})}$ ; thus, the DDP is a random conditional probability. Since $\xi(z)$ is a Dirichlet process if $z$ is fixed, samples from $\xi(z)$ are exchangeable.

Eq. II.16 is, of course, just another way of saying that $\xi$ is a $\mathbf{X}_{\infty}$ -valued stochastic process indexed by $\mathbf{Z}$ , although we have made no specific requirements on the paths of $\xi$ . The interpretation as a path is more apparent in the next example.

II-E Random functions vs random measures

De Finetti’s theorem can be equivalently formulated in terms of a random function, rather than a random measure, and this formulation provides some useful intuition for Section III. Roughly speaking, this random function is the inverse of the cumulative distribution function (CDF) of the random measure $\Theta$ in de Finetti’s theorem; see Fig. 3.

More precisely, suppose that ${\mathbf{X}=[a,b]}$ . A measure $\mu$ on $[a,b]$ can be represented by its CDF, defined as ${\psi(x):=\mu([a,x])}$ . Hence, sampling the random measure $\Theta$ in de Finetti’s theorem is equivalent to sampling a random CDF $\Psi$ . A CDF is not necessarily an invertible function, but it always admits a so-called right-continuous inverse $\overline{\psi^{-1}}$ , given by

This function inverts $\psi$ in the sense that $\psi\circ\overline{\psi^{-1}}(u)=u$ for all $u\in$ . It is well-known that any scalar random variable $X_{i}$ with CDF $\psi$ can be generated as

In the special case $\mathbf{X}=[a,b]$ , de Finetti’s theorem therefore translates as follows: If $X_{1},X_{2},\dotsc$ is an exchangeable sequence, then there is a random function $F:=\overline{\Psi^{-1}}$ such that

where $U_{1},U_{2},\dotsc$ are i.i.d. uniform variables.

It is much less obvious that the same should hold on an arbitrary sample space, but that is indeed the case:

Let $X_{1},X_{2},\dotsc$ be an infinite, exchangeable sequence of random variables with values in a space $\mathbf{X}$ . Then there exists a random function $F$ from $ $to$ \mathbf{X} $such that, if$ U_{1},U_{2},\dotsc$ is an i.i.d. sequence of uniform random variables,

As we will see in the next section, this random function representation generalizes to the more complicated case of array data, whereas the random measure representation in Eq. II.3 does not.

III Exchangeable Graphs, Matrices, and Arrays

Random arrays are a very general type of random structure, which include important special cases, such as random graphs and random matrices. The representation theorem for exchangeable arrays is the Aldous-Hoover theorem. In this section, we focus on $2$ -arrays, matrices and graphs. The general case for $d$ -arrays is conceptually similar, but considerably more technical, and we postpone it until Section VI.

A random matrix is a random $2$ -array, although the term matrix usually implies that $\mathbf{X}$ has the algebraic structure of a field. A random graph is a random matrix with ${\mathbf{X}=\{0,1\}}$ . As in the sequence case, we assume $X_{\infty}$ is infinite in size, and the statistical interpretation is that an observed, finite array is a sub-array of $X_{\infty}$ . In network analysis problems, for example, an observed graph with $n$ vertices would be interpreted as a random induced subgraph of an underlying graph $X_{\infty}$ , which represents an infinite population.

In this section, we are interested in the characterization of random arrays whose distributions are invariant to permutations reordering the rows and columns. For a $2$ -array, there are two natural ways to define exchangeability: we can ask that the distribution of the array be invariant only to the joint (simultaneous) permutation of the rows and columns or also to separate permutations of rows and columns.

A random 2-array $(X_{ij})$ is called jointly exchangeable if

The analogue of de Finetti’s theorem for exchangeable arrays is the Aldous-Hoover theorem . It has two versions, for jointly and for separately exchangeable arrays.

A random array $(X_{ij})$ is jointly exchangeable if and only if it can be represented as follows: There is a random function $F:^{3}\to\mathbf{X}$ such that

Because the variables $U_{\{i,j\}}$ are indexed by a set, the indices are unordered, and we can think of the array $(U_{\{i,j\}})$ as an upper-triagonal matrix with i.i.d. uniform entries. If the function $F$ is symmetric in its first two arguments, then $X_{\infty}$ is symmetric—that is, if $F(x,y,\,.\,)=F(y,x,\,.\,)$ for all $x$ and $y$ , then ${X_{ij}=X_{ji}}$ for all $i$ and $j$ . In general, however, a jointly exchangeability matrix or $2$ -array need not be symmetric.

Separately exchangeable arrays can also be given a precise characterization using Theorem III.2:

A random array $(X_{ij})$ is separately exchangeable if and only if it can be represented as follows: There is a random function $F:^{3}\to\mathbf{X}$ such that

In the prototypical version of a collaborative filtering problem, users assign scores to movies. Scores may be binary (“like/don’t like”, $X_{ij}\in\{0,1\}$ ), have a finite range (“one to five stars”, $X_{ij}\in\{1,\dotsc 5\}$ ), etc. Separate exchangeability then simply means that the probability of seeing any particular realization of the matrix does not depend on the way in which either the users or the movies are ordered.

Like de Finetti’s and Kingman’s theorem, the representation results are complemented by a convergence result, due to Kallenberg [37, Theorem 3]. The general result involves some technicalities and is not stated here; the special case for random graphs is discussed in Section V (see Theorem V.1).

The variables $(U_{i})$ and $(U_{ij})$ in the representation (III.4) of a jointly exchangeable array need not be uniform: We can substitute variables with a different marginal distribution if we modify the random function $F$ accordingly. For (III.4) to represent some jointly exchangeable array, it is sufficient that:

The variables are independent of the random function $F$ .

Conversely, suppose we define variables $(U_{i})$ and $(U_{ij})$ satisfying 1) and 2), and additionally:

Values do not re-occur, with probability one; that is, the distributions substituted for the uniform is atomless.

III-B Exchangeable Graphs

An important special case is when $G$ is simple—i.e., undirected and without self-loops. In this case, $X$ is symmetric with a zero diagonal, and its representation (III.4) can be simplified: Let $F$ be a random function satisfying (III.4). Without loss of generality, we may assume that $F$ is symmetric in its first two arguments. Consider the (random) function $W$ from $^{2}$ to $ $given by$ W(x,x):=0$ on the diagonal and otherwise by

where $U\sim\mbox{\rm Uniform}$ is independent of $F$ . Then $W$ is random symmetric function from $^{2}$ to $$, and, by construction,

We thus obtain the following specialization of the Aldous-Hoover theorem for exchangeable simple graphs:

where $U_{i}$ and $U_{\{i,j\}}$ are independent i.i.d. uniform variables as in (III.4), which are independent of $W$ .

The representation (III.8) yields the following generative process:

where ${X_{ij}=1}$ indicates the edge connecting $i$ and $j$ is present; if ${X_{ij}=0}$ , it is absent.

Fig. 4 illustrates the generation of random simple graph.

Following , we call a (measurable) function from $^{2}$ to $ $a graphon. Thus, every exchangeable graph is represented by a random graphon$ W $. In the language of integral decompositions, the ergodic distributions of exchangeable simple graphs are parametrized by graphons: In (II.8), we could take$ \nu $to be a distribution on the space of graphons. We will see in Section V that$ \theta$ has an interpretation as a limit of the empirical adjacency matrices for larger and larger subgraphs.

III-C Application to Bayesian Models

The representation results above have fundamental implications for Bayesian modeling—in fact, they provide a general characterization of Bayesian models of array-valued data:

Statistical models of exchangeable arrays can be parametrized by functions from ${^{3}\to}$ . Every Bayesian model of an exchangeable array is characterized by a prior distribution on the space of functions from $^{3}$ to $$.

For the special case of simple graphs, we can rephrase this idea in terms of graphons:

Statistical models of exchangeable simple graphs are parametrized by graphons. Every Bayesian model of an exchangeable simple graph is characterized by a prior distribution on the space of graphons.

In a Bayesian model of an exchangeable $2$ -array, the random function $F$ plays the role of the random parameter $\Theta$ in (II.10), and the parameter space $\mathbf{T}$ is the set of measurable function ${^{3}\rightarrow}$ . Every possible value $f$ of $F$ defines an ergodic distribution $\mathbf{p}_{f}$ : In the jointly exchangeable case, for example, Theorem III.2 shows that $X_{\infty}$ can be sampled from $\mathbf{p}_{f}$ by sampling

Similarly, the ergodic distributions for separately exchangeable 2-arrays are given by (III.5). In the special case of exchangeable simple graphs, the parameter space $\mathbf{T}$ can be reduced to the set of graphons, and the ergodic distribution $\mathbf{p}_{w}$ defined by a graphon $w$ is given by (III.8).

To the best of our knowledge, Hoff was the first to invoke the Aldous-Hoover theorem for statistical modeling. The problem of estimating the distribution of an exchangeable graph can be formulated as a regression problem on the unknown function $w$ . This perspective was proposed in , where the regression problem is formulated as a Bayesian nonparametric model with a Gaussian process prior. The regression model need not be Bayesian, however, and recent work formulates the estimation of $w$ under suitable modeling conditions as a maximum likelihood problem .

Various types of array-valued data depend on time or some other covariate. In this case, joint or separate exchangeability might be assumed to hold marginally, as described in Section II-D. E.g., in the case of a graph evolving over time, one could posit the existence of a graphon $W(.,.,t)$ depending also on the time $t$ . More generally, the discussion in II-D applies to joint and separate exchangeability just as it does to exchangeable sequences. On the other hand, sometimes exchangeability will not be an appropriate assumption, even marginally. In Section VII, we highlight some reasons why exchangeable graphs may be poor models of very large sparse graphs.

III-D Uniqueness of representations

In the representation Eq. III.8, random graph distributions are parametrized by functions ${w:^{2}\to}$ . This representation is not unique: Two distinct graphons may parametrize the same random graph. In this case, the two graphons are called weakly isomorphic . From a statistical perspective, this means the graphon is not identifiable when regarded as a model parameter, although it is possible to treat the estimation problem up to equivalence of functions [37, Theorem 4].

Although any graphon $w^{\phi}$ obtained from $w$ by a MPT $\phi$ is weakly isomorphic to $w$ , the converse is not true: For two weakly isomorphic graphons, there need not be a MPT that transforms one into the other [see 47, Example 7.11].

IV Models in the Machine Learning Literature

The representation theorems show that any Bayesian model of an exchangeable array can be specified by a prior on functions. Models can therefore be classified according to the type of random function they employ. This section surveys several common categories of such random functions, including random piece-wise constant (p.w.c.) functions, which account for the structure of models built using Chinese restaurant processes, Indian buffet processes and other combinatorial stochastic processes; and random continuous functions with, e.g., Gaussian process priors. Special cases of the latter include a range of matrix factorization and dimension reduction models proposed in the machine learning literature. Table II summarizes the classes in terms of restrictions on the random function and the values it takes, and Fig. 7 depicts typical random functions across these classes.

Cluster-based models assume that the rows and columns of the random array $X:=(X_{ij})$ can be partitioned into (disjoint) classes, such that the probabilistic structure between every row- and column-class is homogeneous. Within social science, this idea is captured by assumptions underlying stochastic block models .

The collaborative filtering problem described in Example III.4 is a prototypical application: here, a cluster-based model would assume that the users can be partitioned into classes/groups/types/kinds (of users), and likewise, the movies can also be partitioned into classes/groups/types/kinds (of movies). Having identified the underlying partition of users and movies, each class of user would be assumed to have a prototypical preference for each class of movie.

Because a cluster-based model is described by two partitions, the exchangeable partition models well-known from Bayesian nonparametric clustering can be used as building blocks. If the partitions are in particular generated by a by a Chinese restaurant process, we obtain the Infinite Relational Model (IRM), introduced in and independently in . The IRM can be seen as a nonparametric generalization of parametric stochastic block models . In the following example, we describe the model for the special case of a $\{0,1\}$ -valued array.

Under the IRM, the generative process for a finite subarray of binary random variables $X_{ij}$ , $i\leq n$ , $j\leq m$ , is as follows: To begin, we partition the rows (and then columns) into clusters according to a Chinese restaurant process: The first and second row are chosen to belong to the same cluster with probability proportional to 1 and to belong to different clusters with probability proportional to a parameter $c>0$ . Subsequently, each row is chosen to belong to an existing cluster with probability proportional to the current size of the cluster, and to a new cluster with probability proportional to $c$ . Let $\Pi:=\{\Pi_{1},\dotsc,\Pi_{\kappa}\}$ be the random partition of $\{1,\dotsc,n\}$ induced by this process, where $\Pi_{1}$ is the cluster containing 1, and $\Pi_{2}$ is the cluster containing the first row not belonging to $\Pi_{1}$ , and so on. Note that the number of clusters, $\kappa$ , is also a random variable. Let $\Pi^{\prime}:=\{\Pi^{\prime}_{1},\dotsc,\Pi^{\prime}_{\kappa^{\prime}}\}$ be the random partition of $\{1,\dotsc,m\}$ induced by this process on the columns, possibly with a different parameter $c^{\prime}>0$ determining the probability of creating new clusters. Next, for every pair $(k,k^{\prime})$ of cluster indices, $k\leq\kappa$ , $k^{\prime}\leq\kappa^{\prime}$ , we generate an independent beta random variable $\theta_{k,k^{\prime}}$ . Finally, we generate each $X_{ij}$ independently from a Bernoulli distribution with mean $\theta_{k,k^{\prime}}$ , where $i\in\Pi_{k}$ and $j\in\Pi^{\prime}_{k^{\prime}}$ . As we can see, $\theta_{k,k^{\prime}}$ represents the probability of links arising between elements in clusters $k$ and $k^{\prime}$ .

The Chinese restaurant process generating $\Pi$ and $\Pi^{\prime}$ is known to be exchangeable in the sense that the distribution of $\Pi$ is invariant to a permutation of the underlying set $\{1,\dotsc,n\}$ . It is then straightforward to see that the distribution on the subarray is exchangeable. In addition, it is straightforward to verify that, were we to have generated an $n+1\times m+1$ array, the marginal distribution on the $n\times m$ subarray would have agreed with that of the above process. This implies that we have defined a so-called projective family and so results from probability theory imply that there exists an infinite array and that the above process describes the distribution of every finite subarray.

We say that a Bayesian model of an exchangeable array is simple cluster-based if, for some random function $F$ representing $X$ , there are random partitions $B_{1},B_{2},\dotsc$ and $C_{1},C_{2},\dotsc$ of the unit interval $$ such that:

On each block ${A_{i,j}:=B_{i}\times C_{j}\times}$ , $F$ is constant. Let $f_{ij}$ be the value $F$ takes on block $A_{i,j}$ .

The block values $(f_{ij})$ are themselves an exchangeable array, and independent from $(B_{i})$ and $(C_{j})$ .

We call an array simple cluster-based if its distribution is.

Most examples of simple cluster-based models in the literature—including, e.g., the IRM—take the block values $f_{ij}$ to be conditionally i.i.d. (and so the array $(f_{ij})$ is then trivially exchangeable). As an example of a more flexible model for $(f_{ij})$ , which is merely exchangeable, consider the following:

Like with cluster-based models of exchangeable sequences, if the number of classes in each partition is bounded, then a simple cluster-based model of an exchangeable array is a mixture of a finite-dimensional family of ergodic distributions. Therefore, mixtures of an infinite-dimensional family must place positive mass on partitions with arbitrarily many classes.Exchangeable partitions may contain blocks consisting of only a single element (see Section II-C); these are also known as dust. Our definitions exclude this case because it complicates presentation, but the generalization is straightforward.

In order to define the more general class of cluster-based models, we relax the piece-wise constant nature of the random function $F$ . We do so by starting with a simple cluster-based array ${\Theta=(\Theta_{ij})}$ with values in a space $\mathbf{T}$ . Since $\Theta$ is exchangeable, it is represented by a random function $G$ , and we define $F$ as

for some function ${\phi:\mathbf{T}\times\to\mathbf{X}}$ .

If $\theta\in\mathbf{T}$ is a fixed parameter value and $U$ a uniform random variable in $ $, then$ \phi(\theta,U) $is a random variable with values in$ \mathbf{X} $. Suppose$ P_{\theta} $is the distribution of$ \phi(\theta,U) $. Then the array$ {X=(X_{ij})} $represented by$ F $can be sampled by sampling$ {(\Theta_{ij})} $from$ G $, and then generating$ X_{ij} $from the distribution$ P_{\Theta_{ij}} $. In other words, we can posit a family$ {P=\{P_{\theta}|\theta\in\mathbf{T}\}} $of distributions, generate an array from simple cluster-based model, and then generate$ X_{ij} $, conditionally independently given$ \Theta $, from$ P_{\Theta_{ij}} $. This is additional layer of randomness is also called a randomization of$ \Theta$ .

Let $P$ be a family ${\{P_{\theta}:\theta\in\mathbf{T}\}}$ of distributions on $\mathbf{X}$ , and let ${\Theta:=(\Theta_{i}:i\in I)}$ be a collection of random variables taking values in $\mathbf{T}$ , indexed by elements of a set $I$ . We say that a collection ${X:=(X_{i}:i\in I)}$ of random variables, indexed by the same set $I$ , is a $P$ -randomization of $\Theta$ when the elements $X_{i}$ are conditionally independent given $\Theta$ , and ${X_{i}\mid\Theta\sim P_{\Theta_{i}}}$ for all $i\in I$ .

We say that a Bayesian model for an exchangeable array ${X:=(X_{ij})}$ in $\mathbf{X}$ is cluster-based if $X$ is a $P$ -randomization of a simple cluster-based exchangeable array $\Theta:=(\Theta_{ij})$ taking values in a space $\mathbf{T}$ , for some family ${\{P_{\theta}:\theta\in\mathbf{T}\}}$ of distributions on $\mathbf{X}$ . We say an array is cluster-based when its distribution is.

The intuition is that the cluster memberships of every pair $i,j$ of individuals determines a parameter $\theta_{ij}$ , which in turn determines a distribution $P_{\theta_{ij}}$ . The actual observed relationship $X_{ij}$ is then a sample from $P_{\theta_{ij}}$ .

We may alternatively describe the IRM distribution on exchangeable arrays as follows: Let $P$ be a family ${\{P_{\theta}:\theta\in\mathbf{T}\}}$ of distributions on $\mathbf{X}$ (e.g., a family of Bernoulli distributions indexed by their means in $ $) and let$ H $be a prior distribution on the parameter space$ $(e.g., a Beta distribution, so as to achieve conjugacy). The IRM is a cluster-based model, and an array$ {X:=(X_{ij})} $distributed according to an IRM is hence a$ P $-randomization of a simple cluster-based array$ {\Theta:=(\Theta_{ij})} $of parameters in$ \mathbf{T}$.

In order to describe the structure of $\Theta$ , it suffices to describe the distribution of the partitions $(B_{k})$ and $(C_{k})$ as well as that of the block values. For the latter, the IRM simply chooses the block values to be i.i.d. draws from the distribution $H$ . (While the block values can be taken to be merely exchangeable, we have not seen this generalization in the literature.) For the partitions, the IRM utilizes the stick-breaking construction of a Dirichlet process .

The IRM is originally defined in terms of a Chinese restaurant process (rather than a Dirichlet process), as we have done in Example IV.1. In terms of the CRP, each random probability $V_{k}$ in (IV.2) is the limiting fraction of rows in the $k$ th cluster $\Pi_{k}$ as the number of rows tends to infinity.

IV-B Feature-based models

Feature-based models of exchangeable arrays have similar structure to cluster-based models. Like cluster-based models, feature-based models partition the rows and columns into clusters, but unlike cluster-based models, feature-based models allow the rows and columns to belong to multiple clusters simultaneously. The set of clusters that a row belongs to are then called its features. The interaction between row $i$ and column $j$ is then determined by the features that the row and column possess.

The stochastic process at the heart of most existing feature-based models of exchangeable arrays is the Indian buffet process, introduced by Griffiths and Ghahramani . The Indian buffet process (IBP) produces an allocation of features in a sequential fashion, much like the Chinese restaurant process produces a partition in a sequential fashion. In the follow example, we will describe the Latent Feature Relational Model (LFRM) of Miller et al. , one of the first nonparametric, feature-based models of exchangeable arrays. For simplicity, consider the special case of a $\{0,1\}$ -valued, separately-exchangeable array.

Under the LFRM, the generative process for a finite subarray of binary random variables $X_{ij}$ , $i\leq n$ , $j\leq m$ , is as follows: To begin, we allocate features to the rows (and then columns) according to an IBP. In particular, the first row is allocated a Poisson number of features, with mean $\gamma>0$ . Each subsequent row will, in general, share some features with earlier rows, and possess some features not possessed by any earlier row. Specifically, the second row is also allocated a Poisson number of altogether new features, but with mean $\gamma/2$ , and, for every feature possessed by the first row, the second row is allocated that feature, independently, with probability $1/2$ . In general, the $k$ th row is allocated a Poisson number of altogether new features, with mean $\gamma/k$ ; and, for every subset $K\subseteq\{1,\dotsc,k-1\}$ of the previous rows, and every feature possessed by exactly those rows in $K$ , is allocated that feature, independently, with probability $|K|/n$ . (We use the same process to allocate a distinct set of features to the $m$ columns, though potentially with a different constant $\gamma^{\prime}>0$ governing the overall number of features.)

The exchangeability of the subarray follows from the exchangeability of the IBP itself. In particular, define the family of counts $\Pi_{N}$ , $N\subseteq\{1,\dotsc,n\}$ , where $\Pi_{N}$ is the number of features possessed by exactly those rows in $N$ . We say that $\Pi:=(\Pi_{N})$ is a random feature allocation for $\{1,\dotsc,n\}$ . (Let $\Pi^{\prime}$ be the random feature allocation for the columns induced by the IBP.) The IBP is exchangeable is the sense that

for every permutation $\pi$ of $\{1,\dots,n\}$ , where $\sigma(N):={\{\sigma(n):n\in N\}}$ . Moreover, the conditional distribution of the subarray given the feature assignments $(N_{i},M_{j})$ is the same as the conditional distribution given the feature allocations $(\Pi_{N},\Pi^{\prime}_{M})$ . It is then straightforward to verify that the subarray is itself exchangeable. Like with the IRM example, the family of distributions on subarrays of different sizes is projective, and so there exists an infinite array and the above process describes the distribution of every subarray.

The LFRM is a special case of a class of models that we will call feature-based. From the perspective of simple cluster-based models, simple feature-based models also have a block structured representing function, but relax the assumption that values of each block form an exchangeable array.

To state the definition of this class more formally, we begin by generalizing the notion of a partition of $$. Definition IV.8 was also introduced and studied in much more detail by Broderick et al. —which we were unaware of when this article was first submitted—and we have adjusted our terminology to match that of , where the term feature paint-box was coined. The perspectives differ slightly: In , the feature paint-box is introduced based on the theory of exchangeable partitions, and the authors show that key aspects of this theory generalize elegantly. Our perspective here is that feature models form a special type of exchangeable arrays—as indeed do exchangeable partitions, see [39, Theorem 7.38].

Let $U$ be a uniformly-distributed random variable and ${E:=(E_{1},E_{2},\dotsc)}$ a sequence of subsets of $ $. Given$ E $, we say that$ U $has feature$ n $if$ U\in E_{n} $. We call the sequence$ E$ a feature paint-box if

To parse (IV.4), first recall that a partition is a special case of a feature paint-box. In this case, the sets $E_{n}$ are disjoint and represent blocks of a partition. The relation $U\in E_{k}$ then indicates that an object represented by the random variable $U$ is in block $k$ of the partition. In a feature paint-box, the sets $E_{k}$ may overlap. The relation $U\in E_{n}$ now indicates that the object has feature $n$ . Because the sets may overlap, the object may possess multiple features. However, condition Eq. IV.4 ensures that the number of features per object remains finite (with probability 1).

However, if we want to capture the idea that the relationships between objects depend on the individual features the objects possess, we would not want to assume that the entries of $f_{N,M}$ formed an exchangeable array, as in the case of a simple, cluster-based model. E.g., we might choose to induce more dependence between $f_{N,M}$ and $f_{N^{\prime},M}$ when $N\cap N^{\prime}\neq\emptyset$ than otherwise. The following definition captures the appropriate relaxation of exchangeability:

and the values $f:=(f_{N,M})$ themselves form a feature-exchangeable array, independent of $B$ and $C$ . We say an array is simple feature-based if its distribution is.

Simple feature-based arrays specialize to simple cluster-based arrays if either i) the feature allocations are partitions or ii) the array $f$ is exchangeable. The latter case highlights the fact that feature-based arrays relax the exchangeability assumption of the underlying block values.

As in the case of simple cluster-based models, nonparametric simple feature-based models will place positive mass on feature allocations with an arbitrary number of distinct sets. As for cluster-based models, we define general feature-based models as randomizations of simple models:

We say that a Bayesian model for an exchangeable array $X:=(X_{ij})$ in $\mathbf{X}$ is feature-based when $X$ is a $P$ -randomization of a simple, feature-based, exchangeable array $\Theta:=(\Theta_{ij})$ taking values in a space $T$ , for some family ${\{P_{\theta}:\theta\in\mathbf{T}\}}$ of distributions on $\mathbf{X}$ . We say an array is feature-based when its distribution is.

Comparing Definitions IV.5 and IV.12, we see that the relationship between random functions representing $\Theta$ and $X$ are the same as in cluster-based models. The LFRM described in Example IV.7 is a special case of a feature-based model:

A feature-based model is determined by the randomization family $P$ , the distribution of the underlying feature-exchangeable array $f$ of link probabilities, and the distribution of the random feature allocation. In the case of the LFRM, $P$ is the family $\mbox{\rm Bernoulli}(p)$ distributions, for $p\in$ (although this is easily generalized, and does not represent an important aspect of the model). The underlying feature-exchangeable array $f$ is that described in Example IV.10.

The random feature allocations underlying the LFRM can be described in terms of so-called “stick-breaking” constructions of the Indian buffet process. One of the simplest stick-breaking constructions, and the one we will use here, is due to Teh, Görür, and Ghahramani . (See also , and .)

where $B_{n}=[b_{1},b_{2})\cup[b_{3},b_{4})\cup\dotsm\cup[b_{2^{n}-1},b_{2^{n}})$ . (As one can see, this representation obscures the conditional independence inherent in the feature allocation induced by the IBP.)

IV-C Piece-wise constant models

Simple partition- and feature-based models have piece-wise constant structure, which arises because both types of models posit prototypical relationships on the basis of a discrete set of classes or features assignments, respectively. More concretely, a partition of $^{3}$ is induced by partitions of $$.

An alternative approach is to consider partitions of $^{3}$ directly, or partitions of $^{3}$ induced by partitions of $^{2}$ . Rather than attempting a definition capturing a large, natural class of such models, we present an illustrative example:

A Mondrian process is a partition-valued stochastic process introduced by Roy and Teh . (See also Roy [57, Chp. V] for a formal treatment.) More specifically, a homogeneous Mondrian process on $^{2}$ is a continuous-time Markov chain $(M_{t}\colon t\geq 0)$ , where, for every time $t\geq 0$ , $M_{t}$ is a floorplan-partition of $^{2}$ —i.e., a partition of $^{2}$ comprised of axis-aligned rectangles of the form $A=B\times C$ , for intervals $B,C\subseteq$ . It is assumed that $M_{0}$ is the trivial partition containing a single rectangle.

Every continuous-time Markov chain is characterized by the mean waiting times between jumps and the discrete-time Markov process of jumps (i.e., the jump chain) embedded in the continuous-time chain. In the case of a Mondrian process, the mean waiting time from a partition composed of a finite set of rectangles $\{B_{1}\times C_{1},\dotsc,B_{k}\times C_{k}\}$ is $\sum_{j=1}^{k}(|B_{j}|+|C_{j}|)^{-1}$ . The jump chain of the Mondrian process is entirely characterized by its transition probability kernel, which is defined as follows: From a partition $\{B_{1}\times C_{1},\dotsc,B_{k}\times C_{k}\}$ of $^{2}$ , we choose to “cut” exactly one rectangle, say $B_{j}\times C_{j}$ , with probability proportional to $|B_{j}|+|C_{j}|$ ; Choosing $j$ , we then cut the rectangle vertically with probability proportional to $|C_{j}|$ and horizontally with probability proportional to $|B_{j}|$ ; Assuming the cut is horizontal, we partition $B_{j}$ into two intervals $B_{j,1}$ and $B_{j,2}$ , uniformly at random; The jump chain then transitions to the partition where $B_{j}\times C_{j}$ is replaced by $B_{j,1}\times C_{j}$ and $B_{j,2}\times C_{j}$ ; The analogous transformation occurs in the vertical case.

As is plain to see, each partition is produced by a sequence of cuts that hierarchically partition the space. The types of floorplan partitions of this form are called guillotine partitions. Guillotine partitions are precisely the partitions represented by $k$ d-trees, the classical data structure used to represent hierarchical, axis-aligned partitions.

Another, very popular example of a piece-wise constant model is the mixed-membership stochastic blockmodel of Airoldi et al. .

IV-D Gaussian-process-based models

All models discussed so far are characterized by a random function $F$ with piece-wise constant structure. In this section, we briefly discuss a large and important class of models that relax this restriction by modeling the random function as a Gaussian process.

If $\kappa$ is chosen appropriately, a random function sampled from the Gaussian process is continuous with probability 1.

We say that a Bayesian model for an exchangeable array ${X:=(X_{ij})}$ in $\mathbf{X}$ is Gaussian-process-based when, for some random function $F$ representing $X$ , the process ${F=(F_{x,y,z};\ x,y,z\in)}$ is Gaussian on $^{3}$ . We will say that an array $X$ is Gaussian-process-based when its distribution is.

In terms of Eq. III.12, a Gaussian-process-based model is one where a Gaussian process prior is placed on the function $f$ . The definition is stated in terms of the space $^{3}$ as domain of the uniform random variables $U$ to match our statement of the Aldous-Hoover theorem and of previous models. In the case of Gaussian processes, however, it is arguably more natural to use the real line instead of $ $, and we note that this is indeed possible: Given an embedding$ \phi:^{3}\to J $and a Gaussian process$ G $on$ J $, the process$ G^{\prime} $on$ ^{3} $given by$ G^{\prime}_{x,y,z}=G_{\phi(x,y,z)} $is Gaussian. More specifically, if the former has a mean function$ \mu $and covariance function$ \kappa $, then the latter has mean$ \mu\circ\phi $and covariance$ \kappa\circ(\phi\otimes\phi) $. We can therefore talk about Gaussian processes on spaces$ J$ that can be put into correspondence with the unit interval.

The random variables $X_{ij}$ in Definition IV.15 are real-valued. To model ${\{0,1\}}$ -valued arrays, such as random graphs, we can use a suitable randomization:

Many of the most popular parametric models for exchangeable arrays of random variables can be constructed as (randomizations of) Gaussian-process-based arrays. For a catalog of such models and several nonparametric variants, as well as their covariance functions, see . Here we will focus on the parametric eigenmodel, introduced by Hoff , and its nonparametric cousin, introduced Xu, Yan and Qi . To simplify the presentation, we will consider the case of a $\{0,1\}$ -valued array.

In the case of a $\{0,1\}$ -valued array, both the eigenmodel and its nonparametric extension can be interpreted as an $L$ -randomizations of a Gaussian-process-based array $\Theta:=(\Theta_{ij})$ , where $L$ is given as in Definition IV.16 for some mean, variance and sigmoid. To complete the description, we define the Gaussian processes underlying $\Theta$ .

A nonparametric counterpart to the eigenmodel was introduced by Xu et al. :

A related nonparametric model for exchangeable arrays, which places fewer restrictions on the covariance structure and is derived directly from the Aldous-Hoover representation, is described in .

V Convergence, Concentration and Graph Limits

We have already noted that the parametrization of random arrays by functions in the Aldous-Hoover theorem is not unique. Our statement of the theorem also lacks an asymptotic convergence result such as the convergence of the empirical measure in de Finetti’s theorem. The tools to fill these gaps have only recently become available in a new branch of combinatorics which studies objects known as graph limits. This section summarizes a few elementary notions of this rapidly emerging field and shows how they apply to the Aldous-Hoover theorem for graphs.

Convergence results in statistics study the behavior of the empirical distribution (II.5). The corresponding object for exchangeable graphs is an empirical estimate of the graphon, which is a “checkerboard function”: Given a finite graph $g_{n}$ with $n$ vertices, we subdivide $^{2}$ into ${n\times n}$ square patches, resembling the ${n\times n}$ adjacency matrix. We then define a function $w_{g_{n}}$ with constant value 0 or 1 on each patch, equal to the corresponding entry of the adjacency matrix. We call $w_{g_{n}}$ the empirical graphon of $g_{n}$ . Examples are plotted in Fig. 8. Since $w_{g_{n}}$ is a valid graphon, it parametrizes an infinite random graph, even though $n$ is finite. Aldous-Hoover theory provides a graph counterpart to the law of large numbers:

Let $w$ be a graphon. Suppose we sample a random graph from $w$ one vertex at a time, and $G_{n}$ is the graph given by the first $n$ vertices. Then the distributions defined by $w_{G_{n}}$ converge weakly to the distribution defined by $w$ with probability 1.

A recent development in combinatorics and graph theory is the theory of graph limits . This theory defines a distance measure $\delta_{{}_{\square}}$ on graphons (more details below). The metric can be applied to finite graphs, since the graphs can be represented by the empirical graphon. It is then possible to study conditions under which sequences of graphs converge to a limit. It turns out that limits of graphs can be represented by graphons, and the convergence of graphs corresponds precisely to the weak convergence of the distributions defined by the empirical graphons. This theory refines the Aldous-Hoover theory with a large toolbox of powerful results. We describe a few aspects in the following. The authoritative (and very well-written) reference is .

The most convenient way to define convergence is by defining a metric: If $d$ is a distance measure, we can define $w$ as the limit of $w_{g_{n}}$ if $d(w,w_{g_{n}})\to 0$ as $n\to\infty$ . The metric on functions which has emerged as the “right” choice for graph convergence is called the cut metric, and is defined as follows: We first define a norm as

Suppose $w$ and $w^{\prime}$ are two distinct functions which parametrize the same distribution on graphs. The distance $d_{{}_{\square}}$ in general perceives such functions as different: The functions in Fig. 8, for instance, define the same graph, but have non-zero distance under $d_{{}_{\square}}$ . Hence, if we were to use $d_{{}_{\square}}$ to define convergence, the two sequences of graphs in the figure would converge to two different limits. We therefore modify $d_{{}_{\square}}$ as follows: For any given $w$ , let $[w]$ be the set of all functions $w^{\prime}$ which define the same random graph.

Informally, we can think of the functions in $[w_{2}]$ as functions obtained from $w_{2}$ by a “rearrangement” like the one illustrated in Fig. 5. The definition above says that, before we measure the distance between $w_{1}$ and $w_{2}$ using $d_{{}_{\square}}$ , we rearrange $w_{2}$ in the way that aligns it most closely with $w_{1}$ . In Fig. 5, this closest rearrangement would simply reverse the permutation of blocks, so that the two functions would look identical.

The function $\delta_{{}_{\square}}$ is called the cut pseudometric: It is not an actual metric, since it can take value 0 for two distinct functions. It does, however, have all other properties of a metric. By definition, $\delta_{{}_{\square}}(w,w^{\prime})=0$ holds if and only if $w$ and $w^{\prime}$ parametrize the same random graph.

Clearly, the graph limit is a graphon, and the two terms are used interchangeably. This definition indeed provides a metric counterpart to convergence of exchangeable graph distributions:

V-B Unique parametrization in the Aldous-Hoover theorem

Recall that two graphons can be equivalent, in the sense that they are distinct functions but define the same random graph (they are weakly isomorphic in the language of Section III-D). The equivalence classes $[w]$ form a partition of the space $\mathbf{W}$ of all graphons, which motivates the definition of a “quotient space”: We can define a new space $\widehat{\mathbf{W}}$ by collapsing each equivalence class to a single point. Each element ${\widehat{w}\in\widehat{\mathbf{W}}}$ corresponds to all functions in one equivalence class, and hence to one specific random graph distribution. Since the pseudometric $\delta_{{}_{\square}}$ only assigns distance 0 to two distinct functions if they are equivalent, it turns into a metric on $\widehat{\mathbf{W}}$ , and $(\widehat{\mathbf{W}},\delta_{{}_{\square}})$ is a metric space. Although the elements of this space are abstract objects, not actual functions, the space has remarkable analytical properties, and is one of the central objects of graph limit theory.

Since $\widehat{\mathbf{W}}$ contains precisely one element for each ergodic distribution on exchangeable graphs, we can obtain a unique parametrization of exchangeable graph models by using ${\mathbf{T}:=\widehat{\mathbf{W}}}$ as a parameter space: If ${w\in\mathbf{W}}$ is a graphon and $\widehat{w}$ the corresponding element of $\widehat{\mathbf{W}}$ —the element to which $w$ was collapsed in the definition of $\widehat{\mathbf{W}}$ —we define a family ${\{\mathbf{p}_{\widehat{w}}:\widehat{w}\in\widehat{\mathbf{W}}\}}$ of distributions on exchangeable arrays by taking $\mathbf{p}_{\widehat{w}}$ to be the distribution induced by the uniform sampling scheme described by Eq. III.8 when $W=w$ .

Although the existence of such a probability kernel is not a trivial fact, it follows from a technical result of Orbanz and Szegedy . In particular, the Aldous-Hoover theorem for an exchangeable random graph $G$ can then be written as a unique integral decomposition

in analogy to the de Finetti representation.

V-C Regularity and Concentration

All convergence results we have seen so far provide only asymptotic guarantees of convergence, but no convergence rates. We give two examples of concentration results from graph limit theory, which address similar questions as those asked in mathematical statistics and empirical process theory: How large a graph do we have to observe to obtain reliable estimates?

Underlying these ideas is one of the deepest results of modern graph theory, Szemeredi’s regularity lemma: For every very large graph $g$ , there is a small, weighted graph $\hat{g}$ that summarizes all essential structure in $g$ . The only condition is that $g$ is sufficiently large. In principle, this means that $\hat{g}$ can be used as an approximation or summary of $g$ , but unfortunately, the result is only valid for graphs which are much larger than possible in most conceivable applications. There are, however, weaker forms of this result which hold for much smaller graphs.

To define $\hat{g}$ for a given graph $g$ , we proceed as follows: Suppose $\Pi:=\{V_{1},\dotsc,V_{k}\}$ is a partition of $\mathbf{V}(g)$ into $k$ sets. For any two sets $V_{i}$ and $V_{j}$ , we define $p_{ij}$ as the probability that two vertices $v\in V_{i}$ and $v^{\prime}\in V_{j}$ , each chosen uniformly at random from its set, are connected by an edge. That is,

The graph $\hat{g}_{\Pi}$ is now defined as the weighted graph with vertex set $\{1,\dotsc,k\}$ and edge weights $p_{ij}$ for edge $(i,j)$ . To compare this graph to $g$ , it can be helpful to blow it up to a graph $g_{\Pi}$ of the same size as $g$ , constructed as follows:

Each node $i$ is replaced by a clique of size $|V_{i}|$ (with all edges weighted by 1).

For each pair $V_{i}$ and $V_{j}$ , all possible edges between the sets are inserted and weighted by $p_{ij}$ .

If we measure how much two graphs differ in terms of the cut distance, $g$ can be approximated by $g_{\Pi}$ as follows:

This form of the result is called “weak” since it uses a less restrictive definition of what it means for $g$ and $g_{\Pi}$ to be close then Szemerédi’s original result. The weaker hypothesis makes the theorem applicable to graphs that are, by the standards of combinatorics, of modest size.

A prototypical concentration result based on Theorem V.4 is the following:

The relevance of such results to statistics becomes evident if we think of $f$ as a statistic of a graph or network (such as the edge density) which we try to estimate from an observed subgraph of size $k$ . Results of this type, for graphs and other random structures, are collectively known under the term property testing, and are covered by a sizeable literature in combinatorics and theoretical compute science .

VI Exchangeability and higher-dimensional arrays

The theory of exchangeable arrays extends beyond 2-dimensional arrays, and, indeed, some of the more exciting implications and applications of the theory rely on the general results. In this section we begin by defining the natural extension of (joint) exchangeability to higher dimensions, and then give higher-dimensional analogues of the theorems of Aldous and Hoover, due to Kallenberg. These theorems introduce exponentially-many additional random variables as the dimension increases, but only a linear number are necessary to produce an arbitrarily good approximation. The presentation owes much to Kallenberg .

Let $(X_{k_{1},\dotsc,k_{d}})$ be a $d$ -dimensional array (or simply $d$ -array) of random variables in $\mathbf{X}$ . We say that $X$ is jointly exchangeable when

When $d=2$ , we recover Theorem III.2 characterizing two-dimensional exchangeable arrays. Indeed, if we write $U_{i}:=U_{\{\!\{i\}\!\}}$ and $U_{ij}:=U_{\{\!\{i,j\}\!\}}$ for notational convenience, then the right hand side of Eq. VI.4 reduces to

for some random $F:^{3}\to\mathbf{X}$ . When $d=3$ , we instead have

for some random $F:^{7}\to\mathbf{X}$ , where we have additionally taken $U_{ijk}:=U_{\{\!\{i,j,k\}\!\}}$ for notational convenience.

As in the two-dimensional case, arrays with certain additional symmetries can be treated as special cases. In this section, we consider separate exchangeability in the setting of $d$ -arrays, and in the next section we consider further generalizations. We begin by defining:

We say that $d$ -array $X$ is separately exchangeable when

for the element of the function space $^{2^{[d]}\setminus\emptyset}$ that maps each nonempty subset $I\subseteq[d]$ to the real $U_{k_{I}}$ , i.e., the element in the collection $U$ indexed by the vector $k_{I}$ . Then we have:

We can consider the special cases of $d=2$ and $d=3$ arrays. Then we have, respectively,

for some random $F:^{3}\to\mathbf{X}$ ; and

for some random $F:^{7}\to\mathbf{X}$ . As we can see, jointly exchangeable arrays, which are required to satisfy fewer symmetries than their separately exchangeable counterparts, may take $U_{ij0}=U_{0ij}=U_{i0j}=U_{ji0}=\dotsc$ . Indeed, one can show that these additional assumptions make jointly exchangeable arrays a strict superset of separately exchangeable arrays, for $d\geq 2$ .

VI-B Further generalizations

In applications, it is common for the distribution of an array to be invariant to permutations that act simultaneously on some but not all of the dimensions. E.g., if the first two dimensions of an array index into the same collection of users, and the users are a priori exchangeable, then a sensible notion of exchangeability for the array would be one for which these first two dimensions could be permuted jointly together, but separately from the remaining dimensions.

More generally, we consider arrays that, given a partition of the dimensions of an array into classes, are invariant to permutations that act jointly within each class and separately across classes. More carefully:

for every collection $p$ of permutations, where $\pi_{i}$ denotes the subset $I\in\pi$ containing $i$ .

We may now cast both jointly and separately exchangeable arrays as $\pi$ -exchangeable arrays for particular choices of partitions $\pi$ . In particular, when $\pi=\{[d]\}$ we recover joint exchangeability, and when $\pi=\{\{1\},\dotsc,\{d\}\}$ , we recover separate exchangeability. Just as we characterized jointly and separately exchangeable arrays, we can characterize $\pi$ -exchangeable arrays.

VI-C Approximations by simple arrays

These representational results require a number of latent random variables exponential in the dimension of the array, i.e., roughly twice as many latent variables are needed as the entries generated in some subarray. Even if a $d$ -array is sparsely observed, each observation requires the introduction of potentially $2^{d}$ variables. (In a densely observed array, there will be overlap, and most latent variables will be reused.)

Regardless of whether this blowup poses a problem for a particular application, it is interesting to note that exchangeable $d$ -arrays can be approximated by arrays with much simpler structure, known as simple arrays.

Again, it is instructive to study special cases: in the jointly exchangeable case, taking $U_{j}:=U^{\{[d]\}}_{j}$ , we get

and, in the separately exchangeable case, we get

taking $U^{i}_{j}:=U^{\{i\}}_{j}$ . We may now state the relationship between general arrays and simple arrays:

VII Sparse random structures and networks

Exchangeable random structures are not “sparse”. In an exchangeable infinite graph, for example, the expected number of edges attached to each node is either infinite or zero. In contrast, graphs representing network data typically have a finite number of edges per vertex, and exhibit properties like power-laws and “small-world phenomena”, which can only occur in sparse graphs. Hence, even though exchangeable graph models are widely used in network analysis, they are inherently misspecified. Since the lack of sparseness is a direct mathematical consequence of exchangeability, networks and sparse random structures pose a problem that seems to require genuinely non-exchangeable models. The development of a coherent theory is, despite intense efforts in mathematics, a largely unsolved problem. In this section, we make the problem more precise and describe how, at least in principle, exchangeability might be substituted by other symmetry properties. The topic raises a host of challenging questions to which, in most cases, we have no answers.

In an exchangeable structure, events either never occur, or they occur infinitely often with a fixed (though unknown) probability. The simplest example is an exchangeable binary sequence: By de Finetti’s theorem, the binary variables are conditionally i.i.d. with a Bernoulli distribution. If we sample infinitely often, conditionally on the random Bernoulli parameter taking value ${p\in}$ , the fraction of ones in the sequence will be precisely $p$ . Therefore, we either observe a constant proportion of ones (if ${p>0}$ ) or no ones at all (if ${p=0}$ ).

In an exchangeable graph, rather than ones and zeros, we have to consider the possible subgraphs (single edges, triangles, five-stars, etc). Each possible subgraph occurs either never, or infinitely often. Since an infinite graph may have infinitely many edges even if it is sparsely connected, the number of edges is best quantified in terms of a rate:

Let $(g_{n})$ be a sequence of graphs ${g_{n}=(v_{n},e_{n})}$ , where $g_{n}$ has $n$ vertices. We say that the sequence is sparse if, as $n$ increases, $|e_{n}|$ is of size $O(n)$ (is upper-bounded by $c\cdot n$ for a constant $c$ ). It is called dense if ${|e_{n}|=\Omega(n^{2})}$ (lower-bounded by ${c\cdot n^{2}}$ for a constant $c$ ).

If a random graph is sampled step-wise one vertex at a time, the partial graphs at each step also form a sequence, and we can refer to the random graph as dense or sparse, depending on whether the sequence is dense or sparse. (This definition has to be used with caution, since changing the order in which vertices are generated may affect the rate.) A typical example of dense random graphs are infinite random graphs in which each vertex has infinite degree. Random graphs with bounded degrees are sparse. Many important types of graph and array data are inherently sparse: In a social network with billions of users, individual users do not, on average, have billions of friends.

Exchangeable graphs are not sparse. If a random graph is exchangeable, it is either dense or empty.

The theory of graph limits described in Section V is closely related to exchangeability, and is inherently a theory of dense graphs: If we construct a sequence of graphs with sparsely growing edge sets, convergence in cut metric is still well-defined, but the limit object is always the empty graphon, i.e., a function on $^{2}$ which vanishes almost everywhere.

One possible way to generate sparse graphs is of course to modify the sampling scheme for exchangeable graphs to generate fewer edges.

There is a very simple way to modify the Aldous-Hoover approach into one that generates sparse random graphs: Suppose we sample a finite graph with a fixed number $n$ of vertices. We simply multiply the probability in our usual sampling scheme by $1/n$ :

Comparison with our argument why exchangeable graphs are dense immediately shows that a graph sampled this way is sparse. More generally, we can multiply $w$ by some other rate function $\rho_{n}$ (instead of specifically ${\rho_{n}=1/n}$ ), and ask how this model behaves for ${n\to\infty}$ . Statistical properties of this model are studied by Bickel, Chen, and Levina , who consider the behavior of moment estimators for the edge density, triangle density and other subgraph densities.

An obvious limitation of the BJR model is that it does not actually attempt to model network structure: It can equivalently be sampled by sampling from a graphon as in (III.8) and then deleting each edge independently at random, with probability $(1-\rho_{n})$ . (In the parlance of random graph theory, this is exchangeable sampling followed by i.i.d. bond percolation.) In other words, the BJR model modifies an exchangeable graph to fit a first-order statistic of a network (the number of edges), but it cannot generate typical network structures, such as power laws.

VII-B Beyond exchangeability: Symmetry and ergodic theory

The example of networks and sparse structures shows that there are important random structures which are not exchangeable. This raises the question whether integral decompositions and statistical models, which we have throughout derived from exchangeability, can be obtained in a similar manner for structures that are not exchangeable. In principle, that is possible: Exchangeability is a special case of a probabilistic symmetry. It turns out that integral decompositions can be derived from much more general symmetries than exchangeability.

A very general result, the ergodic decomposition theorem, shows that integral decompositions of the form (II.8) are a general consequence of probabilistic symmetries, rather than specifically of exchangeability. The general theme is that there is some correspondence of the form

In principle, Bayesian models can be constructed based on any type of symmetry, as long as this symmetry defines a useful set of ergodic distributions. The following statement of the ergodic decomposition theorem glosses over various technical details; for a precise statement, see [39, Theorem A1.4].

We have already encountered the components of (VII.1) in Section II: In Bayesian terms, $\mathbf{p}_{\theta}$ again corresponds to the observation distribution and $\nu$ to the prior. Geometrically, the integral representation Eq. VII.1 can be regarded as convex combination. Fig. 9 illustrates this idea for a toy example with three ergodic measures. A special case of the ergodic decomposition theorem is well-known in Bayesian theory as a result of Freedman :

In the language of Theorem VII.5, the group $G$ is the set of all rotations and reflections acting on all finite prefixes of a sequence. For every $\sigma>0$ , let $\mathcal{N}_{\sigma}$ be the distribution of zero-mean Gaussian random variable with standard deviation $\sigma$ . Freedman showed that, if $X^{\infty}$ satisfies Eq. VII.2, then its distribution is a scale mixture of Gaussians:

An alternative way to define symmetry in statistical models is through sufficient statistics: Intuitively, a symmetry property identifies information which is not relevant to the statistical problem; so does a sufficient statistic. For example, the empirical distribution retains all information about a sample except for the order in which observations were recorded. The empirical distribution is hence a sufficient statistic for the set distributions of exchangeable sequences. In an exchangeable graph model, the empirical graphon (the checkerboard function in Fig. 8) is a sufficient statistic. If the sufficient statistic is finite-dimensional and computes an average $\frac{1}{n}\sum_{i}S_{0}(x_{i})$ over observations for some function $S_{0}$ , the ergodic distributions are exponential family models . A readable introduction to this topic is given by Diaconis . The definitive reference is the monograph of Lauritzen , who refers to the set $\mathcal{E}$ of ergodic distributions as an extremal family.

Not every probabilistic symmetry is applicable in statistics in the same way as exchangeability is. To be useful to statistics, the symmetry must satisfy two conditions:

The set $\mathcal{E}$ of ergodic measures should be a “small” subset of the set of symmetric measures.

The measures $\mathbf{p}_{\theta}$ should have a tractable representation, such as Kingman’s paint-box or the Aldous-Hoover sampling scheme.

Theorem VII.5 guarantees neither. If (1) is not satisfied, the representation is useless for statistical purposes: The integral representation Eq. VII.1 means that the information in $X_{\infty}$ is split into two parts, the information contained in the parameter value $\theta$ (which a statistical procedure tries to extract) and the randomness represented by $\mathbf{p}_{\theta}$ (which the statistical procedure discards). If the set $\mathcal{E}$ is too large, $\Theta$ contains almost all the information in $X_{\infty}$ , and the decomposition becomes meaningless. We will encounter an appealing notion of symmetry for sparse networks in the next section—which, however, seems to satisfy neither condition (1) or (2). It is not clear at present whether there are useful types of symmetries which do not imply some form of invariance to a group of permutations. Although the question is abstract, the incompatibility of sparseness and exchangeability means it is directly relevant to Bayesian statistics.

VII-C Stationary networks and involution invariance

Is there a form of invariance that yields statistical models for network data? There is indeed a very natural notion of invariance in networks, called involution invariance, which we describe in more detail below. This property has interesting mathematical properties and admits an ergodic decomposition as in Theorem VII.5, but it seems to be too weak for applications in statistics.

A crucial difference between network structures and exchangeable graphs is that, in most networks, location in the graph matters. If conditioning on location is informative, exchangeability is broken. Probabilistically, location is modeled by marking a distinguished vertex in the graph. A rooted graph $(g,v)$ is simply a graph $g$ in which a particular vertex $v$ has been marked. A very natural notion of invariance for networks is called involution invariance or unimodularity , and can be thought of as a form of stationarity:

The definition says that, if an observer randomly walks along the graph $G$ by moving to a uniformly selected neighbor in each step, the distribution of the network around the observer remains unchanged (although the actual neighborhoods in a sampled graph may vary).

Involution invariance is a symmetry property which admits an ergodic decomposition, and Aldous and Lyons have characterized the ergodic measures. This characterization is abstract, however, and there is no known “nice” representation resembling, for example, the sampling scheme for exchangeable graphs. Thus, of the two desiderata described in Section VII-B, property (2) does not seem to hold. We believe that property (1) does not hold either: Although we have no proof at present, it seems that involution invariance is too weak a constraint to yield interesting statistical models (i.e., the set of ergodic distributions is a “large” subset of the involution invariant distributions).

Since exchangeability and involution invariance are the only well-studied probabilistic symmetries for random graphs, the question how statistical models of networks can be characterized is an open problem:

Is there a notion of probabilistic symmetry whose ergodic measures in (VII.1) describe useful statistical models for sparse graphs with network properties?

There is a sizeable literature on sparse random graph models which can model power laws and other network properties; see, for example . These are probability models and can be simulated, but estimation from data is often intractable, due to stochastic dependencies between the edges in the random graph. On the other hand, some dependence between edges is necessary to obtain a power law and similar properties. Hence, a suitable notion of symmetry would have to restrict dependencies between edges sufficiently to permit statistical inference, but not to the full conditional independence characteristic of the exchangeable case.

VIII Further References

Excellent non-technical references on the general theory of exchangeable arrays and other exchangeable random structures are two recent surveys by Aldous . His well-known lecture notes also cover exchangeable arrays. The most comprehensive available reference on the general theory is the monograph by Kallenberg (which presupposes in-depth knowledge of measure-theoretic probability). Kingman’s original article provides a concise reference on exchangeable random partitions. A thorough, more technical treatment of exchangeable partitions can be found in .

Schervish gives an insightful discussion of the application of exchangeability to Bayesian statistics. There is a close connection between probabilistic symmetries (such as exchangeability) and sufficient statistics, which is covered by a substantial literature. See Diaconis for an introduction and further references. For applications of exchangeability results to machine learning models, see , who discuss applications of the partial exchangeability result of Diaconis and Freedman to the infinite hidden Markov model .

The theory of graph limits in its current form was initiated by Lovász and Szegedy and Borgs et al. . It builds on work of Frieze and Kannan , who introduced both the weak regularity lemma (Theorem V.4) and the cut norm $d_{{}_{\square}}$ . In the framework of this theory, the Aldous-Hoover representation of exchangeable graphs can be derived by purely analytic means [48, Theorem 2.7]. The connection between graph limits and Aldous-Hoover theory was established, independently of each other, by Diaconis and Janson and by Austin . An accessible introduction to the analytic perspective is the survey , which assumes basic familiarity with measure-theoretic probability and functional analysis, but is largely non-technical. The monograph gives a comprehensive account.

Historically, the Aldous-Hoover representation was established in independent works of Aldous and of Hoover in the late 1970s. Aldous’ proof used probability-theoretic methods, whereas Hoover, a logician, leveraged techniques from model theory. In 1979, Kingman writes

…a general solution has now been supplied by Dr David Aldous of Cambridge. […] The proof is at present very complicated, but there is reason to hope that the techniques developed can be applied to more general experimental designs.

Aldous’ paper , published in 1981, attributes the idea of the published version of the proof to Kingman. The results were later generalized considerably by Kallenberg .

Acknowledgments. We have learned much about random graphs from Cameron Freer, and are also indebted to James Robert Lloyd for many useful discussions. Two anonymous reviewers have provided very detailed feedback, which in our opinion has greatly improved the article. We thank Karolina Dziugaite, Creighton Heaukalani, Jonathan Huggins and Christian Steinrücken for helpful comments on the manuscript.