Deep Models of Interactions Across Sets

Jason Hartford, Devon R Graham, Kevin Leyton-Brown, Siamak Ravanbakhsh

Introduction

Exchangeability has a long history in machine learning and statistics. de Finetti’s theorem states that exchangeable sequences are the product of a latent variable model. Extensions of this theorem characterize distributions over other exchangeable structures, from matrices to graphs; see Orbanz & Roy (2015) for a detailed treatment. In machine learning, a variety of frameworks formalize exchangeability in data, from plate notation to statistical relational models (Raedt et al., 2016; Getoor & Taskar, 2007). When dealing with exchangeable arrays (or tensors), a common approach is tensor factorization. In particular, one thread of work leverages tensor decomposition for inference in latent variable models (Anandkumar et al., 2014). However, in addition to having limited expressive power, tensor factorization requires retraining models for each new input.

We call a learning algorithm Permutation Equivariant (PE) if it is constrained to give the same answer across all exchangeable inputs; we argue that PE is an important form of inductive bias in exchangeable domains. However, it is not trivial to achieve; indeed, all existing parametric factorization and matrix/tensor completion methods associate parameters with each row and column, and thus are not PE. How can we enforce this property? One approach is to augment the input with all $M!\times N!$ permutations of rows and columns. However, this is computationally wasteful and becomes infeasible for all but the smallest $N$ and $M$ . Instead, we show that a simple parameter-sharing scheme suffices to ensure that a deep model can represent only PE functions. The result is analogous to the idea of a convolution layer: a lower-dimensional effective parameter space that enforces a desired equivariance property. Indeed, parameter-sharing is a generic and efficient approach for achieving equivariance in deep models (Ravanbakhsh et al., 2017).

When the matrix models the interaction between the members of the same group, one could further constrain permutations to be identical across both rows and columns. An example of such a jointly exchangeable matrix (Orbanz & Roy, 2015), modelling the interaction of the nodes in a graph, is the adjacency matrix deployed by graph convolution. Our approach reduces to graph convolution in the special case of 2D arrays with this additional parameter-sharing constraint, but also applies to arbitrary matrices and higher dimensional arrays.

Finally, we explain connections to some of our own past work. First, we introduced a similar parameter-sharing scheme in the context of behavioral game theory (Hartford et al., 2016): rows and columns represented players’ actions and the exchangeable matrix encoded payoffs. The current work provides a theoretical foundation for these ideas and shows how a similar architecture can be applied much more generally. Second, our model for exchangeable matrices can be seen as a generalization of the deep sets architecture (Zaheer et al., 2017), where a set can be seen as a one-dimensional exchangeable array.

In what follows, we begin by introducing our parameter-sharing approach in Section 2, considering the cases of both dense and sparse matrices. In Section 3.2 we study two architectures for matrix completion that use an exchangeable matrix layer. In particular the factorized autoencoding model provides a powerful alternative to commonly used matrix factorization methods; notably, it does not require retraining to be evaluated on previously unseen data. In Section 4 we present extensive results on benchmark matrix completion datasets. We generalize our approach to higher-dimensional tensors in Section 5.

Exchangeable Matrix Layer

Let ${X}\in\mathds{R}^{N\times M}$ be our “exchangeable” input matrix. We use $\operatorname{vec}({{X}})\in\mathds{R}^{NM}$ to denote its vectorized form and $\operatorname{vec}^{-1}({{x}})$ to denote the inverse of the vectorization that reshapes a vector of length $NM$ into an $N\times M$ matrix—i.e., $\operatorname{vec}^{-1}({\operatorname{vec}({{X}})})={X}$ . Consider a fully connected layer ${{Y}}:=\operatorname{vec}^{-1}({\sigma(W\operatorname{vec}({{X}}))})$ where $\sigma$ is an element-wise nonlinear function such as sigmoid, $W\in\mathds{R}^{NM\times NM}$ , and ${Y}\in\mathds{R}^{N\times M}$ is the output matrix. Exchangeablity of ${X}$ motivates the following property: suppose we permute the rows and columns ${X}$ using two arbitrary permutation matrices $G^{(N)}\in\{0,1\}^{N\times N}$ and $G^{(M)}\in\{0,1\}^{M\times M}$ to get ${X}^{\prime}:=G^{(N)}{X}G^{(M)}$ . Permutation Equivariance (PE) requires the new output matrix ${Y}^{\prime}:=\operatorname{vec}^{-1}({\sigma(W\operatorname{vec}({X^{\prime}}))})$ to experience the same permutation of rows and columns—that is, we require ${Y}^{\prime}=G^{(N)}{Y}G^{(M)}$ .

Given $X\in\mathds{R}^{N\times M}$ , a fully connected layer $\sigma(W\operatorname{vec}({X}))$ with $W\in\mathds{R}^{NM\times NM}$ is called an exchangeable matrix layer if, for all permutation matrices $G^{(N)}\in\{0,1\}^{N\times N}$ and $G^{(M)}\in\{0,1\}^{M\times M}$ , permutation of the rows and columns results in the same permutations of the output:

This requirement heavily constrains the weight matrix $W$ : indeed, its number of effective degrees of freedom cannot even grow with $N$ and $M$ . Instead, the resulting layer is forced to have the following, simple form:

For each output element ${Y}_{n,m}$ , we have the following parameters: one connecting it to its counterpart ${X}_{n,m}$ ; one each connecting it to the inputs of the same row and the inputs of the same column; and one shared by all the other connections. We also include a bias parameter; see Fig. 2 for an illustration of the action of this parameter matrix, and Fig. 1 for a visual illustration of it. Theorem 2.1 formalizes the requirement on our parameter matrix. All proofs are deferred to the appendix.

Given a strictly monotonic function $\sigma$ , a neural layer $\sigma(W\operatorname{vec}({X}))$ is an exchangeable matrix layer iff the elements of the parameter matrix $W$ are tied together such that the resulting fully connected layer simplifies to

where $\mathbf{1}_{N}=\underbrace{[1,\ldots,1]^{\mathsf{T}}}_{\text{length}N}$ and $w_{1}^{\prime},\ldots,w_{5}^{\prime}\in\mathds{R}$ .

This parameter sharing is translated to summing or averaging elements across rows and columns; more generally, PE is preserved by any commutative pooling operation. Moreover, stacking multiple layers with the same equivariance properties preserves equivariance (Ravanbakhsh et al., 2017). This allows us to build deep PE models.

Input Features for Rows and Columns

In some applications, in addition to the matrix ${X}\in\mathds{R}^{N\times M\times K}$ , where $K$ is the number of input channels/features, we may have additional features for rows $R\in\mathds{R}^{N\times K^{\prime}}$ and/or columns $C\in\mathds{R}^{M\times K^{\prime\prime}}$ . We can preserve PE by broadcasting these row/column features over the whole matrix and treating them as additional input channels.

Jointly Exchangeable Matrices

For jointly exchangeable matrices, such as adjacency matrix, Equation Eq. 1 is constrained to have $N=M$ and $G^{(N)}=G^{(M)}$ . This will in turn constrain the corresponding parameter-sharing so that $w_{2}=w_{3}$ in Equation Eq. 2.

1 Sparse Inputs

Real-world arrays are often extremely sparse. Indeed, matrix and tensor completion is only meaningful with missing entries. Fortunately, the equivariance properties of Theorem 2.1 continue to hold when we only consider the nonzero (observed) entries. For sparse matrices, we continue to use the same parameter-sharing scheme across rows and columns, with the only difference being that we limit the model to observed entries. We now make this precise.

Matrix Factorization and Completion

Recommender systems are very widely applied, with many modern applications suggesting new items (e.g., movies, friends, restaurants, etc.) to users based on previous ratings of other items. The core underlying problem is naturally posed as a matrix completion task: each row corresponds to a user and each column corresponds to an item; the matrix has a value for each rating given to an item by a given user; the goal is to fill in the missing values of this matrix.

In Section 3.1 we review two types of analysis in dealing with exchangeable matrices. Section 3.2 introduces two architectures: a self-supervised model—a simple composition of exchangeable matrix layers—that is trained to produce randomly removed entries of the observed matrix during the training; and a factorized model that uses an auto-encoding nonlinear factorization scheme. While there are innumerable methods for (nonlinear) matrix factorization and completion, both of these models are the first to generalize to inductive settings while achieving competitive performance in transductive settings. Section 3.3 introduces two subsampling techniques for large sparse matrices followed by a literature review in Section 3.4.

2 Architectures

Factorized Exchangeable Autoencoder (FEA)

This procedure is also analogous to classical matrix factorization, with an an important distinction that once trained, we can readily factorize the unseen matrices without performing any optimization. Note that the same architecture trivially extends to tensor-factorization, where we use an exchangeable tensor layer (see Section 5).

Channel Dropout

Both the factorized autoencoder and self-supervised exchangeable model are flexible enough to make regularization important for good generalization performance. Dropout (Srivastava et al., 2014) can be extended to apply to exchangeable matrix layers by noticing that each channel in an exchangeable matrix layer is analogous to a single unit in a standard feed-forward network. We therefore randomly drop out whole channels during training (as opposed to dropping out individual elements). This procedure is equivalent to the SpatialDropout technique used in convolutional networks (Tompson et al., 2015).

3 Subsampling in Large Matrices

A key practical challenge arising from our approach is that our models are designed to take the whole data matrix ${X}$ as input and will make different (and typically worse) predictions if given only a subset of the data matrix. As datasets grow, the model and input data become too large to fit within fixed-size GPU memory. This is problematic both during training and at evaluation time because our models rely on aggregating shared representations across data points to make their predictions. To address this, we explore two subsampling procedures.

The simplest approach is to sample sub-matrices of ${X}$ by uniformly sampling from its (typically sparse) elements. This has the advantage that each submatrix is an unbiased sample of the full matrix and that the procedure is computationally cheap, but has the potential to limit the performance of the model because the relationships between the elements of ${X}$ are sparsified.

Conditional sampling

Sampling at test time

4 Related Literature

The literature in matrix factorization and completion is vast. The classical approach to solving the matrix completion problem is to find some low rank (or sparse) approximation that minimizes a reconstruction loss for the observed ratings (see e.g., Candès & Recht, 2009; Koren et al., 2009; Mnih & Salakhutdinov, 2008). Because these procedures learn embeddings for each user and item to make predictions, they are transductive, meaning they can only make predictions about users and items observed during training. To our knowledge this is also true for all recent deep factorization and other collaborative filtering techniques (e.g., Salakhutdinov et al., 2007; Deng et al., ; Sedhain et al., 2015; Wang et al., 2015; Li et al., 2015; Zheng et al., 2016; Dziugaite & Roy, 2015). An exception is a recent work by Yang et al. (2016) that extends factorization-style approaches to the inductive setting (where predictions can be made on unseen users and items). However their method relies on additional side information to represent users and items. By contrast, our approach is able to make inductive completions on rating matrices that may differ from that which was observed during training without using any side information (though our approach can easily incorporate side information).

Matrix completion may also be posed as predicting edge weights in bipartite graphs (Berg et al., 2017; Monti et al., 2017). This approach builds on recent work applying convolutional neural networks to graph-structured data (Scarselli et al., 2009; Bruna et al., 2014; Duvenaud et al., 2015; Defferrard et al., 2016; Kipf & Welling, 2016; Hamilton et al., 2017). As we saw, parameter sharing in graph convolution is closely related to parameter sharing in exchangeable matrices, and indeed it is a special case where $w_{2}=w_{3}$ in Equation Eq. 2.

Empirical Results

For reproducibility we have released Tensorflow and Pytorch implementations of our model.Tensorflow: https://github.com/mravanba/deep_exchangeable_tensors. Pytorch: https://github.com/jhartford/AutoEncSets. Details on the training and architectures appear in the appendix. Section 4.1 reports experimental results in the standard transductive (matrix interpolation) setting. However, more interesting results are reported in Section 4.2, where we test a trained deep model on a completely different dataset. Finally Section 4.3 compares the model’s performance when using different sampling procedures. The datasets used in our experiments are summarized in Table 1.

Here, we demonstrate that exploiting the PE structure of the exchangeable matrices allows us to achieve results competitive with state-of-the-art, while maintaining a constant number of parameters. Note that the number of parameters in all the competing methods grow with $N$ and/or $M$ .

In Table 2 we report that the self-supervised exchangeable model is able to achieve state of the art performance on MovieLens-100K dataset. For MovieLens-1M dataset, we cannot fit the whole dataset into the GPU memory for training and therefore use conditional subsampling; also see Section 4.3. Our results on this dataset are summarized in table Table 3. On this larger dataset both models gave comparatively weaker performance than what we observed on the smaller ML-100k dataset and in the extrapolation results. We suspect this is largely due to memory constraints: there is a trade-off between the size of the model (in terms of number of layers and units per layer) and the batch size one can train. We found that both larger batches and deeper models tended to offer better performance, but on these larger datasets it is not currently possible to have both. The results for the factorized exchangeable autoencoder architecture are similar and reported in the same table.

2 Inductive Setting (Matrix Extrapolation)

Because our model does not rely on any per-user or per-movie parameters, it should be able to generalize to new users and movies that were not present during training. We tested this by training an exchangeable factorized autoencoder on the MovieLens-100k dataset and then evaluating it on a subsample of data from the MovieLens-1M dataset. At test time, the model was shown a portion of the new ratings and then made to make predictions on the remaining ratings.

Figure 5 summarizes the results where we vary the amount of data shown to the model from 5% of the new ratings up to 95% and compare against K-nearest neighbours approaches. Our model significantly outperforms the baselines in this task and performance degrades gracefully as we reduce the amount of data observed.

Perhaps most surprisingly, we were able to achieve competitive results when training and testing on completely disjoint datasets. For this experiment we stress-tested our model’s inductive ability by testing how it generalizes to new datasets without retraining. We used a Factorized Exchangeable Autoencoder that was trained to make predictions on the MovieLens-100k dataset and tasked it with making predictions on the Flixster, Douban and YahooMusic datasets. We then evaluated its performance against models trained for each of these individual datasets. All the datasets involve rating prediction tasks, so they share some similar semantics with MovieLens, but they have different user bases and (in the case of YahooMusic) deal with music not movie ratings, so we may expect some change in the rating distributions and user-item interactions. Furthermore, the Flixster ratings are in 0.5 point increments from $1-5$ and YahooMusic allows ratings from $1-100$ , while Douban and MovieLens ratings are on $1-5$ scale. To account for the different rating scales, we simply binned the inputs to our model to a $1-5$ range and, where applicable, linearly rescaled the output before comparing it to the true ratingBecause of this binning procedure, our model received input data that is considerably coarser-grained than that which was used for the comparison models.. Despite all of this, Table 4 shows that our model achieves very competitive results with models that were trained for the task.

For comparison, we also include the performance of a Factorized EAE trained on the respective datasets. This improves performance of our model over previous state of the art results on the Flixster and YahooMusic datasets and gives very similar performance to Berg et al. (2017)’s GC-MC model on the Douban dataset. Interestingly, we see the largest performance gains over existing approaches on the datasets in which ratings are sparse (see Table 1).

3 Comparison of sampling procedures

We evaluated the effect of subsampling the input matrix ${X}$ on performance, for the MovieLens-100k dataset. The results are summarized in Figure 6. The two key findings are: I) even with large batch sizes of 20 000 examples, performance for both sampling methods is diminished compared to the full batch case. We suspect that our models’ weaker results on the larger ML-1M dataset may be partially attributed to the need to subsample. II) the conditional sampling method was able to recover some of the diminished performance. We believe it is likely that more sophisticated sampling schemes that explicitly take into account the dependence between hidden nodes will lead to better performance but we leave that to future work.

Extention to Tensors

In this section we generalize the exchangeable matrix layer to higher-dimensional arrays (tensors) and formalize its optimal qualities. Suppose ${X}\in\mathds{R}^{N_{1}\times...\times N_{D}}$ is our $D$ -dimensional data tensor, and $\operatorname{vec}({{X}})$ its vectorized form. We index $\operatorname{vec}({{X}})$ by tuples $(n_{1},n_{2},...,n_{D})$ , corresponding to the original dimensions of ${X}$ . The precise method of vectorization is irrelevant, provided it is used consistently. Let $\mathscr{N}=\prod_{i}N_{i}$ and let $\underline{n}=(n_{i},n_{-i})$ be an element of ${N_{1}\times...\times N_{D}}$ such that $n_{i}$ is the value of the $i$ -th entry of $\underline{n}$ , and $n_{-i}$ the values of the remaining entries (where it is understood that the ordering of the elements of $\underline{n}$ is unchanged). We seek a layer that is equivariant only to certain, meaningful, permutations of $\operatorname{vec}({{X}})$ . This motivates our definition of an exchangeable tensor layer in a manner that is completely analogous to Definition 2.1 for matrices.

For any positive integer $N$ , let $\mathcal{S}^{(N)}$ denote the symmetric group of all permutations of $N$ objects. Then $\mathcal{S}^{(N_{1})}\times...\times\mathcal{S}^{(N_{D})}$ refers to the product group of all permutations of $N_{1}$ through $N_{D}$ objects, while $\mathcal{S}^{(\mathscr{N})}$ refers to the group of all permutations of $\mathscr{N}=\prod_{i}N_{i}$ objects. So $\mathcal{S}^{(N_{1})}\times...\times\mathcal{S}^{(N_{D})}$ is a proper subgroup of $\mathcal{S}^{(\mathscr{N})}$ having $\prod_{i}(N_{i}!)$ members, compared to $(\prod_{i}N_{i})!$ members in $\mathcal{S}^{(\mathscr{N})}$ . We want a layer that is equivariant to only those permutations in $\mathcal{S}^{(N_{1})}\times...\times\mathcal{S}^{(N_{D})}$ , but not to any other member of $\mathcal{S}^{(\mathscr{N})}$ .

Let $\mathcal{g}^{(\mathscr{N})}\in\mathcal{S}^{(N_{1})}\times\mathcal{S}^{(N_{2})}\times...\times\mathcal{S}^{(N_{D})}$ and ${G}^{(\mathscr{N})}$ be the corresponding permutation matrix. Then the neural layer $\operatorname{vec}^{-1}({\sigma({W}\operatorname{vec}({{X}}))})$ with ${X}\in\mathds{R}^{N_{1}\times\ldots\times N_{D}}$ and ${W}\in\mathds{R}^{\mathscr{N}\times\mathscr{N}}$ is an exchangeable tensor layer if permuting the elements of the input along any set of axes results in the same permutation of the output tensor:

and moreover for any other permutation of the elements ${X}$ (i.e., permutations that are not along axes), there exists an input ${X}$ for which this equality does not hold.

The following theorem asserts that a simple parameter tying scheme achieves the desired permutation equivariance for tensor-structured data.

The layer ${{Y}}=\operatorname{vec}^{-1}({\sigma({W}\operatorname{vec}({{X}}))})$ , where $\sigma$ is any strictly monotonic, element-wise nonlinearity (e.g., sigmoid), is an exchangeable tensor layer iff the parameter matrix ${W}\in\mathds{R}^{\mathscr{N}\times\mathscr{N}}$ is defined as follows.

That is, the $(\underline{n},\underline{n}^{\prime})$ -th element of ${W}$ is uniquely determined by the set of indices at which $n$ and $n^{\prime}$ are equal.

In the special case that ${X}\in\mathds{R}^{N_{1}\times N_{2}}$ is a matrix, this gives the formulation of ${W}$ described in Section 2. Theorem 5.1 says that a layer constructed in this manner is equivariant with respect to only those permutations of $\mathscr{N}$ objects that correspond to permutations of sub-tensors along the $D$ dimensions of the input. The proof is in the Appendix. Equivariance with respect to permutations in $\mathcal{S}^{(N_{1})}\times\mathcal{S}^{(N_{2})}\times...\times\mathcal{S}^{(N_{D})}$ ) follows as a simple corollary of Theorem 2.1 in (Ravanbakhsh et al., 2017). Non-equivariance with respect to other permutations follows from the following Propositions.

Let $\mathcal{g}^{(\mathscr{N})}\in\mathcal{S}^{(\mathscr{N})}$ be an “illegal” permutation of elements of the tensor $X$ – that is $\mathcal{g}^{(\mathscr{N})}\not\in\mathcal{S}^{(N_{1})}\times\mathcal{S}^{(N_{2})}\times...\times\mathcal{S}^{(N_{D})}$ . Then there exists a dimension $i\in[D]$ such that, for some $n_{i},n_{i}^{\prime},n_{-i},n_{-i}^{\prime}$ :

If we consider the sub-tensor of ${X}$ obtained by fixing the value of the $i$ -th dimension to $n_{i}$ , we expect a “legal” permutation to move this whole subtensor to some $n_{i}^{\prime}$ (it could additionally shuffle the elements within this subtensor.) This Proposition asserts that an “illegal” permutation $\mathcal{g}^{(\mathscr{N})}\not\in\mathcal{S}^{(N_{1})}\times\mathcal{S}^{(N_{2})}\times...\times\mathcal{S}^{(N_{D})}$ is guaranteed to violate this constraint for some dimension/subtensor combination. The next proposition asserts that if we can identify such inconsistency in permutation of sub-tensors then we can identify and entry in ${G}^{(\mathscr{N})}{W}$ that will differ from ${W}{G}^{(\mathscr{N})}$ , and therefore for some input tensor $X$ , the equivariance property is violated – i.e., $G^{(\mathscr{N})}\sigma(W\operatorname{vec}({X}))\neq\sigma(WG^{(\mathscr{N})}\operatorname{vec}({X}))$ .

Let $\mathcal{g}^{(\mathscr{N})}\in\mathcal{S}^{(\mathscr{N})}$ with ${G}^{(\mathscr{N})}\in\{0,1\}^{\mathscr{N}\times\mathscr{N}}$ the corresponding permutation matrix. Suppose $\mathcal{g}^{(\mathscr{N})}\not\in\mathcal{S}^{(N_{1})}\times\mathcal{S}^{(N_{2})}\times...\times\mathcal{S}^{(N_{D})}$ , and let ${W}\in\mathds{R}^{\mathscr{N}\times\mathscr{N}}$ be as defined above. If there exists an $i\in[D]$ , and some $n_{i},n_{i}^{\prime},n_{-i},n_{-i}^{\prime}$ such that

Proposition 5.3 makes explicit a particular element at which the products ${G}^{(N)}{W}$ and ${W}{G}^{(N)}$ will differ, provided $\mathcal{g}^{(\mathscr{N})}$ is not of the desired form.

Theorem 5.1 shows that this parameter sharing scheme produces a layer that is equvariant to exactly those permutations we desire, and moreover, it is optimal in the sense that any layer having fewer ties in its parameters (i.e., more parameters) would fail to be equivariant.

Conclusion

This paper has considered the problem of predicting relationships between the elements of two or more distinct sets of objects, where the data can also be expressed as an exchangeable matrix or tensor. We introduced a weight tying scheme enabling the application of deep models to this type of data. We proved that our scheme always produces permutation equivariant models and that no increase in model expressiveness is possible without violating this guarantee. Experimentally, we showed that our models achieve state-of-the-art or competitive performance on widely studied matrix completion benchmarks. Notably, our models achieved this performance despite having a number of parameters independent of the size of the matrix to be completed, unlike all other approaches that offer strong performance. Also unlike these other approaches, our models can achieve competitive results for the problem of matrix extrapolation: asking an already-trained model to complete a new matrix involving new objects that were unobserved at training time. Finally, we observed that our methods were surprisingly strong on various transfer learning tasks: extrapolating from MovieLens ratings to Fixter, Douban, and YahooMusic ratings. All of these contained different user populations and different distributions over the objects being rated; indeed, in the YahooMusic dataset, the underlying objects were not even of the same kind as those rated in training data.

Acknowledgment

We want to thank the anonymous reviewers for their constructive feedback. This research was enabled in part by support provided by NSERC Discovery Grant, WestGrid and Compute Canada.

References

Appendix A Notation

${X}\in\mathds{R}^{N_{1}\times...\times N_{D}}$ , the data tensor

${x}\in\mathds{R}^{\prod_{i}N_{i}}$ , vectorized ${X}$ , also denoted by $\operatorname{vec}({{X}})$ .

$[N]$ : the sequence $\{n\}_{n=1,...,N}=(1,2,...N)$

$[N_{1}\times...\times N_{D}]$ , the sequence $\{(n_{1},...,n_{D})\}_{n_{1}\in[N_{1}],...,n_{D}\in[N_{D}]}$ of $D$ -dimensional tuples over $N_{1}\times...\times N_{D}$

$\underline{n}=(n_{i},n_{-i})$ , an element of ${N_{1}\times...\times N_{D}}$

$\mathcal{S}^{(N)}$ , symmetric group of all permutations of $N$ objects

$\mathcal{G}^{(N)}=\{\mathcal{g}^{(N)}_{i}\}_{i}$ , a permutation group of $N$ objects

$\mathcal{g}^{(N)}_{i}$ or ${G}_{i}^{(N)}$ , both can refer to the matrix form of the permutation $\mathcal{g}^{(N)}_{i}\in\mathcal{G}^{(N)}$ .

$\mathcal{g}^{(N)}_{i}(n)$ refers to the result of applying $\mathcal{g}^{(N)}_{i}$ to $n$ , for any $n\in[N]$ .

$\sigma$ , an arbitrary, element-wise, strictly monotonic, nonlinear function such as sigmoid.

Appendix B Proofs

Let ${X}\in\mathds{R}^{N_{1}\times...\times N_{D}}$ be the data matrix. We prove the contrapositive by induction on $D$ , the dimension of ${X}$ . Suppose that, for all $i\in[D]$ and all $n_{i},n_{i}^{\prime},n_{-i},n_{-i}^{\prime}$ , we have that $\mathcal{g}^{(\mathscr{N})}((n_{i},n_{-i}))=(n_{i}^{\prime},n_{-i})$ implies $\mathcal{g}^{(\mathscr{N})}((n_{i},n_{-i}^{\prime}))=(n_{i}^{\prime},n_{-i}^{\prime})$ . This means that, for any $\underline{n}=(n_{i},n_{-i})=(n_{1},...,n_{i},...,n_{D})$ the action $\mathcal{g}^{(\mathscr{N})}$ takes on $n_{i}$ is independent of the action $\mathcal{g}^{(\mathscr{N})}$ takes on $n_{-i}$ , the remaining dimensions. Thus we can write

Where it is understood that the order of the group product is maintained (this is a slight abuse of notation). If $D=2$ (base case) we have $\mathcal{g}^{(\mathscr{N})}((n_{1},n_{2}))=\mathcal{g}^{(N_{1})}(n_{1})\times\mathcal{g}^{(N_{2})}(n_{2})$ . So $\mathcal{g}^{(\mathscr{N})}\in\mathcal{S}^{(N_{1})}\times\mathcal{S}^{(N_{2})}$ , and we are done. Otherwise, an inductive argument on $\mathcal{g}^{(\mathscr{N}/N_{i})}$ allows us to write $\mathcal{g}^{(\mathscr{N})}(\underline{n})=\mathcal{g}^{(N_{1})}(n_{1})\times\mathcal{g}^{(N_{2})}(n_{2})\times...\times\mathcal{g}^{(N_{D})}(n_{D})$ . And so $\mathcal{g}^{(\mathscr{N})}\in\mathcal{S}^{(N_{1})}\times\mathcal{S}^{(N_{2})}\times...\times\mathcal{S}^{(N_{D})}$ , completing the proof. $\blacksquare$

B.2 Proof of Proposition 5.3

Now, let $i\in[D]$ be such that, for some $n_{i},n_{i}^{\prime},n_{-i},n_{-i}^{\prime}$ we have

The last line follows from the observation above and the fact that ${G}^{(\mathscr{N})}$ is a permutation matrix and so has only one 1 per row. Similarly,

Which completes the proof. $\blacksquare$

B.3 Proof of Theorem 5.1

We will prove both the forward and backward direction: $(\Leftarrow)$ Suppose ${W}$ has the form given by (6). We must show the layer is only equivariant with respect to permutations in $\mathcal{S}^{(N_{1})}\times...\times\mathcal{S}^{(N_{D})}$ :

Equivariance: Let $\mathcal{g}^{(\mathscr{N})}\in\mathcal{S}^{(N_{1})}\times...\times\mathcal{S}^{(N_{D})}$ , and let ${G}^{(\mathscr{N})}$ be the corresponding permutation matrix. Then a simple extension of Theorem 2.1 in (Ravanbakhsh et al., 2017) implies ${G}^{(\mathscr{N})}{W}{X}={W}{G}^{(\mathscr{N})}{X}$ for all ${X}$ , and thus the layer is equivariant. Intuitively, if $\mathcal{g}^{(\mathscr{N})}\in\mathcal{S}^{(N_{1})}\times...\times\mathcal{S}^{(N_{D})}$ we can “decompose” $\mathcal{g}^{(\mathscr{N})}(\underline{n})$ into $D$ permutations $\mathcal{S}^{(N_{1})}(n_{1}),...,\mathcal{S}^{(N_{D})}(n_{D})$ which act independently on the $D$ dimensions of ${X}$ .

No equivariance wrt any other permutation: Let $\mathcal{g}^{(\mathscr{N})}\in\mathcal{S}^{(\mathscr{N})}$ , with $\mathcal{g}^{(\mathscr{N})}\not\in\mathcal{S}^{(N_{1})}\times\dots\times\mathcal{S}^{(N_{D})}$ , and let ${G}^{(\mathscr{N})}$ be the corresponding permutation matrix. Using Propositions 5.2 and 5.3 we have:

So there exists an index at which these two matrices differ, call it $(\underline{n},\underline{n}^{\prime})$ . Then if we take $\operatorname{vec}({{X}})$ to be the vector of all 0’s with a single 1 in position $\underline{n}^{\prime}$ , we will have:

And since $\sigma$ is assumed to be strictly monotonic, we have:

And finally, since ${G}$ (N) is a permutation matrix and $\sigma$ is applied element-wise, we have:

Therefore, the layer $\sigma({W}\operatorname{vec}({{X}}))$ is not a exchangeable tensor layer, and the proof is completed.

This proves the first direction. $(\Rightarrow)$ We prove the contrapositive. Suppose ${W}_{\underline{n},\underline{n}^{\prime}}\neq{W}_{\underline{m},\underline{m}^{\prime}}$ for some $\underline{n},\underline{n}^{\prime},\underline{m},\underline{m}^{\prime}\in N_{1}\times...\times N_{D}$ with $\{i\;:\;n_{i}=n_{i}^{\prime}\}=\{i\;:\;m_{i}=m_{i}^{\prime}\}$ . We want to show that the layer ${{Y}}=\operatorname{vec}^{-1}({\sigma({W}\operatorname{vec}({{X}}))})$ is not equivariant to some permutation $\mathcal{g}^{(\mathscr{N})}\in\mathcal{S}^{(N_{1})}\times...\times\mathcal{S}^{(N_{D})}$ . We define this permutation as follows:

That is, $\mathcal{g}^{(\mathscr{N})}$ “swaps” $\underline{n}$ with $\underline{m}$ and $\underline{n}^{\prime}$ with $\underline{m}^{\prime}$ . This is a valid permutation first since it acts element-wise, but also since $\{i\;:\;n_{i}=n_{i}^{\prime}\}=\{i\;:\;m_{i}=m_{i}^{\prime}\}$ implies that $\underline{n}=\underline{n}^{\prime}$ iff $\underline{m}=\underline{m}^{\prime}$ (and so $\mathcal{g}^{(\mathscr{N})}$ is injective, and thus bijective). So if ${G}^{(\mathscr{N})}$ is the permutation matrix of $\mathcal{g}^{(\mathscr{N})}$ then we have $({G}^{(\mathscr{N})})_{(\underline{n}^{\prime},\underline{n})}=({G}^{(\mathscr{N})})_{(\underline{m}^{\prime},\underline{m})}=1$ , and:

And so ${G}^{(\mathscr{N})}{W}\neq{W}{G}^{(\mathscr{N})}$ and by an argument identical to the above, with appropriate choice of $X$ , this layer does not satisfy the requirements of an exchangeable tensor layer. This completes the proof. $\blacksquare$

B.4 Proof of Theorem 2.1

A simple reparameterization allows us to write the matrix ${W}$ of (6) in the form of (3). Thus Theorem 2.1 is just the special case of Theorem 5.1 where $D=2$ and the proof follows from that.

Appendix C Details of Architecture and Training

Self-Supervised Model. Details of architecture and training: we train a simple feed-forward network with 9 exchangeable matrix layers using a leaky ReLU activation function. Each hidden layer has 256 channels and we apply a channel-wise dropout with probability 0.5 after the first to seventh layers. We found this channel-wise dropout to be crucial to achieving good performance. Before the input layer we mask out a proportion of the ratings be setting their values to 0 uniformly at random with probability 0.15. We convert the input ratings to one-hot vectors and interpret the model output as a probability distribution over potential rating levels. We tuned hyper-parameters by training on 75% of the data, evaluating on a 5% validation set. We test this model using the canonical u1.base/u1.test training/test split, which reserves 20% of the ratings for testing. For the MovieLens-1M dataset, we use the same architecture as for ML-100k and trained on 85% of the data, validating on 5%, and reserving 10% for testing. The limited size of GPU memory becomes an issue for this larger dataset, so we had to employ conditional sampling for training. At validation time we used full batch predictions using the CPU in order to avoid memory issues.

Factorized Exchangeable Autoencoder Model. Details of architecture and training: we use three exchangeable matrix layers for the encoder. The first two have 220 channels, and the third layer maps the input to 100 features for each entry, with no activation function applied. This is followed by mean pooling along both dimensions of the input. Thus, each user and movie is encoded into a length 100 vector of real-valued latent features. The decoder uses five similar exchangeable matrix layers, with the final layer having five channels. We apply a channel-wise dropout with probability 0.5 after the third and fourth layers, which we again found to be crucial for good performance. We convert the input ratings to one-hot vectors and optimize using cross-entropy loss.