Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs

Ryan L. Murphy, Balasubramaniam Srinivasan, Vinayak Rao, Bruno Ribeiro

Introduction

Pooling is a fundamental operation in deep learning architectures (LeCun et al., 2015). The role of pooling is to merge a collection of related features into a single, possibly vector-valued, summary feature. A prototypical example is in convolutional neural networks (CNNs) (LeCun et al., 1995), where linear activations of features in neighborhoods of image locations are pooled together to construct more abstract features. A more modern example is in neural networks for graphs, where each layer pools together embeddings of neighbors of a vertex to form a new embedding for that vertex, see for instance, (Kipf & Welling, 2016; Atwood & Towsley, 2016; Hamilton et al., 2017; Velickovic et al., 2017; Monti et al., 2017; Xu et al., 2018; Liu et al., 2018; Liben-Nowell & Kleinberg, 2007; van den Berg et al., 2017; Duvenaud et al., 2015; Gilmer et al., 2017; Ying et al., 2018; Xu et al., 2019).

A common requirement of a pooling operator is invariance to the ordering of the input features. In CNNs for images, pooling allows invariance to translations and rotations, while for graphs, it allows invariance to graph isomorphisms. Existing pooling operators are mostly limited to pre-defined heuristics such as max-pool, min-pool, sum, or average. Another desirable characteristic of pooling layers is the ability to take variable-size inputs. This is less important in images, where neighborhoods are usually fixed a priori. However in applications involving graphs, the number of neighbors of different vertices can vary widely. Our goal is to design flexible and learnable pooling operators satisfying these two desiderata.

Our goal in this paper is to model and learn permutation-sensitive functions $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ that can be used to construct flexible and learnable permutation-invariant neural networks. A recent step in this direction is work on DeepSets by Zaheer et al. (2017), who argued for learning permutation-invariant functions through the following composition:

In practice, there is a gap between flexibility and learnability. While the architecture of equations 1 and 2 is a universal approximator to permutation-invariant functions, it does not easily encode structural knowledge about $\overline{\overline{\lower 0.86108pt\hbox{$ y $}}}$ . Consider trying to learn the permutation-invariant function $\overline{\overline{\lower 0.86108pt\hbox{$ y $}}}({\bm{x}})=\max_{i,j\leq|{\bm{x}}|}|{x}_{i}-{x}_{j}|$ . With higher-order interactions between the elements of ${\bm{h}}$ , the functions $f$ of equation 2 cannot capture any useful intermediate representations towards the final output, with the burden shifted entirely to the function $\rho$ . Learning $\rho$ means learning to undo mixing performed by the summation layer $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}(|{\bm{h}}|,{\bm{h}};{\bm{\theta}}^{(f)})=\sum_{j=1}^{|{\bm{h}}|}f({h}_{j};{\bm{\theta}}^{(f)})$ . As we show in our experiments, in many applications this is too much to ask of $\rho$ .

We investigate a learnable permutation-invariant pooling layer for variable-size inputs inspired by the Janossy density framework, widely used in the theory of point processes (Daley & Vere-Jones, 2003, Chapter 7). This approach, which we call Janossy pooling, directly allows the user to model what higher-order dependencies in ${\bm{h}}$ are relevant in the pooling.

Figure 1 summarizes a neural network with a single Janossy pooling layer $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}$ (detailed in Definition 2.1 below): given an input embedding ${\bm{h}}$ , we apply a learnable (permutation-sensitive) function $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ to every permutation ${\bm{h}}_{\pi}$ of the input sequence ${\bm{h}}$ . These outputs are added together, and fed to the second function $\rho$ . Examples of function $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ include feedforward and recurrent neural networks (RNNs). We call the operation used to construct $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}$ from $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ the Janossy pooling. Definition 2.1 gives a more detailed description. We will detail three broad strategies for making this computation tractable and discuss how existing methods can be seen as tractability strategies under the Janossy pooling framework.

Thus, we propose a framework and tractability strategies that unify and extend existing methods in the literature. We contribute the following analysis: (a) We show DeepSets (Zaheer et al., 2017) is a special case of Janossy pooling where the function $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ depends only on the first element of the sequence ${\bm{h}}_{\pi}$ . In the most general form of Janossy pooling (as described above), $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ depends on its entire input sequence ${\bm{h}}_{\pi}$ . This naturally raises the possibility of intermediate choices of $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ that allow practitioners to trade between flexibility and tractability. We will show that functions $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ that depend on their first $k$ arguments of ${\bm{h}}_{\pi}$ allow the Janossy pooling layer to capture up to $k$ -ary dependencies in ${\bm{h}}$ . (b) We show Janossy pooling can be used to learn permutation-invariant neural networks $\overline{\overline{\lower 0.86108pt\hbox{$ y $}}}({\bm{x}})$ by sampling a random permutation of ${\bm{h}}$ during training, and then modeling this permuted sequence using a sequence model such as a recurrent neural network (LSTMs (Hochreiter & Schmidhuber, 1997), GRUs (Cho et al., 2014)) or a vector model such as a feedforward network. We call this permutation-sampling learning algorithm $\pi$ -SGD ( $\pi$ -Stochastic Gradient Descent). Our analysis explains why this seemingly unsound procedure is theoretically justified, which sheds light on the recent puzzling success of permutation sampling and LSTMs in relational models (Moore & Neville, 2017; Hamilton et al., 2017). We show that this property relates to randomized model ensemble techniques. (c) In Zaheer et al. (2017), the authors describe a connection between DeepSets and infinite de Finetti exchangeabilty. We provide a probabilistic connection between Janossy pooling and finite de Finetti exchangeabilty (Diaconis, 1977).

Janossy Pooling

We first formalize the Janossy pooling function $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}$ . Start with a function $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ , parameterized by $\bm{\theta}^{(f)}$ , which can take any variable-size sequence as input: a sequence of matrices (such as images), a sequence of vectors (such as a sequence of vector embeddings), or a variable-size sequence of features or embeddings representing the neighbors of a node in an attributed graph. In practice, we implement $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ with a neural network. Formalizing Figure 1 from Section 1, we use $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ to define $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}$ :

where $\Pi_{|{\bm{h}}|}$ is the set of all permutations of the integers $1$ to $|{\bm{h}}|$ , and ${\bm{h}}_{\pi}$ represents a particular reordering of the elements of sequence ${\bm{h}}$ according to $\pi\in\Pi_{|{\bm{h}}|}$ . We refer the operation used to construct $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}$ from $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ as Janossy pooling.

Definition 2.1 provides a conceptually simple approach for constructing permutation-invariant functions from arbitrary and powerful permutation-sensitive functions such as feedforward networks, recurrent neural networks, or convolutional neural networks. If $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ is a vector-valued function, then so is $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}$ , and in practice, one might pass this vector output of $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}$ through a second function $\rho$ (e.g. a neural network parameterized by $\theta^{(\rho)}$ ):

Equation 3 can capture any permutation-invariant function $\overline{\overline{\lower 0.86108pt\hbox{$ g $}}}$ for a flexible enough family of permutation-sensitive functions $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ (for instance, one could always set $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}=\overline{\overline{\lower 0.86108pt\hbox{$ g $}}}$ ). Thus, at least theoretically, $\rho$ in equation 4 provides no additional representational power. In practice, however, $\rho$ can improve learnability by capturing common aspects across all terms in the summation. Furthermore, when we look at approximations to equation 3 or restrictions of $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ to more tractable families, adding $\rho$ can help recover some of the lost model capacity. Overall then, equation 4 represents one layer of Janossy pooling, forming a constituent part of a bigger neural network. Figure 1 summarizes this.

Janossy pooling, as defined in equation 3 and 4 is intractable; the computational cost of summing over all permutations (for prediction), and backpropagating gradients (for learning) is likely prohibitive for most problems of interest. Nevertheless, it provides an overarching framework to unify existing methods, and to extend them. In what follows we present strategies for mitigating this, allowing novel and effective trade-offs between learnability and computational cost.

Rather than pre-defining a good canonical order, one can try to learn it from the data. This requires searching over the discrete space of all $|{\bm{h}}|!$ permutations of the input vector ${\bm{h}}$ . In practice, this discrete optimization relies on heuristics (Vinyals et al., 2016; Rezatofighi et al., 2018). Alternatively, instead of choosing a single canonical ordering, one can choose multiple orderings, resulting in ensemble methods that average across multiple permutations. These can be viewed as more refined (possibly data-driven) approximations to equation 3.

2 Tractability through k𝑘k-ary dependencies

Here, we provide a different spectrum of options to trade-off flexibility, complexity, and generalizability in Janossy pooling. Now, to simplify the sum over permutations in equation 3, we impose structural constraints where $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}({\bm{h}})$ depends only on the first $k$ elements of its input sequence. This amounts to the assumption that only $k$ -ary dependencies in ${\bm{h}}$ are relevant to the task at hand.

Note that if some of the embeddings have length $|{\bm{h}}|<k$ , then we can zero pad to form the length- $k$ sequence $(\downarrow_{k}\!\!({\bm{h}}_{\pi}),0,\ldots,0)$ . Proposition 2.1 shows that if $|{\bm{h}}|>k$ , equation 5 only needs to sum over $|{\bm{h}}|!/(|{\bm{h}}|-k)!$ terms, which can be tractable for small $k$ .

Note that the value of $k$ balances computational savings and the capacity to model higher-order interactions; it can be selected as a hyperparameter based on a-priori beliefs or through typical hyperparameter tuning strategies.

Equation 5 represented with $k=1$ and composing with $\rho$ as in equation 4 yields the model $\rho\big{(}\frac{1}{|{\bm{h}}|}\sum_{j=1}^{|{\bm{h}}|}\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}|,{h}_{j};\bm{\theta}^{(f)});{\bm{\theta}}^{(\rho)}\big{)}$ and thus equations 1 and 2 for an appropriate choice of $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ .

Not surprisingly, the computational savings obtained from $k$ -ary Janossy pooling come at the cost of reduced model flexibility. The next result formalizes this.

The proof is given in the Supplementary Material. Theorem 2.1 has the following implication:

For $k>1$ , the DeepSets function in equation 1 (Zaheer et al., 2017) pushes the modeling of $k$ -ary relationships to $\rho$ .

DeepSets functions can be expressed via Janossy pooling with $k=1$ . Thus, by Theorem 2.1, $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}$ in equation 2 cannot express all functions that can be expressed by higher-order (i.e. $k>1$ ) Janossy pooling operations. Consequently, if the DeepSets function can express any permutation-invariant function, the expressive power must have been pushed to $\rho$ . ∎

3 Tractability through Permutation Sampling

Another approach to tractable Janossy pooling samples random permutations of the input ${\bm{h}}$ during training. Like the canonical ordering approach of Section 2.1, this offers significant computational savings, allowing more complex models for $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ such as LSTMs and GRUs. However, in contrast with that approach, this is considerably more flexible, avoiding the need to learn a canonical ordering or to make assumptions about the dependencies between the elements of ${\bm{h}}$ and the objective function. Rather, it can be viewed as implicitly assuming simpler structure in these functions. The approach of sampling random permutations has been previously used in relational learning tasks (Moore & Neville, 2017; Hamilton et al., 2017) as a heuristic with an LSTM as $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ . Both these papers report that permutation sampling outperforms or closely matches other tested neural network models they tried. Therefore, this section not only proposes a tractable approximation for equation 3 but also provides a theoretical framework to understand and extend such approaches.

Permutation sampling. Consider replacing the Janossy sum in equation 7 with the estimate

where ${\mathbf{s}}$ is a random permutation sampled uniformly, ${\mathbf{s}}\sim\text{Unif}(\Pi_{|{\bm{h}}|})$ . The estimator in equation 8 is unbiased: $E_{\mathbf{s}}[\hat{\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}}(|{\bm{h}}|,{\bm{h}}_{{\mathbf{s}}};\bm{\theta}^{(f)})]=\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}(|{\bm{h}}|,{\bm{h}};\bm{\theta}^{(f)}).$ Note however that when $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}$ is chained with another nonlinear function $\rho$ and/or nonlinear loss $L$ , the composition is no longer unbiased: $E_{\mathbf{s}}[L({\bm{y}},\rho(\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}_{\mathbf{s}}|,{\bm{h}}_{\mathbf{s}};{\bm{\theta}}^{(f)});{\bm{\theta}}^{(\rho)}))]\neq L({\bm{y}},\rho(E_{\mathbf{s}}[\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}_{\mathbf{s}}|,{\bm{h}}_{\mathbf{s}};{\bm{\theta}}^{(f)})];{\bm{\theta}}^{(\rho)}))$ . Nevertheless, we use this estimate to propose the following stochastic approximation algorithm for gradient descent:

Let ${\mathcal{B}}=\{({\bm{x}}(1),{\bm{y}}(1)),\ldots,({\bm{x}}(B),{\bm{y}}(B))\}$ be a mini-batch i.i.d. sampled uniformly from the training data $\mathcal{D}$ . At step $t$ , consider the stochastic gradient descent update

where ${\mathbf{Z}}_{t}=\frac{1}{B}\sum_{i=1}^{B}\nabla_{{\bm{\theta}}}L\left({\bm{y}}(i),\rho\Big{(}\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}^{(i,t-1)}|,{\bm{h}}^{(i,t-1)}_{{\mathbf{s}}_{i}};{\bm{\theta}}^{(f)}_{t-1});{\bm{\theta}}^{(\rho)}_{t-1}\Big{)}\right)$ is the random gradient, where ${\bm{h}}^{(i,t-1)}_{\pi}\equiv(h({\bm{x}}(i);{\bm{\theta}}_{t-1}^{(h)}))_{\pi}$ for a permutation $\pi$ , ${\bm{\theta}}\equiv({\bm{\theta}}^{(\rho)},{\bm{\theta}}^{(f)},{\bm{\theta}}^{(h)})$ , with the random permutations $\{{\mathbf{s}}_{i}\}_{i=1}^{B}$ , sampled independently ${\mathbf{s}}_{i}\sim\text{Uniform}(\Pi_{|{\bm{h}}^{(i)}|})$ ; the learning rate is $\eta_{t}\in(0,1)$ s.t. $\lim_{t\to\infty}\eta_{t}=0$ , $\sum_{t=1}^{\infty}\eta_{t}=\infty$ , and $\sum_{t=1}^{\infty}\eta_{t}^{2}<\infty$ .

Effectively, this is a Robbins-Monro stochastic approximation algorithm of gradient descent (Robbins & Monro, 1951; Bottou & LeCun, 2004) and optimizes the following modified objective:

Observe that the expectation over permutations is now outside the $L$ and $\rho$ functions. Like equation 6, the loss in equation 10 is also permutation-invariant, though we note that $\pi$ -SGD, after a finite number of iterations, returns a $\rho(\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(\cdots,{\bm{h}}^{(i)},\cdots))$ sensitive to the random input permutations of ${\bm{h}}^{(i)}$ presented to the algorithm. Further, unless the function $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ itself is permutation-invariant ( $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}=\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ ), the optima of $\overline{\overline{\lower 0.86108pt\hbox{$ J $}}}$ are different from those of the original objective function $\overline{\overline{\lower 0.86108pt\hbox{$ L $}}}$ . Instead, $\overline{\overline{\lower 0.86108pt\hbox{$ J $}}}$ is an upper bound to $\overline{\overline{\lower 0.86108pt\hbox{$ L $}}}$ via Jensen’s inequality if $L$ is convex and $\rho$ is the identity function (equation 3); minimizing this upper bound forms a tractable surrogate to the original Janossy objective. If the function class used to model $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ is rich enough to include permutation-invariant functions, then the global minima of $\overline{\overline{\lower 0.86108pt\hbox{$ J $}}}$ will include those of $\overline{\overline{\lower 0.86108pt\hbox{$ L $}}}$ . In general, minimizing the upper bound implicitly regularizes $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ to return functions that are insensitive to permutations of the training data. While a general $\rho$ no longer upper bounds the original objective, the implicit regularization of permutation-sensitive functions still applies to the composition $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}^{\prime}\equiv\rho\circ\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ and we show competitive results.

It is important to observe that the function $\rho$ plays a very different role in our $\pi$ -SGD formulation compared to $k$ -ary Janossy pooling. Previously $\rho$ was composed with an average over $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ to model dependencies not captured in the average– and was in some sense separate from $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ – whereas here it becomes absorbed directly into $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}^{\prime}=\rho\circ\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ .

The next result, which we state and prove more formally in the Supplementary Material, provides some insight into the convergence properties of our algorithm. Although the conditions are difficult to check, they are similar to those used to demonstrate the convergence of SGD, which has been empirically demonstrated to yield strong performance in practice.

[ $\pi$ -SGD Convergence] The optimization of $\pi$ -SGD enjoys properties of almost sure convergence to the optimal ${\bm{\theta}}$ under similar conditions as SGD.

Variance reduction of the output of a sampled permutation $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}|,{\bm{h}}_{{\mathbf{s}}};\bm{\theta}^{(f)})$ , ${\mathbf{s}}\sim\text{Unif}(\Pi_{|{\bm{h}}|})$ , allows $E_{\mathbf{s}}[L({\bm{y}},\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}_{\mathbf{s}}|,{\bm{h}}_{\mathbf{s}};\bm{\theta}^{(f)})]\approx L({\bm{y}},E_{\mathbf{s}}[\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}_{\mathbf{s}}|,{\bm{h}}_{\mathbf{s}};\bm{\theta}^{(f)})]),$ inducing a near-equivalence between optimizing equation 6 and equation 10. Possible approaches include importance sampling (used by Chen et al. (2018b) for $1$ -ary Janossy), control variates (also used by Chen et al. (2018a) also used for $1$ -ary Janossy), Rao-Blackwellization (Lehmann & Casella, 2006, Section 1.7), and an output regularization, which includes a penalty for two distinct sampled permutations ${\mathbf{s}}$ and ${\mathbf{s}}^{\prime}$ , $\|\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}|,{\bm{h}}_{{\mathbf{s}}};\bm{\theta}^{(f)})-\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}|,{\bm{h}}_{{\mathbf{s}}^{\prime}};\bm{\theta}^{(f)})\|_{2}^{2}$ , so as to reduce the variance of the sampled Janossy pooling output (used before to improve Dropout masks by Zolna et al. (2018)).

The use of $\pi$ -SGD to optimize the Janossy pooling layer optimizes the objective $\overline{\overline{\lower 0.86108pt\hbox{$ J $}}}$ , and thus has the following implication on how outputs should be calculated at inference time:

Assume $L({\bm{y}},\hat{{\bm{y}}})$ is convex as a function of $\hat{{\bm{y}}}$ (e.g., $L$ is the $L^{2}$ norm, cross-entropy, or negative log-likelihood losses). At test time we estimate the output ${\bm{y}}(i)$ of input ${\bm{x}}(i)$ by computing (or estimating)

where $\mathstrut\mkern 2.5muf^{\prime}\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}\equiv\rho\circ\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ , ${\bm{\theta}}^{(f^{\prime})\star}\equiv({\bm{\theta}}^{(f)\star},{\bm{\theta}}^{(\rho)\star})$ , ${\bm{h}}^{(i,\star)}_{{\mathbf{s}}_{i}}\equiv(h({\bm{x}}(i);{\bm{\theta}}^{(h)\star}))_{{\mathbf{s}}_{i}}$ and ${\bm{\theta}}^{(\rho)\star},{\bm{\theta}}^{(f)\star},{\bm{\theta}}^{(h)\star}$ are fixed points of the $\pi$ -SGD optimization. Equation 11 is a permutation-invariant function.

In some cases one may consider $k$ -ary Janossy pooling with a moderately large value of $k$ in which case even the summation over $\frac{|{\bm{h}}|!}{(|{\bm{h}}|-k)!}$ terms (see proposition 2.1) becomes expensive. In these cases, one may sample ${\mathbf{s}}\sim\text{Unif}\big{(}\Pi_{|{\bm{h}}|}\big{)}$ and compute $\hat{\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}}_{k}=\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}|,\downarrow_{k}\!\!({\bm{h}}_{{\mathbf{s}}});\bm{\theta}^{(f)})$ in lieu of the sum in equation 5. Note that equation 5 defining $k$ -ary Janossy pooling constitutes exact inference of a simplified model whereas $\pi$ -SGD with $k$ -ary dependencies constitutes approximate inference. We will return to this idea in our results section where we note that the GraphSAGE model of Hamilton et al. (2017) can be cast as a $\pi$ -SGD approximation of $k$ -ary Janossy pooling.

Experiments

In what follows we empirically evaluate two tractable Janossy pooling approaches, $k$ -ary dependencies (section 2.2) and sampling permutations for stochastic optimization (section 2.3), to learn permutation-invariant functions for tasks of different complexities. One baseline we compare against is DeepSets (Zaheer et al., 2017); recall that this corresponds to unary ( $k=1$ ) Janossy pooling (Remark 2.1). Corollary 2.1 shows that explicitly modeling higher-order dependencies during pooling simplifies the task of the upper layers ( $\rho$ ) of the neural network, and we evaluate this experimentally by letting $k=1,2,3,|{\bm{h}}|$ over different arithmetic tasks. We also evaluate Janossy pooling in graph tasks, where it can be used as a permutation-invariant function to aggregate the features and embeddings of the neighbors of a vertex in the graph. Note that in graph tasks, permutation-invariance is required to ensure that the neural network is invariant to permutations in the adjacency matrix (graph isomorphism). The code used to generate the results in this section are available on GitHubhttps://github.com/PurdueMINDS/JanossyPooling.

We first consider the task of predicting the sum of a sequence of integers (Zaheer et al., 2017) and extend it to predicting other permutation-invariant functions: range, unique sum, unique count, and variance. In the sum task we predict the sum of a sequence of 5 integers drawn uniformly at random with replacement from $\{0,1,\ldots,99\}$ ; the range task also receives a sequence 5 integers distributed the same way and tries to predict the range (the difference between the maximum and minimum values); the unique sum task receives a sequence of 10 integers, sampled uniformly with replacement from $\{0,1,\ldots,9\}$ , and predicts the sum of all unique elements; the unique count task also receives a sequence of repeating elements from $\{0,1,\ldots,9\}$ , distributed in the same was as with the unique sum task, and predicts the number of unique elements; the variance task receives a sequence of 10 integers drawn uniformly with replacement from $\{0,1,\ldots,99\}$ and tries to predict the variance $\frac{1}{|{\bm{x}}|}\sum_{i}({x}_{i}-\bar{{\bm{x}}})^{2}=\frac{1}{2|{\bm{x}}|^{2}}\sum_{i,j}({x}_{i}-{x}_{j})^{2}$ , where $\bar{{\bm{x}}}$ denotes the mean of ${\bm{x}}$ . Unlike Zaheer et al. (2017), we choose to work with the digits themselves, to allow a more direct assessment of the different Janossy pooling approximations. Note that the summation task of Zaheer et al. (2017) is naturally a unary task that lends itself to the approach of embedding individual digits then adding them together while the other tasks require exploiting high-order relationships within the sequence. Following Zaheer et al. (2017), we report accuracy (0-1 loss) for all tasks with an integer target; we report root mean squared error (RMSE) for the variance task.

Here we explore two Janossy pooling tractable approximations: (a) ( $k$ -ary dependencies) Janossy ( $k=1$ ) (DeepSets), and Janossy $k=2,3$ where $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ is a feedforward network with a single hidden layer comprised of 30 neurons. As detailed in the Supplementary Material, the models are constructed to have the same number of parameters regardless of $k$ by modifying the embedding (output) dimension of $h$ . In the Supplementary Material, we also show results for experiments that relax this constraint. (b) ( $\pi$ -SGD) Full $k=|{\bm{h}}|$ Janossy pooling where $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ is an LSTM or a GRU that returns the short-term hidden state of the last temporal unit (the ${\bm{h}}_{t}$ of Cho et al. with $t=|{\bm{h}}|$ ). The LSTM has 50 hidden units and the GRU 80, trained with the $\pi$ -SGD stochastic optimization. The number of hidden units was chosen to be consistent with Zaheer et al. (2017). At test time, we experiment with approximating (estimating) equation 11 using 1 and 20 sampled permutations.

We also explore two functions for $\rho$ of equation 4 (upper-layer): (i) [Linear] a single dense layer with identity activation as in the experiments of Zaheer et al. (2017), and (ii) [MLP (100)] a feedforward network with one hidden layer using tanh activations and 100 units. Choosing a simple and complex form for $\rho$ allows insight into the extent to which $\rho$ supplements the capacity of the model by capturing relationships not exploited during pooling, and serves as an evaluation of the strategy of optimizing $\overline{\overline{\lower 0.86108pt\hbox{$ J $}}}$ as a tractable approximation of $\overline{\overline{\lower 0.86108pt\hbox{$ L $}}}$ .

Much of our implementation, architectural, and experimental design are based on the DeepSets codehttps://github.com/manzilzaheer/DeepSets of Zaheer et al. (2017), see the Supplementary Material for details. We tuned the Adam learning rate for each model and report the results using the rate yielding top performance on the validation set. Table 3.1 shows the accuracy (average 0-1 loss) of all tasks except variance, for which we report RMSE in the last column. Performance was similar between the LSTM and GRU models, with the GRU performing slightly better, thus we moved the LSTM results to Table B.1 in the Supplementary Material for the sake of clarity. We trained each model with 15 random initializations of the weights to quantify variability. Table B.1 in the Supplementary Material shows the same results measured by mean absolute error. The data consists of 100,000 training examples and 10,000 test examples.

The results in Table 3.1 and Table B.1 show that: (1) models trained with $\pi$ -SGD using LSTMs and GRUs as $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ typically achieve top performance or are comparable to the top performer (within confidence intervals) on all tasks, for any choice of $\rho$ . We also observe for LSTMs and GRUs that adding complexity to $\rho$ can yield small but meaningful performance gains or maintain similar performance, lending credence to the approach of optimizing $\overline{\overline{\lower 0.86108pt\hbox{$ J $}}}$ as a tractable approximation to $\overline{\overline{\lower 0.86108pt\hbox{$ L $}}}$ . (2) Specifically, in the variance task, GRUs and LSTMs with $\pi$ -SGD provide significant accuracy gains over $k\in\{1,2,3\}$ , showing that modeling full-dependencies can be advantageous even if model training with $\pi$ -SGD is approximate. (3) For a more complex $\rho$ (MLP as opposed to Linear), lower-complexity Janossy pooling achieves consistently better results: $k\in\{2,3\}$ gives good results when $\rho$ is linear but poorer results when $\rho$ is an MLP (as these models are more expressive, the only feasible explanation is an optimization issue since we also observed poorer performance on the training data). We also note that when $\rho$ is an MLP, it takes significantly more epochs for $k\in\{2,3\}$ to find the best model (2000 epochs) while $k=1$ finds good models much quicker (1000 epochs). The results we report come from training with 1000 epochs on all models with a linear $\rho$ and 2000 epochs for all models where $\rho$ is an MLP. (4) We observe that for $k=1$ (DeepSets), a more complex $\rho$ (MLP) is required as the pooling pushes the complexity of modeling high-order interactions over the input to $\rho$ . The converse is also true, if $\rho$ is simple (Linear) then a Janossy pooling that models high-order interactions $k\in\{2,3,|{\bm{h}}|\}$ gives higher accuracy, as shown in the range, unique sum, unique count, and variance tasks.

Entries denoted by – all differ by less than 0.01. Typical neighborhoods in Cora and Pubmed are small, so that sampling $\geq 5$ neighbors is often equivalent to using the entire neighborhood.

Some neighbor sequences in PPI are prohibitively large, so we take $k_{1}=k_{2}=100$ .

Under the Janossy pooling framework presented in this work, existing literature falls under one of three approaches to approximating to the intractable Janossy-pooling layer: Canonical orderings, $k$ -ary dependencies, and permutation sampling. We also discuss the broader context of invariant models and probabilistic interpretations.

In section 2.1, we saw how permutation invariance can be achieved by mapping permutations to a canonical ordering. Rather than trying to define a good canonical ordering, one can try to learn it from the data, however searching among all $|{\bm{h}}|!$ permutations for one that correlates with the task of interest is a difficult discrete optimization problem. Recently, Rezatofighi et al. (2018) proposed a method that computes the posterior distribution of all permutations, conditioned on the model and the data. This posterior-sampling approach is intractable for large inputs, unfortunately. We note in passing that Rezatofighi et al. (2018) is interested in permutation-invariant outputs, and that Janossy pooling is also trivially applicable to these tasks. Vinyals et al. (2016) proposes a heuristic using ancestral sampling while learning the model.

In section 2.2 we described $k$ -ary Janossy pooling, which considers $k$ -order relationships in the input vector ${\bm{h}}$ to simplify optimization. DeepSets (Zaheer et al., 2017) can be characterized as unary Janossy pooling (i.e., $k$ -ary for $k=1$ ). . Qi et al. (2017) and Ravanbakhsh et al. (2017a) propose similar unary Janossy pooling models. Cotter et al. (2018) proposes to add inductive biases to the DeepSets model in the form of monotonicity constraints with respect to the vector valued elements of the input sequence by modeling $f$ and $\rho$ with Deep Lattice Networks (You et al., 2017); one can extend Cotter et al. (2018) by using higher-order ( $k>1$ ) pooling.

Exploiting dependencies within a sequence to learn a permutation-invariant function has been discussed elsewhere. For instance Santoro et al. (2017) exploits pairwise relationships to perform relational reasoning about pairs of objects in an image and Battaglia et al. (2018) contemplates modeling the center of mass of a solar system by including the pairwise interactions among planets. However, Janossy pooling provides a general framework for capturing dependencies within a permutation-invariant pooling layer.

In section 2.3 we have seen a that permutation sampling can be used as a stochastic gradient procedure ( $\pi$ -SGD) to learn a model with a Janossy pooling layer. The learned model provides only an approximate solution to original permutation-invariant function. Permutation sampling has been used as a heuristic (without a theoretical justification) in both Moore & Neville (2017) and Hamilton et al. (2017), which found that randomly permuting sequences and feeding them forward to an LSTM is effective in relational learning tasks that require permutation-invariant pooling layers.

Conclusions

Our work has a strong connection with finite exchangeability. Some researchers may be more familiar with the concept of infinite exchangeability through de Finetti’s theorem (De Finetti, 1937; Diaconis, 1977), which imposes strong structural requirements: the probability of any subsequence must equal the marginalized probability of the original sequence (projectivity). Korshunova et al. (2018) noted the importance of this property for generative models and propose a model that learns a distribution without variational approximations. Finite exchangeability drops this projectivity requirement (Diaconis, 1977), which in general, cannot be simplified beyond first sampling the number of observations $m$ , and then sampling their locations from some exchangeable but non-i.i.d. distribution $p^{m}_{\text{exch}}(x_{1},\ldots,x_{m})$ (Daley & Vere-Jones, 2003). Equivalently, de Finetti’s theorem for infinitely exchangeable sequences implies that the joint distribution can represented as a mixture distribution over conditionally independent random variables (given $\theta$ ) (De Finetti, 1937; Orbanz & Roy, 2015) whereas the probability distribution of a finitely exchangeability sequence is a mixture over dependent random variables as shown by Diaconis (1977).

In comparison, the restrictive assumption of letting $k=1$ in $k$ -ary Janossy Pooling yields the form of a log-likelihood of conditionally iid random variables (consider $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ a log pdf), the strong requirement of de Finetti’s theorem for infinitely exchangeable sequences. Conversely, higher-order Janossy pooling was designed to exploit dependencies among the random variables such as those that arise under finitely exchangeable distributions. Indeed, finite exchangeability also arises from the theory of spatial point processes; our framework of Janossy pooling is inspired by Janossy densities (Daley & Vere-Jones, 2003), which model the finite exchangeable distributions as mixtures of non-exchangeable distributions applied to permutations. This literature also studies simplified exchangeable point processes such as finite Gibbs models (Vo et al., 2018; Moller & Waagepetersen, 2003) that restrict the structure of $p_{\text{exch}}$ to fixed-order dependencies, and are related to $k$ -ary Janossy.

More broadly, there are other connections between permutation-invariant deterministic functions and exchangeability in probability distributions, as recently discussed by Bloem-Reddy & Teh (2019). There, the authors also contemplate more general invariances through the language of group actions. An example is permutation equivariance: one form of permutation equivariance asserts that ${f}(X_{\pi})={f}(X)_{\pi}\forall\pi\in\Pi_{|X|}$ where $f(X)$ is a sequence of length greater than 1. Ravanbakhsh et al. (2017b) provides a weight-sharing scheme for maintaining general neural network equivariances characterized as automorphisms of a colored multi-edged bipartite graph. Hartford et al. (2018) proposes a matrix completion model invariant to (possibly separate) permutations of the rows or columns. Other invariances are studied through a probabilistic perspective in Orbanz & Roy (2015).

Our approach of permutation-invariance through Janossy pooling unifies a number of existing approaches, and opens up avenues to develop both new methodological extensions, as well as better theory. Our paper focused on two main approaches: $k$ -ary interactions and random permutations. The former involves exact Janossy pooling for a restricted class of functions $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ . Adding an additional neural network $\rho$ can recover lost model capacity and capture additional higher-order interactions, but hurts tractability and identifiability. Placing restrictions on $\rho$ (convexity, Lipschitz continuity etc.) can allow a more refined control of this trade-off, allowing theoretical and empirical work to shed light on the compromises involved. The second was a random permutation approach which conversely involves no clear trade-offs between model capacity and computation when $\rho$ is made more complex, instead it modifies the relationship between the tractable approximate loss $\overline{\overline{\lower 0.86108pt\hbox{$ J $}}}$ and the original Janossy loss $\overline{\overline{\lower 0.86108pt\hbox{$ L $}}}$ . While there is a difference between $\overline{\overline{\lower 0.86108pt\hbox{$ J $}}}$ and $\overline{\overline{\lower 0.86108pt\hbox{$ L $}}}$ , we saw the strongest empirical performance coming from this approach in our experiments (shown in the last row of Table 3.1); future work is required to identify which problems $\pi$ -SGD is best suited for and when its convergence criteria are satisfied. Further, a better understanding how the loss-functions $\overline{\overline{\lower 0.86108pt\hbox{$ L $}}}$ and $\overline{\overline{\lower 0.86108pt\hbox{$ J $}}}$ relate to each other can shed light on the slightly black-box nature of this procedure. It is also important to understand the relationship between the random permutation optimization to canonical ordering and how one might be used to improve the other. Finally, it is important to apply our methodology to a wider range of applications. Two immediate domains are more challenging tasks involving graphs and tasks involving non-Poisson point processes.

This work was sponsored in part by the ARO, under the U.S. Army Research Laboratory contract number W911NF-09-2-0053, by the Purdue Integrative Data Science Initiative and Purdue Research foundation, the DOD through SERC under Contract No. HQ0034-13-D-0004 RT #206, the National Science Foundation under contract numbers IIS-1816499 and DMS-1812197, and the NVIDIA GPU grant program for hardware donation.

A Proofs of Results

Define two permutations $\pi,\pi^{\prime}\in\Pi_{|{\bm{h}}|}$ that agree on the first $k$ elements as $k$ -equivalent. Such permutations satisfy $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}|,\downarrow_{k}\!\!({\bm{h}}_{\pi});\bm{\theta}^{(f)})=\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}|,\downarrow_{k}\!\!({\bm{h}}_{\pi^{\prime}});\bm{\theta}^{(f)})$ . These two permutations belong to the same equivalence class, containing a total of $(|{\bm{h}}|-k)!$ permutations (obtained by permuting the last $(|{\bm{h}}|-k)$ elements). Overall, we then have a total of $|{\bm{h}}|!/(|{\bm{h}}|-k)!$ equivalence classes. Write the set of equivalence classes as $\Pi_{|{\bm{h}}|}^{k}$ , and represent each by one of its elements. Then,

is now a summation over only $|{\bm{h}}|!/(|{\bm{h}}|-k)!$ terms. We can conclude that

Next, we restate and prove the remaining portion of Theorem 2.1. See 2.1

( $\mathcal{F}_{k-1}\subset\mathcal{F}_{k}$ ): Consider any element $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}_{k-1}\in\mathcal{F}_{k-1}$ , and write $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}|,~{}\cdot~{};\bm{\theta}^{(f)})$ for its associated Janossy function. For any sequence ${\bm{h}}$ , $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}|,\downarrow_{k-1}\!\!({\bm{h}});\bm{\theta}^{(f)})=\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}(|{\bm{h}}|,\downarrow_{k-1}\!\!(\downarrow_{k}\!\!({\bm{h}}));\bm{\theta}^{(f)}):=\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}_{+}(|{\bm{h}}|,\downarrow_{k}\!\!({\bm{h}});\bm{\theta}^{(f)})$ , where the function $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}_{+}$ looks at its first $k$ elements. Thus,

where $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}_{k}$ is the Janossy function associated with $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}_{+}$ and thus belongs to $\mathcal{F}_{k}$ .

( $\mathcal{F}_{k}\not\subset\mathcal{F}_{k-1}$ ): the case where $k=1$ is trivial, so assume $k>1$ . We will demonstrate the existence of $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}_{k}\in\mathcal{F}_{k}$ such that $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}_{k-1}\neq\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}_{k}$ for all $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}_{k-1}\in\mathcal{F}_{k-1}$ . Let $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}_{k}$ and $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}_{k-1}$ be associated with $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}_{k}$ and $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}_{k-1}$ , respectively.

It suffices to consider $|{\bm{h}}|=k$ . Let $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}_{k}(|{\bm{h}}|,{\bm{h}}_{\pi};{\bm{\theta}}^{(f)}_{k})=\prod_{l=1}^{|{\bm{h}}|}{h}_{\pi(l)}$ whence $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}_{k}(|{\bm{h}}|,{\bm{h}};{\bm{\theta}}^{(f)}_{k})=\prod_{l=1}^{|{\bm{h}}|}{h}_{l}$ . Thus, for any $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}_{k-1}$ and any ${\bm{\theta}}^{(f)}_{k-1}$ ,

Proposition 2.2 is repeated below and is followed by a more rigorous restatement.

The following statement is similar to that in Yuille (2004), which also provides intuition behind the theoretical assumptions, which are indeed quite general. See also (Younes, 1999). This is a familiar application of stochastic approximation algorithms already used in training neural networks.

Consider the $\pi$ -SGD algorithm in Definition 2.3. If

there exists a constant $M>0$ such that for all ${\bm{\theta}}$ , $-{\bm{G}}_{t}^{T}{\bm{\theta}}\leq M\|{\bm{\theta}}-{\bm{\theta}}^{\star}\|_{2}^{2}$ , where ${\bm{G}}_{t}$ is the true gradient for the full batch over all permutations, ${\bm{G}}_{t}=\nabla_{\bm{\theta}}\overline{\overline{\lower 0.86108pt\hbox{$ J $}}}(\mathcal{D};{\bm{\theta}}^{(\rho)}_{t},{\bm{\theta}}^{(f)}_{t},{\bm{\theta}}^{(h)}_{t}),$ where ${\bm{\theta}}\equiv({\bm{\theta}}^{(\rho)},{\bm{\theta}}^{(f)},{\bm{\theta}}^{(h)})$ , and ${\bm{\theta}}^{\star}$ is the optimum.

there exists a constant $\delta>0$ such that for all ${\bm{\theta}}$ , $E_{t}[\|{\mathbf{Z}}_{t}\|_{2}^{2}]\leq\delta^{2}(1+\|{\bm{\theta}}_{t}-{\bm{\theta}}^{\star}_{t}\|_{2}^{2})$ , where the expectation is taken with respect to all the data prior to step $t$ .

Then, the algorithm in equation 9 converges to ${\bm{\theta}}^{\star}$ with probability one.

First, we can show that $E_{t}[{\mathbf{Z}}_{t}]={\bm{G}}_{t}$ by equation 10, the linearity of the derivative operator, and the fact that the permutations are independently sampled for each training example in the mini-batch and are assumed independent of ${\bm{\theta}}$ . That equation 9 converges to ${\bm{\theta}}^{\star}$ is a consequence of our conditions and the supermartingale convergence theorem (Grimmett & Stirzaker, 2001, pp. 481). The following argument follows Yuille (2004). Let $A_{t}=\|{\bm{\theta}}_{t}-{\bm{\theta}}^{\star}\|_{2}^{2}$ , $B_{t}=\delta^{2}\eta_{t}^{2}$ , and $C_{t}=-\|{\bm{\theta}}_{t}-{\bm{\theta}}^{\star}\|_{2}^{2}(\delta^{2}\eta_{t}^{2}-2M\eta_{t})$ . Note that $C_{t}$ is positive for a sufficiently large $t$ , and $\sum_{t=1}^{\infty}B_{t}\leq\infty$ by our definition of $\eta_{t}$ (Definition 2.3). We will demonstrate that $E_{t}[A_{t}]\leq A_{t-1}+B_{t-1}-C_{t-1}$ , for all $t$ , in the Supplementary Material from which it follows that $A_{t}$ converges to zero with probability one and $\sum_{t=1}^{\infty}C_{t}<\infty$ . We write

The accuracy scores for all models (including the LSTM) on the sequence arithmetic tasks are shown in Table B.1. This table repeats results shown in Table 3.1, except here we show additional rows representing models that use LSTM as $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ . We chose accuracy (0-1 loss) to be consistent with Zaheer et al. (2017); here we report mean absolute error to evaluate the differences it makes on our results. These can be found in Tables B.1 and 5. The message is similar to the one told by accuracy scores; there is a drop in the mean absolute error as the value of $k$ increases and when using more sampled permutations at test-time (e.g., Janossy-20inf-LSTM versus Janossy-1inf-LSTM). Again, the power of using an RNN for $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ and training with $\pi$ -SGD is salient on the variance task where it is important to exploit dependencies in the sequence. Beyond the performance gains, we also observe a drop in variance when sampling more permutations at test time. Furthermore, as discussed in the implementation section, we constructed $k$ -ary models to have the same number of parameters regardless of $k$ for the results reported in the main body. We show results where this constraint is relaxed in Table 6. Here we see a modest improvement of $k$ -ary models which stands to reason considering the embedding dimension fed to the Janossy pooling layer was reduced from 100 with $k=1$ to 33 with $k=3$ (please see the implementation section for details).

For the graph tasks, the plot of performance as a function of number of inference-time permutations is shown in Figure 2.

For the $k$ -ary results shown in the body, we made sure the number of parameters was consistent for $k\in\{1,2,3\}$ (see Table 8). We unify the number of parameters by adjusting the output dimension of the embedding. We also experimented with relaxing the restriction that $k$ -ary models have the same numbers of parameters (Table 6), and the numbers of parameters in these models is also shown in Table 8. For the LSTM than GRU models, we follow the choice of Zaheer et al. (2017) which also reports that the choices were made to keep the numbers of parameters consistent.

Optimization is done with Adam (Kingma & Ba, 2015) with a tuned the learning rate, searching over $\{0.01,0.001,0.0001,0.00001\}$ . Training was performed on GeForce GTX 1080 Ti GPUs.

C LaTeXfor Janossy function markers

The datasets used for this task are summarized in Table 9. Our implementation is in PyTorch using Python 2.7, following the PyTorch code associated with Hamilton et al. (2017). That repo did not include an LSTM aggregator, so we implemented our own following the TensorFlow implementation of GraphSAGE, and describe it here. At the beginning of every forward pass, each vertex $v$ is associated with a $p$ -dimensional vertex attribute ${\bm{h}}$ (see Table9). For every vertex in a batch, $k_{1}$ neighbors of $v$ are sampled, their order is shuffled, and their features are fed through an LSTM. From the LSTM, we take the short-term hidden state associated with the last element in the input sequence (denoted ${\bm{h}}_{(T)}$ in the LSTM literature, but this ${\bm{h}}$ is not to be confused with a vertex attribute). This short-term hidden state is passed through a fully connected layer to yield a vector of dimension $\frac{q}{2}$ , where $q$ is a user-specified positive even integer referred to as the embedding dimension. The vertex’s own attribute ${\bm{h}}$ is also fed forward through a fully connected layer with $\frac{q}{2}$ output neurons. At this point, for each vertex, we have two representation vectors of size $\frac{q}{2}$ representing the vertex $v$ and its neighbor multiset, which we concatenate to form an embedding of size $q$ . This describes one convolution layer, and it is repeated a second time with a distinct set of learnable weights for the fully connected and LSTM layers, sampling $k_{2}$ vertices from each neighborhood and using the embeddings of the first layer as features. After each convolution, we may optionally apply a ReLU activation and/or embedding normalization, and we follow the decisions shown in the GraphSAGE code Hamilton et al. (2017). After both convolution operations, we apply a final fully connected layer to obtain the score, followed by a softmax (Cora, Pubmed) or sigmoid (PPI). The loss function is cross entropy for Cora and Pubmed, and binary cross entropy for PPI.

The number of trainable parameters in each model is independent of $k_{1}$ and $k_{2}$ by the design of LSTMs (the same is true for the mean-pooling aggregator). The only variation in the number of weights is in the dimensions of the input and output features, which differ by dataset. Please see Table 10 for details.

Optimization is performed with the Adam optimizer (Kingma & Ba, 2015). The training routine for the smaller graphs (Cora, Pubmed) is not guaranteed to see the entire training data, in contrast with the scheme applied to the larger PPI graph. For Cora and Pubmed, we form 100 minibatches by randomly sampling subsets of 256 vertices from the training dataset (with replacement). With PPI, we perform 10 full epochs: at each epoch, the training data is shuffled, partitioned into minibatches of size 512, and we pass over each. In either case, the weights are updated after computing the gradient of the loss on each minibatch.

The hyperparameters were set by following Hamilton et al. (2017); no hyperparameter optimization was performed. For every dataset, the embedding dimension was set to $q=256$ at both conv layers. For Pubmed and PPI, the learning rate is set at 0.01 while for Cora it is set at 0.005.

At test time, we load the weights obtained from training, perform 20 forward passes – which shuffles the input sequence by design – average the predicted probabilities (i.e. softmax output) from each forward pass, and choose the class that maximizes the averaged probabilities.

The implementation for Mean Pooling is similar in spirit but replaces $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ with a permutation invariant function. The details can be found by viewing our repo on GitHub.

The commands below can be directly copied and pasted into LaTeXsource to create the Janossy function markers. Please use the amsmath package.

To typeset $\overline{\overline{\lower 0.86108pt\hbox{$ f $}}}$ , we define in LaTeXas

Similarly, for $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ , we define in LaTeXas

Last, the definition above for $\mathstrut\mkern 2.5muf\mkern-11.0mu\raise 6.88889pt\hbox{$ \scriptscriptstyle\rightharpoonup $}$ caused difficulties in environments such as figure, so we defined and occasionally used in LaTeX