Hyperbolic Attention Networks

Caglar Gulcehre, Misha Denil, Mateusz Malinowski, Ali Razavi, Razvan Pascanu, Karl Moritz Hermann, Peter Battaglia, Victor Bapst, David Raposo, Adam Santoro, Nando de Freitas

Introduction

The focus of this work is to endow neural network representations with suitable geometry to capture fundamental properties of data, including hierarchy and clustering behaviour. These properties emerge in many real-world scenarios that approximately follow power-law distributions . This includes a wide variety of natural phenomena in physics , biology , and even human-made structures such as metabolic-mass relationships , social networks , and frequencies of words .

Complex networks , which connect distinguishable heterogeneous sets of elements represented as nodes, provide us with an intuitive way of understanding these structures. They will also serve as our starting point for introducing hyperbolic geometry, which is by itself difficult to visualize. Nodes in complex networks are referred to as heterogeneous, in the sense that they can be divided into sub-nodes which are themselves distinguishable from each other. The scale-free structure of natural data manifests itself as a power law distribution on the node degrees of the complex network that describes it.

Complex networks can be approximated with tree-like structures, such as taxonomies and dendrograms. Let us begin by recalling a simple property of nn-ary trees that will help us understand hyperbolic space and why Euclidean geometry is perhaps inadequate to model relational data.

In an nn-ary tree, the number of nodes at distance rr from the root and the number of nodes at distance no more than rr from the root both grow as nrn^{r}. Just like trees, the volume of hyperbolic space also expands exponentially. For example, in a two-dimensional hyperbolic space with curvature ζ2,ζ>0-\zeta^{2},\zeta>0, the length and area of the disc of radius rr grow as 2πsinh(ζr)2\pi\sinh(\zeta r) and 2π(cosh(ζr)1)2\pi(\cosh(\zeta r)-1), respectively, both of which grow exponentially in rr .

The growth of volume in hyperbolic space should be contrasted with Euclidean space where the corresponding quantities expand only polynomially, length as 2πr2\pi r and area as πr2\pi r^{2} in the two-dimensional example. Figure 1 is an attempt at visualizing this. For hierarchical data with scale-free structure, the polynomially expanding Euclidean space cannot capture the exponential complexity in the data (and so the images overlap). On the other hand, the exponentially expanding hyperbolic space is able to match the complexity of the data (here we have used the visualization trick of the famous artist Escher of making the images progressively smaller toward the boundary to illustrate the exponential expansion).

The intimate connection between hyperbolic space scale free networks (where node degree follows a power law) is made more precise in Krioukov et al. . In particular, there it is shown that the heterogeneous topology implies hyperbolic geometry, and conversely hyperbolic geometry yields heterogeneous topology.

Moreover, Sarkar describes a construction that embeds trees in two-dimensional hyperbolic space with arbitrarily low distortion, which is not possible in Euclidean space of any dimension . Following this exciting line of research, recently the machine learning community has gained interest in learning non-Euclidean embeddings directly from data .

Fuelled by the desire of increasing the capacity of neural networks so as to match the complexity of data, we propose hyperbolic attention networks. As opposed to previous approaches, which impose hyperbolic geometry on the parameters of shallow networks , we impose hyperbolic geometry on their activations. This allows us to exploit hyperbolic geometry to reason about embeddings produced by deep networks. We achieve this by introducing efficient hyperbolic operations to express the popular, ubiquitous mechanism of attention . Our method shows improvements in terms of generalization on neural machine translation , learning on graphs and visual question answering tasks while keeping the representations compact.

Models of hyperbolic space

Hyperbolic space cannot be isometrically embedded into Euclidean space . There are several ways to endow different subsets of Euclidean space with a hyperbolic metric, leading to different models of hyperbolic space. This leads to the well known Poincaré ball model and many others.

The different models of hyperbolic space are all necessarily the same, but different models define different coordinate systems, which offer different affordances for computation. In this paper, we primarily make use of the hyperboloid, whose status as the only commonly used unbounded model makes it a convenient target for projecting into hyperbolic space. We also make use of the Klein model, because it admits an efficient expression for the hyperbolic aggregation operation we define in Section 4.2.

We briefly review the definitions of the hyperboloid and Klein models and the relationship between them, in just enough detail to support the presentation in the remainder of the paper. A more thorough treatment can be found in Iversen . The geometric relationship between the the two models is diagrammed in Figure 5 of the supplementary material.

With this definition in had, the hyperboloid model consists of the set

and a point in the Klein model can be obtained from the corresponding point in the hyperboloid model by projection

Attention as a building block for relational reasoning

Learning relations in a graph by using neural networks to model the interactions or relations has shown promising results in visual question answering , modelling physical dynamics , and reasoning over graphs . Graph neural networks incorporate a message passing as part of the architecture in order to capture the intrinsic relations between entities. Graph convolution networks use convolutions to efficiently learn a continuous-space representation for a graph of interest.

Many of these relational reasoning models can be expressed in terms of an attentive read operation. In the following we give a general description of the attentive read, and then discuss its specific instantiations in two relational reasoning models from the literature.

First introduced for translation in Bahdanau et al. , attention has seen widespread use in deep learning, not only for applications in NLP but also for image processing imitation learning and memory . The core computation is the attentive read operation, which has the following form:

Here qi\mathbf{q}_{i} is a vector called the query and the kj\mathbf{k}_{j} are keys for the memory locations being read from. The pairwise function α(,)\alpha(\cdot,\cdot) computes a scalar matching score between a query and a key, and the vector vij\mathbf{v}_{ij} is a value to be read from location jj by query ii. Z>0Z>0 is a normalization factor for the full sum. Both vij\mathbf{v}_{ij} and ZZ are free to depend on arbitrary information, but we leave any dependencies here implicit.

It will be useful in the discussion to break this operation down into two parts. The first is the matching, which computes attention weights αij=α(qi,kj)\alpha_{ij}=\alpha(\mathbf{q}_{i},\mathbf{k}_{j}) and the second is the aggregation, which takes a weighted average of the values using these weights,

Instantiating a particular attentive read operation involves specifying both α(,)\alpha(\cdot,\cdot) and vij\mathbf{v}_{ij} along with the normalization constant ZZ.

If one performs an attentive read for each element of the set jj then the resulting operation corresponds in a natural way to message passing on a graph, where each node ii aggregates messages {vij}j\{\mathbf{v}_{ij}\}_{j} from its neighbours along edges of weight α(qi,kj)/Z\alpha(\mathbf{q}_{i},\mathbf{k}_{j})/Z.

We can express many (although not all) message passing neural network architectures using the attentive read operation of Equation 1 as a primitive. In the following sections we do this for two architectures and then discuss how we can replace both the matching and aggregation steps with versions that leverage hyperbolic geometry.

2 Relation networks

Relation Networks (RNs) are a neural network architecture designed for reasoning about the relationships between objects. An RN operates on a set of objects OO by applying a shared operator to each pair of objects (oi, oj)O×O\left(\mathbf{o}_{i},~{}\mathbf{o}_{j}\right)\in O\times O. The pairs can be augmented by a global information, and the result of each relational operation is passed through a further global transformation.

Using the notation of the previous section, we can write the RN as

where α(oi, oj)=1\alpha(\mathbf{o}_{i},~{}\mathbf{o}_{j})=1, vij=g(oi, oj, c)\mathbf{v}_{ij}=g(\mathbf{o}_{i},~{}\mathbf{o}_{j},~{}\mathbf{c}), Z=1Z=1. ff is the global transformation gg is the global transformation, the local transformation and c\mathbf{c} is the global context, as described in Santoro et al. . We augment the basic RN to allow α(oi,oj)\alpha(\mathbf{o}_{i},\mathbf{o}_{j})\in to be a general learnable function.

Interpreting the RN as learned message passing on a graph over objects, the attention weights take on the semantics of edge weights, where αij\alpha_{ij} can be thought of as the probability of the the (directed) edge ojoi\mathbf{o}_{j}\to\mathbf{o}_{i} appearing in the underlying reasoning graph.

3 Scaled dot-product attention

In the Transformer model of Vaswani et al. the authors define an all-to-all message passing operation on a set of vectors which they call scaled dot-product attention. In the language of Section 3.1 the scaled dot-product attention operation performs several attentive reads in parallel, one for each element of the input set.

Vaswani et al. write scaled dot-product attention as R=softmax(QKTd)V\mathbf{R}=\operatorname{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d}}\right)\mathbf{V}, where Q\mathbf{Q}, K\mathbf{K} and V\mathbf{V} are referred to as the queries, keys, and values respectively, and dd is the shared dimensionality of the queries and keys. Using lowercase letters to denote rows of the corresponding matrices, we can write each row of R\mathbf{R} as the result of an attentive read with

We experiment with both softmax and sigmoid operations for computing the attention weights in our hyperbolic models. The motivation for considering sigmoid attention weights is that in some applications (e.g. visual question answering) it makes sense for the attention weights over different entities to not compete with each other.

4 The path forward

At this point, we have discussed how many natural hierarchies posses scale-free structures, and also how the geometry of hyperbolic space naturally encodes this structure. We have considered two models of hyperbolic space (hyperboloid and Klein), and have seen how points in the two models correspond.

We described the attentive read operation, and how it can be broken into two steps which we call matching and aggregation. We then showed how the relational operations from Relation Networks and the Transformer could be expressed using the attentive read as a primitive.

The following section is devoted to explaining how we can redefine the attentive read as an operation on points in hyperbolic space. This will allow us to create hyperbolic versions of Relation Networks and the Transformer by replacing their attentive read with our hyperbolic version.

Hyperbolic attention networks

In this section we show how to redefine the attentive read operation of Section 3.1 as an operation on points in hyperbolic space. The key to doing this is to define new matching and aggregation functions that operate on hyperbolic points and take advantage of the metric structure of the manifold they live on. However, in order to apply these operations inside of a network we first we need a way to interpret network activations as points in hyperbolic space.

Pseudo-polar coordinates: In polar coordinates, we express an nn-dimensional point as a scalar radius, and n1n-1 angles. Pseudo-polar coordinates consist of a radius rr, as in ordinary polar coordinates, and an nn-dimensional vector d\mathbf{d} representing the direction of the point from the origin. In the following discussion we assume that the coordinates are normalized, i.e. that d=1\|\mathbf{d}\|=1.

It is easily verified that the resulting point lies in the hyperboloid,

and to verify that we maintain the appropriate scaling properties we compute the distance between a point and the origin using this projection:

2 Hyperbolic attention

In this section, we show how to build an attentive read operation that operates on points in hyperbolic space. We consider how to exploit hyperbolic geometry in both the matching and the aggregation steps of the attentive read operation separately.

Hyperbolic matching: The most natural way to exploit hyperbolic geometry for matching pairs of points is to use the hyperbolic distance between them. Given a query qi\mathbf{q}_{i} and a key kj\mathbf{k}_{j} both lying in hyperbolic space we take,

Hyperbolic aggregation: The path to extend the weighted midpoint to hyperbolic space is much less obvious, but fortunately such a extension already exists as the Einstein midpoint. The Einstein midpoint is straightforward to compute by adjusting the aggregation weights appropriately (see Ungar [39, Definition 4.21])

where the γ(vij)\gamma(\mathbf{v}_{ij}) are the Lorentz factors, that are given by γ(vij)=11vij2\gamma(\mathbf{v}_{ij})=\frac{1}{\sqrt{1-\|\mathbf{v}_{ij}\|^{2}}}. The norm in the denominator of the Lorentz factor is the Euclidean norm of the Klein coordinates of the point vij\mathbf{v}_{ij}, and the correctness of Equation 3 also relies on the points vij\mathbf{v}_{ij} being represented by their Klein coordinates. Fortunately the various models of hyperbolic space in common use are all isomorphic, so we can work in an arbitrary hyperbolic model and simply project to and from the Klein model to execute midpoint computations, as we discuss in the following section.

The reason for using the Einstein midpoint for hyperbolic aggregation is that it obeys many of the properties that we expect from a weighted average in Euclidean space. In particular, it is invariant to translating the vij\mathbf{v}_{ij}’s by a fixed distance in a common direction, and also by rotations of the constellation of points about the origin. The derivation of this operation is quite involved, and beyond the scope of this paper. We point the interested reader to Ungar for a full exposition.

Experiments

We evaluate our models on synthetic and real-world tasks. Experiments where the underlying graph structure is explicitly known clearly show the benefits of using hyperbolic geometry as an inductive bias. At the same time, we show that real-world tasks withi implicit graph strucutre such as a diagostic visual question answering task , and neural machine translation, equally benefit from relying on hyperbolic geometry.

We provide experiments with feedforward networks, the Transformer and Relation Networks endowed with hyperbolic attention.

Our results show the effectiveness of our approach on diverse tasks and architectures. The benefit of our approach is particularly prominent in relatively small models, which supports our hypothesis that hyperbolic geometry induces compact representations and is therefore better able to represent complex functions in limited space.

We use the algorithm of von Looz et al. to efficiently generate large scale-free graphs, and define two predictive tasks that test our model’s ability to represent different aspects of the structure of these networks. For both tasks in this section, we train Recursive Transformer (RT) models, using hyperbolic and Euclidean attention. A Recursive Transformer is identical to the original transformer, except that the weights of each self-attention layer are tied across depth. We use models with 3 recursive self-attention layers, each of which has 4 heads with 4 units each for each of q\mathbf{q}, k\mathbf{k}, and v\mathbf{v}. This model has similarities to Graph Attention Networks .

Link prediction (LP): Link prediction is a classical graph problem, where the task is to predict if an edge exists between two nodes in the graph. We experimented with graphs of 10001000 and 12001200 nodes and observed that the hyperbolic RT performs better than the Euclidean RT on both tasks. We report the results in Figure 3 (middle). In general, we observed that for graphs of size 10001000 and 12001200 the hyperbolic transformer performs better than the Euclidean transformer given the same amount of capacity.

Shortest path length prediction (SPLP): In this task, the goal is to predict the length of the shortest path between a pair of nodes in the graph. We treat this as a classification problem with a maximum path-length of 25 which becomes naturally an unbalanced classification problem. We use rejection sampling during training to ensure the network is trained on an approximately uniform distribution of path lengths. At test time we sample paths uniformly at random, so the length distribution follows that of the underlying graphs. We report the results in Figure 3 (left). In Figure 3 (right), we visualize the distribution of the scale of the learned activations (rr in the projection of Section 4.1) when training on graphs of size 100 and 400. We observe that our model tends to use larger scales for the larger graphs. As a baseline, we compare to the optimal constant predictor, which always predicts the most common expected path length. This baseline does quite well since the path length distribution on the test set is quite skewed.

For both tasks, we generate training data online. Each example is a new graph in which we query the connectivity of a randomly chosen pair of nodes. To make training easier, we use a curriculum, whereby we start training on smaller graphs and gradually increase the number of vertices towards the final number. More details on the dataset generation procedure and the curriculum scheme are found in the supplementary material.

2 Sort-of-CLEVR

Since we expect hyperbolic attention to be particularly well suited to relational modelling, we investigate our models on the relational variant of the Sort-of-CLEVR dataset . This dataset consists of simple visual scenes allowing us to solely focus on the relational aspect of the problem. Our models extend Relation Nets (RNs) with the attention mechanism in hyperbolic space (with the Euclidean or Einstein midpoint aggregation), but otherwise we follow the standard setup-up . Our best method yields accuracy of 99.2%99.2\% that significantly exceeds the accuracy of the original RN (96%96\%).

However, we are more interested in evaluating models on the low-capacity regime. Indeed, as Figure 4 (left) shows, the attention mechanism computed in the hyperbolic space improves around 20 percent points over the standard RN, where all the models use only two units of the relational MLP.

3 CLEVR

We train our Relation Network with various attention mechanisms on the CLEVR dataset . CLEVR is a synthetic visual question answering datasets consisting of 3D rendered objects like spheres, cubes, or cylinders of various size, material, or color. In contrast to other visual question answering datasets , the focus of CLEVR is on relational reasoning.

In our experiments, we closely follow the procedure established in , both in terms of the model architecture, capacity, or the choice of the hyperparameters, and only differ by the attention mechanism (Euclidean or hyperbolic attention), or sigmoid activations.

Results are shown in Figure 4 (Right). For each model, we vary the capacity of the relational part of the network and report the resulting test accuracy. We find that hyperbolic attention with sigmoid consistently outperforms other models.

Our RN with hyperbolic attention and sigmoid achieves 95.7%95.7\% accuracy on the test set at the same capacity level as RN, whereas the latter reportedly achieves 95.5%95.5\% accuracy .

4 Neural machine translation

The Transformer is a recently introduced state of the art model for neural machine translation. It relies heavily on attention as its core operation. As described in Section 3.3, we have extended the TransformerWe use a publicly available version: https://github.com/tensorflow/tensor2tensor by replacing its scaled dot-product attention operation with its hyperbolic counterpart. We evaluate all the models on the WMT14 En-De dataset .

We train several versions of the Transformer model with hyperbolic attention. They use different coordinate systems (Weierstrass or polar), or different attention functions (softmax or sigmoid). We consider two model sizes, referred to here as tiny and base. The tiny model has two layers of encoders and decoders, each with 128 units and 4 attention heads. The base model has 6 layers of encoders and decoders, each with 512 units and 8 attention heads. All hyperparameter configurations for the Euclidean versions of these models are available in the Tensor2tensor repository.

Results are shown in Table 1. We observe improvements over the Euclidean model by using hyperbolic attention, in particular when coupled with the sigmoid activation function for the attention weights. The improvements are more significant when the model capacity is restricted. In addition, our best model (with sigmoid activation function and without pseudo-polar coordinates) using the big architecture from Tensor2tensor, achieves 28.45 BLEU score, whereas Vaswani et al. report 28.4 BLEU score with the original version of this model.We achieve 28.3 BLEU score using the big Transformer with the publicly available framework..

Conclusion

We have presented a novel way to impose the inductive biases from hyperbolic geometry on the activations of deep neural networks. Our proposed hyperbolic attention operation makes use of hyperbolic geometry in both the computation of the attention weights, and in the aggregation operation over values. We implemented our proposed hyperbolic attention mechanism in both Relation Networks and the Transformer and showed that we achieve improved performance on a diverse set of tasks. We have shown improved performance on link prediction and shortest path length prediction in scale free graphs, on two visual question answering datasets, and finally on English to German machine translation. The gains are particularly prominent in relatively small models, which confirms our hypothesis that hyperbolic geometry induces more compact representations.

Acknowledgements

We would like to thank Neil Rabinowitz, Chris Dyer and Serkan Cabi for constructive comments on earlier versions of this draft. We thank Yannis Assael for helping us with the styles of the plots in this draft. We would like to thank Thomas Paine for the discussions.

References

Appendix A Appendix

A.2 Scale-free graph generation

We use the algorithm described by von Looz et al. . In our experiments, we set the α\alpha to 0.95 and edge_radius_R_factor to 0.350.35.

A.3 Scale-free graph curriculum

Curriculum was an essential part of our training on the scale-free graph tasks. On LP and SPLP tasks, we use a curriculum where we extract the connected components from the graph by cutting the disk that the graphs generated on into slices by starting from a 3030 degree angle and gradually increasing the size of the slice from the disk by increasing the angle during the training according to the number of lessons that are involved in the curriculum. This process is also visualized in Figure 6.

A.4 Travelling salesman problem (TSP)

We train an off-policy DQN-like agent with the HRT. The graphs for the TSP is generated following the procedure introduced in .

On this task, as an ablation we just compared the hyperbolic networks with and The results are provided in Figure 4 (Right) with and without implicit coordinates. Overall, we found that the hyperbolic transformer networks performs better when using the implicit polar coordinates.

A.5 Hyperbolic Recursive Transformer

As shown in Figure 9, the hyperbolic RT is an extension of transformer that ties the parameters of the self-attention layers. The self-attention layer gets the representations of the nodes of the graph coming from the encoder and the decoder decodes that representation from the recursive self-attention layers for the prediction.