Pathways: Asynchronous Distributed Dataflow for ML

Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent El Shafey, Chandramohan A. Thekkath, Yonghui Wu

Introduction

Deep learning has seen remarkable achievements over the last decade, across domains from image understanding Krizhevsky et al. (2012); He et al. (2016) to natural language processing Devlin et al. (2019); Brown et al. (2020). This rapid recent progress of machine learning (ML) has been characterized by the co-evolution of ML models, accelerator hardware, and the software systems that tie the two together. This co-evolution poses a danger that systems become over-specialized to current workloads and fail to anticipate future needs. In this paper, we describe Pathways, a new system built for distributed ML. Pathways is designed to target specific capabilities that we believe will be needed by future ML workloads Dean (2021) – and are therefore needed today to support research into those workloads – but which are poorly supported by state-of-the-art systems.

For example, most of today’s state-of-the-art ML workloads use a “single program multiple data” (SPMD) model, inspired by MPI Clarke et al. (1994), where all accelerators run the same computation in lockstep and communication between accelerators is described by collectives like AllReduce. Recently, researchers have begun to run into the limits of SPMD for ML computations. Very large language models have been scaled up using pipelining rather than pure data-parallelism Narayanan et al. (2019); Rasley et al. (2020); Narayanan et al. (2021), and models such as Mixture of Experts (MoE) Shazeer et al. (2017) have started to explore computational sparsity that is most naturally expressed using fine-grain control flow and heterogeneous computation across accelerators. System designers have adopted ingenious techniques to execute pipelined Narayanan et al. (2021); Rasley et al. (2020); Narayanan et al. (2019); Huang et al. (2019) and homogeneous MoE Lepikhin et al. (2020); Fedus et al. (2021) models on MPI-style systems, but as we argue in detail later, the MPI programming model is too restrictive both for users and for the underlying system.

On the other hand, with each new generation of accelerators, ML clusters are becoming increasingly heterogeneous Jeon et al. (2019); Chaudhary et al. (2020); Weng et al. (2022). Providing exclusive access to large “islands” of homogeneous accelerators connected over high-bandwidth interconnects is expensive, and often wasteful as a single user program must try to keep all of the accelerators continuously busy. Such constraints are further driving researchers towards “multiple program multiple data” (MPMD) computations that allow more flexibility by mapping sub-parts of the overall computation to a collection of more readily available smaller islands of accelerators. To increase utilization, some ML hardware resource management researchers Xiao et al. (2020); Bai et al. (2020); Yu and Chowdhury (2020); Wang et al. (2021); Lim et al. (2021); Zhao et al. (2022); Weng et al. (2022) multiplex hardware in a fine-grained manner between workloads, enabling workload elasticity, and improving fault tolerance.

Finally, researchers are beginning to standardize on a set of foundation models Bommasani and et. al. (2021); Dean (2021) that are trained on large data at scale and are adaptable to multiple downstream tasks. Training and inference for such models offers opportunities for improving cluster utilization by multiplexing resources across many tasks, and efficiently sharing state between them. For example, several researchers might concurrently fine-tune Houlsby et al. (2019); Zhang et al. (2021) a foundation model for different tasks, using the same accelerators to hold the fixed foundation model layers. Training or inference over shared sub-models can benefit from techniques that allow examples from different tasks to be combined in a single vectorized batch to get better accelerator utilization Crankshaw et al. (2017).

This paper describes our system, Pathways, which matches the functionality and performance of state of the art ML systems, while providing the capabilities needed to support future ML workloads. Pathways uses a client-server architecture that enables Pathways’s runtime to execute programs on system-managed islands of compute on behalf of many clients. Pathways is the first system designed to transparently and efficiently execute programs spanning multiple “pods” of TPUs Google (2021), and it scales to thousands of accelerators by adopting a new dataflow execution model. Pathways’s programming model makes it easy to express non-SPMD computations and enables centralized resource management and virtualization to improve accelerator utilization.

In the remainder of the paper we first discuss the limitations of current distributed ML systems and motivate our design choices for Pathways (§2), and next describe the flexible programming model that Pathways supports (§3). We describe Pathways’s architecture (§4), highlighting how we have addressed the key limitations of older client-server ML systems using a sharded dataflow model and asynchronous gang-scheduling. We present both micro-benchmarks and end-to-end evaluations using real ML models that demonstrate we have met the goal of matching the performance of state-of-the-art multi-controllers for realistic workloads (§5), and validate that Pathways’s mechanisms are well suited to support the features needed for the research and deployment of novel and efficient ML methods.

Design Motivation

The design choices of distributed ML systems are often driven by the properties of the underlying target hardware accelerators. We refer readers to Appendix A for a discussion on some of these properties and how they typically influence distributed ML systems. Here, we focus on how some of the design and implementation choices of existing distributed ML systems make it hard for them to support large, sparse or irregular models.

Distributed ML systems for training state-of-the-art SPMD models often adopt a multi-controller architecture where the same client executable is run directly on all the hosts in the system, taking exclusive ownership of the resources on those hosts for the duration of the program execution. Examples of this architecture include MPI Clarke et al. (1994), PyTorch Paszke et al. (2019), JAX Bradbury et al. (2018), and more recent configurations of TensorFlow Shazeer et al. (2018); Agrawal et al. (2019). The key advantage of this architecture is the low latency for dispatching accelerator computations (see Figure 1(a)) since an identical copy of the user’s code runs on each of the accelerator hosts and dispatch involves communication only over (relatively) fast PCIe links. All other communication across hosts only happens through collectives that use dedicated interconnects like NVLink Foley and Danskin (2017) and ICI Jouppi et al. (2020) without going via host memory. However, this architecture is a poor match for modern ML workloads that use pipelining or computational sparsity. Any communication beyond standard collectives in multi-controller systems requires users to implement their own coordination primitives. The multi-controller approach also typically assumes exclusive ownership of hardware resources. This not only shifts the responsibility of ensuring high utilization of the expensive accelerators on to the user, but also complicates the design of features like resource virtualization and multiplexing that are needed to build efficient cluster-wide ML infrastructure.

Single-controller systems such as TensorFlow v1 Abadi et al. (2016) offer a very general distributed dataflow model, including optimized in-graph control flow Yu et al. (2018). A TensorFlow (TF) Python client builds a computation graph and hands it off to a coordinator runtime, which partitions the graph into a subgraph for each worker and delegates the execution of the subgraphs to local runtimes on workers. Coordination between workers is performed using data- and control-edges passing messages over the datacenter network (DCN). While the single-controller design offers a flexible programming model and virtualization of resources, it presents implementation challenges.

Firstly, while multi-controller systems only require communication over PCIe to dispatch accelerator computations (Figure 1(a)), clients in single-controller systems are “farther away” and the dispatch latency involves communication over DCN, typically an order of magnitude slower than PCIe (Figure 1(b)). Secondly, to support concurrent execution of MPMD programs with SPMD sub-computations, each spanning a subset of accelerators drawn from a shared cluster, the runtime must have some mechanism to support gang-scheduling of accelerator computations. Gang-scheduling is essential in the case of TPUs, since they are single-threaded and only run non-preemptible kernels, so the system will deadlock if communicating computations are not enqueued in a consistent order. Even for GPUs or other accelerators that can execute concurrent computations, gang scheduling allows more efficient execution of collectives Feitelson and Rudolph (1992). Single-controller systems for ML therefore require a distributed scheduling mechanism to order the computations enqueued on behalf of different programs. Finally, a system for modern ML workloads must be designed to run computations distributed over thousands of accelerators, with first class support for sharded representations and data structures. For instance, a naive dataflow graph representing an edge between an M-way sharded computation and an N-way sharded computation would require M+NM+N nodes and M×NM\times N edges, rapidly becoming unwieldy.

The implementation choices made by TF v1 were over-specialized to assume a single, smallish, exclusively-owned island of accelerators. This over-specialization makes it practically infeasible to use TF for contemporary or future ML workloads. While TF can run computations that require cross-host coordination or data transfer through send and recv ops (Figure 1(c)), host side work at the destination like dispatching the accelerator computation is triggered only after the transfer is completed. In programs involving many cross-host transfers, for example pipelined models with a large number of stages, these dispatch latencies accumulate, leading to inefficient accelerator utilization. While TF v1 users can (inefficiently) enforce a consistent ordering for gang-scheduling within a single program, by using control edges, the lack of a centralized scheduler in single-controller systems like TF v1 makes it impossible to ensure consistent ordering between computations across programs. TF also materializes the full sharded computation graph, which introduces substantial overhead in both graph serialization and execution when the number of shards reaches into the thousands, leading to millions of graph edges between sub-computations.

Pathways combines the flexibility of single-controller frameworks with the performance of multi-controllers. We adopt a single-controller model since we believe it offers much better opportunities than multi-controller for novel and efficient ML computation, both by exploiting computational sparsity and heterogeneity, and by enabling cluster management systems that promote sharing and virtualizing resources. Our design differs from older single-controller ML systems in that it uses asynchronous dispatch to match the performance of multi-controller systems, supports centralized resource management and scheduling with first-class support for gangs of SPMD accelerator computations, and uses a sharded dataflow system for efficient coordination.

Pathways Programming Model

We have implemented support to target Pathways from source programs written in TensorFlow and JAX, but we concentrate on JAX for the evaluation in this paper. JAX users can explicitly wrap standard Python code with decorators to indicate fragments that should be compiled into (potentially SPMD) XLA computations. These XLA computations are usually characterized by known input and output types and shapes, bounded loops, and with few (if any) conditionals (see Appendix B for more details) making it feasible to estimate the resource requirements of computations in advance. We refer to these computations with known resource requirements as “compiled functions”. Each such function maps to a single (sharded) computation node in a Pathways program.

JAX today cannot scale beyond a single TPU pod since JAX programs that run in multi-controller configurations transfer all data using XLA collectives, and these are only currently available over ICI on TPU. Pathways can be used as a plug-in replacement for the JAX backend, allowing JAX code to run unmodified except that SPMD computations now have access not just to the locally connected TPU cores, but to as many cores as are provisioned in the system. And since Pathways can communicate over both ICI and DCN, it allows JAX programs to scale for the first time to multiple TPU pods, containing many thousands of TPU cores.

The ability to run unmodified JAX code is convenient but does not unlock the full performance of Pathways. A Pathways user may request sets of “virtual devices”, with optional constraints on the device types, locations or interconnect topology, and is then able to place specific compiled functions on those devices (Figure 2). The system will automatically handle all data movement and resharding between dependent computations.

By default, we convert each compiled function into a standalone Pathways program containing just one (sharded) computation, meaning that if a user wants to run many functions back to back, a separate Python call and RPC from client to coordinator is required for each function. We therefore also implemented a new program tracer (Figure 2) that a user can wrap around a block of Python code that calls many compiled functions. The tracer generates a single Pathways program where each compiled function is represented by a computation node in a dataflow graph.

JAX’s philosophy of supporting transforms of traced code is a good match for the research directions we want to explore. For example, JAX has a companion library called FLAX Heek et al. (2020) that is used to express layered DNN models, and we have written a library that automatically converts a FLAX model into a pipelined Pathways program. In addition, JAX supports transforms to vectorize “per-example” Python functions, producing efficient batched code, and such transforms are a good basis for exploring new forms of data-dependent vectorized control flow, as we briefly describe later (§6.3).

Pathways System Architecture

Pathways builds extensively on prior systems, including XLA TensorFlow (2019) to represent and execute TPU computations, TensorFlow graphs and executors Abadi et al. (2016) to represent and execute distributed CPU computations, and Python programming frameworks including JAX Bradbury et al. (2018) and TensorFlow APIs. By leveraging these building blocks we are able to focus on the novel coordination aspects of Pathways, while being able to run existing ML models with minimal code changes.

A Pathways backend consists of a set of accelerators grouped into tightly-coupled islands that are in turn connected to each other over DCN (Figure 3). Pathways has a “resource manager” which is responsible for the centralized management of devices across all of the islands. A client may ask for “virtual slices” of the island with specific 2D or 3D mesh shapes that suit their communication pattern. Each virtual slice contains “virtual devices“ that allow the client to express how computations are laid out on the mesh. The resource manager dynamically assigns physical devices for virtual devices satisfying the desired interconnect topology, memory capacity, etc.

Our initial resource manager implementation uses a simple heuristic that attempts to statically balance load by spreading computations across all available devices, and keeps a one to one mapping between virtual and physical devices. If future workloads require it we can adopt a more sophisticated allocation algorithm, for example taking into account the resource requirements of all client computations and the current state of the system to approximate an optimal allocation of physical devices to computations.

Pathways allows backend compute resources to be added and removed dynamically, with the resource manager tracking available devices. The layer of indirection between virtual and physical devices, as enabled by our single-controller design, will allow us in future to support features like transparent suspend/resume and migration, where a client’s virtual devices are temporarily reclaimed or reassigned without the need for cooperation from the user program.

2 Client

When the user wants to run a traced program, it calls the Pathways client library which first assigns virtual devices to any computations that have not been run before, and registers the computations with the resource manager, triggering the servers to compile the computations in the background. The client then constructs a device location-agnostic Pathways intermediate representation (IR) for the program, expressed as a custom MLIR Lattner et al. (2021) dialect. The IR is progressively “lowered” via a series of standard compiler passes, which eventually output a low-level representation that includes the physical device locations. This low-level program takes into account the network connectivity between physical devices and includes operations to transfer outputs from a source computation shard to the locations of its destination shards, including scatter and gather operations when a data exchange is required. It is efficient to repeatedly run the low-level program in the common case that the virtual device locations do not change, and the program can be re-lowered if the resource manager changes the mapping between virtual and physical devices.

The client in older single controller systems can quickly become a performance bottleneck as it coordinates thousands of individual computations and data buffers corresponding to each shard of computations spread across thousands of accelerators. The Pathways client uses a sharded buffer abstraction to represent a logical buffer that may be distributed over multiple devices. This abstraction helps the client scale by amortizing the cost of bookkeeping tasks (including reference counting) at the granularity of logical buffers instead of individual shards.

3 Coordination implementation

Pathways relies on Plaque for all cross-host coordination that uses DCN. Plaque is an existing (closed-source) production sharded dataflow system used at Google for many customer-facing services where high-fanout or high-fanin communication is necessary, and both scalability and latency are important. The low-level Pathways IR is converted directly to a Plaque program, represented as a dataflow graph. Pathways has stringent requirements for its coordination substrate, all of which are met by Plaque.

First, the representation used to describe the Pathways IR must contain a single node for each sharded computation, to ensure a compact representation for computations that span many shards, i.e. a chained execution of 2 computations AA and BB with NN computation shards each should have 4 nodes in the dataflow representation: ArgCompute(A)Compute(B)ResultArg\xrightarrow{}Compute(A)\xrightarrow{}Compute(B)\xrightarrow{}Result, regardless of the choice of NN. In the Plaque runtime implementation each node generates output data tuples tagged with a destination shard, so when performing data-parallel execution NN data tuples would flow, one between each adjacent pair of IR nodes.

The coordination runtime must also support sparse data exchanges along sharded edges, in which messages can be sent between a dynamically chosen subset of shards, using standard progress tracking mechanisms Akidau et al. (2013); Murray et al. (2013) to detect when all messages for a shard have been received. Efficient sparse communication is a requirement to avoid the DCN becoming a bottleneck for data-dependent control flow on accelerators, which is one of the key capabilities that we want Pathways to enable.

The coordination substrate is used to send DCN messages that are in the critical path for transmitting scheduling messages and data handles (Figure 4), so it must send critical messages with low latency, and batch messages destined for the same host when high throughput is required.

It is also convenient to use an extensible, general-purpose, dataflow engine to handle DCN communication, since this means that Pathways can also use it for background housekeeping tasks such as distributing configuration information, monitoring programs, cleaning them up, delivering errors on failures, and so on.

We believe that it would be feasible to re-implement the full Pathways design using other distributed frameworks such as Ray Moritz et al. (2018) rather than Plaque to realize the low-level coordination framework. In such an implementation, Pathways executors and schedulers would be replaced by long-running Ray actors that would implement Pathways scheduling on top of the underlying Ray cluster scheduling, and executors could use PyTorch for GPU computation and collectives. Some additions might be required to attain comparable performance (see §5) because Ray lacks, for example, an HBM object store, or primitives to efficiently transfer remote objects over the GPU interconnect.

4 Gang-scheduled dynamic dispatch

As discussed previously (§2), one requirement for supporting SPMD computations on a shared set of accelerators is to support efficient gang-scheduling. The Pathways runtime includes a centralized scheduler per island that consistently orders all of the computations in the island. As Pathways enqueues a program for execution, the Plaque dataflow program is responsible for (i) enqueueing the execution of local compiled functions at each accelerator, with buffer futures as inputs; (ii) enqueueing network sends to remote accelerators for the buffer futures output by function executions; and (iii) communicating with the scheduler to determine a consistent order of function executions across all programs running on the island. The scheduler must implement policies for allocating accelerators at a time-scale of milliseconds. Our current implementation simply enqueues work in FIFO order, but more sophisticated schedulers might for example reorder computations based on estimated execution times.

5 Parallel asynchronous dispatch

When running computations on accelerators, systems can take advantage of asynchronous APIs to overlap computation with coordination Kwon et al. (2020). Consider the three-node graph in Figure 4(a), where the squares correspond to three nodes A, B, and C running on accelerators attached to hosts A, B, and C. All node computations are regular compiled functions. Host A enqueues node A, receives a future for A’s outputs, and transmits the future to host B. Host B allocates B’s inputs, transmits the input buffer addresses to host A, and performs most of the preparatory work to launch node B’s function. When node A completes, its outputs are sent via the accelerator interconnect directly into node B’s input buffers, and then host B starts node B. The latency between one node completing and the next node starting can be made to be little more than the data transfer time.

The above design works well when a predecessor node’s computation takes longer than the time spent in scheduling, resource allocation, and coordination between hosts. However if the computation time is too short, which is the case shown in the figure, the asynchronous pipeline stalls and the host-side work becomes the critical bottleneck for executing the overall sequence of computations. Given that the compiled functions are all regular, a successor node’s input shapes can in practice be computed before the predecessor computation was even enqueued.

We therefore introduce a novel parallel asynchronous dispatch design shown in Figure 4(b), which exploits the statically known resource usage of regular compiled functions to run most of the host-side work for a computation’s nodes in parallel, rather than serializing the work for a node to happen after its predecessors have been enqueued. Since work can only be scheduled in parallel when functions are regular, Pathways treats parallel scheduling as an optimization and falls back to the traditional model when a node’s resource requirements are not known until a predecessor computation has completed (e.g., due to data-dependent control flow). When a subgraph of a computation can be scheduled statically, the program sends a single message (describing the entire subgraph) to the scheduler, which is able to sequence the execution of all the active shards in the subgraph back to back. The use of a single message is designed to minimize network traffic, but does not require the scheduler to actually enqueue all the subgraph’s shards as a batch: computations may still be interleaved with those submitted by other concurrently executing programs. We evaluate the cost of different dispatch mechanisms in §5.

6 Data management

Each host manages a sharded object store that is similar to Ray’s object stores Moritz et al. (2018), but extended to also track buffers held in accelerator HBM at each shard. Client programs can hold references to objects in remote host or accelerator memory, and the client and servers refer to them using opaque handles that allow the system to migrate them if needed. Intermediate program values are also kept in the object stores, for example while the system is waiting to transfer them between accelerators, or pass them to a subsequent computation. The objects are tagged with ownership labels so that they can be garbage collected if a program or client fails. We can use simple back-pressure to stall a computation if it cannot allocate memory because other computations’ buffers are temporarily occupying HBM.

Evaluation

For evaluating JAX, Pathways, and TensorFlow on TPU we use three different configurations. Configuration (A) has 44 TPUs per host, and the largest instance we report on has 512512 hosts, resulting in 20482048 total TPUs connected via ICI. Configuration (B) has 88 TPUs per host, and the largest instance we report on has 6464 hosts, and a total 512512 TPUs. Configuration (C) uses four islands of TPUs, where each island has 44 hosts and 3232 TPUs. We note in the text when experiments use a subset of the TPUs of a particular configuration.

When evaluating Ray on GPU we use Ray v1.3 and PyTorch 1.8.1 running on p3.2xlarge VMsThese VMs have 1×1\timesV100 GPU and 8×8\timesCPU cores. with hosts connected via DCN and scheduled using Amazon placement groups.

We mostly compare Pathways against multi-controller JAX, since JAX has demonstrated state of the art performance in industry standard benchmarks Mattson et al. (2020) and we can easily run JAX and Pathways (PW) on identical hardware configurations. We also compare against TensorFlow (TF) and Ray in micro-benchmarks, to examine specific aspects of Pathways’s distributed system performance, and show pipelined performance of a TF model running on Pathways.

Our first experiment is a micro-benchmark to compare the overheads of JAX multi-controller with single-controller frameworks. We construct programs that repeatedly run a trivial gang-scheduled computation containing a single AllReduce of a scalar followed by a scalar addition, feeding the output of one computation to the input of the next. We measure the throughput: the number of computations per second that execute on the accelerators. We compare three ways that the user code can enqueue the computations:

OpByOp (-O): The user code contains a separate call for each execution of the computation.

Chained (-C): The user code contains a series of calls each of which executes a chain of 128128 nodes, where each node executes the computation. The system executes the entire chain in response to a single client call.

Fused (-F): The user code contains a series of calls each of which executes a single computation node, where the node contains a chain of 128128 computations.

For JAX multi-controller, OpByOp means JIT-compiling a function containing one computation and calling it repeatedly from Python, and Fused means JIT-compiling a function containing a chain of computations. There is no analog of Chained for a multi-controller. For Pathways, OpByOp and Fused use the same JAX source as for the multi-controller, and Chained uses the Pathways program tracer to form a multi-node program where each node contains a simple computation. TF is similar to Pathways, where we construct the same TPU computations and execute them using TF graphs instead of Pathways. For Ray, OpByOp means executing a separate actor method for each computation which executes a PyTorch AllReduce. Chained means chaining a sequence of actor methods (by passing Ray futures), each of which executes a single PyTorch AllReduce. Fused means executing a single actor method which runs a chain of PyTorch AllReduce commands in a loop.

Figure shows the result. Note that OpByOp is a worst-case experiment that is not idiomatic for any of the frameworks, and it is present merely to stress the underlying systems. As expected, for OpByOp the JAX multi-controller throughput is much better than the single-controller systems, particularly as the number of accelerators increases. Most of Pathways’s overhead comes from the fact that the client waits until the coordinator has enqueued one computation and returned its output handles before enqueueing the next. We could eliminate most of this overhead by allowing user code to proceed in parallel with the enqueue RPC, and opportunistically batching multiple small computations into a single Pathways program. We have not focused on optimizing overheads of very small computations since, on real models with computations involving more than scalars, Pathways already matches the performance of multi-controller JAX (see §5.3). Once enough work is Fused into a single node Pathways matches JAX’s performance up to 1000 TPU cores, and Pathways Chained outperforms JAX OpByOp up to 256 cores, because Pathways can execute back-to-back accelerator computations directly from C++ while JAX OpByOp transitions to Python for every computation.

TensorFlow and Ray suffer from their lack of a device object store: Ray must transfer the result of a computation from GPU to DRAM before returning the object handle to the client, while TensorFlow transfers the data back to the client. This overhead hurts their OpByOp performance but is largely amortized for Chained and Fused. The performance of Ray and Pathways are not directly comparable since they use different hardware, but we interpret the results to suggest that, if the full Pathways design were implemented substituting Ray for Plaque, it should be possible to achieve comparable performance. Out of the box, Ray shows about an order of magnitude worse performance per computation than Pathways, but that is unsurprising since Ray can execute general-purpose Python actors and Pathways is specialized to TPU computations launched from C++. With careful attention to engineering, it might be possible to add fast paths to Ray, such as an on-GPU object store and primitives to transfer objects efficiently over the GPU interconnect, that eliminate most of its additional overheads. TensorFlow is slow when running over many cores because it uses a centralized barrier, implemented with control edges, to serialize the gang-scheduled computations.

Figure 6 varies the amount of time spent in each computation to find the smallest computation for which Pathways matches JAX’s throughput. For 16 hosts with 128 TPUs on configuration (B), parity is reached with only 2.32.3 ms, and even for 512 hosts with 2048 TPUs on configuration (A), a computation of at least 3535 ms masks all of Pathways’s single-controller overhead.

Our next micro-benchmark, also on configuration (B), evaluates the benefit of the parallel asynchronous dispatch mechanism described in §4.5. We construct a more realistic pipeline benchmark in which the simple computations from the earlier benchmark are again chained together, but now each computation runs on a different set of 4 TPU cores, each on a different host, and data output from one computation must be sent via ICI before the next computation can execute. Figure 7 shows three “phases”: at first the fixed client overhead is amortized as the number of hosts increases; then the increasing transfer costs of adding more stages begin to dominate; finally the system starts to amortize the fixed scheduling overhead. Eventually we expect that transfer overheads would dominate again. For comparison, we also show the performance when we force the Pathways dataflow execution to use sequential asynchronous dispatch, and wait for one computation to be enqueued before enqueueing the next, to measure the benefit we get from parallel asynchronous dispatch.

2 Multi-tenancy

We validate in Figure 8 (performed on configuration (B)) that Pathways is able to time-multiplex accelerators between concurrent programs. Pathways can achieve at least the same aggregated throughput as JAX when multiple clients concurrently submit different Pathways programs, i.e., there is no overhead to context switch between programs from different clients, at least when their resources concurrently fit in HBM (traces in Appendix D). As already shown in Figure 6, the degree of concurrency required to match the throughput is lower for larger computation sizes since the TPU cores reach full utilization sooner. It is noteworthy that the maximum throughput of Pathways exceeds that that of JAX for very small computations, achieving higher TPU utilization. This is because a Pathways worker can accept more computations from remote clients than JAX can dispatch using Python locally.

Figure 9 shows traces of a sample of 128 cores on Pathways for the above workload. This experiment highlights that Pathways performs gang-scheduling of programs submitted by 4 independent clients while controlling allocation of accelerator time for fairness; for example, the scheduler can enforce proportional share in this multi-tenancy setting.

3 Large scale model performance

Finally, we show the performance of Pathways in training real machine learning models that can be expressed as SPMD programs. We compared JAX and TF models running on their native systems to the same models running on Pathways, and verified that at numerical results are identical, so we focus only on performance.

We first compare to JAX multi-controller running a Transformer model with an Encoder-Decoder architecture that is used for several text-to-text natural language processing tasks. We use model configurations from Raffel et al. (2019) and run the experiments on TPUv3s with 16GB memory per accelerator. Table 1 shows the training throughput (tokens/second) for Text-to-text Transformer model with various model sizes (up to 11 billion parameters), training on different number of accelerators. As expected, since the model code is the same, the models trained on JAX and Pathways achieve the same perplexity in the same number of steps. Over all tested model sizes, the two systems show identical performance since realistic computations are large enough to mask single-controller overheads. While we do not report detailed results, we have substantial experience of running JAX models on Pathways, which corroborates the finding that the performance of the two systems is comparable across a broad range of settings.

Next, we compare the performance of Pathways when training a Transformer-based language model with a Decoder-only architecture on configurations (B) and (C). For this experiment, we use a model expressed in Python using TF. The model consists of 62 Transformer layers with a model dimension of 2048 and a hidden dimension of 8192, which results in 3 billion parameters in total. We compare an SPMD configuration to a pipeline using a GPipe-like schedule Huang et al. (2019). The pipelined model is split into multiple stages with balanced computation in each stage. Since the first stage has an extra embedding lookup layer and the last stage has an extra softmax layer, we took out one Transformer layer from the first and last stage to balance the amount of compute per stage. Each stage is assigned to a different set of accelerators spanning multiple hosts.

Table 2 shows the training throughput for different numbers of stages (S) and micro-batches (M), while keeping the global batch size and training hyperparameters fixed.Unlike Megatron Shoeybi et al. (2019), the SPMD-sharded model evaluated here is similar to GShard Lepikhin et al. (2020) and does not have communication proportional to batch size, so it is fair to evaluate pipelined and SPMD with the same batch size. The number of examples per micro-batch is fixed at 4 for all cases, and, hence, the global batch size per step is 2048 for the 128-core configurations (8192 for the 512-core one).

Pathways’s training throughput increases proportionally with the number of TPU cores per pipeline stage (Table 2), in line with other systems Rasley et al. (2020); Narayanan et al. (2021). This result is consistent with Figure 5 showing that the throughput of Pathways linearly scales with the number of hosts. Increasing the number of pipeline stages adds minimal overhead, the throughput being reduced from 133.7k tokens/sec to 131.4k tokens/sec when the number of stages increases from 4 to 16. We compare the pipelined models’ performance to an equivalent model expressed using SPMD, and observe that at least in this instance, the pipeline has competitive performance to SPMD, since collective communication within the SPMD computation incurs higher overhead than pipeline bubble overhead.

We also demonstrate that Pathways can efficiently train models over islands of TPUs connected via DCN. In the S=16, M=64S=16,\ M=64 configuration with 128 cores, we measure the same throughput (131.4131.4k tokens/sec) using a single island of 128 cores on configuration (B), or 4 islands of 32 cores each on configuration (C). Figure 10 shows a trace of a sample of cores when the stages are partitioned into islands. DCN transfers occur between every group of 8 rows in the trace, and are not visible in the trace because communication time is effectively overlapped with computation.

Finally, we scale up training of large Decoder-only Transformer models to 64B and 136B parameters using two islands of accelerators. When trained using using Pathways over two islands of compute connected over DCN, Pathways achieves 97%\sim 97\% of the throughput as compared to a single island with twice as many devices. For the 136B (64B) LM model, we train over two islands of 1024 (512) cores that uses the fast ICI within island reduction followed by DCN transfer across islands (execution trace available in Appendix D) of 1030GB (457GB) for global reduction.

Discussion

Pathways was designed to target large collections of TPU accelerators. The use of TPU instead of GPU affects many of our low-level design decisions. The biggest difference between TPU and GPU is that far longer-running and more complex computations can be fused into a single TPU kernel, because the TPU supports rich control flow and communication primitives that must instead be executed by driver code on GPU systems. GPUs, by contrast, are more tightly integrated with host memory systems and DCNs NVIDIA (2021) (more details in Appendix A.5). TPUs are a good fit for Pathways because XLA can compile high performance functions containing fused collectives, and the large islands of high-performance TPU interconnects allow flexible scheduling of computations of many different sizes. Nevertheless, we believe that most of the high-level architectural choices we made in Pathways and describe in this paper would also be valid for large-scale GPU systems.

2 Resource management

Pathways is designed to allow a wide variety of fine-grained dynamic resource-management policies. Our initial research has focused on efficient dynamic time-multiplexing of TPU computations. For more complex future multi-tenancy use cases, Pathways will need to handle more diverse resource types including but not limited to device and host memory, and ICI, DCN, and PCIe bandwidth. Pathways’s single-controller model grants the system an extensive ability to track available resources and to allocate resources at large scale. We are planning to explore common multi-tenancy requirements such as priorities, performance isolation, access control, and resource accounting, but at timescales that are significantly smaller than prior work, and for orders-of-magnitude larger pools of resources (e.g., thousands of cores and TBs of accelerator memory).

3 Data-dependent vectorized control flow

Almost all ML models currently update every model weight based on every training example in every step. We want to enable research that uses fine-grain control flow so that different model weights can be updated per example, or even per sub-example (patch of an image, or word of a sentence). Models like Mixture of Experts (MoE) Shazeer et al. (2017) and routed capsule networks Hinton et al. (2018); Barham and Isard (2019) exploit computational sparsity by “routing” different (sub-)examples to the accelerators hosting different subsets of model weights based on learned functions that are updated as training progresses. This routing requires fine-grain data-dependent data exchanges between nodes. Our ML research colleagues have told us that they would like to use sparsity more effectively when training ever larger models, with ever more tasks, but that current frameworks limit their ability to experiment with novel model architectures. It is the subject of future work to support data-dependent vectorized control flow with both a clean programming model and good performance.

Related work

We have examined closely related work in detail in §2. This section expands on related research that addresses ML workloads that need capabilities beyond those offered by SPMD multi-controllers, and validates our Pathways design choices.

Sharing accelerators across multiple tasks is crucial for achieving high resource utilization. Conventional resource sharing is performed in a coarse-grained manner. For example, general-purpose virtualization enables cloud applications to efficiently share multi-tenant resources with performance isolation Angel et al. (2014); Wentzlaff et al. (2010); Shahrad and Wentzlaff (2016); Baumann et al. (2009), but cloud providers dedicate accelerators to individual users. Cluster schedulers optimize for heterogeneity of ML workloads Narayanan et al. (2020) and multi-job, multi-user fairness and performance Xiao et al. (2018); Ren et al. (2015); Mahajan et al. (2020); Jeon et al. (2018), but resources are still exclusively dedicated to single jobs at long time scales (seconds or more).

Recent work shows that finer-grained sharing can improve resource efficiency further: virtualizing accelerators Yu et al. (2020); Gupta et al. (2011); Vijaykumar et al. (2016) avoids dedicating a whole accelerator to a single user. Large models Brown et al. (2020) can be limited by available accelerator memory, requiring GPU memory virtualization Rhu et al. (2016); Ausavarungnirun et al. (2018) or DRAM offload Rajbhandari et al. (2021). Concurrent (time-multiplexed or overlapping) ML task execution Gupta et al. (2018); Xiao et al. (2020); Bai et al. (2020); Yu and Chowdhury (2020); Wang et al. (2021); Lim et al. (2021) helps harvest idle resources within accelerators. These fine-grained sharing techniques demonstrate opportunities for sharing accelerators that are hard to capitalize on at scale without a single-controller system like Pathways.

Many works have shown that deviating from SPMD computations can improve efficiency on large workloads: pipelining Huang et al. (2019); Narayanan et al. (2019); Yang et al. (2021) partitions ML models into static heterogeneous computations across accelerators. Graph neural network training Jia et al. (2020), neural architecture search Pham et al. (2018), and multi-modal multi-task learning systems Ma et al. (2018); Lepikhin et al. (2020); Zhao et al. (2019) are examples of inherently heterogeneous and dynamic tasks that do not fit naturally in the SPMD model. We anticipate that upcoming large-scale efficient ML models may form a collection of shared layers and exclusive layers Bommasani and et. al. (2021), which are natural to express as MPMD.

Conclusions

Pathways matches state of the art multi-controller performance on current ML models which are single-tenant SPMD. We have ensured strict compatibility with multi-controller JAX, and as we demonstrate in §5, Pathways matches JAX’s performance across very large system scales, for all but the smallest computations.

At the same time, Pathways upends the execution model of JAX programs, pulling user code back into a single-controller model, and interposing a centralized resource management and scheduling framework between client and accelerators. The single-controller programming model allows users simple access to much richer computation patterns. The resource management and scheduling layer permits the reintroduction of cluster management policies including multi-tenant sharing, virtualization and elasticity, all tailored to the requirements of ML workloads and accelerators. Our micro-benchmarks show interleaving of concurrent client workloads, and efficient pipelined execution, convincingly demonstrating that the system mechanisms we have built are fast and flexible, and form a solid basis for research into novel policies to make use of them.

We have shown that careful system design and engineering lets us “get the best of both worlds”, matching performance on today’s ML models while delivering the features needed to write the models of tomorrow.

Acknowledgements

We gratefully acknowledge contributions to the design and implementation of the Pathways system from many colleagues at Google, and from members of the wider machine learning community. We also thank Martín Abadi, James Laudon, Martin Maas, and the anonymous MLSys reviewers for their helpful suggestions on the presentation of the work.

References

Appendix A Accelerator design considerations

Hardware acceleration is critical to modern deep learning; unfortunately, achieving high performance with accelerators is a non-trivial systems exercise. The following subsections list established techniques commonly employed in deep learning systems to achieve good performance.

Given the end of Dennard-scaling, accelerators implement hardware parallelism, often using SIMT Kirk or systolic array Jouppi et al. designs. While these hardware architectures remove the arithmetic bottleneck, memory bandwidth quickly becomes the critical resource, necessitating high-bandwidth memory (HBM), an expensive and limited-capacity memory technology. Training schemes for modern neural networks leverage batching to unlock parallelism (good for feeding parallel ALUs) and enable memory re-use (a float is read from memory once and used for multiple computations, substantially reducing a computation’s memory bandwidth needs). Nevertheless, batching is not a panacea: it puts pressure on the limited HBM memory capacity, and very large batch sizes can slow model convergence rates Shallue et al. , You et al. , Lanchantin et al. , Anil et al. . While modern GPUs support unified memory—a capability to transparently page memory between accelerators, or from HBM to the host’s DRAM—if the user is not careful, an HBM-bandwidth bound computation could slow to PCIe bandwidth, dropping accelerator utilization by an order of magnitude Lim et al. .

A.2 Asynchronous programming

Accelerator abstractions rely on an asynchronous programming model to achieve performance; a synchronous abstraction wastes too many accelerator computation resources between PCIe latency, kernel scheduling overheads, and interrupt delays. Computations are enqueued on streams to be executed on the accelerator at some point in the future. This asynchronous abstraction effectively masks dispatch latency for small operations, so long as a sufficiently large pipeline of work is maintained.

A.3 High performance interconnects

Modern deep neural networks are orders of magnitude larger than the capacity of accelerator (HBM) memory Lepikhin et al. , Huang et al. . The parallelism within these neural networks is amenable to sharding across multiple accelerators simultaneously, however high speed interconnects between accelerators then become critical for performance. GPUs use interconnects such as NVLink for high-speed communication between “islands” of accelerators on a small number of hosts Naumov et al. , and use RDMA capabilities of ethernet and Infiniband NICs (GPUDirect) to rapidly communicate between the islands. TPUs have a custom mesh network built directly into the chips, and chips can communicate directly without involving the host or the data-center network. Dedicated GPU and TPU interconnects are typically exposed to applications via 30 year old MPI primitives (e.g., AllReduce) that must be gang-scheduled so that every program enters the same primitive at the same time. As larger computations are run (e.g., training a larger neural network, or training a fixed-size neural network over more accelerators through a form of weak scaling called data-parallel scaling), faster collective operations and thus network bandwidth are required to maintain efficient utilization of aggregate cluster resources. This has prompted significant experimentation with alternate chip-network topologies including hypercubes, and 2-D and 3-D mesh tori Naumov et al. .

A.4 Single-tenancy

Unlike most resources in a computer, accelerators are not often shared by multiple programs simultaneously. Deep learning models can be easily scaled to use more memory by increasing parameter counts or batch sizes, and thus programs in practice consume most available accelerator (HBM) memory. PCIe bandwidth is much smaller than HBM- or accelerator interconnect-bandwidth. This means that fine-grained context-switching (where much of the data in HBM is paged out to host DRAM over PCIe) results in wasting a significant fraction of accelerator cycles. Thus, when a host program is not fully utilizing an accelerator, the computational resources are stranded and cannot be used productively. Further, preemption of accelerator resources is minimized in practice, resulting in sub-optimal resource scheduling in large, shared clusters serving heterogeneous workloads; it is difficult to allocate large quantities of physically proximate devices to take advantage of network locality.

A.5 Contrasting GPUs and TPUs

While there are many similarities between GPUs and TPUs, there are some important differences. GPU systems tend to have small islands of NVLink-connected devices (e.g., 8 GPUs within one host), with larger aggregations connected over infiniband or data-center networking technology. GPUs are typically programmed by dispatching many small pre-compiled “kernels” to the accelerator, and because they are pre-compiled, the kernels must support dynamic shapes. Any communication between GPUs, whether over NVLink or via DCN, is performed via the NCCL library and initiated by the host.

TPU systems have thousands of devices connected all-to-all, with hundreds of hosts per “island” (Figure 3 Middle). TPUs contain a capable “scalar core” that coordinates the TPU’s vector computation units, allowing a TPU to execute long-running functions written in XLA TensorFlow without any host interaction, and these functions may include collective communication across the dedicated ICI network. Consequently, on TPU, an ML framework typically constructs a large XLA program, which is just-in-time (JIT) compiled and dispatched to the accelerator. The fact that a single XLA computation may run for orders of magnitude longer than a GPU kernel justifies increased optimization effort by the compiler such as static buffer assignment and automatic rematerialization of intermediate program values (saving memory capacity). As a consequence of this static buffer assignment, TPUs have only limited support for dynamic shapes, making them a good fit to the Pathways concept of regular compiled functions.

TPUs are restricted to run a single program at a time, with no local pre-emption, mostly because their high-performance RDMA communication implementation between devices makes safe pre-emption difficult without distributed coordination. Because computations are not pre-emptible, it is essential to enqueue communicating computations in a consistent order across devices, or the system will deadlock. This requirement translates to the necessity for Pathways to perform centralized gang-scheduling. As noted in the main text of the paper, however, gang-scheduling is also highly advantageous for GPU efficiency. For an cluster prioritizing ML training workloads, where throughput is more important than latency, it is more efficient to dedicate an entire GPU, or a static fraction of a GPU, to a single carefully sized computation at a time, than to allow the GPU driver and hardware runtime to dynamically multiplex its computational resources across competing concurrent computations. Therefore, even though GPUs can execute concurrent programs without centralized scheduling, there is still a benefit from using a design like Pathways to make more efficient use of resources.

Appendix B Structure of a typical ML program

This subsection describes a typical contemporary ML computation in terms of the high level structure that maps sub-computations to accelerators, and the lowering of a sub-computation to accelerator kernels.

The computations that are executed by an accelerator running an ML workload are dominated by what we call “compiled functions”. These are sub-computations with the following characteristics:

Input and output types, and the shapes of any input/output tensors, are known before the input data have been computed.

Bounds of any loops are either known when the node computation is scheduled, or specified as a maximum trip count with potential early termination.

Conditionals are “functional” where both branches have the same output type, and resources are allocated in advance sufficient for either branch.

The constraints on compiled functions are mostly due to the co-evolution of ML models with hardware, discussed in detail in §A. Here we discuss some of the implications of the fact that the resource requirements of compiled functions are known in advance.

Almost all of today’s high performance ML computations are expressed as long stretches of compiled functions and only occasionally (if ever) branch based on data that is computed by a compiled function. Since the system can perform resource allocation for compiled functions in advance, contemporary ML frameworks exploit this property by enqueueing compiled functions asynchronously before their predecessors have run, allowing host-side work to be done in parallel with accelerator computations Bradbury et al. , Paszke et al. . Wherever possible the frameworks submit graphs of compiled functions to a “just in time” (JIT) compiler Chen et al. , TensorFlow that is able to exploit optimizations like layout assignment and fusion that can substantially improve the efficiency of the resulting accelerator code.

The need to optimize graphs of compiled functions to achieve peak accelerator performance means that frameworks typically trace the execution of fragments of high level (Python) code that can be lowered to compiled functions. Thus, even though client code may be written in a high level language with complex state bound to the running context at a host, performance-sensitive node computations are often lowered to an internal representation (IR) that is serializable and relatively easy to send to a remote host for execution.

Appendix C Input Data Processing

JAX has deliberately avoided re-implementing data loading pipelines, and tensorflow/datasets TensorFlow are commonly used for JAX input processing, so it is not difficult for JAX programs to be adapted to offload input processing to the CPU-based TensorFlow executors run on Pathways workers. Pathways instantiates a CPU-based TensorFlow executor on each host, so that user programs can serialize input processing into a TensorFlow graph and distribute it across the workers. We plan to support streaming data protocols so that CPU-based computation can be performed on an independently managed set of servers, thus decoupling the expensive TPU-connected hosts from the CPU resources available for input processing.

Appendix D Evaluation Workload Traces

Figure 11 presents the traces for the workload of Figure 8 with a varied number of clients submitting programs concurrently (§5.2). A single client uses a very small per-program compute time of 0.33 ms that is insufficient to saturate accelerators. With Pathways’s multi-tenency support, using multiple clients increases the device utilization to 100\sim{}100%. All client programs are gang-scheduled across all cores, and interleaved at a millisecond scale or less, showing little context-switch overhead.

Figure 12 shows a trace profile for multiple training steps when the 64B Decoder only Transformer model is trained data parallel over two islands of accelerators with 512 chips each (§5.3). The first eight rows (blue) correspond to TPU computations on a host in the first island and the next eight rows (green) correspond to TPU computations on a host in the second island. In this case, each island computes gradients and then enqueues the transfers of these gradients to the other island. When the transfer of gradients is over DCN completes, each island applies the received gradients and starts the next training step. DCN transfers incur minimal overhead even at the scale of pairs of 128 hosts resulting in 97.2% training throughput compared to an SPMD configuration that uses ICI communication over total equivalent number of chips.