Federated Optimization: Distributed Machine Learning for On-Device Intelligence

Jakub Konečný, H. Brendan McMahan, Daniel Ramage, Peter Richtárik

Introduction

Mobile phones and tablets are now the primary computing devices for many people. In many cases, these devices are rarely separated from their owners , and the combination of rich user interactions and powerful sensors means they have access to an unprecedented amount of data, much of it private in nature. Models learned on such data hold the promise of greatly improving usability by powering more intelligent applications, but the sensitive nature of the data means there are risks and responsibilities to storing it in a centralized location.

We advocate an alternative — federated learning — that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally computed updates via a central coordinating server. This is a direct application of the principle of focused collection or data minimization proposed by the 2012 White House report on the privacy of consumer data . Since these updates are specific to improving the current model, they can be purely ephemeral — there is no reason to store them on the server once they have been applied. Further, they will never contain more information than the raw training data (by the data processing inequality), and will generally contain much less. A principal advantage of this approach is the decoupling of model training from the need for direct access to the raw training data. Clearly, some trust of the server coordinating the training is still required, and depending on the details of the model and algorithm, the updates may still contain private information. However, for applications where the training objective can be specified on the basis of data available on each client, federated learning can significantly reduce privacy and security risks by limiting the attack surface to only the device, rather than the device and the cloud.

If additional privacy is needed, randomization techniques from differential privacy can be used. The centralized algorithm could be modified to produce a differentially private model , which allows the model to be released while protecting the privacy of the individuals contributing updates to the training process. If protection from even a malicious (or compromised) coordinating server is needed, techniques from local differential privacy can be applied to privatize the individual updates . Details of this are beyond the scope of the current work, but it is a promising direction for future research.

A more complete discussion of applications of federated learning as well as privacy ramifications can be found in . Our focus in this work will be on federated optimization, the optimization problem that must be solved in order to make federated learning a practical alternative to current approaches.

The optimization community has seen an explosion of interest in solving problems with finite-sum structure in recent years. In general, the objective is formulated as

The main source of motivation are problems arising in machine learning. The problem structure (1) covers linear or logistic regressions, support vector machines, but also more complicated models such as conditional random fields or neural networks.

logistic regression: $f_{i}(w)=-\log(1+\exp(-y_{i}x_{i}^{T}w))$ , $y_{i}\in\{-1,1\}$

support vector machines: $f_{i}(w)=\max\{0,1-y_{i}x_{i}^{T}w\}$ , $y_{i}\in\{-1,1\}$

More complicated non-convex problems arise in the context of neural networks, where rather than via the linear-in-the-features mapping $x_{i}^{T}w$ , the network makes prediction through a non-convex function of the feature vector $x_{i}$ . However, the resulting loss can still be written as $f_{i}(w)$ , and gradients can be computed efficiently using backpropagation.

The amount of data that businesses, governments and academic projects collect is rapidly increasing. Consequently, solving problem (1) arising in practice is often impossible on a single node, as merely storing the whole dataset on a single node becomes infeasible. This necessitates the use of a distributed computational framework, in which the training data describing the problem is stored in a distributed fashion across a number of interconnected nodes and the optimization problem is solved collectively by the cluster of nodes.

Loosely speaking, one can use any network of nodes to simulate a single powerful node, on which one can run any algorithm. The practical issue is that the time it takes to communicate between a processor and memory on the same node is normally many orders of magnitude smaller than the time needed for two nodes to communicate; similar conclusions hold for the energy required . Further, in order to take advantage of parallel computing power on each node, it is necessary to subdivide the problem into subproblems suitable for independent/parallel computation.

State-of-the-art optimization algorithms are typically inherently sequential. Moreover, they usually rely on performing a large number of very fast iterations. The problem stems from the fact that if one needs to perform a round of communication after each iteration, practical performance drops down dramatically, as the round of communication is much more time-consuming than a single iteration of the algorithm.

These considerations have lead to the development of novel algorithms specialized for distributed optimization (we defer thorough review until Section 2). For now, we note that most of the results in literature work in the setting where the data is evenly distributed, and further suppose that $K\ll n/K$ where $K$ is the number of nodes. This is indeed often close to reality when data is stored in a large data center. Additionally, an important subfield of the field of distributed learning relies on the assumption that each machine has a representative sample of the data available locally. That is, it is assumed that each machine has an IID sample from the underlying distribution. However, this assumption is often too strong; in fact, even in the data center paradigm this is often not the case since the data on a single node can be close to each other on a temporal scale, or clustered by its geographical origin. Since the patterns in the data can change over time, a feature might be present frequently on one node, while not appear on another at all.

The federated optimization setting describes a novel optimization scenario where none of the above assumptions hold. We outline this setting in more detail in the following section.

2 The Setting of Federated Optimization

The main purpose of this paper is to bring to the attention of the machine learning and optimization communities a new and increasingly practically relevant setting for distributed optimization, where none of the typical assumptions are satisfied, and communication efficiency is of utmost importance. In particular, algorithms for federated optimization must handle training data with the following characteristics:

Massively Distributed: Data points are stored across a large number of nodes $K$ . In particular, the number of nodes can be much bigger than the average number of training examples stored on a given node ( $n/K$ ).

Non-IID: Data on each node may be drawn from a different distribution; that is, the data points available locally are far from being a representative sample of the overall distribution.

Unbalanced: Different nodes may vary by orders of magnitude in the number of training examples they hold.

In this work, we are particularly concerned with sparse data, where some features occur on a small subset of nodes or data points only. Although this is not necessary characteristic of the setting of federated optimization, we will show that the sparsity structure can be used to develop an effective algorithm for federated optimization. Note that data arising in the largest machine learning problems being solved nowadays, ad click-through rate predictions, are extremely sparse.

We are particularly interested in the setting where training data lives on users’ mobile devices (phones and tablets), and the data may be privacy sensitive. The data $\{x_{i},y_{i}\}$ is generated through device usage, e.g., via interaction with apps. Examples include predicting the next word a user will type (language modelling for smarter keyboard apps), predicting which photos a user is most likely to share, or predicting which notifications are most important.

To train such models using traditional distributed learning algorithms, one would collect the training examples in a centralized location (data center) where it could be shuffled and distributed evenly over proprietary compute nodes. In this paper we propose and study an alternative model: the training examples are not sent to a centralized location, potentially saving significant network bandwidth and providing additional privacy protection. In exchange, users allow some use of their devices’ computing power, which shall be used to train the model.

Communication constraints arise naturally in the massively distributed setting, as network connectivity may be limited (e.g., we may wish to deffer all communication until the mobile device is charging and connected to a wi-fi network). Thus, in realistic scenarios we may be limited to only a single round of communication per day. This implies that, within reasonable bounds, we have access to essentially unlimited local computational power. Consequently, the practical objective is solely to minimize the number of communication rounds.

The main purpose of this work is initiate research into, and design a first practical implementation of federated optimization. Our results suggest that with suitable optimization algorithms, very little is lost by not having an IID sample of the data available, and that even in the presence of a large number of nodes, we can still achieve convergence in relatively few rounds of communication.

Related Work

In this section we provide a detailed overview of the relevant literature. We particularly focus on algorithms that can be used to solve problem (1) in various contexts. First, in Sections 2.1 and 2.2 we look at algorithms designed to be run on a single computer. In Section 2.3 we follow with a discussion of the distributed setting, where no single node has direct access to all data describing $f$ . We describe a paradigm for measuring the efficiency of distributed methods, followed by overview of existing methods and commentary on whether they were designed with communication efficiency in mind or not.

In this section we shall describe several fundamental baseline algorithms which can be used to solve problems of the form (1).

A trivial benchmark for solving problems of structure (1) is Gradient Descent (GD) in the case when functions $f_{i}$ are smooth (or Subgradient Descent for non-smooth functions) . The GD algorithm performs the iteration

where $h_{t}>0$ is a stepsize parameter. As we mentioned earlier, the number of functions, or equivalently, the number of training data pairs, $n$ , is typically very large. This makes GD impractical, as it needs to process the whole dataset in order to evaluate a single gradient and update the model.

Gradient descent can be substantially accelerated, in theory and practice, via the addition of a momentum term. Acceleration ideas for gradient methods in convex optimization can be traced back to the work of Polyak and Nesterov . While accelerated GD methods have a substantially better convergence rate, in each iteration they still need to do at least one pass over all data. As a result, they are not practical for problems where $n$ very large.

At present a basic, albeit in practice extremely popular, alternative to GD is Stochastic Gradient Descent (SGD), dating back to the seminal work of Robbins and Monro . In the context of (1), SGD samples a random function (i.e., a random data-label pair) $i_{t}\in\{1,2,\dots,n\}$ in iteration $t$ , and performs the update

One common trick that has been practically observed to provide superior performance, is to replace random sampling in each iteration by going through all the functions in a random order. This ordering is replaced by another random order after each such cycle . Theoretical understanding of this phenomenon had been a long standing open problem, understood recently in .

The core differences between GD and SGD can be summarized as follows. GD has a fast convergence rate, but each iteration in the context of (1) is potentially very slow, as it needs to process the entire dataset in each iteration. On the other hand, SGD has slower convergence rate, but each iteration is fast, as the work needed is independent of number of data points $n$ . For the problem structure of (1), SGD is usually better, as for practical purposes relatively low accuracy is required, which SGD can in extreme cases achieve after single pass through data, while GD would make just a single update. However, if a high accuracy was needed, GD or its faster variants would prevail.

2 A Novel Breed of Randomized Algorithms

Recent years have seen an explosion of new randomized methods which, in a first approximation, combine the benefits of cheap iterations of SGD with fast convergence of GD. Most of these methods can be said to belong to one of two classes — dual methods of the randomized coordinate descent variety, and primal methods of the stochastic gradient descent with variance reduction variety.

Although the idea of coordinate descent has been around for several decades in various contexts (and for quadratic functions dates back even much further, to works on the Gauss-Seidel methods), it came to prominence in machine learning and optimization with the work of Nesterov which equipped the method with a randomization strategy. Nesterov’s work on Randomized Coordinate Descent (RCD) popularized the method and demonstrated that randomization can be very useful for problems of structure (1).

The RCD algorithm in each iteration chooses a random coordinate $j_{t}\in\{1,\dots,d\}$ and performs the update

Numerous follow-up works extended the concept to proximal setting , single processor parallelism and develop efficiently implementable acceleration . All of these three properties were connected in a single algorithm in , to which we refer the reader for a review of the early developments in the area of RCD, particularly to overview in Table 1 therein.

When an explicit strongly convex, but not necessarily smooth, regularizer is added to the average loss (1), it is possible to write down its (Fenchel) dual and the dual variables live in $n$ -dimensional space. Applying RCD leads to an algorithm for solving (1) known under the name Stochastic Dual Coordinate Ascent . This method has gained broad popularity with practicioners, likely due to the fact that for a number of loss functions, the method comes without the need to tune any hyper-parameters. The work was first to show that by applying RCD to the dual problem, one also solves the primal problem (1). For a theoretical and computational comparison of applying RCD to the primal versus the dual problems, see .

A directly primal-dual randomized coordinate descent method called Quartz, was developed in . It has been recently shown in SDNA that incorporating curvature information contained in random low dimensional subspaces spanned by a few coordinates can sometimes lead to dramatic speedups. Recent works interpret the SDCA method in primal-only setting, shedding light onto why this method works as a SGD method with a version of variance reduction property.

We now move the the second class of novel randomized algorithms which can be generally interpreted as variants of SGD, with an attempt to reduce variance inherent in the process of gradient estimation.

The first notable algorithm from this class is the Stochastic Average Gradient (SAG) . The SAG algorithm stores an average of $n$ gradients of functions $f_{i}$ evalueated at different points in the history of the algorithm. In each iteration, the algotithm, updates randomly chosen gradient out of this average, and makes a step in the direction of the average. This way, complexity of each iteration is independent of $n$ , and the algorithm enjoys a fast convergence. The drawback of this algorithm is that it needs to store $n$ gradients in memory because of the update operation. In the case of generalized linear models, this memory requirement can be reduced to the need of $n$ scalars, as the gradient is a scalar multiple of the data point. This methods has been recently extended for use in Conditional Random Fields . Nevertheless, the memory requirement makes the algorithm infeasible for application even in relatively small neural networks.

A followup algorithm SAGA and its simplification , modifies the SAG algorithm to achieve unbiased estimate of the gradients. The memory requirement is still present, but the method significantly simplifies theoretical analysis, and yields a slightly stronger convergence guarantee.

Another algorithm from the SGD class of methods is Stochastic Variance Reduced GradientThe same algorithm was simultaneously introduced as Semi-Stochastic Gradient Descent (S2GD) . Since the former work gained more attention, we will for clarity use the name SVRG throughout this paper. (SVRG) and . The SVRG algorithm runs in two nested loops. In the outer loop, it computes full gradient of the whole function, $\nabla f(w^{t})$ , the expensive operation one tries to avoid in general. In the inner loop, the update step is iteratively computed as

The core idea is that the stochastic gradients are used to estimate the change of the gradient between point $w^{t}$ and $w$ , as opposed to estimating the gradient directly. We return to more detailed description of this algorithm in Section 3.2.

The SVRG has the advantage that it does not have the additional memory requirements of SAG/SAGA, but it needs to process the whole dataset every now and then. Indeed, comparing to SGD, which typically makes significant progress in the first pass through data, SVRG does not make any update whatsoever, as it needs to compute the full gradient. This and several other practical issues have been recently addressed in , making the algorithm competitive with SGD early on, and superior in later iterations. Although there is nothing that prevents one from applying SVRG and its variants in deep learning, we are not aware of any systematic assessment of its performance in this setting. Vanilla experiments in suggest that SVRG matches basic SGD, and even outperforms in the sense that variance of the iterates seems to be significantly smaller for SVRG. However, in order to draw any meaningful conclusions, one would need to perform extensive experiments and compare with state-of-the-art methods usually equipped with numerous heuristics.

There already exist attempts at combining SVRG type algorithms with randomized coordinate descent . Although these works highlight some interesting theoretical properties, the algorithms do not seem to be practical at the moment; more work is needed in this area. The first attempt to unify algorithms such as SVRG and SAG/SAGA already appeared in the SAGA paper , where the authors interpret SAGA as a midpoint between SAG and SVRG. Recent work presents a general algorithm, which recovers SVRG, SAGA, SAG and GD as special cases, and obtains an asynchronous variant of these algorithms as a byproduct of the formulation. SVRG can be equipped with momentum (and negative momentum), leading to a new accelerated SVRG method known as Katyusha . SVRG can be further accelerated via a raw clustering mechanism .

A third class of new algorithms are the Stochastic quasi-Newton methods . These algorithms in general try to mimic the limited memory BFGS method (L-BFGS) , but model the local curvature information using inexact gradients — coming from the SGD procedure. A recent attempt at combining these methods with SVRG can be found in . In , the authors utilize recent progress in the area of stochastic matrix inversion revealing new connections with quasi-Newton methods, and devise a new stochastic limited memory BFGS method working in tandem with SVRG. The fact that the theoretical understanding of this branch of research is the least understood and having several details making the implementation more difficult compared to the methods above may limit its wider use. However, this approach could be most promising for deep learning once understood better.

One important aspect of machine learning is that the Empirical Risk Minimization problem (1) we are solving is just a proxy for the Expected Risk we are ultimately interested in. When one can find exact minimum of the empirical risk, everything reduces to balancing approximation–estimation tradeoff that is the object of abundant literature — see for instance . An assessment of asymptotic performance of some optimization algorithms as learning algorithms in large-scale learning problemsSee [13, Section 2.3] for their definition of large scale learning problem. has been introduced in . Recent extension in has shown that the variance reduced algorithms (SAG, SVRG, …) can in certain setting be better learning algorithms than SGD, not just better optimization algorithms.

A general method, referred to as Universal Catalyst , effectively enables conversion of a number of the algorithms mentioned in the previous sections to their ‘accelerated’ variants. The resulting convergence guarantees nearly match lower bounds in a number of cases. However, the need to tune additional parameter makes the method rather impractical.

Recently, lower and upper bounds for complexity of stochastic methods on problems of the form (1) were recently obtained in .

3 Distributed Setting

In this section we review the literature concerning algorithms for solving (1) in the distributed setting. When we speak about distributed setting, we refer to the case when the data describing the functions $f_{i}$ are not stored on any single storage device. This can include setting where one’s data just don’t fit into a single RAM/computer/node, but two is enough. This also covers the case where data are distributed across several datacenters around the world, and across many nodes in those datacenters. The point is that in the system, there is no single processing unit that would have direct access to all the data. Thus, the distributed setting does not include single processor parallelismIt should be noted that some of the works presented in this section were originally presented as parallel algorithms. We include them anyway as many of the general ideas in carry over to the distributed setting.. Compared with local computation on any single node, the cost of communication between nodes is much higher both in terms of speed and energy consumption , introducing new computational challenges, not only for optimization procedures.

We first reveiew a theoretical decision rule for determining the practically best algorithm for a given problem in Section 2.3.1, followed by overview of distributed algorithms in Section 2.3.2, and communication efficient algorithms in Section 2.3.3. The following paradigm highlights why the class of communication efficient algorithms are not only preferable choice in the trivial sense. The communication efficient algorithms provide us with much more flexible tools for designing overall optimization procedure, which can make the algorithms inherently adaptive to differences in computing resources and architectures.

This section reviews a paradigm for comparing efficency of distributed algorithms. Let us suppose we have many algorithms $\mathcal{A}$ readily available to solve the problem (1). The question is: “How do we decide which algorithm is the best for our purpose?” Initial version of this reasoning already appeared in , and applies also to .

First, consider the basic setting on a single machine. Let us define $\mathcal{I}_{\mathcal{A}}(\epsilon)$ as the number of iterations algorithm $\mathcal{A}$ needs to converge to some fixed $\epsilon$ accuracy. Let $\mathcal{T}_{\mathcal{A}}$ be the time needed for a single iteration. Then, in practice, the best algorithm is one that minimizes the following quantity.Considering only algorithms that can be run on a given machine.

The number of iterations $\mathcal{I}_{\mathcal{A}}(\epsilon)$ is usually given by theoretical guarantees or observed from experience. The $\mathcal{T}_{\mathcal{A}}$ can be empirically observed, or one can have idea of how the time needed per iteration varies between different algorithms in question. The main point of this simplified setting is to highlight key issue with extending algorithms to the distributed setting.

The communication cost $c$ does not only consist of actual exchange of the data, but also many other things like setting up and closing a connection between nodes. Consequently, even if we need to communicate very small amount of information, $c$ always remains above a nontrivial threshold.

Most, if not all, of the current state-of-the-art algorithms that are the best in setting of (2), are stochastic and rely on doing very large number (big $\mathcal{I}_{\mathcal{A}}(\epsilon)$ ) of very fast (small $\mathcal{T}_{\mathcal{A}}$ ) iterations. As a result, even relatively small $c$ can cause the practical performance of those algorithms drop down dramatically, because $c\gg\mathcal{T}_{\mathcal{A}}$ .

This has been indeed observed in practice, and motivated development of new methods, designed with this fact in mind from scratch, which we review in Section 2.3.2. Although this is a good development for academia — motivation to explore new setting, it is not necessarily a good news for the industry.

Many companies have spent significant resources to build excellent algorithms to tackle their problems of form (1), fine tuned to the specific patterns arising in their data and side applications required. When the data companies collect grows too large to be processed on a single machine, it is understandable that they would be reluctant to throw away their fine tuned algorithms. This issue was first time explicitly addressed in CoCoA , which is rather framework than a algorithm, which works as follows (more detailed description follows in Section 2.3.3).

The CoCoA framework formulates a general way to form a specific subproblem on each node, based on data available locally and a single shared vector that needs to be distributed to all nodes. Within a iteration of the framework, each node uses any optimization algorithm $\mathcal{A}$ , to reach a relative $\Theta$ accuracy on the local subproblem. Updates from all nodes are then aggregated to form an update to the global model.

The efficiency paradigm changes as follows:

The number of iterations $\mathcal{I}(\epsilon,\Theta)$ is independent of choice of the algorithm $\mathcal{A}$ used as a local solver, because there is theory predicting how many iterations of the CoCoA framework are needed to achieve $\epsilon$ accuracy, if we solve the local subproblems to relative $\Theta$ accuracy. Here, $\Theta=0$ would mean we require the subproblem to be solved to optimality, and $\Theta=1$ that we don’t need any progress whatsoever. The general upper bound on number of iterations of the CoCoA framework is $\mathcal{I}(\epsilon,\Theta)=\frac{\mathcal{O}(\log(1/\epsilon))}{1-\Theta}$ for strongly convex objectives. From the inverse dependence on $1-\Theta$ , we can see that there is a fundamental limit to the number of communication rounds needed. Hence, it will probably not be efficient to spend excessive resources to attain very high local accuracy (small $\Theta$ ). Time per iteration $\mathcal{T}_{\mathcal{A}}(\Theta)$ denotes the time algorithm $\mathcal{A}$ needs to reach the relative $\Theta$ accuracy on the local subproblem.

This efficiency paradigm is more powerful for a number of reasons.

It allows practicioners to continue using their fine-tuned solvers, that can run only on single machine, instead of having to implement completely new algorithms from scratch.

The actual performance in terms of number of rounds of communication is independent from the choice of optimization algorithm, making it much easier to optimize the overall performance.

Since the constant $c$ is architecture dependent, running optimal algorithm on one node network does not have to be optimal on another. In the setting (3), this could mean moving from one cluster to another, a completely different algorithm is optimal, which is a major change. In the setting (4), this can be improved by simply changing $\Theta$ , which is typically implicitly determined by number of iterations algorithm $\mathcal{A}$ runs for.

In this work we propose a different way to formulate the local subproblems, which does not rely on duality as in the case of CoCoA. We also highlight that some algorithms seem to be particularly suitable to solve those local subproblems, effectively leading to novel algorithms for distributed optimization.

3.2 Distributed Algorithms

As discussed below in Section 2.3.1, this setting creates unique challenges. Distributed optimization algorithms typically require a small number (1–4) of communication rounds per iteration. By communication round we typically understand a single MapReduce operation , implemented efficiently for iterative procedures , such as optimization algorithms. Spark has been established as a popular open source framework for implementing distributed iterative algorithms, and includes several of the algorithms mentioned in this section.

Optimization in distributed setting has been studied for decades, tracing back to at least works of Bertsekas and Tsitsiklis . Recent decade has seen an explosion of interest in this area, greatly motivated by rapid increase of data availability in machine learning applications.

Much of the recent effort was focused on creating new optimization algorithms, by building variants of popular algorithms suitable for running on a single processor (See Section 2.1). A relatively common feature of many of these efforts is a) The computation overhead in the case of synchronous algorithms, and b) The difficulty of analysing asynchronous algorithms without restrictive assumptions. By computation overhead we mean that if optimization program runs in a compute-communicate-update cycle, the update part cannot start until all nodes finish their computation. This causes some of the nodes be idle, while remaining nodes finish their part of computation, clearly an inefficient use of computational resources. This pattern often diminishes or completely reverts potential speed-ups from distributed computation. In the asynchronous setting in general, an update can be applied to a parameter vector, followed by computation done based on a now-outdated version of that parameter vector. Formally grasping this pattern, while keeping the setting realistic is often quite challenging. Consequently, this is very open area, and optimal choice of algorithm in any particular case is often heavily dependent on the problem size, details in its structure, computing architecture available, and above all, expertise of the practitioner.

This general issue is best exhibited with numerous attempts at parallelizing the Stochastic Gradient Descent and its variants. As an example, provide theoretically linear speedup with number of nodes, but are difficult to implement efficiently, as the nodes need to synchronize frequently in order to compute reasonable gradient averages. As an alternative, no synchronization between workers is assumed in . Consequently, each worker reads $w^{t}$ from memory, parameter vector $w$ at time point $t$ , computes a stochastic gradient $\nabla f_{i}(w^{t})$ and applies it to already changed state of the parameter vector $w^{t+\tau}$ . The above mentioned methods assume that the delay $\tau$ is bounded by a constant, which is not necessarily realistic assumptionA bound on the delay $\tau$ can be deterministic or probabilistic. However, in practice, the delays are mostly about the number of nodes in the network, and there rare very long delays, when a variety of operating system-related events can temporarily postpone computation of a single node. To the best of our knowledge, no formal assumptions reflect this setting well. In fact, two recent works highlight subtle but important issue with labelling of iterates in the presence of asynchrony, rendering most of the existing analyses of asynchronous optimization algorithms incorrect.. Some of the works also introduce assumptions on the sparsity structures or conditioning of the Hessian of $f$ . Asymptotically optimal convergent rates were proven in with considerably milder assumptions. Improved analysis of asynchronous SGD was also presented in , simultaneously with a version that uses lower-precision arithmetic was introduced without sacrificing performance, which is a trend that might find use in other parts of machine learning in the following years.

The negative effect of asynchronous distributed implementations of SGD seem to be negligible, when applied to the task of training very large deep networks — which is the ultimate industrial application of today. The practical usefulness has been demonstrated for instance by Google’s Downpour SGD and Microsoft’s Project Adam .

The first distributed versions of Coordinate Descent algorithms were the Hydra and its accelerated variant, Hydra2, , which has been demonstrated to be very efficient on large sparse problems implemented on a computing cluster. An extended version with description of implementation details is presented in . Effect of asynchrony has been explored and partially theoretically understood in the works of . Another asynchronous, rather framework than an algorithm, for coordinate updates, applicable to wider class of objectives is presented in .

The data are assumed to be partitioned to nodes by features/coordinates in the above algorithms. This setting can be restrictive if one is not able to distribute the data beforehand, but instead the data are distributed “as is” — in which case the data are most commonly distributed by data points. This does not need to be an issue, if a dual version of coordinate descent is used — in which the distribution is done by data points followed by works on Communication Efficient Dual Cooridante Ascent, described in next section. The use of duality however requires usage of additional explicit strongly convex regularization term, hence can be used to solve smaller class of problems. Despite the apparent practical disadvantages, variants of distributed coordinate descent algorithms are among the most widely used methods in practice.

Moving to variance reduced methods, distributed versions of SAG/SAGA algorithms have not been proposed yet. However, several distributed versions of the SVRG algorithm already exist. A scheme for replicating data to simulate iid sampling in distributed environment was proposed in . Although the performance is well analysed, the setting requires significantly stronger control of data distribution which is rarely practicaly feasible. A relatively similar method to Algorithm 3 presented here has been proposed in , which was analysed, and in , a largely experimental work that can be also cast as communication efficient — described in detail in Section 2.3.3.

Another class of algorithms relevant for this work is Alternating Direction Method of Multipliers (ADMM) . These algorithms are in general applicable to much broader class of problems, and hasn’t been observed to perform better than other algorithms presented in this section, in the machine learning setting of (1).

3.3 Communication-Efficient Algorithms

In this Section, we describe algorithms that can be cast as “communication efficient”. The common theme of the algorithms presented here, is that in order to perform better in the sense of (3), one should design algorithms with high $\mathcal{T}_{\mathcal{A}}$ , in order to make the cost of communcation $c$ negligible.

Before moving onto specific methods, it is worth the noting some of the core limits concerning the problem we are trying to solve in distributed setting. Fundamental limitations of stochastic versions of the problem (1) in terms of runtime, communication costs and number of samples used are studied in . Efficient algorithms and lower bounds for distributed statistical estimation are established in .

However, these works do not fit into our framework, because they assume that each node has access to data generated IID from a single distribution. In the case of also $K\ll n/K$ , that the number of nodes $K$ is much smaller than the number of data point on each node is also assumed. As we stress in the Introduction, these assumptions are far from being satisfied in our setting. Intuitively, relaxing these assumptions should make the problem harder. However, it is not as straightforward to conclude this, as there are certainly particular non-iid data distributions that simplify the problem — for instance if data are distributed according to separability structure of the objective. Lower bounds on communication complexity of distributed convex optimization of (1) are presented in , concluding that for IID data distributions, existing algorithms already achieve optimal complexity in specific settings.

Probably first, rather extreme, work proposed to parallelize SGD in a single round of communication. Each node simply runs SGD on the data available locally, and their outputs are averaged to form a final result. This approach is however not very robust to differences in data distributions available locally, and it has been shown [91, Appendix A] that in general it cannot perform better than using output of a single machine, ignoring all the other data.

Shamir et al. proposed the DANE algorithm, Distributed Approximate Newton , to exactly solve a general subproblem available locally, before averaging their solutions. The method relies on similarity of Hessians of local objectives, representing their iterations as an average of inexact Newton steps. We describe the algorithm in greater detail in Section 3.4 as our proposed work builds on it. A quite similar approach was proposed in , with richer class class of subproblems that can be formulated locally, and solved approximately. An analysis of inexact version of DANE and its accelerated variant, AIDE, appeared recently in . Inexact DANE is closely related to the algorithms presented in this paper. We, however, continue in different direction shaped by the setting of federated optimization.

The DiSCO algorithm of Zhang and Xiao is based on inexact damped Newton method. The core idea is that the inexact Newton steps are computed by distributed preconditioned conjugate gradient, which can be very fast, if the data are distributed in an IID fashion, enabling a good preconditioner to be computed locally. The theoretical upper bound on number of rounds of communication improves upon DANE and other methods, and in certain settings matches the lower bound presented in . The DiSCO algorithm is related to , a distributed truncated Newton method. Although it was reported to perform well in practice, the total number of conjugate gradient iterations may still be high to be considered a communication efficient algorithm.

Common to the above algorithms is the assumption that each node has access to data points sampled IID from the same distribution. This assumption is not required only in theory, but can cause the algorithms to converge significantly slower or even diverge (as reported for instance in [91, Table 3]). Thus, these algorithms, at least in their default form, are not suitable for the setting of Federated Optimization presented here.

An algorithm that bypasses the need for IID data assumption is CoCoA, which provably converges under any distribution of the data, while the convergence rate does depend on properties of the data distribution. The first version of the algorithm was proposed as DisDCA in , without convergence guarantees. First analysis was introduced in , with further improvements in , and a more general version in . Recently, its variant for L1-regularized objectives was introduced in .

The CoCoA framework formulates general local subproblems based on the dual form of (1) (See for instance [57, Eq. (2)]). Data points are distributed to nodes, along with corresponding dual variables. Arbitrary optimization algorithm is used to attain a relative $\Theta$ accuracy on the local subproblem — by changing only local dual variables. These updates have their corresponding updates to primal variable $w$ , which are synchronously aggregated (could be averaging, adding up, or anything in between; depending on the local subproblem formulation).

From the description in this section it appears that the CoCoA framework is the only usable tool for the setting of Federated Optimization. However, the theoretical bound on number of rounds of communications for ill-conditioned problems scales with the number of nodes $K$ . Indeed, as we will show in Section 4 on real data, CoCoA framework does converge very slowly.

Algorithms for Federated Optimization

In this section we introduce the first algorithm that was designed with the unique challenges of federated optimization in mind. Before proceeding with the explanation, we first revisit two important and at first sight unrelated algorithms. The connection between these algorithms helped to motivate our research. Namely, the algorithms are the Stochastic Variance Reduced Gradient (SVRG) , a stochastic method with explicit variance reduction, and the Distributed Approximate Newton (DANE) for distributed optimization.

The descriptions are followed by their connection, giving rise to a new distributed optimization algorithm, at first sight almost identical to the SVRG algorithm, which we call Federated SVRG (FSVRG).

Although this algorithm seems to work well in practice in simple circumstances, its performance is still unsatisfactory in the general setting we specify in Section 3.3. We proceed by making the FSVRG algorithm adaptive to different local data sizes, general sparsity patterns and significant differences in patterns in data available locally, and those present in the entire data set.

It is a useful thought experiment to consider the properties one would hope to find in an algorithm for the non-IID, unbalanced, and massively-distributed setting we consider. In particular:

If the algorithm is initialized to the optimal solution, it stays there.

If all the data is on a single node, the algorithm should converge in $\mathcal{O}(1)$ rounds of communication.

If each feature occurs on a single node, so the problems are fully decomposable (each machine is essentially learning a disjoint block of parameters), then the algorithm should converge in $\mathcal{O}(1)$ rounds of communicationThis is valid only for generalized linear models..

If each node contains an identical dataset, then the algorithm should converge in $\mathcal{O}(1)$ rounds of communication.

For convex problems, “converges” has the usual technical meaning of finding a solution sufficiently close to the global minimum, but these properties also make sense for non-convex problems where “converge” can be read as “finds a solution of sufficient quality”. In these statements, $\mathcal{O}(1)$ round is ideally exactly one round of communication.

Property A is valuable in any optimization setting. Properties B and C are extreme cases of the federated optimization setting (non-IID, unbalanced, and sparse), whereas D is an extreme case of the classic distributed optimization setting (large amounts of IID data per machine). Thus, D is the least important property for algorithms in the federated optimization setting.

2 SVRG

The SVRG algorithm is a stochastic method designed to solve problem (1) on a single node. We present it as Algorithm 1 in a slightly simplified form.

The algorithm runs in two nested loops. In the outer loop, it computes gradient of the entire function $f$ (Line 3). This constitutes for a full pass through data — in general expensive operation one tries to avoid unless necessary. This is followed by an inner loop, where $m$ fast stochastic updates are performed. In practice, $m$ is typically set to be a small multiple (1–5) of $n$ . Although the theoretically optimal choice for $m$ is a small multiple of a condition number [47, Theorem 6], this is often of the same order as $n$ in practice.

The central idea of the algorithm is to avoid using the stochastic gradients to estimate the entire gradient $\nabla f(w)$ directly. Instead, in the stochastic update in Line 7, the algorithm evaluates two stochastic gradients, $\nabla f_{i}(w)$ and $\nabla f_{i}(w^{t})$ . These gradients are used to estimate the change of the gradient of the entire function between points $w^{t}$ and $w$ , namely $\nabla f(w)-\nabla f(w^{t})$ . Using this estimate together with $\nabla f(w^{t})$ pre-computed in the outer loop, yields an unbiased estimate of $\nabla f(w)$ .

Apart from being an unbiased estimate, it could be intuitively clear that if $w$ and $w^{t}$ are close to each other, the variance of the estimate $\nabla f_{i}(w)-\nabla f_{i}(w^{t})$ should be small, resulting in estimate of $\nabla f(w)$ with small variance. As the inner iterate $w$ goes further, variance grows, and the algorithm starts a new outer loop to compute new full gradient $\nabla f(w^{t+1})$ and reset the variance.

The performance is well understood in theory. For $\lambda$ -strongly convex $f$ and $L$ -smooth functions $f_{i}$ , convergence results are in the form

where $w^{*}$ is the optimal solution, and $c=\Theta\left(\frac{1}{mh}\right)+\Theta(h)$ .See [47, Theorem 4] and [43, Theorem 1] for details.

It is possible to show [47, Theorem 6] that for appropriate choice of parameters $m$ and $h$ , the convergence rate (5) translates to the need of

3 Distributed Problem Formulation

In this section, we introduce notation and specify the structure of the distributed version of the problem we consider (1), focusing on the case where the $f_{i}$ are convex. We assume the data $\{x_{i},y_{i}\}_{i=1}^{n}$ , describing functions $f_{i}$ are stored across a large number of nodes.

Let $K$ be the number of nodes. Let $\mathcal{P}_{k}$ for $k\in\{1,\dots,K\}$ denote a partition of data point indices $\{1,\dots,n\}$ , so $\mathcal{P}_{k}$ is the set stored on node $k$ , and define $n_{k}=|\mathcal{P}_{k}|$ . That is, we assume that $\mathcal{P}_{k}\cap\mathcal{P}_{l}=\emptyset$ whenever $k\neq l$ , and $\sum_{k=1}^{K}n_{k}=n$ . We then define local empirical loss as

which is the local objective based on the data stored on machine $k$ . We can then rephrase the objective (1) as

The way to interpret this structure is to see the empirical loss $f(w)=\frac{1}{n}\sum_{i=1}^{n}f_{i}(w)$ as a convex combination of the local empirical losses $F_{k}(w)$ , available locally to node $k$ . Problem (1) then takes the simplified form

4 DANE

In this section, we introduce a general reasoning providing stronger intuitive support for the DANE algorithm , which we describe in detail below. We will follow up on this reasoning in Appendix A and draw a connection between two existing methods that was not known in the literature.

If we wanted to design a distributed algorithm for solving the above problem (8), where node $k$ contains the data describing function $F_{k}$ . The first, and as we shall see, a rather naive idea is to ask each node to minimize their local functions, and average the results (a variant of this idea appeared in ):

One remedy to the above issue is to modify the local problems before each aggregation step. One of the simplest strategies would be to perturb the local function $F_{k}$ in iteration $t$ by a quadratic term of the form: $-(a_{k}^{t})^{T}w+\tfrac{\mu}{2}\|w-w^{t}\|^{2}$ and to ask each node to solve the perturbed problem instead. With this change, the improved method then takes the form

The idea behind iterations of this form is the following. We would like each node $k\in[K]$ to use as much curvature information stored in $F_{k}$ as possible. By keeping the function $F_{k}$ in the subproblem in its entirety, we are keeping the curvature information nearly intact — the Hessian of the subproblem is $\nabla^{2}F_{k}+\mu I$ , and we can even choose $\mu=0$ .

As described, the method is not yet well defined, since we have not described how the vectors $a_{k}^{t}$ would change from iteration to iteration, and how one should choose $\mu$ . In order to get some insight into how such a method might work, let us examine the optimality conditions. Asymptotically as $t\to\infty$ , we would like $a_{k}^{t}$ to be such that the minimum of each subproblem is equal to $w^{*}$ ; the minimizer of (8). Hence, we would wish for $w^{*}$ to be the solution of

Hence, in the limit, we would ideally like to choose $a_{k}^{t}=\nabla F_{k}(w^{*})+\mu(w^{*}-w^{t})\approx\nabla F_{k}(w^{*})$ , since $w^{*}\approx w^{t}$ . Not knowing $w^{*}$ however, we cannot hope to be able to simply set $a_{k}^{t}$ to this value. Hence, the second option is to come up with an update rule which would guarantee that $a_{k}^{t}$ converges to $\nabla F_{k}(w^{*})$ as $t\to\infty$ . Notice at this point that it has been long known in the optimization community that the gradient of the objective at the optimal point is intimately related to the optimal solution of a dual problem. Here the situation is further complicated by the fact that we need to learn $K$ such gradients. In the following, we show that DANE is in fact a particular instantiation of the scheme above.

We present the Distributed Approximate Newton algorithm (DANE) , as Algorithm 2. The algorithm was originally analysed for solving the problem of structure (7), with $n_{k}$ being identical for each $k$ — i.e., each computer has the same number of data points. Nothing prevents us from running it in our more general setting though.

As alluded to earlier, the main idea of DANE is to form a local subproblem, dependent only on local data, and gradient of the entire function — which can be computed in a single round of communication (Line 3). The subproblem is then solved exactly (Line 4), and updates from individual nodes are averaged to form a new iterate (Line 5). This approach allows any algorithm to be used to solve the local subproblem (10). As a result, it often achieves communication efficiency in the sense of requiring expensive local computation between rounds of communication, hopefully rendering the time needed for communication insignificant (see Section 2.3.1). Further, note that DANE belongs to the family of distributed method that operate via the quadratic perturbation trick (9) with

If we assumed that the method works, i.e., that $w^{t}\to w^{*}$ and hence $\nabla f(w^{t})\to\nabla f(w^{*})=0$ , then $a_{k}^{t}\to\nabla F_{k}(w^{*})$ , which agrees with the earlier discussion.

In the default setting when $\mu=0$ and $\eta=1$ , DANE achieves desirable property D (immediate convergence when all local datasets are identical), since in this case $\nabla F_{k}(w^{t})-\eta\nabla f(w^{t})=0$ , and so we exactly minimize $F_{k}(w)=f(w)$ on each machine. For any choice of $\mu$ and $\eta$ , DANE also achieves property A, since in this case $\nabla f(w^{t})=0$ , and $w^{t}$ is a minimizer of $F_{k}(w)-\nabla F_{k}(w^{t})\cdot w$ as well as of the regularization term. Unfortunately, DANE does not achieve the more federated optimization-specific desirable properties B and C.

The convergence analysis for DANE assumes that the functions are twice differentiable, and relies on the assumption that each node has access to IID samples from the same underlying distribution. This implies that that the Hessians of $\nabla^{2}F_{k}(w)$ are similar to each other [91, Lemma 1]. In case of linear regression, with $\lambda=\mathcal{O}(1/\sqrt{n})$ -strongly convex functions, the number of DANE iterations needed to achieve $\epsilon$ -accuracy is $\mathcal{O}(K\log(1/\epsilon))$ . However, for general $L$ -smooth loss, the theory is significantly worse, and does not match its practical performance.

The practical performance also depends on the additional local regularization parameter $\mu$ . For small number of nodes $K$ , the algorithm converges quickly with $\mu=0$ . However, as reported [91, Figure 3], it can diverge quickly with growing $K$ . Bigger $\mu$ makes the algorithm more stable at the cost of slower convergence. Practical choice of $\mu$ remains an open question.

5 SVRG meets DANE

As we mentioned above, the DANE algorithm can perform poorly in certain settings, even without the challenging aspects of federated optimization. Another point that is seen as drawback of DANE is the need to find the exact minimum of (10) — this can be feasible for quadratics with relatively small dimension, but infeasible or extremely expensive to achieve for other problems. We adapt the idea from the CoCoA algorithm , in which an arbitrary optimization algorithm is used to obtain relative $\Theta$ accuracy on a locally defined subproblem. We replace the exact optimization with an approximate solution obtained by using any optimization algorithm.

Considering all the algorithms one could use to solve (10), the SVRG algorithm seems to be a particularly good candidate. Starting the local optimization of (10) from point $w^{t}$ , the algorithm automatically has access to the derivative at $w^{t}$ , which is identical for each node— $\nabla f(w^{t})$ . Hence, the SVRG algorithm can skip the initial expensive operation, evaluation of the entire gradient (Line 3, Algorithm 1), and proceed only with the stochastic updates in the inner loop.

It turns out that this modified version of the DANE algorithm is equivalent to a distributed version of SVRG.

Run the DANE algorithm (Algorithm 2) with $\eta=1$ and $\mu=0$ , and use SVRG (Algorithm 1) as a local solver for (10), running it for a single iteration, initialized at point $w^{t}$ .

Run a distributed variant of the SVRG algorithm, described in Algorithm 3.

The algorithms are equivalent in the following sense. If both start from the same point $w^{t}$ , they generate identical sequence of iterates $\{w^{t}\}$ .

We construct the proof by showing that single step of the SVRG algorithm applied to the problem (10) on computer $k$ is identical to the update on Line 8 in Algorithm 3.

The way to obtain a stochastic gradient of (10) is to sample one of the functions composing $F_{k}(w)=\frac{1}{n_{k}}\sum_{i\in\mathcal{P}_{k}}f_{i}(w)$ , and add the linear term $\nabla F_{k}(w^{t})-\eta f(w^{t})$ , which is known and does not need to be estimated. Upon sampling an index $i\in\mathcal{P}_{k}$ , the update direction follows as

which is identical to the direction in Line 8 in Algorithm 3. The claim follows by chaining the identical updates to form identical iterate $w^{t+1}$ . ∎

The algorithms considered in Proposition 1 are inherently stochastic. The statement of the proposition is valid under the assumption that in both cases, identical sequence of samples $i\in\mathcal{P}_{k}$ would be generated by all nodes $k\in\{1,2,\dots,K\}$ .

In the Proposition 1 we consider the DANE algorithm with particular values of $\eta$ and $\mu$ . The Algorithm 3 and the Proposition can be easily gereralized, but we present only the default version for the sake of clarity.

Since the first version of this paper, this connection has been mentioned in , which analyses an inexact version of the DANE algorithm. We proceed by adapting the above algorithm to other challenges arising in the context of federated optimization.

6 Federated SVRG

Empirically, the Algorithm 3 fits in the model of distributed optimization efficiency described in Section 2.3.1, since we can balance how many stochastic iterations should be performed locally against communication costs. However, several modifications are necessary to achieve good performance in the full federated optimization setting (Section 3.3). Very important aspect that needs to be addressed is that the number of data points available to a given node can differ greatly from the average number of data points available to any single node. Furthermore, this setting always comes with the data available locally being clustered around a specific pattern, and thus not being a representative sample of the overall distribution we are trying to learn. In the Experiments section we focus on the case of L2 regularized logistic regression, but the ideas carry over to other generalized linear prediction problems.

Note that in large scale generalized linear prediction problems, the data arising are almost always sparse, for example due to bag-of-words style feature representations. This means that only a small subset of $d$ elements of vector $x_{i}$ have nonzero values. In this class of problems, the gradient $\nabla f_{i}(w)$ is a multiple of the data vector $x_{i}$ . This creates additional complications, but also potential for exploitation of the problem structure and thus faster algorithms. Before continuing, let us summarize and denote a number of quantities needed to describe the algorithm.

$n$ — number of data points / training examples / functions.

$\mathcal{P}_{k}$ — set of indices, corresponding to data points stored on device $k$ .

$n_{k}=|\mathcal{P}_{k}|$ — number of data points stored on device $k$ .

$n^{j}=\left|\{i\in\{1,\dots,n\}:x_{i}^{T}e_{j}\neq 0\}\right|$ — the number of data points with nonzero $j^{th}$ coordinate

$n_{k}^{j}=\left|\{i\in\mathcal{P}_{k}:x_{i}^{T}e_{j}\neq 0\}\right|$ — the number of data points stored on node $k$ with nonzero $j^{th}$ coordinate

$\phi^{j}=n^{j}/n$ — frequency of appearance of nonzero elements in $j^{th}$ coordinate

$\phi_{k}^{j}=n_{k}^{j}/n_{k}$ — frequency of appearance of nonzero elements in $j^{th}$ coordinate on node $k$

$s_{k}^{j}=\phi^{j}/\phi_{k}^{j}$ — ratio of global and local appearance frequencies on node $k$ in $j^{th}$ coordinate

$S_{k}=\text{Diag}(s_{k}^{j})$ — diagonal matrix, composed of $s_{k}^{j}$ as $j^{th}$ diagonal element

$\omega^{j}=\left|\{\mathcal{P}_{k}:n_{k}^{j}\neq 0\}\right|$ — Number of nodes that contain data point with nonzero $j^{th}$ coordinate

$a^{j}=K/\omega^{j}$ — aggregation parameter for coordinate $j$

$A=\text{Diag}(a_{j})$ — diagonal matrix composed of $a_{j}$ as $j^{th}$ diagonal element

With these quantities defined, we can state our proposed algorithm as Algorithm 4. Our experiments show that this algorithm works very well in practice, but the motivation for the particular scaling of the updates may not be immediately clear. In the following section we provide the intuition that lead to the development of this algorithm.

6.2 Intuition Behind FSVRG Updates

The difference between the Algorithm 4 and Algorithm 3 is in the introduction of the following properties.

Aggregation of updates proportional to partition sizes — $\frac{n_{k}}{n}(w_{k}-w^{t})$

Scaling stochastic gradients by diagonal matrix — $S_{k}$

Per-coordinate scaling of aggregated updates — $A(w_{k}-w^{t})$

Let us now explain what motivated us to get this particular implementation.

As a simplification, assume that at some point in time, we have for some $w$ , $w_{k}=w$ for all $k\in[K]$ . In other words, all the nodes have the same local iterate. Although this is not exactly the case in practice, thinking about the issue in this simplified setting will give us insight into what would be meaningful to do if it was true. Further, we can hope that the reality is not too far from the simplification and it will still work in practice. Indeed, all nodes do start from the same point, and adding the linear term $\nabla F_{k}(w^{t})-\nabla f(w^{t})$ to the local objective forces all nodes to move in the same direction, at least initially.

Suppose the nodes are about to make a single step synchronously. Denote the update direction on node $k$ as $G_{k}=\nabla f_{i}(w)-\nabla f_{i}(w^{t})+\nabla f(w^{t})$ , where $i$ is sampled uniformly at random from $\mathcal{P}_{k}$ .

By setting $\alpha_{k}=\frac{n_{k}}{n}$ , we get

This motivates the aggregation of updates from nodes proportional to $n_{k}$ , the number of data points available locally (Point 2).

Next, we realize that if the local data sizes, $n_{k}$ , are not identical, we likely don’t want to do the same number of local iterations on each node $k$ . Intuitively, doing one pass through data (or a fixed number of passes) makes sense. As a result, the aggregation motivated above does not make perfect sense anymore. Nevertheless, we can even it out, by setting the stepsize $h_{k}$ inversely proportional to $n_{k}$ , making sure each node makes progress of roughly the same magnitude overall. Hence, $h_{k}=h/n_{k}$ (Point 1).

To motivate the Point 3, scaling of stochastic gradients by diagonal matrix $S_{k}$ , consider the following example. We have $1,000,000$ data points, distributed across $K=1,000$ nodes. When we look at a particular feature of the data points, we observe it is non-zero only in $1,000$ of them. Moreover, all of them happen to be stored on a single node, that stores only these $1,000$ data points. Sampling a data point from this node and evaluating the corresponding gradient, will clearly yield an estimate of the gradient $\nabla f(w)$ with $1000$ -times larger magnitude. This would not necessarily be a problem if done only once. However, repeatedly sampling and overshooting the magnitude of the gradient will likely cause the iterative process to diverge quickly.

Hence, we scale the stochastic gradients by a diagonal matrix. This can be seen as an attempt to enforce the estimates of the gradient to be of the correct magnitude, conditioned on us, algorithm designers, being aware of the structure of distribution of the sparsity pattern.

Let us now highlight some properties of the modification in Point 4. Without any extra information, or in the case of fully dense data, averaging the local updates is the only way that actually makes sense — because each node outputs approximate solution of a proxy to the overall objective, and there is no induced separability structure in the outputs such as in CoCoA . However, we could do much more in the other extreme. If the sparsity structure is such that each data point only depends on one of disjoint groups of variables, and the data were distributed according to this structure, we would efficiently have several disjoint problems. Solving each of them locally, and adding up the results would solve the problem in single iteration — desired algorithm property (C).

What we propose is an interpolation between these two settings, on a per-variable basis. If a variable appears in data on each node, we are going to take average. However, the less nodes a particular variable appear on, the more we want to trust those few nodes in informing us about the meaningful update to this variable — or alternatively, take a longer step. Hence the per-variable scaling of aggregated updates.

7 Further Notes

Looking at the Proposition 1, we identify equivalence of two algorithms, take the second one and try modify it to make it suitable for the setting of federated optimization. A question naturally arise: Is it possible to achieve the same by modifying the first algorithm suitable for federated optimization — by only altering the local optimization objective?

We indeed tried to experiment with idea, but we don’t report the details for two reasons. First, the requirement of exact solution of the local subproblem is often impractical. Relaxing it gradually moves us to the setting we presented in the previous sections. But more importantly, using this approach we have only managed to get results significantly inferior to those reported later in the Experiments section.

Experiments

In this section we present the first experimental results in the setting of federated optimization. In particular, we provide results on a dataset based on public Google+ postsThe posts were public at the time the experiment was performed, but since a user may decide to delete the post or make it non-public, we cannot release (or even permanently store) any copies of the data., clustered by user — simulating each user as a independent node. This preliminary experiment demonstrates why none of the existing algorithms are suitable for federated optimization, and the robustness of our proposed method to challenges arising there.

The dataset presented here was generated based on public Google+ posts. We randomly picked $10,000$ authors that have at least $100$ public posts in English, and try to predict whether a post will receive at least one comment (that is, a binary classification task).

We split the data chronologically on a per-author basis, taking the earlier $75\%$ for training and the following $25\%$ for testing. The total number of training examples is $n=2,166,693$ . We created a simple bag-of-words language model, based on the $20,000$ most frequent words in dictionary based on all Google+ data. This results in a problem with dimension $d=20,002$ . The extra two features represent a bias term and variable for unknown word. We then use a logistic regression model to make a prediction based on these features.

We shape the distributed optimization problem as follows. Suppose that each user corresponds to one node, resulting in $K=10,000$ . The average $n_{k}$ , number of data points on node $k$ is thus roughly $216$ . However, the actual numbers $n_{k}$ range from $75$ to $9,000$ , showing the data is in fact substantially unbalanced.

It is natural to expect that different users can exhibit very different patterns in the data generated. This is indeed the case, and hence the distribution to nodes cannot be considered an IID sample from the overall distribution. Since we have a bag-of-words model, our data are very sparse — most posts contain only small fraction of all the words in the dictionary. This, together with the fact that the data are naturally clustered on a per-user basis, creates additional challenge that is not present in the traditional distributed setting.

Figure 1 shows the frequency of different features across nodes. Some features are present everywhere, such as the bias term, while most features are relatively rare. In particular, over $88\%$ of features are present on fewer than $1,000$ nodes. However, this distribution does not necessarily resemble the overall appearance of the features in data examples. For instance, while an unknown word is present in data of almost every user, it is far from being contained in every data point.

Before presenting the results, it is useful to look at some of the important basic prediction properties of the data. We use L2-regularized logistic regression, with regularization parameter $\lambda=1/n$ . We chose $\lambda$ to be the best in terms of test error in the optimal solution.

If one chooses to predict $-1$ (no comment), classification error is $\textbf{33.16}\%$ .

The optimal solution of the global logistic regression problem yields $\textbf{26.27}\%$ test set error.

Predicting the per-author majority from the training data yields $\textbf{17.14}\%$ test error. That is, predict $+1$ or $-1$ for all the posts of an author, based on which label was more common in that author’s training data. This indicates that knowing the author is actually more useful than knowing what they said, which is perhaps not surprising.

In summary, this data is representative for our motivating application in federated optimization. It is possible to improve upon naive baseline using a fixed global model. Further, the per-author majority result suggests it is possible to improve further by adapting the global model to each user individually. Model personalization is common practice in industrial applications, and the techniques used to do this are orthogonal to the challenges of federated optimization. Exploring its performance is a natural next step, but beyond the scope of this work.

While we do not provide experiments for per user personalized models, we remark that this could be a good descriptor of how far from IID the data is distributed. Indeed, if each node has access to an IID sample, any adaptation to local data is merely over-fitting. However, if we can significantly improve upon the global model by per user/node adaptation, this means that the data available locally exhibit patterns specific to the particular node.

The performance of the Algorithm 4 is presented below. The only parameter that remains to be chosen by user is the stepsize $h$ . We tried a set of stepsizes, and retrospectively choose one that works best — a typical practice in machine learning.

In Figure 2, we compare the following optimization algorithmsWe thank Mark Schmidt for his prettyPlot function, available on his website.:

The blue squares (OPT) represent the best possible offline value (the optimal value of the optimization task in the first plot, and the test error corresponding to the optimum in the second plot).

The teal diamonds (GD) correspond to a simple distributed gradient descent.

The purple triangles (COCOA) are for the CoCoA+ algorithm .

The green circles (FSVRG) give values for our proposed algorithm.

The red stars (FSVRGR) correspond to the same algorithm applied to the same problem with randomly reshuffled data. That is, we keep the unbalanced number of examples per node, but populate each node with randomly selected examples.

The first thing to notice is that CoCoA+ seems to be worse than trivial benchmark — distributed gradient descent. This behaviour can be predicted from theory, as the overall convergence rate directly depends on the best choice of aggregation parameter $\sigma^{\prime}$ . For sparse problems, it is upperbounded by the maximum of the values reported in Figure 1, which is $K$ , and it is close to it also in practice. Althought it is expected that the algorithm could be modified to depend on average of these quantities (which could be orders of magnitude smaller), akin to coordinate descent algorithms , it has not been done yet. Note that other communication efficient algorithms fail to converge altogether.

The algorithm we propose, FSVRG, converges to optimal test classification accuracy in just $30$ iterations. Recall that in the setting of federated optimization we introduced in Section 1.2, minimization of rounds of communication is the principal goal. However, concluding that the approach is stunningly superior to existing methods would not be completely fair nor correct. The conclusion is that the FSVRG is the first algorithm to tackle federated optimization, a problem that existing methods fail to generalize to. It is important to stress that none of the existing methods were designed with these particular challenges in mind, and we formulate the first benchmark.

Since the core reason other methods fail to converge is the non-IID data distribution, we test our method on the same problem, with data randomly reshuffled among the same number of nodes (FSVRGR; red stars). Since the difference in convergence is subtle, we can conclude that the techniques described in Section 3.6.2 serve its purpose and make the algorithm robust to challenges present in federated optimization.

This experiment demonstrates that learning from massively decentralized data, clustered on a per-user basis is indeed problem we can tackle in practice. Since the first version of this paper , additional experimental results were presented in . We refer the reader to this paper for experiments in more challenging setting of deep learning, and a further discussion on how such system would be implemented in practice.

Conclusions and Future Challenges

We have introduced a new setting for distributed optimization, which we call federated optimization. This setting is motivated by the outlined vision, in which users do not send the data they generate to companies at all, but rather provide part of their computational power to be used to solve optimization problems. This comes with a unique set of challenges for distributed optimization. In particular, we argue that the massively distributed, non-IID, unbalanced, and sparse properties of federated optimization problems need to be addressed by the optimization community.

We explain why existing methods are not applicable or effective in this setting. Even the distributed algorithms that can be applied converge very slowly in the presence of large number of nodes on which the data are stored. We demonstrate that in practice, it is possible to design algorithms that work surprisingly efficiently in the challenging setting of federated optimization, which makes the vision conceptually feasible.

We realize that it is important to scale stochastic gradients on a per-coordinate basis, differently on each node to improve performance. To the best of our knowledge, this is the first time such per-node scaling has been used in distributed optimization. Additionally, we use per-coordinate aggregation of updates from each node, based on distribution of the sparsity patterns in the data.

Even though our results are encouraging, there is a lot of room for future work. One natural direction is to consider fully asynchronous versions of our algorithms, where the updates are applied as soon as they arrive. Another is developing a better theoretical understanding of our algorithm, as we believe that development of a strong understanding of the convergence properties will drive further research in this area.

Study of the federated optimization problem for non-convex objectives is another important avenue of research. In particular, neural networks are the most important example of a machine learning tool that yields non-convex functions $f_{i}$ , without any convenient general structure. Consequently, there are no useful results describing convergence guarantees of optimization algorithms. Despite the lack of theoretical understanding, neural networks are now state-of-the-art in many application areas, ranging from natural language understanding to visual object detection. Such applications arise naturally in federated optimization settings, and so extending our work to such problems is an important direction.

The non-IID data distribution assumed in federated optimization, and mobile applications in particular, suggest that one should consider the problem of training a personalized model together with that of learning a global model. That is, if there is enough data available on a given node, and we assume that data is drawn from the same distribution as future test examples for that node, it may be preferable to make predictions based on a personalized model that is biased toward good performance on the local data, rather than simply using the global model.

References

Appendix A Distributed Optimization via Quadratic Perturbations

This appendix follows from the discussion motivating DANE algorithm by a general algorithmic perturbation template (9) for $\lambda$ -strongly convex objectives. We use this to propose a similar but new method, which unlike DANE converges under arbitrary data partitioning $\{\mathcal{P}_{k}\}_{k=1}^{K}$ , and we highlight its relation to the dual CoCoA algorithm for distributed optimization.

For simplicity and ease of drawing the above connections we assume that $n_{k}$ is identical for all $k\in\{1,2,\dots,K\}$ throughout the appendix. All the arguments can be simply extended, but would unnecessarily complicate the notation for current purpose.

We now present a new method (Algorithm 5), which also belongs to the family of quadratic perturbation methods (9). However, the perturbation vectors $a_{k}^{t}$ are different from those of DANE. In particular, we set

where $\eta>0$ is a parameter, and the vectors $g_{k}^{t}$ are maintained by the method. As we show in Lemma 4, Algorithm 5 satisfies

for all iterations $t$ . This implies that $\tfrac{1}{K}\sum_{k=1}^{K}a_{k}^{t}=(1-\eta)\nabla f(w^{t})$ . That is, both DANE and the new method use a linear perturbation which, when averaged over the nodes, involves the gradient of the objective function $f$ at the latest iterate $w^{t}$ . Therefore, the methods have one more property in common beyond both being of the form (9). However, as we shall see in the rest of this section, Algorithm 5 allows an insightful dual interpretation. Moreover, while DANE may not converge for arbitrary problems (even when restricted to ridge regression)—and is only known to converge under the assumption that the data stored on each node are in some precise way similar, Algorithm 5 converges for any ridge regression problem and any data partitioning.

Let us denote by $X_{k}$ the matrix obtained by stacking the data points $x_{i}$ as column vectors for all $i\in\mathcal{P}_{k}$ . We have the following Lemma.

For all $t\geq 0$ we have $\sum_{k=1}^{K}g_{k}^{t}=0$ .

where the last step follows from the definition of $w^{0}$ . Assume now that the statement hold for $t$ . Then

The first equation follows from the way $g_{k}$ is updated in the algorithm. The second equation follows from the inductive assumption, and the last equation follows from the definition of $w^{t+1}$ in the algorithm. ∎

A.2 L2-Regularized Linear Predictors

In the rest of this section we consider the case of L2-regularized linear predictors. That is, we focus on problem (1) with $f_{i}$ of the form

where $\lambda>0$ is a regularization parameter. This leads to L2 regularized empirical risk minimization (ERM) problem

A.3 A Dual Method: Dual Block Proximal Gradient Ascent

where $\phi_{i}^{*}$ is the convex conjugate of $\phi_{i}$ . Since we assume that $\phi_{i}$ is $1/\gamma$ smooth, it follows that $\phi_{i}^{*}$ is $\gamma$ strongly convex. Therefore, $D$ is a strongly concave function.

It is well known that if $\alpha^{*}$ is the optimal solution of the dual problem (11), then $w^{*}\overset{\text{def}}{=}\frac{1}{\lambda n}X\alpha^{*}$ is the optimal solution of the primal problem. Therefore, for any dual algorithm producing a sequence of iterates $\alpha^{t}$ , we can define a corresponding primal algorithm via the linear mapping

Clearly, if $\alpha^{t}\to\alpha^{*}$ , then $w^{t}\to w^{*}$ . We shall now design a method for maximizing the dual function $D$ and then in Theorem 5 we claim that for quadratic loss functions, Algorithm 5 arises as an image, defined via (14), of dual iterations of this dual ascent method.

Let $\xi(\alpha)\overset{\text{def}}{=}\tfrac{1}{2}\|X\alpha\|^{2}$ . Since $\xi$ is a convex quadratic, we have

where $\nabla\xi(\alpha)=X^{T}X\alpha$ and $\nabla^{2}\xi(\alpha)=X^{T}X$ . Further, we define the block-diagonal matrix $B\overset{\text{def}}{=}Diag(X_{1}^{T}X_{1},\dots,X_{K}^{T}X_{K})$ , and a norm associate with this matrix:

where $\nabla_{k}\xi(\alpha^{t})$ corresponds to the subvector of $\nabla\xi(\alpha^{t})$ formed by entries $i\in{\cal P}_{k}$ .

We now let $h^{t}=(h^{t}_{1},\dots,h^{t}_{K})$ be the maximizer of this lower bound. Since the lower bound is separable in the blocks $\{h^{t}_{k}\}_{k}$ , we can simply set

Having computed $h_{k}^{t}$ for all $k$ , we can set $\alpha_{k}^{t+1}=\alpha_{k}^{t}+h_{k}^{t}$ for all $k$ , or equivalently, $\alpha^{t+1}=\alpha^{t}+h^{t}$ . This is formalized as Algorithm 6. Algorithm 6 is a proximal gradient ascent method applied to the dual problem, with smoothness being measured using the block norm $\|h\|_{B}$ . It is known that gradient ascent converges at a linear rate for smooth and strongly convex (for minimization problems) objectives.

One of the main insights of this section is the following equivalence result.

Consider the ridge regression problem. That is, set $\phi_{i}(t)=\tfrac{1}{2}(t-y_{i})^{2}$ for all $i$ . Assume $\alpha_{1}^{0},\dots,\alpha_{K}^{0}$ is chosen in the same way in Algorithms 5 and 6. Then the dual iterates $\alpha^{t}$ and the primal iterates $w^{t}$ produced by the two algorithms are related via (14) for all $t\geq 0$ .

Since the dual method converges linearly, in view of the above theorem, so does the primal method. Here we only remark that the popular algorithm CoCoA+ arises if Step 5 in Algorithm 6 is done inexactly. Hence, we show that duality provides a deep relationship between the CoCoA+ and DANE algorithms, which were previously considered completely different.

A.4 Proof of Theorem 5

Since $\phi_{i}(t)=\tfrac{1}{2}(t-y_{i})^{2}$ , the primal problem (11) is a ridge regression problem of the form

The primal objective function is of the form (8), where in view of (12), we have $F_{k}(w)=\frac{K}{2n}\|X_{k}^{T}w-y_{k}\|^{2}+\frac{\lambda}{2}\|w\|^{2}$ . Therefore,

and $\nabla f(w)=\frac{1}{K}\sum_{k}\nabla F_{k}(w)=\frac{1}{K}\sum_{k}\left(\frac{K}{n}X_{k}(X_{k}^{T}w-y_{k})+\lambda w\right).$

Note that (19) has the same form as (17), with $X$ replaced by $X_{k}$ , $\lambda$ replaced by $\lambda/\sigma$ and $y$ replaced by $c_{k}:=y_{k}-X_{k}^{T}w^{t}-\alpha^{t}_{k}$ . Hence, we know that

is the optimal solution of the primal problem of (22):

Hence, the primal version of method (20) is given by

With the change of variables $w:=w^{t}+\frac{K}{\sigma}s$ (i.e., $s=\frac{\sigma}{K}(w-w^{t})$ ), from (22) we know that $w_{k}^{t+1}:=w^{t}+\frac{K}{\sigma}s_{k}^{t}$ solves

and $w^{t+1}=\frac{1}{K}\sum_{k=1}^{K}w_{k}^{t+1}$ .

Let us now rewrite the function in (23) so as to connect it to Algorithm 5:

Next, since $\|w\|^{2}=\|w-w^{t}\|^{2}-\|w^{t}\|^{2}+2(w^{t})^{T}w$ , we can further write

where the last step follows from the claim that $\eta z_{k}^{t}=\eta\nabla F_{k}(w^{t})+g_{k}^{t}$ . We now prove the claim. First, we have

Due to the definition of $g_{k}^{0}$ in Step 5 of Algorithm 5 as $g_{k}^{0}=\eta(\frac{K}{n}X_{k}\alpha_{k}^{0}-\lambda w^{0})$ , we observe that the claim holds for $t=0$ . If we show that

for all $t\geq 0$ , then we are done. This can be shown by induction. This finishes the proof of Theorem 5.