Delay and Cooperation in Nonstochastic Bandits

Nicolo' Cesa-Bianchi, Claudio Gentile, Yishay Mansour, Alberto Minora

Introduction

Delayed feedback naturally arises in many sequential decision problems. For instance, a recommender system typically learns the utility of a recommendation by detecting the occurrence of certain events (e.g., a user conversion), which may happen with a variable delay after the recommendation was issued. Other examples are the communication delays experienced by interacting learning agents. Concretely, consider a network of geographically distributed ad servers using real-time bidding to sell their inventory. Each server sequentially learns how to set the auction parameters (e.g., reserve price) in order to maximize the network’s overall revenue, and shares feedback information with other servers in order to speed up learning. However, the rate at which information is exchanged through the communication network is slower than the typical rate at which ads are served. This causes each learner to acquire feedback information from other servers with a delay that depends on the network’s structure.

Motivated by the ad network example, we consider networks of learning agents that cooperate to solve the same nonstochastic bandit problem, and study the impact of delay on the global performance of these agents. We introduce the Exp3-Coop algorithm, a distributed and cooperative version of the Exp3 algorithm of [auer2002nonstochastic]. Exp3-Coop works within a distributed and synced model where each agent runs an instance of the same bandit algorithm (Exp3). All bandit instances are initialized in the same way irrespective to the agent’s location in the network (that is, agents have no preliminary knowledge of the network), and we assume the information about an agent’s actions is propagated through the network with a unit delay for each crossed edge. In each round tt, each agent selects an action and incurs the corresponding loss (which is the same for all agents that pick that action in round tt). Besides observing the loss of the selected action, each agent obtains the information previously broadcast by other agents with a delay equal to the shortest-path distance between the agents. Namely, at time tt an agent learns what the agents at shortest-path distance ss did at time tst-s for each s=1,,ds=1,\ldots,d, where dd is a delay parameter. In this scenario, we aim at controlling the growth of the regret averaged over all agents (the so-called average welfare regret).

In the noncooperative case, when agents ignore the information received from other agents, the average welfare regret grows like KT\sqrt{KT} (the minimax rate for standard bandit setting), where KK is the number of actions and TT is the time horizon. We show that, using cooperation, NN agents with communication graph GG can achieve an average welfare regret of order \sqrt{\bigl{(}d+1+\tfrac{K}{N}\alpha_{\leq d}\bigr{)}(T\ln K)}. Here αd\alpha_{\leq d} denotes the independence number of the dd-th power of GG (i.e., the graph GG augmented with all edges between any two pair of nodes at shortest-path distance less than or equal to dd). When d=Kd=\sqrt{K} this bound is at most K1/4TlnK+K(lnT)K^{1/4}\sqrt{T\ln K}+\sqrt{K}(\ln T) for any connected graph —see Remark LABEL:rem:choiceofd in Section LABEL:s:single-d— which is asymptotically better than KT\sqrt{KT}.

Networks of nonstochastic bandits were also investigated by [awerbuch2008competitive] in a setting where the distribution over actions is shared among the agents without delay. [awerbuch2008competitive] prove a bound on the average welfare regret of order \sqrt{\bigl{(}1+\tfrac{K}{N}\bigr{)}T} ignoring polylog factors. The rate proven in (awerbuch2008competitive, Theorem 2.1) has a worse dependence on TT, but we believe this is due to the fact that their setting allows for dishonest agents and agent-specific loss vectors. We recover the same bound as a special case of our bound when GG is a clique and d=1d=1. In the clique case our bound is also similar to the bound KN(TlnK)\sqrt{\tfrac{K}{N}(T\ln K)} achieved by [seldin2014prediction] in a single-agent bandit setting where, at each time step, the agent can choose a subset of NKN\leq K actions and observe their loss. In the case when N=1N=1 (single agent), our analysis can be applied to the nonstochastic bandit problem where the player observes the loss of each played action with a delay of dd steps. In this case we improve on the previous result of (d+1)KT\sqrt{(d+1)KT} by [neu2010online, neu2014online], and give the first characterization (up to logarithmic factors) of the minimax regret, which is of order (d+K)T\sqrt{(d+K)\,T}.

In principle, the problem of delays in online learning could be tackled by simple reductions. Yet, these reductions give rise to suboptimal results. In the single agent setting, where the delay is constant and equal to dd, one can use the technique of [weinberger2002delayed] and run d+1d+1 instances of an online algorithm for the nondelayed case, where each instance is used every d+1d+1 steps. This delivers a suboptimal regret bound of (d+1)KT\sqrt{(d+1)KT}. In the case of multiple delays, like in our multi-agent setting, one can repeat the same action for d+1d+1 steps while accumulating information from the other agents, and then perform an update on scaled-up losses. The resulting (suboptimal) bound on the average welfare regret would be of the form \sqrt{(d+1)\bigl{(}1+\tfrac{K}{N}\alpha_{\leq d}\bigr{)}(T\ln K)}.

Rather than using reductions, the analysis of Exp3-Coop rests on quantifying the performance of suitable importance weighted estimates. In fact, in the single-agent setting with delay parameter dd, using Exp3-Coop reduces to running the standard Exp3 algorithm performing an update as soon a new loss becomes available. This implies that at any round t>dt>d, Exp3 selects an action without knowing the losses incurred during the last dd rounds. The resulting regret is bounded by relating the standard analysis of Exp3 to a detailed quantification of the extent to which the distribution maintained by Exp3 can drift in dd steps.

In the multi-agent case, the importance weighted estimate of Exp3-Coop is designed in such a way that at each time t>dt>d the instance of the algorithm run by an agent vv updates all actions that were played at time tdt-d by agent vv or by other agents not further away than dd from vv. Compared to the single agent case, here each agent can exploit the information circulated by the other agents. However, in order to compute the importance weighted estimates used locally by each agent, the probabilities maintained by the agents must be propagated together with the observed losses. Here, further concerns may show up, like the amount of communication, and the location of each agent within the network. In particular, when GG has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality within GG, strictly improves on the regret of Exp3-Coop.

Additional Related Work

Many important ideas in delayed online learning, including the observation that the effect of delays can be limited by controlling the amount of change in the agent strategy, were introduced by