Delay and Cooperation in Nonstochastic Bandits

Nicolo' Cesa-Bianchi, Claudio Gentile, Yishay Mansour, Alberto Minora

Introduction

Delayed feedback naturally arises in many sequential decision problems. For instance, a recommender system typically learns the utility of a recommendation by detecting the occurrence of certain events (e.g., a user conversion), which may happen with a variable delay after the recommendation was issued. Other examples are the communication delays experienced by interacting learning agents. Concretely, consider a network of geographically distributed ad servers using real-time bidding to sell their inventory. Each server sequentially learns how to set the auction parameters (e.g., reserve price) in order to maximize the network’s overall revenue, and shares feedback information with other servers in order to speed up learning. However, the rate at which information is exchanged through the communication network is slower than the typical rate at which ads are served. This causes each learner to acquire feedback information from other servers with a delay that depends on the network’s structure.

Motivated by the ad network example, we consider networks of learning agents that cooperate to solve the same nonstochastic bandit problem, and study the impact of delay on the global performance of these agents. We introduce the Exp3-Coop algorithm, a distributed and cooperative version of the Exp3 algorithm of [auer2002nonstochastic]. Exp3-Coop works within a distributed and synced model where each agent runs an instance of the same bandit algorithm (Exp3). All bandit instances are initialized in the same way irrespective to the agent’s location in the network (that is, agents have no preliminary knowledge of the network), and we assume the information about an agent’s actions is propagated through the network with a unit delay for each crossed edge. In each round $t$ , each agent selects an action and incurs the corresponding loss (which is the same for all agents that pick that action in round $t$ ). Besides observing the loss of the selected action, each agent obtains the information previously broadcast by other agents with a delay equal to the shortest-path distance between the agents. Namely, at time $t$ an agent learns what the agents at shortest-path distance $s$ did at time $t-s$ for each $s=1,\ldots,d$ , where $d$ is a delay parameter. In this scenario, we aim at controlling the growth of the regret averaged over all agents (the so-called average welfare regret).

In the noncooperative case, when agents ignore the information received from other agents, the average welfare regret grows like $\sqrt{KT}$ (the minimax rate for standard bandit setting), where $K$ is the number of actions and $T$ is the time horizon. We show that, using cooperation, $N$ agents with communication graph $G$ can achieve an average welfare regret of order $\sqrt{\bigl{(}d+1+\tfrac{K}{N}\alpha_{\leq d}\bigr{)}(T\ln K)}$ . Here $\alpha_{\leq d}$ denotes the independence number of the $d$ -th power of $G$ (i.e., the graph $G$ augmented with all edges between any two pair of nodes at shortest-path distance less than or equal to $d$ ). When $d=\sqrt{K}$ this bound is at most $K^{1/4}\sqrt{T\ln K}+\sqrt{K}(\ln T)$ for any connected graph —see Remark LABEL:rem:choiceofd in Section LABEL:s:single-d— which is asymptotically better than $\sqrt{KT}$ .

Networks of nonstochastic bandits were also investigated by [awerbuch2008competitive] in a setting where the distribution over actions is shared among the agents without delay. [awerbuch2008competitive] prove a bound on the average welfare regret of order $\sqrt{\bigl{(}1+\tfrac{K}{N}\bigr{)}T}$ ignoring polylog factors. The rate proven in (awerbuch2008competitive, Theorem 2.1) has a worse dependence on $T$ , but we believe this is due to the fact that their setting allows for dishonest agents and agent-specific loss vectors. We recover the same bound as a special case of our bound when $G$ is a clique and $d=1$ . In the clique case our bound is also similar to the bound $\sqrt{\tfrac{K}{N}(T\ln K)}$ achieved by [seldin2014prediction] in a single-agent bandit setting where, at each time step, the agent can choose a subset of $N\leq K$ actions and observe their loss. In the case when $N=1$ (single agent), our analysis can be applied to the nonstochastic bandit problem where the player observes the loss of each played action with a delay of $d$ steps. In this case we improve on the previous result of $\sqrt{(d+1)KT}$ by [neu2010online, neu2014online], and give the first characterization (up to logarithmic factors) of the minimax regret, which is of order $\sqrt{(d+K)\,T}$ .

In principle, the problem of delays in online learning could be tackled by simple reductions. Yet, these reductions give rise to suboptimal results. In the single agent setting, where the delay is constant and equal to $d$ , one can use the technique of [weinberger2002delayed] and run $d+1$ instances of an online algorithm for the nondelayed case, where each instance is used every $d+1$ steps. This delivers a suboptimal regret bound of $\sqrt{(d+1)KT}$ . In the case of multiple delays, like in our multi-agent setting, one can repeat the same action for $d+1$ steps while accumulating information from the other agents, and then perform an update on scaled-up losses. The resulting (suboptimal) bound on the average welfare regret would be of the form $\sqrt{(d+1)\bigl{(}1+\tfrac{K}{N}\alpha_{\leq d}\bigr{)}(T\ln K)}$ .

Rather than using reductions, the analysis of Exp3-Coop rests on quantifying the performance of suitable importance weighted estimates. In fact, in the single-agent setting with delay parameter $d$ , using Exp3-Coop reduces to running the standard Exp3 algorithm performing an update as soon a new loss becomes available. This implies that at any round $t>d$ , Exp3 selects an action without knowing the losses incurred during the last $d$ rounds. The resulting regret is bounded by relating the standard analysis of Exp3 to a detailed quantification of the extent to which the distribution maintained by Exp3 can drift in $d$ steps.

In the multi-agent case, the importance weighted estimate of Exp3-Coop is designed in such a way that at each time $t>d$ the instance of the algorithm run by an agent $v$ updates all actions that were played at time $t-d$ by agent $v$ or by other agents not further away than $d$ from $v$ . Compared to the single agent case, here each agent can exploit the information circulated by the other agents. However, in order to compute the importance weighted estimates used locally by each agent, the probabilities maintained by the agents must be propagated together with the observed losses. Here, further concerns may show up, like the amount of communication, and the location of each agent within the network. In particular, when $G$ has sparse components, we show that a variant of Exp3-Coop, allowing agents to choose their parameters according to their centrality within $G$ , strictly improves on the regret of Exp3-Coop.

Additional Related Work

Many important ideas in delayed online learning, including the observation that the effect of delays can be limited by controlling the amount of change in the agent strategy, were introduced by