On Distributed Cooperative Decision-Making in Multiarmed Bandits

Peter Landgren, Vaibhav Srivastava, Naomi Ehrich Leonard

I Introduction

Cooperative decision-making under uncertainty is ubiquitous in natural systems as well as in engineering networks. Typically in a distributed cooperative decision-making scenario, there is assimilation of information across a network followed by decision-making based on the collective information. The result is a kind of collective intelligence, which is of fundamental interest both in terms of understanding natural systems and designing efficient engineered systems.

A fundamental feature of decision-making under uncertainty is the explore-exploit tradeoff. The explore-exploit tradeoff refers to the tension between learning and optimizing: the decision-making agent needs to learn the unknown system parameters (exploration), while maximizing its decision-making objective, which depends on the unknown parameters (exploitation).

MAB problems are canonical formulations of the explore-exploit tradeoff. In a stochastic MAB problem a set of options (arms) are given. A stochastic reward with an unknown mean is associated with each option. A player can pick only one option at a time, and the player’s objective is to maximize the cumulative expected reward over a sequence of choices. In an MAB problem, the player needs to balance the tradeoff between learning the mean reward at each arm (exploration), and picking the arm with maximum mean reward (exploitation).

MAB problems are pervasive across a variety of scientific communities and have found application in diverse areas including controls and robotics , ecology , psychology , and communications . Despite the prevalence of the MAB problem, the research on MAB problems has primarily focused on policies for a single agent. The increasing importance of networked systems warrants the development of distributed algorithms for multiple communicating agents faced with MAB problems. In this paper, we extend a popular single-agent algorithm for the stochastic MAB problem to the distributed multiple agent setting and analyze decision-making performance as a function of the network structure.

The MAB problem has been extensively studied (see for a survey). In their seminal work, Lai and Robbins established a logarithmic lower bound on the expected number of times a sub-optimal arm needs to be selected by an optimal policy. In another seminal work, Auer et al. developed the upper confidence bound (UCB) algorithm for the stochastic MAB problem that achieves the lower bound in uniformly in time. Anantharam et al. extended the results of to the setting of multiple centralized players.

Recently, researchers have studied the MAB problem with multiple players in the decentralized setting. Primarily motivated by communication networks, these researchers assume no communication among agents and design efficient decentralized policies. Kar et al. investigated the multiagent MAB problem in a leader-follower setting. They designed efficient policies for systems in which there is one major player that can access the rewards and the remaining minor players can only observe the sampling patterns of the major player. The MAB problem has also been used to perform empirical study of collaborative learning in social networks and the influence of network structure on decision-making performance of human agents .

Here, we use a running consensus algorithm for assimilation of information, which is an extension of the classical DeGroot model in the social networks literature. Running consensus and related models have been used to study learning and decision-making in social networks.

In the present paper we study the distributed cooperative MAB problem in which a set of agents are faced with a stochastic MAB problem and communicate their information with their neighbors in an undirected, connected communication graph. We use a set of running consensus algorithms for cooperative estimation of the mean reward at each arm, and we design an arm selection heuristic that leads to an order-optimal performance for the group. The major contributions of this paper are as follows.

First, we employ and rigorously analyze running consensus algorithms for distributed cooperative estimation of mean reward at each arm, and we derive bounds on several relevant quantities.

Second, we propose and thoroughly analyze the cooperative UCB algorithm. We derive bounds on decision-making performance for the group and characterize the influence of the network structure on the performance.

Third, we introduce a novel graph centrality measure and numerically demonstrate that this measure captures the ordering of explore-exploit performance of each agent.

The remainder of the paper is organized as follows. In Section II we recall some preliminaries about the stochastic MAB problem and consensus algorithms. In Section III we present and analyze the cooperative estimation algorithm. We propose and analyze the cooperative UCB algorithm in Section IV. We illustrate our analytic results with numerical examples in Section V. We conclude in Section VI.

II Background

In this section we recall the standard MAB problem, the UCB algorithm, and some preliminaries on discrete-time consensus.

Consider an NN-armed bandit problem, i.e., an MAB problem with NN arms. The reward associated with arm i{1,,N}i\in\{1,\dots,N\} is a random variable with an unknown mean mim_{i}. Let the agent choose arm i(t)i(t) at time t{1,,T}t\in\{1,\dots,T\} and receive a reward r(t)r(t). The decision-maker’s objective is to choose a sequence of arms {i(t)}t{1,,T}\{i(t)\}_{t\in\{1,\dots,T\}} that maximizes the expected cumulative reward t=1Tmi(t)\sum_{t=1}^{T}m_{i(t)}, where TT is the horizon length of the sequential allocation process.

In this paper, we focus on Gaussian rewards, i.e., the reward at arm ii is sampled from a Gaussian distribution with mean mim_{i} and variance σ2\sigma^{2}. We assume that the variance σ2\sigma^{2} is known and is the same at each arm.

II-B The UCB Algorithm

A popular solution to the stochastic MAB problem is the UCB algorithm proposed in . The UCB algorithm initializes by sampling each arm once, and then selects an arm with maximum

where ni(t)n_{i}(t) is the number of times arm ii has been chosen up to and including time tt, and μ^i(t)\hat{\mu}_{i}(t) and Ci(t)=2ln(t)ni(t)C_{i}(t)=\sqrt{\frac{2\ln(t)}{n_{i}(t)}} are the empirical mean reward of arm ii and the associated measure of the uncertainty associated with that mean at time tt, respectively.

The function Qi(t)Q_{i}(t) is judiciously designed to balance the tradeoff between explore and exploit: the terms μ^i(t)\hat{\mu}_{i}(t) and Ci(t)C_{i}(t) facilitate exploitation and exploration, respectively. The UCB algorithm as described above assumes that rewards have a bounded support $$, but this algorithm can be easily extended to distributions with unbounded support .

II-C The Cooperative MAB Problem

In the following, we will design a distributed algorithm that samples a suboptimal arm ii within a constant factor of the above bound.

II-D Discrete-Time Consensus

Consider a set of agents {1,,M}\{1,\dots,M\}, each of which maintains bidirectional communication with a set of neighboring agents. The objective of the consensus algorithms is to ensure agreement among agents on a common value. In the discrete-time consensus algorithm , agents average their opinion with their neighbors’ opinions at each time. A discrete-time consensus algorithm can be expressed as

where x(t)\mathbf{x}(t) is the vector of each agent’s opinion, and PP is a row stochastic matrix given by

IM\mathcal{I}_{M} is the identity matrix of order MM, κ(0,1]\kappa\in(0,1] is a step size parameter , dmax=max{deg(i)    i{1,,M}}d_{\text{max}}=\max\{\text{deg}(i)\;|\;i\in\{1,\dots,M\}\}, and deg(i)\text{deg}(i) is the degree of node ii. In the following, we assume without loss of generality that the eigenvalues of PP are ordered such that λ1=1>λ2...λM>1\lambda_{1}=1>\lambda_{2}\geq...\geq\lambda_{M}>-1.

In the context of social networks, the consensus algorithm (2) is referred to as the Degroot model and has been successfully used to describe evolution of opinions .

One drawback of the consensus algorithm (2) is that it does not allow for incorporating new external information. This drawback can be mitigated by adding a forcing term and the resulting algorithm is called the running consensus . Similar to (2), the running consensus updates the opinion at time tt as

where υ(t)\bm{\upsilon}(t) is the information received at time tt. In the running consensus update (4), each agent kk collects information υk(t)\upsilon_{k}(t) at time tt, adds it to its current opinion, and then averages its updated opinion with the updated opinion of its neighbors.

III Cooperative Estimation of Mean Rewards

In this section we investigate the cooperative estimation of mean rewards at each arm. To this end, we propose two running consensus algorithms for each arm and analyze their performance.

For distributed cooperative estimation of the mean reward at each arm ii, we employ two running consensus algorithms: (i) for estimation of total reward provided at the arm, and (ii) for estimation of the total number of times the arm has been sampled.

Let s^ik(t)\hat{s}_{i}^{k}(t) and n^ik(t)\hat{n}_{i}^{k}(t) be agent kk’s estimate of the total reward provided at arm ii per unit agent and the total number of times arm ii has been selected until time tt per unit agent, respectively. Using s^ik(t)\hat{s}_{i}^{k}(t) and n^ik(t)\hat{n}_{i}^{k}(t) agent kk can calculate μ^ik(t)\hat{\mu}_{i}^{k}(t), the estimated empirical mean of arm ii at time tt defined by

Let ik(t)i^{k}(t) be the arm sampled by agent kk at time tt and let ξik(t)=\mathds1(ik(t)=i)\xi_{i}^{k}(t)=\mathds{1}(i^{k}(t)=i). \mathds1()\mathds{1}(\cdot) is the indicator function, here equal to 1 if ik(t)=ii^{k}(t)=i and 0 otherwise. For simplicity of notation we define rik(t)r_{i}^{k}(t) as the realized reward at arm ii for agent kk, which is a random variable sampled from N(mi,σ2)\mathcal{N}(m_{i},\sigma^{2}), and the corresponding accumulated reward is rk(t)=rik(t)\mathds1(ik(t)=i)r^{k}(t)=r_{i}^{k}(t)\cdot\mathds{1}(i^{k}(t)=i).

The estimates s^ik(t)\hat{s}_{i}^{k}(t) and n^ik(t)\hat{n}_{i}^{k}(t) are updated using running consensus as follows

where n^i(t)\mathbf{\hat{n}}_{i}(t), s^i(t)\mathbf{\hat{s}}_{i}(t), ξi(t)\bm{\xi}_{i}(t), and ri(t)\mathbf{r}_{i}(t) are vectors of n^ik(t)\hat{n}_{i}^{k}(t), s^ik(t)\hat{s}_{i}^{k}(t), ξik(t)\xi_{i}^{k}(t), and rik(t)r_{i}^{k}(t), k{1,,M}k\in\{1,\dots,M\}, respectively, and \circ denotes element-wise multiplication (Hadamard product).

III-B Analysis of the Cooperative Estimation Algorithm

We now analyze the performance of the estimation algorithm defined by (5), (6) and (7). Let nicent(t)1Mτ=1t1Mξi(τ)n_{i}^{\text{cent}}(t)\equiv\frac{1}{M}\sum_{\tau=1}^{t}\mathbf{1}_{M}^{\top}\bm{\xi}_{i}(\tau) be the total number of times arm ii has been selected per unit agent up to and including time tt, and let sicent(t)1Mτ=1tξi(t)ri(t)s_{i}^{\text{cent}}(t)\equiv\frac{1}{M}\sum_{\tau=1}^{t}\bm{\xi}_{i}^{\top}(t)\mathbf{r}_{i}(t) be the total reward provided at arm ii per unit agent up to and including time tt. Also, let λi\lambda_{i} denote the ii-th largest eigenvalue of PP, ui\mathbf{u}_{i} the eigenvector corresponding to λi\lambda_{i}, uidu_{i}^{d} the dd-th entry of ui\mathbf{u}_{i}, and

Note that λ1=1\lambda_{1}=1 and u1=1M/M\mathbf{u}_{1}=\mathbf{1}_{M}/\sqrt{M}. Let us define

where νpjmax=max{νpj-sum,νpj+sum}\nu_{pj}^{\text{max}}=\max{\{|\nu_{pj}^{\text{-sum}}|,\nu_{pj}^{\text{+sum}}\}}. Furthermore, let

We note that both ϵn\epsilon_{n} and ϵck\epsilon_{c}^{k} depend only on the topology of the communication graph. These are measures of distributed cooperative estimation performance.

For the distributed estimation algorithm defined in (5), (6) and (7), and a doubly stochastic matrix PP defined in (3), the following statements hold

the estimate n^ik(t)\hat{n}_{i}^{k}(t) satisfies

the following inequality holds for the estimate n^ik(t)\hat{n}_{i}^{k}(t) and the sequence {ξij(τ)}τ{1,,t}\{\xi_{i}^{j}(\tau)\}_{\tau\in\{1,\dots,t\}},j{1,,M}j\in\{1,\dots,M\}

We begin with the first statement. From (6) it follows that

We now bound the kk-th entry of the second term on the right hand side of (11):

To prove the second statement, we note that

where νpwi(τ) ⁣= ⁣j=1Mupjuwjξij(τ)\nu_{pwi}(\tau)\!=\!\sum_{j=1}^{M}u_{p}^{j}u_{w}^{j}\xi_{i}^{j}(\tau).

Bounds in (13) establish the second statement. ∎

For the estimates s^ik(t)\hat{s}_{i}^{k}(t) and n^ik(t)\hat{n}_{i}^{k}(t) obtained using equations (6) and (7), the following concentration inequality holds

where δ>0\delta>0, η>0\eta>0, G(η)=(1η216)G(\eta)=(1-\frac{\eta^{2}}{16}), and ϵck\epsilon_{c}^{k} and ϵn\epsilon_{n} are defined in (10) and (8), respectively.

We begin by noting that s^ik(t)\hat{s}_{i}^{k}(t) can be decomposed as

Let s^ikp(t)=τ=1tλptτ+1j=1Mupkupjrij(τ)ξij(τ)\hat{s}_{i}^{kp}(t)=\sum_{\tau=1}^{t}\lambda_{p}^{t-\tau+1}\sum_{j=1}^{M}u_{p}^{k}u_{p}^{j}r_{i}^{j}(\tau)\xi_{i}^{j}(\tau). Then,

It follows from (15) and (16) that for any Θ>0\Theta>0

where the last line follows using the fact that conditioned on Ft1\mathcal{F}_{t-1}, ξij(t)\xi_{i}^{j}(t) is a deterministic variable and rij(t)r_{i}^{j}(t) are i.i.d. for each j{1,,M}j\in\{1,\dots,M\}. Therefore, it follows that

Using the above argument recursively with the fact that sik(0)=0s_{i}^{k}(0)=0, we obtain

Since for Gaussian random variables ϕi(β)=βmi+12σ2β2\phi_{i}(\beta)=\beta m_{i}+\frac{1}{2}\sigma^{2}\beta^{2}, we have

where the last inequality follows from the second statement of Proposition 1. Now using the Markov Inequality, we obtain

The right hand side of the above inequality contains a random variable n^ik(t)\hat{n}_{i}^{k}(t) which is dependent on the random variable on the left hand side. Therefore, we use union bounds on n^ik(t)\hat{n}_{i}^{k}(t) to obtain the desired concentration inequality. Towards this end, we consider an exponentially increasing sequence of time indices {(1+η)g1    g{1,,D}}\{(1+\eta)^{g-1}\;|\;g\in\{1,\dots,D\}\}, where D=ln(t+ϵn)ln(1+η)D=\left\lceil\frac{\ln\left(t+\epsilon_{n}\right)}{\ln\left(1+\eta\right)}\right\rceil and η>0\eta>0. For every g{1,,D}g\in\{1,\dots,D\}, define

Thus, if (1+η)g1n^ik(t)(1+η)g(1+\eta)^{g-1}\leq\hat{n}_{i}^{k}(t)\leq(1+\eta)^{g}, then

where the last inequality follows from inequality (19).

Setting σa2((1+η)14+(1+η)14)=δ\sigma\sqrt{\frac{a}{2}}\left((1+\eta)^{\frac{1}{4}}+(1+\eta)^{-\frac{1}{4}}\right)=\delta, this yields

It can be verified that the first three terms in the Taylor series for 4((1+η)14+(1+η)14)2\frac{4}{\left((1+\eta)^{\frac{1}{4}}+(1+\eta)^{-\frac{1}{4}}\right)^{2}} provide a lower bound, i.e.,

IV Cooperative Decision-Making

In this section, we extend the UCB algorithm to the distributed cooperative setting in which multiple agents can communicate with each other according to a given graph topology. Intuitively, compared to the single agent setting, in the cooperative setting each agent will be able to perform better due to communication with neighbors. However, the extent of an agent’s performance advantage depends on the network structure. We compute bounds on the performance of the group in terms of the expected group cumulative regret. We also propose a metric that orders the contribution of agents to the cumulative group regret in terms of their location in the graph.

The cooperative UCB algorithm is analogous to the UCB algorithm, and uses a modified decision-making heuristic that captures the effect of the additional information an agent receives through communication with other agents as well as the rate of information propagation through the network.

The cooperative UCB algorithm is initialized by each agent sampling each arm once and proceeds as follows. At each time tt each agent kk selects the arm with maximum Qik(t1)=μ^ik(t1)+Cik(t1)Q_{i}^{k}(t-1)=\hat{\mu}_{i}^{k}(t-1)+C_{i}^{k}(t-1), where

and receives realized reward rik(t)r_{i}^{k}(t), where γ>1\gamma>1, G(η)=(1η216)G(\eta)=(1-{\eta^{2}}{16}), and η(0,4)\eta\in(0,4). Each agent kk updates its cooperative estimate of the mean reward at each arm using the distributed cooperative estimation algorithm described in (5), (6), and (7). Note that the heuristic QikQ_{i}^{k} requires the agent kk to know ϵck\epsilon_{c}^{k}, which depends on the global graph structure. This requirement can be relaxed by replacing ϵck\epsilon_{c}^{k} with an increasing sub-logarithmic function of time. We leave rigorous analysis of the alternative policy for future investigation.

IV-B Regret Analysis of the Cooperative UCB Algorithm

We now derive a bound on the expected cumulative group regret using the distributed cooperative UCB algorithm. This bound recovers the upper bound given in (1) within a constant factor. The contribution of each agent to the group regret is a function of its location in the network.

For the cooperative UCB algorithm and the Gaussian multiarmed bandit problem the number of times a suboptimal arm ii is selected by all agents until time TT satisfies

We proceed similarly to . The number of times a suboptimal arm ii is selected by all agents until time TT is

where A>0A>0 is a constant that will be chosen later.

At a given time t+1t+1 an individual agent kk will choose a suboptimal arm only if Qik(t)Qik(t)Q_{i}^{k}(t)\geq Q_{i^{*}}^{k}(t). For this condition to be true at least one of the following three conditions must hold:

We now bound the probability that (23) holds. Applying Theorem 1, it follows for tNt\geq N (i.e., after initialization) that

It follows analogously with a slight modification to Theorem 1 that

Finally, we examine the probability that (24) holds. It follows that

From monotonicity of ln(t)\ln(t), it follows that (24) does not hold if n_{i}^{\text{cent}}(t)\geq\big{\lceil}\epsilon_{n}+\frac{8\sigma^{2}\gamma(1+\epsilon_{c}^{k})\ln(T)}{M\Delta_{i}^{2}}\big{\rceil}. Now, let A=max{M,Mϵn+k=1M8σ2γ(1+ϵck)ln(T)MΔi2}A=\max\left\{M,\lceil M\epsilon_{n}+\sum_{k=1}^{M}\frac{8\sigma^{2}\gamma(1+\epsilon_{c}^{k})\ln(T)}{M\Delta_{i}^{2}}\rceil\right\}, where the first element of the set corresponds to the selection of each arm ii once by each player during the initialization and the second element ensures that (24) does not hold. Then, it follows from (21) that

Theorem 2 provides bounds on the performance of the group as a function of the graph structure. However, the bound is dependent on the values of ϵck\epsilon_{c}^{k} for each individual agent. In this sense, ςk1/ϵck\varsigma^{k}\equiv 1/\epsilon_{c}^{k} can be thought of as a measure of node certainty in the context of explore-exploit problems. For ϵck=0\epsilon_{c}^{k}=0, the agent behaves like a centralized agent. Higher values of ϵck\epsilon_{c}^{k} reflect behavior of an agent with sparser connectivity. Rigorous connections between ϵck\epsilon_{c}^{k} and standard notions of network centralities is an interesting open problem that we leave for future investigation. \square

V Numerical Illustrations

In this section, we elucidate our theoretical analyses from the previous sections with numerical examples. We first demonstrate that the ordering on the performance of nodes obtained through numerical simulations is identical to the ordering by our upper bounds. We then investigate the effect of connectivity on the performance of agents in random graphs.

For all simulations below we consider a 1010-armed bandit problem with mean rewards as in Table I, σ=30\sigma=30, η=0\eta=0, and γ=1\gamma=1.

Consider the set of agents communicating according to the graph in Fig. 1 and using the cooperative UCB algorithm to handle the explore-exploit tradeoff in the distributed cooperative MAB problem. The values of ϵck\epsilon_{c}^{k} for nodes 1,2,3,1,2,3, and 44 are 2.31,2.31,0,2.31,2.31,0, and 5.435.43, respectively. As predicted by Remark 1, agent 33 should have the lowest regret, agents 11 and 22 should have equal and intermediate regret, and agent 44 should have the highest regret. These predictions are validated in our simulations shown in Fig. 1. The expected cumulative regret in our simulations is computed using 500500 Monte-Carlo runs.

We now explore the effect of ϵck\epsilon_{c}^{k} on the performance of an agent in an Erdös-Réyni random (ER) graph. ER graphs are a widely used class of random graphs where any two agents are connected with a given probability ρ\rho .

Consider a set of 1010 agents communicating according to an ER graph and using the cooperative UCB algorithm to handle the explore-exploit tradeoff in the aforementioned MAB problem. In our simulations, we consider 100100 connected ER graphs, and for each ER graph we compute the expected cumulative regret of agents using 3030 Monte-Carlo simulations. We show the behavior of the expected cumulative regret of each agent as a function of ςk\varsigma^{k} in Fig. 2.

It is evident that increased ςk\varsigma^{k} results in a sharp increase in performance. Conversely, low ςk\varsigma^{k} is indicative of very poor performance. This strong disparity is due to agents with lower ςk\varsigma^{k} doing a disproportionately high amount of exploration over the network, allowing other agents to exploit. The disparity is also seen in Fig. 1, as agent 44 does considerably worse than the others. Additionally, as shown in Fig. 1 the expected cumulative regret averaged over all agents is higher than the centralized (all-to-all) case.

VI Final Remarks

Here we used the distributed multi-agent MAB problem to explore cooperative decision-making in networks. We designed the cooperative UCB algorithm that achieves logarithmic regret for the group. Additionally, we investigated the performance of individual agents in the network as a function of the graph topology. We derived a node certainty measure ςk\varsigma^{k} that predicts the relative performance of the agents.

Several directions of future research are of interest. First, the arm selection heuristic designed in this paper requires some knowledge of global parameters. Relaxation of this constraint will be addressed in a future publication. Another interesting direction is to study the tradeoff between the communication and the performance. Specifically, if agents do not communicate at each time but only intermittently, then it is of interest to characterize performance as a function of the communication frequency.

References