Adaptive Shortest-Path Routing under Unknown and Stochastically Varying Link States

Keqin Liu, Qing Zhao

I Introduction

We consider a wireless network where the quality state of each link is unknown and stochastically varying over time. Specifically, the state of each link is modeled as a random cost (or reward) evolving i.i.d. over time under an unknown distribution. At each time, a path from the source to the destination is selected as the communication route and the total end-to-end cost, given by the sum of the costs of all links on the path, is subsequently observed. The cost of each individual link is not observable. The objective is to design the optimal sequential path selection policy to minimize the long-term total cost.

The above problem can be modeled as a generalized Multi-Armed Bandit (MAB) with dependent arms. In the classic MAB , there are NN independent arms and a player needs to decide which arm to play at each time. An arm, when played, offers a random cost drawn from an unknown distribution. The performance of a sequential arm selection policy is measured by regret, defined as the difference in expected total cost with respect to the optimal policy in the ideal scenario of known cost models where the player always plays the best arm. The optimal regret was shown to be logarithmic with time in . Furthermore, since cost observations from one arm do not provide information about the cost of other arms, the optimal regret grows linearly with the number of arms. We can model the adaptive routing problem as an MAB by treating each path as an arm. The difference is that paths are dependent through shared links. While the dependency across paths can be ignored in learning and the policies for the classic MAB directly apply, such a naive approach yields poor performance with a regret growing linearly with the number of paths, thus exponentially with the network size (i.e., the number of links in the worst case).

In this paper, we show that by exploiting the structure of the path dependencies, a regret polynomial with network size can be achieved while preserving its optimal logarithmic order with time. Specifically, we propose an algorithm that achieves an O(d3logT)O(d^{3}\log T) regret for all light-tailed cost distributions, where dmd\leq m is the dimension of the path set, mm the number of links, and TT the length of time horizon. We further show that a modification to the proposed algorithm leads to a regret linear with dd by sacrificing an arbitrarily small regret order with time. This result allows a performance tradeoff in terms of the network size and the time horizon. For example, the algorithm with a smaller regret order in dd can perform better over a finite time horizon when the network size is large. For heavy-tailed cost distributions, a regret linear with the number of edges and sublinear with time can be achieved. We point out that any regret sublinear with time implies the convergence of the time-average cost to the minimum one of the best path.

We further generalize the adaptive routing problem to stochastic online linear optimization problems, where the action space is a compact subset of Rd{\cal R}^{d} and the random cost function is linear on the actions. We show that for all light-tailed cost functions, regret polynomial or linear with dd and sublinear with time can be achieved. We point out that for cases where there exists a nonzero gap in the expected cost between the optimal action and the rest of the action space (e.g., when the action set is a polytope or finite), the same regret orders obtained for the adaptive routing problem can be achieved.

I-B Applications

One application example is adaptive routing in cognitive radio networks where secondary users communicate by exploiting channels temporarily unoccupied by primary users. In this case, the availability of each link dynamically varies according to the communication activities of nearby primary users. The delay on each link can thus be modeled as a stochastic process unknown to the secondary users. The objective is to route through the path with the smallest latency (i.e., the lightest primary traffic) through stochastic online learning.

Other applications include ad hoc networks where link states vary stochastically due to channel fading or random contentions with other users.

I-C Related Work

The classic MAB was addressed by Lai and Robbins in 1985, where they showed that the minimum regret has a logarithmic order with time and proposed specific policies to asymptotically achieve the optimal regret for several cost distributions in the light-tailed family . Since Lai and Robbins’s seminar work, simpler index policies were proposed for different classes of light-tailed cost distributions to achieve the logarithmic regret order with time . In , we proposed the DSEE approach that achieves the logarithmic regret order with time for all light-tailed cost distributions and sublinear regrets with time for heavy-tailed cost distributions. All regrets established in grow linearly with the number of arms. This work builds upon the general DSEE structure proposed in . By incorporating the exploitation of the arm dependencies into DSEE, we show that the learning efficiency can be significant improved in terms of network size while preserved in terms of time.

This work generalizes the previous work that assumes fully observable link costs on a chosen path. In particular, the adaptive routing problem under this more informative observation model was considered in and an algorithm was proposed to achieve O(m4logT)O(m^{4}\log T) regret, where mm is the number of links in the network. The problems considered in this paper were also studied under an adversarial bandit model in which the cost functions are chosen by an adversary and are treated as arbitrary bounded deterministic quantities . Algorithms were proposed to achieve regrets sublinear with time and polynomial with network size. The problem formulation and results established in this paper can be considered as a stochastic version of those in .

For the special class of stochastic online linear optimization problems where there exists a nonzero gap in expected cost between the optimal action and the rest of the action space and the cost has a finite support, an algorithm was proposed in to achieve an O(d2log3T)O(d^{2}\log^{3}T) regret given that a nontrivial lower bound on the gap is known. The algorithm proposed in this paper achieves a better regret in terms of both dd and TT without any knowledge on the cost model. For the general case with finite-support cost distributions, the regret was shown to be lower bounded by O(dT)O(d\sqrt{T}) and an efficient algorithm was proposed to achieve an O((dlnT)3/2T)O((d\ln T)^{3/2}\sqrt{T}) regret . Compared to the algorithm in , our algorithm for the general stochastic online linear optimization problems performs worse in TT but better in dd.

II Problem Statement

In this section, we address the adaptive routing problem under unknown and stochastically time-varying link states. Consider a network with a source ss and a destination rr. Let G=(V,E)G=(V,E) denote the directed graph consisting of all simple paths from ss to rr. Let mm and nn denote, respectively, the number of edges and vertices in graph GG.

At each time tt, a random weight/cost We(t)W_{e}(t) drawn from an unknown distribution is assigned to each edge in EE. We assume that {We(t)}\{W_{e}(t)\} are i.i.d. over time for each edge ee. At the beginning of each time slot tt, a path pnP (i{1,2,,P})p_{n}\in{\cal P}~{}(i\in\{1,2,\ldots,|{\cal P}|\}) from ss to rr is chosen, where P{\cal P} is the set of all paths from ss to rr in GG. Subsequently, the total end-to-end cost Ct(pn)C_{t}(p_{n}) of the path pnp_{n}, given by the sum of the weights of all edges on the path, is revealed in the end of the slot. We point out that the individual cost on each edge eEe\in E is unobservable.

The regret of a path selection policy π\pi is defined as the expected extra cost incurred over time TT compared to the optimal policy that always selects the best path (i.e., the path with the minimum expected total end-to-end cost). The objective is to design a path selection policy π\pi to minimize the growth rate of the regret. Let σ\sigma be a permutation on all paths such that

where Ct(π)C_{t}(\pi) denotes the total end-to-end cost of the selected path under policy π\pi at time tt.

III Adaptive Shortest Path Routing Algorithm

In this section, we present the proposed adaptive shortest-path routing algorithm. We consider both light-tailed and heavy-tailed cost distributions.

We consider the light-tailed cost distributions as defined below.

A random variable XX is light-tailed if its moment-generating function exists, i.e., there exists a u0>0u_{0}>0 such that

For a zero-mean light-tailed random variable XX, we have ,

where M(2)()M^{(2)}(\cdot) denotes the second derivative of M()M(\cdot) and u0u_{0} the parameter specified in Definition 1. From (1), we have the following extended Chernoff-Hoeffding bound on the deviation of the sample mean from the true mean for light-tailed random variables .

Note that the path cost is the sum of the costs of all edges on the path. Since the number of edges in a path is upper bounded by mm, the bound on the moment generating function in (1) holds on the path cost by replacing ζ\zeta by mζm\zeta, and so does the Chernoff-Hoeffding bound in (2).

In the following, we propose an algorithm to achieve a regret polynomial with mm and logarithmic with TT. We first represent each path pnp_{n} as a vector pn\vec{p}_{n} with mm entries consisting of s and 11s representing whether or not an edge is on the path. The vector space of all paths is embedded in a dd-dimensional (dmd\leq m) subspaceIf graph GG is acyclic, then d=mn+2d=m-n+2. of Rm{\cal R}^{m}. The cost on each path pnp_{n} at time tt is thus given by the linear function

The general structure of the algorithm follows the DSEE framework established in for the classic MAB. More specifically, we partition time into an exploration sequence and an exploration sequence. In the exploration sequence, we sample the dd basis vectors (barycentric spanner as defined in the following) evenly to estimate the qualities of these vectors. In the exploitation sequence, we select the action estimated as the best by linearly interpolating the estimated qualities of the basis vectors. A detailed algorithm is given in Fig. 1.

A set B={x1,,xd}B=\{x_{1},\ldots,x_{d}\} is a barycentric spanner for a dd-dimensional set AA if every xAx\in A can be expressed as a linear combination of elements of BB using coefficients in $$.

Note that a compact subset of Rd{\cal R}^{d} always has a barycentric spanner, which can be constructed efficiently (see an algorithm in ). For the problem at hand, the path set P{\cal P} is a compact subset of Rd{\cal R}^{d} with dimension dmd\leq m. We can thus construct a barycentric spanner consisting of dd paths in P{\cal P} by using the algorithm in .

Construct an exploration sequence as follows. Let a,ζ,u0a,\zeta,u_{0} be the constants such that (2) holds on each edge cost. Choose a constant b>2m/ab>2m/a, a constant

and a constant wmax{b/(mdζu0)2,4b/c2}w\geq\max\{b/(md\zeta u_{0})^{2},4b/c^{2}\}. For each t>1t>1, if A(t1)<dd2wlogt|{\cal A}(t-1)|<d\lceil d^{2}w\log t\rceil, then include tt in A(t){\cal A}(t). Under this exploration sequence, the resulting policy π\pi^{*} has regret

for some constant AA independent of dd, mm and T>1T>1.

Since A(T)dd2wlogT{\cal A}(T)\leq d\lceil d^{2}w\log T\rceil, the regret caused in the exploration sequence is at the order of md3logTmd^{3}\log T. Now we consider the regret caused in the exploitation sequence. Let EkE_{k} denote the kkth exploitation period which is the kkth contiguous segment in the exploitation sequence. Let EkE_{k} denote the kkth exploitation period. Similar to the proof of Theorem 3.1 in , we have

for some constant hh independent of dd and mm. Let tk>1t_{k}>1 denote the starting time of the kkth exploitation period. Next, we show that by applying the Chernoff-Hoeffding bound in (2) on the path cost, for any tt in the kkth exploitation period and i=1,,di=1,\ldots,d, we have

To show this, we define the parameter ϵi(t)=Δblogt/τi(t)\epsilon_{i}(t){\,\stackrel{{\scriptstyle\Delta}}{{=}}}\,\sqrt{b\log t/\tau_{i}(t)}, where τi(t)\tau_{i}(t) is the number of times that path ii has been sampled up to time tt. From the definition of parameter bb, we have

Applying the Chernoff-Hoeffding bound, we arrive at

In the exploitation sequence, the expected times that at least one path in BB has a sample mean deviating from its true mean cost by c/(2d)c/(2d) is thus bounded by

for some constant gg independent of dd and mm. Based on the property of the barycentric spanner, the best path would not be selected in the exploitation sequence only if one of the basis vector in BB has a sample mean deviating from its true mean cost by at least c/(2d)c/(2d). We thus proved the theorem. ∎

In Theorem 1, we need a lower bound (parameter cc) on the difference in the cost means of the best and the second best paths. We also need to know the bounds on parameters ζ\zeta and u0u_{0} such that the Chernoff-Hoeffding bound (2) holds. These bounds are required in defining ww that specifies the minimum leading constant of the logarithmic cardinality of the exploration sequence necessary for identifying the best path. Similar to , we can show that without any knowledge of the cost models, increasing the cardinality of the exploration sequence of π\pi^{*} by an arbitrarily small amount leads to a regret linear with dd and arbitrarily close to the logarithmic order with time.

Let f(t)f(t) be any positive increasing sequence with f(t)f(t)\rightarrow\infty as tt\rightarrow\infty. Revise policy π\pi^{*} in Theorem 1 as follows: include t (t>1)t~{}(t>1) in A(t){\cal A}(t) if A(t1)<df(t)logt|{\cal A}(t-1)|<d\lceil f(t)\log t\rceil. Under the revised policy π\pi^{\prime}, we have

It is sufficient to show that the regret caused in the exploitation sequence is bounded by O(d)O(d), independent of TT. Since the exploration sequence is denser than the logarithmic order as in Theorem 1, it is not difficult to show that the bound on Ek|E_{k}| given in (3) still holds with a different value of hh.

We consider any positive increasing sequence b(t)b(t) such that b(t)=o(f(t))b(t)=o(f(t)) and b(t)b(t)\rightarrow\infty as tt\rightarrow\infty. By replacing bb in the proof of Theorem 1 with b(t)b(t), we notice that after some finite time T0T_{0}, the parameter ϵi(t)\epsilon_{i}(t) will be small enough to ensure (4) holds and b(t)b(t) will be large enough to ensure (5) holds. The proof thus follows. ∎

III-B Heavy-Tailed Cost Distributions

We now consider the heavy-tailed cost distributions where the moment of the cost exists up to the qq-th (q>1q>1) order. From , we have the following bound on the deviation of the sample mean from the true mean for heavy-tailed cost distributions.

Construct an exploration sequence as follows. Choose a constant v>0v>0. For each t>1t>1, if A(t1)<vt1/q|{\cal A}(t-1)|<vt^{1/q}, then include tt in A(t){\cal A}(t). Under this exploration sequence, the resulting policy πq\pi^{q} has regret Rπq(T)DdT1/qR^{\pi^{q}}(T)\leq DdT^{1/q} for some constant DD independent of dd and TT.

Based on the construction of the exploration sequence, it is sufficient to show that the regret in the exploitation sequence is o(T1/q)do(T^{1/q})\cdot d. From (6), we have, for any i=1,,di=1,\ldots,d,

For any exploitation slot tA(t)t\in{\cal A}(t), we have A(t)vt1/q|{\cal A}(t)|\geq vt{1/q}. We arrive at

Since the best path will not be chosen only if at least one of the basis vector has the sample deviating from the true mean by c/(2d)c/(2d), the regret in the exploitation sequence is thus bounded by

IV Generalization to Stochastic Online Linear Optimization Problems

where x(π)x(\pi) denotes the action under policy π\pi at time tt.

As a special case, the action space of the adaptive routing problem consists of all path vectors with dimension dmd\leq m. The main difficulty in a general SOLO problem is on identifying the optimal action in an infinite action set. Our basic approach is to implement the shortest-path algorithm in Fig. 1 repeatedly for the chosen action to gradually converge to the optimal action. Specifically, the proposed algorithm runs under an epoch structure where each epoch simulates one round of the algorithm and the epoch lengths form a geometric progression.

Assume that the cost for each action is light-tailed. Let TkT_{k} be the length of the kk-th epoch. Construct an exploration sequence in this epoch as follows. Let a,ζ,u0a,\zeta,u_{0} be the constants such that (2) holds on each cost. Choose a constant b>2/ab>2/a, a constant c=logTk1/3/Tk1/3c=\log T_{k}^{1/3}/T_{k}^{1/3}, and a constant wmax{b/(dζu0)2,4b/c2}w\geq\max\{b/(d\zeta u_{0})^{2},4b/c^{2}\}. For each t>1t>1, if A(t1)<dd2wlogt|{\cal A}(t-1)|<d\lceil d^{2}w\log t\rceil, then include tt in A(t){\cal A}(t). The resulting policy πg\pi^{g} has regret

We point out that the regret order can be improved to linear with dd by sacrificing the order with TT by an arbitrarily small amount, as similar to Theorem 2.

V Conclusion

In this paper, we considered the adaptive routing problem in networks with unknown and stochastically varying link states, where only the total end-to-end cost of a path is observable after the path is selected for routing. For both light-tailed and heavy-tailed link-state distributions, we proposed efficient online learning algorithms to minimize the regret in terms of both time and network size. The result was further extended to the general stochastic online liner optimization problems. Future work includes extending the i.i.d. cost evolution over time to more general stochastic processes.

References