Learning to Search Better Than Your Teacher

Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumé, John Langford

Introduction

In structured prediction problems, a learner makes joint predictions over a set of interdependent output variables and observes a joint loss. For example, in a parsing task, the output is a parse tree over a sentence. Achieving optimal performance commonly requires the prediction of each output variable to depend on neighboring variables. One approach to structured prediction is learning to search (l2s) (Collins & Roark, 2004; Daumé III & Marcu, 2005; Daumé III et al., 2009; Ross et al., 2011; Doppa et al., 2014; Ross & Bagnell, 2014), which solves the problem by:

converting structured prediction into a search problem with specified search space and actions;

defining structured features over each state to capture the interdependency between output variables;

constructing a reference policy based on training data;

learning a policy that imitates the reference policy.

Empirically, l2s approaches have been shown to be competitive with other structured prediction approaches both in accuracy and running time (see e.g. Daumé III et al. (2014)). Theoretically, existing l2s algorithms guarantee that if the learning step performs well, then the learned policy is almost as good as the reference policy, implicitly assuming that the reference policy attains good performance. Good reference policies are typically derived using labels in the training data, such as assigning each word to its correct POS tag. However, when the reference policy is suboptimal, which can arise for reasons such as computational constraints, nothing can be said for existing approaches.

This problem is most obviously manifest in a “structured contextual bandit”The key difference from (1) contextual bandits is that the action space is exponentially large (in the length of trajectories in the search space); and from (2) reinforcement learning is that a baseline reference policy exists before learning starts. setting. For example, one might want to predict how the landing page of a high profile website should be displayed; this involves many interdependent predictions: items to show, position and size of those items, font, color, layout, etc. It may be plausible to derive a quality signal for the displayed page based on user feedback, and we may have access to a reasonable reference policy (namely the existing rule-based system that renders the current web page). But, applying l2s techniques results in nonsense—learning something almost as good as the existing policy is useless as we can just keep using the current system and obtain that guarantee. Unlike the full feedback settings, label information is not even available during learning to define a substantially better reference. The goal of learning here is to improve upon the current system, which is most likely far from optimal. This naturally leads to the question: is learning to search useless when the reference policy is poor?

This is the core question of the paper, which we address first with a new l2s algorithm, LOLS (Locally Optimal Learning to Search) in Section 2. LOLS operates in an online fashion and achieves a bound on a convex combination of regret-to-reference and regret-to-own-one-step-deviations. The first part ensures that good reference policies can be leveraged effectively; the second part ensures that even if the reference policy is very sub-optimal, the learned policy is approximately “locally optimal” in a sense made formal in Section 3.

LOLS operates according to a general schematic that encompases many past l2s algorithms (see Section 2), including Searn (Daumé III et al., 2009), DAgger (Ross et al., 2011) and AggreVaTe (Ross & Bagnell, 2014). A secondary contribution of this paper is a theoretical analysis of both good and bad ways of instantiating this schematic under a variety of conditions, including: whether the reference policy is optimal or not, and whether the reference policy is in the hypothesis class or not. We find that, while past algorithms achieve good regret guarantees when the reference policy is optimal, they can fail rather dramatically when it is not. LOLS, on the other hand, has superior performance to other l2s algorithms when the reference policy performs poorly but local hill-climbing in policy space is effective. In Section 5, we empirically confirm that LOLS can significantly outperform the reference policy in practice on real-world datasets.

In Section 4 we extend LOLS to address the structured contextual bandit setting, giving a natural modification to the algorithm as well as the corresponding regret analysis.

The algorithm LOLS, the new kind of regret guarantee it satisfies, the modifications for the structured contextual bandit setting, and all experiments are new here.

Learning to Search

An optimal policy chooses the action leading to the minimal expected loss at each state. For losses decomposable over the states in a trajectory, generating an optimal policy is trivial given ${\boldsymbol{y}}^{*}$ (e.g., the sequence tagging example in (Daumé III et al., 2009)). In general, finding the optimal action at states not in the optimal trajectory can be tricky (e.g., (Goldberg & Nivre, 2013; Goldberg et al., 2014)).

Finally, like most other l2s algorithms, LOLS assumes access to a cost-sensitive classification algorithm. A cost-sensitive classifier predicts a label $\hat{y}$ given an example ${\boldsymbol{x}}$ , and receives a loss ${\boldsymbol{c}}_{{\boldsymbol{x}}}(\hat{y})$ , where ${\boldsymbol{c}}_{{\boldsymbol{x}}}$ is a vector containing the cost for each possible label. In order to perform online updates, we assume access to a no-regret online cost-sensitive learner, which we formally define below.

Given a hypothesis class $\mathcal{H}~{}:~{}\mathcal{X}\rightarrow[K]$ , the regret of an online cost-sensitive classification algorithm which produces hypotheses $h_{1},\ldots,h_{M}$ on cost-sensitive example sequence $\{({\boldsymbol{x}}_{1},{\boldsymbol{c}}_{1}),\ldots,({\boldsymbol{x}}_{M},{\boldsymbol{c}}_{M})\}$ is

Such no-regret guarantees can be obtained, for instance, by applying the SECOC technique (Langford & Beygelzimer, 2005) on top of any importance weighted binary classification algorithm that operates in an online fashion, examples being the perceptron algorithm or online ridge regression.

where $e(a)$ is the end state reached with rollout by ${\pi_{i}^{\text{out}}}$ after taking action $a$ in state $s_{t}$ . LOLS collects the $T$ examples from the different roll-out points and feeds the set of examples $\Gamma$ into an online cost-sensitive multiclass learner, thereby updating the learned policy from $\hat{\pi}_{i}$ to $\hat{\pi}_{i+1}$ . By default, we use the learned policy $\hat{\pi}_{i}$ for roll-in and a mixture policy for roll-out. For each roll-out, the mixture policy either executes ${\pi^{\text{ref}}}$ to an end-state with probability $\beta$ or $\hat{\pi}_{i}$ with probability $1-\beta$ . LOLS converts into a batch algorithm with a standard online-to-batch conversion where the final model $\bar{\pi}$ is generated by averaging $\hat{\pi}_{i}$ across all rounds (i.e., picking one of $\hat{\pi}_{1},\ldots\hat{\pi}_{N}$ uniformly at random).

Theoretical Analysis

In this section, we analyze LOLS and answer the questions raised in Section 1. Throughout this section we use $\bar{\pi}$ to denote the average policy obtained by first choosing $n\in[1,N]$ uniformly at random and then acting according to $\pi_{n}$ .We begin with discussing the choices of roll-in and roll-out policies. Table 1 summarizes the results of using different strategies for roll-in and roll-out.

An obvious bad choice is roll-in and roll-out with the learned policy, because the learner is blind to the reference policy. It reduces the structured learning problem to a reinforcement learning problem, which is much harder. To build intuition, we show two other bad cases.

Roll-in with ${\pi^{\text{ref}}}$ is bad. Roll-in with a reference policy causes the state distribution to be unrealistically good. As a result, the learned policy never learns to correct for previous mistakes, performing poorly when testing. A related discussion can be found at Theorem 2.1 in (Ross & Bagnell, 2010). We show a theorem below.

We demonstrate examples where the claim is true.

We start with the case where ${\pi_{i}^{\text{out}}}={\pi_{i}^{\text{in}}}={\pi^{\text{ref}}}$ . In this case, suppose we have one structured example, whose search space is defined as in Figure 3(a). From state $s_{1}$ , there are two possible actions: $a$ and $b$ (we will use actions and features interchangeably since features uniquely identify actions here); the (optimal) reference policy takes action $a$ . From state $s_{2}$ , there are again two actions ( $c$ and $d$ ); the reference takes $c$ . Finally, even though the reference policy would never visit $s_{3}$ , from that state it chooses action $f$ . When rolling in with ${\pi^{\text{ref}}}$ , the cost-sensitive examples are generated only at state $s_{1}$ (if we take a one-step deviation on $s_{1}$ ) and $s_{2}$ but never at $s_{3}$ (since that would require a two deviations, one at $s_{1}$ and one at $s_{3}$ ). As a result, we can never learn how to make predictions at state $s_{3}$ . Furthermore, under a rollout with ${\pi^{\text{ref}}}$ , both actions from state $s_{1}$ lead to a loss of zero. The learner can therefore learn to take action $c$ at state $s_{2}$ and $b$ at state $s_{1}$ , and achieve zero cost-sensitive regret, thereby “thinking” it is doing a good job. Unfortunately, when this policy is actually run, it performs as badly as possible (by taking action $e$ half the time in $s_{3}$ ), which results in the large structured regret.

Next we consider the case where ${\pi_{i}^{\text{out}}}$ is either the learned policy or a mixture with ${\pi^{\text{ref}}}$ . When applied to the example in Figure 3(b), our feature representation is not expressive enough to differentiate between the two actions at state $s_{1}$ , so the learned policy can do no better than pick randomly between the top and bottom branches from this state. The algorithm either rolls in with ${\pi^{\text{ref}}}$ on $s_{1}$ and generates a cost-sensitive example at $s_{2}$ , or generates a cost-sensitive example on $s_{1}$ and then completes a roll out with ${\pi_{i}^{\text{out}}}$ . Crucially, the algorithm still never generates a cost-sensitive example at the state $s_{3}$ (since it would have already taken a one-step deviation to reach $s_{3}$ and is constrained to do a roll out from $s_{3}$ ). As a result, if the learned policy were to choose the action $e$ in $s_{3}$ , it leads to a zero cost-sensitive regret but large structured regret.∎

Despite these negative results, rolling in with the learned policy is robust to both the above failure modes. In Figure 3(a), if the learned policy picks action $b$ in state $s_{1}$ , then we can roll in to the state $s_{3}$ , then generate a cost-sensitive example and learn that $f$ is a better action than $e$ . Similarly, we also observe a cost-sensitive example in $s_{3}$ in the example of Figure 3(b), which clearly demonstrates the benefits of rolling in with the learned policy as opposed to ${\pi^{\text{ref}}}$ .

Roll-out with ${\pi^{\text{ref}}}$ is bad if ${\pi^{\text{ref}}}$ is not optimal. When the reference policy is not optimal or the reference policy is not in the hypothesis class, roll-out with ${\pi^{\text{ref}}}$ can make the learner blind to compounding errors. The following theorem holds. We state this in terms of “local optimality”: a policy is locally optimal if changing any one decision it makes never improves its performance.

Suppose we have only one structured example, whose search space is defined as in Figure 3(c) and the reference policy chooses $a$ or $c$ depending on the node. If we roll-out with ${\pi^{\text{ref}}}$ , we observe expected losses 1 and $1+\epsilon$ for actions $a$ and $b$ at state $s_{1}$ , respectively. Therefore, the policy with zero cost-sensitive classification regret chooses actions $a$ and $d$ depending on the node. However, a one step deviation ( $a\rightarrow b$ ) does radically better and can be learned by instead rolling out with a mixture policy. ∎

The above theorems show the bad cases and motivate a good l2s algorithm which generates a learned policy that competes with the reference policy and deviations from the learned policy. In the following section, we show that Algorithm 1 is such an algorithm.

2 Regret Guarantees

Let $Q^{\pi}(s_{t},a)$ represent the expected loss of executing action $a$ at state $s_{t}$ and then executing policy $\pi$ until reaching an end state. $T$ is the number of decisions required before reaching an end state. For notational simplicity, we use $Q^{\pi}(s_{t},\pi^{\prime})$ as a shorthand for $Q^{\pi}(s_{t},\pi^{\prime}(s_{t}))$ , where $\pi^{\prime}(s_{t})$ is the action that $\pi^{\prime}$ takes at state $s_{t}$ . Finally, we use $d^{t}_{\pi}$ to denote the distribution over states at time $t$ when acting according to the policy $\pi$ . The expected loss of a policy is:

for any $t\in[0,T]$ . In words, this is the expected cost of rolling in with $\pi$ up to some time $t$ , taking $\pi$ ’s action at time $t$ and then completing the roll out with $\pi$ .

Our main regret guarantee for Algorithm 1 shows that LOLS minimizes a combination of regret to the reference policy ${\pi^{\text{ref}}}$ and regret its own one-step deviations. In order to concisely present the result, we present an additional definition which captures the regret of our approach:

where ${\pi_{i}^{\text{out}}}=\beta{\pi^{\text{ref}}}+(1-\beta)\hat{\pi}_{i}$ is the mixture policy used to roll-out in Algorithm 1. With these definitions in place, we can now state our main result for Algorithm 1.

Let $\delta_{N}$ be as defined in Equation 4. The averaged policy $\bar{\pi}$ generated by running $N$ steps of Algorithm 1 with a mixing parameter $\beta$ satisfies

It might appear that the LHS of the theorem combines one term which is constant to another scaling with $T$ . We point the reader to Lemma 1 in the appendix to see why the terms are comparable in magnitude. Note that the theorem does not assume anything about the quality of the reference policy, and it might be arbitrarily suboptimal. Assuming that Algorithm 1 uses a no-regret cost-sensitive classification algorithm (recall Definition 1), the first term in the definition of $\delta_{N}$ converges to

This observation is formalized in the next corollary.

Suppose we use a no-regret cost-sensitive classifier in Algorithm 1. As $N\rightarrow\infty$ , $\delta_{N}\rightarrow{\delta_{\mbox{\small class}}}$ , where

In order to avoid this asymptotic gap, it seems desirable to have regrets to reference policy and one-step deviations controlled individually, which is equivalent to having the guarantee of Theorem 3 for all values of $\beta$ in $$ rather than a specific one. As we show in the next section, guaranteeing a regret bound to one-step deviations when the reference policy is arbitrarily bad is rather tricky and can take an exponentially long time. Understanding structures where this can be done more tractably is an important question for future research. Nevertheless, the result of Theorem 3 has interesting consequences in several settings, some of which we discuss next.

The second term on the left in the theorem is always non-negative by definition, so the conclusion of Theorem 3 is at least as powerful as existing regret guarantee to reference policy when $\beta=1$ . Since the previous works in this area (Daumé III et al., 2009; Ross et al., 2011; Ross & Bagnell, 2014) have only studied regret guarantees to the reference policy, the quantity we’re studying is strictly more difficult.

The asymptotic regret incurred by using a mixture policy for roll-out might be larger than that using the reference policy alone, when the reference policy is near-optimal. How the combination of these factors manifests in practice is empirically evaluated in Section 5.

When the reference policy is optimal, the first term is non-negative. Consequently, the theorem demonstrates that our algorithm competes with one-step deviations in this case. This is true irrespective of whether ${\pi^{\text{ref}}}$ is in the policy class $\Pi$ or not.

When the reference policy is very suboptimal, then the first term can be negative. In this case, the regret to one-step deviations can be large despite the guarantee of Theorem 3, since the first negative term allows the second term to be large while the sum stays bounded. However, when the first term is significantly negative, then the learned policy has already improved upon the reference policy substantially! This ability to improve upon a poor reference policy by using a mixture policy for rolling out is an important distinction for Algorithm 1 compared with previous approaches.

Overall, Theorem 3 shows that the learned policy is either competitive with the reference policy and nearly locally optimal, or improves substantially upon the reference policy.

3 Hardness of local optimality

In this section we demonstrate that the process of reaching a local optimum (under one-step deviations) can be exponentially slow when the initial starting policy is arbitrary. This reflects the hardness of learning to search problems when equipped with a poor reference policy, even if local rather than global optimality is considered a yardstick. We establish this lower bound for a class of algorithms substantially more powerful than LOLS. We start by defining a search space and a policy class. Our search space consists of trajectories of length $T$ , with $2$ actions available at each step of the trajectory. We use and $1$ to index the two actions. We consider policies whose only feature in a state is the depth of the state in the trajectory, meaning that the action taken by any policy $\pi$ in a state $s_{t}$ depends only on $t$ . Consequently, each policy can be indexed by a bit string of length $T$ . For instance, the policy $0100\ldots 0$ executes action in the first step of any trajectory, action $1$ in the second step and at all other levels. It is easily seen that two policies are one-step deviations of each other if the corresponding bit strings have a Hamming distance of 1.

To establish a lower bound, consider the following powerful algorithmic pattern. Given a current policy $\pi$ , the algorithm examines the cost $J(\pi^{\prime})$ for all the one-step deviations $\pi^{\prime}$ of $\pi$ . It then chooses the policy with the smallest cost as its new learned policy. Note that access to the actual costs $J(\pi)$ makes this algorithm more powerful than existing l2s algorithms, which can only estimate costs of policies through rollouts on individual examples. Suppose this algorithm starts from an initial policy $\hat{\pi}_{0}$ . How long does it take for the algorithm to reach a policy $\hat{\pi}_{i}$ which is locally optimal compared with all its one-step deviations? We next present a lower bound for algorithms of this style.

Consider any algorithm which updates policies only by moving from the current policy to a one-step deviation. Then there is a search space, a policy class and a cost function where the any such algorithm must make $\Omega(2^{T})$ updates before reaching a locally optimal policy. Specifically, the lower bound also applies to Algorithm 1.

The result shows that competing with the seemingly reasonable benchmark of one-step deviations may be very challenging from an algorithmic perspective, at least without assumptions on the search space, policy class, loss function, or starting policy. For instance, the construction used to prove Theorem 4 does not apply to Hamming loss.

Structured Contextual Bandit

We now show that a variant of LOLS can be run in a “structured contextual bandit” setting, where only the loss of a single structured label can be observed. As mentioned, this setting has applications to webpage layout, personalized search, and several other domains.

Our approach is based on the $\epsilon$ -greedy algorithm which is a common strategy in partial feedback problems. Upon receiving an example ${\boldsymbol{x}}_{i}$ , the algorithm randomly chooses whether to explore or exploit on this example. With probability $1-\epsilon$ , the algorithm chooses to exploit and follows the recommendation of the current learned policy. With the remaining probability, the algorithm performs a randomized variant of the LOLS update. A detailed description is given in Algorithm 2.

We assess the algorithm’s performance via a measure of regret, where the comparator is a mixture of the reference policy and the best one-step deviation. Let $\bar{\pi}_{i}$ be the averaged policy based on all policies in $\mathcal{I}$ at round $i$ . ${\boldsymbol{y}}_{ie}$ is the predicted label in either step 9 or step 14 of Algorithm 2. The average regret is defined as:

Recalling our earlier definition of $\delta_{i}$ (4), we bound on the regret of Algorithm 2 with a proof in the appendix.

Algorithm 2 with parameter $\epsilon$ satisfies:

With a no-regret learning algorithm, we expect

where $|\Pi|$ is the cardinality of the policy class. This leads to the following corollary with a proof in the appendix.

In the setup of Theorem 5, suppose further that the underlying no-regret learner satisfies (5). Then with probability at least $1-2/(N^{5}K^{2}T^{2}\log(N|\Pi|))^{3}$ ,

Experiments

This section shows that LOLS is able to improve upon a suboptimal reference policy and provides empirical evidence to support the analysis in Section 3. We conducted experiments on the following three applications.

Cost-Sensitive Multiclass classification. For each cost-sensitive multiclass sample, each choice of label has an associated cost. The search space for this task is a binary search tree. The root of the tree corresponds to the whole set of labels. We recursively split the set of labels in half, until each subset contains only one label. A trajectory through the search space is a path from root-to-leaf in this tree. The loss of the end state is defined by the cost. An optimal reference policy can lead the agent to the end state with the minimal cost. We also show results of using a bad reference policy which arbitrarily chooses an action at each state. The experiments are conducted on KDDCup 99 datasethttp://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html generated from a computer network intrusion detection task. The dataset contains $5$ classes, $4,898,431$ training and $311,029$ test instances.

Part of speech tagging. The search space for POS tagging is left-to-right prediction. Under Hamming loss the trivial optimal reference policy simply chooses the correct part of speech for each word. We train on $38k$ sentences and test on $11k$ from the Penn Treebank (Marcus et al., 1993). One can construct suboptimal or even bad reference policies, but under Hamming loss these are all equivalent to the optimal policy because roll-outs by any fixed policy will incur exactly the same loss and the learner can immediately learn from one-step deviations.

Dependency parsing. A dependency parser learns to generate a tree structure describing the syntactic dependencies between words in a sentence (McDonald et al., 2005; Nivre, 2003). We implemented a hybrid transition system (Kuhlmann et al., 2011) which parses a sentence from left to right with three actions: Shift, ReduceLeft and ReduceRight. We used the “non-deterministic oracle” (Goldberg & Nivre, 2013) as the optimal reference policy, which leads the agent to the best end state reachable from each state. We also designed two suboptimal reference policies. A bad reference policy chooses an arbitrary legal action at each state. A suboptimal policy applies a greedy selection and chooses the action which leads to a good tree when it is obvious; otherwise, it arbitrarily chooses a legal action. (This suboptimal reference was the default reference policy used prior to the work on “non-deterministic oracles.”) We used data from the Penn Treebank Wall Street Journal corpus: the standard data split for training (sections 02-21) and test (section 23). The loss is evaluated in UAS (unlabeled attachment score), which measures the fraction of words that pick the correct parent.

For each task and each reference policy, we compare 6 different combinations of roll-in (learned or reference) and roll-out (learned, mixture or reference) strategies. We also include Searn in the comparison, since it has notable differences from LOLS. Searn rolls in and out with a mixture where a different policy is drawn for each state, while LOLS draws a policy once per example. Searn uses a batch learner, while LOLS uses online. The policy in Searn is a mixture over the policies produced at each iteration. For LOLS, it suffices to keep just the most recent one. It is an open research question whether an analogous theoretical guarantee of Theorem 3 can be established for Searn.

Our implementation is based on Vowpal Wabbithttp://hunch.net/~vw/, a machine learning system that supports online learning and l2s. For LOLS’s mixture policy, we set $\beta=0.5$ . We found that LOLS is not sensitive to $\beta$ , and setting $\beta$ to be 0.5 works well in practice. For Searn, we set the mixture parameter to be $1-(1-\alpha)^{t}$ , where $t$ is the number of rounds and $\alpha=10^{-5}$ . Unless stated otherwise all the learners take 5 passes over the data.

Tables 2, 3 and 4 show the results on cost-sensitive multiclass classification, POS tagging and dependency parsing, respectively. The empirical results qualitatively agree with the theory. Rolling in with reference is always bad. When the reference policy is optimal, then doing roll-outs with reference is a good idea. However, when the reference policy is suboptimal or bad, then rolling out with reference is a bad idea, and mixture rollouts perform substantially better. LOLS also significantly outperforms Searn on all tasks.

Proofs of Main Results

Let $\pi^{t}$ be a policy that executes $\pi_{1}$ in the first $t$ steps and then executes $\pi_{2}$ from time steps $t+1$ to $T$ . We have $J(\pi_{1})=J(\pi^{T})$ and $J(\pi_{2})=J(\pi^{0})$ . Consequently, we can set up the telescoping sum:

The second equality in the lemma can be obtained by reversing the roles of $\pi_{1}$ and $\pi_{2}$ above. ∎

We start with an application of Lemma 1. Using the lemma, we have:

Combining the above bounds from Equations 6 and 7, we see that

2 Proof of Corollary 1

The proof is fairly straightforward from definitions. By definition of no-regret, it is immediate that the gap

with $e_{i,t}$ being the end-state reached on completing the roll-out with the policy ${\pi_{i}^{\text{out}}}$ on round $i$ , when action $a$ was taken on the decision point $t$ . Recalling that we rolled in following the trajectory of ${\pi_{i}^{\text{in}}}$ , this expectation further simplifies to

Now taking expectations in Equation 8 and combining with the above observation, we obtain that for any policy $\pi\in\Pi$ ,

Taking the best policy $\pi\in\Pi$ and dividing through by $NT$ completes the proof.

3 Proof sketch of Theorem 5

(Sketch only) We decompose the analysis over exploration and exploitation rounds. For the exploration rounds, we bound the regret by its maximum possible value of 1. To control the regret on the exploitation rounds, we focus on the updates performed during exploration.

The cost vector $\hat{c}(a)$ used at an exploration round $i$ satisfies

Corollary 1. Since the cost vector is identical in expectation as that used in Algorithm 1, the proof of theorem 3, which only depends on expectations, can be reused to prove a result similar to theorem 3 for the exploration rounds. That is, letting $\bar{\pi}_{i}$ to be the averaged policy over all the policies in $\mathcal{I}$ at exploration round $i$ , we have the bound

where $\delta_{i}$ is as defined in Equation 4.

On the exploitation rounds, we can now invoke this guarantee. Recalling that we have $n_{i}$ exploration rounds until round $i$ , the expected regret at an exploitation round $i$ is at most $\delta_{n_{i}}$ . Thus the overall regret of the algorithm is at most

4 Proof of corollary 2

We start by substituting Equation 5 in the regret bound of Theorem 5. This yields

We would like to further replace $n_{i}$ with its expectation which is $\epsilon i$ . However, this does not yield a valid upper bound directly. Instead, we apply a Chernoff bound to the quantity $n_{i}$ , which is a sum of $i$ i.i.d. Bernoulli random variables with mean $\epsilon$ . Consequently, we have

for $\gamma=1/2$ . Let $i_{0}=16\log N/\epsilon+1$ . Then we can sum the failure probabilities above for all $i\geq i_{0}$ and obtain

where the last inequality uses $1+x\leq\exp(x)$ . Consequently, we can now allow a regret of 1 on the first $i_{0}$ rounds, and control the regret on the remaining rounds using $n_{i}\leq\epsilon i/2$ . Doing so, we see that with probability at least $1-2/(N^{2}\epsilon)$

Choosing $\epsilon=(KT)^{2/3}(\log(N|\Pi|)/N)^{1/3}$ completes the proof.

5 Proof of Theorem 4

The proof follows from results in combinatorics. The dynamics of algorithms considered here can be thought of as a path through a graph where the vertices are the corners of the boolean hypercube in $T$ dimensions with two vertices at Hamming distance 1 sharing an edge. We demonstrate that there is a cost function such that the algorithm is forced to traverse a long path before reaching a local optimum. Without loss of generality, assume that the algorithm always moves to a one-step deviation with the lowest cost since otherwise longer paths exist.

To gain some intuition, first consider $T=3$ which is depicted in Figure 4. Suppose the algorithm starts from the policy $000$ then moves to the policy $001$ . If the algorithm picks the best amongst the one-step deviations, we know that $J(001)\leq\min\{J(000),J(010),J(100)\}$ , placing constraints on the costs of these policies which force the algorithm to not visit any of these policies later. Similarly, if the algorithm moves to the policy $011$ next, we obtain a further constraint $J(011)<\min\{J(101),J(001)\}$ . It is easy to check that the only feasible move (corresponding to policies not crossed in Figure 4(c)) which decreases the cost under these constraints is to the policy $111$ and then $110$ , at which point the algorithm attains local optimality since no more moves that decrease the cost are possible. In general, at any step $i$ of the path, the policy $\hat{\pi}_{i}$ is a one-step deviation of $\hat{\pi}_{i-1}$ and at least 2 or more steps away from $\hat{\pi}_{j}$ for $j<i-1$ . The policy never moves to a neighbor of an ancestor (excluding the immediate parent) in the path.

This property is the key element to understand more generally. Suppose we have a current path $\hat{\pi}_{1}\rightarrow\hat{\pi}_{2}\ldots\rightarrow\hat{\pi}_{i-1}\rightarrow\hat{\pi}_{i}$ . Since we picked the best neighbor of $\hat{\pi}_{j}$ as $\hat{\pi}_{j+1}$ , $\hat{\pi}_{i+1}$ cannot be a neighbor of any $\hat{\pi}_{j}$ for $j<i$ . Consequently, the maximum number of updates the algorithm must make is given by the length of the longest such path on a hypercube, where each vertex (other than start and end) neighbors exactly two other vertices on the path. This is called the snake-in-the-box problem in combinatorics, and arises in the study of error correcting codes. It is shown by Abbott & Katchalski (1988) that the length of longest such path is $\Theta(2^{T})$ . With monotonically decreasing costs for policies in the path and maximal cost for all policies not in the path, the traversal time is $\Theta(2^{T})$ .

Finally, it might appear that Algorithm 1 is capable of moving to policies which are not just one-step deviations of the currently learned policy, since it performs updates on “mini-batches” of $T$ cost-sensitive examples. However, on this lower bound instance, Algorithm 1 will be forced to follow one-step deviations only due to the structure of the cost function. For instance, from the policy $000$ when we assign maximal cost to policies $010$ and $100$ in our example, this corresponds to making the cost of taking action $1$ on first and second step very large in the induced cost-sensitive problem. Consequently, $001$ is the policy which minimizes the cost-sensitive loss even when all the $T$ roll-outs are accumulated, implying the algorithm is forced to traverse the same long path to local optimality.

Acknowledgements

Part of this work was carried out while Kai-Wei, Akshay and Hal were visiting Microsoft Research.

References

Appendix A Details of cost-sensitive reduction

There is one missing detail in the specification of Algorithm 3, which is the update step. The specifics of this step depend on the form of the function $f(x)$ being used. For instance, if $f(x)=w^{T}x$ , then a simple update rule is to use online ridge regression (see e.g. Section 11.7 in (Cesa-Bianchi & Lugosi, 2006)). Online gradient descent (Zinkevich, 2003) on the squared loss $\sum_{i=1}^{K}(f(x_{t,i})-c_{t,i})^{2}$ is another simple alternative, which can be used more generally. The specific implementation in our experiments uses a more sophisticated variant of online gradient descent with linear functions.

Appendix B Details of Experiments

Our implementation is based on Vowpal Wabbit (VW) version 7.8 (http://hunch.net/~vw/). It is available at https://github.com/KaiWeiChang/vowpal_wabbit/tree/icmlexp. For LOLS, we use flags “–search_rollin”, “–search_rollout”, “–search_beta” to set the rollin policy, the rollout policy, and $\beta$ , respectively. We use “–search_interpolation policy –search_passes_per_policy –passes 5” to enable Searn. The details settings of various VW flags for the three experiments are shown below:

POS tagging: we use “–search_task sequence –search 45 –holdout_off –affix -2w,+2w –search_neighbor_features -1:w,1:w -b 28”

Dependency parsing: we use “ –search_task dep_parser –search 12 –holdout_off –search_history_length 3 –search_no_caching -b 24 –root_label 8 –num_label 12”

Cost-sensitive multiclass: we use “–search_task multiclasstask –search 5 –holdout_off –mc_cost”

The data sets used in the experiments are available upon request.