DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

Irina Higgins, Arka Pal, Andrei A. Rusu, Loic Matthey, Christopher P Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, Alexander Lerchner

Introduction

Autonomous agents can learn how to maximise future expected rewards by choosing how to act based on incoming sensory observations via reinforcement learning (RL). Early RL approaches did not scale well to environments with large state spaces and high-dimensional raw observations (Sutton & Barto, 1998). A commonly used workaround was to embed the observations in a lower-dimensional space, typically via hand-crafted and/or privileged-information features. Recently, the advent of deep learning and its successful combination with RL has enabled end-to-end learning of such embeddings directly from raw inputs, sparking success in a wide variety of previously challenging RL domains (Mnih et al., 2015, 2016; Jaderberg et al., 2017). Despite the seemingly universal efficacy of deep RL, however, fundamental issues remain. These include data inefficiency, the reactive nature and general brittleness of learnt policies to changes in input data distribution, and lack of model interpretability (Garnelo et al., 2016; Lake et al., 2016). This paper focuses on one of these outstanding issues: the ability of RL agents to deal with changes to the input distribution, a form of transfer learning known as domain adaptation (Bengio et al., 2013). In domain adaptation scenarios, an agent trained on a particular input distribution with a specified reward structure (termed the source domain) is placed in a setting where the input distribution is modified but the reward structure remains largely intact (the target domain). We aim to develop an agent that can learn a robust policy using observations and rewards obtained exclusively within the source domain. Here, a policy is considered as robust if it generalises with minimal drop in performance to the target domain without extra fine-tuning.

Past attempts to build RL agents with strong domain adaptation performance highlighted the importance of learning good internal representations of raw observations (Finn et al., 2015; Raffin et al., 2017; Pan & Yang, 2009; Barreto et al., 2016; Littman et al., 2001). Typically, these approaches tried to align the source and target domain representations by utilising observation and reward signals from both domains (Tzeng et al., 2016; Daftry et al., 2016; Parisotto et al., 2015; Guez et al., 2012; Talvitie & Singh, 2007; Niekum et al., 2013; Gupta et al., 2017; Finn et al., 2017; Rajendran et al., 2017). In many scenarios, such as robotics, this reliance on target domain information can be problematic, as the data may be expensive or difficult to obtain (Finn et al., 2017; Rusu et al., 2016). Furthermore, the target domain may simply not be known in advance. On the other hand, policies learnt exclusively on the source domain using existing deep RL approaches that have few constraints on the nature of the learnt representations often overfit to the source input distribution, resulting in poor domain adaptation performance (Lake et al., 2016; Rusu et al., 2016).

We propose tackling both of these issues by focusing instead on learning representations which capture an underlying low-dimensional factorised representation of the world and are therefore not task or domain specific. Many naturalistic domains such as video game environments, simulations and our own world are well described in terms of such a structure. Examples of such factors of variation are object properties like colour, scale, or position; other examples correspond to general environmental factors, such as geometry and lighting. We think of these factors as a set of high-level parameters that can be used by a world graphics engine to generate a particular natural visual scene (Kulkarni et al., 2015). Learning how to project raw observations into such a factorised description of the world is addressed by the large body of literature on disentangled representation learning (Schmidhuber, 1992; Desjardins et al., 2012; Cohen & Welling, 2014, 2015; Kulkarni et al., 2015; Hinton et al., 2011; Rippel & Adams, 2013; Reed et al., 2014; Yang et al., 2015; Goroshin et al., 2015; Kulkarni et al., 2015; Cheung et al., 2015; Whitney et al., 2016; Karaletsos et al., 2016; Chen et al., 2016; Higgins et al., 2017). Disentangled representations are defined as interpretable, factorised latent representations where either a single latent or a group of latent units are sensitive to changes in single ground truth factors of variation used to generate the visual world, while being invariant to changes in other factors (Bengio et al., 2013). The theoretical utility of disentangled representations for supervised and reinforcement learning has been described before (Bengio et al., 2013; Higgins et al., 2017; Ridgeway, 2016); however, to our knowledge, it has not been empirically validated to date.

We demonstrate how disentangled representations can improve the robustness of RL algorithms in domain adaptation scenarios by introducing DARLA (DisentAngled Representation Learning Agent), a new RL agent capable of learning a robust policy on the source domain that achieves significantly better out-of-the-box performance in domain adaptation scenarios compared to various baselines. DARLA relies on learning a latent state representation that is shared between the source and target domains, by learning a disentangled representation of the environment’s generative factors. Crucially, DARLA does not require target domain data to form its representations. Our approach utilises a three stage pipeline: 1) learning to see, 2) learning to act, 3) transfer. During the first stage, DARLA develops its vision, learning to parse the world in terms of basic visual concepts, such as objects, positions, colours, etc. by utilising a stream of raw unlabelled observations – not unlike human babies in their first few months of life (Leat et al., 2009; Candy et al., 2009). In the second stage, the agent utilises this disentangled visual representation to learn a robust source policy. In stage three, we demonstrate that the DARLA source policy is more robust to domain shifts, leading to a significantly smaller drop in performance in the target domain even when no further policy finetuning is allowed (median 270.3% improvement). These effects hold consistently across a number of different RL environments (DeepMind Lab and Jaco/MuJoCo: Beattie et al., 2016; Todorov et al., 2012) and algorithms (DQN, A3C and Episodic Control: Mnih et al., 2015, 2016; Blundell et al., 2016).

Framework

We now formalise domain adaptation scenarios in a reinforcement learning (RL) setting. We denote the source and target domains as DSD_{S} and DTD_{T}, respectively. Each domain corresponds to an MDP defined as a tuple DS(SS,AS,TS,RS)D_{S}\equiv(\mathcal{S}_{S},\mathcal{A}_{S},\mathcal{T}_{S},R_{S}) or DT(ST,AT,TT,RT)D_{T}\equiv(\mathcal{S}_{T},\mathcal{A}_{T},\mathcal{T}_{T},R_{T}) (we assume a shared fixed discount factor γ\gamma), each with its own state space S\mathcal{S}, action space A\mathcal{A}, transition function T\mathcal{T} and reward function RR.For further background on the notation relating to the RL paradigm, see Section A.1 in the Supplementary Materials. In domain adaptation scenarios the states S\mathcal{S} of the source and the target domains can be quite different, while the action spaces A\mathcal{A} are shared and the transitions T\mathcal{T} and reward functions RR have structural similarity. For example, consider a domain adaptation scenario for the Jaco robotic arm, where the MuJoCo (Todorov et al., 2012) simulation of the arm is the source domain, and the real world setting is the target domain. The state spaces (raw pixels) of the source and the target domains differ significantly due to the perceptual-reality gap (Rusu et al., 2016); that is to say, SSST\mathcal{S}_{S}\neq\mathcal{S}_{T}. Both domains, however, share action spaces (AS=AT\mathcal{A}_{S}=\mathcal{A}_{T}), since the policy learns to control the same set of actuators within the arm. Finally, the source and target domain transition and reward functions share structural similarity (TSTT\mathcal{T}_{S}\approx\mathcal{T}_{T} and RSRTR_{S}\approx R_{T}), since in both domains transitions between states are governed by the physics of the world and the performance on the task depends on the relative position of the arm’s end effectors (i.e. fingertips) with respect to an object of interest.

2 DARLA

In order to describe our proposed DARLA framework, we assume that there exists a set M\mathcal{M} of MDPs that is the set of all natural world MDPs, and each MDP DiD_{i} is sampled from M\mathcal{M}. We define M\mathcal{M} in terms of the state space S^\hat{\mathcal{S}} that contains all possible conjunctions of high-level factors of variation necessary to generate any naturalistic observation in any DiMD_{i}\in\mathcal{M}. A natural world MDP DiD_{i} is then one whose state space S\mathcal{S} corresponds to some subset of S^\hat{\mathcal{S}}. In simple terms, we assume that there exists some shared underlying structure between the MDPs DiD_{i} sampled from M\mathcal{M}. We contend that this is a reasonable assumption that permits inclusion of many interesting problems, including being able to characterise our own reality (Lake et al., 2016).

We now introduce notation for two state space variables that may in principle be used interchangeably within the source and target domain MDPs DSD_{S} and DTD_{T} – the agent observation state space So\mathcal{S}^{o}, and the agent’s internal latent state space Sz\mathcal{S}^{z}.Note that we do not assume these to be Markovian i.e. it is not necessarily the case that p(st+1osto)=p(st+1osto,st1o,,s1o)p(s^{o}_{t+1}|s^{o}_{t})=p(s^{o}_{t+1}|s^{o}_{t},s^{o}_{t-1},\ldots,s^{o}_{1}), and similarly for szs^{z}. Note the index tt here corresponds to time. Sio\mathcal{S}^{o}_{i} in DiD_{i} consists of raw (pixel) observations sios^{o}_{i} generated by the true world simulator from a sampled set of data generative factors s^i\hat{s}_{i}, i.e. sioSim(s^i)s^{o}_{i}\sim\bf{Sim}(\hat{s}_{i}). s^i\hat{s}_{i} is sampled by some distribution or process Gi\mathcal{G}_{i} on S^\hat{\mathcal{S}}, s^iGi(S^)\hat{s}_{i}\sim\mathcal{G}_{i}(\hat{\mathcal{S}}).

Using the newly introduced notation, domain adaptation scenarios can be described as having different sampling processes GS\mathcal{G}_{S} and GT\mathcal{G}_{T} such that s^SGS(S^)\hat{s}_{S}\sim\mathcal{G}_{S}(\hat{\mathcal{S}}) and s^TGT(S^)\hat{s}_{T}\sim\mathcal{G}_{T}(\hat{\mathcal{S}}) for the source and target domains respectively, and then using these to generate different agent observation states sSoSim(s^S)s^{o}_{S}\sim\bf{Sim}(\hat{s}_{S}) and sToSim(s^T)s^{o}_{T}\sim\bf{Sim}(\hat{s}_{T}). Intuitively, consider a source domain where oranges appear in blue rooms and apples appear in red rooms, and a target domain where the object/room conjunctions are reversed and oranges appear in red rooms and apples appear in blue rooms. While the true data generative factors of variation S^\hat{\mathcal{S}} remain the same - room colour (blue or red) and object type (apples and oranges) - the particular source and target distributions GS\mathcal{G}_{S} and GT\mathcal{G}_{T} differ.

Typically deep RL agents (e.g. Mnih et al., 2015, 2016) operating in an MDP DiMD_{i}\in\mathcal{M} learn an end-to-end mapping from raw (pixel) observations sioSios^{o}_{i}\in\mathcal{S}^{o}_{i} to actions aiAia_{i}\in\mathcal{A}_{i} (either directly or via a value function Qi(sio,ai)Q_{i}(s^{o}_{i},a_{i}) from which actions can be derived). In the process of doing so, the agent implicitly learns a function F:SioSiz\mathcal{F}:\mathcal{S}^{o}_{i}\rightarrow\mathcal{S}^{z}_{i} that maps the typically high-dimensional raw observations sios^{o}_{i} to typically low-dimensional latent states sizs^{z}_{i}; followed by a policy function πi:SizAi\pi_{i}:\mathcal{S}^{z}_{i}\rightarrow\mathcal{A}_{i} that maps the latent states sizs^{z}_{i} to actions aiAia_{i}\in\mathcal{A}_{i}. In the context of domain adaptation, if the agent learns a naive latent state mapping function FS:SSoSSz\mathcal{F}_{S}:\mathcal{S}^{o}_{S}\rightarrow\mathcal{S}^{z}_{S} on the source domain using reward signals to shape the representation learning, it is likely that FS\mathcal{F}_{S} will overfit to the source domain and will not generalise well to the target domain. Returning to our intuitive example, imagine an agent that has learnt a policy to pick up oranges and avoid apples on the source domain. Such a source policy πS\pi_{S} is likely to be based on an entangled latent state space SSz\mathcal{S}^{z}_{S} of object/room conjunctions: oranges/blue \rightarrow good, apples/red \rightarrow bad, since this is arguably the most efficient representation for maximising expected rewards on the source task in the absence of extra supervision signals suggesting otherwise. A source policy πS(asSz;θ)\pi_{S}(a|s^{z}_{S};\theta) based on such an entangled latent representation sSzs^{z}_{S} will not generalise well to the target domain without further fine-tuning, since FS(sSo)FS(sTo)\mathcal{F}_{S}(s^{o}_{S})\neq\mathcal{F}_{S}(s^{o}_{T}) and therefore crucially SSzSTzS^{z}_{S}\neq S^{z}_{T}.

On the other hand, since both s^SGS(S^)\hat{s}_{S}\sim\mathcal{G}_{S}(\hat{\mathcal{S}}) and s^TGT(S^)\hat{s}_{T}\sim\mathcal{G}_{T}(\hat{\mathcal{S}}) are sampled from the same natural world state space S^\hat{\mathcal{S}} for the source and target domains respectively, it should be possible to learn a latent state mapping function F^:SoSS^z\hat{\mathcal{F}}:\mathcal{S}^{o}\rightarrow\mathcal{S}^{z}_{\hat{\mathcal{S}}}, which projects the agent observation state space So\mathcal{S}^{o} to a latent state space SS^z\mathcal{S}^{z}_{\hat{\mathcal{S}}} expressed in terms of factorised data generative factors that are representative of the natural world i.e. SS^zS^S^{z}_{\hat{S}}\approx\hat{S}. Consider again our intuitive example, where F^\hat{\mathcal{F}} maps agent observations (sSos^{o}_{S}: orange in a blue room) to a factorised or disentangled representation expressed in terms of the data generative factors (sS^zs^{z}_{\hat{\mathcal{S}}}: room type = blue; object type = orange). Such a disentangled latent state mapping function should then directly generalise to both the source and the target domains, so that F^(sSo)=F^(sTo)=sS^z\hat{\mathcal{F}}(s^{o}_{S})=\hat{\mathcal{F}}(s^{o}_{T})=s^{z}_{\hat{\mathcal{S}}}. Since SS^z\mathcal{S}^{z}_{\hat{\mathcal{S}}} is a disentangled representation of object and room attributes, the source policy πS\pi_{S} can learn a decision boundary that ignores the irrelevant room attributes: oranges \rightarrow good, apples \rightarrow bad. Such a policy would then generalise well to the target domain out of the box, since πS(aF^(sSo);θ)=πT(aF^(sTo);θ)=πT(asS^z;θ)\pi_{S}(a|\hat{\mathcal{F}}(s^{o}_{S});\theta)=\pi_{T}(a|\hat{\mathcal{F}}(s^{o}_{T});\theta)=\pi_{T}(a|s^{z}_{\hat{\mathcal{S}}};\theta). Hence, DARLA is based on the idea that a good quality F^\hat{\mathcal{F}} learnt exclusively on the source domain DSMD_{S}\in\mathcal{M} will zero-shot-generalise to all target domains DiMD_{i}\in\mathcal{M}, and therefore the source policy π(aSS^z;θ)\pi(a|\mathcal{S}^{z}_{\hat{\mathcal{S}}};\theta) will also generalise to all target domains DiMD_{i}\in\mathcal{M} out of the box.

Next we describe each of the stages of the DARLA pipeline that allow it to learn source policies πS\pi_{S} that are robust to domain adaptation scenarios, despite being trained with no knowledge of the target domains (see Fig. 1 for a graphical representation of these steps):

1) Learn to see (unsupervised learning of FU\mathcal{F}_{U}) – the task of inferring a factorised set of generative factors SS^z=S^\mathcal{S}^{z}_{\hat{\mathcal{S}}}=\hat{S} from observations So\mathcal{S}^{o} is the goal of the extensive disentangled factor learning literature (e.g. Chen et al., 2016; Higgins et al., 2017). Hence, in stage one we learn a mapping FU:SUoSUz\mathcal{F}_{U}:\mathcal{S}^{o}_{U}\rightarrow\mathcal{S}^{z}_{U}, where SUzSS^z\mathcal{S}^{z}_{U}\approx\mathcal{S}^{z}_{\hat{\mathcal{S}}} (UU stands for ‘unsupervised’) using an unsupervised model for learning disentangled factors that utilises observations collected by an agent with a random policy πU\pi_{U} from a visual pre-training MDP DUMD_{U}\in\mathcal{M}. Note that we require sufficient variability of factors and their conjunctions in DUD_{U} in order to have SUzSS^zS^{z}_{U}\approx\mathcal{S}^{z}_{\hat{\mathcal{S}}};

2) Learn to act (reinforcement learning of πS\pi_{S} in the source domain DSD_{S} utilising previously learned FU\mathcal{F}_{U}) – an agent that has learnt to see the world in stage one in terms of the natural data generative factors is now exposed to a source domain DSMD_{S}\in\mathcal{M}. The agent is tasked with learning the source policy πS(asSz;θ)\pi_{S}(a|s^{z}_{S};\theta), where sSz=FU(sSo)sS^zs^{z}_{S}=\mathcal{F}_{U}(s^{o}_{S})\approx s^{z}_{\hat{\mathcal{S}}}, via a standard reinforcement learning algorithm. Crucially, we do not allow FU\mathcal{F}_{U} to be modified (e.g. by gradient updates) during this phase;

3) Transfer (to a target domain DTD_{T}) – in the final step, we test how well the policy πS\pi_{S} learnt on the source domain generalises to the target domain DTMD_{T}\in\mathcal{M} in a zero-shot domain adaptation setting, i.e. the agent is evaluated on the target domain without retraining. We compare the performance of policies learnt with a disentangled latent state SS^z\mathcal{S}^{z}_{\hat{\mathcal{S}}} to various baselines where the latent state mapping function FU\mathcal{F}_{U} projects agent observations sos^{o} to entangled latent state representations szs^{z}.

3 Learning disentangled representations

In order to learn FU\mathcal{F}_{U}, DARLA utilises β\beta-VAE (Higgins et al., 2017), a state-of-the-art unsupervised model for automated discovery of factorised latent representations from raw image data. β\beta-VAE is a modification of the variational autoencoder framework (Kingma & Welling, 2014; Rezende et al., 2014) that controls the nature of the learnt latent representations by introducing an adjustable hyperparameter β\beta to balance reconstruction accuracy with latent channel capacity and independence constraints. It maximises the objective:

where ϕ\phi, θ\theta parametrise the distributions of the encoder and the decoder respectively. Well-chosen values of β\beta - usually larger than one (β>1\beta>1) - typically result in more disentangled latent representations z\mathbf{z} by limiting the capacity of the latent information channel, and hence encouraging a more efficient factorised encoding through the increased pressure to match the isotropic unit Gaussian prior p(z)p(\mathbf{z}) (Higgins et al., 2017).

4 Reinforcement Learning Algorithms

We used various RL algorithms (DQN, A3C and Episodic Control: Mnih et al., 2015, 2016; Blundell et al., 2016) to learn the source policy πS\pi^{S} during stage two of the pipeline using the latent states szs^{z} acquired by β\beta-VAE based models during stage one of the DARLA pipeline.

Deep Q Network (DQN) (Mnih et al., 2015) is a variant of the Q-learning algorithm (Watkins, 1989) that utilises deep learning. It uses a neural network to parametrise an approximation for the action-value function Q(s,a;θ)Q(s,a;\theta) using parameters θ\theta.

Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016) is an asynchronous implementation of the advantage actor-critic paradigm (Sutton & Barto, 1998; Degris & Sutton, 2012), where separate threads run in parallel and perform updates to shared parameters. The different threads each hold their own instance of the environment and have different exploration policies, thereby decorrelating parameter updates without the need for experience replay. Therefore, A3C is an online algorithm, whereas DQN learns its policy offline, resulting in different learning dynamics between the two algorithms.

Model-Free Episodic Control (EC) (Blundell et al., 2016) was proposed as a complementary learning system to the other RL algorithms described above. The EC algorithm relies on near-determinism of state transitions and rewards in RL environments; in settings where this holds, it can exploit these properties to memorise which action led to high returns in similar situations in the past. Since in its simplest form EC relies on a lookup table, it learns good policies much faster than value-function-approximation based deep RL algorithms like DQN trained via gradient descent - at the cost of generality (i.e. potentially poor performance in non-deterministic environments).

We also compared our approach to that of UNREAL (Jaderberg et al., 2017), a recently proposed RL algorithm which also attempts to utilise unsupervised data in the environment. The UNREAL agent takes as a base an LSTM A3C agent (Mnih et al., 2016) and augments it with a number of unsupervised auxiliary tasks that make use of the rich perceptual data available to the agent besides the (sometimes very sparse) extrinsic reward signals. This auxiliary learning tends to improve the representation learnt by the agent. See Sec. A.6 in Supplementary Materials for further details of the algorithms above.

Tasks

We evaluate the performance of DARLA on different task and environment setups that probe subtly different aspects of domain adaptation. As a reminder, in Sec. 2.2 we defined S^\hat{\mathcal{S}} as a state space that contains all possible conjunctions of high-level factors of variation necessary to generate any naturalistic observation in any DiMD_{i}\in\mathcal{M}. During domain adaptation scenarios agent observation states are generated according to sSoSimS(s^S)s^{o}_{S}\sim\bf{Sim}_{S}(\hat{s}_{S}) and sToSimT(s^T)s^{o}_{T}\sim\bf{Sim}_{T}(\hat{s}_{T}) for the source and target domains respectively, where s^S\hat{s}_{S} and s^T\hat{s}_{T} are sampled by some distributions or processes GS\mathcal{G}_{S} and GT\mathcal{G}_{T} according to s^SGS(S^)\hat{s}_{S}\sim\mathcal{G}_{S}(\hat{\mathcal{S}}) and s^TGT(S^)\hat{s}_{T}\sim\mathcal{G}_{T}(\hat{\mathcal{S}}).

We use DeepMind Lab (Beattie et al., 2016) to test a version of domain adaptation setup where the source and target domain observation simulators are equal (SimS=SimT\bf{Sim}_{S}=\bf{Sim}_{T}), but the processes used to sample s^S\hat{s}_{S} and s^T\hat{s}_{T} are different (GSGT\mathcal{G}_{S}\neq\mathcal{G}_{T}). We use the Jaco arm with a matching MuJoCo simulation environment (Todorov et al., 2012) in two domain adaptation scenarios: simulation to simulation (sim2sim) and simulation to reality (sim2real). The sim2sim domain adaptation setup is relatively similar to DeepMind Lab i.e. the source and target domains differ in terms of processes GS\mathcal{G}_{S} and GT\mathcal{G}_{T}. However, there is a significant point of difference. In DeepMind Lab, all values of factors in the target domain, s^T\hat{s}_{T}, are previously seen in the source domain; however, s^Ss^T\hat{s}_{S}\neq\hat{s}_{T} as the conjunctions of these factor values are different. In sim2sim, by contrast, novel factor values are experienced in the target domain (this accordingly also leads to novel factor conjunctions). Hence, DeepMind Lab may be considered to be assessing domain interpolation performance, whereas sim2sim tests domain extrapolation.

The sim2real setup, on the other hand, is based on identical processes GS=GT\mathcal{G}_{S}=\mathcal{G}_{T}, but different observation simulators SimSSimT\bf{Sim}_{S}\neq\bf{Sim}_{T} corresponding to the MuJoCo simulation and the real world, which results in the so-called ‘perceptual reality gap’ (Rusu et al., 2016). More details of the tasks are given below.

DeepMind Lab is a first person 3D game environment with rich visuals and realistic physics. We used a standard seek-avoid object gathering setup, where a room is initialised with an equal number of randomly placed objects of two different types. One of the object varieties is ‘good’ (its collection is rewarded +1), while the other is ‘bad’ (its collection is punished -1). The full state space S^\hat{\mathcal{S}} consisted of all conjunctions of two room types (pink and green based on the colour of the walls) and four object types (hat, can, cake and balloon) (see Fig. 2A). The source domain DSD_{S} contained environments with hats/cans presented in the green room, and balloons/cakes presented in either the green or the pink room. The target domain DTD_{T} contained hats/cans presented in the pink room. In both domains cans and balloons were the rewarded objects.

1) Learn to see: we used β-VAEDAE\beta\text{-VAE}_{DAE} to learn the disentangled latent state representation szs^{z} that includes both the room and the object generative factors of variation within DeepMind Lab. We had to use the high-level feature space of a pre-trained DAE within the β-VAEDAE\beta\text{-VAE}_{DAE} framework (see Section 2.3.1), instead of the pixel space of vanilla β\beta-VAE , because we found that objects failed to reconstruct when using the values of β\beta necessary to disentangle the generative factors of variation within DeepMind Lab (see Fig. 2B).

β-VAEDAE\beta\text{-VAE}_{DAE} was trained on observations sUos^{o}_{U} collected by an RL agent with a simple wall-avoiding policy πU\pi_{U} (otherwise the training data was dominated by close up images of walls). In order to enable the model to learn F(sUo)S^\mathcal{F}(s^{o}_{U})\approx\hat{\mathcal{S}}, it is important to expose the agent to at least a minimal set of environments that span the range of values for each factor, and where no extraneous correlations are added between different factorsIn our setup of DeepMind Lab domain adaptation task, the object and environment factors are supposed to be independent. In order to ensure that β-VAEDAE\beta\text{-VAE}_{DAE} learns a factorised representation that reflects this ground truth independence, we present observations of all possible conjunctions of room and individual object types.(see Fig. 2A, yellow). See Section A.3.1 in Supplementary Materials for details of β-VAEDAE\beta\text{-VAE}_{DAE} training.

2) Learn to act: the agent was trained with the algorithms detailed in Section 2.4 on a seek-avoid task using the source domain (DSD_{S}) conjunctions of object/room shown in Fig. 2A (green). Pre-trained β-VAEDAE\beta\text{-VAE}_{DAE} from stage one was used as the ‘vision’ part of various RL algorithms (DQN, A3C and Episodic Control: Mnih et al., 2015, 2016; Blundell et al., 2016) to learn a source policy πS\pi_{S} that picks up balloons and avoids cakes in both the green and the pink rooms, and picks up cans and avoids hats in the green rooms. See Section A.3.1 in Supplementary Materials for more details of the various versions of DARLA we have tried, each based on a different base RL algorithm.

3) Transfer: we tested the ability of DARLA to transfer the seek-avoid policy πS\pi_{S} it had learnt on the source domain in stage two using the domain adaptation condition DTD_{T} illustrated in Figure 2A (red). The agent had to continue picking up cans and avoid hats in the pink room, even though these objects had only been seen in the green room during source policy training. The optimal policy πT\pi_{T} is one that maintains the reward polarity from the source domain (cans are good and hats are bad). For further details, see Appendix A.2.1.

2 Jaco Arm and MuJoCo

We used frames from an RGB camera facing a robotic Jaco arm, or a matching rendered camera view from a MuJoCo physics simulation environment (Todorov et al., 2012) to investigate the performance of DARLA in two domain adaptation scenarios: 1) simulation to simulation (sim2sim), and 2) simulation to reality (sim2real). The sim2real setup is of particular importance, since the progress that deep RL has brought to control tasks in simulation (Schulman et al., 2015; Mnih et al., 2016; Levine & Abbeel, 2014; Heess et al., 2015; Lillicrap et al., 2015; Schulman et al., 2016) has not yet translated as well to reality, despite various attempts (Tobin et al., 2017; Tzeng et al., 2016; Daftry et al., 2016; Finn et al., 2015; Rusu et al., 2016). Solving control problems in reality is hard due to sparse reward signals, expensive data acquisition and the attendant danger of breaking the robot (or its human minders) during exploration.

In both sim2sim and sim2real, we trained the agent to perform an object reaching policy where the goal is to place the end effector as close to the object as possible. While conceptually the reaching task is simple, it is a hard control problem since it requires correct inference of the arm and object positions and velocities from raw visual inputs.

1) Learn to see: β\beta-VAE was trained on observations collected in MuJoCo simulations with the same factors of variation as in DSD_{S}. In order to enable the model to learn F(sUo)s^\mathcal{F}(s^{o}_{U})\approx\hat{s}, a reaching policy was applied to phantom objects placed in random positions - therefore ensuring that the agent learnt the independent nature of the arm position and object position (see Fig. 2C, left);

2) Learn to act: a feedforward-A3C based agent with the vision module pre-trained in stage one was taught a source reaching policy πS\pi_{S} towards the real object in simulation (see Fig. 2C (left) for an example frame, and Sec. A.4 in Supplementary Materials for a fuller description of the agent). In the source domain DSD_{S} the agent was trained on a distribution of camera angles and positions. The colour of the tabletop on which the arm rests and the object colour were both sampled anew every episode.

3) Transfer: sim2sim: in the target domain, DTD_{T}, the agent was faced with a new distribution of camera angles and positions with little overlap with the source domain distributions, as well as a completely held out set of object colours (see Fig. 2C, middle). sim2real: in the target domain DTD_{T} the camera position and angle as well as the tabletop colour and object colour were sampled from the same distributions as seen in the source domain DSD_{S}, but the target domain DTD_{T} was now the real world. Many details present in the real world such as shadows, specularity, multiple light sources and so on are not modelled in the simulation; the physics engine is also not a perfect model of reality. Thus sim2real tests the ability of the agent to cross the perceptual-reality gap and generalise its source policy πS\pi_{S} to the real world (see Fig. 2C, right). For further details, see Appendix A.2.2.

Results

We evaluated the robustness of DARLA’s policy πS\pi_{S} learnt on the source domain to various shifts in the input data distribution. In particular, we used domain adaptation scenarios based on the DeepMind Lab seek-avoid task and the Jaco arm reaching task described in Sec. 3. On each task we compared DARLA’s performance to that of various baselines. We evaluated the importance of learning ‘good’ vision during stage one of the pipeline, i.e one that maps the input observations sos^{o} to disentangled representations szs^s^{z}\approx\hat{s}. In order to do this, we ran the DARLA pipeline with different vision models: the encoders of a disentangled β\beta-VAE In this section of the paper, we use the term β\beta-VAE to refer to a standard β\beta-VAE for the MuJoCo experiments, and a β-VAEDAE\beta\text{-VAE}_{DAE} for the DeepMind Lab experiments (as described in stage 1 of Sec. 3.1). (the original DARLA), an entangled β\beta-VAE (DARLAENT{}_{\text{ENT}}), and a denoising autoencoder (DARLADAE{}_{\text{DAE}}). Apart from the nature of the learnt representations szs^{z}, DARLA and all versions of its baselines were equivalent throughout the three stages of our proposed pipeline in terms of architecture and the observed data distribution (see Sec. A.3 in Supplementary Materials for more details).

Figs. 3-4 display the degree of disentanglement learnt by the vision modules of DARLA and DARLAENT{}_{\text{ENT}} on DeepMind Lab and MuJoCo. DARLA’s vision learnt to independently represent environment variables (such as room colour-scheme and geometry) and object-related variables (change of object type, size, rotation) on DeepMind Lab (Fig. 3, left). Disentangling was also evident in MuJoCo. Fig. 4, left, shows that DARLA’s single latent units ziz_{i} learnt to represent different aspects of the Jaco arm, the object, and the camera. By contrast, in the representations learnt by DARLAENT{}_{\text{ENT}}, each latent is responsible for changes to both the environment and objects (Fig. 3, right) in DeepMind Lab, or a mixture of camera, object and/or arm movements (Fig. 4, right) in MuJoCo.

The table in Fig. 5 shows the average performance (across different seeds) in terms of rewards per episode of the various agents on the target domain with no fine-tuning of the source policy πS\pi_{S}. It can be seen that DARLA is able to zero-shot-generalise significantly better than DARLAENT{}_{\text{ENT}} or DARLADAE{}_{\text{DAE}}, highlighting the importance of learning a disentangled representation sz=sS^zs^{z}=s^{z}_{\hat{\mathcal{S}}} during the unsupervised stage one of the DARLA pipeline. In particular, this also demonstrates that the improved domain transfer performance is not simply a function of increased exposure to training observations, as both DARLAENT{}_{\text{ENT}} and DARLADAE{}_{\text{DAE}} were exposed to the same data. The results are mostly consistent across target domains and in most cases DARLA is significantly better than the second-best-performing agent. This holds in the sim2real task See https://youtu.be/sZqrWFl0wQ4 for example sim2sim and sim2real zero-shot transfer policies of DARLA and baseline A3C agent., where being able to perform zero-shot policy transfer is highly valuable due to the particular difficulties of gathering data in the real world.

DARLA’s performance is particularly surprising as it actually preserves less information about the raw observations sos^{o} than DARLAENT{}_{\text{ENT}} and DARLADAE{}_{\text{DAE}}. This is due to the nature of the β\beta-VAE and how it achieves disentangling; the disentangled model utilised a significantly higher value of the hyperparameter β\beta than the entangled model (see Appendix A.3 for further details), which constrains the capacity of the latent channel. Indeed, DARLA’s β\beta-VAE only utilises 8 of its possible 32 Gaussian latents to store observation-specific information for MuJoCo/Jaco (and 20 in DeepMind Lab), whereas DARLAENT{}_{\text{ENT}} utilises all 32 for both environments (as does DARLADAE{}_{\text{DAE}}).

Furthermore, we examined what happens if DARLA’s vision (i.e. the encoder of the disentangled β\beta-VAE ) is allowed to be fine-tuned via gradient updates while learning the source policy during stage two of the pipeline. This is denoted by DARLAFT{}_{\text{FT}} in the table in Fig. 5. We see that it exhibits significantly worse performance than that of DARLA in zero-shot domain adaptation using an A3C-based agent in all tasks. This suggests that a favourable initialisation does not make up for subsequent overfitting to the source domain for the on-policy A3C. However, the off-policy DQN-based fine-tuned agent performs very well. We leave further investigation of this curious effect for future work.

Finally, we compared the performance of DARLA to an UNREAL (Jaderberg et al., 2017) agent with the same architecture. Despite also exploiting the unsupervised data available in the source domain, UNREAL performed worse than baseline A3C on the DeepMind Lab domain adaptation task. This further demonstrates that use of unsupervised data in itself is not a panacea for transfer performance; it must be utilised in a careful and structured manner conducive to learning disentangled latent states sz=sS^zs^{z}=s^{z}_{\hat{\mathcal{S}}}.

In order to quantitatively evaluate our hypothesis that disentangled representations are essential for DARLA’s performance in domain adaptation scenarios, we trained various DARLAs with different degrees of learnt disentanglement in szs^{z} by varying β\beta (of β\beta-VAE ) during stage one of the pipeline. We then calculated the correlation between the performance of the EC-based DARLA on the DeepMind Lab domain adaptation task and the transfer metric, which approximately measures the quality of disentanglement of DARLA’s latent representations szs^{z} (see Sec. A.5.2 in Supplementary Materials). This is shown in the chart in Fig. 5; as can be seen, there is a strong positive correlation between the level of disentanglement and DARLA’s zero-shot domain transfer performance (r=0.6r=0.6, p<0.001p<0.001).

Having shown the robust utility of disentangled representations in agents for domain adaptation, we note that there is evidence that they can provide an important additional benefit. We found significantly improved speed of learning of πS\pi_{S} on the source domain itself, as a function of how disentangled the model was. The gain in data efficiency from disentangled representations for source policy learning is not the main focus of this paper so we leave it out of the main text; however, we provide results and discussion in Section A.7 in Supplementary Materials.

Conclusion

We have demonstrated the benefits of using disentangled representations in a deep RL setting for domain adaptation. In particular, we have proposed DARLA, a multi-stage RL agent. DARLA first learns a visual system that encodes the observations it receives from the environment as disentangled representations, in a completely unsupervised manner. It then uses these representations to learn a robust source policy that is capable of zero-shot domain adaptation.

We have demonstrated the efficacy of this approach in a range of domains and task setups: a 3D naturalistic first-person environment (DeepMind Lab), a simulated graphics and physics engine (MuJoCo), and crossing the simulation to reality gap (MuJoCo to Jaco sim2real). We have also shown that the effect of disentangling is consistent across very different RL algorithms (DQN, A3C, EC), achieving significant improvements over the baseline algorithms (median 2.7 times improvement in zero-shot transfer across tasks and algorithms). To the best of our knowledge, this is the first comprehensive empirical demonstration of the strength of disentangled representations for domain adaptation in a deep RL setting.

References

Appendix A Supplementary Materials

A.2 Further task details

As described in Sec 3.1, in each source episode of DeepMind Lab the agent was presented with one of three possible room/object type conjunctions, chosen at random. These are marked DSD_{S} in Fig 2. The setup was a seek-avoid style task, where one of the two object types in the room gave a reward of +1 and the other gave a reward of -1. The agent was allowed to pick up objects for 60 seconds after which the episode would terminate and a new one would begin; if the agent was able to pick up all the ‘good’ objects in less than 60 seconds, a new episode was begun immediately. The agent was spawned in a random location in the room at the start of each new episode.

During transfer, the agent was placed into the held out conjunction of object types and room background; see DTD_{T} in Fig 2.

Visual pre-training was performed in other conjunctions of object type and room background denoted DUD_{U} in Fig 2.

The observation size of frames in the DeepMind Lab task was 84x84x3 (HHxWWxCC).

A.2.2 MuJoCo/Jaco Arm Experiments

As described in Sec 3.2, the source task consisted of an agent learning to control a simulated arm in order to reach toward an object. A shaping reward was used, with a maximum value of 1 when the centre of the object fell between the pinch and grip sites of the end effector, or within a 10cm distance of the two. Distances on the x and y dimensions counted double compared to distances on the z dimension.

During each episode the object was placed at a random drop point within a 40x40cm area, and the arm was set to a random initial start position high above the work-space, independent of the object’s position. Each episode lasted for 150 steps, or 7.5 seconds, with a control step of 50ms. Observations sUos^{o}_{U} were sampled randomly across episodes. Overall, 4 million frames of dimensions 64x64x3 (HHxWWxCC) were used for this stage of the curriculum. For each episode the camera position and orientation were randomly sampled from an isotropic normal distribution centred around the approximate position and orientation of the real camera, with standard deviation 0.01. No precise measurements were used to match the two. Work-space table colour was sampled uniformly between 5%-5\% and +5%+5\% around the midpoint, independently for each RGB channel; object colours were sampled uniformly at random in RGB space, rejecting colours which fell within a ball around 10 held-out intensities (radius 10%10\% of range); the latter were only used for simulated transfer experiments, i.e. in DTD_{T} in the sim2sim experiments. Additionally, Gaussian noise with standard deviation 0.01 was added to the observations sTos^{o}_{T} in the sim2sim task.

For the real Jaco arm and its MuJoCo simulation counterpart, each of the nine joints could independently take 11 different actions (a linear discretisation of the continuous velocity action space). In simulation Gaussian noise with standard deviation 0.1 was added to each discrete velocity output; delays in the real setup between observations and action execution were simulated by randomly mixing velocity outputs from two previous steps instead of emitting the last output directly. Speed ranges were between 50%-50\% and 50%50\% of the Jaco arm’s top speed on joints 1 through 6 starting at the base, while the fingers could use a full range. For safety reasons the speed ranges have been reduced by a factor of 0.3 while evaluating agents on the Jaco arm, without significant performance degradation.

A.3 Vision model details

A denoising autoencoder (DAE) was used as a model to provide the feature space for the β\beta-VAE reconstruction loss to be computed over (for motivation, see Sec. 2.3.1). The DAE was trained with occlusion-style masking noise in the vein of (Pathak et al., 2016), with the aim for the DAE to learn a semantic representation of the input frames. Concretely, two values were independently sampled from U[0,W]U[0,W] and two from U[0,H]U[0,H] where WW and HH were the width and height of the input frames. These four values determined the corners of the rectangular mask applied; all pixels that fell within the mask were set to zero.

The DAE architecture consisted of four convolutional layers, each with kernel size 4 and stride 2 in both the height and width dimensions. The number of filters learnt for each layer was {32, 32, 64, 64} respectively. The bottleneck layer consisted of a fully connected layer of size 100 neurons. This was followed by four deconvolutional layers, again with kernel sizes 4, strides 2, and {64, 64, 32, 32} filters. The padding algorithm used was ‘SAME’ in TensorFlow (Abadi et al., 2015). ReLU non-linearities were used throughout.

The model was trained with loss given by the L2 distance of the outputs from the original, un-noised inputs. The optimiser used was Adam (Kingma & Ba, 2014) with a learning rate of 1e-3.

A.3.2 β𝛽\beta-VAE with Perceptual Similarity Loss

After training a DAE, as detailed in the previous sectionIn principle, the β-VAEDAE\beta\text{-VAE}_{DAE} could also have been trained end-to-end in one pass, but we did not experiment with this., a β-VAEDAE\beta\text{-VAE}_{DAE} was trained with perceptual similarity loss given by Eq. 2, repeated here:

Specifically, the input was passed through the β\beta-VAE and a sampledIt is more typical to use the mean of the reconstruction distribution, but this does not induce any pressure on the Gaussians parametrising the decoder to reduce their variances. Hence full samples were used instead. reconstruction was passed through the pre-trained DAE up to a designated layer. The L2 distance of this representation from the representation of the original input passed through the same layers of the DAE was then computed, and this formed the training loss for the β\beta-VAE part of the β-VAEDAE\beta\text{-VAE}_{DAE} The representations were taken after passing through the layer but before passing through the following non-linearity. We also briefly experimented with taking the L2 loss post-activation but did not find a significant difference.. The DAE weights remained frozen throughout.

The β\beta-VAE architecture consisted of an encoder of four convolutional layers, each with kernel size 4, and stride 2 in the height and width dimensions. The number of filters learnt for each layer was {32, 32, 64, 64} respectively. This was followed by a fully connected layer of size 256 neurons. The latent layer comprised 64 neurons parametrising 32 (marginally) independent Gaussian distributions. The decoder architecture was simply the reverse of the encoder, utilising deconvolutional layers. The decoder used was Gaussian, so that the number of output channels was 2C2C, where CC was the number of channels that the input frames had. The padding algorithm used was ‘SAME’ in TensorFlow. ReLU non-linearities were used throughout.

The model was trained with the loss given by Eq. 3. Specifically, the disentangled model used for DARLA was trained with a β\beta hyperparameter value of 1 and the layer of the DAE used to compute the perceptual similarity loss was the last deconvolutional layer. The entangled model used for DARLAENT{}_{\text{ENT}} was trained with a β\beta hyperparameter value of 0.1 with the last deconvolutional layer of the DAE was used to compute the perceptual similarity loss.

The optimiser used was Adam with a learning rate of 1e-4.

A.3.3 β𝛽\beta-VAE

For the MuJoCo/Jaco tasks, a standard β\beta-VAE was used rather than the β-VAEDAE\beta\text{-VAE}_{DAE} used for DeepMind Lab. The architecture of the VAE encoder, decoder and the latent size were exactly as described in the previous section A.3.2. β\beta for the the disentangled β\beta-VAE in DARLA was 175. β\beta for the entangled model DARLAENT{}_{\text{ENT}} was 1, corresponding to the standard VAE of (Kingma & Welling, 2014).

The optimizer used was Adam with a learning rate of 1e-4.

A.3.4 Denoising Autoencoder for baseline

For the baseline model DARLADAE{}_{\text{DAE}}, we trained a denoising autoencoder with occlusion-style masking noise as described in Appendix Section A.3.1. The architecture used matched that exactly of the β\beta-VAE described in Appendix Section A.3.2 - however, all stochastic nodes were replaced with deterministic neurons.

The optimizer used was Adam with a learning rate of 1e-4.

A.4 Reinforcement Learning Algorithm Details

The action space in the DeepMind Lab task consisted of 8 discrete actions.

DQN: in DQN, the convolutional (or ‘vision’) part of the Q-net was replaced with the encoder of the β-VAEDAE\beta\text{-VAE}_{DAE} from stage 1 and frozen. DQN takes four consecutive frames as input in order to capture some aspect of environment dynamics in the agent’s state. In order to match this in our setup with a pre-trained vision stack FU\mathcal{F}_{U}, we passed each observation frame s{1..4}os^{o}_{\{1..4\}} through the pre-trained model s{1..4}z=FU(s{1..4}o)s^{z}_{\{1..4\}}=\mathcal{F}_{U}(s^{o}_{\{1..4\}}) and then concatenated the outputs together to form the k-dimensional (where k=4szk=4|s^{z}|) input to the policy network. In this case the size of szs^{z} was 64 for DARLA as well as DARLAENT{}_{\text{ENT}}, DARLADAE{}_{\text{DAE}} and DARLAFT{}_{\text{FT}}.

On top of the frozen convolutional stack, two ‘policy’ layers of 512 neurons each were used, with a final linear layer of 8 neurons corresponding to the size of the action space in the DeepMind Lab task. ReLU non-linearities were used throughout. All other hyperparameters were as reported in (Mnih et al., 2015).

A3C: in A3C, as with DQN, the convolutional part of the network that is shared between the policy net and the value net was replaced with the encoder of the β-VAEDAE\beta\text{-VAE}_{DAE} in DeepMind Lab tasks. All other hyperparameters were as reported in (Mnih et al., 2016).

Episodic Control: for the Episodic Controller-based DARLA we used mostly the same hyperparameters as in the original paper by (Blundell et al., 2016). We explored the following hyperparameter settings: number of nearest neighbours {10,50}\in{\{10,50\}}, return horizon {100,400,800,1800,500000}\in\{100,400,800,1800,500000\}, kernel type \in {inverse, gaussian}, kernel width {1e6,1e5,1e4,1e3,1e2,1e1,0.5,0.99}\in\{1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,0.5,0.99\} and we tried training EC with and without Peng’s Q(λ)Q(\lambda) (Peng, 1993). In practice we found that none of the explored hyperparameter choices significantly influenced the results of our experiments. The final hyperparameters used for all experiments reported in the paper were the following: number of nearest neighbours: 10, return horizon: 400, kernel type: inverse, kernel width: 1e-6 and no Peng’s Q(λ)Q(\lambda) (Peng, 1993).

UNREAL: We used a vanilla version of UNREAL, with parameters as reported in (Jaderberg et al., 2017).

A.4.2 MuJoCo/Jaco Arm Experiments

For the real Jaco arm and its MuJoCo simulation, each of the nine joints could independently take 11 different actions (a linear discretisation of the continuous velocity action space). Therefore the action space size was 99.

DARLA for MuJoCo/Jaco was based on feedforward A3C (Mnih et al., 2016). We closely followed the simulation training setup of (Rusu et al., 2016) for feed-forward networks using raw visual-input only. In place of the usual conv-stack, however, we used the encoder of the β\beta-VAE as described in Appendix A.3.3. This was followed by a linear layer with 512 units, a ReLU non-linearity and a collection of 9 linear and softmax layers for the 9 independent policy outputs, as well as a single value output layer that outputted the value function.

A.5 Disentanglement Evaluation

In order to choose the optimal value of β\beta for the β\beta-VAE -DAE models and evaluate the fitness of the representations sUzs^{z}_{U} learnt in stage 1 of our pipeline (in terms of disentanglement achieved), we used the visual inspection heuristic described in (Higgins et al., 2017). The heuristic involved clustering trained β\beta-VAE based models based on the number of informative latents (estimated as the number of latents ziz_{i} with average inferred standard deviation below 0.75). For each cluster we examined the degree of learnt disentanglement by running inference on a number of seed images, then traversing each latent unit z{i}z_{\left\{i\right\}} one at a time over three standard deviations away from its average inferred mean while keeping all other latents z{i}z_{\left\{\setminus i\right\}} fixed to their inferred values. This allowed us to visually examine whether each individual latent unit ziz_{i} learnt to control a single interpretable factor of variation in the data. A similar heuristic has been the de rigueur method for exhibiting disentanglement in the disentanglement literature (Chen et al., 2016; Kulkarni et al., 2015).

A.5.2 Transfer Metric Details

In the case of DeepMind Lab, we were able to use the ground truth labels corresponding to the two factors of variation of the object type and the background to design a proxy to the disentanglement metric proposed in (Higgins et al., 2017). The procedure used consisted of the following steps:

1) Train the model under consideration on observations sUos^{o}_{U} to learn FU\mathcal{F_{U}}, as described in stage 1 of the DARLA pipeline.

3) The trained linear model L\mathcal{L}’s accuracy is evaluated on the held out subset of the Cartesian product M×NM\times N.

Although the above procedure only measures disentangling up to linearity, and only does so for the latents of object type and room background, we nevertheless found that the metric was highly correlated with disentanglement as determined via visual inspection (see Fig. 6).

A.6 Background on RL Algorithms

In this Appendix, we provide background on the different RL algorithms that the DARLA framework was tested on in this paper.

A.6.2 A3C

Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016) is an asynchronous implementation of the advantage actor-critic paradigm (Sutton & Barto, 1998; Degris & Sutton, 2012), where separate threads run in parallel and perform updates to shared parameters. The different threads each hold their own instance of the environment and have different exploration policies, thereby decorrelating parameter updates without the need for experience replay.

A.6.3 UNREAL

The UNREAL agent (Jaderberg et al., 2017) takes as a base an LSTM A3C agent (Mnih et al., 2016) and augments it with a number of unsupervised auxiliary tasks that make use of the rich perceptual data available to the agent besides the (sometimes very sparse) extrinsic reward signals. This auxiliary learning tends to improve the representation learnt by the agent. While training the base agent, its observations, rewards, and actions are stored in a replay buffer, which is used by the auxiliary learning tasks. The tasks include: 1) pixel control – the agent learns how to control the environment by training auxiliary policies to maximally change pixel intensities in different parts of the input; 2) reward prediction - given a replay buffer of observations within a short time period of an extrinsic reward, the agent has to predict the reward obtained during the next unobserved timestep using a sequence of three preceding steps; 3) value function replay - extra training of the value function to promote faster value iteration.

A.6.4 Episodic Control

In its simplest form EC is a lookup table of states and actions denoted as QEC(s,a)Q^{EC}(s,a). In each state EC picks the action with the highest QECQ_{EC} value. At the end of each episode QEC(s,a)Q^{EC}(s,a) is set to RtR_{t} if (st,at)QEC(s_{t},a_{t})\notin Q^{EC}, where RtR_{t} is the discounted return. Otherwise QEC(s,a)=max{QEC(s,a),Rt}Q^{EC}(s,a)=max\left\{Q^{EC}(s,a),R_{t}\right\}. In order to generalise its policy to novel states that are not in QECQ^{EC}, EC uses a non-parametric nearest neighbours search QEC^(s,a)=1ki=1kQEC(si,a)\widehat{Q^{EC}}(s,a)=\frac{1}{k}\sum_{i=1}^{k}Q^{EC}(s^{i},a), where si,i=1,...,ks^{i},i=1,...,k are kk states with the smallest distance to the novel state ss. Like DQN, EC takes a concatenation of four frames as input.

The EC algorithm is proposed as a model of fast hippocampal instance-based learning in the brain (Marr, 1971; Sutherland & Rudy, 1989), while the deep RL algorithms described above are more analogous to slow cortical learning that relies on generalised statistical summaries of the input distribution (McClelland et al., 1995; Norman & O’Reilly, 2003; Tulving et al., 1991).

A.7 Source Task Performance Results

The focus of this paper is primarily on zero-shot domain adaptation performance. However, it is also interesting to analyse the effect of the DARLA approach on source domain policy performance. In order to compare the models’ behaviour on the source task, we examined the training curves (see Figures 7-10) and noted in particular their:

Asymptotic task performance, i.e. the rewards per episode at the point where πS\pi_{S} has converged for the agent under consideration.

Data efficiency, i.e. how quickly the training curve was able to achieve convergence.

We note the following consistent trends across the results:

Using DARLA provided an initial boost in learning performance, which depended on the degree of disentanglement of the representation. This was particularly observable in A3C, see Fig. 8.

Baseline algorithms where F\mathcal{F} could be fine-tuned to the source task were able to achieve higher asymptotic performance. This was particularly notable on DQN and A3C (see Figs. 7 and 8) in DeepMind Lab. However, in both those cases, DARLA was able to learn very reasonable policies on the source task which were on the order of 20% lower than the fine-tuned models – arguably a worthwhile sacrifice for a subsequent median 270% improvement in target domain performance noted in the main text.

Allowing DARLA to fine-tune its vision module (DARLAFT{}_{\text{FT}}) boosted its source task learning speed, and allowed the agent to asymptote at the same level as the baseline algorithms. As discussed in the main text, this comes at the cost of significantly reduced domain transfer performance on A3C. For DQN, however, finetuning appears to offer the best of both worlds.

Perhaps most relevantly for this paper, even if solely examining source task performance, DARLA outperforms both DARLAENT{}_{\text{ENT}} and DARLADAE{}_{\text{DAE}} on both asymptotic performance and data efficiency – suggesting that disentangled representations have wider applicability in RL beyond the zero-shot domain adaptation that is the focus of this paper.