Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning

Denis Yarats, Rob Fergus, Alessandro Lazaric, Lerrel Pinto

Introduction

Creating sample-efficient continuous control methods that observe high-dimensional images has been a long standing challenge in reinforcement learning (RL) . Over the last three years, the RL community has made significant headway on this problem, improving sample-efficiency significantly. The key insight to solving visual control is the learning of better low-dimensional representations, either through autoencoders (Yarats et al., 2019; Finn et al., 2015), variational inference (Hafner et al., 2018, 2019; Lee et al., 2019), contrastive learning (Srinivas et al., 2020; Yarats et al., 2021a), self-prediction (Schwarzer et al., 2020b), or data augmentations (Yarats et al., 2021b; Laskin et al., 2020). However, current state-of-the-art model-free methods are still limited in three ways. First, they are unable to solve the more challenging visual control problems such as quadruped and humanoid locomotion. Second, they often require significant computational resources, i.e. lengthy training times using distributed multi-GPU infrastructure. Lastly, it is often unclear how different design choices affect overall system performance.

In this paper we present DrQ-v2, a simple model-free algorithm that builds on the idea of using data augmentations (Yarats et al., 2021b; Laskin et al., 2020) to solve hard visual control problems. Most notably, it is the first model-free method that solves complex humanoid tasks directly from pixels. Compared to previous state-of-the-art model-free methods, DrQ-v2 provides significant improvements in sample efficiency across tasks from the DeepMind Control Suite (Tassa et al., 2018). Conceptually simple, DrQ-v2 is also computationally efficient, which allows solving most tasks in DeepMind Control Suite in just $8$ hours on a single GPU (see Figure 1). Recently, a model-based method, DreamerV2 (Hafner et al., 2020) was also shown to solve visual continuous control problems and it was first to solve the humanoid locomotion problem from pixels. While our model-free DrQ-v2 matches DreamerV2 in terms sample efficiency and performance, it does so $4\times$ faster in terms of wall-clock time to train. We believe this makes DrQ-v2 a more accessible approach to support research in visual continuous control and it reinforces the question on whether model-free or model-based is the more suitable approach to solve this type of tasks.

DrQ-v2, which is detailed in Section 3, improves upon DrQ (Yarats et al., 2021b) by making several algorithmic changes: (i) switching the base RL algorithm from SAC (Haarnoja et al., 2018b) to DDPG (Lillicrap et al., 2015a), (ii) this allows us straightforwardly incorporating multi-step return, (iii) adding bilinear interpolation to the random shift image augmentation, (iv) introducing an exploration schedule, (v) selecting better hyper-parameters including a larger capacity of the replay buffer. A careful ablation study of these design choices is presented in Section 4.4. Furthermore, we re-examine the original implementation of DrQ and identify several computational bottlenecks such as replay buffer management, data augmentation processing, batch size, and frequency of learning updates (see Section 3.2). To remedy these, we have developed a new implementation that both achieves better performance and trains around $3.5$ times faster with respect to wall-clock time than the previous implementation on the same hardware with an increase in environment frame throughput (FPS) from $28$ to $96$ (i.e., it takes $10^{6}/96/3600\approx 2.9$ hours to train for 1M environment steps).

Background

2 Deep Deterministic Policy Gradient

3 Data Augmentation in Reinforcement Learning

Recently, it has been shown that data augmentation techniques, commonplace in Computer Vision, are also important for achieving the state-of-the-art performance in image-based RL (Yarats et al., 2021b; Laskin et al., 2020). For example, the state-of-the-art algorithm for visual RL, DrQ (Yarats et al., 2021b) builds on top of Soft Actor-Critic (Haarnoja et al., 2018b), a model-free actor-critic algorithm, by adding a convolutional encoder and data augmentation in the form of random shifts. The use of such data augmentations now forms an essential component of several recent visual RL algorithms (Srinivas et al., 2020; Raileanu et al., 2020; Yarats et al., 2021a; Stooke et al., 2020; Hansen and Wang, 2021; Schwarzer et al., 2020b).

DrQ-v2: Improved Data-Augmented Reinforcement Learning

In this section, we describe DrQ-v2, a simple model-free actor-critic RL algorithm for image-based continuous control, that builds upon DrQ.

As in DrQ we apply random shifts image augmentation to pixel observations of the environment. In the settings of visual continuous control by DMC, this augmentation can be instantiated by first padding each side of $84\times 84$ observation rendering by $4$ pixels (by repeating boundary pixels), and then selecting a random $84\times 84$ crop, yielding the original image shifted by $\pm 4$ pixels. We also find it useful to apply bilinear interpolation on top of the shifted image (i.e, we replace each pixel value with the average of the four nearest pixel values). In our experience, this modification provides an additional performance boost across the board.

Image Encoder

Actor-Critic Algorithm

We use DDPG (Lillicrap et al., 2015a) as a backbone actor-critic RL algorithm and, similarly to Barth-Maron et al. (2018), augment it with $n$ -step returns to estimate TD error. This results into faster reward propagation and overall learning progress (Mnih et al., 2016a). While some methods (Hafner et al., 2020) employ more sophisticated techniques such as TD( $\lambda$ ) or Retrace( $\lambda$ ) (Munos et al., 2016), they are often computationally demanding when $n$ is large. We find that using simple $n$ -step returns, without an importance sampling correction, strikes a good balance between performance and efficiency. We also employ clipped double Q-learning (Fujimoto et al., 2018) to reduce overestimation bias in the target value. Practically, this requires training two Q-functions $Q_{\theta_{1}}$ and $Q_{\theta_{2}}$ . For this, we sample a mini-batch of transitions $\tau=({\bm{x}}_{t},{\bm{a}}_{t},r_{t:t+n-1},{\bm{x}}_{t+n})$ from the replay buffer ${\mathcal{D}}$ and compute the following two losses:

Scheduled Exploration Noise

Empirically, we observe that it is helpful to have different levels of exploration at different stages of learning. At the beginning of training we want the agent to be more stochastic and explore the environment more effectively, while at the later stages of training, when the agent has already identified promising behaviors, it is better to be more deterministic and master those behaviors. Similar to Amos et al. (2020), we instantiate this idea by using linear decay $\sigma(t)$ for the variance $\sigma^{2}$ of the exploration noise defined as:

Key Hyper-Parameter Changes

We also conduct an extensive hyper-parameter search and identify several useful hyper-parameter modifications compared to DrQ. The three most important hyper-parameters are: (i) the size of the replay buffer, (ii) mini-batch size, and (iii) learning rate. Specifically, we use a $10$ times larger replay buffer than DrQ. We also use a smaller mini-batch size of $256$ without any noticeable performance degradation. This is in contrast to CURL (Srinivas et al., 2020) and DrQ (Yarats et al., 2021b) that both use a larger batch size of $512$ to attain more stable training in the expense of computational efficiency. Finally, we find that using smaller learning rate of $1\times 10^{-4}$ , rather than DrQ’s learning rate of $1\times 10^{-3}$ , results into more stable training without any loss in learning speed.

2 Implementation Details

We replace DrQ’s random shifts augmentation (i.e., kornia.augmentation.RandomCrop) by a custom implementation that uses flow-field image sampling provided in PyTorch (i.e., grid_sample). This is done for two reasons. First, we noticed that Kornia’s implementation does not fully utilize GPU pipelining since it has some intermediate CPU to GPU data transferring which breaks the computational flow. Second, using grid_sample allows straightforward addition of bilinear interpolation. Our custom random shifts augmentation improves training throughput by a factor of $2$ .

Faster Replay Buffer

Another computational bottleneck of DrQ was the replay buffer. The specific implementation had poor memory management which resulted in slow CPU to GPU data transfer, which also restricted the number of image-based transitions that could be stored. We reimplemented the replay buffer to address these issues which led to a ten-fold increase in storage capacity and faster data transfer. More details are available in our open-source release. We note that the improved training speed of DrQ-v2 was key to solving humanoid tasks as it enabled much faster experimentation.

Experiments

In this section we provide empirical evaluation of DrQ-v2 on an extensive set of visual continuous control tasks from DMC (Tassa et al., 2018). We first present comparison to prior methods, both model-free and model-based, in terms of sample efficiency and wall-clock time. We then present a large scale ablation study that guided the final version of DrQ-v2.

We consider a set of MuJoCo tasks (Todorov et al., 2012) provided by DMC (Tassa et al., 2018), a widely used benchmark for continous control. DMC offers environments of various difficulty, ranging from the simple control problems such as the single degree of freedom (DOF) pendulum and cartpool, to the control of complex multi-joint bodies such as the humanoid (21 DOF). We consider learning from pixels. In this setting, environment observations are stacks of $3$ consecutive RGB images of size $84\times 84$ , stacked along the channel dimension to enable inference of dynamic information like velocity and acceleration. In total, we consider $24$ different tasks, which we group into three buckets, easy, medium, and hard, according to the sample complexity to reach near-optimal performance (see Appendix A). Our motivation for this is to encourage RL practitioners to focus on the medium and hard tasks and stop using the easy tasks for evaluation, as they are mostly solved at this point and may no longer provide any valuable signal in comparing different methods.

Training Details

For all tasks in the suite an episode corresponds to $1000$ steps, where a per-step reward is in the unit interval $ $. This upper bounds the episode return to$ 1000 $making it easier to compute aggregated performance measures across tasks, as we do in Figure 1a. To facilitate fair wall-clock time comparison all algorithms are trained on the same hardware (i.e., a single NVIDIA V100 GPU machine) and evaluated with the same periodicity of$ 20000 $environment steps. Each evaluation query averages episode returns over$ 10 $episodes. Per common practice (Hafner et al., 2019), we employ action repeat of$ 2 $and measure sample complexity in the environment steps, rather than the actor steps. In all the figures we plot the mean performance over$ 10 $seeds together with the shaded regions which represent$ 95\%$ confidence intervals. A full list of hyper-parameters can be found in Appendix B.

Comparison Axes

In many real-world applications, taking a step in the environment incurs significant computational cost making sample efficiency a critical feature of an RL algorithm. It is hence important to compare RL algorithms in terms of their sample efficiency. We facilitate this comparison by computing an algorithm’s performance measured by episode return with respect to environment steps. On the other end, striving low sample complexity often comes at the cost of a poor computational efficiency. Unfortunately, recent deep RL literature has paid very little attention to this important axis which has led to skyrocketing hardware requirements. Such a trend has made it virtually impossible for an RL practitioner with modest hardware capacity to contribute to advancements in image-based RL, leaving research in this area to a few well-equipped labs. To democratize research in visual RL, we additionally propose to compare the agents in terms of wall-clock training time given the same single GPU hardware. We note that it is possible to adapt DrQ-v2 to a distributed setup, as has been done for DDPG in prior work (Barth-Maron et al., 2018; Hoffman et al., 2020).

2 Comparison to Model-Free Methods

We compare our method to several state-of-the-art model-free algorithms for visual RL including CURL (Srinivas et al., 2020), DrQ (Yarats et al., 2021b), and vanilla SAC (Haarnoja et al., 2018b) augmented with the convolutional encoder from SAC-AE (Yarats et al., 2019). Vanilla SAC is a weak baseline and only included as a ground point to showcase the recent progress in visual RL.

Sample Efficiency Axis

We present results on the hard (Figure 3), medium (Figure 4), and easy (Figure 5) subsets of the DMC tasks, where the maximum number of environment interactions is limited to thirty, three, and one million of steps respectively.Our empirical study reveals that DrQ-v2 outperforms prior model-free methods in terms of sample efficiency across the three benchmarks with different levels of difficulty. Importantly, DrQ-v2’s advantage is more pronounced on harder tasks (i.e., acrobot, quadruped, and humanoid), where exploration is especially challenging. Finally, DrQ-v2 solves the DMC humanoid locomotion tasks directly from pixels, which, to the best of our knowledge, is the first successful demonstration of such feat by a model-free method.

Compute Efficiency Axis

To facilitate a fair comparison in terms of sheer wall-clock training time, besides employee the identical training protocol (see Section 4.1), we also use the same mini-batch size of $256$ for each agent. In Figure 6, we evaluate DrQ-v2 on a subset of DMC tasks for the sake of brevity only, and note that the demonstrated results can be easily extrapolated to the other tasks given the linear dependency between training time and sample complexity. In our benchmarks, DrQ-v2 is able to achieve a throughput of $96$ FPS, which favorably compares to DrQ’s $28$ FPS (a $3.4\times$ increase), and CURL’s $16$ FPS (a $6\times$ increase) throughputs. Practically, DrQ-v2 solves easy, medium, and hard tasks within $2.9$ , $8.6$ , and $86$ hours respectively.

3 Comparison to Model-Based Methods

To see how DrQ-v2 stacks up against model-based methods, which tend to achieve better sample complexity in expense of a larger computational footprint, we also compare to recent and unpublishedArXiv v3 revision from May 3, 2021 introduces a new result on the Humanoid Walk task in Appendix A. improvements to Dreamer-v2 (Hafner et al., 2020), a leading model-based approach for visual continuous control. The recent update shows that the model-based approach can solve the DMC humanoid tasks directly from pixel inputs. The open-source implementation of Dreamer-v2 (https://github.com/danijar/dreamerv2) only provides learning curves for Humanoid Walk. For this reason we run their code to obtain results on other DMC tasks. To limit hardware requirements of compute-expensive Dreamer-v2, we only run it on a subset of $12$ out of $24$ considered tasks. This subset, however, overlaps with all the three (i.e. easy, medium, and hard) benchmarks.

Sample Efficiency Axis

Our empirical study in Figure 7 reveals that in many cases, DrQ-v2, despite being a model-free method, can rival sample efficiency of state-of-the-art model-based Dreamer-v2. We note, however, that on several tasks (for example Acrobot Swingup and Finger Turn Hard) Dreamer-v2 outperforms DrQ-v2. We leave investigation of such discrepancy for future work.

Compute Efficiency Axis

A different picture emerges if comparison is done with respect to wall-clock training time. Dreamer-v2, being a model-based method, performs significantly more floating point operations to reach its sample efficiency. In our benchmarks, Dreamer-v2 records a throughput of $24$ FPS, which is $4\times$ less than DrQ-v2’s throughput of $96$ FPS, measured on the same hardware. In Figure 8 we plot learning curves against wall-clock time and observe that DrQ-v2 takes less time to solve the tasks.

4 Ablation Study

In this section we present an extensive ablation study that guided us to the final version of DrQ-v2. Here, for brevity we only discuss experiments that were most impactful and omit others that did not pan out. For computational reasons, we only ablate on $3$ different control tasks of various difficulty levels. Our findings are summarized in Figure 9 and detailed below.

DrQ (Yarats et al., 2021b) leverages SAC (Haarnoja et al., 2018b) as the backbone RL algorithm. While it has been demonstrated by many works, including the original manuscripts (Haarnoja et al., 2018b, a) that SAC is superior to DDPG (Lillicrap et al., 2015b), our careful examination identifies two shortcomings that preclude SAC (within DrQ) to solve hard exploration-wise image-based tasks. First, the automatic entropy adjustment strategy, introduced in Haarnoja et al. (2018a), is inadequate and in some cases leads to a premature entropy collapse. This prevents the agent from finding more optimal behaviors due to the insufficient exploration. In Figure 9a, we empirically verify our intuition and, indeed, observe that DDPG demonstrates better exploration properties than SAC. Here, DDPG uses constant $\sigma=0.2$ for the exploration noise.

𝐍𝐍\mathbf{N}-step Returns

The second issue concerns the inability of soft Q-learning to incorporate $n$ -step returns to estimate TD error in a straightforward manner. The reason for this is that computing a target value for soft Q-function requires estimating per-step entropy of the policy, which is challenging to do for large $n$ in the off-policy regime. In contrast, DDPG does not require estimating per-step entropy to compute targets and is more amenable for $n$ -step returns. In Figure 9b we demonstrate that estimating TD error with $n$ -step returns improves sample efficiency over vanilla DDPG. We select $3$ -step returns as a sensible choice for our method.

Replay Buffer Size

We hypothesize that a larger replay buffer plays an important role in circumventing the catastrophic forgetting problem (Fedus et al., 2020). This issue is especially prominent in tasks with more diverse initial state distributions (i.e., reacher or humanoid tasks), where the vast variety of possible behaviors requires significantly larger memory. We confirm this intuition by ablating the size of the replay buffer in Figure 9c, where we observe that a buffer size of 1M helps to improve performance on Reacher Hard considerably.

Scheduled Exploration Noise

Related Work

Successes of visual representation learning in computer vision (Vincent et al., 2008; Doersch et al., 2015; Wang and Gupta, 2015; Noroozi and Favaro, 2016; Zhang et al., 2017; Gidaris et al., 2018) has inspired successes in visual RL, where coherent representations are learned alongside RL. Works such as SAC-AE (Yarats et al., 2019), PlaNet (Hafner et al., 2018), and SLAC (Lee et al., 2019), demonstrated how auto-encoders (Finn et al., 2015) could improve visual RL. Following this, other self-supervised objectives such as contrastive learning in CURL (Srinivas et al., 2020) and ATC (Stooke et al., 2020), self-prediction in SPR (Schwarzer et al., 2020a), contrastive cluster assignment in Proto-RL (Yarats et al., 2021a), and augmented data in DrQ (Yarats et al., 2021b) and RAD (Laskin et al., 2020), have significantly bridged the gap between state-based and image-based RL. Future prediction objectives (Hafner et al., 2018, 2019; Yan et al., 2020; Finn et al., 2015; Pinto et al., 2016; Agrawal et al., 2016) and other auxiliary objectives (Jaderberg et al., 2016; Zhan et al., 2020; Young et al., 2020; Chen et al., 2020) have shown improvements on a variety of problems ranging from gameplay, continuous control, and robotics. In the context of visual control settings, clever use of augmented data (Yarats et al., 2021b; Laskin et al., 2020) currently produces state-of-the-art results on visual tasks from DMC (Tassa et al., 2018).

Humanoid Control

The humanoid control problem first presented in Tassa et al. (2012), has been studied as one of the hardest control problems due to its large state and action spaces. The earliest solutions to this problem use ideas in model-based optimal control to generate policies given an accurate model of the humanoid . Subsequent works in RL have shown that model-free policies can solve the humanoid control problem given access to proprioceptive state observations. However, solving such a problem from visual observations has been a challenging problem, with leading RL algorithms making little progress to solve the task (Tassa et al., 2018). Recently, Hafner et al. (2020) was able to solve this problem through a model-based technique in around 30M environment steps and $340$ hours of training on a single GPU machine. DrQ-v2, presented in this paper, marks the first model-free RL method that can solve humanoid control from visual observations, taking also around 30M steps and $86$ hours of training on the same hardware.

Conclusion

We have introduced a conceptually simple model-free actor-critic RL algorithm for image-based continuous control – DrQ-v2. Our method provides significantly better computational footprint and masters tasks from DMC (Tassa et al., 2018) directly from pixels, most notably the humanoid locomotion tasks that were previously unsolved by model-free approaches. Additionally, we have provided an efficient PyTorch implementation of DrQ-v2 that is publicly available at https://github.com/facebookresearch/drqv2. We hope that our algorithm will help to inspire and democratize further research in visual RL.

Acknowledgements

This research is supported in part by DARPA through the Machine Common Sense Program. We thank Brandon Amos, Kimin Lee, Mandi Zhao, and Younggyo Seo for insightful discussions that helped to shape our paper.

References

Appendix

Appendix A Benchmarks

We classify a set of $24$ continuous control tasks from DMC [Tassa et al., 2018] into easy, medium, and hard benchmarks and provide a summary for each task in Table 1.

Appendix B Hyper-parameters

The full list of hyper-parameters is presented in Table 2. While we tried to keep the settings identical for each of the task, there are a few specific deviations for some tasks. Walker Stand/Walk/Run For all three tasks we use mini-batch size of $512$ and $n$ -step return of $1$ . Quadruper Run We set the replay buffer size to $10^{5}$ . Humanoid Stand/Walk We set learning rate to $8\times 10^{-5}$ and increase features dim. to $100$ .