d3rlpy: An Offline Deep Reinforcement Learning Library

Takuma Seno, Michita Imai

Introduction

Deep reinforcement learning (RL) has been led to significant advancements in numerous domains such as gaming (Wurman et al., 2022) and robotics (Lee et al., 2020). While RL algorithms have a potential to solve complex tasks, active data collection is a major challenge especially for environments where interaction is expensive. Offline RL (Levine et al., 2020), where algorithms find a good policy within a previously collected static dataset, has been considered as a solution to this problem.

Although recent offline deep RL papers are published with author-provided implementations, they are scattered across different repositories and do not provide standardized interfaces, which makes it difficult for researchers to incorporate the algorithms into their projects. It is also crucial for researchers to have an access to faithfully benchmarked implementations to deal with a reproducibility problem (Henderson et al., 2017).

In this paper, we introduce Data-Driven Deep Reinforcement Learning library for Python (d3rlpy), an offline deep RL library for Python. d3rlpy provides a set of off-policy offline and online RL algorithms built with PyTorch (Paszke et al., 2019). API of all implemented algorithms is fully documented, plug-and-play and standardized so that users can easily start experiments with d3rlpy. To solve the reproducibility issue in offline RL, a large-scale faithful benchmark is conducted with d3rlpy.

Related work

The choice of API design determines user experience, which has to balance the tradeoff between ease of use and flexibility. KerasRL (Plappert, 2016) and Stable-Baselines3 (Raffin et al., 2021) provide deep RL algorithms with plug-and-play API and extensive documentations. Tensorforce (Kuhnle et al., 2017), MushroomRL (D’Eramo et al., 2021) and Tianshou (Weng et al., 2021) provide moduralized deep RL components that allow users to conduct custom experiments. SaLinA (Denoyer et al., 2021) provides a general framework for decision-making agents where users can implement scalable algorithms on top of it. To encourage a broader RL community to start offline RL research, d3rlpy is designed to be the first library that privdes fully documented plug-and-play API for offline RL experiments.

From the reproducibility perspective, ChainerRL (Fujita et al., 2021) and Tonic (Pardo, 2020) provide many deep RL algorithms with faithful reproduction results. Dopamine (Castro et al., 2018) is also the widely used implementation in the community, which focuses on making DQN-variants (Hessel et al., 2018) available for researchers. d3rlpy is also the first library accompanied by a number of offline RL algorithms and the extensive benchmark results in this research field.

Integrating the fully documented plug-and-play API and a number of faithfully benchmarked offline RL algorithms altogether, it is difficult for the existing libraries to instantly achieve the same offline RL research experience as d3rlpy.

Design of d3rlpy

In this section, the library design of d3rlpy is described.

d3rlpy provides scikit-learn-styled API (Pedregosa et al., 2011) to make the use of this library as easy as possible. In terms of the library design, there are two main differences from the existing libraries. First, d3rlpy has an interface for offline RL training, which takes a dedicated RL dataset component, MDPDataset described in Section 3.2. Second, all methods for training such as fit and fit_online are implemented in Algorithm components to make d3rlpy as plug-and-play as possible. For the plug-and-play user experience, neural network architectures are automatically selected from MLP and the convolutional model (Mnih et al., 2015) depending on observation, which allows users to start training without composing neural network models unless using customized architectures described in Section 3.2. These design choices are expected to lower the bar to start using this library.

Since d3rlpy supports both offline and online training, the seamless transition from offline training to online fine-tuning is realized. Fine-tuning policies trained offline is demanded, but is still a challenging problem (Nair et al., 2020; Kostrikov et al., 2021). This seamless transition supports the further research by allowing RL researchers to easily conduct fine-tuning experiments.

2 Components

We highlight main components provided by d3rlpy. Figure 1 depicts module components in d3rlpy. All of these components provide the standardized API. The full documentation including extensive tutorials of the library is available at https://d3rlpy.readthedocs.io.

Algorithm. Algorithm components provide the offline and online training methods described in Section 3.1. Algorithm is implemented in a hierarchical design that internally instantiates AlgorithmImpl. This hierarchy is to provide high-level user-friendly API such as fit method for Algorithm and low-level API such as update_actor and update_critic for AlgorithmImpl. The main motivation of this hierarchical API system is to increase module resusability of the algoirthms when the new algorithm only requires high-level changes. For example, delayed policy update of TD3, which updates policy parameters every two gradient steps, can be implemented by adjusting frequency of update_actor method calls in the high-level module without changing the low-level logics.

MDPDataset. MDPDataset provides the standardized offline RL dataset interface. Users can build their own dataset by using logged data consisting with numpy arrays (Harris et al., 2020) of observations, actions, rewards, terminals and optionally timeouts (Pardo et al., 2018). The popular benchmark datasets such as D4RL and Atari 2600 datasets are also provided by d3rlpy.datasets package that converts them into MDPDataset object. In addition, d3rlpy supports automatic data collection by giving OpenAI Gym (Brockman et al., 2016) style environment, which exports collected data as MDPDataset object. For diverse sets of dataeset creation, the data collection can be performed with and without parameter updates.

EncoderFactory. User-defined custom neural network models are supported via EncoderFactory components. Users can define all function approximators to train in d3rlpy by building their own EncoderFactory components. This flexibility allows the use of the complex architectures (He et al., 2016) and experiments with partially pretrained models (Shah and Kumar, 2021).

QFunctionFactory. d3rlpy provides QFunctionFactory components that allow users to use distributional Q-functions: Quantile Regression (Dabney et al., 2018b) and Implicit Quantile Network (Dabney et al., 2018a). The distributional Q-functions dramatically improve performance by capturing variance of returns. Unlike conventional RL libraries that implement distributional Q-functions as DQN-variants, d3rlpy enables users to use them with all implemented algorithms, which reduces complexity to support algorithmic-variants such as QR-DQN (Dabney et al., 2018b) and a discrete version of CQL (Kumar et al., 2020).

Scaler, ActionScaler and RewardScaler. By exploiting static dataset in offline RL training, d3rlpy provides various preprocessing and postprocessing methods through Scaler, ActionScaler and RewardScaler components. For observation preprocessing, normalization, standardization and pixel are available. The observation standardization has been shown to improve policy performance in offline RL setting (Fujimoto and Gu, 2021). Regarding action preprocessing, normalization is available and action output from a trained policy is denormalized to original scale as postprocessing. Lastly, reward preprocessing supports normalization, standardization, clip and constant multiplication.

Large-scale benchmark

To address the reproducibility problem (Henderson et al., 2017), the implemented algorithmsd3rlpy implements NFQ (Riedmiller, 2005), DQN (Mnih et al., 2015), Double DQN (van Hasselt et al., 2015), DDPG (Lillicrap et al., 2016), TD3 (Fujimoto et al., 2018), SAC (Haarnoja et al., 2018), BCQ (Fujimoto et al., 2019b), BEAR (Kumar et al., 2019), CQL (Kumar et al., 2020), AWAC (Nair et al., 2020), CRR (Wang et al., 2020), PLAS (Zhou et al., 2020), PLAS+P (Zhou et al., 2020), TD3+BC (Fujimoto and Gu, 2021) and IQL (Kostrikov et al., 2021). are faithfully benchmarked with D4RL (Fu et al., 2020) and Atari 2600 datasets (Agarwal et al., 2020). The full Python scripts used in this benchmark are also included in our source code https://github.com/takuseno/d3rlpy/tree/master/reproductions, which allows users to conduct additional benchmark experiments. Full tables of benchmark results are reported in Appendix A, Appendix B and Appendix C. The all logged metrics are released in our GitHub repository https://github.com/takuseno/d3rlpy-benchmarks.

Conclusion

In this paper, we introduced an offline deep reinforcement learning library, d3rlpy. d3rlpy provides a set of offline and online RL algorithms through the standardized plug-and-play API. The large-scale faithful benchmark was conducted to address the reproducibility issue.

This work is supported by Information-technology Promotion Agency, Japan (IPA), Exploratory IT Human Resources Project (MITOU Program) in the fiscal year 2020. We would like to express our genuine gratitude for the contributions made by the voluntary contributors. We would also like to thank our users who have provided constructive feedback and insights.

Appendix A. Benchmark Results: D4RL

We evaluated the implemented algorithms on the open-sourced D4RL benchmark of OpenAI Gym MuJoCo tasks (Brockman et al., 2016; Fu et al., 2020). We followed the experimental procedure described in Fu et al. (2020). We trained each algorithm for 500K gradient steps and evaluated every 5000 steps to collect evaluation performance in environments for 10 episodes. Table LABEL:tab:params shows hyperparameters used in benchmarking. We used the same hyperparameters as the ones previously reported in previous papers or recommended in author-provided repositories. We used discount factor of $0.99$ , target update rate of 5e-3 and an Adam optimizer (Kingma and Ba, 2014) across all algorithms. The default architecture was MLP with hidden layers of unless we explicitly address it. We repeated all experiments with 10 random seeds.

Table 2 shows results of the benchmark in normalized scale (Fu et al., 2020). Table 3, 4, 5, 6, 7, 6, 9, 10 and 11 show side-by-side comparisons with reference scores. CRR is not included in this side-by-side comparison because CRR has not been benchmarked with D4RL and an author-provided implementation is not publically available. Considering the fact that standard deviations for some algorithms were not reported by the authors and many algorithms in offline RL research were originally benchmarked with only 3 or 4 random seeds, we believe that performance discrepancy is trivial.

Appendix B. Benchmark Results: Atari 2600

We evaluated the implemented algorithms with open-sourced Atari 2600 datasets (Agarwal et al., 2020). We followed the experimental procedure described in Agarwal et al. (2020). We used 1% portion of transitions (500K datapoints) and train each algorithm for 12.5M gradient steps and evaluate every 125K steps to collect evaluation performance in environments for 10 episodes. Table LABEL:tab:params_atari shows hyperparameters used in benchmarking. We used the same hyperparameters for QR-DQN and CQL as the ones reported in Kumar et al. (2020). For NFQ and BCQ, the hyperparameters were chosen based on the QR-DQN setup for fair comparison because there are no benchmark results for them with publically available datasets. We used discount factor of $0.99$ , Adam optimizer (Kingma and Ba, 2014) and the convolutional neural network (Mnih et al., 2015) across all algorithms. Note that we configured BCQ with the Quantile Regression Q-function introduced in Section 3.2 to match the CQL setup. In evaluation, we used $\epsilon$ -greedy of $\epsilon=0.001$ and 25% probability of sticky action (Agarwal et al., 2020). We repeated all experiments with 10 random seeds.

Table 13 shows the benchmark results, and Table 14 and 15 show side-by-side comparisons with reference scores. NFQ and BCQ are not included in the side-by-side comparisons because those are not evaluated with publically available datasets, an author-provided implementation for NFQ is not publically available, and the author-provided implementationhttps://github.com/sfujim/BCQ for BCQ is not directly applicable for this evaluation. Considering that the authors did not report standard deviations, we believe that performance discrepancy is trivial.

Appendix C. Benchmark Results: Fine-tuning

We evaluated the implemented algorithms with AntMaze datasets (Fu et al., 2020) in a fine-tuning scenario where a policy is pretrained with a static dataset and fine-tuned with online experiences. We followed the experimental procedure described in Kostrikov et al. (2021). In this evaluation, we chose AWAC and IQL, which are proposed as RL algorithms with fine-tuning capability. Table LABEL:tab:antmaze_params shows hyperparameters used in benchmarking. All reward values are subtracted by $1$ . We used the same hyperparameters described in the original paper (Nair et al., 2020; Kostrikov et al., 2021). In each training, the policy was fine-tuned for 1M steps after pretraining. We repeated all experiments with 10 random seeds.

Table 17 shows benchmark results. Table 18 and 19 show side-by-side comparisons with reference scores. Considering that the authors did not report standard deviations, we believe that performance discrepancy is trivial.