Skip to content

seohongpark/ogbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

    OGBench: Benchmarking Offline Goal-Conditioned RL

image

Overview

OGBench is a benchmark designed to facilitate algorithms research in offline goal-conditioned reinforcement learning (RL), offline unsupervised RL, and offline RL. See the project page for videos and more details about the environments, tasks, and datasets.

Features

  • 8 types of realistic and diverse environments (videos):
    • Locomotion: PointMaze, AntMaze, HumanoidMaze, and AntSoccer.
    • Manipulation: Cube, Scene, and Puzzle.
    • Drawing: Powderworld.
  • 85 datasets covering various challenges in offline goal-conditioned RL.
  • 410 tasks for standard (i.e., non-goal-conditioned) offline RL.
  • Support for both pixel-based and state-based observations.
  • Clean, well-tuned reference implementations of 6 offline goal-conditioned RL algorithms (GCBC, GCIVL, GCIQL, QRL, CRL, and HIQL) based on JAX.
  • Fully reproducible scripts for the entire benchmark table and datasets.
  • pip-installable, easy-to-use APIs based on Gymnasium.
  • No major dependencies other than MuJoCo.

Quick Start

Installation

OGBench can be easily installed via PyPI:

pip install ogbench

It requires Python 3.8+ and has only three dependencies: mujoco >= 3.1.6, dm_control >= 1.0.20, and gymnasium.

To use OGBench for offline goal-conditioned RL, go to this section. To use OGBench for standard (non-goal-conditioned) offline RL, go to this section.

Usage for offline goal-conditioned RL

After installing OGBench, you can create an environment and datasets using ogbench.make_env_and_datasets. The environment follows the Gymnasium interface. The datasets will be automatically downloaded during the first run.

Here is an example of how to use OGBench for offline goal-conditioned RL:

Caution

Do not use gymnasium.make to create an environment. Use ogbench.make_env_and_datasets instead. To create an environment without loading datasets, use env_only=True in ogbench.make_env_and_datasets.

import ogbench

# Make an environment and datasets (they will be automatically downloaded).
dataset_name = 'humanoidmaze-large-navigate-v0'
env, train_dataset, val_dataset = ogbench.make_env_and_datasets(dataset_name)

# Train your offline goal-conditioned RL agent on the dataset.
# ...

# Evaluate the agent.
for task_id in [1, 2, 3, 4, 5]:
    # Reset the environment and set the evaluation task.
    ob, info = env.reset(
        options=dict(
            task_id=task_id,  # Set the evaluation task. Each environment provides five
                              # evaluation goals, and `task_id` must be in [1, 5].
            render_goal=True,  # Set to `True` to get a rendered goal image (optional).
        )
    )

    goal = info['goal']  # Get the goal observation to pass to the agent.
    goal_rendered = info['goal_rendered']  # Get the rendered goal image (optional).

    done = False
    while not done:
        action = env.action_space.sample()  # Replace this with your agent's action.
        ob, reward, terminated, truncated, info = env.step(action)  # Gymnasium-style step.
        # If the agent reaches the goal, `terminated` will be `True`. If the episode length
        # exceeds the maximum length without reaching the goal, `truncated` will be `True`.
        # `reward` is 1 if the agent reaches the goal and 0 otherwise.
        done = terminated or truncated
        frame = env.render()  # Render the current frame (optional).

    success = info['success']  # Whether the agent reached the goal (0 or 1).
                               # `terminated` also indicates this.

You can find a complete example of a training script for offline goal-conditioned RL in the impls directory. See the next section for more details on the reference implementations.

Usage for standard (non-goal-conditioned) offline RL

OGBench also provides single-task variants of the environments for standard (reward-maximizing) offline RL. Each locomotion and manipulation environment provides five different single-task tasks corresponding to the five evaluation goals, and they are named with the suffix singletask-task[n] (e.g., scene-play-singletask-task2-v0), where [n] denotes a number between 1 and 5 (inclusive). Among the five tasks in each environment, the most representative one is chosen as the "default" task, and is aliased by the suffix singletask without a task number. Default tasks can be useful for reducing the number of benchmarking environments or for tuning hyperparameters.

Click to see the list of default tasks
Environment Default Task
pointmaze-* task1
antmaze-* task1
humanoidmaze-* task1
antsoccer-* task4
cube-* task2
scene-* task2
puzzle-{3x3, 4x4} task4
puzzle-{4x5, 4x6} task2

Here is an example of how to use OGBench for standard (non-goal-conditioned) offline RL:

Caution

Do not use gymnasium.make to create an environment. Use ogbench.make_env_and_datasets instead. To create an environment without loading datasets, use env_only=True in ogbench.make_env_and_datasets.

Note

Offline RL datasets contain both the terminals and masks fields.

  • masks denotes whether the agent should get a Bellman backup from the next observation. It is 0 only when the task is complete (and 1 otherwise). In this case, the agent should set the target Q-value to 0, instead of using the next observation's target Q-value.
  • terminals simply denotes whether the dataset trajectory is over, regardless of task completion.

For example, in antmaze-large-navigate-singletask-v0, the dataset contains 1M transitions, with each trajectory having a length of 1000. Hence, sum(dataset['terminals']) is exactly 1000 (i.e., 1 at the end of each trajectory), whereas sum(dataset['masks']) can vary depending on how many times the agent reaches the goal. Note that dataset trajectories do not terminate even when the agent reaches the goal, as they are collected by a scripted policy that is not task-aware.

For standard Q-learning, you likely only need masks, but for other trajectory-aware algorithms (e.g., hierarchical RL or trajectory modeling-based approaches), you may need both masks and terminals. See the IQL implementation in the FQL repository for an example of how to use masks.

import ogbench

# Make an environment and datasets (they will be automatically downloaded).
# In `cube-double`, the default task is `task2`, and it is also callable by
# `cube-double-play-singletask-v0`.
dataset_name = 'cube-double-play-singletask-task2-v0'
env, train_dataset, val_dataset = ogbench.make_env_and_datasets(dataset_name)

# Train your offline RL agent on the dataset.
# ...

# Evaluate the agent.
ob, info = env.reset()  # Reset the environment.

done = False
while not done:
    action = env.action_space.sample()  # Replace this with your agent's action.
    ob, reward, terminated, truncated, info = env.step(action)  # Gymnasium-style step.
    # If the agent achieves the task, `terminated` will be `True`. If the episode length
    # exceeds the maximum length without achieving the task, `truncated` will be `True`.
    done = terminated or truncated
    frame = env.render()  # Render the current frame (optional).

success = info['success']  # Whether the agent achieved the task (0 or 1).

For standard offline RL, we do not provide official reference implementations or benchmarking results. However, you may find implementations of some offline RL algorithms (e.g., IQL, ReBRAC, and FQL) with partial benchmarking results in this repository.

Dataset APIs

OGBench provides several APIs to download and load datasets. The simplest way is to use ogbench.make_env_and_datasets as shown above, which creates an environment and loads training and validation datasets. The datasets will automatically be downloaded to the directory specified by dataset_dir during the first run (default: ~/.ogbench/data). ogbench.make_env_and_datasets also provides the compact_dataset option, which returns a dataset without the next_observations field (see below). For example:

import ogbench

# Make an environment and load datasets.
dataset_name = 'antmaze-large-navigate-v0'
env, train_dataset, val_dataset = ogbench.make_env_and_datasets(
    dataset_name,  # Dataset name.
    dataset_dir='~/.ogbench/data',  # Directory to save datasets (optional).
    compact_dataset=False,  # Whether to use a compact dataset (optional; see below).
)

# Assume each dataset trajectory has a length of 4, and (s0, a0, s1), (s1, a1, s2),
# (s2, a2, s3), (s3, a3, s4) are the transition tuples.
# If `compact_dataset` is `False`, the dataset will have the following structure:
#                       |<- traj 1 ->|  |<- traj 2 ->|  ...
# ----------------------------------------------------------
# 'observations'     : [s0, s1, s2, s3, s0, s1, s2, s3, ...]
# 'actions'          : [a0, a1, a2, a3, a0, a1, a2, a3, ...]
# 'next_observations': [s1, s2, s3, s4, s1, s2, s3, s4, ...]
# 'terminals'        : [ 0,  0,  0,  1,  0,  0,  0,  1, ...]

# If `compact_dataset` is `True`, the dataset will have the following structure, where the
# `next_observations` field is omitted. Instead, it includes a `valids` field indicating
# whether the next observation is valid:
#                       |<--- traj 1 --->|  |<--- traj 2 --->|  ...
# ------------------------------------------------------------------
# 'observations'     : [s0, s1, s2, s3, s4, s0, s1, s2, s3, s4, ...]
# 'actions'          : [a0, a1, a2, a3, a4, a0, a1, a2, a3, a4, ...]
# 'terminals'        : [ 0,  0,  0,  1,  1,  0,  0,  0,  1,  1, ...]
# 'valids'           : [ 1,  1,  1,  1,  0,  1,  1,  1,  1,  0, ...]

To download multiple datasets at once, you can use ogbench.download_datasets:

import ogbench

dataset_names = [
    'humanoidmaze-medium-navigate-v0',
    'visual-puzzle-3x3-play-v0',
    'powderworld-easy-play-v0',
]
ogbench.download_datasets(
    dataset_names,  # List of dataset names.
    dataset_dir='~/.ogbench/data',  # Directory to save datasets (optional).
)

Reference Implementations

OGBench also provides JAX-based reference implementations of six offline goal-conditioned RL algorithms (GCBC, GCIVL, GCIQL, QRL, CRL and HIQL). They are provided in the impls directory as a standalone codebase. You can safely remove the other parts of the repository if you only need the reference implementations and do not want to modify the environments.

Installation

Our reference implementations require Python 3.9+ and additional dependencies, including jax >= 0.4.26. To install these dependencies, run:

cd impls
pip install -r requirements.txt

By default, it uses the PyPI version of OGBench. If you want to use a local version of OGBench (e.g., for training methods on modified environments), run instead pip install -e ".[train]" in the root directory.

Running the reference implementations

Each algorithm is implemented in a separate file in the agents directory. We provide implementations of the following offline goal-conditioned RL algorithms:

  • gcbc.py: Goal-Conditioned Behavioral Cloning (GCBC)
  • gcivl.py: Goal-Conditioned Implicit V-Learning (GCIVL)
  • gciql.py: Goal-Conditioned Implicit Q-Learning (GCIQL)
  • qrl.py: Quasimetric Reinforcement Learning (QRL)
  • crl.py: Contrastive Reinforcement Learning (CRL)
  • hiql.py: Hierarchical Implicit Q-Learning (HIQL)

To train an agent, you can run the main.py script. Training metrics, evaluation metrics, and videos are logged via wandb by default. Here are some example commands (see hyperparameters.sh for the full list of commands):

# antmaze-large-navigate-v0 (GCBC)
python main.py --env_name=antmaze-large-navigate-v0 --agent=agents/gcbc.py
# antmaze-large-navigate-v0 (GCIVL)
python main.py --env_name=antmaze-large-navigate-v0 --agent=agents/gcivl.py --agent.alpha=10.0
# antmaze-large-navigate-v0 (GCIQL)
python main.py --env_name=antmaze-large-navigate-v0 --agent=agents/gciql.py --agent.alpha=0.3
# antmaze-large-navigate-v0 (QRL)
python main.py --env_name=antmaze-large-navigate-v0 --agent=agents/qrl.py --agent.alpha=0.003
# antmaze-large-navigate-v0 (CRL)
python main.py --env_name=antmaze-large-navigate-v0 --agent=agents/crl.py --agent.alpha=0.1
# antmaze-large-navigate-v0 (HIQL)
python main.py --env_name=antmaze-large-navigate-v0 --agent=agents/hiql.py --agent.high_alpha=3.0 --agent.low_alpha=3.0

Each run typically takes 2-5 hours (on state-based tasks) or 5-12 hours (on pixel-based tasks) on a single A5000 GPU. For large pixel-based datasets (e.g., visual-puzzle-4x6-play-v0 with 5M transitions), up to 120GB of RAM may be required.

Tips for hyperparameters and flags

To reproduce the results in the paper, you need to use the hyperparameters provided. We provide a complete list of the exact command-line flags used to produce the main benchmark table in the paper in hyperparameters.sh. Below, we highlight some important hyperparameters and common pitfalls:

  • Regardless of the algorithms, one of the most important hyperparameters is agent.alpha (i.e., the temperature (AWR) or the BC coefficient (DDPG+BC)) for the actor loss. It is crucial to tune this hyperparameter when running an algorithm on a new environment. In the paper, we provide a separate table of the policy extraction hyperparameters, which are individually tuned for each environment and dataset category.
  • By default, actor goals are uniformly sampled from the future states in the same trajectory. We found this works best in most cases, but you can adjust this to allow random actor goals (e.g., by setting --agent.actor_p_trajgoal=0.5 --agent.actor_p_randomgoal=0.5). This is especially important for datasets that require stitching. See the hyperparameter table in the paper for the values used in benchmarking.
  • For GCIQL, CRL, and QRL, we provide two policy extraction methods: AWR and DDPG+BC. In general, DDPG+BC works better than AWR (see this paper for the reasons), but DDPG+BC is usually more sensitive to the alpha hyperparameter than AWR. As such, in a new environment, we recommend starting with AWR to get a sence of the performance and then switching to DDPG+BC to further improve the performance.
  • Our QRL implementation provides two quasimetric parameterizations: MRN and IQE. We found that IQE (default) works better in general, but it is almost twice as slow as MRN.
  • In CRL, we found that using --agent.actor_log_q=True (which is set by default) is important for strong performance, especially in locomotion environments. We found this doesn't help much with other algorithms.
  • In HIQL, setting --agent.low_actor_rep_grad=True (which is False by default) is crucial in pixel-based environments. This allows gradients to flow from the low-level actor loss to the subgoal representation, which helps maintain better representations.
  • In pixel-based environments, don't forget to set agent.encoder. We used --agent.encoder=impala_small across all pixel-based environments.
  • In discrete-action environments (e.g., Powderworld), don't forget to set --agent.discrete=True.
  • In Powderworld, use --eval_temperature=0.3, which helps prevent the agent from getting stuck in certain states.

Reproducing Datasets

We provide the full scripts and exact command-line flags used to produce all the datasets in OGBench. The scripts are provided in the data_gen_scripts directory.

Installation

Data-generation scripts for locomotion environments require Python 3.9+ and additional dependencies, including jax >= 0.4.26, to train and load expert agents. For manipulation and drawing environments, no additional dependencies are required. To install the necessary dependencies for locomotion environments, run the following command in the root directory:

pip install -e ".[train]"

This installs the same dependencies as the reference implementations, but in the editable mode (-e).

Reproducing datasets

To reproduce datasets, you can run the scripts in the data_gen_scripts directory. For locomotion environments, you need to first download the expert policies. We provide the exact command-line flags used to produce the datasets in commands.sh. Here is an example of how to reproduce a dataset for the antmaze-large-navigate-v0 task:

cd data_gen_scripts
# Download the expert policies for locomotion environments (not required for other environments).
wget https://rail.eecs.berkeley.edu/datasets/ogbench/experts.tar.gz
tar xf experts.tar.gz && rm experts.tar.gz
# Create a directory to save datasets.
mkdir -p data
# Add the `impls` directory to PYTHONPATH.
# Alternatively, you can move the contents of `data_gen_scripts` to `impls` instead of setting PYTHONPATH.
export PYTHONPATH="../impls:${PYTHONPATH}"  
# Generate a dataset for `antmaze-large-navigate-v0`.
python generate_locomaze.py --env_name=antmaze-large-v0 --save_path=data/antmaze-large-navigate-v0.npz

Reproducing expert policies

If you want to train your own expert policies from scratch, you can run the corresponding commands in commands.sh. For example, to train an Ant expert policy, you can run the following command in the data_gen_scripts directory after setting PYTHONPATH as above:

python main_sac.py --env_name=online-ant-xy-v0

Acknowledgments

This codebase is inspired by or partly uses code from the following repositories:

  • D4RL for the dataset structure and the AntMaze environment.
  • Gymnasium and dm_control for the agents (Ant and Humanoid) in the locomotion environments.
  • MuJoCo Menagerie for the robot descriptions (Universal Robots UR5e and Robotiq 2F-85) in the manipulation environments.
  • jaxlie for Lie group operations in the manipulation environments.
  • Meta-World for the objects (drawer, window, and button) in the manipulation environments.
  • Powderworld for the Powderworld environment.
  • NumPyConv2D for the NumPy Conv2D implementation in the Powderworld environment.
  • jaxrl_m, rlbase, HIQL, and cmd-notebook for JAX-based implementations of RL algorithms.

Special thanks to Kevin Zakka for providing the initial codebase for the manipulation environments.

Citation

@inproceedings{ogbench_park2025,
  title={OGBench: Benchmarking Offline Goal-Conditioned RL},
  author={Park, Seohong and Frans, Kevin and Eysenbach, Benjamin and Levine, Sergey},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2025},
}