Safe Policy Optimization (SafePO) is a comprehensive algorithm benchmark for Safe Reinforcement Learning (Safe RL). It provides RL research community with a unified platform for processing and evaluating algorithms in various safe reinforcement learning environments. In order to better help the community study this problem, SafePO is developed with the following key features:
- Comprehensive Safe RL benchmark: We offer high-quality implementation of both single-agent safe reinforcement learning algorithms (CPO, PCPO, FOCOPS, PPO-Lag, TRPO-Lag, CUP, CPPO-PID, and RCPO) and multi-agent safe reinforcement learning algorithms (HAPPO, MAPPO-Lag, IPPO, MACPO, and MAPPO).
- Richer interfaces:In SafePO, you can modify the parameters of the algorithm according to your requirements. You can pass in the parameters you want to change via
argparse
at the terminal. - Single file style:SafePO adopts a single-file style to implement algorithms, aiming to function as an algorithm library that integrates tutorial and tool capabilities. This design choice prioritizes readability and extensibility, albeit at the expense of inheritance and code simplicity. Unlike modular frameworks, users can grasp the essence of the algorithms without the need to extensively navigate through the entire library.
- More information:We provide rich data visualization methods. Reinforcement learning algorithms typically involves huge number of parameters. In order to better understand the changes of each parameter in the training process, we use log files, TensorBoard, and wandb to visualize them. We believe this will help developers tune each algorithm more efficiently.
- Overview of Algorithms
- Supported Environments
- Safety-Gymnasium
- Safe-Dexterous-Hands
- What's More
- Pre-requisites
- Conda-Environment
- Getting Started
- Machine Configuration
- Ethical and Responsible Use
- PKU-MARL Team
Here we provide a table of Safe RL algorithms that the benchmark includes.
Algorithm | Proceedings&Cites | Official Code Repo | Official Code Last Update | Official Github Stars |
---|---|---|---|---|
PPO-Lag | ❌ | Tensorflow 1 | ||
TRPO-Lag | ❌ | Tensorflow 1 | ||
CUP | Neurips 2022 (Cite: 6) | Pytorch | ||
FOCOPS | Neurips 2020 (Cite: 27) | Pytorch | ||
CPO | ICML 2017(Cite: 663) | ❌ | ❌ | ❌ |
PCPO | ICLR 2020(Cite: 67) | Theano | ❌ | ❌ |
RCPO | ICLR 2019 (Cite: 238) | ❌ | ❌ | ❌ |
CPPO-PID | Neurips 2020(Cite: 71) | Pytorch | ||
MACPO | Preprint(Cite: 4) | Pytorch | ||
MAPPO-Lag | Preprint(Cite: 4) | Pytorch | ||
HAPPO (Purely reward optimisation) | ICLR 2022 (Cite: 10) | Pytorch | ||
MAPPO (Purely reward optimisation) | Preprint(Cite: 98) | Pytorch | ||
IPPO (Purely reward optimisation) | Preprint(Cite: 28) | ❌ | ❌ | ❌ |
Here is a list of all the environments Saty-Gymnasiumn support for now; some are being tested in our baselines, and we will gradually release them in later updates. For more details, please refer to Safety-Gymnasium.
Category | Task | Agent | Example |
---|---|---|---|
Safe Navigation | Goal[012] | Point, Car, Doggo, Racecar, Ant | SafetyPointGoal1-v0 |
Button[012] | |||
Push[012] | |||
Circle[012] | |||
Velocity | Velocity | HalfCheetah, Hopper, Swimmer, Walker2d, Ant, Humanoid | SafetyAntVelocity-v1 |
note: Safe velocity tasks support both single-agent and multi-agent algorithms, while safe navigation tasks only support single-agent algorithms currently.
note: These tasks support multi-agent algorithms only currently.
We implement some different constraints to the base environments, expanding the setting to both single-agent and multi-agent.
Our team has also designed a number of more interesting safety tasks for two-handed dexterous manipulation, and this work will soon be releasing code for use by more Safe RL researchers.
Base Environments | Description | Demo |
---|---|---|
ShadowHandOverWall | None | |
ShadowHandOverWallDown | None | |
ShadowHandCatchOver2UnderarmWall | None | |
ShadowHandCatchOver2UnderarmWallDown | None |
And the safe region are :
Wall | Wall Down |
---|---|
To use SafePO-Baselines, you need to install environments. Please refer to Mujoco, Safety-Gymnasium for more details on installation. Details regarding the installation of IsaacGym can be found here. We currently support the Preview Release 3
version of IsaacGym.
conda create -n safe python=3.8
conda activate safe
# because the cuda version, we recommend you install pytorch manual.
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -e .
For detailed instructions, please refer to Installation.md.
each algorithm file is the entrance. Running ALGO.py
with arguments about algorithms and environments does the training. For example, to run PPO-Lag in SafetyPointGoal1-v0 with seed 0, you can use the following command:
cd safepo/single_agent
python ppo_lag.py --env-id SafetyPointGoal1-v0 --seed 0
To run a benchamrk parallelly, for example, you can use the following command to run PPO-Lag, TRPO-Lag in SafetyAntVelocity-v1, SafetyHalfCheetahVelocity-v1:
cd safepo/single_agent
python benchmark.py --env-id SafetyAntVelocity-v1 SafetyHalfCheetahVelocity-v1 --algo ppo_lag trpo_lag --workers 2
The command above will run two processes in parallel, each process will run one algorithm in one environment. The results will be saved in ./runs/
.
Here we provide the list of arguments:
Argument | Default | Info |
---|---|---|
--seed | 0 | the random seed of the experiment |
--device | cpu | the device (cpu or cuda) to run the code |
--torch-threads | 4 | number of threads for torch |
--total-steps | 10000000 | total timesteps of the experiments |
--env-id | SafetyPointGoal1-v0 | the id of the environment |
--use-eval | False | toggles evaluation |
--eval-episodes | 1 | the number of episodes for final evaluation |
--steps-per-epoch | 20000 | the number of steps to run in each environment per rollout |
--update-iters | 10 | the max iteration to update the policy |
--batch-size | 64/128 | the number of mini-batches |
--entropy-coef | 0.0 | coefficient of the entropy |
--target-kl | 0.01/0.02 | the target KL divergence threshold |
--max-grad-norm | 40.0 | the maximum norm for the gradient clipping |
--critic-norm-coef | 0.001 | the critic norm coefficient |
--gamma | 0.99 | the discount factor gamma |
--lam | 0.95 | the lambda for the reward general advantage estimation |
--lam-c | 0.95 | the lambda for the cost general advantage estimation |
--standardized-adv-r | True | toggles reward advantages standardization |
--standardized-adv-c | True | toggles cost advantages standardization |
--critic-lr | 1e-3 | the learning rate of the critic network |
--actor-lr | 3e-4/None | the learning rate of the actor network |
--log-dir | ../runs | directory to save agent logs |
--write-terminal | True | toggles terminal logging |
--fvp-sample-freq | 1 | the sub-sampling rate of the observation |
--cg-damping | 0.1 | the damping value for conjugate gradient |
--cg-iters | 15 | the number of conjugate gradient iterations |
--backtrack-iters | 15 | the number of backtracking line search iterations |
--backtrack-coef | 0.8 | the coefficient for backtracking line search |
--cost-limit | 25.0 | the cost limit for the safety constraint |
Note: Some hyper-parameters are varied for different algorithms. For more details, please refer to the corresponding code files.
We also provide a safe MARL algorithm benchmark for safe MARL research on the challenging tasks of Safety DexterousHands and Safety-Gymnasium multi-agent velocity tasks. HAPPO, IPPO, MACPO, MAPPO-Lag and MAPPO have already been implemented.
safepo/envs/safe_dexteroushands/train_marl.py
is the entrance file. Running train_marl.py
with arguments about algorithms and tasks does the training. For example, you can use the following command:
# algo: macpo, mappolag, mappo, ippo, happo
python train_marl.py --task=ShadowHandOver --algo=macpo
safepo/multi_agnet/train_marl.py
is the entrance file. Running train_marl.py
with arguments about algorithms and tasks does the training. For example, you can use the following command to run MACPO in Safety2x4AntVelocity-v0, with default arguments:
# algo: macpo, mappolag, mappo, ippo, happo
# env: Safety2x4AntVelocity-v0
python train_marl.py --algo=macpo --scenario=Ant-v4 --agent_conf=2x4
The SafePO multi-agent algorithms share almost all hyperparameters for Safety DexterousHands and Safety-Gymnasium multi-agent velocity tasks. However, there are some differences in certain hyperparameters, which are listed below:
Argument | Info | Default (for Safety DexterousHands) | Default (for Safety-Gymnasium multi-agent velocity) |
---|---|---|---|
--episode-length | Episode length | 8 | 200 |
--num-env-steps | The number of total steps | 100000000 | 10000000 |
--cost-lim | The tolerance of cost violation | 0.0 | 1.0 |
--n-rollout-threads | The number of episodes to run | 80 | 32 |
--hidden-size | The size of hidden layers of neural network | 512 | 64 |
--entropy-coef | The coefficient of entropy | 0.00 | 0.01 |
--use-value-active-masks | Whether to use value active masks | False | True |
--use-policy-active-masks | Whether to use policy active masks | False | True |
We test all algorithms and experiments on CPU: AMD Ryzen Threadripper PRO 3975WX 32-Cores and GPU: NVIDIA GeForce RTX 3090, Driver Version: 495.44.
SafePO aims to benefit safe RL community research, and is released under the Apache-2.0 license. Illegal usage or any violation of the license is not allowed.
The Baseline is a project contributed by PKU-Alignment at Peking University. We also thank the list of contributors of the following open source repositories: Spinning Up, Bullet-Safety-Gym, Safety-Gym.