Meta-reinforcement learning (Meta-RL) utilizes shared structure among tasks to enable rapid adaptation to new tasks with only a little experience. However, most existing Meta-RL algorithms lack theoretical generalization guarantees or offer such guarantees under restrictive assumptions (e.g., strong assumptions on the data distribution). This paper for the first time conducts a theoretical analysis for estimating the generalization performance of the Meta-RL learner using the PAC-Bayesian theory. The application of PAC-Bayesian theory to Meta-RL poses a challenge due to the existence of dependencies in the training data, which renders the independent and identically distributed (i.i.d.) assumption invalid. To address this challenge, we propose a dependency graph-based offline decomposition (DGOD) approach, which decomposes non-i.i.d. Meta-RL data into multiple offline i.i.d. datasets by utilizing the techniques of offline sampling and graph decomposition. With the DGOD approach, we derive the practical PAC-Bayesian offline Meta-RL generalization bounds and design an algorithm with generalization guarantees to optimize them. The results of experiments conducted on several challenging Meta-RL benchmarks demonstrate that our algorithm performs well in avoiding meta-overfitting and outperforms recent state-of-the-art Meta-RL algorithms without generalization bounds.
This repository provides the implementation of [PBOMAC].
To ensure the code runs, you first need to install Pytorch and MuJoCo. Mujoco can be 150 or above. If you face any problem, please follow FOCAL steps to install.
For software dependencies, create conda environment by
conda env create -f environment.yaml
PBOMAC requires fixed data (batch) for meta-training and meta-testing, which are generated by trained SAC behavior policies. The training samples required for comparative experiments can be found in the data
folder.
If you want to get offline sample data for other environments, please cd
into the src.algos
folder and generating trajectories through:
python agent.py --env_name cheetah-vel --total_timesteps 2000000 --n_tasks 10
Generated data will be saved in ./data/
python main.py --env_config ./configs/ant-dir.json
Experiments are configured via json
configuration files located in ./configs
. You can complete other experiments by changing the parameters of env_config
, such as cheeta-dir.json, etc. Of course, you can also manually adjust other parameters in main.py
.
Please open an issue on issues tracker to report problems or to ask questions or send an email to me, outshine-J.
@inproceedings{PBOMAC,
title={{PBOMAC}: PAC-Bayesian offline Meta-reinforcement learning},
author={Zheng Sun and Chenheng Jing and Shangqi Guo and Lingling An},
booktitle={Applied Intelligence},
year={2023},
url={https://doi.org/10.1007/s10489-023-04911-y}
}