Skip to content

Commit

Permalink
[WIP] Non-adaptative Agent Comparisions (rlberry-py#276)
Browse files Browse the repository at this point in the history
* fix bug dill and compress always

* change version

* permutation tests

* comparison tukey_hsd and permutation and plot more or less working

* api and misc doc improvements

* simpler returns

* remove test thing

* add test

* doc, test, typos

* revert mistaken changes to readme

* add decision column,  remove old compare_agent

* correct test

* fix bug when merge

* fix agent manager
  • Loading branch information
TimotheeMathieu authored Aug 21, 2023
1 parent 80a7ae4 commit cc84a0f
Show file tree
Hide file tree
Showing 12 changed files with 609 additions and 74 deletions.
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -124,12 +124,10 @@ The modules listed below are experimental at the moment, that is, they are not t
* `rlberry.network`: Allows communication between a server and client via sockets, and can be used to run agents remotely.
* `rlberry.agents.experimental`: Experimental agents that are not thoroughly tested.


## About us
This project was initiated and is actively maintained by [INRIA SCOOL team](https://team.inria.fr/scool/).
More information [here](https://rlberry.readthedocs.io/en/latest/about.html#).


## Contributing

Want to contribute to `rlberry`? Please check [our contribution guidelines](https://rlberry.readthedocs.io/en/latest/contributing.html). **If you want to add any new agents or environments, do not hesitate
Expand Down
1 change: 1 addition & 0 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Evaluation and plot
manager.evaluate_agents
manager.read_writer_data
manager.plot_writer_data
manager.compare_agents


Agents
Expand Down
71 changes: 0 additions & 71 deletions docs/basics/compare_agents.rst

This file was deleted.

126 changes: 126 additions & 0 deletions docs/basics/comparison.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@

# Comparison of Agents

The performance of one execution of a Deep RL algorithm is random so that independent executions are needed to assess it precisely.
In this section we use multiple hypothesis testing to assert that we used enough fits to be able to say that the agents are indeed differents and that the perceived differences are not just a result of randomness.


## Quick reminder on hypothesis testing

### Two sample testing

In its most simple form, a statistical test is aimed at deciding, whether a given collection of data $X_1,\dots,X_N$ adheres to some hypothesis $H_0$ (called the null hypothesis), or if it is a better fit for an alternative hypothesis $H_1$.

Consider two samples $X_1,\dots,X_N$ and $Y_1,\dots,Y_N$ and do a two-sample test deciding whether the mean of the distribution of the $X_i$'s is equal to the mean of the distribution of the $Y_i$'s:

\begin{equation*}
H_0 : \mathbb{E}[X] = \mathbb{E}[Y] \quad \text{vs}\quad H_1: \mathbb{E}[X] \neq \mathbb{E}[Y]
\end{equation*}

In both cases, the result of a test is either accept $H_0$ or reject $H_0$. This answer is not a ground truth: there is some probability that we make an error. However, this probability of error is often controlled and can be decomposed in type I error and type II errors (often denoted $\alpha$ and $\beta$ respectively).

<center>

| | $H_0$ is true | $H_0$ is false |
|-----------------|-----------------------|-----------------------|
| We accept $H_0$ | No error | type II error $\beta$ |
| We reject $H_0$ | type I error $\alpha$ | No error |

</center>

Note that the problem is not symmetric: failing to reject the null hypothesis does not mean that the null hypothesis is true. It can be that there is not enough data to reject $H_0$.

### Multiple testing

When doing simultaneously several statistical tests, one must be careful that the error of each test accumulate and if one is not cautious, the overall error may become non-negligible. As a consequence, multiple strategies have been developed to deal with multiple testing problem.

To deal with the multiple testing problem, the first step is to define what is an error. There are several definitions of error in multiple testing, one possibility is the Family-wise error which is defined as the probability to make at least one false rejection (at least one type I error):

$$\mathrm{FWE} = \mathbb{P}_{H_j, j \in \textbf{I}}\left(\exists j \in \textbf{I}:\quad \text{reject }H_j \right),$$

where $\mathbb{P}_{H_j, j \in \textbf{I}}$ is used to denote the probability when $\textbf{I}$ is the set of indices of the hypotheses that are actually true (and $\textbf{I}^c$ the set of hypotheses that are actually false).

## Multiple agent comparison in rlberry

We compute the performances of one agent as follows:

```python
import numpy as np
from rlberry.envs import gym_make
from rlberry.agents.torch import A2CAgent
from rlberry.manager import AgentManager, evaluate_agents

env_ctor = gym_make
env_kwargs = dict(id="CartPole-v1")

n_simulations = 50
n_fit = 8

rbagent = AgentManager(
A2CAgent,
(env_ctor, env_kwargs),
agent_name="A2CAgent",
fit_budget=3e4,
eval_kwargs=dict(eval_horizon=500),
n_fit=n_fit,
)

rbagent.fit() # get 5 trained agents
performances = [
np.mean(rbagent.eval_agents(n_simulations, agent_id=idx)) for idx in range(8)
]
```

We begin by training all the agents (here $8$ agents). Then, we evaluate each trained agent `n_simulations` times (here $50$). The performance of one trained agent is then the mean of its evaluations. We do this for each agent that we trained, obtaining `n_fit` evaluation performances. These `n_fit` numbers are the random numbers that will be used to do hypothesis testing.

The evaluation and statistical hypothesis testing is handled through the function {class}`~rlberry.manager.compare_agents`.

For example we may compare PPO, A2C and DQNAgent on Cartpole with the following code.

``` python
from rlberry.agents.torch import A2CAgent, PPOAgent, DQNAgent
from rlberry.manager.comparison import compare_agents

agents = [
AgentManager(
A2CAgent,
(env_ctor, env_kwargs),
agent_name="A2CAgent",
fit_budget=3e4,
eval_kwargs=dict(eval_horizon=500),
n_fit=n_fit,
),
AgentManager(
PPOAgent,
(env_ctor, env_kwargs),
agent_name="PPOAgent",
fit_budget=3e4,
eval_kwargs=dict(eval_horizon=500),
n_fit=n_fit,
),
AgentManager(
DQNAgent,
(env_ctor, env_kwargs),
agent_name="DQNAgent",
fit_budget=3e4,
eval_kwargs=dict(eval_horizon=500),
n_fit=n_fit,
),
]

for agent in agents:
agent.fit()

print(compare_agents(agents))
```

```
Agent1 vs Agent2 mean Agent1 mean Agent2 mean diff std diff decisions p-val significance
0 A2CAgent vs PPOAgent 213.600875 423.431500 -209.830625 144.600160 reject 0.002048 **
1 A2CAgent vs DQNAgent 213.600875 443.296625 -229.695750 152.368506 reject 0.000849 ***
2 PPOAgent vs DQNAgent 423.431500 443.296625 -19.865125 104.279024 accept 0.926234
```

The results of `compare_agents(agents)` show the p-values and significance level if the method is `tukey_hsd` and in all the cases it shows the decision accept or reject of the test with Family-wise error controlled by $0.05$. In our case, we see that A2C seems significantly worst than both PPO and DQN but the difference between PPO and DQN is not statistically significant. Remark that no significance (which is to say, decision to accept $H_0$) does not necessarily mean that the algorithms perform the same, it can be that there is not enough data.

*Remark*: the comparison we do here is a black-box comparison in the sense that we don't care how the algorithms were tuned or how many training steps are used, we suppose that the user already tuned these parameters adequately for a fair comparison.
15 changes: 15 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,28 @@
"sphinx.ext.autodoc",
"sphinx.ext.autosummary",
"sphinx.ext.mathjax",
"sphinx_math_dollar",
"sphinx.ext.autosectionlabel",
"sphinxcontrib.video",
"numpydoc",
"sphinx_gallery.gen_gallery",
"myst_parser",
]

myst_enable_extensions = ["amsmath"]
# myst_enable_extensions = [
# "amsmath",
# "colon_fence",
# "deflist",
# "dollarmath",
# "fieldlist",
# "html_admonition",
# "html_image",
# "replacements",
# "smartquotes",
# "substitution",
# "tasklist",
# ]

autodoc_default_options = {
"members": True,
Expand Down
1 change: 1 addition & 0 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,7 @@ Deep RL agents require extra libraries, like PyTorch.
$ pip install git+https://github.com/rlberry-py/rlberry.git#egg=rlberry[torch_agents]
$ pip install tensorboard
* JAX agents (**Linux only, experimental**):


* Stable-baselines3 agents with Gymnasium support:
Expand Down
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
sphinx-gallery
sphinx-math-dollar
numpydoc
myst-parser
git+https://github.com/sphinx-contrib/video
Expand Down
2 changes: 1 addition & 1 deletion docs/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,10 @@ Agents, hyperparameter optimization and experiment setup

basics/create_agent.rst
basics/evaluate_agent.rst
basics/compare_agents.rst
basics/experiment_setup.rst
basics/seeding.rst
basics/multiprocess.rst
basics/comparison.md

We also provide examples to show how to use :ref:`torch checkpointing<checkpointing_example>`
in rlberry and :ref:`tensorboard<dqn_example>`
Expand Down
106 changes: 106 additions & 0 deletions examples/comparison_agents.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
"""
=========================
Compare Bandit Algorithms
=========================
This example illustrate the use of compare_agents, a function that uses multiple-testing to assess whether traine agents are
statistically different or not.
Remark that in the case where two agents are not deemed statistically different it can mean either that they are as efficient,
or it can mean that there have not been enough fits to assess the variability of the agents.
"""

import numpy as np

from rlberry.manager.comparison import compare_agents
from rlberry.manager import AgentManager
from rlberry.envs.bandits import BernoulliBandit
from rlberry.wrappers import WriterWrapper
from rlberry.agents.bandits import (
IndexAgent,
makeBoundedMOSSIndex,
makeBoundedNPTSIndex,
makeBoundedUCBIndex,
makeETCIndex,
)

# Parameters of the problem
means = np.array([0.6, 0.6, 0.6, 0.9]) # means of the arms
A = len(means)
T = 2000 # Horizon
N = 50 # number of fits

# Construction of the experiment

env_ctor = BernoulliBandit
env_kwargs = {"p": means}


class UCBAgent(IndexAgent):
name = "UCB"

def __init__(self, env, **kwargs):
index, _ = makeBoundedUCBIndex()
IndexAgent.__init__(self, env, index, **kwargs)
self.env = WriterWrapper(self.env, self.writer, write_scalar="reward")


class ETCAgent(IndexAgent):
name = "ETC"

def __init__(self, env, m=20, **kwargs):
index, _ = makeETCIndex(A, m)
IndexAgent.__init__(self, env, index, **kwargs)
self.env = WriterWrapper(
self.env, self.writer, write_scalar="action_and_reward"
)


class MOSSAgent(IndexAgent):
name = "MOSS"

def __init__(self, env, **kwargs):
index, _ = makeBoundedMOSSIndex(T, A)
IndexAgent.__init__(self, env, index, **kwargs)
self.env = WriterWrapper(
self.env, self.writer, write_scalar="action_and_reward"
)


class NPTSAgent(IndexAgent):
name = "NPTS"

def __init__(self, env, **kwargs):
index, tracker_params = makeBoundedNPTSIndex()
IndexAgent.__init__(self, env, index, tracker_params=tracker_params, **kwargs)
self.env = WriterWrapper(self.env, self.writer, write_scalar="reward")


Agents_class = [MOSSAgent, NPTSAgent, UCBAgent, ETCAgent]

managers = [
AgentManager(
Agent,
train_env=(env_ctor, env_kwargs),
fit_budget=T,
parallelization="process",
mp_context="fork",
n_fit=N,
)
for Agent in Agents_class
]


for manager in managers:
manager.fit()


def eval_function(manager, eval_budget=None, agent_id=0):
df = manager.get_writer_data()[agent_id]
return T * np.max(means) - np.sum(df.loc[df["tag"] == "reward", "value"])


print(
compare_agents(managers, method="tukey_hsd", eval_function=eval_function, B=10_000)
)
1 change: 1 addition & 0 deletions rlberry/manager/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from .multiple_managers import MultipleManagers
from .remote_experiment_manager import RemoteExperimentManager
from .evaluation import evaluate_agents, plot_writer_data, read_writer_data
from .comparison import compare_agents

# (Remote)AgentManager alias for the (Remote)ExperimentManager class, for backward compatibility
AgentManager = ExperimentManager
Expand Down
Loading

0 comments on commit cc84a0f

Please sign in to comment.