forked from rlberry-py/rlberry
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[WIP] Non-adaptative Agent Comparisions (rlberry-py#276)
* fix bug dill and compress always * change version * permutation tests * comparison tukey_hsd and permutation and plot more or less working * api and misc doc improvements * simpler returns * remove test thing * add test * doc, test, typos * revert mistaken changes to readme * add decision column, remove old compare_agent * correct test * fix bug when merge * fix agent manager
- Loading branch information
1 parent
80a7ae4
commit cc84a0f
Showing
12 changed files
with
609 additions
and
74 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,126 @@ | ||
|
||
# Comparison of Agents | ||
|
||
The performance of one execution of a Deep RL algorithm is random so that independent executions are needed to assess it precisely. | ||
In this section we use multiple hypothesis testing to assert that we used enough fits to be able to say that the agents are indeed differents and that the perceived differences are not just a result of randomness. | ||
|
||
|
||
## Quick reminder on hypothesis testing | ||
|
||
### Two sample testing | ||
|
||
In its most simple form, a statistical test is aimed at deciding, whether a given collection of data $X_1,\dots,X_N$ adheres to some hypothesis $H_0$ (called the null hypothesis), or if it is a better fit for an alternative hypothesis $H_1$. | ||
|
||
Consider two samples $X_1,\dots,X_N$ and $Y_1,\dots,Y_N$ and do a two-sample test deciding whether the mean of the distribution of the $X_i$'s is equal to the mean of the distribution of the $Y_i$'s: | ||
|
||
\begin{equation*} | ||
H_0 : \mathbb{E}[X] = \mathbb{E}[Y] \quad \text{vs}\quad H_1: \mathbb{E}[X] \neq \mathbb{E}[Y] | ||
\end{equation*} | ||
|
||
In both cases, the result of a test is either accept $H_0$ or reject $H_0$. This answer is not a ground truth: there is some probability that we make an error. However, this probability of error is often controlled and can be decomposed in type I error and type II errors (often denoted $\alpha$ and $\beta$ respectively). | ||
|
||
<center> | ||
|
||
| | $H_0$ is true | $H_0$ is false | | ||
|-----------------|-----------------------|-----------------------| | ||
| We accept $H_0$ | No error | type II error $\beta$ | | ||
| We reject $H_0$ | type I error $\alpha$ | No error | | ||
|
||
</center> | ||
|
||
Note that the problem is not symmetric: failing to reject the null hypothesis does not mean that the null hypothesis is true. It can be that there is not enough data to reject $H_0$. | ||
|
||
### Multiple testing | ||
|
||
When doing simultaneously several statistical tests, one must be careful that the error of each test accumulate and if one is not cautious, the overall error may become non-negligible. As a consequence, multiple strategies have been developed to deal with multiple testing problem. | ||
|
||
To deal with the multiple testing problem, the first step is to define what is an error. There are several definitions of error in multiple testing, one possibility is the Family-wise error which is defined as the probability to make at least one false rejection (at least one type I error): | ||
|
||
$$\mathrm{FWE} = \mathbb{P}_{H_j, j \in \textbf{I}}\left(\exists j \in \textbf{I}:\quad \text{reject }H_j \right),$$ | ||
|
||
where $\mathbb{P}_{H_j, j \in \textbf{I}}$ is used to denote the probability when $\textbf{I}$ is the set of indices of the hypotheses that are actually true (and $\textbf{I}^c$ the set of hypotheses that are actually false). | ||
|
||
## Multiple agent comparison in rlberry | ||
|
||
We compute the performances of one agent as follows: | ||
|
||
```python | ||
import numpy as np | ||
from rlberry.envs import gym_make | ||
from rlberry.agents.torch import A2CAgent | ||
from rlberry.manager import AgentManager, evaluate_agents | ||
|
||
env_ctor = gym_make | ||
env_kwargs = dict(id="CartPole-v1") | ||
|
||
n_simulations = 50 | ||
n_fit = 8 | ||
|
||
rbagent = AgentManager( | ||
A2CAgent, | ||
(env_ctor, env_kwargs), | ||
agent_name="A2CAgent", | ||
fit_budget=3e4, | ||
eval_kwargs=dict(eval_horizon=500), | ||
n_fit=n_fit, | ||
) | ||
|
||
rbagent.fit() # get 5 trained agents | ||
performances = [ | ||
np.mean(rbagent.eval_agents(n_simulations, agent_id=idx)) for idx in range(8) | ||
] | ||
``` | ||
|
||
We begin by training all the agents (here $8$ agents). Then, we evaluate each trained agent `n_simulations` times (here $50$). The performance of one trained agent is then the mean of its evaluations. We do this for each agent that we trained, obtaining `n_fit` evaluation performances. These `n_fit` numbers are the random numbers that will be used to do hypothesis testing. | ||
|
||
The evaluation and statistical hypothesis testing is handled through the function {class}`~rlberry.manager.compare_agents`. | ||
|
||
For example we may compare PPO, A2C and DQNAgent on Cartpole with the following code. | ||
|
||
``` python | ||
from rlberry.agents.torch import A2CAgent, PPOAgent, DQNAgent | ||
from rlberry.manager.comparison import compare_agents | ||
|
||
agents = [ | ||
AgentManager( | ||
A2CAgent, | ||
(env_ctor, env_kwargs), | ||
agent_name="A2CAgent", | ||
fit_budget=3e4, | ||
eval_kwargs=dict(eval_horizon=500), | ||
n_fit=n_fit, | ||
), | ||
AgentManager( | ||
PPOAgent, | ||
(env_ctor, env_kwargs), | ||
agent_name="PPOAgent", | ||
fit_budget=3e4, | ||
eval_kwargs=dict(eval_horizon=500), | ||
n_fit=n_fit, | ||
), | ||
AgentManager( | ||
DQNAgent, | ||
(env_ctor, env_kwargs), | ||
agent_name="DQNAgent", | ||
fit_budget=3e4, | ||
eval_kwargs=dict(eval_horizon=500), | ||
n_fit=n_fit, | ||
), | ||
] | ||
|
||
for agent in agents: | ||
agent.fit() | ||
|
||
print(compare_agents(agents)) | ||
``` | ||
|
||
``` | ||
Agent1 vs Agent2 mean Agent1 mean Agent2 mean diff std diff decisions p-val significance | ||
0 A2CAgent vs PPOAgent 213.600875 423.431500 -209.830625 144.600160 reject 0.002048 ** | ||
1 A2CAgent vs DQNAgent 213.600875 443.296625 -229.695750 152.368506 reject 0.000849 *** | ||
2 PPOAgent vs DQNAgent 423.431500 443.296625 -19.865125 104.279024 accept 0.926234 | ||
``` | ||
|
||
The results of `compare_agents(agents)` show the p-values and significance level if the method is `tukey_hsd` and in all the cases it shows the decision accept or reject of the test with Family-wise error controlled by $0.05$. In our case, we see that A2C seems significantly worst than both PPO and DQN but the difference between PPO and DQN is not statistically significant. Remark that no significance (which is to say, decision to accept $H_0$) does not necessarily mean that the algorithms perform the same, it can be that there is not enough data. | ||
|
||
*Remark*: the comparison we do here is a black-box comparison in the sense that we don't care how the algorithms were tuned or how many training steps are used, we suppose that the user already tuned these parameters adequately for a fair comparison. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,5 @@ | ||
sphinx-gallery | ||
sphinx-math-dollar | ||
numpydoc | ||
myst-parser | ||
git+https://github.com/sphinx-contrib/video | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
""" | ||
========================= | ||
Compare Bandit Algorithms | ||
========================= | ||
This example illustrate the use of compare_agents, a function that uses multiple-testing to assess whether traine agents are | ||
statistically different or not. | ||
Remark that in the case where two agents are not deemed statistically different it can mean either that they are as efficient, | ||
or it can mean that there have not been enough fits to assess the variability of the agents. | ||
""" | ||
|
||
import numpy as np | ||
|
||
from rlberry.manager.comparison import compare_agents | ||
from rlberry.manager import AgentManager | ||
from rlberry.envs.bandits import BernoulliBandit | ||
from rlberry.wrappers import WriterWrapper | ||
from rlberry.agents.bandits import ( | ||
IndexAgent, | ||
makeBoundedMOSSIndex, | ||
makeBoundedNPTSIndex, | ||
makeBoundedUCBIndex, | ||
makeETCIndex, | ||
) | ||
|
||
# Parameters of the problem | ||
means = np.array([0.6, 0.6, 0.6, 0.9]) # means of the arms | ||
A = len(means) | ||
T = 2000 # Horizon | ||
N = 50 # number of fits | ||
|
||
# Construction of the experiment | ||
|
||
env_ctor = BernoulliBandit | ||
env_kwargs = {"p": means} | ||
|
||
|
||
class UCBAgent(IndexAgent): | ||
name = "UCB" | ||
|
||
def __init__(self, env, **kwargs): | ||
index, _ = makeBoundedUCBIndex() | ||
IndexAgent.__init__(self, env, index, **kwargs) | ||
self.env = WriterWrapper(self.env, self.writer, write_scalar="reward") | ||
|
||
|
||
class ETCAgent(IndexAgent): | ||
name = "ETC" | ||
|
||
def __init__(self, env, m=20, **kwargs): | ||
index, _ = makeETCIndex(A, m) | ||
IndexAgent.__init__(self, env, index, **kwargs) | ||
self.env = WriterWrapper( | ||
self.env, self.writer, write_scalar="action_and_reward" | ||
) | ||
|
||
|
||
class MOSSAgent(IndexAgent): | ||
name = "MOSS" | ||
|
||
def __init__(self, env, **kwargs): | ||
index, _ = makeBoundedMOSSIndex(T, A) | ||
IndexAgent.__init__(self, env, index, **kwargs) | ||
self.env = WriterWrapper( | ||
self.env, self.writer, write_scalar="action_and_reward" | ||
) | ||
|
||
|
||
class NPTSAgent(IndexAgent): | ||
name = "NPTS" | ||
|
||
def __init__(self, env, **kwargs): | ||
index, tracker_params = makeBoundedNPTSIndex() | ||
IndexAgent.__init__(self, env, index, tracker_params=tracker_params, **kwargs) | ||
self.env = WriterWrapper(self.env, self.writer, write_scalar="reward") | ||
|
||
|
||
Agents_class = [MOSSAgent, NPTSAgent, UCBAgent, ETCAgent] | ||
|
||
managers = [ | ||
AgentManager( | ||
Agent, | ||
train_env=(env_ctor, env_kwargs), | ||
fit_budget=T, | ||
parallelization="process", | ||
mp_context="fork", | ||
n_fit=N, | ||
) | ||
for Agent in Agents_class | ||
] | ||
|
||
|
||
for manager in managers: | ||
manager.fit() | ||
|
||
|
||
def eval_function(manager, eval_budget=None, agent_id=0): | ||
df = manager.get_writer_data()[agent_id] | ||
return T * np.max(means) - np.sum(df.loc[df["tag"] == "reward", "value"]) | ||
|
||
|
||
print( | ||
compare_agents(managers, method="tukey_hsd", eval_function=eval_function, B=10_000) | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.