[WIP] Non-adaptative Agent Comparisions (rlberry-py#276)

* fix bug dill and compress always * change version * permutation tests * comparison tukey_hsd and permutation and plot more or less working * api and misc doc improvements * simpler returns * remove test thing * add test * doc, test, typos * revert mistaken changes to readme * add decision column, remove old compare_agent * correct test * fix bug when merge * fix agent manager
BorisHamadej · Aug 21, 2023 · cc84a0f · cc84a0f
1 parent 80a7ae4
commit cc84a0f
Show file tree

Hide file tree

Showing 12 changed files with 609 additions and 74 deletions.
diff --git a/README.md b/README.md
@@ -124,12 +124,10 @@ The modules listed below are experimental at the moment, that is, they are not t
 * `rlberry.network`: Allows communication between a server and client via sockets, and can be used to run agents remotely.
 * `rlberry.agents.experimental`: Experimental agents that are not thoroughly tested.
 
-
 ## About us
 This project was initiated and is actively maintained by [INRIA SCOOL team](https://team.inria.fr/scool/).
 More information [here](https://rlberry.readthedocs.io/en/latest/about.html#).
 
-
 ## Contributing
 
 Want to contribute to `rlberry`? Please check [our contribution guidelines](https://rlberry.readthedocs.io/en/latest/contributing.html). **If you want to add any new agents or environments, do not hesitate

diff --git a/docs/api.rst b/docs/api.rst
@@ -28,6 +28,7 @@ Evaluation and plot
    manager.evaluate_agents
    manager.read_writer_data
    manager.plot_writer_data
+   manager.compare_agents
 
 
 Agents

diff --git a/docs/basics/compare_agents.rst b/docs/basics/compare_agents.rst
diff --git a/docs/basics/comparison.md b/docs/basics/comparison.md
@@ -0,0 +1,126 @@
+
+# Comparison of Agents
+
+The performance of one execution of a Deep RL algorithm is random so that independent executions are needed to assess it precisely.
+In this section we use multiple hypothesis testing to assert that we used enough fits to be able to say that the agents are indeed differents and that the perceived differences are not just a result of randomness.
+
+
+## Quick reminder on hypothesis testing
+
+### Two sample testing
+
+In its most simple form, a statistical test is aimed at deciding, whether a given collection of data $X_1,\dots,X_N$ adheres to some hypothesis $H_0$ (called the null hypothesis), or if it is a better fit for an alternative hypothesis $H_1$.
+
+Consider two samples $X_1,\dots,X_N$ and $Y_1,\dots,Y_N$ and do a two-sample test deciding whether the mean of the distribution of the $X_i$'s is equal to the mean of the distribution of the $Y_i$'s:
+
+\begin{equation*}
+H_0 : \mathbb{E}[X] = \mathbb{E}[Y] \quad \text{vs}\quad H_1: \mathbb{E}[X] \neq \mathbb{E}[Y]
+\end{equation*}
+
+In both cases, the result of a test is either accept $H_0$ or reject $H_0$. This answer is not a ground truth: there is some probability that we make an error. However, this  probability of error is often controlled and can be decomposed in type I error and type II errors (often denoted $\alpha$ and $\beta$ respectively).
+
+<center>
+
+|                 | $H_0$ is true         | $H_0$ is false        |
+|-----------------|-----------------------|-----------------------|
+| We accept $H_0$ | No error              | type II error $\beta$ |
+| We reject $H_0$ | type I error $\alpha$ | No error              |
+
+</center>
+
+Note that the problem is not symmetric: failing to reject the null hypothesis does not mean that the null hypothesis is true.  It can be that there is not enough data to reject $H_0$.
+
+### Multiple testing
+
+When doing simultaneously several statistical tests, one must be careful that the error of each test accumulate and if one is not cautious, the overall error may become non-negligible. As a consequence, multiple strategies have been developed to deal with multiple testing problem.
+
+To deal with the multiple testing problem, the first step is to define what is an error. There are several definitions of error in multiple testing, one possibility is the Family-wise error which is defined as the probability to make at least one false rejection (at least one type I error):
+
+$$\mathrm{FWE} = \mathbb{P}_{H_j, j \in \textbf{I}}\left(\exists j \in \textbf{I}:\quad  \text{reject }H_j \right),$$
+
+where $\mathbb{P}_{H_j, j \in \textbf{I}}$ is used to denote the probability when $\textbf{I}$ is the set of indices of the hypotheses that are actually true (and $\textbf{I}^c$ the set of hypotheses that are actually false).
+
+## Multiple agent comparison in rlberry
+
+We compute the performances of one agent as follows:
+
+```python
+import numpy as np
+from rlberry.envs import gym_make
+from rlberry.agents.torch import A2CAgent
+from rlberry.manager import AgentManager, evaluate_agents
+
+env_ctor = gym_make
+env_kwargs = dict(id="CartPole-v1")
+
+n_simulations = 50
+n_fit = 8
+
+rbagent = AgentManager(
+    A2CAgent,
+    (env_ctor, env_kwargs),
+    agent_name="A2CAgent",
+    fit_budget=3e4,
+    eval_kwargs=dict(eval_horizon=500),
+    n_fit=n_fit,
+)
+
+rbagent.fit()  # get 5 trained agents
+performances = [
+    np.mean(rbagent.eval_agents(n_simulations, agent_id=idx)) for idx in range(8)
+]
+```
+
+We begin by training all the agents (here $8$ agents). Then, we evaluate each trained agent `n_simulations` times (here $50$). The performance of one trained agent is then the mean of its evaluations. We do this for each agent that we trained, obtaining `n_fit` evaluation performances. These `n_fit` numbers are the random numbers that will be used to do hypothesis testing.
+
+The evaluation and statistical hypothesis testing is handled through the function {class}`~rlberry.manager.compare_agents`.
+
+For example we may compare PPO, A2C and DQNAgent on Cartpole with the following code.
+
+``` python
+from rlberry.agents.torch import A2CAgent, PPOAgent, DQNAgent
+from rlberry.manager.comparison import compare_agents
+
+agents = [
+    AgentManager(
+        A2CAgent,
+        (env_ctor, env_kwargs),
+        agent_name="A2CAgent",
+        fit_budget=3e4,
+        eval_kwargs=dict(eval_horizon=500),
+        n_fit=n_fit,
+    ),
+    AgentManager(
+        PPOAgent,
+        (env_ctor, env_kwargs),
+        agent_name="PPOAgent",
+        fit_budget=3e4,
+        eval_kwargs=dict(eval_horizon=500),
+        n_fit=n_fit,
+    ),
+    AgentManager(
+        DQNAgent,
+        (env_ctor, env_kwargs),
+        agent_name="DQNAgent",
+        fit_budget=3e4,
+        eval_kwargs=dict(eval_horizon=500),
+        n_fit=n_fit,
+    ),
+]
+
+for agent in agents:
+    agent.fit()
+
+print(compare_agents(agents))
+```
+
+```
+       Agent1 vs Agent2  mean Agent1  mean Agent2   mean diff    std diff decisions     p-val significance
+0  A2CAgent vs PPOAgent   213.600875   423.431500 -209.830625  144.600160    reject  0.002048           **
+1  A2CAgent vs DQNAgent   213.600875   443.296625 -229.695750  152.368506    reject  0.000849          ***
+2  PPOAgent vs DQNAgent   423.431500   443.296625  -19.865125  104.279024    accept  0.926234
+```
+
+The results of `compare_agents(agents)` show the p-values and significance level if the method is `tukey_hsd` and in all the cases it shows the decision accept or reject of the test with Family-wise error controlled by $0.05$. In our case, we see that A2C seems significantly worst than both PPO and DQN but the difference between PPO and DQN is not statistically significant. Remark that no significance (which is to say, decision to accept $H_0$) does not necessarily mean that the algorithms perform the same, it can be that there is not enough data.
+
+*Remark*: the comparison we do here is a black-box comparison in the sense that we don't care how the algorithms were tuned or how many training steps are used, we suppose that the user already tuned these parameters adequately for a fair comparison.
diff --git a/docs/conf.py b/docs/conf.py
@@ -42,13 +42,28 @@
     "sphinx.ext.autodoc",
     "sphinx.ext.autosummary",
     "sphinx.ext.mathjax",
+    "sphinx_math_dollar",
     "sphinx.ext.autosectionlabel",
     "sphinxcontrib.video",
     "numpydoc",
     "sphinx_gallery.gen_gallery",
     "myst_parser",
 ]
 
+myst_enable_extensions = ["amsmath"]
+# myst_enable_extensions = [
+#     "amsmath",
+#     "colon_fence",
+#     "deflist",
+#     "dollarmath",
+#     "fieldlist",
+#     "html_admonition",
+#     "html_image",
+#     "replacements",
+#     "smartquotes",
+#     "substitution",
+#     "tasklist",
+# ]
 
 autodoc_default_options = {
     "members": True,

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -75,6 +75,7 @@ Deep RL agents require extra libraries, like PyTorch.
     $ pip install git+https://github.com/rlberry-py/rlberry.git#egg=rlberry[torch_agents]
     $ pip install tensorboard
 
+* JAX agents (**Linux only, experimental**):
 
 
 * Stable-baselines3 agents with Gymnasium support:

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,4 +1,5 @@
 sphinx-gallery
+sphinx-math-dollar
 numpydoc
 myst-parser
 git+https://github.com/sphinx-contrib/video

diff --git a/docs/user_guide.rst b/docs/user_guide.rst
@@ -45,10 +45,10 @@ Agents, hyperparameter optimization and experiment setup
 
    basics/create_agent.rst
    basics/evaluate_agent.rst
-   basics/compare_agents.rst
    basics/experiment_setup.rst
    basics/seeding.rst
    basics/multiprocess.rst
+   basics/comparison.md
 
 We also provide examples to show how to use :ref:`torch checkpointing<checkpointing_example>`
 in rlberry and :ref:`tensorboard<dqn_example>`

diff --git a/examples/comparison_agents.py b/examples/comparison_agents.py
@@ -0,0 +1,106 @@
+"""
+=========================
+Compare Bandit Algorithms
+=========================
+
+This example illustrate the use of compare_agents, a function that uses multiple-testing to assess whether traine agents are
+statistically different or not.
+
+Remark that in the case where two agents are not deemed statistically different it can mean either that they are as efficient,
+or it can mean that there have not been enough fits to assess the variability of the agents.
+
+"""
+
+import numpy as np
+
+from rlberry.manager.comparison import compare_agents
+from rlberry.manager import AgentManager
+from rlberry.envs.bandits import BernoulliBandit
+from rlberry.wrappers import WriterWrapper
+from rlberry.agents.bandits import (
+    IndexAgent,
+    makeBoundedMOSSIndex,
+    makeBoundedNPTSIndex,
+    makeBoundedUCBIndex,
+    makeETCIndex,
+)
+
+# Parameters of the problem
+means = np.array([0.6, 0.6, 0.6, 0.9])  # means of the arms
+A = len(means)
+T = 2000  # Horizon
+N = 50  # number of fits
+
+# Construction of the experiment
+
+env_ctor = BernoulliBandit
+env_kwargs = {"p": means}
+
+
+class UCBAgent(IndexAgent):
+    name = "UCB"
+
+    def __init__(self, env, **kwargs):
+        index, _ = makeBoundedUCBIndex()
+        IndexAgent.__init__(self, env, index, **kwargs)
+        self.env = WriterWrapper(self.env, self.writer, write_scalar="reward")
+
+
+class ETCAgent(IndexAgent):
+    name = "ETC"
+
+    def __init__(self, env, m=20, **kwargs):
+        index, _ = makeETCIndex(A, m)
+        IndexAgent.__init__(self, env, index, **kwargs)
+        self.env = WriterWrapper(
+            self.env, self.writer, write_scalar="action_and_reward"
+        )
+
+
+class MOSSAgent(IndexAgent):
+    name = "MOSS"
+
+    def __init__(self, env, **kwargs):
+        index, _ = makeBoundedMOSSIndex(T, A)
+        IndexAgent.__init__(self, env, index, **kwargs)
+        self.env = WriterWrapper(
+            self.env, self.writer, write_scalar="action_and_reward"
+        )
+
+
+class NPTSAgent(IndexAgent):
+    name = "NPTS"
+
+    def __init__(self, env, **kwargs):
+        index, tracker_params = makeBoundedNPTSIndex()
+        IndexAgent.__init__(self, env, index, tracker_params=tracker_params, **kwargs)
+        self.env = WriterWrapper(self.env, self.writer, write_scalar="reward")
+
+
+Agents_class = [MOSSAgent, NPTSAgent, UCBAgent, ETCAgent]
+
+managers = [
+    AgentManager(
+        Agent,
+        train_env=(env_ctor, env_kwargs),
+        fit_budget=T,
+        parallelization="process",
+        mp_context="fork",
+        n_fit=N,
+    )
+    for Agent in Agents_class
+]
+
+
+for manager in managers:
+    manager.fit()
+
+
+def eval_function(manager, eval_budget=None, agent_id=0):
+    df = manager.get_writer_data()[agent_id]
+    return T * np.max(means) - np.sum(df.loc[df["tag"] == "reward", "value"])
+
+
+print(
+    compare_agents(managers, method="tukey_hsd", eval_function=eval_function, B=10_000)
+)
diff --git a/rlberry/manager/__init__.py b/rlberry/manager/__init__.py
@@ -2,6 +2,7 @@
 from .multiple_managers import MultipleManagers
 from .remote_experiment_manager import RemoteExperimentManager
 from .evaluation import evaluate_agents, plot_writer_data, read_writer_data
+from .comparison import compare_agents
 
 # (Remote)AgentManager alias for the (Remote)ExperimentManager class, for backward compatibility
 AgentManager = ExperimentManager