merge upstream changes (#186)

* freeze pytorch version to fix mypy crash (Learning-and-Intelligent-Systems#1563) * Implement infinite-horizon for exploration (Learning-and-Intelligent-Systems#1565) * simple initial implementation * fix checks * okay - really fix checks now * MyPy Bump and changes (Learning-and-Intelligent-Systems#1568) * minor changes to fix bugs (Learning-and-Intelligent-Systems#1569) * fix get_objects in hierarchical typing case (Learning-and-Intelligent-Systems#1572) Co-authored-by: Tom Silver <[email protected]> * fix hierarchical typing edge case (Learning-and-Intelligent-Systems#1574) * Fix + raise awareness of subtle bugs with active sampler exploration (Learning-and-Intelligent-Systems#1575) * fix subtle bugs * yapf * Ball and Cup Sticky Table Env (Learning-and-Intelligent-Systems#1576) * initial commit that seems to run without error... * fix bug in placing logic * delete outdated comment * fix replanning bug * more data = better results??? * starting tests * try oracle feature selection? * fix buggy test * increase training time? * yapf + fix tom comment * fix reachability issue in placing * minor * more unit tests * fix and more tests * this should be interesting * see if this yields a difference * let's see what happens now * woops * try removing placing cup with the ball on the table * hail mary * minor changes + logging * run task repeat first * sticky table with moving radius * yay! try other approaches... * polar coordinates ftw! * try a simpler thing * let's see how this does. * try more probability of success * all baselines * try running grid row env * most things passing * try this * progress towards PR * should be ready! * revert unnecessary change * fix linting * tom comments --------- Co-authored-by: Tom Silver <[email protected]> * allow third party users to define their own oracle NSRTs (Learning-and-Intelligent-Systems#1578) * allow third party users to define their own oracle NSRTs * test fixes * mypy * Clustering via reverse engineering (Learning-and-Intelligent-Systems#1556) * Initial commit. * Fix a minor bug. * Small changes to satisfy mypi. * Fix linting. * Add tests. * fixes * fix minor grammatical issue * Change check for non-zero types. --------- Co-authored-by: Nishanth Kumar <[email protected]> Co-authored-by: Nishanth Kumar <[email protected]> * pin openai dependency (Learning-and-Intelligent-Systems#1580) * changes to produce prettier grid row graphs (Learning-and-Intelligent-Systems#1577) * add functionality for rendering videos within cogman, rather than within the environment (Learning-and-Intelligent-Systems#1581) * add info to FD crashes (Learning-and-Intelligent-Systems#1582) * disable flakey tests (Learning-and-Intelligent-Systems#1586) * Remove dead email and add NJK email in README (Learning-and-Intelligent-Systems#1583) with Rohan's blessing * handle planning failures within task planning in active sampler explorer (Learning-and-Intelligent-Systems#1584) * add separate flag for approach wrapper (Learning-and-Intelligent-Systems#1585) * fix expected atoms monitoring (Learning-and-Intelligent-Systems#1587) * Split fail focus into UCB and non-UCB baselines (Learning-and-Intelligent-Systems#1579) * try non-ucb exploration baseline * update plotting script * ready! * yapf * Sample a random point inside a `_Geom2D` (Learning-and-Intelligent-Systems#1591) * should be gtg! * should be gtg * pursue task goal during exploration only every n cycles (Learning-and-Intelligent-Systems#1589) * fix loading during online learning (Learning-and-Intelligent-Systems#1592) * a few fixes to saving and loading in active sampler learning (Learning-and-Intelligent-Systems#1593) * use a highly optimistic initial competence until the second cycle (Learning-and-Intelligent-Systems#1595) * try a simpler fix * try again * merge in upstream --------- Co-authored-by: Tom Silver <[email protected]> Co-authored-by: Nishanth Kumar <[email protected]> Co-authored-by: Bartłomiej Cieślar <[email protected]> Co-authored-by: Tom Silver <[email protected]> Co-authored-by: Ashay Athalye <[email protected]> Co-authored-by: Nishanth Kumar <[email protected]>
bdaiinstitute · Dec 5, 2023 · 2d008b9 · 2d008b9
1 parent 208cbd9
commit 2d008b9
Show file tree

Hide file tree

Showing 7 changed files with 66 additions and 28 deletions.
diff --git a/predicators/approaches/active_sampler_learning_approach.py b/predicators/approaches/active_sampler_learning_approach.py
@@ -136,7 +136,8 @@ def load(self, online_learning_cycle: Optional[int]) -> None:
         self._nsrt_to_explorer_sampler = save_dict["nsrt_to_explorer_sampler"]
         self._seen_train_task_idxs = save_dict["seen_train_task_idxs"]
         self._train_tasks = save_dict["train_tasks"]
-        self._online_learning_cycle = save_dict["online_learning_cycle"]
+        self._online_learning_cycle = 0 if online_learning_cycle is None \
+            else online_learning_cycle + 1
 
     def _learn_nsrts(self, trajectories: List[LowLevelTrajectory],
                      online_learning_cycle: Optional[int],
@@ -179,20 +180,19 @@ def _learn_nsrts(self, trajectories: List[LowLevelTrajectory],
         with open(f"{save_path}_{online_learning_cycle}.DATA", "wb") as f:
             pkl.dump(
                 {
+                    "dataset": self._dataset,
                     "sampler_data": self._sampler_data,
                     "ground_op_hist": self._ground_op_hist,
                     "competence_models": self._competence_models,
                     "last_seen_segment_traj_idx":
                     self._last_seen_segment_traj_idx,
                     "nsrt_to_explorer_sampler": self._nsrt_to_explorer_sampler,
                     "seen_train_task_idxs": self._seen_train_task_idxs,
-                    "dataset": self._dataset,
                     # We need to save train tasks because they get modified
                     # in the explorer. The original sin is that tasks are
                     # generated before reset with default init states, which
                     # are subsequently overwritten after reset is called.
                     "train_tasks": self._train_tasks,
-                    "online_learning_cycle": self._online_learning_cycle,
                 },
                 f)
 

diff --git a/predicators/competence_models.py b/predicators/competence_models.py
@@ -66,7 +66,10 @@ def predict_competence(self, num_additional_data: int) -> float:
         # Highly naive: predict a constant improvement in competence.
         del num_additional_data  # unused
         current_competence = self.get_current_competence()
-        return min(1.0, current_competence + 1e-2)
+        # Use a highly optimistic initial competence until the second cycle.
+        return min(
+            1.0,
+            current_competence + CFG.skill_competence_initial_prediction_bonus)
 
 
 class OptimisticSkillCompetenceModel(SkillCompetenceModel):
@@ -100,7 +103,9 @@ def predict_competence(self, num_additional_data: int) -> float:
         nonempty_cycle_obs = self._get_nonempty_cycle_observations()
         current_competence = self.get_current_competence()
         if len(nonempty_cycle_obs) < 2:
-            return min(1.0, current_competence + 1e-2)  # default
+            return min(
+                1.0, current_competence +
+                CFG.skill_competence_initial_prediction_bonus)  # default
         # Look at changes between individual cycles.
         inference_window = 1
         recency_size = CFG.skill_competence_model_optimistic_recency_size
@@ -143,7 +148,9 @@ def predict_competence(self, num_additional_data: int) -> float:
         # the LegacySkillCompetenceModel.
         if self._competence_regressor is None:
             current_competence = self.get_current_competence()
-            return min(1.0, current_competence + 1e-2)
+            return min(
+                1.0, current_competence +
+                CFG.skill_competence_initial_prediction_bonus)
         # Use the regressor to predict future competence.
         current_num_data = self._get_current_num_data()
         current_rv = self._competence_regressor.predict_beta(current_num_data)

diff --git a/predicators/explorers/active_sampler_explorer.py b/predicators/explorers/active_sampler_explorer.py
@@ -408,11 +408,13 @@ def _score_ground_op(
             num_tries = len(history)
             success_rate = sum(history) / num_tries
             total_trials = sum(len(h) for h in self._ground_op_hist.values())
-            # Try less successful operators more often.
-            # UCB-like bonus.
-            c = CFG.active_sampler_explore_bonus
-            bonus = c * np.sqrt(np.log(total_trials) / num_tries)
-            score = (1.0 - success_rate) + bonus
+            score = (1.0 - success_rate)
+            if CFG.active_sampler_explore_use_ucb_bonus:
+                # Try less successful operators more often.
+                # UCB-like bonus.
+                c = CFG.active_sampler_explore_bonus
+                bonus = c * np.sqrt(np.log(total_trials) / num_tries)
+                score += bonus
         elif CFG.active_sampler_explore_task_strategy == "random":
             # Random scores baseline.
             score = self._rng.uniform()

diff --git a/predicators/settings.py b/predicators/settings.py
@@ -576,6 +576,7 @@ class GlobalSettings:
     skill_competence_model_optimistic_window_size = 5
     skill_competence_model_optimistic_recency_size = 5
     skill_competence_default_alpha_beta = (10.0, 1.0)
+    skill_competence_initial_prediction_bonus = 0.5
 
     # refinement cost estimation parameters
     refinement_estimator = "oracle"  # default refinement cost estimator
@@ -608,6 +609,7 @@ class GlobalSettings:
     greedy_lookahead_max_num_resamples = 10
 
     # active sampler explorer parameters
+    active_sampler_explore_use_ucb_bonus = True
     active_sampler_explore_bonus = 1e-1
     active_sampler_explore_task_strategy = "planning_progress"
     active_sampler_explorer_replan_frequency = 100

diff --git a/scripts/configs/active_sampler_learning.yaml b/scripts/configs/active_sampler_learning.yaml
@@ -11,11 +11,18 @@ APPROACHES:
     FLAGS:
       explorer: "active_sampler"
       active_sampler_explore_task_strategy: "task_repeat"
-  success_rate_explore:
+  success_rate_explore_ucb:
     NAME: "active_sampler_learning"
     FLAGS:
       explorer: "active_sampler"
       active_sampler_explore_task_strategy: "success_rate"
+      active_sampler_explore_use_ucb_bonus: False
+  success_rate_explore_no_ucb:
+    NAME: "active_sampler_learning"
+    FLAGS:
+      explorer: "active_sampler"
+      active_sampler_explore_task_strategy: "success_rate"
+      active_sampler_explore_use_ucb_bonus: False
   random_score_explore:
     NAME: "active_sampler_learning"
     FLAGS:
@@ -112,4 +119,4 @@ FLAGS:
   sesame_grounder: "fd_translator"
   active_sampler_learning_n_iter_no_change: 5000
 START_SEED: 456
-NUM_SEEDS: 10
+NUM_SEEDS: 10
diff --git a/scripts/plotting/create_active_sampler_learning_plots.py b/scripts/plotting/create_active_sampler_learning_plots.py
@@ -86,8 +86,10 @@ def _derive_per_task_average(metric: str,
             lambda v: "kitchen-planning_progress_explore" in v)),
         ("Task Repeat", "orange", lambda df: df["EXPERIMENT_ID"].apply(
             lambda v: "kitchen-task_repeat_explore" in v)),
-        ("Fail Focus", "red", lambda df: df["EXPERIMENT_ID"].apply(
-            lambda v: "kitchen-success_rate_explore" in v)),
+        ("Fail Focus Non-UCB", "brown", lambda df: df["EXPERIMENT_ID"].apply(
+            lambda v: "kitchen-success_rate_explore_no_ucb" in v)),
+        ("Fail Focus UCB", "red", lambda df: df["EXPERIMENT_ID"].apply(
+            lambda v: "kitchen-success_rate_explore_ucb" in v)),
         ("Task-Relevant", "purple", lambda df: df["EXPERIMENT_ID"].apply(
             lambda v: "kitchen-random_score_explore" in v)),
         ("Random Skills", "blue", lambda df: df["EXPERIMENT_ID"].apply(
@@ -98,8 +100,11 @@ def _derive_per_task_average(metric: str,
             lambda v: "regional_bumpy_cover-planning_progress_explore" in v)),
         ("Task Repeat", "orange", lambda df: df["EXPERIMENT_ID"].apply(
             lambda v: "regional_bumpy_cover-task_repeat_explore" in v)),
-        ("Fail Focus", "red", lambda df: df["EXPERIMENT_ID"].apply(
-            lambda v: "regional_bumpy_cover-success_rate_explore" in v)),
+        ("Fail Focus Non-UCB", "brown", lambda df: df["EXPERIMENT_ID"].apply(
+            lambda v: "regional_bumpy_cover-success_rate_explore_no_ucb" in v)
+         ),
+        ("Fail Focus UCB", "red", lambda df: df["EXPERIMENT_ID"].apply(
+            lambda v: "regional_bumpy_cover-success_rate_explore_ucb" in v)),
         ("Task-Relevant", "purple", lambda df: df["EXPERIMENT_ID"].apply(
             lambda v: "regional_bumpy_cover-random_score_explore" in v)),
         ("Random Skills", "blue", lambda df: df["EXPERIMENT_ID"].apply(
@@ -110,8 +115,10 @@ def _derive_per_task_average(metric: str,
             lambda v: "grid_row-planning_progress_explore" in v)),
         ("Task Repeat", "orange", lambda df: df["EXPERIMENT_ID"].apply(
             lambda v: "grid_row-task_repeat_explore" in v)),
-        ("Fail Focus", "red", lambda df: df["EXPERIMENT_ID"].apply(
-            lambda v: "grid_row-success_rate_explore" in v)),
+        ("Fail Focus Non-UCB", "brown", lambda df: df["EXPERIMENT_ID"].apply(
+            lambda v: "grid_row-success_rate_explore_no_ucb" in v)),
+        ("Fail Focus UCB", "red", lambda df: df["EXPERIMENT_ID"].apply(
+            lambda v: "grid_row-success_rate_explore_ucb" in v)),
         ("Task-Relevant", "purple", lambda df: df["EXPERIMENT_ID"].apply(
             lambda v: "grid_row-random_score_explore" in v)),
         ("Random Skills", "blue", lambda df: df["EXPERIMENT_ID"].apply(
@@ -122,8 +129,10 @@ def _derive_per_task_average(metric: str,
             lambda v: "sticky_table-planning_progress_explore" in v)),
         ("Task Repeat", "orange", lambda df: df["EXPERIMENT_ID"].apply(
             lambda v: "sticky_table-task_repeat_explore" in v)),
-        ("Fail Focus", "red", lambda df: df["EXPERIMENT_ID"].apply(
-            lambda v: "sticky_table-success_rate_explore" in v)),
+        ("Fail Focus Non-UCB", "brown", lambda df: df["EXPERIMENT_ID"].apply(
+            lambda v: "sticky_table-success_rate_explore_no_ucb" in v)),
+        ("Fail Focus UCB", "red", lambda df: df["EXPERIMENT_ID"].apply(
+            lambda v: "sticky_table-success_rate_explore_ucb" in v)),
         ("Task-Relevant", "purple", lambda df: df["EXPERIMENT_ID"].apply(
             lambda v: "sticky_table-random_score_explore" in v)),
         ("Random Skills", "blue", lambda df: df["EXPERIMENT_ID"].apply(

diff --git a/tests/test_competence_models.py b/tests/test_competence_models.py
@@ -29,20 +29,25 @@ def test_legacy_skill_competence_model():
     """Tests for LegacySkillCompetenceModel()."""
     utils.reset_config({
         "skill_competence_default_alpha_beta": (1.0, 1.0),
+        "skill_competence_initial_prediction_bonus": 1e-2,
     })
     model = create_competence_model("legacy", "test")
     assert isinstance(model, LegacySkillCompetenceModel)
     assert np.isclose(model.get_current_competence(), 0.5)
-    assert np.isclose(model.predict_competence(1), 0.5 + 1e-2)
+    assert np.isclose(model.predict_competence(1),
+                      0.5 + CFG.skill_competence_initial_prediction_bonus)
     model.observe(True)
     assert model.get_current_competence() > 0.5
-    assert model.predict_competence(1) > 0.5 + 1e-2
+    assert model.predict_competence(
+        1) > 0.5 + CFG.skill_competence_initial_prediction_bonus
     model.observe(False)
     assert np.isclose(model.get_current_competence(), 0.5)
-    assert np.isclose(model.predict_competence(1), 0.5 + 1e-2)
+    assert np.isclose(model.predict_competence(1),
+                      0.5 + CFG.skill_competence_initial_prediction_bonus)
     model.advance_cycle()
     assert np.isclose(model.get_current_competence(), 0.5)
-    assert np.isclose(model.predict_competence(1), 0.5 + 1e-2)
+    assert np.isclose(model.predict_competence(1),
+                      0.5 + CFG.skill_competence_initial_prediction_bonus)
     model.observe(True)
     assert model.get_current_competence() > 0.5
 
@@ -53,10 +58,12 @@ def test_latent_variable_skill_competence_model_short():
         "skill_competence_model_num_em_iters": 1,
         "skill_competence_model_max_train_iters": 10,
         "skill_competence_default_alpha_beta": (1.0, 1.0),
+        "skill_competence_initial_prediction_bonus": 1e-2,
     })
     model = create_competence_model("latent_variable", "test")
     assert np.isclose(model.get_current_competence(), 0.5)
-    assert np.isclose(model.predict_competence(1), 0.5 + 1e-2)
+    assert np.isclose(model.predict_competence(1),
+                      0.5 + CFG.skill_competence_initial_prediction_bonus)
     model.observe(True)
     assert model.get_current_competence() > 0.5
     assert model.predict_competence(1) > model.get_current_competence()
@@ -72,12 +79,14 @@ def test_optimistic_skill_competence_model():
     """Tests for OptimisticSkillCompetenceModel()."""
     utils.reset_config({
         "skill_competence_default_alpha_beta": (1.0, 1.0),
+        "skill_competence_initial_prediction_bonus": 1e-2,
     })
     h = CFG.skill_competence_model_lookahead
 
     model = create_competence_model("optimistic", "test")
     assert np.isclose(model.get_current_competence(), 0.5)
-    assert np.isclose(model.predict_competence(h), 0.5 + 1e-2)
+    assert np.isclose(model.predict_competence(h),
+                      0.5 + CFG.skill_competence_initial_prediction_bonus)
 
     # Test impossible skill.
     model = create_competence_model("optimistic", "impossible-skill")
@@ -154,12 +163,14 @@ def test_latent_variable_skill_competence_model_long():
     """Long tests for LatentVariableSkillCompetenceModel()."""
     utils.reset_config({
         "skill_competence_default_alpha_beta": (1.0, 1.0),
+        "skill_competence_initial_prediction_bonus": 1e-2,
     })
     h = CFG.skill_competence_model_lookahead
 
     model = create_competence_model("latent_variable", "test")
     assert np.isclose(model.get_current_competence(), 0.5)
-    assert np.isclose(model.predict_competence(h), 0.5 + 1e-2)
+    assert np.isclose(model.predict_competence(h),
+                      0.5 + CFG.skill_competence_initial_prediction_bonus)
 
     # Test impossible skill.
     model = create_competence_model("latent_variable", "impossible-skill")