[ci skip] Integrate new version of the model that jointly models the …

…competing risk incidences (#53) * removing Incidence * adding MultiIncidence to hazardosu * reinserting example for multiincidence * Hardcode uniform time sampler to simplify estimator * Remove any reference to the dataset-specific oracle scorer * Rename to SurvivalBoost * Update the .score method to work in the any event setting * Rename test file * Fix test_gradient_boosting_incidence_parameter_tuning * Simplify code * Raise error if hard-zero resampling censors too many events * adding test * Refactor survival boost tests * Fix missing assertion * fix TODO * fix linter * Fix LaTeX * More informative error message for bad value for hard_zero_fraction * Remove untested code related to Cox PH based conditional censoring estimation * Avoid using 'Est' as a short hand for 'Estimator' * FIX accept both float and array-like for time_horizon in predict_proba * DOC update predict_proba docstring * TST add a test that check the behaviour of predict_proba * DOC add docstring to test * TST check that probability sum to one for all events * DOC fix docstring type for time_horizon * Rename BaseIPCW to KaplanMeierIPCW. Adding some documentation. * API correct the API of predict_proba * Use arturo formulation * Silence 'DeprecationWarning: invalid escape sequence' by using raw docstrings with LaTeX expressions * update docstring * Simplify KaplanMeierIPCW and improve its docstring * adding some documentation to SurvivalBoost * More docstring fixes * Apply suggestions from code review Co-authored-by: Olivier Grisel <[email protected]> * TST update test and code to follow scikit-learn convention * TST check that we override properly * Apply suggestions from code review Co-authored-by: Olivier Grisel <[email protected]> * DOC start documenting parameter * DOC documents parameters in SurvivalBoost docstring * DOC add ipcw estimator docstring * DOC add attributes, reference, and examples * DOC document some parameters and attributes of WeightedMultiClassTargetSampler * API change the shape of the incidence curves * DOC Use data generator in marginal cumulative incidence example * FIX slice properly the prediction * fix install instructions with flit * MAINT Remove old python version in black config (#57) * MAINT activate doctest check with pytest by default * Fix a pandas warning in load_seer * FIX do not pytest collect outside of the hazardous module folder * FIX update the example to integrate on the right columns * Apply suggestions from code review Co-authored-by: Olivier Grisel <[email protected]> * Set explicit value for n_events * Improve variable names and comments in WeightedMultiClassTargetSampler * Use n_horizons_per_observation=3 by default * Update hazardous/_survival_boost.py * Remove reference to "column" to describe the second axis of a 3D array. * Improve docstrings for SurvivalBoost * Make sure to use zero-width param ranges in marginal example * Silence AJ warning on tied events * Small improvements * Use a string in the public API of SurvivalBoost to select the censoring estimator * Keep IPCW estimator private API for now * replacing any occurence of _est to _estimator * fixing math, harmonising different notations. * Example survival analysis (#70) * survival analysis example * change the `predict_cumutive_incidence` function to `predict_survival_function` * fix the example and adapt it to sphinx * add torch to doc deps because because pycox doesn't require it * clean example --------- Co-authored-by: LeoGrin <[email protected]> Co-authored-by: Vincent Maladiere <[email protected]> * enhance documentation --------- Co-authored-by: Julie Alberge <[email protected]> Co-authored-by: Guillaume Lemaitre <[email protected]> Co-authored-by: ArturoAmorQ <[email protected]> Co-authored-by: Judith Abecassis <[email protected]> Co-authored-by: Jovan Stojanovic <[email protected]> Co-authored-by: Julie <[email protected]> Co-authored-by: LeoGrin <[email protected]> Co-authored-by: Vincent Maladiere <[email protected]> 5e581d5
soda-inria · Aug 20, 2024 · f972960 · f972960
1 parent 7376fd6
commit f972960
Show file tree

Hide file tree

Showing 67 changed files with 4,000 additions and 1,465 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -1,4 +1,4 @@
 # Sphinx build info version 1
 # This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
-config: e3800b2185b6bf6d86ab083fedf1fa5d
+config: fb4d4fb8638e781e53ce9360531dfb3f
 tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.doctrees/api.doctree b/.doctrees/api.doctree
diff --git a/.doctrees/auto_examples/index.doctree b/.doctrees/auto_examples/index.doctree
diff --git a/.doctrees/auto_examples/plot_marginal_cumulative_incidence_estimation.doctree b/.doctrees/auto_examples/plot_marginal_cumulative_incidence_estimation.doctree
diff --git a/.doctrees/auto_examples/plot_survival_analysis.doctree b/.doctrees/auto_examples/plot_survival_analysis.doctree
diff --git a/.doctrees/auto_examples/sg_execution_times.doctree b/.doctrees/auto_examples/sg_execution_times.doctree
diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle
diff --git a/.doctrees/generated/hazardous.GradientBoostingIncidence.doctree b/.doctrees/generated/hazardous.GradientBoostingIncidence.doctree
diff --git a/.doctrees/generated/hazardous.IPCWEstimator.doctree b/.doctrees/generated/hazardous.IPCWEstimator.doctree
diff --git a/.doctrees/generated/hazardous.SurvivalBoost.doctree b/.doctrees/generated/hazardous.SurvivalBoost.doctree
diff --git a/.doctrees/generated/hazardous.data.load_seer.doctree b/.doctrees/generated/hazardous.data.load_seer.doctree
diff --git a/.doctrees/generated/hazardous.data.make_synthetic_competing_weibull.doctree b/.doctrees/generated/hazardous.data.make_synthetic_competing_weibull.doctree
diff --git a/.doctrees/generated/hazardous.metrics.brier_score_incidence.doctree b/.doctrees/generated/hazardous.metrics.brier_score_incidence.doctree
diff --git a/.doctrees/generated/hazardous.metrics.brier_score_survival.doctree b/.doctrees/generated/hazardous.metrics.brier_score_survival.doctree
diff --git a/.doctrees/generated/hazardous.metrics.integrated_brier_score_incidence.doctree b/.doctrees/generated/hazardous.metrics.integrated_brier_score_incidence.doctree
diff --git a/.doctrees/generated/hazardous.metrics.integrated_brier_score_survival.doctree b/.doctrees/generated/hazardous.metrics.integrated_brier_score_survival.doctree
diff --git a/.doctrees/sg_execution_times.doctree b/.doctrees/sg_execution_times.doctree
diff --git a/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip b/_downloads/07fcc19ba03226cd3d83d4e40ec44385/auto_examples_python.zip
diff --git a/...oads/2932d6ac7842b6d781c27c5737ae52fa/plot_marginal_cumulative_incidence_estimation.ipynb b/...oads/2932d6ac7842b6d781c27c5737ae52fa/plot_marginal_cumulative_incidence_estimation.ipynb
diff --git a/_downloads/3f8c903e544072e7ce4b9fd8dc2dee36/plot_survival_analysis.zip b/_downloads/3f8c903e544072e7ce4b9fd8dc2dee36/plot_survival_analysis.zip
diff --git a/...nloads/4f05ecd626e67565b19d5038f03572d5/plot_marginal_cumulative_incidence_estimation.zip b/...nloads/4f05ecd626e67565b19d5038f03572d5/plot_marginal_cumulative_incidence_estimation.zip
diff --git a/_downloads/6f1e7a639e0699d6164445b55e6c116d/auto_examples_jupyter.zip b/_downloads/6f1e7a639e0699d6164445b55e6c116d/auto_examples_jupyter.zip
diff --git a/_downloads/878bb6ffe7fba36a0655277b8151e4cd/plot_marginal_cumulative_incidence_estimation.py b/_downloads/878bb6ffe7fba36a0655277b8151e4cd/plot_marginal_cumulative_incidence_estimation.py
@@ -4,76 +4,57 @@
 ==================================================
 
 This example demonstrates how to estimate the marginal cumulative incidence
-using :class:`hazardous.GradientBoostingIncidence` and compares the results to
-the Aalen-Johansen estimator and to the theoretical cumulated incidence curves
-on synthetic data.
+using :class:`hazardous.SurvivalBoost` and compares the results to the
+Aalen-Johansen estimator and to the theoretical cumulated incidence curves on
+synthetic data.
 
 Here the data is generated by taking the minimum time of samples from three
 competing Weibull distributions with fixed parameters and without any
 conditioning covariate. In this case, the Aalen-Johansen estimator is expected
 to be unbiased, and this is empirically confirmed by this example.
 
-The :class:`hazardous.GradientBoostingIncidence` estimator on the other hand is
-a predictive estimator that expects at least one conditioning covariate. In
-this example, we use a dummy covariate that is constant for all samples. Here
-we are not interested in the discrimination power of the estimator: there is
-none by construction, since we do not have access to informative covariates.
-Instead we empirically study its marginal calibration, that is, its ability to
+The :class:`hazardous.SurvivalBoost` estimator on the other hand is a
+predictive estimator that expects at least one conditioning covariate. In this
+example, we use a dummy covariate that is constant for all samples. Here we are
+not interested in the discrimination power of the estimator: there is none by
+construction, since we do not have access to informative covariates. Instead we
+empirically study its marginal calibration, that is, its ability to
 approximately recover an unbiased estimate of the marginal cumulative incidence
 function for each competing event.
 
-This example also highlights that :class:`hazardous.GradientBoostingIncidence`
-estimates noisy cumulative incidence functions, which are not smooth and not
-even monotonically increasing. This is a known limitation of the estimator,
-and attempting to enforce monotonicity at training time typically introduces
-severe over-estimation bias for large time horizons.
+This example also highlights that :class:`hazardous.SurvivalBoost` estimates
+noisy cumulative incidence functions, which are not smooth and not even
+monotonically increasing. This is a known limitation of the estimator, and
+attempting to enforce monotonicity at training time typically introduces severe
+over-estimation bias for large time horizons.
 """
+
 # %%
 from time import perf_counter
 import numpy as np
-from scipy.stats import weibull_min
-from sklearn.base import clone
-import pandas as pd
 import matplotlib.pyplot as plt
 
-from hazardous import GradientBoostingIncidence
+from hazardous import SurvivalBoost
+from hazardous.data import make_synthetic_competing_weibull
 from lifelines import AalenJohansenFitter
 
-rng = np.random.default_rng(0)
 n_samples = 3_000
-
-# Non-informative covariate because scikit-learn estimators expect at least
-# one feature.
-X_dummy = np.zeros(shape=(n_samples, 1), dtype=np.float32)
-
 base_scale = 1_000.0  # some arbitrary time unit
-
-distributions = [
-    {"event_id": 1, "scale": 10 * base_scale, "shape": 0.5},
-    {"event_id": 2, "scale": 3 * base_scale, "shape": 1},
-    {"event_id": 3, "scale": 3 * base_scale, "shape": 5},
-]
-event_times = np.concatenate(
-    [
-        weibull_min.rvs(
-            dist["shape"],
-            scale=dist["scale"],
-            size=n_samples,
-            random_state=rng,
-        ).reshape(-1, 1)
-        for dist in distributions
-    ],
-    axis=1,
+event_dist_shapes = (0.5, 1.0, 5.0)
+event_dist_scales = (10, 3, 3)
+n_events = len(event_dist_shapes)
+
+_, y_uncensored = make_synthetic_competing_weibull(
+    n_samples=n_samples,
+    n_events=n_events,
+    censoring_relative_scale=0,
+    return_X_y=True,
+    shape_ranges=[(shape, shape) for shape in event_dist_shapes],
+    scale_ranges=[(scale, scale) for scale in event_dist_scales],
+    base_scale=base_scale,
+    random_state=0,
 )
-first_event_idx = np.argmin(event_times, axis=1)
 
-y_uncensored = pd.DataFrame(
-    dict(
-        event=first_event_idx + 1,  # 0 is reserved as the censoring marker
-        duration=event_times[np.arange(n_samples), first_event_idx],
-    )
-)
-y_uncensored["event"].value_counts().sort_index()
 t_max = y_uncensored["duration"].max()
 
 # %%
@@ -88,7 +69,7 @@
 # distribution <https://en.wikipedia.org/wiki/Weibull_distribution>`_:
 
 
-def weibull_hazard(t, shape=1.0, scale=1.0, **ignored_kwargs):
+def weibull_hazard(t, shape=1.0, scale=1.0):
     # Plug an arbitrary finite hazard value at t==0 because fractional powers
     # of 0 are undefined.
     #
@@ -103,19 +84,25 @@ def weibull_hazard(t, shape=1.0, scale=1.0, **ignored_kwargs):
 #
 # Note that true CIFs are independent of the censoring distribution. We can use
 # them as reference to check that the estimators are unbiased by the censoring.
-# Here are the two estimators of interest:
+#
+# We first define the two estimators of interest. The
+# :class:`hazardous.SurvivalBoost` instance uses the Kaplan-Meier estimator on
+# the negated event labels (1 for censoring, 0 for any event) to estimate
+# internal IPCW weights. This is a valid choice in this context because we do
+# not have access to any informative covariate (either for censoring or for the
+# events of interest).
 
 calculate_variance = n_samples <= 5_000
 aj = AalenJohansenFitter(calculate_variance=calculate_variance, seed=0)
 
-gb_incidence = GradientBoostingIncidence(
+survival_boost = SurvivalBoost(
     learning_rate=0.03,
     n_iter=100,
     max_leaf_nodes=5,
     hard_zero_fraction=0.1,
     min_samples_leaf=50,
-    loss="ibs",
     show_progressbar=False,
+    ipcw_strategy="kaplan-meier",
     random_state=0,
 )
 
@@ -128,21 +115,25 @@ def weibull_hazard(t, shape=1.0, scale=1.0, **ignored_kwargs):
 # theoretical CIFs:
 
 
-def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=None):
-    _, axes = plt.subplots(figsize=(12, 4), ncols=len(distributions), sharey=True)
+def plot_cumulative_incidence_functions(y, survival_boost=None, aj=None):
+    """Plot cause-specific cumulative incidence per event using a dummy covariate"""
+    _, axes = plt.subplots(figsize=(12, 4), ncols=n_events, sharey=True)
 
     # Compute the estimate of the CIFs on a coarse grid.
     coarse_timegrid = np.linspace(0, t_max, num=100)
 
     # Compute the theoretical CIFs by integrating the hazard functions on a
     # fine-grained time grid. Note that integration errors can accumulate quite
-    # quickly if the time grid is resolution too coarse, especially for the
+    # quickly if the time grid's resolution is too coarse, especially for the
     # Weibull distribution with shape < 1.
     tic = perf_counter()
     fine_time_grid = np.linspace(0, t_max, num=10_000_000)
     dt = np.diff(fine_time_grid)[0]
     all_hazards = np.stack(
-        [weibull_hazard(fine_time_grid, **dist) for dist in distributions],
+        [
+            weibull_hazard(fine_time_grid, shape, scale * base_scale)
+            for shape, scale in zip(event_dist_shapes, event_dist_scales)
+        ],
         axis=0,
     )
     any_event_hazards = all_hazards.sum(axis=0)
@@ -157,6 +148,20 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
         "Cause-specific cumulative incidence functions"
         f" ({censoring_fraction:.1%} censoring)"
     )
+    # Non-informative covariate because scikit-learn estimators expect at least
+    # one feature.
+    X_dummy = np.zeros(shape=(n_samples, 1), dtype=np.float32)
+    if survival_boost is not None:
+        tic = perf_counter()
+        survival_boost.fit(X_dummy, y)
+        duration = perf_counter() - tic
+        print(f"SurvivalBoost fit: {duration:.3f} s")
+        tic = perf_counter()
+        cif_preds = survival_boost.predict_cumulative_incidence(
+            X_dummy, coarse_timegrid
+        )
+        duration = perf_counter() - tic
+        print(f"SurvivalBoost prediction: {duration:.3f} s")
 
     for event_id, (ax, hazards_i) in enumerate(zip(axes, all_hazards), 1):
         theoretical_cif = (hazards_i * any_event_survival).cumsum(axis=-1) * dt
@@ -172,28 +177,22 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
             label="Theoretical incidence",
         ),
 
-        if gb_incidence is not None:
-            tic = perf_counter()
-            gb_incidence.set_params(event_of_interest=event_id)
-            gb_incidence.fit(X_dummy, y)
-            duration = perf_counter() - tic
-            print(f"GB Incidence for event {event_id} fit in {duration:.3f} s")
-            tic = perf_counter()
-            cif_pred = gb_incidence.predict_cumulative_incidence(
-                X_dummy[0:1], coarse_timegrid
-            )[0]
-            duration = perf_counter() - tic
-            print(f"GB Incidence for event {event_id} prediction in {duration:.3f} s")
+        if survival_boost is not None:
+            cif_pred = cif_preds[:, event_id][0]
             ax.plot(
                 coarse_timegrid,
                 cif_pred,
-                label="GradientBoostingIncidence",
+                label="SurvivalBoost",
             )
             ax.set(title=f"Event {event_id}")
 
         if aj is not None:
+            # Randomly break tied durations, to silence a warning raised by the
+            # Aalen-Johansen estimator.
+            rng = np.random.default_rng(0)
+            jitter = rng.normal(scale=1e-3, size=y.shape[0])
             tic = perf_counter()
-            aj.fit(y["duration"], y["event"], event_of_interest=event_id)
+            aj.fit(y["duration"] + jitter, y["event"], event_of_interest=event_id)
             duration = perf_counter() - tic
             print(f"Aalen-Johansen for event {event_id} fit in {duration:.3f} s")
             aj.plot(label="Aalen-Johansen", ax=ax)
@@ -205,7 +204,7 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
 
 
 plot_cumulative_incidence_functions(
-    distributions, gb_incidence=gb_incidence, aj=aj, y=y_uncensored
+    survival_boost=survival_boost, aj=aj, y=y_uncensored
 )
 
 # %%
@@ -216,31 +215,23 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
 # Add some independent censoring with some arbitrary parameters to control the
 # amount of censoring: lowering the expected value bound increases the amount
 # of censoring.
-censoring_times = weibull_min.rvs(
-    1.0,
-    scale=1.5 * base_scale,
-    size=n_samples,
-    random_state=rng,
-)
-y_censored = pd.DataFrame(
-    dict(
-        event=np.where(
-            censoring_times < y_uncensored["duration"], 0, y_uncensored["event"]
-        ),
-        duration=np.minimum(censoring_times, y_uncensored["duration"]),
-    )
+_, y_censored = make_synthetic_competing_weibull(
+    n_samples=n_samples,
+    n_events=n_events,
+    censoring_relative_scale=1.5,
+    return_X_y=True,
+    shape_ranges=[(shape, shape) for shape in event_dist_shapes],
+    scale_ranges=[(scale, scale) for scale in event_dist_scales],
+    base_scale=base_scale,
+    random_state=0,
 )
-y_censored["event"].value_counts().sort_index()
 
-# %%
-plot_cumulative_incidence_functions(
-    distributions, gb_incidence=gb_incidence, aj=aj, y=y_censored
-)
+plot_cumulative_incidence_functions(survival_boost=survival_boost, aj=aj, y=y_censored)
 # %%
 #
 # Note that the Aalen-Johansen estimator is unbiased and empirically recovers
 # the theoretical curves both with and without censoring. The
-# GradientBoostingIncidence estimator also appears unbiased by censoring, but
+# SurvivalBoost estimator also appears unbiased by censoring, but
 # the predicted curves are not smooth and not even monotonically increasing. By
 # adjusting the hyper-parameters, notably the learning rate, the number of
 # boosting iterations and leaf nodes, it is possible to somewhat control the
@@ -251,22 +242,3 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
 # Alternatively, we could try to enable a monotonicity constraint at training
 # time, however, in practice this often causes a sever over-estimation bias for
 # the large time horizons:
-
-# %%
-#
-# Finally let's try again to fit the GB Incidence models using a monotonicity
-# constraint:
-
-monotonic_gb_incidence = clone(gb_incidence).set_params(
-    monotonic_incidence="at_training_time"
-)
-plot_cumulative_incidence_functions(
-    distributions, gb_incidence=monotonic_gb_incidence, y=y_censored
-)
-# %%
-#
-# The resulting incidence curves are indeed monotonic. However, for smaller
-# training set sizes, the resulting models can be significantly biased, in
-# particular large time horizons, where the CIFs are getting flatter. This
-# effect diminishes with larger training set sizes (lower epistemic
-# uncertainty).