Skip to content

Commit

Permalink
[ci skip] Integrate new version of the model that jointly models the …
Browse files Browse the repository at this point in the history
…competing risk incidences (#53)

* removing Incidence

* adding MultiIncidence to hazardosu

* reinserting example for multiincidence

* Hardcode uniform time sampler to simplify estimator

* Remove any reference to the dataset-specific oracle scorer

* Rename to SurvivalBoost

* Update the .score method to work in the any event setting

* Rename test file

* Fix test_gradient_boosting_incidence_parameter_tuning

* Simplify code

* Raise error if hard-zero resampling censors too many events

* adding test

* Refactor survival boost tests

* Fix missing assertion

* fix TODO

* fix linter

* Fix LaTeX

* More informative error message for bad value for hard_zero_fraction

* Remove untested code related to Cox PH based conditional censoring estimation

* Avoid using 'Est' as a short hand for 'Estimator'

* FIX accept both float and array-like for time_horizon in predict_proba

* DOC update predict_proba docstring

* TST add a test that check the behaviour of predict_proba

* DOC add docstring to test

* TST check that probability sum to one for all events

* DOC fix docstring type for time_horizon

* Rename BaseIPCW to KaplanMeierIPCW. Adding some documentation.

* API correct the API of predict_proba

* Use arturo formulation

* Silence 'DeprecationWarning: invalid escape sequence' by using raw docstrings with LaTeX expressions

* update docstring

* Simplify KaplanMeierIPCW and improve its docstring

* adding some documentation to SurvivalBoost

* More docstring fixes

* Apply suggestions from code review

Co-authored-by: Olivier Grisel <[email protected]>

* TST update test and code to follow scikit-learn convention

* TST check that we override properly

* Apply suggestions from code review

Co-authored-by: Olivier Grisel <[email protected]>

* DOC start documenting parameter

* DOC documents parameters in SurvivalBoost docstring

* DOC add ipcw estimator docstring

* DOC add attributes, reference, and examples

* DOC document some parameters and attributes of WeightedMultiClassTargetSampler

* API change the shape of the incidence curves

* DOC Use data generator in marginal cumulative incidence example

* FIX slice properly the prediction

* fix install instructions with flit

* MAINT Remove old python version in black config (#57)

* MAINT activate doctest check with pytest by default

* Fix a pandas warning in load_seer

* FIX do not pytest collect outside of the hazardous module folder

* FIX update the example to integrate on the right columns

* Apply suggestions from code review

Co-authored-by: Olivier Grisel <[email protected]>

* Set explicit value for n_events

* Improve variable names and comments in WeightedMultiClassTargetSampler

* Use n_horizons_per_observation=3 by default

* Update hazardous/_survival_boost.py

* Remove reference to "column" to describe the second axis of a 3D array.

* Improve docstrings for SurvivalBoost

* Make sure to use zero-width param ranges in marginal example

* Silence AJ warning on tied events

* Small improvements

* Use a string in the public API of SurvivalBoost to select the censoring estimator

* Keep IPCW estimator private API for now

* replacing any occurence of _est to _estimator

* fixing math, harmonising different notations.

* Example survival analysis (#70)

* survival analysis example

* change the `predict_cumutive_incidence` function to `predict_survival_function`

* fix the example and adapt it to sphinx

* add torch to doc deps because because pycox doesn't require it

* clean example

---------

Co-authored-by: LeoGrin <[email protected]>
Co-authored-by: Vincent Maladiere <[email protected]>

* enhance documentation

---------

Co-authored-by: Julie Alberge <[email protected]>
Co-authored-by: Guillaume Lemaitre <[email protected]>
Co-authored-by: ArturoAmorQ <[email protected]>
Co-authored-by: Judith Abecassis <[email protected]>
Co-authored-by: Jovan Stojanovic <[email protected]>
Co-authored-by: Julie <[email protected]>
Co-authored-by: LeoGrin <[email protected]>
Co-authored-by: Vincent Maladiere <[email protected]> 5e581d5
  • Loading branch information
8 people committed Aug 20, 2024
1 parent 7376fd6 commit f972960
Show file tree
Hide file tree
Showing 67 changed files with 4,000 additions and 1,465 deletions.
2 changes: 1 addition & 1 deletion .buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: e3800b2185b6bf6d86ab083fedf1fa5d
config: fb4d4fb8638e781e53ce9360531dfb3f
tags: 645f666f9bcd5a90fca523b33c5a78b7
Binary file modified .doctrees/api.doctree
Binary file not shown.
Binary file modified .doctrees/auto_examples/index.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/auto_examples/sg_execution_times.doctree
Binary file not shown.
Binary file modified .doctrees/environment.pickle
Binary file not shown.
Binary file not shown.
Binary file removed .doctrees/generated/hazardous.IPCWEstimator.doctree
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/generated/hazardous.data.load_seer.doctree
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified .doctrees/sg_execution_times.doctree
Binary file not shown.
Binary file not shown.

Large diffs are not rendered by default.

Binary file not shown.
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -4,76 +4,57 @@
==================================================
This example demonstrates how to estimate the marginal cumulative incidence
using :class:`hazardous.GradientBoostingIncidence` and compares the results to
the Aalen-Johansen estimator and to the theoretical cumulated incidence curves
on synthetic data.
using :class:`hazardous.SurvivalBoost` and compares the results to the
Aalen-Johansen estimator and to the theoretical cumulated incidence curves on
synthetic data.
Here the data is generated by taking the minimum time of samples from three
competing Weibull distributions with fixed parameters and without any
conditioning covariate. In this case, the Aalen-Johansen estimator is expected
to be unbiased, and this is empirically confirmed by this example.
The :class:`hazardous.GradientBoostingIncidence` estimator on the other hand is
a predictive estimator that expects at least one conditioning covariate. In
this example, we use a dummy covariate that is constant for all samples. Here
we are not interested in the discrimination power of the estimator: there is
none by construction, since we do not have access to informative covariates.
Instead we empirically study its marginal calibration, that is, its ability to
The :class:`hazardous.SurvivalBoost` estimator on the other hand is a
predictive estimator that expects at least one conditioning covariate. In this
example, we use a dummy covariate that is constant for all samples. Here we are
not interested in the discrimination power of the estimator: there is none by
construction, since we do not have access to informative covariates. Instead we
empirically study its marginal calibration, that is, its ability to
approximately recover an unbiased estimate of the marginal cumulative incidence
function for each competing event.
This example also highlights that :class:`hazardous.GradientBoostingIncidence`
estimates noisy cumulative incidence functions, which are not smooth and not
even monotonically increasing. This is a known limitation of the estimator,
and attempting to enforce monotonicity at training time typically introduces
severe over-estimation bias for large time horizons.
This example also highlights that :class:`hazardous.SurvivalBoost` estimates
noisy cumulative incidence functions, which are not smooth and not even
monotonically increasing. This is a known limitation of the estimator, and
attempting to enforce monotonicity at training time typically introduces severe
over-estimation bias for large time horizons.
"""

# %%
from time import perf_counter
import numpy as np
from scipy.stats import weibull_min
from sklearn.base import clone
import pandas as pd
import matplotlib.pyplot as plt

from hazardous import GradientBoostingIncidence
from hazardous import SurvivalBoost
from hazardous.data import make_synthetic_competing_weibull
from lifelines import AalenJohansenFitter

rng = np.random.default_rng(0)
n_samples = 3_000

# Non-informative covariate because scikit-learn estimators expect at least
# one feature.
X_dummy = np.zeros(shape=(n_samples, 1), dtype=np.float32)

base_scale = 1_000.0 # some arbitrary time unit

distributions = [
{"event_id": 1, "scale": 10 * base_scale, "shape": 0.5},
{"event_id": 2, "scale": 3 * base_scale, "shape": 1},
{"event_id": 3, "scale": 3 * base_scale, "shape": 5},
]
event_times = np.concatenate(
[
weibull_min.rvs(
dist["shape"],
scale=dist["scale"],
size=n_samples,
random_state=rng,
).reshape(-1, 1)
for dist in distributions
],
axis=1,
event_dist_shapes = (0.5, 1.0, 5.0)
event_dist_scales = (10, 3, 3)
n_events = len(event_dist_shapes)

_, y_uncensored = make_synthetic_competing_weibull(
n_samples=n_samples,
n_events=n_events,
censoring_relative_scale=0,
return_X_y=True,
shape_ranges=[(shape, shape) for shape in event_dist_shapes],
scale_ranges=[(scale, scale) for scale in event_dist_scales],
base_scale=base_scale,
random_state=0,
)
first_event_idx = np.argmin(event_times, axis=1)

y_uncensored = pd.DataFrame(
dict(
event=first_event_idx + 1, # 0 is reserved as the censoring marker
duration=event_times[np.arange(n_samples), first_event_idx],
)
)
y_uncensored["event"].value_counts().sort_index()
t_max = y_uncensored["duration"].max()

# %%
Expand All @@ -88,7 +69,7 @@
# distribution <https://en.wikipedia.org/wiki/Weibull_distribution>`_:


def weibull_hazard(t, shape=1.0, scale=1.0, **ignored_kwargs):
def weibull_hazard(t, shape=1.0, scale=1.0):
# Plug an arbitrary finite hazard value at t==0 because fractional powers
# of 0 are undefined.
#
Expand All @@ -103,19 +84,25 @@ def weibull_hazard(t, shape=1.0, scale=1.0, **ignored_kwargs):
#
# Note that true CIFs are independent of the censoring distribution. We can use
# them as reference to check that the estimators are unbiased by the censoring.
# Here are the two estimators of interest:
#
# We first define the two estimators of interest. The
# :class:`hazardous.SurvivalBoost` instance uses the Kaplan-Meier estimator on
# the negated event labels (1 for censoring, 0 for any event) to estimate
# internal IPCW weights. This is a valid choice in this context because we do
# not have access to any informative covariate (either for censoring or for the
# events of interest).

calculate_variance = n_samples <= 5_000
aj = AalenJohansenFitter(calculate_variance=calculate_variance, seed=0)

gb_incidence = GradientBoostingIncidence(
survival_boost = SurvivalBoost(
learning_rate=0.03,
n_iter=100,
max_leaf_nodes=5,
hard_zero_fraction=0.1,
min_samples_leaf=50,
loss="ibs",
show_progressbar=False,
ipcw_strategy="kaplan-meier",
random_state=0,
)

Expand All @@ -128,21 +115,25 @@ def weibull_hazard(t, shape=1.0, scale=1.0, **ignored_kwargs):
# theoretical CIFs:


def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=None):
_, axes = plt.subplots(figsize=(12, 4), ncols=len(distributions), sharey=True)
def plot_cumulative_incidence_functions(y, survival_boost=None, aj=None):
"""Plot cause-specific cumulative incidence per event using a dummy covariate"""
_, axes = plt.subplots(figsize=(12, 4), ncols=n_events, sharey=True)

# Compute the estimate of the CIFs on a coarse grid.
coarse_timegrid = np.linspace(0, t_max, num=100)

# Compute the theoretical CIFs by integrating the hazard functions on a
# fine-grained time grid. Note that integration errors can accumulate quite
# quickly if the time grid is resolution too coarse, especially for the
# quickly if the time grid's resolution is too coarse, especially for the
# Weibull distribution with shape < 1.
tic = perf_counter()
fine_time_grid = np.linspace(0, t_max, num=10_000_000)
dt = np.diff(fine_time_grid)[0]
all_hazards = np.stack(
[weibull_hazard(fine_time_grid, **dist) for dist in distributions],
[
weibull_hazard(fine_time_grid, shape, scale * base_scale)
for shape, scale in zip(event_dist_shapes, event_dist_scales)
],
axis=0,
)
any_event_hazards = all_hazards.sum(axis=0)
Expand All @@ -157,6 +148,20 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
"Cause-specific cumulative incidence functions"
f" ({censoring_fraction:.1%} censoring)"
)
# Non-informative covariate because scikit-learn estimators expect at least
# one feature.
X_dummy = np.zeros(shape=(n_samples, 1), dtype=np.float32)
if survival_boost is not None:
tic = perf_counter()
survival_boost.fit(X_dummy, y)
duration = perf_counter() - tic
print(f"SurvivalBoost fit: {duration:.3f} s")
tic = perf_counter()
cif_preds = survival_boost.predict_cumulative_incidence(
X_dummy, coarse_timegrid
)
duration = perf_counter() - tic
print(f"SurvivalBoost prediction: {duration:.3f} s")

for event_id, (ax, hazards_i) in enumerate(zip(axes, all_hazards), 1):
theoretical_cif = (hazards_i * any_event_survival).cumsum(axis=-1) * dt
Expand All @@ -172,28 +177,22 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
label="Theoretical incidence",
),

if gb_incidence is not None:
tic = perf_counter()
gb_incidence.set_params(event_of_interest=event_id)
gb_incidence.fit(X_dummy, y)
duration = perf_counter() - tic
print(f"GB Incidence for event {event_id} fit in {duration:.3f} s")
tic = perf_counter()
cif_pred = gb_incidence.predict_cumulative_incidence(
X_dummy[0:1], coarse_timegrid
)[0]
duration = perf_counter() - tic
print(f"GB Incidence for event {event_id} prediction in {duration:.3f} s")
if survival_boost is not None:
cif_pred = cif_preds[:, event_id][0]
ax.plot(
coarse_timegrid,
cif_pred,
label="GradientBoostingIncidence",
label="SurvivalBoost",
)
ax.set(title=f"Event {event_id}")

if aj is not None:
# Randomly break tied durations, to silence a warning raised by the
# Aalen-Johansen estimator.
rng = np.random.default_rng(0)
jitter = rng.normal(scale=1e-3, size=y.shape[0])
tic = perf_counter()
aj.fit(y["duration"], y["event"], event_of_interest=event_id)
aj.fit(y["duration"] + jitter, y["event"], event_of_interest=event_id)
duration = perf_counter() - tic
print(f"Aalen-Johansen for event {event_id} fit in {duration:.3f} s")
aj.plot(label="Aalen-Johansen", ax=ax)
Expand All @@ -205,7 +204,7 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=


plot_cumulative_incidence_functions(
distributions, gb_incidence=gb_incidence, aj=aj, y=y_uncensored
survival_boost=survival_boost, aj=aj, y=y_uncensored
)

# %%
Expand All @@ -216,31 +215,23 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
# Add some independent censoring with some arbitrary parameters to control the
# amount of censoring: lowering the expected value bound increases the amount
# of censoring.
censoring_times = weibull_min.rvs(
1.0,
scale=1.5 * base_scale,
size=n_samples,
random_state=rng,
)
y_censored = pd.DataFrame(
dict(
event=np.where(
censoring_times < y_uncensored["duration"], 0, y_uncensored["event"]
),
duration=np.minimum(censoring_times, y_uncensored["duration"]),
)
_, y_censored = make_synthetic_competing_weibull(
n_samples=n_samples,
n_events=n_events,
censoring_relative_scale=1.5,
return_X_y=True,
shape_ranges=[(shape, shape) for shape in event_dist_shapes],
scale_ranges=[(scale, scale) for scale in event_dist_scales],
base_scale=base_scale,
random_state=0,
)
y_censored["event"].value_counts().sort_index()

# %%
plot_cumulative_incidence_functions(
distributions, gb_incidence=gb_incidence, aj=aj, y=y_censored
)
plot_cumulative_incidence_functions(survival_boost=survival_boost, aj=aj, y=y_censored)
# %%
#
# Note that the Aalen-Johansen estimator is unbiased and empirically recovers
# the theoretical curves both with and without censoring. The
# GradientBoostingIncidence estimator also appears unbiased by censoring, but
# SurvivalBoost estimator also appears unbiased by censoring, but
# the predicted curves are not smooth and not even monotonically increasing. By
# adjusting the hyper-parameters, notably the learning rate, the number of
# boosting iterations and leaf nodes, it is possible to somewhat control the
Expand All @@ -251,22 +242,3 @@ def plot_cumulative_incidence_functions(distributions, y, gb_incidence=None, aj=
# Alternatively, we could try to enable a monotonicity constraint at training
# time, however, in practice this often causes a sever over-estimation bias for
# the large time horizons:

# %%
#
# Finally let's try again to fit the GB Incidence models using a monotonicity
# constraint:

monotonic_gb_incidence = clone(gb_incidence).set_params(
monotonic_incidence="at_training_time"
)
plot_cumulative_incidence_functions(
distributions, gb_incidence=monotonic_gb_incidence, y=y_censored
)
# %%
#
# The resulting incidence curves are indeed monotonic. However, for smaller
# training set sizes, the resulting models can be significantly biased, in
# particular large time horizons, where the CIFs are getting flatter. This
# effect diminishes with larger training set sizes (lower epistemic
# uncertainty).
Loading

0 comments on commit f972960

Please sign in to comment.