You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am using the Weibull Cure model implementation proposed in the docs to predict the conversion of subjects (leads) for a specific business process.
A cure model is necessary here as most conversion events will never happen and we are typically interested in the cdf in this context as it can be interpreted as the probability for the subject to have converted by time $t$.
The way I plan on using the model is by making predictions on subjects that were created up to max_age ago and have not converted yet.
For survival regression I routinely see that people split the data on the subject ids (sometimes with stratification on the event_col) and I am trying to understand why this is the right thing to do, especially if I want to use the model to make predictions over a predictive horizon, using .predict_cumulative_density() (and with the conditional_after argument). The performance of the model over that predictive horizon will be something important to understand.
In practice, one would train the model on a time window ranging between some time in the past and the time of training, which becomes the censoring time, typically ’now’.
Forgetting about hyper-parameter tuning and CV for the sake this example, my intuition tends to make me want to evaluate the model using a data split similar to a TimesSeriesSplit to be able to understand how the model performance change over the predictive horizon.
As someone discussed here, I am tempted to follow this approach:
For training:
set a date in the past as the censoring time, so that the duration between that censoring time and now is (at least roughly) equal to the predictive horizon we will use.
train the model on a time window extending up to that censoring time. Events observed after that time would be ignored and marked as False in the event_col column and durations would be calculated up to that censoring time in the past.
For evaluation:
evaluate on all leads created up to the censoring time and going back max_age ago from it.
don’t ignore the events observed after the censoring time, effectively using the training time as censoring time and computing the durations up to that time.
On one hand I think this approach introduces so bias as the training and test set will not be sampled from the same exact distribution but on the other hand it seems closer to a typical use case.
Curious to know if some implementations of this exist and what people are doing.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I am using the Weibull Cure model implementation proposed in the docs to predict the conversion of subjects (leads) for a specific business process.$t$ .
A cure model is necessary here as most conversion events will never happen and we are typically interested in the cdf in this context as it can be interpreted as the probability for the subject to have converted by time
The way I plan on using the model is by making predictions on subjects that were created up to
max_age
ago and have not converted yet.For survival regression I routinely see that people split the data on the subject ids (sometimes with stratification on the
event_col
) and I am trying to understand why this is the right thing to do, especially if I want to use the model to make predictions over a predictive horizon, using.predict_cumulative_density()
(and with theconditional_after
argument). The performance of the model over that predictive horizon will be something important to understand.In practice, one would train the model on a time window ranging between some time in the past and the time of training, which becomes the censoring time, typically ’now’.
Forgetting about hyper-parameter tuning and CV for the sake this example, my intuition tends to make me want to evaluate the model using a data split similar to a TimesSeriesSplit to be able to understand how the model performance change over the predictive horizon.
As someone discussed here, I am tempted to follow this approach:
For training:
False
in theevent_col
column and durations would be calculated up to that censoring time in the past.For evaluation:
max_age
ago from it.On one hand I think this approach introduces so bias as the training and test set will not be sampled from the same exact distribution but on the other hand it seems closer to a typical use case.
Curious to know if some implementations of this exist and what people are doing.
Beta Was this translation helpful? Give feedback.
All reactions