Skip to content

Commit

Permalink
DOC Add note on cv splits in CalibratedClassifierCV missing classes (
Browse files Browse the repository at this point in the history
  • Loading branch information
lucyleeow authored Jul 22, 2024
1 parent b4e1192 commit 24b581b
Showing 1 changed file with 17 additions and 3 deletions.
20 changes: 17 additions & 3 deletions doc/modules/calibration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -149,9 +149,14 @@ The :class:`CalibratedClassifierCV` class is used to calibrate a classifier.
unbiased data is always used to fit the calibrator. The data is split into k
`(train_set, test_set)` couples (as determined by `cv`). When `ensemble=True`
(default), the following procedure is repeated independently for each
cross-validation split: a clone of `base_estimator` is first trained on the
train subset. Then its predictions on the test subset are used to fit a
calibrator (either a sigmoid or isotonic regressor). This results in an
cross-validation split:

1. a clone of `base_estimator` is trained on the train subset
2. the trained `base_estimator` makes predictions on the test subset
3. the predictions are used to fit a calibrator (either a sigmoid or isotonic
regressor) (when the data is multiclass, a calibrator is fit for every class)

This results in an
ensemble of k `(classifier, calibrator)` couples where each calibrator maps
the output of its corresponding classifier into [0, 1]. Each couple is exposed
in the `calibrated_classifiers_` attribute, where each entry is a calibrated
Expand All @@ -162,6 +167,15 @@ predicted probabilities of the `k` estimators in the `calibrated_classifiers_`
list. The output of :term:`predict` is the class that has the highest
probability.

It is important to choose `cv` carefully when using `ensemble=True`.
All classes should be present in both train and test subsets for every split.
When a class is absent in the train subset, the predicted probability for that
class will default to 0 for the `(classifier, calibrator)` couple of that split.
This skews the :term:`predict_proba` as it averages across all couples.
When a class is absent in the test subset, the calibrator for that class
(within the `(classifier, calibrator)` couple of that split) is
fit on data with no positive class. This results in ineffective calibration.

When `ensemble=False`, cross-validation is used to obtain 'unbiased'
predictions for all the data, via
:func:`~sklearn.model_selection.cross_val_predict`.
Expand Down

0 comments on commit 24b581b

Please sign in to comment.