Releases: alexzwanenburg/familiar
Version 1.5.0 (Whole Whale)
Major changes
-
The source code now uses the
tidyverse
code style. -
Power transformation is now handled by the
power.transform
package. This package replaces the internal routines that were previously used.
Future deprecation
-
Functionality reliant on the
mboost
,VGAM
orqvalue
packages will be deprecated in version 2.0.0. -
count
outcome types will be deprecated by merging intocontinuous
outcome type, starting version 2.0.0.
Bug fixes
-
Prevent errors due to parsing columns called
else
,for
,function
,if
,in
, orwhile
. -
Presence of features with integer values no longer lead to rare errors during evaluation.
-
The main panel for composite plots (e.g. calibration plots, Kaplan-Meier curves) is no longer of fixed width when title or subtitles are present.
-
Thresholds for clustering with correlation-based metrics are now computed correctly.
Version 1.4.8 (Valorous Viper)
Bug fixes
-
Adapted tests to work when suggested packages are missing (addresses CRAN noSuggests check).
-
Fixed an issue that prevented hyperparameter optimisation of
xgboost
models for survival tasks.
Version 1.4.7 (Uncertain Unicorn)
Bug fixes
-
Computing distance matrices no longer produces an error due to applying
rownames
todata.table
. The exact cause is unsure, but was either introduced bydata.table
version 1.15.0,R
version 4.4.0, or both. -
Several fixes related to changes introduced in
ggplot2
version 3.5 were made:- Plot margins are now correctly set in the default familiar plotting theme.
- Plot elements of composite plots are now correctly set.
-
Fixed an incorrect data.table merge when computing survival predictions from random forests.
Version 1.4.6 (Talented Toad)
Bug fixes
- Fixes unused arguments appearing in documentation.
Version 1.4.5 (Reminiscing Rat)
Bug fixes
- Creating data objects (
as_data_object
) using naive learners now works and no longer throws an error.
Version 1.4.4 (Quixotic Quail)
Bug fixes
-
Prevented an error that could occur when computing net benefit for decision curves of models that would predict class probabilities of exactly 1. This was a very rare error, as it would only occur if predicted class probabilities would have at most two distinct values, one of which is 1.0.
-
Prevented an issue that could occur when computing linear calibration fits where the fit can be computed without residual errors. This would prevent the t-statistic and p-value from being correctly computed for binomial, multinomial and survival outcomes.
-
Prevented an issue when computing linear calibration fits when all the expected values are the same. The model will then lack a slope. We now add a slope of 0, with an infinite confidence interval, if this is the case.
Version 1.4.3 (Puzzled Prawn)
Bug fixes
- Prevented an error due to an overzealous check for hyperparameters being present for training a model.
Version 1.4.2 (Omnicompetent Owl)
Bug fixes
- Fixed an error that could occur when creating a lasso model for imputation using just a single feature.
Version 1.4.1 (Nefarious Newt)
Minor changes
-
Robust methods for power transformations were added, based on the work of Raymaekers and Rousseeuw (Transforming variables to central normality. Mach Learn. 2021. doi:10.1007/s10994-021-05960-5). These methods are
yeo_johnson_robust
andbox_cox_robust
. -
A robust normalisation method, based on Huber's M-estimators for location and scale, was added:
standardisation_robust
. -
Improved efficiency of aggregating and computing point estimates for evaluation steps. It may occur that for each grouping (e.g. samples for pairwise sample similarity), multiple values are available that should be aggregated to a point estimate. Previously we split on all unique combinations of grouping column, and process each split separately. This is a valid approach, but can occur significant overhead when this forms a large number (>100k) splits. We now first determine which data (if any) require computation of a (bias-corrected) point estimate because of grouping. Often, each split would only contain a single instance which forms a point estimate on its own. Extra computation is avoided for these cases.
-
Plots now always show the evaluation time point. This is relevant for, for example, calibration plots, where both the observed and expected (predicted) probabilities are time-dependent, and will change depending on the time point.
-
Improved support for providing a file name for storing a plot. The plotting device is now changed based on the file name, if it has an extension. In case multiple plots would be created, e.g. due to splitting on some grouping variable, such as the underlying dataset, the provided file name is used as a base.
-
Methods for setting labels previously could update the ordering of the labels for
familiarCollection
objects, which could produce unexpected changes. Setting new labels now does not change the label order. Use theorder
argument to update the order of the labels.
Bug fixes
-
Fixed an error that would occur when attempting to create risk group labels for a
familiarCollection
object that is composed of externally providedfamiliarData
objects. -
Fixed an issue that would prevent a
familiarCollection
object from being returned if an experiment was run using a temporary folder. -
Fixed an issue with apply functions in familiar taking long to aggregate their results.
-
Fixed an issue that would prevent Kaplan-Meier curves to be plotted when more than three risk strata where present.
-
Fixed an error that would occur if Kaplan-Meier curves were plotted for more than one stratification method and different risk groups.
-
Fixed an issue that could potentially cause matching wrong transformation and normalisation parameter values when forming ensemble models. This may have affect sample cluster plots, which uses this information.
Version 1.4.0 (Misanthropic Muskrat)
Major changes
- Hyperparameter optimisation now trains naive models if none of the hyperparameter sets lead to models that perform better than these models. Previously a model was trained regardless of whether such a model would actually be better than a naive model. Naive models, for example, predict the majority class or median value, depending on the problem.
Minor changes
-
Metrics for assessing performance of regression models, such as mean squared error, can now be computed in winsorised or trimmed (truncated) forms. These can be specified by appending
_winsor
or_trim
as a suffix to the metric name. Winsorising clips the predicted values for 5% of the instances with the most extreme absolute errors prior to computing the performance metric, whereas trimming removes these instances. The result of either option is that for many metrics, the assessed model performance is less skewed by rare outliers. -
Two additional optimisation functions were defined to assess suitability of hyperparameter sets:
-
model_balanced_estimate
: seeks to maximise the estimate of the balanced IB and OOB score. This is similar to thebalanced
score, and in fact uses a hyperparameter learner to predict said score (not available for random search). -
model_balanced_estimate_minus_sd
: seeks to maximise the estimate of the balanced IB and OOB score, minus its estimated standard deviation. This is similar to thebalanced
score, but takes into account its estimated spread. Note that likemodel_estimate_minus_sd
, the width of the distribution of balanced scores is more difficult to determine than its estimate.
-
-
The
balanced
optimisation function now adds a penalty when the trained model on the training data performs worse then a naive model. -
A new exploration method for hyperparameter optimisation was added, namely
single_shot
. As the name suggests, this performs a single pass on the challenger and incumbent models during each intensification iteration. This is also the new default. Extensive tests have shown that the use of single-shot selection led to comparable performance. -
Convergence checks for hyperparameter sets now depend on the validation optimisation score, as this is more stable than the summary score for some
optimisation_function
methods, such asmodel_estimate_minus_sd
. More over the tolerance has been changed to allow for values above0.01
for sample sizes smaller than100
. This prevents convergence issues where the expected statistical fluctuation for small sample sizes would easily break convergence checks, and hence force long searches for suitable hyperparameters. -
The default familiar plotting theme is now exported as
theme_familiar
. This allows for changing tweaking the default theme, for example, setting a larger font size, or selecting a different font family. After making changing to theme, it can be provided as theggtheme
argument.
Bug fixes
-
ggtheme
is now checked for completeness, which prevents errors with unclear causes or solutions. -
We previously checked that any coefficients of a regression model could be estimated. This could lead to large models being formed where all features were insufficiently converged, even if this led to a meaningless model. We now check that all (instead of any) coefficients could be estimated for GLM, Cox and survival regression models.
-
Fixed an error caused by unsuccessfully retraining an anonymous random forest for variable importance estimations.
-
Fixed errors due to introduction of
linewidth
elements in version 3.4.0 ofggplot2
. Versions ofggplot2
prior to 3.4.0 are still supported.