diff --git a/teaching/mlprojects24/slides/20240123-ml.md b/teaching/mlprojects24/slides/20240123-ml.md index 52250622..770bc0c9 100644 --- a/teaching/mlprojects24/slides/20240123-ml.md +++ b/teaching/mlprojects24/slides/20240123-ml.md @@ -237,7 +237,7 @@ $$g = \hat{h} = \argmin_{h \in \mathcal{H}} R_N(h)$$ -- -\.highlight1[Important question]: how well does the empirical risk approximate the true risk? +.highlight1[Important question]: how well does the empirical risk approximate the true risk? --- @@ -373,7 +373,7 @@ It is generally a good idea to split the data into (at least) a train set and a -- -An even better idea: a .highlight1[validation split]. +An even better idea: .highlight1[validation split(s)]. ??? @@ -393,9 +393,9 @@ A resampling method to split the data into multiple folds, for either evaluation When is cross-validation a good idea? * When the model is not computationally too expensive. -* When the amount of data is particularly small. +* When the amount of data is rather small. -In the extreme, use leave-one-out cross-validation. +In the extreme, use .highlight1[leave-one-out] cross-validation. --- @@ -404,7 +404,7 @@ In the extreme, use leave-one-out cross-validation. Normalising the data is generally a good idea too, for several reasons: * Numerical stability. -* It may make training easier or faster. +* It may make training (optimisation via gradient descent) easier or faster. * It equalises artificial differences in scale/importance between features. -- @@ -421,6 +421,23 @@ There are multiple ways of doing [feature normalisation](https://en.wikipedia.or --- +## Class imbalance + +Practical applications of machine learning for classification problems typically tackle data sets where the various classes have different amounts of data points. This is known as an .highlight1[imbalance classification problem]. + +There are multiple techniques to deal with class imbalance: + +- Re-sampling: under-sampling the majority class or over-sampling the minority class +- Re-weighting the loss function: increase the loss of the minority class and vice-versa +- Choose appropriate metrics, not only accuracy: precision, recall, F1 score, confusion matrix... + +
+ Under- and over-sampling +
Adapted from: http://www.capallen.top
+
+ +--- + name: title class: title, middle diff --git a/teaching/mlprojects24/slides/index.md b/teaching/mlprojects24/slides/index.md index e60a623e..498f4c96 100644 --- a/teaching/mlprojects24/slides/index.md +++ b/teaching/mlprojects24/slides/index.md @@ -12,3 +12,5 @@ title: IFT 3710/6759 - Slides ### [18 janvier - Tutoriel clusters HPC](20240118-cluster) +### [23 janvier - Revue de l'apprentissage automatique](20240123-ml) +