Merging main

Signed-off-by: Adam Li <[email protected]>
neurodata · Nov 8, 2023 · 30e9e95 · 30e9e95
2 parents 739c7be + 386578a
commit 30e9e95
Show file tree

Hide file tree

Showing 23 changed files with 2,242 additions and 835 deletions.
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -32,7 +32,7 @@ jobs:
           architecture: "x64"
       - name: Install dependencies
         run: |
-          python -m pip install --progress-bar off --upgrade pip setuptools wheel
+          python -m pip install --progress-bar off --upgrade pip
           python -m pip install --progress-bar off build twine
       - name: Prepare environment
         run: |
@@ -43,12 +43,9 @@ jobs:
         with:
           name: package
           path: dist
-      # TODO: refactor scripts to generate release notes from `whats_new.rst` file instead
-      # - name: Generate release notes
-      #   run: |
-      #     python scripts/release_notes.py > ${{ github.workspace }}-RELEASE_NOTES.md
       - name: Publish package to PyPI
         run: |
+          ls dist/*
           twine upload -u ${{ secrets.PYPI_USERNAME }} -p ${{ secrets.PYPI_PASSWORD }} dist/*
       - name: Publish GitHub release
         uses: softprops/action-gh-release@v1

diff --git a/DEVELOPING.md b/DEVELOPING.md
@@ -190,4 +190,6 @@ or if you have two-factor authentication enabled: https://pypi.org/help/#apitoke
 
     twine upload dist/* --repository scikit-tree
 
-4. Update version number on ``meson.build`` and ``_version.py`` to the relevant version.
+4. Update version number on ``meson.build`` and ``pyproject.toml`` to the relevant version.
+
+See https://github.com/neurodata/scikit-tree/pull/160 as an example.
diff --git a/doc/_static/versions.json b/doc/_static/versions.json
@@ -1,9 +1,14 @@
 [
     {
-        "name": "0.3 (devel)",
+        "name": "0.4 (devel)",
         "version": "dev",
         "url": "https://docs.neurodata.io/scikit-tree/dev/"
     },
+    {
+        "name": "0.3",
+        "version": "0.3",
+        "url": "https://docs.neurodata.io/scikit-tree/v0.3/"
+    },
     {
         "name": "0.2",
         "version": "0.2",

diff --git a/doc/whats_new.rst b/doc/whats_new.rst
@@ -17,5 +17,6 @@ on libraries.io to be notified when new versions are released.
 
     Version 0.1 <whats_new/v0.1.rst>
     Version 0.2 <whats_new/v0.2.rst>
-    Version 0.3 (Unreleased) <whats_new/v0.3.rst>
+    Version 0.3 <whats_new/v0.3.rst>
+    Version 0.4 (Unreleased) <whats_new/v0.4.rst>
 
diff --git a/doc/whats_new/v0.3.rst b/doc/whats_new/v0.3.rst
@@ -3,20 +3,24 @@
 .. include:: _contributors.rst
 .. currentmodule:: sktree
 
-.. _current:
+.. _v0_3:
 
 Version 0.3
 ===========
 
-**In Development**
+This release includes a number of bug fixes and enhancements related to hypothesis testing with decision trees.
+Moreover, we have added an experimental multi-view decision tree / random forest, which considers multiple views
+of the data when building trees. The documentation page has also undergone an organizational overhaul
+making it easier for users to find examples related to specific use cases.
 
 Changelog
 ---------
 - |Fix| Fixes a bug in consistency of train/test samples when ``random_state`` is not set in FeatureImportanceForestClassifier and FeatureImportanceForestRegressor, by `Adam Li`_ (:pr:`135`)
 - |Fix| Fixes a bug where covariate indices were not shuffled by default when running FeatureImportanceForestClassifier and FeatureImportanceForestRegressor test methods, by `Sambit Panda`_ (:pr:`140`)
 - |Enhancement| Add multi-view splitter for axis-aligned decision trees, by `Adam Li`_ (:pr:`129`)
 - |Enhancement| Add stratified sampling option to ``FeatureImportance*`` via the ``stratify`` keyword argument, by `Yuxin Bai`_ (:pr:`143`)
-- |API| ``FeatureImportanceForest*`` now has a hyperparameter to control the number of permutations is done per forest ``permute_per_forest_fraction``, by `Adam Li`_ (:pr:`145`)
+- |Fix| Fixed usage of ``feature_importances_`` property in ``HonestForestClassifier``, by `Adam Li`_ (:pr:`156`)
+- |Fix| Fixed ``HonestForestClassifier`` to allow decision-trees from sklearn, albeit with a limited API, by `Adam Li`_ (:pr:`158`)
 
 Code and Documentation Contributors
 -----------------------------------

diff --git a/doc/whats_new/v0.4.rst b/doc/whats_new/v0.4.rst
@@ -0,0 +1,25 @@
+:orphan:
+
+.. include:: _contributors.rst
+.. currentmodule:: sktree
+
+.. _current:
+
+Version 0.4
+===========
+
+**In Development**
+
+Changelog
+---------
+
+- |API| ``FeatureImportanceForest*`` now has a hyperparameter to control the number of permutations is done per forest ``permute_per_forest_fraction``, by `Adam Li`_ (:pr:`145`)
+
+Code and Documentation Contributors
+-----------------------------------
+
+Thanks to everyone who has contributed to the maintenance and improvement of
+the project since version inception, including:
+
+* `Adam Li`_
+
diff --git a/examples/hypothesis_testing/plot_might_mv_auc.py b/examples/hypothesis_testing/plot_might_mv_auc.py
@@ -0,0 +1,135 @@
+"""
+=====================================================
+Compute partial AUC using multi-view MIGHT (MV-MIGHT)
+=====================================================
+
+An example using :class:`~sktree.stats.FeatureImportanceForestClassifier` for nonparametric
+multivariate hypothesis test, on simulated mutli-view datasets. Here, we present
+how to estimate partial AUROC from a multi-view feature set.
+
+We simulate a dataset with 510 features, 1000 samples, and a binary class target variable.
+The first 10 features (X) are strongly correlated with the target, and the second
+feature set (W) is weakly correlated with the target (y).
+
+We then use MV-MIGHT to calculate the partial AUC of these sets.
+"""
+
+import numpy as np
+from scipy.special import expit
+
+from sktree import HonestForestClassifier
+from sktree.stats import FeatureImportanceForestClassifier
+from sktree.tree import DecisionTreeClassifier, MultiViewDecisionTreeClassifier
+
+seed = 12345
+rng = np.random.default_rng(seed)
+
+# %%
+# Simulate data
+# -------------
+# We simulate the two feature sets, and the target variable. We then combine them
+# into a single dataset to perform hypothesis testing.
+
+n_samples = 1000
+n_features_set = 500
+mean = 1.0
+sigma = 2.0
+beta = 5.0
+
+unimportant_mean = 0.0
+unimportant_sigma = 4.5
+
+# first sample the informative features, and then the uniformative features
+X_important = rng.normal(loc=mean, scale=sigma, size=(n_samples, 10))
+X_unimportant = rng.normal(
+    loc=unimportant_mean, scale=unimportant_sigma, size=(n_samples, n_features_set)
+)
+X = np.hstack([X_important, X_unimportant])
+
+# simulate the binary target variable
+y = rng.binomial(n=1, p=expit(beta * X_important[:, :10].sum(axis=1)), size=n_samples)
+
+# %%
+# Use partial AUC as test statistic
+# ---------------------------------
+# You can specify the maximum specificity by modifying ``max_fpr`` in ``statistic``.
+
+n_estimators = 125
+max_features = "sqrt"
+metric = "auc"
+test_size = 0.2
+n_jobs = -1
+honest_fraction = 0.5
+max_fpr = 0.1
+
+est_mv = FeatureImportanceForestClassifier(
+    estimator=HonestForestClassifier(
+        n_estimators=n_estimators,
+        max_features=max_features,
+        tree_estimator=MultiViewDecisionTreeClassifier(feature_set_ends=[10, 10 + n_features_set]),
+        honest_fraction=honest_fraction,
+        n_jobs=n_jobs,
+    ),
+    random_state=seed,
+    test_size=test_size,
+    permute_per_tree=True,
+    sample_dataset_per_tree=True,
+)
+
+# we test with the multi-view setting, thus should return a higher AUC
+stat, posterior_arr, samples = est_mv.statistic(
+    X,
+    y,
+    metric=metric,
+    return_posteriors=True,
+    max_fpr=max_fpr,
+)
+
+print(f"ASH-90 / Partial AUC: {stat}")
+print(f"Shape of Observed Samples: {samples.shape}")
+print(f"Shape of Tree Posteriors for the positive class: {posterior_arr.shape}")
+
+# %%
+# Repeat without multi-view
+# ---------------------------------
+# This feature set has a smaller statistic, which is expected due to its lack of multi-view setting.
+
+est = FeatureImportanceForestClassifier(
+    estimator=HonestForestClassifier(
+        n_estimators=n_estimators,
+        max_features=max_features,
+        tree_estimator=DecisionTreeClassifier(),
+        honest_fraction=honest_fraction,
+        n_jobs=n_jobs,
+    ),
+    random_state=seed,
+    test_size=test_size,
+    permute_per_tree=True,
+    sample_dataset_per_tree=True,
+)
+
+stat, posterior_arr, samples = est.statistic(
+    X,
+    y,
+    metric=metric,
+    return_posteriors=True,
+    max_fpr=max_fpr,
+)
+
+print(f"ASH-90 / Partial AUC: {stat}")
+print(f"Shape of Observed Samples: {samples.shape}")
+print(f"Shape of Tree Posteriors for the positive class: {posterior_arr.shape}")
+
+# %%
+# All posteriors are saved within the model
+# -----------------------------------------
+# Extract the results from the model variables anytime. You can save the model with ``pickle``.
+#
+# ASH-90 / Partial AUC: ``est_mv.observe_stat_``
+#
+# Observed Samples: ``est_mv.observe_samples_``
+#
+# Tree Posteriors for the positive class: ``est_mv.observe_posteriors_``
+# (n_trees, n_samples_test, 1)
+#
+# True Labels: ``est_mv.y_true_final_``
diff --git a/examples/quantile_predictions/README.txt b/examples/quantile_predictions/README.txt
@@ -0,0 +1,6 @@
+.. _quantile_examples:
+
+Quantile Predictions with Random Forest
+---------------------------------------
+
+Examples demonstrating how to generate quantile predictions using Random Forest variants.
diff --git a/examples/quantile_predictions/plot_quantile_interpolation_with_RF.py b/examples/quantile_predictions/plot_quantile_interpolation_with_RF.py
@@ -0,0 +1,111 @@
+"""
+========================================================
+Predicting with different quantile interpolation methods
+========================================================
+
+An example comparison of interpolation methods that can be applied during
+prediction when the desired quantile lies between two data points.
+
+"""
+
+from collections import defaultdict
+
+import matplotlib.pyplot as plt
+import numpy as np
+from sklearn.ensemble import RandomForestRegressor
+
+# %%
+# Generate the data
+# -----------------
+# We use four simple data points to illustrate the difference between the intervals that are
+# generated using different interpolation methods.
+
+X = np.array([[-1, -1], [-1, -1], [-1, -1], [1, 1], [1, 1]])
+y = np.array([-2, -1, 0, 1, 2])
+
+# %%
+# The interpolation methods
+# -------------------------
+# The following interpolation methods demonstrated here are:
+# To interpolate between the data points, i and j (``i <= j``),
+# linear, lower, higher, midpoint, or nearest. For more details, see `sktree.RandomForestRegressor`.
+# The difference between the methods can be illustrated with the following example:
+
+interpolations = ["linear", "lower", "higher", "midpoint", "nearest"]
+colors = ["#006aff", "#ffd237", "#0d4599", "#f2a619", "#a6e5ff"]
+quantiles = [0.025, 0.5, 0.975]
+
+y_medians = []
+y_errs = []
+est = RandomForestRegressor(
+    n_estimators=1,
+    random_state=0,
+)
+# fit the model
+est.fit(X, y)
+# get the leaf nodes that each sample fell into
+leaf_ids = est.apply(X)
+# create a list of dictionary that maps node to samples that fell into it
+# for each tree
+node_to_indices = []
+for tree in range(leaf_ids.shape[1]):
+    d = defaultdict(list)
+    for id, leaf in enumerate(leaf_ids[:, tree]):
+        d[leaf].append(id)
+    node_to_indices.append(d)
+# drop the X_test to the trained tree and
+# get the indices of leaf nodes that fall into it
+leaf_ids_test = est.apply(X)
+# for each samples, collect the indices of the samples that fell into
+# the same leaf node for each tree
+y_pred_quantile = []
+for sample in range(leaf_ids_test.shape[0]):
+    li = [
+        node_to_indices[tree][leaf_ids_test[sample][tree]] for tree in range(leaf_ids_test.shape[1])
+    ]
+    # merge the list of indices into one
+    idx = [item for sublist in li for item in sublist]
+    # get the y_train for each corresponding id``
+    y_pred_quantile.append(y[idx])
+
+for interpolation in interpolations:
+    # get the quatile preditions for each predicted sample
+    y_pred = [
+        np.array(
+            [
+                np.quantile(y_pred_quantile[i], quantile, method=interpolation)
+                for i in range(len(y_pred_quantile))
+            ]
+        )
+        for quantile in quantiles
+    ]
+    y_medians.append(y_pred[1])
+    y_errs.append(
+        np.concatenate(
+            (
+                [y_pred[1] - y_pred[0]],
+                [y_pred[2] - y_pred[1]],
+            ),
+            axis=0,
+        )
+    )
+
+sc = plt.scatter(np.arange(len(y)) - 0.35, y, color="k", zorder=10)
+ebs = []
+for i, (median, y_err) in enumerate(zip(y_medians, y_errs)):
+    ebs.append(
+        plt.errorbar(
+            np.arange(len(y)) + (0.15 * (i + 1)) - 0.35,
+            median,
+            yerr=y_err,
+            color=colors[i],
+            ecolor=colors[i],
+            fmt="o",
+        )
+    )
+plt.xlim([-0.75, len(y) - 0.25])
+plt.xticks(np.arange(len(y)), X.tolist())
+plt.xlabel("Samples (Feature Values)")
+plt.ylabel("Actual and Predicted Values")
+plt.legend([sc] + ebs, ["actual"] + interpolations, loc=2)
+plt.show()