Skip to content

Commit

Permalink
Merging main
Browse files Browse the repository at this point in the history
Signed-off-by: Adam Li <[email protected]>
  • Loading branch information
adam2392 committed Nov 8, 2023
2 parents 739c7be + 386578a commit 30e9e95
Show file tree
Hide file tree
Showing 23 changed files with 2,242 additions and 835 deletions.
7 changes: 2 additions & 5 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
architecture: "x64"
- name: Install dependencies
run: |
python -m pip install --progress-bar off --upgrade pip setuptools wheel
python -m pip install --progress-bar off --upgrade pip
python -m pip install --progress-bar off build twine
- name: Prepare environment
run: |
Expand All @@ -43,12 +43,9 @@ jobs:
with:
name: package
path: dist
# TODO: refactor scripts to generate release notes from `whats_new.rst` file instead
# - name: Generate release notes
# run: |
# python scripts/release_notes.py > ${{ github.workspace }}-RELEASE_NOTES.md
- name: Publish package to PyPI
run: |
ls dist/*
twine upload -u ${{ secrets.PYPI_USERNAME }} -p ${{ secrets.PYPI_PASSWORD }} dist/*
- name: Publish GitHub release
uses: softprops/action-gh-release@v1
Expand Down
4 changes: 3 additions & 1 deletion DEVELOPING.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,4 +190,6 @@ or if you have two-factor authentication enabled: https://pypi.org/help/#apitoke

twine upload dist/* --repository scikit-tree

4. Update version number on ``meson.build`` and ``_version.py`` to the relevant version.
4. Update version number on ``meson.build`` and ``pyproject.toml`` to the relevant version.

See https://github.com/neurodata/scikit-tree/pull/160 as an example.
7 changes: 6 additions & 1 deletion doc/_static/versions.json
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
[
{
"name": "0.3 (devel)",
"name": "0.4 (devel)",
"version": "dev",
"url": "https://docs.neurodata.io/scikit-tree/dev/"
},
{
"name": "0.3",
"version": "0.3",
"url": "https://docs.neurodata.io/scikit-tree/v0.3/"
},
{
"name": "0.2",
"version": "0.2",
Expand Down
3 changes: 2 additions & 1 deletion doc/whats_new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,5 +17,6 @@ on libraries.io to be notified when new versions are released.

Version 0.1 <whats_new/v0.1.rst>
Version 0.2 <whats_new/v0.2.rst>
Version 0.3 (Unreleased) <whats_new/v0.3.rst>
Version 0.3 <whats_new/v0.3.rst>
Version 0.4 (Unreleased) <whats_new/v0.4.rst>

10 changes: 7 additions & 3 deletions doc/whats_new/v0.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,24 @@
.. include:: _contributors.rst
.. currentmodule:: sktree

.. _current:
.. _v0_3:

Version 0.3
===========

**In Development**
This release includes a number of bug fixes and enhancements related to hypothesis testing with decision trees.
Moreover, we have added an experimental multi-view decision tree / random forest, which considers multiple views
of the data when building trees. The documentation page has also undergone an organizational overhaul
making it easier for users to find examples related to specific use cases.

Changelog
---------
- |Fix| Fixes a bug in consistency of train/test samples when ``random_state`` is not set in FeatureImportanceForestClassifier and FeatureImportanceForestRegressor, by `Adam Li`_ (:pr:`135`)
- |Fix| Fixes a bug where covariate indices were not shuffled by default when running FeatureImportanceForestClassifier and FeatureImportanceForestRegressor test methods, by `Sambit Panda`_ (:pr:`140`)
- |Enhancement| Add multi-view splitter for axis-aligned decision trees, by `Adam Li`_ (:pr:`129`)
- |Enhancement| Add stratified sampling option to ``FeatureImportance*`` via the ``stratify`` keyword argument, by `Yuxin Bai`_ (:pr:`143`)
- |API| ``FeatureImportanceForest*`` now has a hyperparameter to control the number of permutations is done per forest ``permute_per_forest_fraction``, by `Adam Li`_ (:pr:`145`)
- |Fix| Fixed usage of ``feature_importances_`` property in ``HonestForestClassifier``, by `Adam Li`_ (:pr:`156`)
- |Fix| Fixed ``HonestForestClassifier`` to allow decision-trees from sklearn, albeit with a limited API, by `Adam Li`_ (:pr:`158`)

Code and Documentation Contributors
-----------------------------------
Expand Down
25 changes: 25 additions & 0 deletions doc/whats_new/v0.4.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
:orphan:

.. include:: _contributors.rst
.. currentmodule:: sktree

.. _current:

Version 0.4
===========

**In Development**

Changelog
---------

- |API| ``FeatureImportanceForest*`` now has a hyperparameter to control the number of permutations is done per forest ``permute_per_forest_fraction``, by `Adam Li`_ (:pr:`145`)

Code and Documentation Contributors
-----------------------------------

Thanks to everyone who has contributed to the maintenance and improvement of
the project since version inception, including:

* `Adam Li`_

135 changes: 135 additions & 0 deletions examples/hypothesis_testing/plot_might_mv_auc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
"""
=====================================================
Compute partial AUC using multi-view MIGHT (MV-MIGHT)
=====================================================
An example using :class:`~sktree.stats.FeatureImportanceForestClassifier` for nonparametric
multivariate hypothesis test, on simulated mutli-view datasets. Here, we present
how to estimate partial AUROC from a multi-view feature set.
We simulate a dataset with 510 features, 1000 samples, and a binary class target variable.
The first 10 features (X) are strongly correlated with the target, and the second
feature set (W) is weakly correlated with the target (y).
We then use MV-MIGHT to calculate the partial AUC of these sets.
"""

import numpy as np
from scipy.special import expit

from sktree import HonestForestClassifier
from sktree.stats import FeatureImportanceForestClassifier
from sktree.tree import DecisionTreeClassifier, MultiViewDecisionTreeClassifier

seed = 12345
rng = np.random.default_rng(seed)

# %%
# Simulate data
# -------------
# We simulate the two feature sets, and the target variable. We then combine them
# into a single dataset to perform hypothesis testing.

n_samples = 1000
n_features_set = 500
mean = 1.0
sigma = 2.0
beta = 5.0

unimportant_mean = 0.0
unimportant_sigma = 4.5

# first sample the informative features, and then the uniformative features
X_important = rng.normal(loc=mean, scale=sigma, size=(n_samples, 10))
X_unimportant = rng.normal(
loc=unimportant_mean, scale=unimportant_sigma, size=(n_samples, n_features_set)
)
X = np.hstack([X_important, X_unimportant])

# simulate the binary target variable
y = rng.binomial(n=1, p=expit(beta * X_important[:, :10].sum(axis=1)), size=n_samples)

# %%
# Use partial AUC as test statistic
# ---------------------------------
# You can specify the maximum specificity by modifying ``max_fpr`` in ``statistic``.

n_estimators = 125
max_features = "sqrt"
metric = "auc"
test_size = 0.2
n_jobs = -1
honest_fraction = 0.5
max_fpr = 0.1

est_mv = FeatureImportanceForestClassifier(
estimator=HonestForestClassifier(
n_estimators=n_estimators,
max_features=max_features,
tree_estimator=MultiViewDecisionTreeClassifier(feature_set_ends=[10, 10 + n_features_set]),
honest_fraction=honest_fraction,
n_jobs=n_jobs,
),
random_state=seed,
test_size=test_size,
permute_per_tree=True,
sample_dataset_per_tree=True,
)

# we test with the multi-view setting, thus should return a higher AUC
stat, posterior_arr, samples = est_mv.statistic(
X,
y,
metric=metric,
return_posteriors=True,
max_fpr=max_fpr,
)

print(f"ASH-90 / Partial AUC: {stat}")
print(f"Shape of Observed Samples: {samples.shape}")
print(f"Shape of Tree Posteriors for the positive class: {posterior_arr.shape}")

# %%
# Repeat without multi-view
# ---------------------------------
# This feature set has a smaller statistic, which is expected due to its lack of multi-view setting.

est = FeatureImportanceForestClassifier(
estimator=HonestForestClassifier(
n_estimators=n_estimators,
max_features=max_features,
tree_estimator=DecisionTreeClassifier(),
honest_fraction=honest_fraction,
n_jobs=n_jobs,
),
random_state=seed,
test_size=test_size,
permute_per_tree=True,
sample_dataset_per_tree=True,
)

stat, posterior_arr, samples = est.statistic(
X,
y,
metric=metric,
return_posteriors=True,
max_fpr=max_fpr,
)

print(f"ASH-90 / Partial AUC: {stat}")
print(f"Shape of Observed Samples: {samples.shape}")
print(f"Shape of Tree Posteriors for the positive class: {posterior_arr.shape}")

# %%
# All posteriors are saved within the model
# -----------------------------------------
# Extract the results from the model variables anytime. You can save the model with ``pickle``.
#
# ASH-90 / Partial AUC: ``est_mv.observe_stat_``
#
# Observed Samples: ``est_mv.observe_samples_``
#
# Tree Posteriors for the positive class: ``est_mv.observe_posteriors_``
# (n_trees, n_samples_test, 1)
#
# True Labels: ``est_mv.y_true_final_``
6 changes: 6 additions & 0 deletions examples/quantile_predictions/README.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.. _quantile_examples:

Quantile Predictions with Random Forest
---------------------------------------

Examples demonstrating how to generate quantile predictions using Random Forest variants.
111 changes: 111 additions & 0 deletions examples/quantile_predictions/plot_quantile_interpolation_with_RF.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
"""
========================================================
Predicting with different quantile interpolation methods
========================================================
An example comparison of interpolation methods that can be applied during
prediction when the desired quantile lies between two data points.
"""

from collections import defaultdict

import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestRegressor

# %%
# Generate the data
# -----------------
# We use four simple data points to illustrate the difference between the intervals that are
# generated using different interpolation methods.

X = np.array([[-1, -1], [-1, -1], [-1, -1], [1, 1], [1, 1]])
y = np.array([-2, -1, 0, 1, 2])

# %%
# The interpolation methods
# -------------------------
# The following interpolation methods demonstrated here are:
# To interpolate between the data points, i and j (``i <= j``),
# linear, lower, higher, midpoint, or nearest. For more details, see `sktree.RandomForestRegressor`.
# The difference between the methods can be illustrated with the following example:

interpolations = ["linear", "lower", "higher", "midpoint", "nearest"]
colors = ["#006aff", "#ffd237", "#0d4599", "#f2a619", "#a6e5ff"]
quantiles = [0.025, 0.5, 0.975]

y_medians = []
y_errs = []
est = RandomForestRegressor(
n_estimators=1,
random_state=0,
)
# fit the model
est.fit(X, y)
# get the leaf nodes that each sample fell into
leaf_ids = est.apply(X)
# create a list of dictionary that maps node to samples that fell into it
# for each tree
node_to_indices = []
for tree in range(leaf_ids.shape[1]):
d = defaultdict(list)
for id, leaf in enumerate(leaf_ids[:, tree]):
d[leaf].append(id)
node_to_indices.append(d)
# drop the X_test to the trained tree and
# get the indices of leaf nodes that fall into it
leaf_ids_test = est.apply(X)
# for each samples, collect the indices of the samples that fell into
# the same leaf node for each tree
y_pred_quantile = []
for sample in range(leaf_ids_test.shape[0]):
li = [
node_to_indices[tree][leaf_ids_test[sample][tree]] for tree in range(leaf_ids_test.shape[1])
]
# merge the list of indices into one
idx = [item for sublist in li for item in sublist]
# get the y_train for each corresponding id``
y_pred_quantile.append(y[idx])

for interpolation in interpolations:
# get the quatile preditions for each predicted sample
y_pred = [
np.array(
[
np.quantile(y_pred_quantile[i], quantile, method=interpolation)
for i in range(len(y_pred_quantile))
]
)
for quantile in quantiles
]
y_medians.append(y_pred[1])
y_errs.append(
np.concatenate(
(
[y_pred[1] - y_pred[0]],
[y_pred[2] - y_pred[1]],
),
axis=0,
)
)

sc = plt.scatter(np.arange(len(y)) - 0.35, y, color="k", zorder=10)
ebs = []
for i, (median, y_err) in enumerate(zip(y_medians, y_errs)):
ebs.append(
plt.errorbar(
np.arange(len(y)) + (0.15 * (i + 1)) - 0.35,
median,
yerr=y_err,
color=colors[i],
ecolor=colors[i],
fmt="o",
)
)
plt.xlim([-0.75, len(y) - 0.25])
plt.xticks(np.arange(len(y)), X.tolist())
plt.xlabel("Samples (Feature Values)")
plt.ylabel("Actual and Predicted Values")
plt.legend([sc] + ebs, ["actual"] + interpolations, loc=2)
plt.show()
Loading

0 comments on commit 30e9e95

Please sign in to comment.