-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(feat): allow boolean indices to pass down to X
#1365
Conversation
ilan-gold
commented
Feb 9, 2024
•
edited
Loading
edited
- Towards Check indexing arg types #1293
- Tests added
- Release note added (or unnecessary)
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1365 +/- ##
==========================================
- Coverage 86.17% 84.08% -2.10%
==========================================
Files 35 35
Lines 5584 5592 +8
==========================================
- Hits 4812 4702 -110
- Misses 772 890 +118
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@ivirshup I have a fix here but I'm unsure of the reasoning for integer indices. The use for them, primarily, seemed to be that sparse/h5py needed integers. The numpy/pandas subsetting seems to work ok with booleans. For example, we don't seem to need them for forming a proper mesh, it seems, using |
I'm not sure what you mean by this. Are you wondering why we were doing |
I'm guessing the reason was for sparse/hdf5 compatibility but am not sure and don't want to miss something I'm not thinking of. |
It's also possible we didn't support masking because it was easier to support fewer cases and the performance hit from doing |
This is how I went about this PR, by removing
Great idea, will definitely do this. |
You could invert the mask? |
@ivirshup I got no errors in scanpy and nothing looks particularly out of place in AnnData either. As mentioned on slack, also, |
Why would subset func being randomized here matter?
|
Oh @ivirshup you just want to test for failing or not? I thought we wanted to do equality i.e. |
64ab900
to
8d685d2
Compare
@ivirshup Done. Sorry for the confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I just tried seeing if we would continue using boolean masks on repeated indexing. Not surprisingly, we don't:
import scanpy as sc
adata = adata = sc.datasets.pbmc3k_processed().raw.to_adata()
v = adata[adata.obs["louvain"].isin(["B cells", "NK cells"])]
v2 = v[v.obs["louvain"] == "B Cells"]
v._oidx
# array([False, True, False, ..., True, True, False])
v2._oidx
# array([ 1, 10, 18, 19, 20, 23, 31, 55, 75, 85, 88,
# 89, 92, 93, 105, 107, 117, 122, 127, 134, 139, 143,
# 146, 160, 162, 165, 178, 179, 181, 190, 192, 218, 231,
# ...
This means it's quite easy to hit the unoptimized case for any sort of more complex querying.
Do you think we should fix this by continuing to pass through booleans, or do we now need to address the integer case? I think masks on top of masks could happen in this PR (though doesn't strictly have to), but integer would be a separate PR.
We could resolve nested boolean masks like: def update_mask_numpy(mask_orig, mask_incoming):
mask_new = np.zeros_like(mask_orig)
mask_new[np.flatnonzero(mask_orig)[mask_incoming]] = True
return mask_new There is a maybe better solution, if we could use numba or added some compiled code.import numpy as np, numba
@numba.njit
def update_mask_numba(mask_orig, mask_incoming):
mask_new = np.zeros_like(mask_orig)
incoming_pos = 0
for orig_pos, val in enumerate(mask_orig):
if val:
mask_new[orig_pos] = mask_incoming[incoming_pos]
incoming_pos += 1
if incoming_pos != len(mask_incoming):
raise ValueError("Incoming mask is too long")
return mask_new
def update_mask_numpy(mask_orig, mask_incoming):
mask_new = np.zeros_like(mask_orig)
mask_new[np.flatnonzero(mask_orig)[mask_incoming]] = True
return mask_new
rng = np.random.default_rng()
def update_mask_bench(size1, size2, size3):
int_idx1 = rng.choice(size1, size2, replace=False)
int_idx2 = rng.choice(size2, size3, replace=False)
bool_idx1 = np.zeros(size1, dtype=bool)
bool_idx2 = np.zeros(size2, dtype=bool)
bool_idx1[int_idx1] = True
bool_idx2[int_idx2] = True
%timeit update_mask_numba(bool_idx1, bool_idx2)
%timeit update_mask_numpy(bool_idx1, bool_idx2) In [57]: update_mask_bench(1_000_000, 500_000, 10_000)
4.55 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.58 ms ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [58]: update_mask_bench(1_000_000, 500_000, 400_000)
4.51 ms ± 1.78 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.75 ms ± 7.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [59]: update_mask_bench(1_000_000, 500_000, 450_000)
4.51 ms ± 2.86 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
5.08 ms ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [60]: update_mask_bench(10_000_000, 9_000_000, 8_000_000)
18.9 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
63.1 ms ± 89.3 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [61]: update_mask_bench(10_000_000, 9_000_000, 500_000)
18.8 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
39 ms ± 61 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [62]: update_mask_bench(10_000_000, 1_000_000, 500_000)
14.8 ms ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.2 ms ± 62.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [64]: update_mask_bench(10_000_000, 1_000_000, 100_000)
14.7 ms ± 44 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
23.4 ms ± 62.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) So the differences are never too huge, and we're talking about 10s of milliseconds differences for 10 million rows in the worst cases. Which we could dispatch to in @_resolve_idx.register(np.ndarray)
def _resolve_idx_ndarray(old, new, l):
if is_bool_dtype(old) and is_bool_dtype(new):
return update_mask(old, new)
if is_bool_dtype(old):
old = np.where(old)[0]
return old[new] |
This is just a toy case, where the data is unoptimized for query. I just wanted to see if we would propagate the masks. In data optimized for query we should be sorting by the fields where querying on which would lead to longer runs. |
I think mask type should stay boolean, then, yes. I also think the performance is acceptable in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
…pass down to `X`) (#1387) Co-authored-by: Ilan Gold <[email protected]>