-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pearson Residuals: normalization and hvg #2980
base: main
Are you sure you want to change the base?
Conversation
n_cells: int, | ||
) -> np.ndarray[np.float64]: | ||
# TODO: potentially a lot to rewrite here | ||
# TODO: is there a better/more common way we use? e.g. with dispatching? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something I've asked myself multiple times
return x | ||
|
||
# np clip doesn't work with sparse-in-dask: although pre_res is not sparse since computed as outer product | ||
# TODO: we have such a clip function in multiple places..? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not priority here
@@ -184,7 +231,11 @@ def _highly_variable_pearson_residuals( | |||
if clip < 0: | |||
raise ValueError("Pearson residuals require `clip>=0` or `clip=None`.") | |||
|
|||
if sp_sparse.issparse(X_batch): | |||
if isinstance(X_batch, DaskArray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having _calculate_res_dense and _calculate_res_sparse was there before, maybe singledispatch cleaner here.
the non-jitted dask computation is ~10x slower at the moment.
sums_cells = axis_sum(X, axis=1, dtype=np.float64).reshape(-1, 1) | ||
sum_total = sums_genes.sum() | ||
|
||
# TODO: Consider deduplicating computations below which are similarly required in _highly_variable_genes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's actually quite some duplication, think this might be reduced
@pytest.mark.parametrize("array_type", ARRAY_TYPES) | ||
@pytest.mark.parametrize("dtype", ["float32", "int64"]) | ||
def test_pearson_residuals_inputchecks(array_type, dtype): | ||
# TODO: do we have a preferred way of making such a small dataset, wich the array types option? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied the array_type(adata.X)
from other tests.
is this how we want to set up datasets to test through the combinations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so?
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2980 +/- ##
==========================================
- Coverage 76.31% 75.54% -0.77%
==========================================
Files 109 117 +8
Lines 12513 12971 +458
==========================================
+ Hits 9549 9799 +250
- Misses 2964 3172 +208
|
def custom_clip(x, clip_val): | ||
x[x < -clip_val] = -clip_val | ||
x[x > clip_val] = clip_val | ||
return x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make a utility and use with the other implementation. Also maybe a better name? dask_sparse_compat_clip
?
sums_genes = axis_sum(X, axis=0, dtype=np.float64).reshape(1, -1) | ||
sums_cells = axis_sum(X, axis=1, dtype=np.float64).reshape(-1, 1) | ||
sum_total = sums_genes.sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions should work across the types, no? They rely on single dispatch of the first argument and just call the other imeplementations
def clip(x, min, max): | ||
x[x < min] = min | ||
x[x > max] = max | ||
return x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely a feather in the cap for making this a utility.
scanpy/tests/test_normalization.py
Outdated
def general_max(x, array_type): | ||
if "dask" in array_type.__name__: | ||
return x.compute().max() | ||
return x.max() | ||
|
||
|
||
def general_min(x, array_type): | ||
if "dask" in array_type.__name__: | ||
return x.compute().min() | ||
return x.min() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better as maybe_compute_{min,max}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also can't we check isintance(x, DaskArray)
?
@@ -90,6 +92,47 @@ def clac_clipped_res_sparse(gene: int, cell: int, value: np.float64) -> np.float | |||
return residuals | |||
|
|||
|
|||
def _calculate_res_dense_vectorized( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this called dense
if it operates on sparse inputs? Also why are the type hints np.ndarray
then? Also, why the extra dtype? Can we guarantee that?
@ilan-gold, I put some comments where I'm most interested in your feedback for a first step forward