Pearson Residuals: normalization and hvg #2980

eroell · 2024-04-05T10:41:18Z

Fixes Dask support for normalize_pearson_residuals, highly_variable_genes(flavor='pearson_residuals') #2940
Tests included

Release notes not necessary because: tbd

@ilan-gold, I put some comments where I'm most interested in your feedback for a first step forward

eroell · 2024-04-05T10:45:24Z

scanpy/experimental/pp/_highly_variable_genes.py

+    n_cells: int,
+) -> np.ndarray[np.float64]:
+    # TODO: potentially a lot to rewrite here
+    # TODO: is there a better/more common way we use? e.g. with dispatching?


something I've asked myself multiple times

eroell · 2024-04-05T10:46:37Z

scanpy/experimental/pp/_highly_variable_genes.py

+        return x
+
+    # np clip doesn't work with sparse-in-dask: although pre_res is not sparse since computed as outer product
+    # TODO: we have such a clip function in multiple places..?


not priority here

eroell · 2024-04-05T10:50:05Z

scanpy/experimental/pp/_highly_variable_genes.py

@@ -184,7 +231,11 @@ def _highly_variable_pearson_residuals(
        if clip < 0:
            raise ValueError("Pearson residuals require `clip>=0` or `clip=None`.")

-        if sp_sparse.issparse(X_batch):
+        if isinstance(X_batch, DaskArray):


having _calculate_res_dense and _calculate_res_sparse was there before, maybe singledispatch cleaner here.

the non-jitted dask computation is ~10x slower at the moment.

eroell · 2024-04-05T10:52:43Z

scanpy/experimental/pp/_normalization.py

+        sums_cells = axis_sum(X, axis=1, dtype=np.float64).reshape(-1, 1)
+        sum_total = sums_genes.sum()
+
+    # TODO: Consider deduplicating computations below which are similarly required in _highly_variable_genes?


there's actually quite some duplication, think this might be reduced

eroell · 2024-04-05T11:00:33Z

scanpy/tests/test_highly_variable_genes.py

+@pytest.mark.parametrize("array_type", ARRAY_TYPES)
+@pytest.mark.parametrize("dtype", ["float32", "int64"])
+def test_pearson_residuals_inputchecks(array_type, dtype):
+    # TODO: do we have a preferred way of making such a small dataset, wich the array types option?


copied the array_type(adata.X) from other tests.
is this how we want to set up datasets to test through the combinations?

I think so?

codecov · 2024-04-05T11:06:48Z

Codecov Report

Attention: Patch coverage is 97.56098% with 1 line in your changes missing coverage. Please review.

Project coverage is 75.54%. Comparing base (896e249) to head (7eb31ff).

❗ Current head 7eb31ff differs from pull request most recent head d58e083

Please upload reports for the commit d58e083 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2980      +/-   ##
==========================================
- Coverage   76.31%   75.54%   -0.77%     
==========================================
  Files         109      117       +8     
  Lines       12513    12971     +458     
==========================================
+ Hits         9549     9799     +250     
- Misses       2964     3172     +208

Files	Coverage Δ
scanpy/_utils/__init__.py	`74.84% <100.00%> (ø)`
scanpy/experimental/pp/_normalization.py	`95.06% <100.00%> (ø)`
scanpy/experimental/pp/_highly_variable_genes.py	`66.27% <96.00%> (ø)`

... and 223 files with indirect coverage changes

ilan-gold · 2024-04-05T14:13:55Z

scanpy/experimental/pp/_highly_variable_genes.py

+    def custom_clip(x, clip_val):
+        x[x < -clip_val] = -clip_val
+        x[x > clip_val] = clip_val
+        return x


Make a utility and use with the other implementation. Also maybe a better name? dask_sparse_compat_clip?

ilan-gold · 2024-04-05T14:15:44Z

scanpy/experimental/pp/_normalization.py

+        sums_genes = axis_sum(X, axis=0, dtype=np.float64).reshape(1, -1)
+        sums_cells = axis_sum(X, axis=1, dtype=np.float64).reshape(-1, 1)
+        sum_total = sums_genes.sum()


These functions should work across the types, no? They rely on single dispatch of the first argument and just call the other imeplementations

ilan-gold · 2024-04-05T14:17:11Z

scanpy/tests/test_highly_variable_genes.py

+        def clip(x, min, max):
+            x[x < min] = min
+            x[x > max] = max
+            return x


Definitely a feather in the cap for making this a utility.

ilan-gold · 2024-04-05T14:18:34Z

scanpy/tests/test_normalization.py

+def general_max(x, array_type):
+    if "dask" in array_type.__name__:
+        return x.compute().max()
+    return x.max()
+
+
+def general_min(x, array_type):
+    if "dask" in array_type.__name__:
+        return x.compute().min()
+    return x.min()


better as maybe_compute_{min,max}

Also can't we check isintance(x, DaskArray)?

ilan-gold · 2024-04-05T14:21:06Z

scanpy/experimental/pp/_highly_variable_genes.py

@@ -90,6 +92,47 @@ def clac_clipped_res_sparse(gene: int, cell: int, value: np.float64) -> np.float
    return residuals


+def _calculate_res_dense_vectorized(


Why is this called dense if it operates on sparse inputs? Also why are the type hints np.ndarray then? Also, why the extra dtype? Can we guarantee that?

Eljas Roellin and others added 6 commits March 11, 2024 13:19

some starting things & trials

d731b71

Merge branch 'main' into pearson-dask

f4a439e

Merge branch 'main' into pearson-dask

a5f7485

first draft for normalize_pearson_residuals

306c8f8

hvg pearson_residuals dask, not immediately failing

eafee9f

some draft changes & comments

a5eca04

eroell requested a review from ilan-gold April 5, 2024 10:41

some comments

6a8f43d

eroell commented Apr 5, 2024

View reviewed changes

ilan-gold reviewed Apr 5, 2024

View reviewed changes

eroell and others added 3 commits April 15, 2024 11:43

clip util and singledispatch use (works)

7365b97

Merge branch 'scverse:main' into pearson-dask

7eb31ff

merge main into pearson-dask

d58e083

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pearson Residuals: normalization and hvg #2980

Pearson Residuals: normalization and hvg #2980

eroell commented Apr 5, 2024

eroell Apr 5, 2024

eroell Apr 5, 2024

eroell Apr 5, 2024

eroell Apr 5, 2024

eroell Apr 5, 2024

ilan-gold Apr 5, 2024

codecov bot commented Apr 5, 2024 •

edited

Loading

ilan-gold Apr 5, 2024

ilan-gold Apr 5, 2024

ilan-gold Apr 5, 2024

ilan-gold Apr 5, 2024

ilan-gold Apr 5, 2024

ilan-gold Apr 5, 2024

		@@ -90,6 +92,47 @@ def clac_clipped_res_sparse(gene: int, cell: int, value: np.float64) -> np.float
		return residuals


		def _calculate_res_dense_vectorized(

Pearson Residuals: normalization and hvg #2980

Are you sure you want to change the base?

Pearson Residuals: normalization and hvg #2980

Conversation

eroell commented Apr 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Apr 5, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Apr 5, 2024 •

edited

Loading