Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pearson Residuals: normalization and hvg #2980

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

eroell
Copy link
Contributor

@eroell eroell commented Apr 5, 2024

  • Release notes not necessary because: tbd

@ilan-gold, I put some comments where I'm most interested in your feedback for a first step forward

@eroell eroell requested a review from ilan-gold April 5, 2024 10:41
n_cells: int,
) -> np.ndarray[np.float64]:
# TODO: potentially a lot to rewrite here
# TODO: is there a better/more common way we use? e.g. with dispatching?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something I've asked myself multiple times

return x

# np clip doesn't work with sparse-in-dask: although pre_res is not sparse since computed as outer product
# TODO: we have such a clip function in multiple places..?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not priority here

@@ -184,7 +231,11 @@ def _highly_variable_pearson_residuals(
if clip < 0:
raise ValueError("Pearson residuals require `clip>=0` or `clip=None`.")

if sp_sparse.issparse(X_batch):
if isinstance(X_batch, DaskArray):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having _calculate_res_dense and _calculate_res_sparse was there before, maybe singledispatch cleaner here.

the non-jitted dask computation is ~10x slower at the moment.

sums_cells = axis_sum(X, axis=1, dtype=np.float64).reshape(-1, 1)
sum_total = sums_genes.sum()

# TODO: Consider deduplicating computations below which are similarly required in _highly_variable_genes?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's actually quite some duplication, think this might be reduced

@pytest.mark.parametrize("array_type", ARRAY_TYPES)
@pytest.mark.parametrize("dtype", ["float32", "int64"])
def test_pearson_residuals_inputchecks(array_type, dtype):
# TODO: do we have a preferred way of making such a small dataset, wich the array types option?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied the array_type(adata.X) from other tests.
is this how we want to set up datasets to test through the combinations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so?

Copy link

codecov bot commented Apr 5, 2024

Codecov Report

Attention: Patch coverage is 97.56098% with 1 line in your changes missing coverage. Please review.

Project coverage is 75.54%. Comparing base (896e249) to head (7eb31ff).

Current head 7eb31ff differs from pull request most recent head d58e083

Please upload reports for the commit d58e083 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2980      +/-   ##
==========================================
- Coverage   76.31%   75.54%   -0.77%     
==========================================
  Files         109      117       +8     
  Lines       12513    12971     +458     
==========================================
+ Hits         9549     9799     +250     
- Misses       2964     3172     +208     
Files Coverage Δ
scanpy/_utils/__init__.py 74.84% <100.00%> (ø)
scanpy/experimental/pp/_normalization.py 95.06% <100.00%> (ø)
scanpy/experimental/pp/_highly_variable_genes.py 66.27% <96.00%> (ø)

... and 223 files with indirect coverage changes

Comment on lines 118 to 121
def custom_clip(x, clip_val):
x[x < -clip_val] = -clip_val
x[x > clip_val] = clip_val
return x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a utility and use with the other implementation. Also maybe a better name? dask_sparse_compat_clip?

Comment on lines 75 to 77
sums_genes = axis_sum(X, axis=0, dtype=np.float64).reshape(1, -1)
sums_cells = axis_sum(X, axis=1, dtype=np.float64).reshape(-1, 1)
sum_total = sums_genes.sum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions should work across the types, no? They rely on single dispatch of the first argument and just call the other imeplementations

Comment on lines 154 to 157
def clip(x, min, max):
x[x < min] = min
x[x > max] = max
return x
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely a feather in the cap for making this a utility.

Comment on lines 35 to 44
def general_max(x, array_type):
if "dask" in array_type.__name__:
return x.compute().max()
return x.max()


def general_min(x, array_type):
if "dask" in array_type.__name__:
return x.compute().min()
return x.min()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better as maybe_compute_{min,max}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can't we check isintance(x, DaskArray)?

@@ -90,6 +92,47 @@ def clac_clipped_res_sparse(gene: int, cell: int, value: np.float64) -> np.float
return residuals


def _calculate_res_dense_vectorized(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called dense if it operates on sparse inputs? Also why are the type hints np.ndarray then? Also, why the extra dtype? Can we guarantee that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dask support for normalize_pearson_residuals, highly_variable_genes(flavor='pearson_residuals')
2 participants