-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leiden clustering returns different results for Scanpy 1.10.4
and 1.9.3
#3422
Comments
Hm, I need to take a closer look at that. I didn‘t try to reproduce it, but logically, it seems impossible, the code’s behavior looks identical to me: git diff 1.9.3:scanpy/tools/_leiden.py 1.10.4:src/scanpy/tools/_leiden.py |
@flying-sheep I just tested it again and it still seems to be inconsistent between Another possibility is that the implementation of |
@alam-shahul Thanks for the pointers, both the |
I think the culprit lie with our random state handling: #2946 (comment) |
I suspect (at least for #2946) the culprit is not ScanPy directly but one of its dependencies (specifically, a change in the dependency's internals between two versions). I am not sure which dependency specifically at the moment. |
#2536 So this PR caused the change. I'll need to look into why. |
That PR was huge. I did my best to keep it compatible but I’m not surprised that something slipped through. IIRC think the main difference is that after the PR, If it does, the ideal fix would be
On the other hand, people have been using the new code for a year, so going back might break their backwards compatibility. |
Ok @flying-sheep following your suggestion I was able to get the connectivities to match but not the distances. I would assume the culprit lies here: https://github.com/scverse/scanpy/pull/2536/files#diff-934eec0a4b88db7c4f7c099d803a25f6b81ca654579bd1ee84fd28b7858b2de2L380-L423 We don't re-get the self._distances = _get_sparse_matrix_from_indices_distances_umap(
knn_indices, knn_distances, self._adata.shape[0], self.n_neighbors
) seems to fix things up to 5 decimal places. But I am super out of my depth here - I don't really understand why it was removed or there in the first place. Here's the very rough diff --git a/src/scanpy/neighbors/__init__.py b/src/scanpy/neighbors/__init__.py
index 21404372..d4e24d07 100644
--- a/src/scanpy/neighbors/__init__.py
+++ b/src/scanpy/neighbors/__init__.py
@@ -10,6 +10,7 @@ from warnings import warn
import numpy as np
import scipy
from scipy.sparse import issparse
+from sklearn.metrics import pairwise_distances
from sklearn.utils import check_random_state
from .. import _utils
@@ -19,6 +20,7 @@ from .._settings import settings
from .._utils import NeighborsView, _doc_params, get_literal_vals
from . import _connectivity
from ._common import (
+ _get_indices_distances_from_dense_matrix,
_get_indices_distances_from_sparse_matrix,
_get_sparse_matrix_from_indices_distances,
)
@@ -571,8 +573,8 @@ class Neighbors:
self.n_neighbors = n_neighbors
self.knn = knn
X = _choose_representation(self._adata, use_rep=use_rep, n_pcs=n_pcs)
- self._distances = transformer.fit_transform(X)
- knn_indices, knn_distances = _get_indices_distances_from_sparse_matrix(
+ self._distances = pairwise_distances(X, metric=metric, **metric_kwds)
+ knn_indices, knn_distances = _get_indices_distances_from_dense_matrix(
self._distances, n_neighbors
)
if shortcut:
@@ -593,14 +595,42 @@ class Neighbors:
with contextlib.suppress(Exception):
self._rp_forest = _make_forest_dict(index)
start_connect = logg.debug("computed neighbors", time=start_neighbors)
-
if method == "umap":
+
+ def _get_sparse_matrix_from_indices_distances_umap(
+ knn_indices, knn_dists, n_obs, n_neighbors
+ ):
+ rows = np.zeros((n_obs * n_neighbors), dtype=np.int64)
+ cols = np.zeros((n_obs * n_neighbors), dtype=np.int64)
+ vals = np.zeros((n_obs * n_neighbors), dtype=np.float64)
+
+ for i in range(knn_indices.shape[0]):
+ for j in range(n_neighbors):
+ if knn_indices[i, j] == -1:
+ continue # We didn't get the full knn for i
+ if knn_indices[i, j] == i:
+ val = 0.0
+ else:
+ val = knn_dists[i, j]
+
+ rows[i * n_neighbors + j] = i
+ cols[i * n_neighbors + j] = knn_indices[i, j]
+ vals[i * n_neighbors + j] = val
+ import scipy.sparse as sp
+
+ result = sp.coo_matrix((vals, (rows, cols)), shape=(n_obs, n_obs))
+ result.eliminate_zeros()
+ return result.tocsr()
+
self._connectivities = _connectivity.umap(
knn_indices,
knn_distances,
n_obs=self._adata.shape[0],
n_neighbors=self.n_neighbors,
)
+ self._distances = _get_sparse_matrix_from_indices_distances_umap(
+ knn_indices, knn_distances, self._adata.shape[0], self.n_neighbors
+ )
elif method == "gauss":
self._connectivities = _connectivity.gauss(
self._distances, self.n_neighbors, knn=self.knn As for what we should do about this, I am also a bit at a loss. I would tend towards making a public announcement and going back to the original behavior. Certainly, something should be done, or at least stated. |
OK, let’s think about this. What it does should already happen here: scanpy/src/scanpy/neighbors/__init__.py Lines 581 to 584 in 75c246d
Indeed: why and how does
|
I think I am also curious as to why was |
The umap version also deals with -1s which we never feed in. Otherwise I can’t see a difference other than the umap version being dog slow. That’s why it was removed: it’s a slower version of something we have with features we don’t use. But you’re saying switching it out makes a difference so seems like I’m wrong. I just wonder what I’m not seeing |
Ok great to know. I'll have a closer look then. |
Hm, I don’t think I can do much right now, I’m blocked by lmcinnes/pynndescent#250 /edit: nevermind, I found a workaround: lmcinnes/pynndescent#250 (comment) |
Hmm, if we look at the number of distances per knn matrix row, we get this: (the entries <19 are just identical rows, i.e. distance 0, so they get eliminated in the old version of the code)
the result is a high variance in columns being different between the two versions (up to 12 different distances), but the rows are very close (mostly 0–2, sometimes 4 different distances)
as a particularly egregious example, here’s column 91 (h) and row 91 (v), respectively ( As you can see, it varies a lot in which knn lists sample 91 is. Diff details
|
hmm, I think the new code just has better support for duplicate data rows actually … |
@alam-shahul @keller-mark Could you both try out this #3444 in your respective use cases and report back? I believe this should help. @alam-shahul your distances/connectivities match with good tolerance and the leiden results appear to be completely identical, although I don't have access to the "full" data. |
to be clear: please try running we won’t restore full backwards compatibility because of downsides, but we will add an option to get the old results. |
Thanks @flying-sheep and @ilan-gold . I don't think this solves the issue with #2946 . The image comparison tests in #3446 are still failing in the python 3.10 + min-dependencies environment (but not a python 3.12 + normal environment). I think this points to the issue in #2946 being with a change in how computations are performed between versions of one or more dependencies and therefore despite the same random seed, the end results diverge. I am not sure that we should expect random results, even with a seed set, to be identical in different environments. |
Especially since numpy’s recommended random number API has no stability guarantee. So going forward, we can only have one if people pin a bunch of other dependencies, which means that I don’t feel like it will forever be worth the effort to have a stability guarantee in scanpy. |
Right, I think one could only guarantee stability given some version of scanpy and a fully pinned set of dependencies. |
I should have checked better. I wasted so much time on this. We don’t support duplicated data. To not completely waste a week of debugging this, here the final results: On macOS ARM64, using But instead, you should deduplicate your data or otherwise deal with duplicates. Here is the code. specify from __future__ import annotations
from dataclasses import KW_ONLY, dataclass
from typing import TYPE_CHECKING
import numpy as np
from sklearn.base import TransformerMixin
from .._common import (
_get_indices_distances_from_dense_matrix,
_get_sparse_matrix_from_indices_distances,
)
if TYPE_CHECKING:
from collections.abc import Mapping
from typing import Literal, Self
from numpy.typing import NDArray
from ..._utils import _CSMatrix
_Metric = Literal["cityblock", "cosine", "euclidean", "l1", "l2", "manhattan"]
_MatrixLike = NDArray | _CSMatrix
@dataclass
class PairwiseDistancesTransformer(TransformerMixin):
_: KW_ONLY
algorithm: Literal["brute"]
n_jobs: int
n_neighbors: int
metric: _Metric
metric_params: Mapping[str, object]
def fit(self, x: _MatrixLike) -> Self:
self.x_ = x
return self
def transform(self, y: _MatrixLike | None) -> _CSMatrix:
from sklearn.metrics import pairwise_distances
d_arr = pairwise_distances(self.x_, y, metric=self.metric, **self.metric_params)
ind, dist = _get_indices_distances_from_dense_matrix(
d_arr, self.n_neighbors + 1
)
rv = _get_sparse_matrix_from_indices_distances(ind, dist, keep_self=True)
return rv |
Thanks so much, sorry about taking up your time. This is useful, I'll use this to ensure reproducibility in my own code. I didn't realize there was duplicated data here. Indeed, this is some simulated data, so I could see there being duplicates (although I'm a bit surprised, since it should be randomly generated). I also found that the clustering results changed when applied to real data, which I wouldn't have thought would have duplicates, so I'll have to check that. Maybe it would be possible to add a documentation note that duplicated rows are not supported? |
Good point, but no idea where we could add a note about this. Very few libraries are capable of dealing with duplicates, so this was always implicitly true for basically everything we do, especially kNN. After all, the following question has no answer: “given two identical rows a and b, and a different row c, is a or b closer to c?”
this is not the first time someone had duplicates in “real” data, it can happen for multiple reasons, such as measurement artifacts, all-0 rows, subsampling-with-replacement, … |
Please make sure these conditions are met
What happened?
I am doing some data analysis that depends on Scanpy, and for reproducibility reasons it is important that the results from Scanpy's implementation of Leiden clustering are consistent.
It seems that the changes to the
sc.tl.leiden
implementation inv1.10
change the clustering results, although they are not supposed to (there is a warning that the defaults will change in the future, but that they have not yet).Can this be patched?
Minimal code sample
Error output
Versions
The text was updated successfully, but these errors were encountered: