Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make to_dataframe use the new Sparse data structures from pandas >= 0.25 #838

Closed
wants to merge 4 commits into from

Conversation

fedarko
Copy link
Contributor

@fedarko fedarko commented Feb 23, 2020

Currently, pandas v1 is the reason folks are running into problems like #837. This is also breaking a couple of other packages that rely on biom, including but probably not limited to songbird and mmvec.

This PR should fix the problem for most datasets. However, there are a few complicating factors that I think will prevent merging this PR in right now, both due to pd.DataFrame.sparse.from_spmatrix() (which is pandas' recommended way to convert SciPy CSR matrices to DataFrames-with-sparse-values):

  1. I'm pretty sure from_spmatrix() doesn't handle the sparse data in these matrices properly. When I make a huge but super sparse matrix in SciPy and try to convert it to a DF, my computer explodes at the conversion step (even though initializing the matrix takes on the order of 15-ish seconds for SciPy). I've opened DataFrame.sparse.from_spmatrix seems inefficient with large (but very sparse) matrices? pandas-dev/pandas#32196 accordingly. Update: this assumption was incorrect on my part, and it's expected that really wide matrices like this really will slow down pandas. See the pandas issue thread linked for profiling details / their response.

    To ensure that this change doesn't cause a repeat of [python] to_dataframe does not produce sparse data frames #808, I added a test in this PR that makes sure that these sorts of large matrices can be converted to sparse DataFrames in a sane amount of time (max 30 seconds). This currently fails, since biom.table.Table.to_dataframe() in this PR uses from_spmatrix() under the hood. Making to_dataframe() faster should cause this test to pass. (If I'm being overzealous and sparse DataFrames are really going to be slower to create than I'd thought, it's totally possible to bump up the 30-second timer in this test.) Update: will need to do this and/or just try a less crazy sparse matrix than a 1mil x 1mil one.

  2. ...Also I think from_spmatrix() has other bugs that might make it unreliable.

ANYWAY TLDR hopefully these changes are useful at least as a starting point for a solution to this problem. Thanks!

pd.SparseDataFrame has been removed -- the replacement, per the
pandas docs*, is making a normal DataFrame with sparse columns.

Making this change actually simplifies a few things on our end in
the codebase, which is nice :)

I bumped up the min pandas version in setup.py here because the new
sparse stuff was added in v0.25.0, but I imagine other version things
for the biom-format package (e.g. conda / conda-forge stuff) will
also need to be adjusted accordingly.

*https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
(It currently fails. Will explain in PR message.)
@wasade
Copy link
Member

wasade commented Feb 23, 2020

Thanks, @fedarko and thank you for the detailed exploration here. The observation with from_spmatrix in (2) is very concerning. I believe we need to wait on upstream to address that issue.

@ebolyen
Copy link
Member

ebolyen commented Feb 26, 2020

Re: 2, is this not an issue with pandas < 1.0?
I was planning to do an skbio release to get pandas 1.0 in for QIIME 2, but if from_spmatrix is broken, then we might need to just pin to the older pandas for a bit longer.

@ebolyen
Copy link
Member

ebolyen commented Feb 26, 2020

Ah, from the linked PR in songbird, that appears to be the case.

@fedarko
Copy link
Contributor Author

fedarko commented Feb 26, 2020

@ebolyen I haven't checked if (2) is a problem with pandas < 1.0 -- the reason songbird (and a few other things) were breaking IIRC is that biom.Table.to_dataframe() calls pandas.SparseDataFrame() directly, which was removed in pandas 1.0.

@ebolyen
Copy link
Member

ebolyen commented Feb 26, 2020

I will try to reproduce in the last non-1.0 version to see what's going on.

@ebolyen
Copy link
Member

ebolyen commented Feb 26, 2020

Yep, this is an issue in our last release also (2019.10):

In [1]: from scipy.sparse import csr_matrix                                     

In [2]: import pandas as pd                                                     

In [3]: a = csr_matrix([[1, 0, 0], [2, 2, 0], [0, 2, 2]])                       

In [4]: a[0, 2] = 0                                                             

In [5]: a.todense()                                                             
Out[5]: 
matrix([[1, 0, 0],
        [2, 2, 0],
        [0, 2, 2]], dtype=int64)

In [6]: pd.DataFrame.sparse.from_spmatrix(a)                                    
Out[6]: 
   0  1  2
0  1  0  0
1  2  2  0
2  0  2  0

In [7]: pd.__version__                                                          
Out[7]: '0.25.2'

To what extent does any of our code mutate the CSR matrix before loading it as a df? @wasade do we need a workaround for this?

@wasade
Copy link
Member

wasade commented Feb 26, 2020

@ebolyen, the .matrix_data member is mutable...

@ebolyen
Copy link
Member

ebolyen commented Feb 26, 2020

It doesn't look like any q2 code attempts mutation, the most common path is toarray() which shouldn't have this issue.

grep qiime2
§ grep -rI '.matrix_data' *
paper2/notebooks/qiime2-protocol-API.ipynb:      "  for r in self.matrix_data.tocsr()]\n",
q2-diversity/build/lib/q2_diversity/_alpha/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/build/lib/q2_diversity/_alpha/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/build/lib/q2_diversity/_beta/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/build/lib/q2_diversity/_beta/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/q2_diversity/_alpha/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/q2_diversity/_alpha/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/q2_diversity/_beta/_method.py:    counts = table.matrix_data.toarray().T
q2-diversity-lib/q2_diversity_lib/alpha.py:    counts = presence_absence_table.matrix_data.toarray().astype(int).T
q2-diversity-lib/q2_diversity_lib/alpha.py:    counts = presence_absence_table.matrix_data.toarray().astype(int).T
q2-diversity-lib/q2_diversity_lib/alpha.py:    counts = table.matrix_data.toarray().T
q2-diversity-lib/q2_diversity_lib/alpha.py:    counts = table.matrix_data.toarray().T
q2-feature-table/q2_feature_table/tests/test_transform.py:        npt.assert_array_equal(a.matrix_data.toarray(),
q2-feature-table/q2_feature_table/tests/test_transform.py:        npt.assert_array_equal(a.matrix_data.toarray(),
q2-gamma/q2_gamma/visualizers/plot/visualizer.py:    array = table.pa().matrix_data.astype(np.uint8).toarray()
q2-sample-classifier/q2_sample_classifier/utilities.py:    for i, row in enumerate(feature_data.matrix_data.T):
q2-types/q2_types/feature_data/tests/test_transformer.py:        table = biom.Table(table.matrix_data,
q2-types/q2_types/feature_data/tests/test_transformer.py:        table = biom.Table(table.matrix_data,
q2-types/q2_types/feature_table/_transformer.py:    return biom.Table(table.matrix_data,
q2-types/q2_types/feature_table/_transformer.py:    array = table.matrix_data.toarray().T
grep skbio
§ grep -rI '.matrix_data' *
biom-format/biom/table.py:    def matrix_data(self):
biom-format/tests/test_table.py:        obs = self.simple_derived.matrix_data
biom-format/tests/test_table.py:            self.simple_derived.matrix_data = 'foo'
biom-format/ChangeLog.md:* Matrix data can now be accessed by the ``Table.matrix_data`` property
biom-format/ChangeLog.md:* added ``Table.matrix_data`` property
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:    matrix_data = center_distance_matrix(distance_matrix.data, inplace=inplace)
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:        if method == "fsvd" and matrix_data.shape[0] > 10:
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:        number_of_dimensions = matrix_data.shape[0]
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:        eigvals, eigvecs = eigh(matrix_data)
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:        eigvals, eigvecs = _fsvd(matrix_data, number_of_dimensions)
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:        sum_eigenvalues = np.trace(matrix_data)
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:    matrix_data = center_distance_matrix(distance_matrix.data, inplace=inplace)
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:        if method == "fsvd" and matrix_data.shape[0] > 10:
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:        number_of_dimensions = matrix_data.shape[0]
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:        eigvals, eigvecs = eigh(matrix_data)
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:        eigvals, eigvecs = _fsvd(matrix_data, number_of_dimensions)
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:        sum_eigenvalues = np.trace(matrix_data)
unifrac/unifrac/tests/test_api.py:        cnts = table_inmem.matrix_data.astype(int).toarray().T


It's a bummer, but I don't think we've been directly impacted by this (yet at least).

@wasade
Copy link
Member

wasade commented Feb 26, 2020

.transform(), for example, can operate in place on the CSR/CSC representation. I'm unsure about impact to QIIME 2 -- hopefully most operations rely on the Table object rather than the DataFrame. I cannot predict the impact for other users of biom.

@mortonjt
Copy link
Contributor

mortonjt commented Feb 26, 2020 via email

@wasade
Copy link
Member

wasade commented Feb 26, 2020

Dense representations were explicitly removed from BIOM a long time ago. I'm not very eager to add support back in because a dependency has a bug.

@wasade
Copy link
Member

wasade commented Sep 4, 2020

@fedarko, I ended up just putting in the from_spmatrix piece similar to what's done here within #848. It looks like this is going to be or was fixed in Pandas...

@wasade wasade closed this Sep 4, 2020
@wasade wasade mentioned this pull request Nov 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants