Make to_dataframe use the new Sparse data structures from pandas >= 0.25 #838

fedarko · 2020-02-23T07:30:55Z

Currently, pandas v1 is the reason folks are running into problems like #837. This is also breaking a couple of other packages that rely on biom, including but probably not limited to songbird and mmvec.

This PR should fix the problem for most datasets. However, there are a few complicating factors that I think will prevent merging this PR in right now, both due to pd.DataFrame.sparse.from_spmatrix() (which is pandas' recommended way to convert SciPy CSR matrices to DataFrames-with-sparse-values):

I'm pretty sure from_spmatrix() doesn't handle the sparse data in these matrices properly. When I make a huge but super sparse matrix in SciPy and try to convert it to a DF, my computer explodes at the conversion step (even though initializing the matrix takes on the order of 15-ish seconds for SciPy). I've opened DataFrame.sparse.from_spmatrix seems inefficient with large (but very sparse) matrices? pandas-dev/pandas#32196 accordingly. Update: this assumption was incorrect on my part, and it's expected that really wide matrices like this really will slow down pandas. See the pandas issue thread linked for profiling details / their response.

To ensure that this change doesn't cause a repeat of [python] to_dataframe does not produce sparse data frames #808, I added a test in this PR that makes sure that these sorts of large matrices can be converted to sparse DataFrames in a sane amount of time (max 30 seconds). This currently fails, since biom.table.Table.to_dataframe() in this PR uses from_spmatrix() under the hood. Making to_dataframe() faster should cause this test to pass. (If I'm being overzealous and sparse DataFrames are really going to be slower to create than I'd thought, it's totally possible to bump up the 30-second timer in this test.) Update: will need to do this and/or just try a less crazy sparse matrix than a 1mil x 1mil one.
...Also I think from_spmatrix() has other bugs that might make it unreliable.

ANYWAY TLDR hopefully these changes are useful at least as a starting point for a solution to this problem. Thanks!

pd.SparseDataFrame has been removed -- the replacement, per the pandas docs*, is making a normal DataFrame with sparse columns. Making this change actually simplifies a few things on our end in the codebase, which is nice :) I bumped up the min pandas version in setup.py here because the new sparse stuff was added in v0.25.0, but I imagine other version things for the biom-format package (e.g. conda / conda-forge stuff) will also need to be adjusted accordingly. *https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html

(It currently fails. Will explain in PR message.)

wasade · 2020-02-23T16:45:28Z

Thanks, @fedarko and thank you for the detailed exploration here. The observation with from_spmatrix in (2) is very concerning. I believe we need to wait on upstream to address that issue.

ebolyen · 2020-02-26T15:32:16Z

Re: 2, is this not an issue with pandas < 1.0?
I was planning to do an skbio release to get pandas 1.0 in for QIIME 2, but if from_spmatrix is broken, then we might need to just pin to the older pandas for a bit longer.

ebolyen · 2020-02-26T15:33:09Z

Ah, from the linked PR in songbird, that appears to be the case.

fedarko · 2020-02-26T16:54:02Z

@ebolyen I haven't checked if (2) is a problem with pandas < 1.0 -- the reason songbird (and a few other things) were breaking IIRC is that biom.Table.to_dataframe() calls pandas.SparseDataFrame() directly, which was removed in pandas 1.0.

ebolyen · 2020-02-26T17:02:36Z

I will try to reproduce in the last non-1.0 version to see what's going on.

ebolyen · 2020-02-26T17:05:54Z

Yep, this is an issue in our last release also (2019.10):

In [1]: from scipy.sparse import csr_matrix                                     

In [2]: import pandas as pd                                                     

In [3]: a = csr_matrix([[1, 0, 0], [2, 2, 0], [0, 2, 2]])                       

In [4]: a[0, 2] = 0                                                             

In [5]: a.todense()                                                             
Out[5]: 
matrix([[1, 0, 0],
        [2, 2, 0],
        [0, 2, 2]], dtype=int64)

In [6]: pd.DataFrame.sparse.from_spmatrix(a)                                    
Out[6]: 
   0  1  2
0  1  0  0
1  2  2  0
2  0  2  0

In [7]: pd.__version__                                                          
Out[7]: '0.25.2'

To what extent does any of our code mutate the CSR matrix before loading it as a df? @wasade do we need a workaround for this?

wasade · 2020-02-26T17:08:19Z

@ebolyen, the .matrix_data member is mutable...

ebolyen · 2020-02-26T17:16:18Z

It doesn't look like any q2 code attempts mutation, the most common path is toarray() which shouldn't have this issue.

grep qiime2

§ grep -rI '.matrix_data' *
paper2/notebooks/qiime2-protocol-API.ipynb:      "  for r in self.matrix_data.tocsr()]\n",
q2-diversity/build/lib/q2_diversity/_alpha/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/build/lib/q2_diversity/_alpha/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/build/lib/q2_diversity/_beta/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/build/lib/q2_diversity/_beta/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/q2_diversity/_alpha/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/q2_diversity/_alpha/_method.py:    counts = table.matrix_data.toarray().astype(int).T
q2-diversity/q2_diversity/_beta/_method.py:    counts = table.matrix_data.toarray().T
q2-diversity-lib/q2_diversity_lib/alpha.py:    counts = presence_absence_table.matrix_data.toarray().astype(int).T
q2-diversity-lib/q2_diversity_lib/alpha.py:    counts = presence_absence_table.matrix_data.toarray().astype(int).T
q2-diversity-lib/q2_diversity_lib/alpha.py:    counts = table.matrix_data.toarray().T
q2-diversity-lib/q2_diversity_lib/alpha.py:    counts = table.matrix_data.toarray().T
q2-feature-table/q2_feature_table/tests/test_transform.py:        npt.assert_array_equal(a.matrix_data.toarray(),
q2-feature-table/q2_feature_table/tests/test_transform.py:        npt.assert_array_equal(a.matrix_data.toarray(),
q2-gamma/q2_gamma/visualizers/plot/visualizer.py:    array = table.pa().matrix_data.astype(np.uint8).toarray()
q2-sample-classifier/q2_sample_classifier/utilities.py:    for i, row in enumerate(feature_data.matrix_data.T):
q2-types/q2_types/feature_data/tests/test_transformer.py:        table = biom.Table(table.matrix_data,
q2-types/q2_types/feature_data/tests/test_transformer.py:        table = biom.Table(table.matrix_data,
q2-types/q2_types/feature_table/_transformer.py:    return biom.Table(table.matrix_data,
q2-types/q2_types/feature_table/_transformer.py:    array = table.matrix_data.toarray().T

grep skbio

§ grep -rI '.matrix_data' *
biom-format/biom/table.py:    def matrix_data(self):
biom-format/tests/test_table.py:        obs = self.simple_derived.matrix_data
biom-format/tests/test_table.py:            self.simple_derived.matrix_data = 'foo'
biom-format/ChangeLog.md:* Matrix data can now be accessed by the ``Table.matrix_data`` property
biom-format/ChangeLog.md:* added ``Table.matrix_data`` property
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:    matrix_data = center_distance_matrix(distance_matrix.data, inplace=inplace)
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:        if method == "fsvd" and matrix_data.shape[0] > 10:
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:        number_of_dimensions = matrix_data.shape[0]
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:        eigvals, eigvecs = eigh(matrix_data)
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:        eigvals, eigvecs = _fsvd(matrix_data, number_of_dimensions)
scikit-bio/build/lib.linux-x86_64-3.7/skbio/stats/ordination/_principal_coordinate_analysis.py:        sum_eigenvalues = np.trace(matrix_data)
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:    matrix_data = center_distance_matrix(distance_matrix.data, inplace=inplace)
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:        if method == "fsvd" and matrix_data.shape[0] > 10:
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:        number_of_dimensions = matrix_data.shape[0]
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:        eigvals, eigvecs = eigh(matrix_data)
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:        eigvals, eigvecs = _fsvd(matrix_data, number_of_dimensions)
scikit-bio/skbio/stats/ordination/_principal_coordinate_analysis.py:        sum_eigenvalues = np.trace(matrix_data)
unifrac/unifrac/tests/test_api.py:        cnts = table_inmem.matrix_data.astype(int).toarray().T

It's a bummer, but I don't think we've been directly impacted by this (yet at least).

wasade · 2020-02-26T17:49:09Z

.transform(), for example, can operate in place on the CSR/CSC representation. I'm unsure about impact to QIIME 2 -- hopefully most operations rely on the Table object rather than the DataFrame. I cannot predict the impact for other users of biom.

mortonjt · 2020-02-26T18:18:33Z

One thing to note is that many users who are using the dataframe conversion rely on dense dataframes rather than sparse dataframes. In Songbird, we do the conversion from sparse to dense versions anyways. One possibility is to open up the option to do an explicit dense conversion without the an intermediate sparse dataframe (with the warning that the memory consumption maybe large).

…

On Wed, Feb 26, 2020 at 12:49 PM Daniel McDonald ***@***.***> wrote: .transform(), for exampel, can operate in place <https://github.com/biocore/biom-format/blob/master/biom/_transform.pyx#L51> on the CSR/CSC representation. I'm unsure about impact to QIIME 2 -- hopefully most operations rely on the Table object rather than the DataFrame. I cannot predict the impact for other users of biom. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#838?email_source=notifications&email_token=AA75VXNHCI4FRE5VCISNIM3RE2TRPA5CNFSM4KZYLIOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENBGWNI#issuecomment-591555381>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA75VXP4XP3E3CUKHTEFG4DRE2TRPANCNFSM4KZYLIOA> .

wasade · 2020-02-26T18:22:50Z

Dense representations were explicitly removed from BIOM a long time ago. I'm not very eager to add support back in because a dependency has a bug.

wasade · 2020-09-04T17:54:54Z

@fedarko, I ended up just putting in the from_spmatrix piece similar to what's done here within #848. It looks like this is going to be or was fixed in Pandas...

fedarko added 4 commits February 22, 2020 21:12

TST: Add test checking sparse DF conversion speed

565dc5d

(It currently fails. Will explain in PR message.)

TST: fix typo in comment

d495a36

TST/DOC: fix typo in a test and improve its docs

b08ce66

fedarko mentioned this pull request Feb 24, 2020

(Temporarily) pin pandas version in the README's standalone install instructions biocore/songbird#118

Merged

wasade mentioned this pull request Apr 14, 2020

DEP: update mafft pin biocore/deblur#202

Closed

fedarko mentioned this pull request Aug 30, 2020

TypeError: astype() got an unexpected keyword argument 'copy' biocore/qurro#312

Closed

wasade closed this Sep 4, 2020

wasade mentioned this pull request Nov 3, 2020

Test issue with pandas1.0 #837

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make to_dataframe use the new Sparse data structures from pandas >= 0.25 #838

Make to_dataframe use the new Sparse data structures from pandas >= 0.25 #838

fedarko commented Feb 23, 2020 •

edited

Loading

wasade commented Feb 23, 2020

ebolyen commented Feb 26, 2020

ebolyen commented Feb 26, 2020

fedarko commented Feb 26, 2020

ebolyen commented Feb 26, 2020

ebolyen commented Feb 26, 2020 •

edited

Loading

wasade commented Feb 26, 2020

ebolyen commented Feb 26, 2020 •

edited

Loading

wasade commented Feb 26, 2020 •

edited

Loading

mortonjt commented Feb 26, 2020 via email

wasade commented Feb 26, 2020

wasade commented Sep 4, 2020

Make to_dataframe use the new Sparse data structures from pandas >= 0.25 #838

Make to_dataframe use the new Sparse data structures from pandas >= 0.25 #838

Conversation

fedarko commented Feb 23, 2020 • edited Loading

wasade commented Feb 23, 2020

ebolyen commented Feb 26, 2020

ebolyen commented Feb 26, 2020

fedarko commented Feb 26, 2020

ebolyen commented Feb 26, 2020

ebolyen commented Feb 26, 2020 • edited Loading

wasade commented Feb 26, 2020

ebolyen commented Feb 26, 2020 • edited Loading

wasade commented Feb 26, 2020 • edited Loading

mortonjt commented Feb 26, 2020 via email

wasade commented Feb 26, 2020

wasade commented Sep 4, 2020

fedarko commented Feb 23, 2020 •

edited

Loading

ebolyen commented Feb 26, 2020 •

edited

Loading

ebolyen commented Feb 26, 2020 •

edited

Loading

wasade commented Feb 26, 2020 •

edited

Loading