-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make to_dataframe use the new Sparse data structures from pandas >= 0.25 #838
Conversation
pd.SparseDataFrame has been removed -- the replacement, per the pandas docs*, is making a normal DataFrame with sparse columns. Making this change actually simplifies a few things on our end in the codebase, which is nice :) I bumped up the min pandas version in setup.py here because the new sparse stuff was added in v0.25.0, but I imagine other version things for the biom-format package (e.g. conda / conda-forge stuff) will also need to be adjusted accordingly. *https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
(It currently fails. Will explain in PR message.)
Thanks, @fedarko and thank you for the detailed exploration here. The observation with |
Re: 2, is this not an issue with pandas < 1.0? |
Ah, from the linked PR in songbird, that appears to be the case. |
@ebolyen I haven't checked if (2) is a problem with pandas < 1.0 -- the reason songbird (and a few other things) were breaking IIRC is that biom.Table.to_dataframe() calls pandas.SparseDataFrame() directly, which was removed in pandas 1.0. |
I will try to reproduce in the last non-1.0 version to see what's going on. |
Yep, this is an issue in our last release also (2019.10): In [1]: from scipy.sparse import csr_matrix
In [2]: import pandas as pd
In [3]: a = csr_matrix([[1, 0, 0], [2, 2, 0], [0, 2, 2]])
In [4]: a[0, 2] = 0
In [5]: a.todense()
Out[5]:
matrix([[1, 0, 0],
[2, 2, 0],
[0, 2, 2]], dtype=int64)
In [6]: pd.DataFrame.sparse.from_spmatrix(a)
Out[6]:
0 1 2
0 1 0 0
1 2 2 0
2 0 2 0
In [7]: pd.__version__
Out[7]: '0.25.2'
To what extent does any of our code mutate the CSR matrix before loading it as a df? @wasade do we need a workaround for this? |
@ebolyen, the |
It doesn't look like any q2 code attempts mutation, the most common path is grep qiime2
grep skbio
It's a bummer, but I don't think we've been directly impacted by this (yet at least). |
|
One thing to note is that many users who are using the dataframe conversion
rely on dense dataframes rather than sparse dataframes.
In Songbird, we do the conversion from sparse to dense versions anyways.
One possibility is to open up the option to do an explicit dense conversion
without the an intermediate sparse dataframe (with the warning that the
memory consumption maybe large).
…On Wed, Feb 26, 2020 at 12:49 PM Daniel McDonald ***@***.***> wrote:
.transform(), for exampel, can operate in place
<https://github.com/biocore/biom-format/blob/master/biom/_transform.pyx#L51>
on the CSR/CSC representation. I'm unsure about impact to QIIME 2 --
hopefully most operations rely on the Table object rather than the
DataFrame. I cannot predict the impact for other users of biom.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#838?email_source=notifications&email_token=AA75VXNHCI4FRE5VCISNIM3RE2TRPA5CNFSM4KZYLIOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENBGWNI#issuecomment-591555381>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA75VXP4XP3E3CUKHTEFG4DRE2TRPANCNFSM4KZYLIOA>
.
|
Dense representations were explicitly removed from BIOM a long time ago. I'm not very eager to add support back in because a dependency has a bug. |
Currently, pandas v1 is the reason folks are running into problems like #837. This is also breaking a couple of other packages that rely on biom, including but probably not limited to songbird and mmvec.
This PR should fix the problem for most datasets. However, there are a few complicating factors that I think will prevent merging this PR in right now, both due to
pd.DataFrame.sparse.from_spmatrix()
(which is pandas' recommended way to convert SciPy CSR matrices to DataFrames-with-sparse-values):I'm pretty sureUpdate: this assumption was incorrect on my part, and it's expected that really wide matrices like this really will slow down pandas. See the pandas issue thread linked for profiling details / their response.from_spmatrix()
doesn't handle the sparse data in these matrices properly. When I make a huge but super sparse matrix in SciPy and try to convert it to a DF, my computer explodes at the conversion step (even though initializing the matrix takes on the order of 15-ish seconds for SciPy). I've opened DataFrame.sparse.from_spmatrix seems inefficient with large (but very sparse) matrices? pandas-dev/pandas#32196 accordingly.To ensure that this change doesn't cause a repeat of [python]
to_dataframe
does not produce sparse data frames #808, I added a test in this PR that makes sure that these sorts of large matrices can be converted to sparse DataFrames in a sane amount of time (max 30 seconds). This currently fails, sincebiom.table.Table.to_dataframe()
in this PR usesfrom_spmatrix()
under the hood. Makingto_dataframe()
faster should cause this test to pass. (If I'm being overzealous and sparse DataFrames are really going to be slower to create than I'd thought, it's totally possible to bump up the 30-second timer in this test.) Update: will need to do this and/or just try a less crazy sparse matrix than a 1mil x 1mil one....Also I think
from_spmatrix()
has other bugs that might make it unreliable.ANYWAY TLDR hopefully these changes are useful at least as a starting point for a solution to this problem. Thanks!