Support pandas v1: Switch from SparseDataFrame to "regular ... DataFrame with sparse values" #258

fedarko · 2019-12-18T01:35:51Z

See https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating. For the time being it looks like running Qurro gives us a bunch of warning messages repeating this, so ... if we can avoid this that'd make the UX a lot nicer.

Addressing this might actually not be that much of a pain -- seems like the only place Sparse structures are explicitly used is in biom_table_to_sparse_df(), and the remainder of references to it are in comments/etc. --

/Users/mfedarko/Dropbox/Work/KnightLab/qurro> grep -ri "Sparse" qurro/*
Binary file qurro/__pycache__/_df_utils.cpython-36.pyc matches
Binary file qurro/__pycache__/generate.cpython-36.pyc matches
Binary file qurro/__pycache__/_rank_utils.cpython-36.pyc matches
qurro/_df_utils.py:def biom_table_to_sparse_df(table, min_row_ct=2, min_col_ct=1):
qurro/_df_utils.py:    """Loads a BIOM table as a pd.SparseDataFrame. Also calls validate_df().
qurro/_df_utils.py:       To get around this, we extract the scipy.sparse.csr_matrix data from the
qurro/_df_utils.py:       BIOM table and directly convert that to a pandas SparseDataFrame.
qurro/_df_utils.py:    logging.debug("Creating a SparseDataFrame from BIOM table.")
qurro/_df_utils.py:    table_sdf = pd.SparseDataFrame(table.matrix_data, default_fill_value=0.0)
qurro/_df_utils.py:    # in to the SparseDataFrame.
qurro/_df_utils.py:    logging.debug("Converted BIOM table to SparseDataFrame.")
qurro/_df_utils.py:       df_old: pd.DataFrame (or pd.SparseDataFrame)
qurro/_df_utils.py:       df_new: pd.DataFrame (or pd.SparseDataFrame)
qurro/_df_utils.py:       table: pd.DataFrame (or pd.SparseDataFrame)
qurro/_df_utils.py:       (m_table, m_sample_metadata): both pd.[Sparse]DataFrame
qurro/_df_utils.py:    """Returns a "sparse" representation of a dict of counts data.
qurro/_df_utils.py:    sparse_count_dict = {}
qurro/_df_utils.py:        sparse_count_dict[feature_id] = fdict
qurro/_df_utils.py:    return sparse_count_dict
qurro/_df_utils.py:       table_sdf: pd.SparseDataFrame
qurro/_df_utils.py:       table_sdf: pd.DataFrame (or pd.SparseDataFrame)
qurro/_rank_utils.py:    table: pd.SparseDataFrame, ranks: pd.DataFrame, extreme_feature_count: int
qurro/_rank_utils.py:       table: pd.SparseDataFrame
qurro/_rank_utils.py:            A SparseDataFrame representation of a BIOM table. This can be
qurro/_rank_utils.py:            qurro._df_utils.biom_table_to_sparse_df().
qurro/_rank_utils.py:       (table, ranks): (pandas.SparseDataFrame, pandas.DataFrame)
qurro/_rank_utils.py:    # >>> df = pd.SparseDataFrame(np.zeros(34000000).reshape(17000, 2000))
qurro/generate.py:    biom_table_to_sparse_df,
qurro/generate.py:       3. Converts the BIOM table to a SparseDataFrame by calling
qurro/generate.py:          biom_table_to_sparse_df().
qurro/generate.py:       output_table: pd.SparseDataFrame
qurro/generate.py:    table = biom_table_to_sparse_df(biom_table)
qurro/generate.py:    table_sdf: pd.SparseDataFrame
qurro/support_files/js/display.js:        /* Gets count data from the featureCts object. This uses a sparse
qurro/tests/test_filter_unextreme_features.py:from qurro.generate import biom_table_to_sparse_df, process_input
qurro/tests/test_filter_unextreme_features.py:    # ...And yeah we're actually making it into a Sparse DF because that's what
qurro/tests/test_filter_unextreme_features.py:    output_table = biom_table_to_sparse_df(biom_table)
qurro/tests/test_df_utils.py:    # Test that it works even when the data is totally sparse
Binary file qurro/tests/__pycache__/testing_utilities.cpython-36.pyc matches
Binary file qurro/tests/__pycache__/test_filter_unextreme_features.cpython-36-pytest-5.1.2.pyc matches
qurro/tests/testing_utilities.py:    biom_table_to_sparse_df,
qurro/tests/testing_utilities.py:    # Load the table as a Sparse DF, and then match it up with the sample
qurro/tests/testing_utilities.py:    table = biom_table_to_sparse_df(load_table(biom_table_loc))
qurro/tests/web_tests/tests/test_rrvdisplay_compute_balance.js:                // Check that sparse data is handled properly (i.e. 0s are
qurro/tests/web_tests/instrumented_js/display.js:         */validateSampleID(sampleID){cov_1wpg1oiw7k.f[58]++;cov_1wpg1oiw7k.s[313]++;if(!this.sampleIDs.includes(sampleID)){cov_1wpg1oiw7k.b[54][0]++;cov_1wpg1oiw7k.s[314]++;throw new Error("Invalid sample ID: "+sampleID);}else{cov_1wpg1oiw7k.b[54][1]++;}}/* Gets count data from the featureCts object. This uses a sparse

The text was updated successfully, but these errors were encountered:

fedarko · 2019-12-18T01:36:13Z

issue labels somehow got messed up, huh

fedarko · 2020-02-21T23:04:49Z

Upgrading to important, since we need to get this done for the next pandas release: biocore/songbird#117

mortonjt · 2020-02-21T23:07:33Z

I don't think it is a songbird problem, but rather a problem with biom biocore/biom-format#837

…

On Fri, Feb 21, 2020, 6:04 PM Marcus Fedarko ***@***.***> wrote: Upgrading to important, since we need to get this done for the next pandas release: biocore/songbird#117 <biocore/songbird#117> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#258?email_source=notifications&email_token=AA75VXPUSGYASEAUNTKYRCTREBMZHA5CNFSM4J4D6VR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMUMFOY#issuecomment-589873851>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA75VXNWP7UJTSEIFGBCL4TREBMZHANCNFSM4J4D6VRQ> .

ElDeveloper · 2020-02-21T23:08:48Z

If at all possible, it would be worthwhile considering using the Table API from biom. Thanks! Yoshiki

…

On (Feb-21-20|15:04), Marcus Fedarko wrote: Upgrading to important, since we need to get this done for the next pandas release: biocore/songbird#117 -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #258 (comment)

See biocore#258 and biocore#315. not confident this is done yet (and if nothing else the rest of the code gleefully refers to "SparseDataFrame" because 2019 marcus was a schmuck), but this at least fixes a fair amount of failing tests

The problem was using .loc[] on these sparse dataframes. whoops

…QIIME 2 (#322) * DEP: Update setup.py re: python and pandas #315 * DEV: port CI from Travis to GH Actions: close #316 * TST: For now, omit "make notebooks" from CI Maybe we can make another GitHub Actions for these later; but Songbird is causing tensorflow nonsense to pop up, and this is not the sort of thing I think we should spend time fixing (esp with the advent of birdman) * DEP: pin min biom vsn and add some comments * DEP: Fix biom_table_to_sparse_df for pandas >= 1 See #258 and #315. not confident this is done yet (and if nothing else the rest of the code gleefully refers to "SparseDataFrame" because 2019 marcus was a schmuck), but this at least fixes a fair amount of failing tests * DEP: remove some warnings, docs, fix a test re: pd * TST: Fix the python tests!!! #258 The problem was using .loc[] on these sparse dataframes. whoops * STY: tiny style fixes * DEP: knock out some pandas warnings * DEP: np.matrix() -> np.array() in qarcoal tests since apparently it's deprecated, or about to be deprecated, idk * DEP/STY: Fix more warnings; remove unused import most of these warnings were just pd.DataFrame.append() being deprecated and replaced with pd.concat() * DOC: one of the demos' JS data slightly changed looks like it's a tiny floating-point thing -- probably an artifact of working here on a new operating system, on a new python version, a new pandas version, a new biom version, etc. shouldn't make a noticeable difference * DOC: update readme re: min Q2 vsn * TST: matrix of qiime 2 versions nice! * TST: more detailed comment about Q2 vsn matrix * DOC: remove the "Sparse" from "SparseDataFrame" * REL: version kick * TST: Add standalone CI IIRC something about how our specific altair version works makes it incompatible with python 3.10. let's test that here -- if needed, we can update the README to disallow python versions >= 3.10. (And then we can look into removing the altair pin when absolutely needed.) * TST: attempt to get standalone tests working * TST: attempt to fix pytest q2 exclusion * DEP: ok py 3.10 is a no go * STY: fix formatting * DOC: Rerun 4 / 6 example notebooks Songbird and ALDEx2 ones will cause problems * DOC: tidy/update readme refs * DOC: update jake fish dataset ref on website * DOC: Fix songbird notebook!, standardize output rm * BLD: rm (now-)unused comments from q2 ci * DOC: fix transcriptomics ntbk :) * REL: update changelog * REL: update changelog * TST: see if we can finagle q2 2020.6 / 2020.8? since i thiiiink these versions mighta worked with the pandas >= 1 syntax that being said, i don't think it makes sense to devote time/energy to officially supporting them; just wanna check * TST: remove Q2 2020.6 / 2020.8 in CI Looks like the tests themselves pass for these versions, but the style-checking with black fails due to incompatibility with click. yeah this is enough for me to not bother supporting these versions imo * DOC: songbird compatibility deets * DEV/DOC: update dev docs re: 2022 the apocalypse came and all i got was this pull request * REL: update changelog about updating contributing about about about about aboot * REL: minor chglog tidying

fedarko added administrative Logistical matters that don't have much or anything to do with code backburner Low-priority things that are still good to keep track of labels Dec 18, 2019

fedarko self-assigned this Dec 18, 2019

fedarko added the good first issue Good for newcomers label Dec 18, 2019

This was referenced Feb 16, 2020

Update BIOM version required #272

Closed

Some Qarcoal tests failing on Travis #271

Closed

fedarko added the important Things that are critical for getting Qurro in a working/useful state label Feb 21, 2020

fedarko added a commit to fedarko/qurro that referenced this issue Feb 29, 2020

REL: temporarily pin pandas below v1 biocore#258

46075ab

fedarko mentioned this issue Aug 30, 2020

TypeError: astype() got an unexpected keyword argument 'copy' #312

Closed

fedarko changed the title ~~Switch from SparseDataFrame to "regular ... DataFrame with sparse values"~~ Support pandas v1: Switch from SparseDataFrame to "regular ... DataFrame with sparse values" Sep 8, 2020

fedarko mentioned this issue Dec 11, 2020

Qurro will break on newer (>= 2020.11) versions of QIIME 2 #315

Closed

fedarko added a commit to fedarko/qurro that referenced this issue Jul 5, 2022

TST: Fix the python tests!!! biocore#258

e773b95

The problem was using .loc[] on these sparse dataframes. whoops

fedarko mentioned this issue Jul 5, 2022

Update Qurro to support pandas v1 and up, and thus newer versions of QIIME 2 #322

Merged

2 tasks

fedarko closed this as completed in #322 Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pandas v1: Switch from SparseDataFrame to "regular ... DataFrame with sparse values" #258

Support pandas v1: Switch from SparseDataFrame to "regular ... DataFrame with sparse values" #258

fedarko commented Dec 18, 2019

fedarko commented Dec 18, 2019 •

edited

Loading

fedarko commented Feb 21, 2020

mortonjt commented Feb 21, 2020 via email

ElDeveloper commented Feb 21, 2020 via email

Support pandas v1: Switch from SparseDataFrame to "regular ... DataFrame with sparse values" #258

Support pandas v1: Switch from SparseDataFrame to "regular ... DataFrame with sparse values" #258

Comments

fedarko commented Dec 18, 2019

fedarko commented Dec 18, 2019 • edited Loading

fedarko commented Feb 21, 2020

mortonjt commented Feb 21, 2020 via email

ElDeveloper commented Feb 21, 2020 via email

fedarko commented Dec 18, 2019 •

edited

Loading