Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pandas v1: Switch from SparseDataFrame to "regular ... DataFrame with sparse values" #258

Closed
fedarko opened this issue Dec 18, 2019 · 4 comments · Fixed by #322
Closed
Assignees
Labels
external issues/bugs with other libraries, frameworks, etc.; might include reproducing an issue minimally good first issue Good for newcomers important Things that are critical for getting Qurro in a working/useful state optimization Making code faster or cleaner

Comments

@fedarko
Copy link
Collaborator

fedarko commented Dec 18, 2019

See https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating. For the time being it looks like running Qurro gives us a bunch of warning messages repeating this, so ... if we can avoid this that'd make the UX a lot nicer.

Addressing this might actually not be that much of a pain -- seems like the only place Sparse structures are explicitly used is in biom_table_to_sparse_df(), and the remainder of references to it are in comments/etc. --

/Users/mfedarko/Dropbox/Work/KnightLab/qurro> grep -ri "Sparse" qurro/*
Binary file qurro/__pycache__/_df_utils.cpython-36.pyc matches
Binary file qurro/__pycache__/generate.cpython-36.pyc matches
Binary file qurro/__pycache__/_rank_utils.cpython-36.pyc matches
qurro/_df_utils.py:def biom_table_to_sparse_df(table, min_row_ct=2, min_col_ct=1):
qurro/_df_utils.py:    """Loads a BIOM table as a pd.SparseDataFrame. Also calls validate_df().
qurro/_df_utils.py:       To get around this, we extract the scipy.sparse.csr_matrix data from the
qurro/_df_utils.py:       BIOM table and directly convert that to a pandas SparseDataFrame.
qurro/_df_utils.py:    logging.debug("Creating a SparseDataFrame from BIOM table.")
qurro/_df_utils.py:    table_sdf = pd.SparseDataFrame(table.matrix_data, default_fill_value=0.0)
qurro/_df_utils.py:    # in to the SparseDataFrame.
qurro/_df_utils.py:    logging.debug("Converted BIOM table to SparseDataFrame.")
qurro/_df_utils.py:       df_old: pd.DataFrame (or pd.SparseDataFrame)
qurro/_df_utils.py:       df_new: pd.DataFrame (or pd.SparseDataFrame)
qurro/_df_utils.py:       table: pd.DataFrame (or pd.SparseDataFrame)
qurro/_df_utils.py:       (m_table, m_sample_metadata): both pd.[Sparse]DataFrame
qurro/_df_utils.py:    """Returns a "sparse" representation of a dict of counts data.
qurro/_df_utils.py:    sparse_count_dict = {}
qurro/_df_utils.py:        sparse_count_dict[feature_id] = fdict
qurro/_df_utils.py:    return sparse_count_dict
qurro/_df_utils.py:       table_sdf: pd.SparseDataFrame
qurro/_df_utils.py:       table_sdf: pd.DataFrame (or pd.SparseDataFrame)
qurro/_rank_utils.py:    table: pd.SparseDataFrame, ranks: pd.DataFrame, extreme_feature_count: int
qurro/_rank_utils.py:       table: pd.SparseDataFrame
qurro/_rank_utils.py:            A SparseDataFrame representation of a BIOM table. This can be
qurro/_rank_utils.py:            qurro._df_utils.biom_table_to_sparse_df().
qurro/_rank_utils.py:       (table, ranks): (pandas.SparseDataFrame, pandas.DataFrame)
qurro/_rank_utils.py:    # >>> df = pd.SparseDataFrame(np.zeros(34000000).reshape(17000, 2000))
qurro/generate.py:    biom_table_to_sparse_df,
qurro/generate.py:       3. Converts the BIOM table to a SparseDataFrame by calling
qurro/generate.py:          biom_table_to_sparse_df().
qurro/generate.py:       output_table: pd.SparseDataFrame
qurro/generate.py:    table = biom_table_to_sparse_df(biom_table)
qurro/generate.py:    table_sdf: pd.SparseDataFrame
qurro/support_files/js/display.js:        /* Gets count data from the featureCts object. This uses a sparse
qurro/tests/test_filter_unextreme_features.py:from qurro.generate import biom_table_to_sparse_df, process_input
qurro/tests/test_filter_unextreme_features.py:    # ...And yeah we're actually making it into a Sparse DF because that's what
qurro/tests/test_filter_unextreme_features.py:    output_table = biom_table_to_sparse_df(biom_table)
qurro/tests/test_df_utils.py:    # Test that it works even when the data is totally sparse
Binary file qurro/tests/__pycache__/testing_utilities.cpython-36.pyc matches
Binary file qurro/tests/__pycache__/test_filter_unextreme_features.cpython-36-pytest-5.1.2.pyc matches
qurro/tests/testing_utilities.py:    biom_table_to_sparse_df,
qurro/tests/testing_utilities.py:    # Load the table as a Sparse DF, and then match it up with the sample
qurro/tests/testing_utilities.py:    table = biom_table_to_sparse_df(load_table(biom_table_loc))
qurro/tests/web_tests/tests/test_rrvdisplay_compute_balance.js:                // Check that sparse data is handled properly (i.e. 0s are
qurro/tests/web_tests/instrumented_js/display.js:         */validateSampleID(sampleID){cov_1wpg1oiw7k.f[58]++;cov_1wpg1oiw7k.s[313]++;if(!this.sampleIDs.includes(sampleID)){cov_1wpg1oiw7k.b[54][0]++;cov_1wpg1oiw7k.s[314]++;throw new Error("Invalid sample ID: "+sampleID);}else{cov_1wpg1oiw7k.b[54][1]++;}}/* Gets count data from the featureCts object. This uses a sparse
@fedarko fedarko added administrative Logistical matters that don't have much or anything to do with code backburner Low-priority things that are still good to keep track of labels Dec 18, 2019
@fedarko fedarko self-assigned this Dec 18, 2019
@fedarko fedarko added external issues/bugs with other libraries, frameworks, etc.; might include reproducing an issue minimally optimization Making code faster or cleaner and removed administrative Logistical matters that don't have much or anything to do with code backburner Low-priority things that are still good to keep track of labels Dec 18, 2019
@fedarko
Copy link
Collaborator Author

fedarko commented Dec 18, 2019

issue labels somehow got messed up, huh

@fedarko fedarko added the good first issue Good for newcomers label Dec 18, 2019
@fedarko fedarko added the important Things that are critical for getting Qurro in a working/useful state label Feb 21, 2020
@fedarko
Copy link
Collaborator Author

fedarko commented Feb 21, 2020

Upgrading to important, since we need to get this done for the next pandas release: biocore/songbird#117

@mortonjt
Copy link

mortonjt commented Feb 21, 2020 via email

@ElDeveloper
Copy link
Member

ElDeveloper commented Feb 21, 2020 via email

fedarko added a commit to fedarko/qurro that referenced this issue Feb 29, 2020
@fedarko fedarko changed the title Switch from SparseDataFrame to "regular ... DataFrame with sparse values" Support pandas v1: Switch from SparseDataFrame to "regular ... DataFrame with sparse values" Sep 8, 2020
fedarko added a commit to fedarko/qurro that referenced this issue Jul 5, 2022
See biocore#258 and biocore#315. not confident this is done yet (and if nothing
else the rest of the code gleefully refers to "SparseDataFrame"
because 2019 marcus was a schmuck), but this at least fixes a fair
amount of failing tests
fedarko added a commit to fedarko/qurro that referenced this issue Jul 5, 2022
The problem was using .loc[] on these sparse dataframes. whoops
fedarko added a commit that referenced this issue Oct 20, 2022
…QIIME 2 (#322)

* DEP: Update setup.py re: python and pandas #315

* DEV: port CI from Travis to GH Actions: close #316

* TST: For now, omit "make notebooks" from CI

Maybe we can make another GitHub Actions for these later; but
Songbird is causing tensorflow nonsense to pop up, and this is not
the sort of thing I think we should spend time fixing (esp with
the advent of birdman)

* DEP: pin min biom vsn and add some comments

* DEP: Fix biom_table_to_sparse_df for pandas >= 1

See #258 and #315. not confident this is done yet (and if nothing
else the rest of the code gleefully refers to "SparseDataFrame"
because 2019 marcus was a schmuck), but this at least fixes a fair
amount of failing tests

* DEP: remove some warnings, docs, fix a test re: pd

* TST: Fix the python tests!!! #258

The problem was using .loc[] on these sparse dataframes. whoops

* STY: tiny style fixes

* DEP: knock out some pandas warnings

* DEP: np.matrix() -> np.array() in qarcoal tests

since apparently it's deprecated, or about to be deprecated, idk

* DEP/STY: Fix more warnings; remove unused import

most of these warnings were just pd.DataFrame.append() being
deprecated and replaced with pd.concat()

* DOC: one of the demos' JS data slightly changed

looks like it's a tiny floating-point thing -- probably an artifact
of working here on a new operating system, on a new python version,
a new pandas version, a new biom version, etc. shouldn't make a
noticeable difference

* DOC: update readme re: min Q2 vsn

* TST: matrix of qiime 2 versions

nice!

* TST: more detailed comment about Q2 vsn matrix

* DOC: remove the "Sparse" from "SparseDataFrame"

* REL: version kick

* TST: Add standalone CI

IIRC something about how our specific altair version works makes it
incompatible with python 3.10. let's test that here -- if needed,
we can update the README to disallow python versions >= 3.10. (And
then we can look into removing the altair pin when absolutely needed.)

* TST: attempt to get standalone tests working

* TST: attempt to fix pytest q2 exclusion

* DEP: ok py 3.10 is a no go

* STY: fix formatting

* DOC: Rerun 4 / 6 example notebooks

Songbird and ALDEx2 ones will cause problems

* DOC: tidy/update readme refs

* DOC: update jake fish dataset ref on website

* DOC: Fix songbird notebook!, standardize output rm

* BLD: rm (now-)unused comments from q2 ci

* DOC: fix transcriptomics ntbk :)

* REL: update changelog

* REL: update changelog

* TST: see if we can finagle q2 2020.6 / 2020.8?

since i thiiiink these versions mighta worked with the pandas >= 1
syntax

that being said, i don't think it makes sense to devote time/energy
to officially supporting them; just wanna check

* TST: remove Q2 2020.6 / 2020.8 in CI

Looks like the tests themselves pass for these versions, but the
style-checking with black fails due to incompatibility with click.

yeah this is enough for me to not bother supporting these versions
imo

* DOC: songbird compatibility deets

* DEV/DOC: update dev docs re: 2022

the apocalypse came and all i got was this pull request

* REL: update changelog about updating contributing

about about about about aboot

* REL: minor chglog tidying
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external issues/bugs with other libraries, frameworks, etc.; might include reproducing an issue minimally good first issue Good for newcomers important Things that are critical for getting Qurro in a working/useful state optimization Making code faster or cleaner
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants