Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dask engine to dataset generation functions #404

Merged
merged 7 commits into from
Apr 25, 2024

Conversation

zmbc
Copy link
Collaborator

@zmbc zmbc commented Apr 22, 2024

Add Dask engine to dataset generation functions

Description

  • Category: feature
  • JIRA issue: none

Successor to #349 -- using Dask instead of Modin. This simplifies things and hopefully gives us a shorter path to releasing distributed noising.

Testing

  • all tests pass (pytest --runslow)

Added some tests that use Dask.

Copy link
Member

@aflaxman aflaxman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add something to the docs about how to install and use this, and a cautionary note that this is not important for the small data distributed with psp, but can be a big time saver for people working with the full dataset?

Also, I think I'm doing it wrong... without dask
E.g. without this I loaded Alabama in 1.29 hours, but with it, on a 32-core cluster node, it took 1.57 hours.

src/pseudopeople/interface.py Outdated Show resolved Hide resolved
@zmbc
Copy link
Collaborator Author

zmbc commented Apr 23, 2024

@aflaxman How did you start your Dask cluster? And which dataset did you load?

@aflaxman
Copy link
Member

I didn't start one! I just used

        df = psp.generate_decennial_census(source=full_data_path, state=name,
                                           config=my_config, verbose=False, engine='dask')

with a path to the full data that you sent me yesterday.

@zmbc
Copy link
Collaborator Author

zmbc commented Apr 23, 2024

I have a feeling you might be starting Dask with as many workers as there are physical CPUs on your cluster node, and then it is thrashing.

I'm in the process of testing this branch on the case study.

@zmbc
Copy link
Collaborator Author

zmbc commented Apr 23, 2024

Note: running into a fiddly bug with dtypes on a few columns. I don't think it will be overly complicated to fix.

@zmbc zmbc mentioned this pull request Apr 24, 2024
1 task
@zmbc
Copy link
Collaborator Author

zmbc commented Apr 24, 2024

Blocked by #405 -- Dask currently can't handle writing our strange dtypes to Parquet files.

@aflaxman
Copy link
Member

I didn't start one! I just used

        df = psp.generate_decennial_census(source=full_data_path, state=name,
                                           config=my_config, verbose=False, engine='dask')

with a path to the full data that you sent me yesterday.

When I added this code block (on a node with the srun from the commented line, after installing the packages from the other commented lines), I was able to load the full USA data in 40 minutes:

# pip install "dask[distributed]" --upgrade
# pip install --upgrade pyarrow

# srun -t 14-00:00:00 --mem=100G -c 32 -A proj_simscience -p long.q --pty bash

from dask.distributed import LocalCluster
cluster = LocalCluster(n_workers=10, threads_per_worker=1)

setup.py Show resolved Hide resolved
src/pseudopeople/interface.py Outdated Show resolved Hide resolved
src/pseudopeople/interface.py Show resolved Hide resolved
@rmudambi
Copy link
Collaborator

@zmbc can you target a (new) release branch rather than main? We probably want to release all of your PRs together as one release, and merging to main requires we release.

@zmbc zmbc changed the base branch from main to release-candidate/dtypes-distributed-noising April 25, 2024 20:06
@zmbc zmbc requested review from pletale and a team as code owners April 25, 2024 20:25
@zmbc zmbc merged commit a96d337 into release-candidate/dtypes-distributed-noising Apr 25, 2024
7 checks passed
@zmbc zmbc deleted the feature/dask branch April 25, 2024 21:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants