-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dask engine to dataset generation functions #404
Add Dask engine to dataset generation functions #404
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add something to the docs about how to install and use this, and a cautionary note that this is not important for the small data distributed with psp, but can be a big time saver for people working with the full dataset?
Also, I think I'm doing it wrong... without dask
E.g. without this I loaded Alabama in 1.29 hours, but with it, on a 32-core cluster node, it took 1.57 hours.
@aflaxman How did you start your Dask cluster? And which dataset did you load? |
I didn't start one! I just used
with a path to the full data that you sent me yesterday. |
I have a feeling you might be starting Dask with as many workers as there are physical CPUs on your cluster node, and then it is thrashing. I'm in the process of testing this branch on the case study. |
Note: running into a fiddly bug with dtypes on a few columns. I don't think it will be overly complicated to fix. |
Blocked by #405 -- Dask currently can't handle writing our strange dtypes to Parquet files. |
When I added this code block (on a node with the
|
@zmbc can you target a (new) release branch rather than |
…uted-noising' into feature/dask
a96d337
into
release-candidate/dtypes-distributed-noising
Add Dask engine to dataset generation functions
Description
Successor to #349 -- using Dask instead of Modin. This simplifies things and hopefully gives us a shorter path to releasing distributed noising.
Testing
pytest --runslow
)Added some tests that use Dask.