-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add redundancy to OpenML datasets with Figshare #1218
Add redundancy to OpenML datasets with Figshare #1218
Conversation
Datasets to upload as CSV, in a zip:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! a first few comments :)
Update examples
I think something is off with circle CI |
TODO left:
|
skrub/datasets/_utils.py
Outdated
return bunch | ||
|
||
|
||
def _load_dataset_files(dataset_name, data_home): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this method should be named load_dataset_files
, as it is used outside of this module?
there seems to be a problem with the videogame sales dataset (example 6 -- see the negative r2 scores), I'll look into it edit: the dataset is ordered by decreasing global sales so we needed kfold with shuffle=True, I just pushed the fix |
I added the docstrings. @Vincent-Maladiere what was the second TODO? |
I uploaded the zip files to osf and updated the urls in this PR so it should be mostly ok now checking uploadsimport requests
import hashlib
from skrub.datasets._utils import DATASET_INFO
for name, data in DATASET_INFO.items():
print(name)
assert len(data['urls']) == 2
assert 'github.com' in data['urls'][0]
assert 'osf.io' in data['urls'][1]
for url in data['urls']:
content = requests.get(url).content
assert hashlib.sha256(content).hexdigest() == data['sha256'] |
@glemaitre if you have time you may want to have a look at this pr, too |
+1 for the changes. I think this PR is ready for a final review before merging (@glemaitre if you have some spare time) :) |
Addresses #1217