-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continuous integration for tests #13
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! I think it looks good. If I understand correctly it runs the tests on all collections that are new? Will it fail if the dataset download is huge?
Sorry, I should have had a clearer message in the PR (just updated it). Yeah it finds collections which have changed or been added in the particular PR and runs tests for those. I figured collections are mostly independent of each other, up to some changes in common code, so we don't need to test everything when anything changes. It's possible it could fail for really huge datasets, though the Github Actions docs only specify a runtime limit. We could try to test more before merging if you'd like cc @shayne-longpre |
@wwbrannon great feature! the runtime limit of 6 hours is relatively long. I would suggest that we divide the test into smaller subtests and that we use streaming https://huggingface.co/docs/datasets/stream and just load 1-10 samples from a dataset for the tests. in this case, the actual size is not that important for our tests. |
If we have merged it, I will add something like: def huggingface_download(
data_address, name=None, data_dir=None, data_files=None, split=None
):
"""Download a dataset from the Hugging Face Hub.
It supports various options for specifying the dataset to download,
such as providing a name, a data directory, data files, or a split.
Args:
data_address (str): The address or identifier of the dataset
name (str, optional): Name of the dataset to download. Defaults to None.
data_dir (str, optional): Path to the directory containing the dataset files. Defaults to None.
data_files (str or list, optional): Path(s) to specific dataset files. Defaults to None.
split (str, optional): Name of the split to take (usually "train"). Defaults to None.
Returns:
list or Dataset: The downloaded dataset as a list of items,
or Hugging Face Dataset object (if failed converted to list).
"""
assert not (data_dir and data_files)
if data_files:
dset = load_dataset(data_address, data_files=data_files, streaming=True)
elif data_dir:
dset = load_dataset(data_address, data_dir=data_dir, streaming=True)
elif name:
dset = load_dataset(data_address, name, streaming=True)
else:
dset = load_dataset(data_address, streaming=True)
if split:
dset = dset[split]
subsample = [sample for sample in dset.take(2)]
return subsample while I was checking out the collections, I noticed that a few of them might cause some errors and could need an update |
@chkla Hmm, I don't think I understand.
The existing test script doesn't check output statistics, but running with a sample I think would preclude doing that in the future. |
No, I tried several collections, but if we increase the number of collections, we might encounter issues in the future. Actually, I thought it would help us with time constraints since we wouldn't need to load everything, which would speed up the code. But maybe I’m misunderstanding something here.
Yeah, that would also be fine for me.
|
Try to run tests automatically. The idea is: