Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuous integration for tests #13

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

wwbrannon
Copy link
Collaborator

@wwbrannon wwbrannon commented Jan 6, 2024

Try to run tests automatically. The idea is:

  • Find data_summaries json files modified, created, or deleted in the particular PR that was opened.
  • Skip the deleted ones.
  • Get all the collection names in the modified/created ones and run tests for each of those.

Copy link
Contributor

@shayne-longpre shayne-longpre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I think it looks good. If I understand correctly it runs the tests on all collections that are new? Will it fail if the dataset download is huge?

@wwbrannon
Copy link
Collaborator Author

Thanks for doing this! I think it looks good. If I understand correctly it runs the tests on all collections that are new? Will it fail if the dataset download is huge?

Sorry, I should have had a clearer message in the PR (just updated it). Yeah it finds collections which have changed or been added in the particular PR and runs tests for those. I figured collections are mostly independent of each other, up to some changes in common code, so we don't need to test everything when anything changes.

It's possible it could fail for really huge datasets, though the Github Actions docs only specify a runtime limit. We could try to test more before merging if you'd like cc @shayne-longpre

@chkla
Copy link
Contributor

chkla commented Jan 23, 2024

@wwbrannon great feature! the runtime limit of 6 hours is relatively long. I would suggest that we divide the test into smaller subtests and that we use streaming https://huggingface.co/docs/datasets/stream and just load 1-10 samples from a dataset for the tests. in this case, the actual size is not that important for our tests.

@chkla
Copy link
Contributor

chkla commented Jan 23, 2024

If we have merged it, I will add something like:

def huggingface_download(
        data_address, name=None, data_dir=None, data_files=None, split=None
):
    """Download a dataset from the Hugging Face Hub.

    It supports various options for specifying the dataset to download,
    such as providing a name, a data directory, data files, or a split.

    Args:
        data_address (str): The address or identifier of the dataset
        name (str, optional): Name of the dataset to download. Defaults to None.
        data_dir (str, optional): Path to the directory containing the dataset files. Defaults to None.
        data_files (str or list, optional): Path(s) to specific dataset files. Defaults to None.
        split (str, optional): Name of the split to take (usually "train"). Defaults to None.

    Returns:
        list or Dataset: The downloaded dataset as a list of items,
            or Hugging Face Dataset object (if failed converted to list).
    """
    assert not (data_dir and data_files)

    if data_files:
        dset = load_dataset(data_address, data_files=data_files, streaming=True)
    elif data_dir:
        dset = load_dataset(data_address, data_dir=data_dir, streaming=True)
    elif name:
        dset = load_dataset(data_address, name, streaming=True)
    else:
        dset = load_dataset(data_address, streaming=True)

    if split:
        dset = dset[split]

    subsample = [sample for sample in dset.take(2)]

    return subsample

while I was checking out the collections, I noticed that a few of them might cause some errors and could need an update

@wwbrannon
Copy link
Collaborator Author

@chkla Hmm, I don't think I understand.

  • Do you mean making all of the huggingface downloads streaming by default for the whole package? That would avoid space constraints, but not time constraints. On the other hand, do we actually have (or would we be adding) anything that runs tests for >6 hours?
  • Alternatively we could monkey-patch the huggingface_download function from the test-runner code.

The existing test script doesn't check output statistics, but running with a sample I think would preclude doing that in the future.

@chkla
Copy link
Contributor

chkla commented Feb 2, 2024

@wwbrannon

@chkla Hmm, I don't think I understand.

  • Do you mean making all of the huggingface downloads streaming by default for the whole package? That would avoid space constraints, but not time constraints. On the other hand, do we actually have (or would we be adding) anything that runs tests for >6 hours?

No, I tried several collections, but if we increase the number of collections, we might encounter issues in the future. Actually, I thought it would help us with time constraints since we wouldn't need to load everything, which would speed up the code. But maybe I’m misunderstanding something here.

  • Alternatively we could monkey-patch the huggingface_download function from the test-runner code.

Yeah, that would also be fine for me.

The existing test script doesn't check output statistics, but running with a sample I think would preclude doing that in the future. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants