Continuous integration for tests #13

wwbrannon · 2024-01-06T17:13:44Z

Try to run tests automatically. The idea is:

Find data_summaries json files modified, created, or deleted in the particular PR that was opened.
Skip the deleted ones.
Get all the collection names in the modified/created ones and run tests for each of those.

shayne-longpre

Thanks for doing this! I think it looks good. If I understand correctly it runs the tests on all collections that are new? Will it fail if the dataset download is huge?

wwbrannon · 2024-01-16T07:11:32Z

Thanks for doing this! I think it looks good. If I understand correctly it runs the tests on all collections that are new? Will it fail if the dataset download is huge?

Sorry, I should have had a clearer message in the PR (just updated it). Yeah it finds collections which have changed or been added in the particular PR and runs tests for those. I figured collections are mostly independent of each other, up to some changes in common code, so we don't need to test everything when anything changes.

It's possible it could fail for really huge datasets, though the Github Actions docs only specify a runtime limit. We could try to test more before merging if you'd like cc @shayne-longpre

chkla · 2024-01-23T20:53:21Z

@wwbrannon great feature! the runtime limit of 6 hours is relatively long. I would suggest that we divide the test into smaller subtests and that we use streaming https://huggingface.co/docs/datasets/stream and just load 1-10 samples from a dataset for the tests. in this case, the actual size is not that important for our tests.

chkla · 2024-01-23T21:57:06Z

If we have merged it, I will add something like:

def huggingface_download(
        data_address, name=None, data_dir=None, data_files=None, split=None
):
    """Download a dataset from the Hugging Face Hub.

    It supports various options for specifying the dataset to download,
    such as providing a name, a data directory, data files, or a split.

    Args:
        data_address (str): The address or identifier of the dataset
        name (str, optional): Name of the dataset to download. Defaults to None.
        data_dir (str, optional): Path to the directory containing the dataset files. Defaults to None.
        data_files (str or list, optional): Path(s) to specific dataset files. Defaults to None.
        split (str, optional): Name of the split to take (usually "train"). Defaults to None.

    Returns:
        list or Dataset: The downloaded dataset as a list of items,
            or Hugging Face Dataset object (if failed converted to list).
    """
    assert not (data_dir and data_files)

    if data_files:
        dset = load_dataset(data_address, data_files=data_files, streaming=True)
    elif data_dir:
        dset = load_dataset(data_address, data_dir=data_dir, streaming=True)
    elif name:
        dset = load_dataset(data_address, name, streaming=True)
    else:
        dset = load_dataset(data_address, streaming=True)

    if split:
        dset = dset[split]

    subsample = [sample for sample in dset.take(2)]

    return subsample

while I was checking out the collections, I noticed that a few of them might cause some errors and could need an update

wwbrannon · 2024-01-31T22:56:59Z

@chkla Hmm, I don't think I understand.

Do you mean making all of the huggingface downloads streaming by default for the whole package? That would avoid space constraints, but not time constraints. On the other hand, do we actually have (or would we be adding) anything that runs tests for >6 hours?
Alternatively we could monkey-patch the huggingface_download function from the test-runner code.

The existing test script doesn't check output statistics, but running with a sample I think would preclude doing that in the future.

chkla · 2024-02-02T13:48:54Z

@wwbrannon

@chkla Hmm, I don't think I understand.

Do you mean making all of the huggingface downloads streaming by default for the whole package? That would avoid space constraints, but not time constraints. On the other hand, do we actually have (or would we be adding) anything that runs tests for >6 hours?

No, I tried several collections, but if we increase the number of collections, we might encounter issues in the future. Actually, I thought it would help us with time constraints since we wouldn't need to load everything, which would speed up the code. But maybe I’m misunderstanding something here.

Alternatively we could monkey-patch the huggingface_download function from the test-runner code.

Yeah, that would also be fine for me.

The existing test script doesn't check output statistics, but running with a sample I think would preclude doing that in the future. 👍

continuous integration for tests

9a31696

shayne-longpre approved these changes Jan 12, 2024

View reviewed changes

use correct teest script name

25a5a4c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous integration for tests #13

Continuous integration for tests #13

wwbrannon commented Jan 6, 2024 •

edited

Loading

shayne-longpre left a comment

wwbrannon commented Jan 16, 2024

chkla commented Jan 23, 2024

chkla commented Jan 23, 2024

wwbrannon commented Jan 31, 2024

chkla commented Feb 2, 2024

Continuous integration for tests #13

Are you sure you want to change the base?

Continuous integration for tests #13

Conversation

wwbrannon commented Jan 6, 2024 • edited Loading

shayne-longpre left a comment

Choose a reason for hiding this comment

wwbrannon commented Jan 16, 2024

chkla commented Jan 23, 2024

chkla commented Jan 23, 2024

wwbrannon commented Jan 31, 2024

chkla commented Feb 2, 2024

wwbrannon commented Jan 6, 2024 •

edited

Loading