Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip reading files with incorrect extension #318

Merged
merged 14 commits into from
Nov 18, 2024

Conversation

sarahyurick
Copy link
Collaborator

Closes #214.

@ayushdg
Copy link
Collaborator

ayushdg commented Oct 22, 2024

We might need to expand the list of extensions since some files are format like .json.gz.
I wonder if an alternative could be to expand get_all_files_paths_under to also filter on an extension. That way users can specify what extension they want to filter on.
I'm hoping that #50 will make things easier in this regard.

@sarahyurick
Copy link
Collaborator Author

sarahyurick commented Oct 23, 2024

Thanks @ayushdg ! I like your idea of having it in get_all_files_paths_under so I changed it to use that instead.

Also, I agree with you about .json.gz. I think it is outside the scope of this PR, but I have added it to #50 for now.


input_extensions = {os.path.splitext(f)[-1] for f in input_files}
if len(input_extensions) != 1:
raise RuntimeError(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of when we would expect this RuntimeError is for:

doc = DocumentDataset.read_json(in_files)

Where in_files is a string path to a directory with multiple JSONL files and a CRC file. Since the CRC file is not explicitly being filtered out, we raise the error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave this as is for now.
In theory there might be cases where a user filters by [.json, .jsonl] using the file filter, but will raise errors here. In practice I expect it to be unlikely so we can wait an see if there is any user feedback around this.

root: str,
recurse_subdirectories: bool = True,
followlinks: bool = False,
filter_by: Optional[Union[str, List[str]]] = None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these examples work:
(1)

input_files = get_all_files_paths_under(in_files, filter_by="jsonl")
input_dataset = DocumentDataset.read_json(input_files)

(2)

input_files = get_all_files_paths_under(in_files, filter_by=["jsonl"])
input_dataset = DocumentDataset.read_json(input_files)

(3)

# Returns a list containing only .jsonl, .parquet, and .csv files
input_files = get_all_files_paths_under(in_files, filter_by=["jsonl", "parquet", "csv"])

@sarahyurick sarahyurick requested a review from ayushdg October 28, 2024 18:51
ayushdg
ayushdg previously requested changes Nov 5, 2024
Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. Overall changes lgtm! Minor nits/comments.

As a followup it might make sense to track updating tutorials/notebooks to use this newer filter arg in the api but not required for this pr.

if file.endswith(tuple(file_extensions)):
filtered_files.append(file)
else:
warnings.warn(f"Skipping read for file: {file}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this might get too noisy in some cases. I'm leaning towards warning once if we have to skip, but not for every file we skip.


input_extensions = {os.path.splitext(f)[-1] for f in input_files}
if len(input_extensions) != 1:
raise RuntimeError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave this as is for now.
In theory there might be cases where a user filters by [.json, .jsonl] using the file filter, but will raise errors here. In practice I expect it to be unlikely so we can wait an see if there is any user feedback around this.

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick requested a review from ayushdg November 6, 2024 00:07
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick
Copy link
Collaborator Author

Thanks @ayushdg ! Updated.

Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick
Copy link
Collaborator Author

Thank you @praateekmahajan ! I have addressed all your comments.

Copy link
Collaborator

@praateekmahajan praateekmahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding type hint as well. (left two small nits on comment / typehint)

@sarahyurick sarahyurick dismissed ayushdg’s stale review November 18, 2024 19:28

Review has been addressed, thanks!

@sarahyurick sarahyurick merged commit d0dd30b into NVIDIA:main Nov 18, 2024
3 checks passed
davzoku pushed a commit to davzoku/NeMo-Curator that referenced this pull request Nov 19, 2024
* filter_files_by_extension function

Signed-off-by: Sarah Yurick <[email protected]>

* add type checking

Signed-off-by: Sarah Yurick <[email protected]>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <[email protected]>

* isort

Signed-off-by: Sarah Yurick <[email protected]>

* address ayush's comments

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* more whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* address praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

* praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>
VibhuJawa pushed a commit that referenced this pull request Nov 19, 2024
* update obsolete flag

Signed-off-by: Walter Teng <[email protected]>

* build: Improve caching (#352)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on main (#354)

* ci: Run gpuci on main
* fix checkout

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on merge commit (#355)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* build: Add conda env to `$PATH` (#357)

* build: Add conda env to `$PATH`

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

* add newline

Signed-off-by: Oliver Koenig <[email protected]>

* run cleanup always

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Add `build-test-publish-wheel` CI file (#356)

* Create build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Create package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* remove extra version string

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* add `__all__`

Signed-off-by: Sarah Yurick <[email protected]>

* Fix version

Signed-off-by: oliver könig <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/sarahyurick/ci/build test publish wheel (#358)

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

---------

Signed-off-by: Oliver Koenig <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* run isort

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update pyproject.toml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken TestPyPi builder (#362)

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Update Dockerfile

Signed-off-by: Sarah Yurick <[email protected]>

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* chore: Add `CHANGELOG.md` file (#359)

* chore: Add `CHANGELOG.md` file

* fix

* add end of line

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Release workflow (#360)

* add file

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow to allow of `devN` semver (#366)

* ci: Bump release workflow for `devN`

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add code-freeze workflow (#367)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add cherry pick workflow (#368)

* ci: Add cherry pick workflow

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken NeMo dependencies (#372)

* add packaging

Signed-off-by: Sarah Yurick <[email protected]>

* move to requires

Signed-off-by: Sarah Yurick <[email protected]>

* move to github ci file

Signed-off-by: Sarah Yurick <[email protected]>

* add pin

Signed-off-by: Sarah Yurick <[email protected]>

* add torch

Signed-off-by: Sarah Yurick <[email protected]>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <[email protected]>

* try github install

Signed-off-by: Sarah Yurick <[email protected]>

* add comma

Signed-off-by: Sarah Yurick <[email protected]>

* another attempt

Signed-off-by: Sarah Yurick <[email protected]>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <[email protected]>

* add datasets

Signed-off-by: Sarah Yurick <[email protected]>

* try removing cython

Signed-off-by: Sarah Yurick <[email protected]>

* remove cython

Signed-off-by: Sarah Yurick <[email protected]>

* sentencepiece

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow (#373)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Skip reading files with incorrect extension (#318)

* filter_files_by_extension function

Signed-off-by: Sarah Yurick <[email protected]>

* add type checking

Signed-off-by: Sarah Yurick <[email protected]>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <[email protected]>

* isort

Signed-off-by: Sarah Yurick <[email protected]>

* address ayush's comments

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* more whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* address praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

* praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* remove deprecated convert_str_ids args  from ConnectedComponents

Signed-off-by: Walter Teng <[email protected]>

---------

Signed-off-by: Walter Teng <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
vinay-raman pushed a commit to vinay-raman/NeMo-Curator that referenced this pull request Nov 26, 2024
* filter_files_by_extension function

Signed-off-by: Sarah Yurick <[email protected]>

* add type checking

Signed-off-by: Sarah Yurick <[email protected]>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <[email protected]>

* isort

Signed-off-by: Sarah Yurick <[email protected]>

* address ayush's comments

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* more whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* address praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

* praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
vinay-raman pushed a commit to vinay-raman/NeMo-Curator that referenced this pull request Nov 26, 2024
* update obsolete flag

Signed-off-by: Walter Teng <[email protected]>

* build: Improve caching (NVIDIA#352)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on main (NVIDIA#354)

* ci: Run gpuci on main
* fix checkout

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on merge commit (NVIDIA#355)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* build: Add conda env to `$PATH` (NVIDIA#357)

* build: Add conda env to `$PATH`

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

* add newline

Signed-off-by: Oliver Koenig <[email protected]>

* run cleanup always

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Add `build-test-publish-wheel` CI file (NVIDIA#356)

* Create build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Create package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* remove extra version string

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* add `__all__`

Signed-off-by: Sarah Yurick <[email protected]>

* Fix version

Signed-off-by: oliver könig <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/sarahyurick/ci/build test publish wheel (NVIDIA#358)

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

---------

Signed-off-by: Oliver Koenig <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* run isort

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update pyproject.toml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken TestPyPi builder (NVIDIA#362)

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Update Dockerfile

Signed-off-by: Sarah Yurick <[email protected]>

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* chore: Add `CHANGELOG.md` file (NVIDIA#359)

* chore: Add `CHANGELOG.md` file

* fix

* add end of line

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Release workflow (NVIDIA#360)

* add file

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow to allow of `devN` semver (NVIDIA#366)

* ci: Bump release workflow for `devN`

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add code-freeze workflow (NVIDIA#367)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add cherry pick workflow (NVIDIA#368)

* ci: Add cherry pick workflow

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken NeMo dependencies (NVIDIA#372)

* add packaging

Signed-off-by: Sarah Yurick <[email protected]>

* move to requires

Signed-off-by: Sarah Yurick <[email protected]>

* move to github ci file

Signed-off-by: Sarah Yurick <[email protected]>

* add pin

Signed-off-by: Sarah Yurick <[email protected]>

* add torch

Signed-off-by: Sarah Yurick <[email protected]>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <[email protected]>

* try github install

Signed-off-by: Sarah Yurick <[email protected]>

* add comma

Signed-off-by: Sarah Yurick <[email protected]>

* another attempt

Signed-off-by: Sarah Yurick <[email protected]>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <[email protected]>

* add datasets

Signed-off-by: Sarah Yurick <[email protected]>

* try removing cython

Signed-off-by: Sarah Yurick <[email protected]>

* remove cython

Signed-off-by: Sarah Yurick <[email protected]>

* sentencepiece

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow (NVIDIA#373)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Skip reading files with incorrect extension (NVIDIA#318)

* filter_files_by_extension function

Signed-off-by: Sarah Yurick <[email protected]>

* add type checking

Signed-off-by: Sarah Yurick <[email protected]>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <[email protected]>

* isort

Signed-off-by: Sarah Yurick <[email protected]>

* address ayush's comments

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* more whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* address praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

* praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* remove deprecated convert_str_ids args  from ConnectedComponents

Signed-off-by: Walter Teng <[email protected]>

---------

Signed-off-by: Walter Teng <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
ruchaa-apte pushed a commit to ruchaa-apte/NeMo-Curator that referenced this pull request Dec 13, 2024
* filter_files_by_extension function

Signed-off-by: Sarah Yurick <[email protected]>

* add type checking

Signed-off-by: Sarah Yurick <[email protected]>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <[email protected]>

* isort

Signed-off-by: Sarah Yurick <[email protected]>

* address ayush's comments

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* more whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* address praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

* praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>
ruchaa-apte pushed a commit to ruchaa-apte/NeMo-Curator that referenced this pull request Dec 13, 2024
* update obsolete flag

Signed-off-by: Walter Teng <[email protected]>

* build: Improve caching (NVIDIA#352)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on main (NVIDIA#354)

* ci: Run gpuci on main
* fix checkout

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Run on merge commit (NVIDIA#355)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* build: Add conda env to `$PATH` (NVIDIA#357)

* build: Add conda env to `$PATH`

Signed-off-by: Oliver Koenig <[email protected]>

* test

Signed-off-by: Oliver Koenig <[email protected]>

* add newline

Signed-off-by: Oliver Koenig <[email protected]>

* run cleanup always

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Add `build-test-publish-wheel` CI file (NVIDIA#356)

* Create build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Create package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update package_info.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* remove extra version string

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* add `__all__`

Signed-off-by: Sarah Yurick <[email protected]>

* Fix version

Signed-off-by: oliver könig <[email protected]>

* Update .github/workflows/build-test-publish-wheel.yml

Signed-off-by: oliver könig <[email protected]>

* Ko3n1g/sarahyurick/ci/build test publish wheel (NVIDIA#358)

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

* fix

---------

Signed-off-by: Oliver Koenig <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* run isort

Signed-off-by: Sarah Yurick <[email protected]>

* Update __init__.py

Signed-off-by: Sarah Yurick <[email protected]>

* Update pyproject.toml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken TestPyPi builder (NVIDIA#362)

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

* Update Dockerfile

Signed-off-by: Sarah Yurick <[email protected]>

* Update build-test-publish-wheel.yml

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* chore: Add `CHANGELOG.md` file (NVIDIA#359)

* chore: Add `CHANGELOG.md` file

* fix

* add end of line

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Release workflow (NVIDIA#360)

* add file

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow to allow of `devN` semver (NVIDIA#366)

* ci: Bump release workflow for `devN`

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add code-freeze workflow (NVIDIA#367)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Add cherry pick workflow (NVIDIA#368)

* ci: Add cherry pick workflow

Signed-off-by: Oliver Koenig <[email protected]>

* fix

Signed-off-by: Oliver Koenig <[email protected]>

---------

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Fix broken NeMo dependencies (NVIDIA#372)

* add packaging

Signed-off-by: Sarah Yurick <[email protected]>

* move to requires

Signed-off-by: Sarah Yurick <[email protected]>

* move to github ci file

Signed-off-by: Sarah Yurick <[email protected]>

* add pin

Signed-off-by: Sarah Yurick <[email protected]>

* add torch

Signed-off-by: Sarah Yurick <[email protected]>

* add suggestion from mamba readme

Signed-off-by: Sarah Yurick <[email protected]>

* try github install

Signed-off-by: Sarah Yurick <[email protected]>

* add comma

Signed-off-by: Sarah Yurick <[email protected]>

* another attempt

Signed-off-by: Sarah Yurick <[email protected]>

* remove nemo toolkit

Signed-off-by: Sarah Yurick <[email protected]>

* add datasets

Signed-off-by: Sarah Yurick <[email protected]>

* try removing cython

Signed-off-by: Sarah Yurick <[email protected]>

* remove cython

Signed-off-by: Sarah Yurick <[email protected]>

* sentencepiece

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* apply ryan's suggestion

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* ci: Bump release workflow (NVIDIA#373)

Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* Skip reading files with incorrect extension (NVIDIA#318)

* filter_files_by_extension function

Signed-off-by: Sarah Yurick <[email protected]>

* add type checking

Signed-off-by: Sarah Yurick <[email protected]>

* add filter_by param to get_all_files_paths_under

Signed-off-by: Sarah Yurick <[email protected]>

* isort

Signed-off-by: Sarah Yurick <[email protected]>

* address ayush's comments

Signed-off-by: Sarah Yurick <[email protected]>

* run black

Signed-off-by: Sarah Yurick <[email protected]>

* trailing whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* more whitespace

Signed-off-by: Sarah Yurick <[email protected]>

* address praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

* praateek's review

Signed-off-by: Sarah Yurick <[email protected]>

---------

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Walter Teng <[email protected]>

* remove deprecated convert_str_ids args  from ConnectedComponents

Signed-off-by: Walter Teng <[email protected]>

---------

Signed-off-by: Walter Teng <[email protected]>
Signed-off-by: Oliver Koenig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: oliver könig <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Co-authored-by: oliver könig <[email protected]>
Co-authored-by: Sarah Yurick <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DocumentDataset read errors when other files are present in directory
3 participants