Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gpu_exact_dups due to deprecated flag #351

Merged
merged 1 commit into from
Nov 13, 2024
Merged

Conversation

davzoku
Copy link
Contributor

@davzoku davzoku commented Nov 12, 2024

Description

This PR is to fix gpu_exact_dups mention in this issue, #350

Referencing these 2 commits:

I noticed the args.no_gpu is a deprecated argument, however during the refactoring process, one of this argument is being left out. This blocks gpu_exact_dups from executing correctly.

Usage

Prereqs:

add_id is executed on the jsonl to add id for dedup
existing .jsonl files in file directory eg.books_dedup/

gpu_exact_dups \
 --input-json-id-field="nemo_id" \
 --input-data-dir=books_dedup/ \
 --output-dir=exact_dedup \
 --device=gpu \
> logs/exact_dedup2.log 2>&1

Checklist

Results

Notice that after this commit, we can run gpu_exact_dups to completion.
image

Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, the PR looks good to me, you will have to sign your commits though, using DCO though. A

1) Make sure your PR does one thing. Have a clear answer to "What does this PR do?".
2) Read General Principles and style guide above
3) Make sure you sign your commits. E.g. use ``git commit -sS`` when committing.
1) If you forget to do this, please follow the steps below to undo the commits and reapply the changes under a new (signed and signed-off) commit. Note: This will preserve your changes, but delete the git history of commits.
```bash
git reset --soft HEAD~N
git add <insert all files you want to include>
git commit -sS -m "My commit message"
git push --force
```

See below for learning how to fix this:
https://github.com/NVIDIA/NeMo-Curator/pull/351/checks?check_run_id=32854401493

@davzoku
Copy link
Contributor Author

davzoku commented Nov 12, 2024

Hi @VibhuJawa, I have done the sign-off. lmk if there is any other issues. Thank you!

@VibhuJawa
Copy link
Collaborator

@davzoku , Its still missing them to signed, i think we missed the -S flag. Recommend trying below:

https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits

Signed-off-by: Walter Teng <[email protected]>
@davzoku
Copy link
Contributor Author

davzoku commented Nov 13, 2024

@VibhuJawa, the commit should be signed now

@ayushdg ayushdg added the bugfix Fixes a bug in the codebase label Nov 13, 2024
@VibhuJawa VibhuJawa merged commit cdfb47a into NVIDIA:main Nov 13, 2024
5 checks passed
@VibhuJawa
Copy link
Collaborator

@davzoku , Thanks a lot for your contribution, merged the fix in . Appreciate it.

vinay-raman pushed a commit to vinay-raman/NeMo-Curator that referenced this pull request Nov 26, 2024
Signed-off-by: Walter Teng <[email protected]>
Signed-off-by: Vinay Raman <[email protected]>
@ms-leemina
Copy link

ms-leemina commented Dec 12, 2024

@VibhuJawa @davzoku Is this update applied to the docker image? With the docker image of nvcr.io/nvidia/nemo:24.09, the exact duplicate still not work

ruchaa-apte pushed a commit to ruchaa-apte/NeMo-Curator that referenced this pull request Dec 13, 2024
Signed-off-by: Walter Teng <[email protected]>
Signed-off-by: Rucha Apte <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix Fixes a bug in the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants