-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Semdedup Embedding Restart not working cleanly #211
Comments
@sarahyurick , I think given your PRs , you should probably just take this on. Happy to provide input as needed. Let me know what you think. |
I'm not sure I can reproduce this. I ran:
where LMK if there is anything else I should be setting or changing, otherwise we can close this issue. |
NVM, the issue is that it should not rerun if the embeddings are already present. |
Closed by #327. |
Describe the bug
Currently our semdedup restart mechanism for embedding is not working cleanly.
This is because of following (
add_filename=False
)NeMo-Curator/nemo_curator/scripts/semdedup/compute_embeddings.py
Lines 62 to 64 in 3a31ab1
And write to filename is False
NeMo-Curator/nemo_curator/scripts/semdedup/compute_embeddings.py
Line 78 in 3a31ab1
And
get_remaining_files
by default cant handle comparing files with different extensions.NeMo-Curator/nemo_curator/utils/file_utils.py
Lines 66 to 80 in 3a31ab1
The text was updated successfully, but these errors were encountered: