-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DAPT tutorial #503
Fix DAPT tutorial #503
Conversation
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit about launching tutorial in GPU Mode (which should be default). Everything else looks good to me.
tutorials/dapt-curation/README.md
Outdated
python main.py | ||
# or python main.py --device "gpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it an or here , should this by the default ?
Signed-off-by: Sarah Yurick <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Looks like the failures are caused by flaky |
@@ -37,11 +37,8 @@ | |||
) | |||
|
|||
import nemo_curator as nc | |||
from nemo_curator import ExactDuplicates, Modify, ScoreFilter, Sequential |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @sarahyurick - Is it possible to revert these sections that are deleted from the DAPT tutorial?
The semantic deduplication portion of the DAPT tutorial is currently failing with a
ValueError("'category' column dtypes are currently not supported by the gpu accelerated parquet writer")
. A quick way to avoid this error is to avoid writing the embeddings to disk.