Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements for semantic deduplication and DAPT tutorial #564

Merged
merged 6 commits into from
Feb 24, 2025

Conversation

sarahyurick
Copy link
Collaborator

Closes #341.
Closes #505.

Signed-off-by: Sarah Yurick <[email protected]>
Comment on lines +232 to +235
# category column dtypes are not supported by the GPU-accelerated Parquet writer
for col in embedding_ddf.columns:
if embedding_ddf[col].dtype.name == "category":
embedding_ddf[col] = embedding_ddf[col].astype("str")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closes #505.

@@ -6,7 +6,7 @@ num_files: 16
embeddings_save_loc: "embeddings"
embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
embedding_batch_size: 128
write_embeddings_to_disk: false
write_embeddings_to_disk: true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this closes #505, we can switch this back to true.

dataset: DocumentDataset, sem_dedupe_config_yaml_path: str, cache_dir: str
dataset: DocumentDataset, sem_dedupe_config_yaml_path: str,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unused parameter.

Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added the gpuci Run GPU CI/CD on PR label Feb 21, 2025
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Feb 21, 2025
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick sarahyurick added gpuci Run GPU CI/CD on PR and removed gpuci Run GPU CI/CD on PR labels Feb 24, 2025
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the quick fixes, mostly looks good, nits around variable naming and error message details.

Signed-off-by: Sarah Yurick <[email protected]>
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sarahyurick sarahyurick merged commit 119edd4 into NVIDIA:main Feb 24, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
gpuci Run GPU CI/CD on PR
Projects
None yet
2 participants