Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Applying SEO Best Pratices (#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <[email protected]> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <[email protected]> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <[email protected]> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * Shuffle CC result on group before writing out (#110) Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst (#113) Added links to tutorials Signed-off-by: jgerh <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: avinashvem <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> * first commit Signed-off-by: Vibhu Jawa <[email protected]> * mv under modules dir Signed-off-by: Vibhu Jawa <[email protected]> * embed by cluster saved Signed-off-by: Vibhu Jawa <[email protected]> * id map script Signed-off-by: Vibhu Jawa <[email protected]> * test commit Signed-off-by: Vibhu Jawa <[email protected]> * add id map script Signed-off-by: Vibhu Jawa <[email protected]> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Cleanup compute_embeddings_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Pre-commit style fixes Signed-off-by: Vibhu Jawa <[email protected]> * clustering_dask_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * Minor clean up to sort_clusters_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * cleanup semdedup_crossfit Signed-off-by: Vibhu Jawa <[email protected]> * Remove undo changes Signed-off-by: Vibhu Jawa <[email protected]> * Remove rename changes Signed-off-by: Vibhu Jawa <[email protected]> * Fix rename Signed-off-by: Vibhu Jawa <[email protected]> * Readme formatting Signed-off-by: Vibhu Jawa <[email protected]> * add dask to semdedup_crossfit.py Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * configure max memory using a cli Signed-off-by: Vibhu Jawa <[email protected]> * Dumb id results to parquet Signed-off-by: Vibhu Jawa <[email protected]> * Embedding fixes Signed-off-by: Vibhu Jawa <[email protected]> * README.md updates Signed-off-by: Vibhu Jawa <[email protected]> * Working end to end Signed-off-by: Vibhu Jawa <[email protected]> * Minor yaml fixes Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update .pre-commit-config.yaml Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update fuzzy_dedup.py Signed-off-by: Vibhu Jawa <[email protected]> * Undo changes to docs/personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Add end to end script in readme.md Signed-off-by: Vibhu Jawa <[email protected]> * Add type hints Signed-off-by: Vibhu Jawa <[email protected]> * Use dask for sort_clusters Signed-off-by: Vibhu Jawa <[email protected]> * Make sort_clusters work on MNMG scales Signed-off-by: Vibhu Jawa <[email protected]> * Cleaned up dask shutdown Signed-off-by: Vibhu Jawa <[email protected]> * Decrease noise in E2E scripts Signed-off-by: Vibhu Jawa <[email protected]> * Clean up scripts Signed-off-by: Vibhu Jawa <[email protected]> * Fix scripts/end_to_end_script.sh Signed-off-by: Vibhu Jawa <[email protected]> * Some more cleanup Signed-off-by: Vibhu Jawa <[email protected]> * Add copyright Signed-off-by: Vibhu Jawa <[email protected]> * Fix README.md Signed-off-by: Vibhu Jawa <[email protected]> * Address reviews Signed-off-by: Vibhu Jawa <[email protected]> * Make work with a SemDedupConfig Signed-off-by: Vibhu Jawa <[email protected]> * Make work with SemDedupConfig Signed-off-by: Vibhu Jawa <[email protected]> * Move to nemo-curator's logger Signed-off-by: Vibhu Jawa <[email protected]> * Semdedup-extract_dedup_data.py Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Applying SEO Best Pratices (#104) * Rename CPUvsGPU.rst to cpuvsgpu.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DataCuration.rsts to datacuration.rsts Signed-off-by: Andrew Schilling <[email protected]> * Rename DistributedDataClassification.rst to distributeddataclassification.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename DocumentDataset.rst to documentdataset.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename Download.rst to download.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename GpuDeduplication.rst to gpudeduplication.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename KubernetesCurator.rst to kubernetescurator.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename LanguageIdentificationUnicodeFormatting.rst to languageidentificationunicodeformatting.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename PersonalIdentifiableInformationIdentificationAndRemoval.rst to personalidentifiableinformationidentificationandremoval.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename QualityFiltering.rst to qualityfiltering.rst Signed-off-by: Andrew Schilling <[email protected]> * Rename TaskDecontamination.rst to taskdecontamination.rst Signed-off-by: Andrew Schilling <[email protected]> * Update index.rst Setting all RST files to lowercase names. Signed-off-by: Andrew Schilling <[email protected]> * Ignore docs for EOF fixer hook Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Fix bad merge Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Update index.rst Signed-off-by: Vibhu Jawa <[email protected]> * Add Module for embedding+clustering Signed-off-by: Vibhu Jawa <[email protected]> * Add sorting to clustering Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Refactor Semdup modules Signed-off-by: Vibhu Jawa <[email protected]> * Fix Readme.md Signed-off-by: Vibhu Jawa <[email protected]> * Add a environment variable to silence HF warnings Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * dask-cudf fix Signed-off-by: Vibhu Jawa <[email protected]> * Make config a flat file based on reviews Signed-off-by: Vibhu Jawa <[email protected]> * Add docstrings Signed-off-by: Vibhu Jawa <[email protected]> * Fix argparse and seed function Signed-off-by: Vibhu Jawa <[email protected]> * Use argparse to read config Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Move around config files Signed-off-by: Vibhu Jawa <[email protected]> * Remove end_to_end_script.sh Signed-off-by: Vibhu Jawa <[email protected]> * Append Readme Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews Signed-off-by: Vibhu Jawa <[email protected]> * Change config Signed-off-by: Vibhu Jawa <[email protected]> * Make embedding creation optionally lazy Signed-off-by: Vibhu Jawa <[email protected]> * fix docstring Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews and docstrings Signed-off-by: Vibhu Jawa <[email protected]> * Address Reviews and make eps_thresholds a list of values Signed-off-by: Vibhu Jawa <[email protected]> * Minor import fix Signed-off-by: Vibhu Jawa <[email protected]> * Empty Commit Signed-off-by: Vibhu Jawa <[email protected]> * Add modules to __init__ and README.md Signed-off-by: Vibhu Jawa <[email protected]> * Fix init Signed-off-by: Vibhu Jawa <[email protected]> * Move comment Signed-off-by: Vibhu Jawa <[email protected]> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <[email protected]> * Empty commit to restart CI (which failed due to a download issue) Signed-off-by: Vibhu Jawa <[email protected]> --------- Signed-off-by: Andrew Schilling <[email protected]> Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Vibhu Jawa <[email protected]> Signed-off-by: jgerh <[email protected]> Signed-off-by: avinashvem <[email protected]> Co-authored-by: Andrew Schilling <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: jgerh <[email protected]> Co-authored-by: avinashvem <[email protected]>
- Loading branch information