update docs

Signed-off-by: Sarah Yurick <[email protected]>
NVIDIA · Oct 25, 2024 · 08349d5 · 08349d5
1 parent c60c6e6
commit 08349d5
Show file tree

Hide file tree

Showing 2 changed files with 5 additions and 4 deletions.
diff --git a/docs/user-guide/semdedup.rst b/docs/user-guide/semdedup.rst
@@ -41,7 +41,7 @@ Semantic deduplication in NeMo Curator can be configured using a YAML file. Here
     cache_dir: "semdedup_cache"
     num_files: -1
     id_col_name: "id"
-    id_col_type: "int"
+    id_col_type: "int" # or "str" if the `add_id` module was used
     input_column: "text"
 
     # Embeddings configuration
@@ -159,6 +159,7 @@ You can use the ``add_id`` module from NeMo Curator if needed:
 
 
 To perform semantic deduplication, you can either use individual components or the SemDedup class with a configuration file.
+Please note that if you use the ``add_id`` module, then the ``id_col_type`` in your configuration file should be a "str".
 
 Use Individual Components
 ##########################

diff --git a/nemo_curator/scripts/semdedup/README.md b/nemo_curator/scripts/semdedup/README.md
@@ -10,15 +10,15 @@ Please edit `config/sem_dedup_config.yaml` to configure the pipeline and run it
 
 2) Compute embeddings:
     ```sh
-    python compute_embeddings.py --input-data-dir "$INPUT_DATA_DIR" --input-file-type "jsonl" --input-file-extension "json" --config-file "$CONFIG_FILE"
+    semdedup_extract_embeddings --input-data-dir "$INPUT_DATA_DIR" --input-file-type "jsonl" --input-file-extension "json" --config-file "$CONFIG_FILE"
     ```
     **Input:** `input_data_dir/*.jsonl` and YAML file from step (1)
 
     **Output:** Embedding Parquet files in the `{config.cache_dir}/{config.embeddings_save_loc}` directory
 
 3) Clustering
     ```sh
-    python clustering.py --config-file "$CONFIG_FILE"
+    semdedup_clustering --config-file "$CONFIG_FILE"
     ```
     **Input:** Output from step (2) and YAML file from step (1)
 
@@ -30,7 +30,7 @@ Please edit `config/sem_dedup_config.yaml` to configure the pipeline and run it
 
 4) Extract deduplicated data
     ```sh
-    python extract_dedup_data.py --config-file "$CONFIG_FILE"
+    semdedup_extract_dedup_ids --config-file "$CONFIG_FILE"
     ```
     **Input:** Output from step (3) and YAML file from step (1)