Skip to content

Commit

Permalink
update docs
Browse files Browse the repository at this point in the history
Signed-off-by: Sarah Yurick <[email protected]>
  • Loading branch information
sarahyurick committed Oct 25, 2024
1 parent c60c6e6 commit 08349d5
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 4 deletions.
3 changes: 2 additions & 1 deletion docs/user-guide/semdedup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Semantic deduplication in NeMo Curator can be configured using a YAML file. Here
cache_dir: "semdedup_cache"
num_files: -1
id_col_name: "id"
id_col_type: "int"
id_col_type: "int" # or "str" if the `add_id` module was used
input_column: "text"
# Embeddings configuration
Expand Down Expand Up @@ -159,6 +159,7 @@ You can use the ``add_id`` module from NeMo Curator if needed:
To perform semantic deduplication, you can either use individual components or the SemDedup class with a configuration file.
Please note that if you use the ``add_id`` module, then the ``id_col_type`` in your configuration file should be a "str".

Use Individual Components
##########################
Expand Down
6 changes: 3 additions & 3 deletions nemo_curator/scripts/semdedup/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,15 @@ Please edit `config/sem_dedup_config.yaml` to configure the pipeline and run it

2) Compute embeddings:
```sh
python compute_embeddings.py --input-data-dir "$INPUT_DATA_DIR" --input-file-type "jsonl" --input-file-extension "json" --config-file "$CONFIG_FILE"
semdedup_extract_embeddings --input-data-dir "$INPUT_DATA_DIR" --input-file-type "jsonl" --input-file-extension "json" --config-file "$CONFIG_FILE"
```
**Input:** `input_data_dir/*.jsonl` and YAML file from step (1)

**Output:** Embedding Parquet files in the `{config.cache_dir}/{config.embeddings_save_loc}` directory

3) Clustering
```sh
python clustering.py --config-file "$CONFIG_FILE"
semdedup_clustering --config-file "$CONFIG_FILE"
```
**Input:** Output from step (2) and YAML file from step (1)

Expand All @@ -30,7 +30,7 @@ Please edit `config/sem_dedup_config.yaml` to configure the pipeline and run it

4) Extract deduplicated data
```sh
python extract_dedup_data.py --config-file "$CONFIG_FILE"
semdedup_extract_dedup_ids --config-file "$CONFIG_FILE"
```
**Input:** Output from step (3) and YAML file from step (1)

Expand Down

0 comments on commit 08349d5

Please sign in to comment.