-
Notifications
You must be signed in to change notification settings - Fork 112
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* edit docs Signed-off-by: Sarah Yurick <[email protected]> * fix bullet pts Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]>
- Loading branch information
1 parent
d539cac
commit 8d9ba84
Showing
2 changed files
with
28 additions
and
29 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,40 +1,37 @@ | ||
# SemDeDup Pipeline | ||
|
||
This pipeline is used to cluster and deduplicate data points based on their embeddings. | ||
Please edit "semdedup_config.yaml" to configure the pipeline and run it using the following commands. | ||
Please edit `config/sem_dedup_config.yaml` to configure the pipeline and run it using the following commands. | ||
|
||
|
||
## Pipeline Steps | ||
|
||
1) Modify "semdedup_config.yaml" | ||
1) Modify `config/sem_dedup_config.yaml` | ||
|
||
2) Compute embeddings: | ||
```sh | ||
python compute_embeddings.py --input-data-dir "$INPUT_DATA_DIR" --input-file-type "jsonl" --input-file-extension "json" --config-file "$CONFIG_FILE" | ||
``` | ||
**Input:** `config.embeddings.input_data_dir/*.jsonl` and output from step (2) | ||
**Output:** Embedding parquet files in the embedding directory | ||
**Input:** `input_data_dir/*.jsonl` and YAML file from step (1) | ||
|
||
**Output:** Embedding Parquet files in the `{config.cache_dir}/{config.embeddings_save_loc}` directory | ||
|
||
3) Clustering | ||
```sh | ||
python clustering.py --config-file "$CONFIG_FILE" | ||
``` | ||
**Input:** Output from step (3) | ||
**Input:** Output from step (2) and YAML file from step (1) | ||
|
||
**Output:** Under `{config.cache_dir}/{config.clustering_save_loc}` directory, including: | ||
|
||
- `kmeans_centroids.npy` | ||
- `embs_by_nearest_center` directory, containing `nearest_cent={x}` where x ranges from 0 to `num_clusters - 1` | ||
- Parquet files within `embs_by_nearest_center/nearest_cent={x}` containing the data points in each cluster | ||
|
||
|
||
3) Extract deduplicated data | ||
4) Extract deduplicated data | ||
```sh | ||
python extract_dedup_data.py --config-file "$CONFIG_FILE" | ||
``` | ||
**Input:** Output from step (3) | ||
**Output:** `{config.cache_dir}/{config.clustering_save_loc}/unique_ids_{}.parquet` | ||
**Input:** Output from step (3) and YAML file from step (1) | ||
|
||
## End to End Script | ||
|
||
python3 end_to_end_example.py --input-data-dir "$INPUT_DATA_DIR" --input-file-type "jsonl" --config-file "$CONFIG_FILE" | ||
**Output:** `{config.cache_dir}/{config.clustering_save_loc}/unique_ids_{}.parquet` |