diff --git a/docs/user-guide/semdedup.rst b/docs/user-guide/semdedup.rst index 80a73868..269432f8 100644 --- a/docs/user-guide/semdedup.rst +++ b/docs/user-guide/semdedup.rst @@ -33,7 +33,7 @@ The SemDeDup algorithm consists of the following main steps: Configure Semantic Deduplication ----------------------------------------- -Semantic deduplication in NeMo Curator can be configured using a YAML file. Here's an example `sem_dedup_config.yaml`: +Semantic deduplication in NeMo Curator can be configured using a YAML file. Here's an example ``sem_dedup_config.yaml``: .. code-block:: yaml @@ -77,7 +77,7 @@ Change Embedding Models ----------------------------------------- One of the key advantages of the semantic deduplication module is its flexibility in using different pre-trained models for embedding generation. -You can easily change the embedding model by modifying the `embedding_model_name_or_path` parameter in the configuration file. +You can easily change the embedding model by modifying the ``embedding_model_name_or_path`` parameter in the configuration file. For example, to use a different sentence transformer model, you could change: @@ -99,7 +99,7 @@ The module supports various types of models, including: When changing the model, ensure that: 1. The model is compatible with the data type you're working with (primarily text for this module). -2. You adjust the `embedding_batch_size` and `embedding_max_mem_gb` parameters as needed, as different models may have different memory requirements. +2. You adjust the ``embedding_batch_size`` and ``embedding_max_mem_gb`` parameters as needed, as different models may have different memory requirements. 3. The chosen model is appropriate for the language or domain of your dataset. By selecting an appropriate embedding model, you can optimize the semantic deduplication process for your specific use case and potentially improve the quality of the deduplicated dataset. @@ -118,32 +118,34 @@ The semantic deduplication process is controlled by two key threshold parameters eps_to_extract: 0.01 -1. `eps_thresholds`: A list of similarity thresholds used to compute semantic matches. Each threshold represents a different level of strictness in determining duplicates. +1. ``eps_thresholds``: A list of similarity thresholds used to compute semantic matches. Each threshold represents a different level of strictness in determining duplicates. Lower values are more strict, requiring higher similarity for documents to be considered duplicates. -2. `eps_to_extract`: The specific threshold used for the final extraction of deduplicated data. - This value must be one of the thresholds listed in `eps_thresholds`. +2. ``eps_to_extract``: The specific threshold used for the final extraction of deduplicated data. + This value must be one of the thresholds listed in ``eps_thresholds``. This two-step approach offers several advantages: + * Flexibility to compute matches at multiple thresholds without rerunning the entire process. * Ability to analyze the impact of different thresholds on your dataset. * Option to fine-tune the final threshold based on specific needs without recomputing all matches. When choosing appropriate thresholds: + * Lower thresholds (e.g., 0.001): More strict, resulting in less deduplication but higher confidence in the identified duplicates. * Higher thresholds (e.g., 0.1): Less strict, leading to more aggressive deduplication but potentially removing documents that are only somewhat similar. We recommended that you experiment with different threshold values to find the optimal balance between data reduction and maintaining dataset diversity and quality. The impact of these thresholds can vary depending on the nature and size of your dataset. -Remember, if you want to extract data using a threshold that's not in `eps_thresholds`, you'll need to recompute the semantic matches with the new threshold included in the list. +Remember, if you want to extract data using a threshold that's not in ``eps_thresholds``, you'll need to recompute the semantic matches with the new threshold included in the list. ----------------------------------------- Usage ----------------------------------------- Before running semantic deduplication, ensure that each document/datapoint in your dataset has a unique identifier. -You can use the `add_id` module from NeMo Curator if needed: +You can use the ``add_id`` module from NeMo Curator if needed: .. code-block:: python @@ -156,7 +158,7 @@ You can use the `add_id` module from NeMo Curator if needed: id_dataset.to_json("output_file_path", write_to_filename=True) -To perform semantic deduplication, you can either use individual components or the SemDedup class with a configuration file: +To perform semantic deduplication, you can either use individual components or the SemDedup class with a configuration file. Use Individual Components ########################## @@ -246,12 +248,12 @@ Parameters Key parameters in the configuration file include: -- `embedding_model_name_or_path`: Path or identifier for the pre-trained model used for embedding generation. -- `embedding_max_mem_gb`: Maximum memory usage for the embedding process. -- `embedding_batch_size`: Number of samples to process in each embedding batch. -- `n_clusters`: Number of clusters for k-means clustering. -- `eps_to_extract`: Deduplication threshold. Higher values result in more aggressive deduplication. -- `which_to_keep`: Strategy for choosing which duplicate to keep ("hard" or "soft"). +- ``embedding_model_name_or_path``: Path or identifier for the pre-trained model used for embedding generation. +- ``embedding_max_mem_gb``: Maximum memory usage for the embedding process. +- ``embedding_batch_size``: Number of samples to process in each embedding batch. +- ``n_clusters``: Number of clusters for k-means clustering. +- ``eps_to_extract``: Deduplication threshold. Higher values result in more aggressive deduplication. +- ``which_to_keep``: Strategy for choosing which duplicate to keep ("hard" or "soft"). ----------------------------------------- Output @@ -271,7 +273,7 @@ Performance Considerations Semantic deduplication is computationally intensive, especially for large datasets. However, the benefits in terms of reduced training time and improved model performance often outweigh the upfront cost. Consider the following: - Use GPU acceleration for faster embedding generation and clustering. -- Adjust the number of clusters (`n_clusters`) based on your dataset size and available computational resources. -- The `eps_to_extract` parameter allows you to control the trade-off between dataset size reduction and potential information loss. +- Adjust the number of clusters (``n_clusters``) based on your dataset size and available computational resources. +- The ``eps_to_extract`` parameter allows you to control the trade-off between dataset size reduction and potential information loss. For more details on the algorithm and its performance implications, refer to the original paper: `SemDeDup: Data-efficient learning at web-scale through semantic deduplication `_ by Abbas et al. diff --git a/nemo_curator/scripts/semdedup/README.md b/nemo_curator/scripts/semdedup/README.md index 68bc1680..d27aeea1 100644 --- a/nemo_curator/scripts/semdedup/README.md +++ b/nemo_curator/scripts/semdedup/README.md @@ -1,25 +1,26 @@ # SemDeDup Pipeline This pipeline is used to cluster and deduplicate data points based on their embeddings. -Please edit "semdedup_config.yaml" to configure the pipeline and run it using the following commands. +Please edit `config/sem_dedup_config.yaml` to configure the pipeline and run it using the following commands. ## Pipeline Steps -1) Modify "semdedup_config.yaml" +1) Modify `config/sem_dedup_config.yaml` 2) Compute embeddings: ```sh python compute_embeddings.py --input-data-dir "$INPUT_DATA_DIR" --input-file-type "jsonl" --input-file-extension "json" --config-file "$CONFIG_FILE" ``` - **Input:** `config.embeddings.input_data_dir/*.jsonl` and output from step (2) - **Output:** Embedding parquet files in the embedding directory + **Input:** `input_data_dir/*.jsonl` and YAML file from step (1) + + **Output:** Embedding Parquet files in the `{config.cache_dir}/{config.embeddings_save_loc}` directory 3) Clustering ```sh python clustering.py --config-file "$CONFIG_FILE" ``` - **Input:** Output from step (3) + **Input:** Output from step (2) and YAML file from step (1) **Output:** Under `{config.cache_dir}/{config.clustering_save_loc}` directory, including: @@ -27,14 +28,10 @@ Please edit "semdedup_config.yaml" to configure the pipeline and run it using th - `embs_by_nearest_center` directory, containing `nearest_cent={x}` where x ranges from 0 to `num_clusters - 1` - Parquet files within `embs_by_nearest_center/nearest_cent={x}` containing the data points in each cluster - -3) Extract deduplicated data +4) Extract deduplicated data ```sh python extract_dedup_data.py --config-file "$CONFIG_FILE" ``` - **Input:** Output from step (3) - **Output:** `{config.cache_dir}/{config.clustering_save_loc}/unique_ids_{}.parquet` + **Input:** Output from step (3) and YAML file from step (1) -## End to End Script - -python3 end_to_end_example.py --input-data-dir "$INPUT_DATA_DIR" --input-file-type "jsonl" --config-file "$CONFIG_FILE" + **Output:** `{config.cache_dir}/{config.clustering_save_loc}/unique_ids_{}.parquet`