[Tutorials] Synthetic data generation + reward model + curation for P…

…EFT (NVIDIA#157) This tutorial demonstrates the usage of NeMo Curator's Python API data curation as well as synthetic data generation to prepare a dataset for parameter-efficient fine-tuning (PEFT) of LLMs. Signed-off-by: Mehran Maghoumi <[email protected]>
sarahyurick · Jul 23, 2024 · 9452189 · 9452189
1 parent a8ed614
commit 9452189
Show file tree

Hide file tree

Showing 7 changed files with 1,015 additions and 0 deletions.
diff --git a/tutorials/peft-curation-with-sdg/README.md b/tutorials/peft-curation-with-sdg/README.md
@@ -0,0 +1,57 @@
+# Curating Datasets for Parameter Efficient Fine-tuning with Synthetic Data Generation
+
+This tutorial demonstrates the usage of NeMo Curator's Python API data curation as well as synthetic
+data generation, and qualitative score assignment to prepare a dataset for parameter-efficient fine-tuning (PEFT) of LLMs.
+
+We demonstrate the pipeline using the [Law StackExchange dataset](https://huggingface.co/datasets/ymoslem/Law-StackExchange),
+which is a dataset of legal question/answers. Each record consists of a question, some context as
+well as human-provided answers.
+
+In this tutorial, we implement various filtering and processing operations on the records. We then
+demonstrate the usage of external LLM services for synthetic data generation and reward models to
+assign qualitative metrics to each synthetic record. We further NeMo Curator's facilities
+to iteratively augment and refine the data until the dataset has reached the desired size.
+
+> **Note:** The use of external LLM services for synthetic data generation is entirely optional.
+> Similarly, this tutorial can be executed on a local machine without the need for a GPU. To fully
+> experience all the capabilities of this code, see the "Optional Prerequisites" section below.
+
+## Optional Prerequisites
+
+The following is a list of optional dependencies to allow experimentation with all the features
+showcased in this code:
+
+* In order to run the data curation pipeline with semantic deduplication enabled, you would need an
+NVIDIA GPU.
+* To generate synthetic data, you would need a synthetic data generation model compatible with the OpenAI API (such as the [Nemotron-4 340B Instruct](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-instruct) model).
+* For assigning qualitative metrics to the generated records, you would need a reward model compatible with the OpenAI API (such as the [Nemotron-4 340B Reward](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/nemotron-4-340b-reward) model).
+
+For synthetic data generation and quality assignment, this notebook demonstrates the usage of the
+Nemotron-4 340B models through the [build.nvidia.com](https://build.nvidia.com) API gateway. As such,
+a valid API key to prompt these models is required.
+
+## Usage
+After installing the NeMo Curator package, you can simply run the following commands:
+```bash
+# Running the basic pipeline (no GPUs or external LLMs needed)
+python tutorials/peft-curation-with-sdg/main.py
+
+# Run with synthetic data generation and semantic dedeuplication
+python tutorials/peft-curation-with-sdg/main.py \
+    --api-key YOUR_BUILD.NVIDIA.COM_API_KEY \
+    --device gpu
+
+# To control the amount of synthetic data to generate
+python tutorials/peft-curation-with-sdg/main.py \
+    --api-key YOUR_BUILD.NVIDIA.COM_API_KEY \
+    --device gpu \
+    --synth-gen-rounds 2 \ # Do 2 rounds of synthetic data generation
+    --synth-gen-ratio 0.5  # Generate synthetic data using 50% of the real data
+```
+
+By default, this tutorial will use at most 8 workers to run the curation pipeline. If you face any
+out of memory issues, you can reduce the number of workers by supplying the `--n-workers=N` argument,
+where `N` is the number of workers to spawn.
+
+Once the code finishes executing, the curated dataset will be available under `data/curated/final`.
+By default, the script outputs splits for training (80%), validation (10%) and testing (10%).
diff --git a/tutorials/peft-curation-with-sdg/config/sem_dedup_config.yaml b/tutorials/peft-curation-with-sdg/config/sem_dedup_config.yaml
@@ -0,0 +1,32 @@
+# Configuration file for semdantic dedup
+cache_dir: "_temp/semdedup_cache"
+num_files: 16
+id_col_name: "id"
+id_col_type: "str"
+input_column: "text"
+
+# Embeddings configuration
+embeddings_save_loc: "embeddings"
+embedding_model_name_or_path: "sentence-transformers/all-MiniLM-L6-v2"
+embedding_batch_size: 128
+embedding_max_mem_gb: 20
+
+# Clustering configuration
+clustering_save_loc: "clustering_results"
+n_clusters: 1000
+seed: 1234
+max_iter: 100
+kmeans_with_cos_dist: false
+
+# Semdedup configuration
+which_to_keep: "hard"
+largest_cluster_size_to_process: 100000
+sim_metric: "cosine"
+
+# Extract dedup configuration
+eps_thresholds:
+  - 0.01
+  - 0.001
+
+# Which threshold to use for extracting deduped data
+eps_to_extract: 0.01
diff --git a/tutorials/peft-curation-with-sdg/docbuilder.py b/tutorials/peft-curation-with-sdg/docbuilder.py
@@ -0,0 +1,164 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import json
+import os
+import re
+import warnings
+from typing import Dict
+
+import requests
+from bs4 import BeautifulSoup, MarkupResemblesLocatorWarning
+
+from nemo_curator.download.doc_builder import (
+    DocumentDownloader,
+    DocumentExtractor,
+    DocumentIterator,
+)
+
+# Ignore the specific BeautifulSoup warning
+warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)
+
+
+class LawQADownloader(DocumentDownloader):
+    """
+    A class for downloading Law QA dataset.
+    """
+
+    def __init__(self, download_dir: str):
+        super().__init__()
+
+        if not os.path.isdir(download_dir):
+            os.makedirs(download_dir)
+
+        self._download_dir = download_dir
+        print("Download directory: ", self._download_dir)
+
+    def download(self, url: str) -> str:
+        """
+        Downloads the Law QA dataset from the given URL.
+
+        Args:
+            url (str): The URL of the Law QA dataset.
+
+        Returns:
+            str: The path of the downloaded file.
+
+        """
+        filename = os.path.basename(url)
+        output_file = os.path.join(self._download_dir, filename)
+
+        if os.path.exists(output_file):
+            print(f"File '{output_file}' already exists, skipping download.")
+            return output_file
+
+        print(f"Downloading Law QA dataset from '{url}'...")
+        response = requests.get(url)
+
+        with open(output_file, "wb") as file:
+            file.write(response.content)
+
+        return output_file
+
+
+class LawQAIterator(DocumentIterator):
+
+    def __init__(self):
+        super().__init__()
+        self._counter = -1
+        self._extractor = LawQAExtractor()
+
+    def iterate(self, file_path):
+        """
+        Iterates over the content of a file and yields extracted records.
+
+        Args:
+            file_path (str): The path to the file to be iterated.
+
+        Yields:
+            dict: A dictionary representing a record extracted from the file.
+        """
+        self._counter = -1
+        file_name = os.path.basename(file_path)
+
+        with open(file_path, "r", encoding="utf-8") as file:
+            lines = file.readlines()
+
+        file_content = "".join(lines)
+        json_content = json.loads(file_content)
+
+        for row in json_content:
+            self._counter += 1
+            extracted_content = self._extractor.extract(row)
+
+            # Skip if the question has no answers.
+            if extracted_content is None:
+                continue
+
+            id, extracted_content = extracted_content
+            meta = {
+                "filename": file_name,
+                "id": f"law-stackexchange-qa-{id}",
+            }
+
+            record = {**meta, **extracted_content}
+            yield record
+
+
+class LawQAExtractor(DocumentExtractor):
+
+    def extract(self, content: str) -> Dict[str, str]:
+        """
+        Extracts relevant information from a law-related question and its best answer.
+
+        Args:
+            content (str): The content of the question and its answers.
+
+        Returns:
+            Dict[str, str]: A dictionary containing the extracted information, including the question ID, title, body,
+            score, best answer, best answer score, and tags.
+        """
+        id = content["question_id"]
+        q_title = content["question_title"]
+        q_body = content["question_body"]
+        q_score = content["score"]
+        tags = ",".join(sorted(content["tags"]))
+
+        # If this question has no answers, skip it.
+        if len(content["answers"]) == 0:
+            return None
+
+        # All answers are sorted by votes, so take the first answer as the best one.
+        best_answer = content["answers"][0]
+        best_answer_score = best_answer["score"]
+        best_answer = best_answer["body"]
+
+        # Get rid of HTML tags using beautifulsoup
+        # NOTE: Doing this here so that I can split the dataset without having to worry about curating the test split.
+        q_title = self._clean_html(q_title)
+        q_body = self._clean_html(q_body)
+        best_answer = self._clean_html(best_answer)
+
+        return id, {
+            "title": q_title,
+            "question": q_body,
+            "question_score": q_score,
+            "answer": best_answer,
+            "answer_score": best_answer_score,
+            "tags": tags,
+        }
+
+    def _clean_html(self, text: str) -> str:
+        text = BeautifulSoup(text, "lxml").get_text()
+        return re.sub(r"\s+", " ", text).strip()
diff --git a/tutorials/peft-curation-with-sdg/filters.py b/tutorials/peft-curation-with-sdg/filters.py
@@ -0,0 +1,31 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from nemo_curator.filters import DocumentFilter
+
+
+class FilterLowScores(DocumentFilter):
+    """
+    Discards documents with scores (human-assigned, or reward model assiegned) below a threshold.
+    """
+
+    def __init__(self, score_threshold: int):
+        super().__init__()
+        self._score_threshold = score_threshold
+
+    def score_document(self, text: str) -> bool:
+        return text >= self._score_threshold
+
+    def keep_document(self, score) -> bool:
+        return score