Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add improved cleaning methods from Nemotron-CC #517

Merged
merged 4 commits into from
Feb 6, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ All of our text pipelines have great multilingual support.
- [Download and Extraction](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/download.html)
- Default implementations for Common Crawl, Wikipedia, and ArXiv sources
- Easily customize and extend to other sources
- [Language Identification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentificationunicodeformatting.html)
- [Unicode Reformatting](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentificationunicodeformatting.html)
- [Language Identification](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/languageidentification.html)
- [Text Cleaning](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/textcleaning.html)
- [Heuristic Filtering](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)
- Classifier Filtering
- [fastText](https://docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/qualityfiltering.html)
Expand Down
7 changes: 5 additions & 2 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,11 @@ Text Curation
:ref:`Document Filtering <data-curator-qualityfiltering>`
This section describes how to use the 30+ heuristic and classifier filters available within the NeMo Curator and implement custom filters to apply to the documents within the corpora.

:ref:`Language Identification and Unicode Fixing <data-curator-languageidentification>`
Large, unlabeled text corpora often contain a variety of languages. The NeMo Curator provides utilities to identify languages and fix improperly decoded Unicode characters.
:ref:`Language Identification <data-curator-languageidentification>`
Large, unlabeled text corpora often contain a variety of languages. NeMo Curator provides utilities to identify languages.

:ref:`Text Cleaning <data-curator-text-cleaning>`
Many parts of the Internet contained malformed or poorly formatted text. NeMo Curator can fix many of these issues with text.

:ref:`GPU Accelerated Exact and Fuzzy Deduplication <data-curator-gpu-deduplication>`
Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,40 +11,17 @@ Background
Large unlabeled text corpora often contain a variety of languages.
However, data curation usually includes steps that are language specific (e.g. using language-tuned heuristics for quality filtering)
and many curators are only interested in curating a monolingual dataset.
Datasets also may have improperly decoded unicode characters (e.g. "The Mona Lisa doesn't have eyebrows." decoding as "The Mona Lisa doesn’t have eyebrows.").

NeMo Curator provides utilities to identify languages and fix improperly decoded unicode characters.
The language identification is performed using `fastText <https://fasttext.cc/docs/en/language-identification.html>`_ and unicode fixing is performed using `ftfy <https://ftfy.readthedocs.io/en/latest/>`_.
NeMo Curator provides utilities to identify languages using `fastText <https://fasttext.cc/docs/en/language-identification.html>`_.
Even though a preliminary language identification may have been performed on the unextracted text (as is the case in our Common Crawl pipeline
using pyCLD2), `fastText <https://fasttext.cc/docs/en/language-identification.html>`_ is more accurate so it can be used for a second pass.

-----------------------------------------
Usage
-----------------------------------------

We provide an example of how to use the language identification and unicode reformatting utility at ``examples/identify_languages_and_fix_unicode.py``.
We provide an example of how to use the language identification and unicode reformatting utility at ``examples/identify_languages.py``.
At a high level, the module first identifies the languages of the documents and removes any documents for which it has high uncertainty about the language.
Notably, this line uses one of the ``DocmentModifiers`` that NeMo Curator provides:

.. code-block:: python

cleaner = nc.Modify(UnicodeReformatter())
cleaned_data = cleaner(lang_data)

``DocumentModifier``s like ``UnicodeReformatter`` are very similar to ``DocumentFilter``s.
They implement a single ``modify_document`` function that takes in a document and outputs a modified document.
Here is the implementation of the ``UnicodeReformatter`` modifier:

.. code-block:: python

class UnicodeReformatter(DocumentModifier):
def __init__(self):
super().__init__()

def modify_document(self, text: str) -> str:
return ftfy.fix_text(text)

Also like the ``DocumentFilter`` functions, ``modify_document`` can be annotated with ``batched`` to take in a pandas series of documents instead of a single document.

-----------------------------------------
Related Scripts
Expand Down Expand Up @@ -79,15 +56,4 @@ within that file. Below is an example run command for :code:`separate_by_metadat
--output-metadata-distribution=./data/lang_distro.json

After running this module, the output directory will consist of one directory per language present within the corpus and all documents
within those directories will contain text that originates from the same language. Finally, the text within a specific language can have
its unicode fixed using the :code:`text_cleaning` module

.. code-block:: bash

text_cleaning \
--input-data-dir=<Output directory containing sub-directories>/EN \
--output-clean-dir=<Output directory to which cleaned english documents will be written>


The above :code:`text_cleaning` module uses the heuristics defined within the :code:`ftfy` package that is commonly used for fixing
improperly decoded unicode.
within those directories will contain text that originates from the same language.
98 changes: 98 additions & 0 deletions docs/user-guide/text-cleaning.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
.. _data-curator-text-cleaning:

=========================
Text Cleaning
=========================

--------------------
Overview
--------------------
Use NeMo Curator's text cleaning modules to remove undesirable text such as improperly decoded unicode characters, inconsistent line spacing, or excessive URLs from documents being pre-processed for dataset.

For example, the input sentence `"The Mona Lisa doesn't have eyebrows."` from a given document may not have included a properly encoded apostrophe (`'`), resulting in the sentence decoding as `"The Mona Lisa doesn’t have eyebrows."` NeMo Curator enables you to easily run this document through the default `UnicodeReformatter()` module to detect and remove the unwanted text, or you can define your own custom unicode text cleaner tailored to your needs.

--------------------
Use Cases
--------------------
* Fix improperly decoded Unicode characters from webpages.
* Standardize document layout by removing excessive newlines.
* Remove URLs in documents.

--------------------
Modules
--------------------
NeMo Curator provides the following modules for cleaning text:

- ``UnicodeReformatter()``: Uses [ftfy](https://ftfy.readthedocs.io/en/latest/) to fix broken Unicode characters. Modifies the "text" field of the dataset by default.
- ``NewlineNormalizer()``: Uses regex to replace 3 or more consecutive newline characters in each document with only 2 newline characters.
- ``UrlRemover()``: Uses regex to remove all urls in each document.

You can use these modules individually or sequentially in a cleaning pipeline.

Consider the following example, which loads a dataset (`books.jsonl`), steps through each module in a cleaning pipeline, and outputs the processed dataset as `cleaned_books.jsonl`:


.. code-block:: python

from nemo_curator import Sequential, Modify, get_client
from nemo_curator.datasets import DocumentDataset
from nemo_curator.modifiers import UnicodeReformatter, UrlRemover, NewlineNormalizer

def main():
client = get_client(cluster_type="cpu")

dataset = DocumentDataset.read_json("books.jsonl")
cleaning_pipeline = Sequential([
Modify(UnicodeReformatter()),
Modify(NewlineNormalizer()),
Modify(UrlRemover()),
])

cleaned_dataset = cleaning_pipeline(dataset)

cleaned_dataset.to_json("cleaned_books.jsonl")

if __name__ == "__main__":
main()

You can also perform text cleaning operations using the CLI by running the `text_cleaning` command:

.. code-block:: bash

text_cleaning \
--input-data-dir=/path/to/input/ \
--output-clean-dir=/path/to/output/ \
--normalize-newlines \
--remove-urls

By default, the CLI will only perform unicode reformatting. Adding the ``--normalize-newlines`` and ``--remove-urls`` options add the other text cleaning options.

------------------------
Custom Text Cleaner
------------------------
It's easy to write your own custom text cleaner. The implementation of ``UnicodeReformatter`` can be used as an example.

.. code-block:: python
import ftfy

from nemo_curator.modifiers import DocumentModifier


class UnicodeReformatter(DocumentModifier):
def __init__(self):
super().__init__()

def modify_document(self, text: str) -> str:
return ftfy.fix_text(text)

Simply define a new class that inherits from ``DocumentModifier`` and define the constructor and ``modify_text`` method.
Also, like the ``DocumentFilter`` class, ``modify_document`` can be annotated with ``batched`` to take in a pandas series of documents instead of a single document.
See the :ref:`document filtering page <data-curator-qualityfiltering>` for more information.

---------------------------
Additional Resources
---------------------------
* `Single GPU Tutorial <https://github.com/NVIDIA/NeMo-Curator/blob/main/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb>`_
* `ftfy <https://ftfy.readthedocs.io/en/latest/>`_
* `Refined Web Paper <https://arxiv.org/abs/2306.01116>`_
* `Nemotron-CC Paper <https://arxiv.org/abs/2412.02595>`_
10 changes: 7 additions & 3 deletions docs/user-guide/text-curation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,11 @@ Text Curation
:ref:`Document Filtering <data-curator-qualityfiltering>`
This section describes how to use the 30+ heuristic and classifier filters available within the NeMo Curator and implement custom filters to apply to the documents within the corpora.

:ref:`Language Identification and Unicode Fixing <data-curator-languageidentification>`
Large, unlabeled text corpora often contain a variety of languages. The NeMo Curator provides utilities to identify languages and fix improperly decoded Unicode characters.
:ref:`Language Identification <data-curator-languageidentification>`
Large, unlabeled text corpora often contain a variety of languages. NeMo Curator provides utilities to identify languages.

:ref:`Text Cleaning <data-curator-text-cleaning>`
Many parts of the Internet contained malformed or poorly formatted text. NeMo Curator can fix many of these issues with text.

:ref:`GPU Accelerated Exact and Fuzzy Deduplication <data-curator-gpu-deduplication>`
Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.
Expand Down Expand Up @@ -43,7 +46,8 @@ Text Curation
documentdataset.rst
cpuvsgpu.rst
qualityfiltering.rst
languageidentificationunicodeformatting.rst
languageidentification.rst
textcleaning.rst
gpudeduplication.rst
semdedup.rst
syntheticdata.rst
Expand Down
2 changes: 1 addition & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ These include:
| exact_deduplication.py | Use the `ExactDuplicates` class to perform exact deduplication on text data. |
| find_pii_and_deidentify.py | Use the `PiiModifier` and `Modify` classes to remove personally identifiable information from text data. |
| fuzzy_deduplication.py | Use the `FuzzyDuplicatesConfig` and `FuzzyDuplicates` classes to perform fuzzy deduplication on text data. |
| identify_languages_and_fix_unicode.py | Use `FastTextLangId` to filter data by language, then fix the unicode in it. |
| identify_languages.py | Use `FastTextLangId` to filter data by language |
| raw_download_common_crawl.py | Download the raw compressed WARC files from Common Crawl without extracting them. |
| semdedup_example.py | Use the `SemDedup` class to perform semantic deduplication on text data. |
| task_decontamination.py | Remove segments of downstream evaluation tasks from a dataset. |
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -13,13 +13,11 @@
# limitations under the License.

import argparse
import os

import nemo_curator as nc
from nemo_curator.datasets import DocumentDataset
from nemo_curator.filters import FastTextLangId
from nemo_curator.modifiers import UnicodeReformatter
from nemo_curator.utils.distributed_utils import get_client, read_data, write_to_disk
from nemo_curator.utils.distributed_utils import get_client, read_data
from nemo_curator.utils.file_utils import (
get_all_files_paths_under,
separate_by_metadata,
Expand All @@ -45,7 +43,6 @@ def main(args):
# and see a list of supported languages here:
# https://fasttext.cc/docs/en/language-identification.html
model_path = "/path/to/model.bin"
target_language = "EN"
language_field = "language"

# Prepare samples for the classifier
Expand All @@ -70,18 +67,6 @@ def main(args):
metadata_field=language_field,
).compute()

# Read the language specific data and fix the unicode in it
lang_data_path = os.path.join(language_separated_output_path, target_language)
if not os.path.exists(lang_data_path):
raise RuntimeError(f"Dataset did not have language: {target_language}")
lang_data = load_dataset(lang_data_path)

cleaner = nc.Modify(UnicodeReformatter())
cleaned_data = cleaner(lang_data)

# Write the cleaned_data
write_to_disk(cleaned_data.df, cleaned_data_output_path, write_to_filename=True)


def attach_args(
parser=argparse.ArgumentParser(
Expand Down
4 changes: 4 additions & 0 deletions nemo_curator/modifiers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,17 @@
from .c4 import BoilerPlateStringModifier
from .doc_modifier import DocumentModifier
from .fasttext import FastTextLabelModifier
from .newline_normalizer import NewlineNormalizer
from .pii_modifier import PiiModifier
from .unicode_reformatter import UnicodeReformatter
from .url_remover import UrlRemover

__all__ = [
"DocumentModifier",
"BoilerPlateStringModifier",
"FastTextLabelModifier",
"UnicodeReformatter",
"PiiModifier",
"NewlineNormalizer",
"UrlRemover",
]
33 changes: 33 additions & 0 deletions nemo_curator/modifiers/newline_normalizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re

from nemo_curator.modifiers import DocumentModifier

THREE_OR_MORE_NEWLINES_REGEX = re.compile(r"(\n){3,}")
THREE_OR_MORE_WINDOWS_NEWLINES_REGEX = re.compile(r"(\r\n){3,}")


class NewlineNormalizer(DocumentModifier):
"""
Replaces 3 or more consecutive newline characters with only 2 newline characters.
"""

def __init__(self):
super().__init__()

def modify_document(self, text):
text = THREE_OR_MORE_NEWLINES_REGEX.sub("\n\n", text)
text = THREE_OR_MORE_WINDOWS_NEWLINES_REGEX.sub("\r\n\r\n", text)
return text
30 changes: 30 additions & 0 deletions nemo_curator/modifiers/url_remover.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re

from nemo_curator.modifiers import DocumentModifier

URL_REGEX = re.compile(r"https?://\S+|www\.\S+", flags=re.IGNORECASE)


class UrlRemover(DocumentModifier):
"""
Removes all URLs in a document.
"""

def __init__(self):
super().__init__()

def modify_document(self, text):
return URL_REGEX.sub("", text)
24 changes: 19 additions & 5 deletions nemo_curator/scripts/text_cleaning.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@

import argparse

import nemo_curator
from nemo_curator import Modify, Sequential
from nemo_curator.datasets import DocumentDataset
from nemo_curator.modifiers import UnicodeReformatter
from nemo_curator.modifiers import NewlineNormalizer, UnicodeReformatter, UrlRemover
from nemo_curator.utils.distributed_utils import get_client, read_data, write_to_disk
from nemo_curator.utils.file_utils import expand_outdir_and_mkdir, get_batched_files
from nemo_curator.utils.script_utils import ArgumentHelper
Expand All @@ -28,9 +28,14 @@ def main(args):
# Make the output directories
output_clean_dir = expand_outdir_and_mkdir(args.output_clean_dir)

cleaner = nemo_curator.Modify(
UnicodeReformatter(), text_field=args.input_text_field
)
stages = [Modify(UnicodeReformatter(), text_field=args.input_text_field)]

if args.normalize_newlines:
stages.append(Modify(NewlineNormalizer(), text_field=args.input_text_field))
if args.remove_urls:
stages.append(Modify(UrlRemover, text_field=args.text_field))

cleaner = Sequential(stages)

for files in get_batched_files(
args.input_data_dir,
Expand Down Expand Up @@ -79,6 +84,15 @@ def attach_args(
argumentHelper.add_arg_input_text_field()
argumentHelper.add_arg_output_file_type()
argumentHelper.add_distributed_args()
argumentHelper.attach_bool_arg(
parser,
"normalize-newlines",
default=False,
help="Replace 3 or more consecutive newline characters in each document with only 2 newline characters.",
)
argumentHelper.attach_bool_arg(
parser, "remove-urls", default=False, help="Removes all URLs in each document."
)
parser.add_argument(
"--output-clean-dir",
type=str,
Expand Down
Loading