TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend

🤗 [TxT360 Download] • 📈 [TxT360 Details and Analysis]

About TxT360

TxT360 (Trillion eXtracted Text), introduced by LLM360, is the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored provide pretraining teams with a recipe to easily adjust data weighting, obtain the largest high-quality open source dataset, and train the most performant models.

Key Advantages of TxT360

Global Deduplication: Deduplicates 99 CommonCrawl snapshots and 14 curated high-quality sources.
Customizable: Metadata enables easy data weighting adjustments to optimize training.
High Quality: Meticulously curated and processed to ensure high-quality data.
Diverse Domains: Covers a wide range of topics (e.g., FreeLaw, PG-19) to ensure comprehensive data variety.
Fully Open Source: Provides the largest open-source pretraining dataset together with detailed documentation and scripts, fostering reproducibility and transparency.

This repository contains the scripts for replicating TxT360, reflecting dedication to our 360° open source spirit. You can find the full data of TxT360 on our huggingface dataset page, along with an in-depth blog post detailing the project.

Overview

This repository is structured into three primary components: common-crawl, curated-sources, and deduplication:

common-crawl: Contains scripts for processing raw data from Common Crawl.
curated-sources: Includes scripts for processing data from various curated sources.
deduplication: Features scripts for applying global deduplication to data from both Common Crawl and curated sources.

Further details can be found within each respective folder.

How to Use the Repository

1. Prepare Data Before Global Deduplication

1.1 Generate Common Crawl Data

Navigate to the common-crawl folder.
Run the scripts provided and follow the instructions in the folder to process raw Common Crawl data.

1.2 Generate Curated Sources Data:

Navigate to the curated-sources folder.
Run the scripts provided to process data from curated sources.

2. Perform Global Deduplication

Navigate to the deduplication folder.
Run the deduplication scripts to globally deduplicate data from both Common Crawl and curated sources.

3. Finalize the Data

Return to the common-crawl and curated-sources folders.
Remove duplicates identified during deduplication and re-organize the data.

By following these steps, you can replicate the TxT360 dataset or adapt the process to create your custom datasets.

Citation

If you use the TxT360 dataset or our scripts, please cite the following items:

@article{txt360,
  title = {{TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend}},
  author={Liping Tang, Nikhil Ranjan, Omkar Pangarkar, Xuezhi Liang, Zhen Wang, Li An, Bhaskar Rao, Linghao Jin, Huijuan Wang, Zhoujun Cheng, Suqi Sun, Cun Mu, Victor Miller, Xuezhe Ma, Yue Peng, Zhengzhong Liu, Eric P Xing},
  year={2024},
  url={https://huggingface.co/spaces/LLM360/TxT360}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend

About TxT360

Key Advantages of TxT360

Overview

How to Use the Repository

1. Prepare Data Before Global Deduplication

1.1 Generate Common Crawl Data

1.2 Generate Curated Sources Data:

2. Perform Global Deduplication

3. Finalize the Data

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend

About TxT360

Key Advantages of TxT360

Overview

How to Use the Repository

1. Prepare Data Before Global Deduplication

1.1 Generate Common Crawl Data

1.2 Generate Curated Sources Data:

2. Perform Global Deduplication

3. Finalize the Data

Citation