🤗 [TxT360 Download] • 📈 [TxT360 Details and Analysis]
TxT360 (Trillion eXtracted Text), introduced by LLM360, is the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored provide pretraining teams with a recipe to easily adjust data weighting, obtain the largest high-quality open source dataset, and train the most performant models.
- Global Deduplication: Deduplicates 99 CommonCrawl snapshots and 14 curated high-quality sources.
- Customizable: Metadata enables easy data weighting adjustments to optimize training.
- High Quality: Meticulously curated and processed to ensure high-quality data.
- Diverse Domains: Covers a wide range of topics (e.g., FreeLaw, PG-19) to ensure comprehensive data variety.
- Fully Open Source: Provides the largest open-source pretraining dataset together with detailed documentation and scripts, fostering reproducibility and transparency.
This repository contains the scripts for replicating TxT360, reflecting dedication to our 360° open source spirit. You can find the full data of TxT360 on our huggingface dataset page, along with an in-depth blog post detailing the project.
This repository is structured into three primary components: common-crawl
, curated-sources
, and deduplication
:
common-crawl
: Contains scripts for processing raw data from Common Crawl.curated-sources
: Includes scripts for processing data from various curated sources.deduplication
: Features scripts for applying global deduplication to data from both Common Crawl and curated sources.
Further details can be found within each respective folder.
- Navigate to the
common-crawl
folder. - Run the scripts provided and follow the instructions in the folder to process raw Common Crawl data.
- Navigate to the
curated-sources
folder. - Run the scripts provided to process data from curated sources.
- Navigate to the
deduplication
folder. - Run the deduplication scripts to globally deduplicate data from both Common Crawl and curated sources.
- Return to the
common-crawl
andcurated-sources
folders. - Remove duplicates identified during deduplication and re-organize the data.
By following these steps, you can replicate the TxT360 dataset or adapt the process to create your custom datasets.
If you use the TxT360 dataset or our scripts, please cite the following items:
@article{txt360,
title = {{TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend}},
author={Liping Tang, Nikhil Ranjan, Omkar Pangarkar, Xuezhi Liang, Zhen Wang, Li An, Bhaskar Rao, Linghao Jin, Huijuan Wang, Zhoujun Cheng, Suqi Sun, Cun Mu, Victor Miller, Xuezhe Ma, Yue Peng, Zhengzhong Liu, Eric P Xing},
year={2024},
url={https://huggingface.co/spaces/LLM360/TxT360}
}