Skip to content

LLM360/TxT360

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend

license

logo


🤗 [TxT360 Download] • 📈 [TxT360 Details and Analysis]

About TxT360

TxT360 (Trillion eXtracted Text), introduced by LLM360, is the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored provide pretraining teams with a recipe to easily adjust data weighting, obtain the largest high-quality open source dataset, and train the most performant models.

Key Advantages of TxT360

  • Global Deduplication: Deduplicates 99 CommonCrawl snapshots and 14 curated high-quality sources.
  • Customizable: Metadata enables easy data weighting adjustments to optimize training.
  • High Quality: Meticulously curated and processed to ensure high-quality data.
  • Diverse Domains: Covers a wide range of topics (e.g., FreeLaw, PG-19) to ensure comprehensive data variety.
  • Fully Open Source: Provides the largest open-source pretraining dataset together with detailed documentation and scripts, fostering reproducibility and transparency.

comparison


This repository contains the scripts for replicating TxT360, reflecting dedication to our 360° open source spirit. You can find the full data of TxT360 on our huggingface dataset page, along with an in-depth blog post detailing the project.

Overview

This repository is structured into three primary components: common-crawl, curated-sources, and deduplication:

  • common-crawl: Contains scripts for processing raw data from Common Crawl.
  • curated-sources: Includes scripts for processing data from various curated sources.
  • deduplication: Features scripts for applying global deduplication to data from both Common Crawl and curated sources.

Further details can be found within each respective folder.

How to Use the Repository

1. Prepare Data Before Global Deduplication

1.1 Generate Common Crawl Data

  • Navigate to the common-crawl folder.
  • Run the scripts provided and follow the instructions in the folder to process raw Common Crawl data.

1.2 Generate Curated Sources Data:

  • Navigate to the curated-sources folder.
  • Run the scripts provided to process data from curated sources.

2. Perform Global Deduplication

  • Navigate to the deduplication folder.
  • Run the deduplication scripts to globally deduplicate data from both Common Crawl and curated sources.

3. Finalize the Data

  • Return to the common-crawl and curated-sources folders.
  • Remove duplicates identified during deduplication and re-organize the data.

By following these steps, you can replicate the TxT360 dataset or adapt the process to create your custom datasets.

Citation

If you use the TxT360 dataset or our scripts, please cite the following items:

@article{txt360,
  title = {{TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend}},
  author={Liping Tang, Nikhil Ranjan, Omkar Pangarkar, Xuezhi Liang, Zhen Wang, Li An, Bhaskar Rao, Linghao Jin, Huijuan Wang, Zhoujun Cheng, Suqi Sun, Cun Mu, Victor Miller, Xuezhe Ma, Yue Peng, Zhengzhong Liu, Eric P Xing},
  year={2024},
  url={https://huggingface.co/spaces/LLM360/TxT360}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages