Complete improvements of download script. #55

populated · 2023-11-11T14:15:15Z

I modified the script to utilize data classes, JSON serialization, and the tqdm library, ensuring a seamless and informative data download process. It also offers options to specify data sizes, splits, and target example counts. (cool, cool!)

Little list of changes:

Added a data class (ChatData) for structuring GPT-related data.
Implemented a JSON encoder (ChatDataEncoder) for custom serialization.
Created a class (GPTData) to manage data download, processing, and saving.
Introduced methods for validating data sizes and splits.
Utilized tqdm for a progress bar during data download.
Provided options for truncating data based on a target example count.

Usage (I thought this was necessary, soooo):

gpt_data = GPTData(target_examples=None)
gpt_data.download_and_save_data(data_size_fn='webtext', split_fn='train')

Testing:

It works perfectly—I've tested all sizes and splits. I also tried various example sizes and all in general. It worked flawlessly on my local machine (Linux).

populated added 2 commits November 11, 2023 09:06

full rewrite.

ae99757

full rewrite + improved.

9e6a482

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete improvements of download script. #55

Complete improvements of download script. #55

populated commented Nov 11, 2023

Complete improvements of download script. #55

Are you sure you want to change the base?

Complete improvements of download script. #55

Conversation

populated commented Nov 11, 2023

Little list of changes:

Usage (I thought this was necessary, soooo):

Testing: