Complete improvements of download script. #55
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I modified the script to utilize data classes, JSON serialization, and the
tqdm
library, ensuring a seamless and informative data download process. It also offers options to specify data sizes, splits, and target example counts. (cool, cool!)Little list of changes:
ChatData
) for structuring GPT-related data.ChatDataEncoder
) for custom serialization.GPTData
) to manage data download, processing, and saving.tqdm
for a progress bar during data download.Usage (I thought this was necessary, soooo):
Testing:
It works perfectly—I've tested all sizes and splits. I also tried various example sizes and all in general. It worked flawlessly on my local machine (Linux).