Skip to content

Latest commit

 

History

History
115 lines (80 loc) · 4.31 KB

README.md

File metadata and controls

115 lines (80 loc) · 4.31 KB

Transformer-Based Text Generation

This project implements a text generation model inspired by the concepts of transformer architectures as described in Andrej Karpathy's video on building GPTs and MIT's open-source educational materials. The model is designed to process and learn from classic literature to generate coherent text outputs.

Features

  • Transformer Architecture: Implements a scaled-down transformer model, incorporating key components such as multi-head self-attention, feed-forward networks, and residual connections.
  • Custom Dataset: Processes and trains on text extracted from public-domain classics like:
    • The Mysterious Stranger by Mark Twain
    • Paradise Lost by John Milton
    • The Picture of Dorian Gray by Oscar Wilde
    • Notes from the Underground by Fyodor Dostoyevsky
    • The Sorrows of Satan by Marie Corelli
    • The Monk by M. G. Lewis
  • Training and Evaluation: Includes separate training and testing splits (90/10) to measure model performance.
  • Dynamic Generation: Generates text using a context window with a user-specified number of new tokens.

Setup

Prerequisites

Ensure the following libraries are installed:

Installation

  1. Clone the repository:
    git clone https://github.com/your-username/transformer-gpt.git
    cd transformer-gpt
  2. Install the required packages:
    pip install torch PyPDF2

Files Needed

Ensure you have the following text files in the specified directory structure:

/Users/<your_name>/Desktop/microGPT/
|-- The Project Gutenberg Book of The Mysterious Stranger and Other Stories, by Mark Twain.pdf
|-- The Project Gutenberg eBook of Paradise Lost, by John Milton.pdf
|-- The Project Gutenberg eBook of The Picture of Dorian Gray, by Oscar Wilde.pdf
|-- The Project Gutenberg eBook of Notes from the Underground, by Fyodor Dostoyevsky.pdf
|-- The Sorrows of Satan, by Marie Corelli—A Project Gutenberg eBook.pdf
|-- The Project Gutenberg eBook of The Monk, by M. G. Lewis.pdf

Usage

  1. Preprocess text data from the PDF files using the PyPDF2 library to extract and clean the content.
  2. Train the model by running:
    python main.py
  3. During training, the script outputs the loss metrics every 100 iterations.
  4. Once trained, generate new text using the generate() method. Modify the max_new_tokens parameter to control the length of the output.

Code Highlights

Transformer Components

  1. Self-Attention Mechanism:

    • Each token attends to other tokens in the context window to capture relationships.
    • Implements scaled dot-product attention with masking to ensure causal behavior.
  2. Multi-Head Attention:

    • Parallel attention heads allow the model to capture multiple patterns simultaneously.
  3. Feed-Forward Network:

    • Adds non-linear transformations to improve the model's expressiveness.
  4. Residual Connections:

    • Helps mitigate gradient vanishing and improve information flow.

Training

  • Batching: Divides data into 64-token samples for efficient parallel training.
  • Loss Function: Uses Cross Entropy to measure prediction accuracy.
  • Optimizer: AdamW optimizer for adaptive learning rate adjustments.

Text Generation

  • Employs a sampling-based approach using multinomial probability distributions.
  • Context is dynamically cropped to fit the model's context window (256 tokens).

Results

After training, the model generates text that mimics the style and structure of the training corpus. Example outputs demonstrate coherent phrasing and vocabulary consistent with the classics.

Future Improvements

  • Scaling: Extend to larger transformer architectures for more complex text generation.
  • Data Augmentation: Include more diverse literary texts to enhance generalization.
  • Fine-Tuning: Allow for transfer learning on specific genres or modern text.

Acknowledgments

  • Inspired by Andrej Karpathy's "Let's Build GPT" video.
  • Texts sourced from Project Gutenberg (public domain).
  • PyTorch framework for deep learning implementation.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or collaboration, feel free to reach out at [email protected].