Transformer-Based Text Generation

This project implements a text generation model inspired by the concepts of transformer architectures as described in Andrej Karpathy's video on building GPTs and MIT's open-source educational materials. The model is designed to process and learn from classic literature to generate coherent text outputs.

Features

Transformer Architecture: Implements a scaled-down transformer model, incorporating key components such as multi-head self-attention, feed-forward networks, and residual connections.
Custom Dataset: Processes and trains on text extracted from public-domain classics like:
- The Mysterious Stranger by Mark Twain
- Paradise Lost by John Milton
- The Picture of Dorian Gray by Oscar Wilde
- Notes from the Underground by Fyodor Dostoyevsky
- The Sorrows of Satan by Marie Corelli
- The Monk by M. G. Lewis
Training and Evaluation: Includes separate training and testing splits (90/10) to measure model performance.
Dynamic Generation: Generates text using a context window with a user-specified number of new tokens.

Setup

Prerequisites

Ensure the following libraries are installed:

Python 3.8+
PyTorch
PyPDF2

Installation

Clone the repository:

git clone https://github.com/your-username/transformer-gpt.git
cd transformer-gpt

Install the required packages:
```
pip install torch PyPDF2
```

Files Needed

Ensure you have the following text files in the specified directory structure:

/Users/<your_name>/Desktop/microGPT/
|-- The Project Gutenberg Book of The Mysterious Stranger and Other Stories, by Mark Twain.pdf
|-- The Project Gutenberg eBook of Paradise Lost, by John Milton.pdf
|-- The Project Gutenberg eBook of The Picture of Dorian Gray, by Oscar Wilde.pdf
|-- The Project Gutenberg eBook of Notes from the Underground, by Fyodor Dostoyevsky.pdf
|-- The Sorrows of Satan, by Marie Corelli—A Project Gutenberg eBook.pdf
|-- The Project Gutenberg eBook of The Monk, by M. G. Lewis.pdf

Usage

Preprocess text data from the PDF files using the PyPDF2 library to extract and clean the content.
Train the model by running:
```
python main.py
```
During training, the script outputs the loss metrics every 100 iterations.
Once trained, generate new text using the generate() method. Modify the max_new_tokens parameter to control the length of the output.

Code Highlights

Transformer Components

Self-Attention Mechanism:
- Each token attends to other tokens in the context window to capture relationships.
- Implements scaled dot-product attention with masking to ensure causal behavior.
Multi-Head Attention:
- Parallel attention heads allow the model to capture multiple patterns simultaneously.
Feed-Forward Network:
- Adds non-linear transformations to improve the model's expressiveness.
Residual Connections:
- Helps mitigate gradient vanishing and improve information flow.

Training

Batching: Divides data into 64-token samples for efficient parallel training.
Loss Function: Uses Cross Entropy to measure prediction accuracy.
Optimizer: AdamW optimizer for adaptive learning rate adjustments.

Text Generation

Employs a sampling-based approach using multinomial probability distributions.
Context is dynamically cropped to fit the model's context window (256 tokens).

Results

After training, the model generates text that mimics the style and structure of the training corpus. Example outputs demonstrate coherent phrasing and vocabulary consistent with the classics.

Future Improvements

Scaling: Extend to larger transformer architectures for more complex text generation.
Data Augmentation: Include more diverse literary texts to enhance generalization.
Fine-Tuning: Allow for transfer learning on specific genres or modern text.

Acknowledgments

Inspired by Andrej Karpathy's "Let's Build GPT" video.
Texts sourced from Project Gutenberg (public domain).
PyTorch framework for deep learning implementation.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or collaboration, feel free to reach out at prajsachan@gmail.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Transformer-Based Text Generation

Features

Setup

Prerequisites

Installation

Files Needed

Usage

Code Highlights

Transformer Components

Training

Text Generation

Results

Future Improvements

Acknowledgments

License

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Transformer-Based Text Generation

Features

Setup

Prerequisites

Installation

Files Needed

Usage

Code Highlights

Transformer Components

Training

Text Generation

Results

Future Improvements

Acknowledgments

License

Contact