Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds speculative decoding #27

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yashkgp
Copy link

@yashkgp yashkgp commented Nov 17, 2024

Add Speculative Decoding to picoGPT

This PR implements speculative decoding to improve text generation performance in picoGPT. The implementation uses a smaller draft model (124M) to predict multiple tokens at once, which are then verified by the main model, potentially reducing the number of forward passes needed for generation.

Key Changes

  1. Added generate_speculative() function in gpt2.py that implements:

    • Draft model token generation (default: 3 tokens at a time)
    • Main model verification of speculative tokens
    • Acceptance/rejection mechanism for speculative predictions
  2. Modified main() function to support both standard and speculative generation modes

    • Added use_generate_speculative flag (defaults to True)
    • Loads both main and draft model parameters
    • Routes to appropriate generation function based on the flag
  3. Added benchmark_speculative.py for performance comparison:

    • Benchmarks both standard and speculative generation
    • Supports multiple model sizes (124M, 355M)
    • Includes warm-up runs for more accurate measurements
    • Reports generation time and percentage improvement

Implementation Details

The speculative decoding algorithm works as follows:

  1. Draft model (124M) generates N speculative tokens (default N=3)
  2. Main model verifies these predictions in a single forward pass
  3. Accepted tokens are added to the sequence
  4. If rejection occurs, fall back to main model's prediction

Benefits

  • Potential speedup in text generation by reducing the number of forward passes
  • Configurable number of speculative tokens
  • Minimal memory overhead (reuses existing 124M model as draft model)
  • Easy to toggle between standard and speculative modes

Testing

The PR includes a benchmarking script that measures performance improvements across different model sizes. Results can be reproduced by running:

python benchmark_speculative.py

Notes

  • The draft model is fixed to 124M for simplicity, but this could be made configurable
  • The number of speculative tokens (N=3) can be adjusted through the n_speculative parameter
  • Implementation is kept minimal and numpy-based, consistent with picoGPT's philosophy

Future Work

  • Make draft model size configurable
  • Add adaptive speculation length based on acceptance rate
  • Optimize verification step for better performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant