Skip to content

andrewn6/tinyllm

Repository files navigation

TinyLLM

Minimal, high-performance inference engine for LLM's -- used in development environments

Overview

TinyLLM streamlines the inference pipeline with minimal overhead, focusing on memory efficiency and throughput optimization. We include a custom tokenizer for self-developed models, and it's compataibile with existing LLM's through our scheduling systme.

Features

  • Memory managment pruning
  • Efficient batch processing and response streaming
  • Optimized scheduling for multi-model deployments
  • Custom tokenizer implmentation for self-developed models
  • Inference API
  • KV cache implementation
  • Training CLI for development models
  • Byte-level tokenization

This is very much still an experiment, especially the tokenizer, our scheduler is somewhat well-written, memory management is decent.

I'll continue to slowly improve these components over my weekends.

Scope

This is solely a inference engine. It does not:

  • Implement large model architectures
  • Include pre-trained models
  • Support distributed training

How to use?

Clone repository

git clone https://github.com/andrewn6/tinyllm
pip install -e . 

Register your trained model

tinyllm model register transformer-19m v1 \
    --checkpoint models/tiny-19m.pt \
    --model-type native \
    --description "19M parameter transformer"

Serve and expose to localhost

tinyllm serve \
    --model-name mymodel \
    --port 8000 \
    --model-type native

List models

tinyllm model list

About

Minimal, fast inference engine for LLM's

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages