- Drop support for Python 3.8
- Properly apply prompt format when providing
choices
- Do not add special tokens before
choices
- Support multilingual-e5-small embedding model
- Support Falcon 3 Instruct 1B and 3B
- Pin Llama 3.2 model versions
- Decrease repetition penalty for Llama 3.2 models
- Support SmolLM2
- Add
embed
function - Support Llama 3.1 8B instruct
- Use models directly from Huggingface with config.use_hf_model()
- Add "echo" config option to allow streaming tokens to stdout as they are generated
- Skip checking for model updates
- Download entire model upfront even if we only need the tokenizer initially
- Use most recent version of CTranslate2
- Add per-model repetition penalties
- Support Llama 3.2 1B and 3B
- Support Danube3
- Support SmolLM
- Add new separators to document chunking heuristic
- Allow missing query prefixes for embedding models
- Support Phi-3-mini-4k-instruct
- Support GIST-small-Embedding-v0 embedding model
- Store model runtime stats to improve benchmarking and analysis
- Support Meta-Llama-3-8B-Instruct
- Support gemma-2b-it
- Support h2o-danube2-1.8b-chat
- Support WizardLM-2-7B
- Correct issue causing
choices
to be scored improperly
- CUDA 12 support
- Run embedding models on CPU to work around memory copy issue
- Improve embedding search performance
- Add openchat-3.5-0106 model
- Add h2o-danube-1.8b-chat model
- Simplified dialogstudio system message
- Correct missing instruction in openchat prompt
- Improved search speed when searching many documents
- Reduce memory usage for large document embeddings
- Updated to TinyLlama Chat v1.0
- Remove auto model scaling on Colab
- Correct phi-1.5 prompt format
- Correct model license metadata
- Add Mistral-7B-Instruct-v0.2 model
- Add openchat-3.5-1210 model
- Add phi-2 model
- Support static batching by passing lists to
do
- Support choices list on
do
to restrict possible outputs
- Remove explicit setuptools dependency (see CTranslate2#1526)
- Reduce model size when not using a CPU in Colab
- Default to 8GB model size on Colab
- Allow 2048 token response by default on Colab
- Use Colab GPU by default if available
- Skip returning prompt for decoder-only models
- Ensure whitespace is removed from decoder-only outputs
- Add neural-chat-7b-v3-1 as default 8GB model
- Add max_tokens config option
- Add gte-tiny embedding model
- Properly support Python 3.12
- Removed extra classification prompt when performing classification with generative models
- Prevent doubling of special tokens during classification
- Use per-model instruction formats
- Batch chunk embeddings for faster performance embedding larger documents
- Automatically use query prefixes as needed for embeddings
- Add phi-1.5 model
- Add dialogstudio base model
- Add support for gte-small embeddings
- Add support for bge-small-en embeddings
- Allow token suppression on decoder-only models
- Remove HTML comments appearing in some wiki pages
- Model names no longer include backend and quantization info
- Default to CPU inference unless GPU enabled using
lm.config["device"]="auto"
- Add quantization info to config and use it for memory usage calculation
- Increase repetition penalty to 1.3 from 1.2 to help avoid repetition in smaller models
- Improve semantic meaning of chunk heading
- Remove sentencepiece dependency
- Support GPT-based models
- Add
code
generation function - Create new configuration system
- Use CUDA if available
- Use non-greedy sampling on
complete
function - Decrease chance of splitting chunks on decimal points
- Correct assistant example
- Attempt to chunk context on semantic boundaries
- Allow filtering by model license
- Update classification to only allow valid classes to be returned
- Disable beam search for faster inference
- Normalize output
- Rename some functions
- Support xl models
- Less verbose chat syntax
- Use ctranslate2 for greater efficiency
- Original version using HuggingFace Transformers