Skip to content

Releases: huggingface/text-generation-inference

v3.0.2

24 Jan 11:16
b70f29d
Compare
Choose a tag to compare

Tl;dr

New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez

New models unlocked: Cohere2, olmo, olmo2, helium.

What's Changed

New Contributors

Full Changelog: v3.0.1...v3.0.2

v3.0.1

11 Dec 20:13
bb9095a
Compare
Choose a tag to compare

Summary

Patch release to handle a few older models and corner cases.

What's Changed

New Contributors

Full Changelog: v3.0.0...v3.0.1

v3.0.0

09 Dec 20:22
8f326c9
Compare
Choose a tag to compare

TL;DR

Big new release

benchmarks_v3

Details: https://huggingface.co/docs/text-generation-inference/conceptual/chunking

What's Changed

New Contributors

Full Changelog: v2.4.1...v3.0.0

v2.4.1

22 Nov 17:35
d2ed52f
Compare
Choose a tag to compare

Notable changes

  • Choose input/total tokens automatically based on available VRAM
  • Support Qwen2 VL
  • Decrease latency of very large batches (> 128)

What's Changed

New Contributors

Full Changelog: v2.3.0...v2.4.1

v2.4.0

25 Oct 21:14
0a655a0
Compare
Choose a tag to compare

Notable changes

  • Experimental prefill chunking (PREFILL_CHUNKING=1)
  • Experimental FP8 KV cache support
  • Greatly decrease latency for large batches (> 128 requests)
  • Faster MoE kernels and support for GPTQ-quantized MoE
  • Faster implementation of MLLama

What's Changed

New Contributors

Read more

v2.3.1

03 Oct 13:01
a094729
Compare
Choose a tag to compare

Important changes

  • Added support for Mllama (3.2, vision models). Flashed, unpadded.
  • FP8 performance improvements
  • Moe performance improvements
  • BREAKING CHANGE - When using tools, models could answer with a tool call notify_error with the content error, it will instead output regular generation.

What's Changed

New Contributors

Full Changelog: v2.3.0...v2.3.1

v2.3.0

20 Sep 16:20
169178b
Compare
Choose a tag to compare

Important changes

  • Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem.
    So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.

  • Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).

  • Lots of performance improvements with Marlin and quantization.

What's Changed

Read more

v2.2.0

23 Jul 16:30
Compare
Choose a tag to compare

Notable changes

  • Llama 3.1 support (including 405B, FP8 support in a lot of mixed configurations, FP8, AWQ, GPTQ, FP8+FP16).
  • Gemma2 softcap support
  • Deepseek v2 support.
  • Lots of internal reworks/cleanup (allowing for cool features)
  • Lots of AWQ/GPTQ work with marlin kernels (everything should be faster by default)
  • Flash decoding support (FLASH_DECODING=1 environment variables which will probably enable some nice improvements in the future)

What's Changed

New Contributors

Full Changelog: v2.1.1...v2.2.0

v2.1.1

04 Jul 10:43
4dfdb48
Compare
Choose a tag to compare

Main changes

  • Bugfixes
  • Added FlashDecoding support (Beta) use FLASH_DECODING=1 to use TGI with flash decoding (large speedups on long queries). #1940
  • Use Marlin over GPTQ kernels for faster GPTQ inference #2111

What's Changed

  • Fixing the CI to also run in release when it's a tag ? by @Narsil in #2138
  • fix microsoft/Phi-3-mini-4k-instruct crash in batch.slots[batch.slot_… by @sywangyi in https://github.com//pull/2148
  • Fixing clippy. by @Narsil in #2149
  • fix: use weights from base_layer by @drbh in #2141
  • feat: download lora adapter weights from launcher by @drbh in #2140
  • Use GPTQ-Marlin for supported GPTQ configurations by @danieldk in #2111
  • fix AttributeError: 'MixtralLayer' object has no attribute 'mlp' by @icyxp in #2123
  • refine get xpu free memory/enable Qwen2/gemma2/gemma/phi in intel platform by @sywangyi in #2132
  • fix: prefer serde structs over custom functions by @drbh in #2127
  • Fixing test. by @Narsil in #2152
  • GH router. by @Narsil in #2153
  • Fixing baichuan override. by @Narsil in #2158
  • [Major Change][Undecided yet] Move to FlashDecoding instead of PagedAttention kernel. by @Narsil in #1940
  • Fixing graph capture for flash decoding. by @Narsil in #2163
  • fix FlashDecoding change's regression in intel platform by @sywangyi in #2161
  • fix: use the base layers weight in mistral rocm by @drbh in #2155
  • Fixing rocm. by @Narsil in #2164
  • Ci test by @glegendre01 in #2124
  • Hotfixing qwen2 and starcoder2 (which also get clamping). by @Narsil in #2167
  • feat: improve update_docs for openapi schema by @drbh in #2169
  • Fixing the dockerfile warnings. by @Narsil in #2173
  • Fixing missing object field for regular completions. by @Narsil in #2175

New Contributors

Full Changelog: v2.1.0...v2.1.1

v2.1.0

28 Jun 06:26
192d49a
Compare
Choose a tag to compare

Notable changes

  • New models : gemma2

  • Multi lora adapters. You can now run multiple loras on the same TGI deployment #2010

  • Faster GPTQ inference and Marlin support (up to 2x speedup).

  • Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)

  • Lots of Rocm support and bugfixes,

  • Lots of new contributors ! Thanks a lot for these contributions

What's Changed

Read more