Tl;dr
New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez
New models unlocked: Cohere2, olmo, olmo2, helium.
What's Changed
- docs(README): supported hardware links TGI AMD GPUs by @guspan-tanadi in #2814
- Fixing latest flavor by disabling it. by @Narsil in #2831
- fix facebook/opt-125m not working issue by @sywangyi in #2824
- Fixup opt to reduce the amount of odd if statements. by @Narsil in #2833
- TensorRT-LLM backend bump to latest version + misc fixes by @mfuntowicz in #2791
- Feat/trtllm cancellation dev container by @Hugoch in #2795
- New arg. by @Narsil in #2845
- Fixing CI. by @Narsil in #2846
- fix: lint backend and doc files by @drbh in #2850
- Qwen2-VL runtime error fix when prompted with multiple images by @janne-alatalo in #2840
- Update vllm kernels for ROCM by @mht-sharma in #2826
- change xpu lib download link by @sywangyi in #2852
- fix: include add_special_tokens in kserve request by @drbh in #2859
- chore: fixed some typos and attribute issues in README by @ruidazeng in #2891
- update ipex xpu to fix issue in ARC770 by @sywangyi in #2884
- Basic flashinfer 0.2 support by @danieldk in #2862
- Improve vlm support (add idefics3 support) by @drbh in #2437
- Update to marlin-kernels 0.3.7 by @danieldk in #2882
- chore: Update jsonschema to 0.28.0 by @Stranger6667 in #2870
- Add possible variants for A100 and H100 GPUs for auto-detecting flops by @lazariv in #2837
- Update using_guidance.md by @nbroad1881 in #2901
- fix crash in torch2.6 if TP=1 by @sywangyi in #2885
- Add Flash decoding kernel ROCm by @mht-sharma in #2855
- Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm by @mht-sharma in #2825
- Baichuan2-13B does not have max_position_embeddings in config by @sywangyi in #2903
- docs(conceptual/speculation): available links Train Medusa by @guspan-tanadi in #2863
- Fix
docker run
inREADME.md
by @alvarobartt in #2861 - π add guide on using TPU with TGI in the docs by @baptistecolle in #2907
- Upgrading our rustc version. by @Narsil in #2908
- Fix typo in TPU docs by @baptistecolle in #2911
- Removing the github runner. by @Narsil in #2912
- Upgrading bitsandbytes. by @Narsil in #2910
- Do not convert weight scale to e4m3fnuz on CUDA by @danieldk in #2917
- feat: improve star coder to support multi lora layers by @drbh in #2883
- Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu by @sywangyi in #2815
- nix: update to PyTorch 2.5.1 by @danieldk in #2921
- Moving to
uv
instead ofpoetry
. by @Narsil in #2919 - Add fp8 kv cache for ROCm by @mht-sharma in #2856
- fix the crash of meta-llama/Llama-3.2-1B by @sywangyi in #2918
- feat: improve qwen2-vl startup by @drbh in #2802
- Revert "feat: improve qwen2-vl startup " by @drbh in #2924
- flashinfer: switch to plan API by @danieldk in #2904
- Fixing TRTLLM dockerfile. by @Narsil in #2922
- Flash Transformers modeling backend support by @Cyrilvallez in #2913
- Give TensorRT-LLMa proper CI/CD π by @mfuntowicz in #2886
- Trying to avoid the random timeout. by @Narsil in #2929
- Run
pre-commit run --all-files
to fix CI by @alvarobartt in #2933 - Upgrading the deps to have transformers==4.48.0 necessary by @Narsil in #2937
- fix moe in quantization path by @sywangyi in #2935
- Clarify FP8-Marlin use on capability 8.9 by @danieldk in #2940
- Bump TensorRT-LLM backend dependency to v0.16.0 by @mfuntowicz in #2931
- Set
alias
formax_completion_tokens
inChatRequest
by @alvarobartt in #2932 - Add NVIDIA A40 to known cards by @kldzj in #2941
- [TRTLLM] Expose finish reason by @mfuntowicz in #2841
- Tmp tp transformers by @Narsil in #2942
- Transformers backend TP fix by @Cyrilvallez in #2945
- Trying to put back the archlist (to fix the oom). by @Narsil in #2947
New Contributors
- @janne-alatalo made their first contribution in #2840
- @ruidazeng made their first contribution in #2891
- @Stranger6667 made their first contribution in #2870
- @lazariv made their first contribution in #2837
- @baptistecolle made their first contribution in #2907
- @Cyrilvallez made their first contribution in #2913
- @kldzj made their first contribution in #2941
Full Changelog: v3.0.1...v3.0.2