Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore(main): release 0.0.5 (flashinfer-ai#232)
🤖 I have created a release *beep* *boop* --- ## [0.1.0](flashinfer-ai/flashinfer@v0.0.4...v0.1.0) (2024-06-20) ### Highlights * Support any GQA group size support for tensor-cores kernels. * Support any page size support for tensor-cores kernels. * Support CUDA-Graph for prefill/decode APIs. * Add an option to accelerate decode kernels with Tensor Cores. * Support custom attention mask. (https://docs.flashinfer.ai/tutorials/kv_layout.html#mask-layout-2d-ragged-tensor) * Support logits cap in Grok-1 models. * Fused GPU-sampling kernels: top-p, top-k, speculative verification. (https://docs.flashinfer.ai/api/python/sampling.html) * PyTorch wrapper of group-gemm cutlass kernels. (https://docs.flashinfer.ai/api/python/sampling.html) ### Acknowledgement We thank [@ibsidorenko](https://github.com/ibsidorenko), [@LiuXiaoxuanPKU](https://github.com/LiuXiaoxuanPKU), [@Yard1](https://github.com/Yard1) [@AgrawalAmey](https://github.com/AgrawalAmey), [@xuzhenqi](https://github.com/xuzhenqi), [@mgerstgrasser](https://github.com/mgerstgrasser), [@esmeetu](https://github.com/esmeetu), [@yz-tang](https://github.com/yz-tang), [@HSQ79815](https://github.com/HSQ79815), [@Qubitium](https://github.com/Qubitium), [@shreygupta2809](https://github.com/shreygupta2809), [@sighingnow](https://github.com/sighingnow), [@vinx13](https://github.com/vinx13), [@tqchen](https://github.com/tqchen), [@merrymercy](https://github.com/merrymercy), [@comaniac](https://github.com/comaniac) and many others for their contributions and helpful discussions for 0.0.5 release. ### Refactor * support any GQA group size for tensor-cores kernels ([flashinfer-ai#301](flashinfer-ai#301)) ([c111ca](flashinfer-ai@c111ca6)) * support any page size for tensor-cores kernels ([flashinfer-ai#306](flashinfer-ai#306)) ([82fd8c](flashinfer-ai@82fd8c7)) ### Features * add `use_tensor_cores` option to decode kernels to accelerate GQA ([flashinfer-ai#317](flashinfer-ai#317)) ([3b50dd5](flashinfer-ai@3b50dd5)) * add group gemm operators ([flashinfer-ai#282](flashinfer-ai#282)) ([e08ba42](flashinfer-ai@e08ba42)) * initial support of distributed operators ([flashinfer-ai#289](flashinfer-ai#289)) ([03553da](flashinfer-ai@03553da)) * initial support of logits hook ([flashinfer-ai#298](flashinfer-ai#298)) ([ab1e2ad](flashinfer-ai@ab1e2ad)) * Separate Q and KV dtypes for decode ([flashinfer-ai#286](flashinfer-ai#286)) ([5602659](flashinfer-ai@5602659)) * support cuda graph for batched multi-query(prefill/append) attention ([flashinfer-ai#275](flashinfer-ai#275)) ([83ceb67](flashinfer-ai@83ceb67)) * support cuda graph for batched multi-query(prefill/append) attention ([flashinfer-ai#277](flashinfer-ai#277)) ([24cc583](flashinfer-ai@24cc583)) * support custom attention mask in prefill/append attention kernels ([flashinfer-ai#266](flashinfer-ai#266)) ([7304282](flashinfer-ai@7304282)) * fused speculative sampilng kernels ([flashinfer-ai#259](flashinfer-ai#259)) ([cea2bb](flashinfer-ai@cea2bb9)) * expose sampling APIs in pytorch ([flashinfer-ai#238](flashinfer-ai#238)) ([092902](flashinfer-ai@0929023)) ### Performance Improvements * initial cuda graph support ([flashinfer-ai#256](flashinfer-ai#256)) ([7e9cc7f](flashinfer-ai@7e9cc7f)) * split kv-cache for prefill/append kernels ([flashinfer-ai#310](flashinfer-ai#310)) ([f0bb0a3](flashinfer-ai@f0bb0a3)) * use packed bit array for attention mask ([flashinfer-ai#308](flashinfer-ai#308)) ([3d43dc9](flashinfer-ai@3d43dc9)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <[email protected]>
- Loading branch information