Releases: l3utterfly/llama.cpp
Releases · l3utterfly/llama.cpp
b4767
[SYCL] Optimize mul_mat for Q4_0 on Intel GPU (#12035) * opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <[email protected]>
b4713
vulkan: linux builds + small subgroup size fixes (#11767) * mm subgroup size * upload vulkan x86 builds
b4687
CUDA: use arch list for compatibility check (#11775) * CUDA: use arch list for feature availability check --------- Co-authored-by: Diego Devesa <[email protected]>
b4519
common : add -hfd option for the draft model (#11318) * common : add -hfd option for the draft model * cont : fix env var * cont : more fixes
b4393
vulkan: multi-row k quants (#10846) * multi row k quant shaders! * better row selection * more row choices * readjust row selection * rm_kq=2 by default
b4302
ggml: load all backends from a user-provided search path (#10699) * feat: load all backends from a user-provided search path * fix: Windows search path * refactor: rename `ggml_backend_load_all_in_search_path` to `ggml_backend_load_all_from_path` * refactor: rename `search_path` to `dir_path` * fix: change `NULL` to `nullptr` Co-authored-by: Diego Devesa <[email protected]> * fix: change `NULL` to `nullptr` --------- Co-authored-by: Diego Devesa <[email protected]>
b4219
sycl : Reroute permuted mul_mats through oneMKL (#10408) This PR fixes the failing MUL_MAT tests for the sycl backend.
b4200
ci : faster CUDA toolkit installation method and use ccache (#10537) * ci : faster CUDA toolkit installation method and use ccache * remove fetch-depth * only pack CUDA runtime on master
b4098
vulkan: Optimize some mat-vec mul quant shaders (#10296) Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops. Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before. Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions.
b4033
ggml : fix arch check in bf16_to_fp32 (#10164)