merge upstream #50

l3utterfly · 2025-01-18T14:26:57Z

Make sure to read the contributing guidelines before submitting a PR

…g#10942) * tests: Add im2col perf tests * vulkan: optimize im2col, more elements per thread * vulkan: increase small tile size for NV_coopmat2 * vulkan: change im2col to 512 elements per workgroup

Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better. Share some code for reducing the result values to memory in mul_mat_vec_base.

…hen building with LLAMA_CURL=ON and GGML_OPENCL=ON (ggml-org#11013) In common/common.cpp: * Convert usage of stat() function call to check if file exists to standard library function std::filesystem::exists (error unable to match to correct function signature) * Additional conditions to check if PATH_MAX is already defined in WIN32 environment (warning it is already defined in MSYS2) In examples/run/run.cpp: * Add io.h header inclusion (error cannot find function _get_osfhandle) * Change initialisers for OVERLAPPED to empty struct (warning about uninitialised members) * Add initialiser for hFile (warning it may be uninitialised) * Add cast for curl_off_t percentage value to long int in generate_progress_prefix function (warning that curl_off_t is long long int) In ggml/src/ggml-opencl/ggml-opencl.cpp: * Initialise certain declared cl_mem variables to nullptr for greater safety (warning about B_d variable possibly used unassigned)

* conflict resolution * move comments after bracket to its own line * DeciLMCausalModel now reads rope_theta from config.json properly

* server : add OAI compat for /v1/completions * add test * add docs * better docs

* server : clean up built-in template detection * fix compilation * add chat template test * fix condition

…g#11027) * Fixes for clang AVX VNNI * enable AVX VNNI and alder lake build for MSVC * Apply suggestions from code review --------- Co-authored-by: slaren <[email protected]>

* list llama-swap under tools in README * readme: add llama-swap to Infrastructure

* slot.can_batch_with * lora per request * test: force disable cache prompt * move can_batch_with check * fix condition * add slow test with llama 8b * update docs * move lora change task to queue * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * lora_base * remove redundant check --------- Co-authored-by: Georgi Gerganov <[email protected]>

* server/bench: - support openAI streaming standard output with [DONE]\n\n - export k6 raw results in csv - fix too many tcp idle connection in tcp_wait - add metric time to emit first token * server/bench: - fix when prometheus not started - wait for server to be ready before starting bench

* llama : scatter llama.cpp into multiple modules (wip) * llama : control-vector -> adapter * llama : arch * llama : mmap ggml-ci * ci : remove BUILD_SHARED_LIBS=OFF ggml-ci * llama : arch (cont) ggml-ci * llama : chat ggml-ci * llama : model ggml-ci * llama : hparams ggml-ci * llama : adapter ggml-ci * examples : fix ggml-ci * rebase ggml-ci * minor * llama : kv cache ggml-ci * llama : impl ggml-ci * llama : batch ggml-ci * cont ggml-ci * llama : context ggml-ci * minor * llama : context (cont) ggml-ci * llama : model loader ggml-ci * common : update lora ggml-ci * llama : quant ggml-ci * llama : quant (cont) ggml-ci * minor [no ci]

…ls (ggml-org#11053) * Disable KV cache shifting automatically for unsupported models instead of exiting directly Signed-off-by: Molly Sophia <[email protected]> * Update common/common.cpp Co-authored-by: Georgi Gerganov <[email protected]> --------- Signed-off-by: Molly Sophia <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

This commit attempts to improve the log message for the inputs of the splits in the sched_print_assignments function. The motivation for this change is that currently even if there are no inputs a colon is displayed at the end of the line, which can make it a little confusing when reading the output as it could be interpreted as the line below are inputs when they are in fact nodes. With this change the colon will only be printed if there actually are inputs.

…ggml-org#11047) * Added init tensor calling code * Added get_alloc_size forwarding * Cleaned up and improved type/error handling. * fix: remove trailing whitespaces. * Cleanup and use GGML error logging functions. * Handle potentially dangerous edge cases. * Apply suggestions from code review Co-authored-by: Diego Devesa <[email protected]> --------- Co-authored-by: Diego Devesa <[email protected]>

* convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type * vocab : add DeepSeek V3 pre-tokenizer regexes * unicode : handle ACCENT_MARK and SYMBOL categories in regex * llama : add DeepSeek V3 chat template, handle new model parameters and tensor types --------- Co-authored-by: Stanisław Szymczyk <[email protected]>

…tary driver (ggml-org#11074) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check

* CUDA: add BF16 support

ggml-ci

* mmap : fix fileno macro clash ggml-ci * cont ggml-ci

* tokenize : escape the prompt * tokenize : update help

Do masking on whole dwords, fetch all scales at once.

…l-org#11166) * vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl Shaders are based on cpy.cu. * vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32 * ggml: copy q->f32 assumes some contiguity in the destination

ggml-ci

Early register RPC devices and do not propagate RPC specifics in the llama model structures. ref: ggml-org#10609

…al tokens when send message (ggml-org#11270)

…org#11281) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes ggml-org#11268.

* Added the ability to use guide tokens for OuteTTS, greatly improving TTS recitation accuracy over long input sequences. * applied linting suggestions, updated to latest llama_vocab changes, added a safety check, added newline to guide token start

ggml-ci

* server : implement cancellable request * fix typo * httplib 0.18.5 * fix i underflow

* cmake : add sanitizer flags for llama.cpp ggml-ci * tests : fix compile warnings ggml-ci * cmake : move sanitizer flags to llama_add_compile_flags ggml-ci * cmake : move llama.cpp compile flags to top level lists ggml-ci * cmake : apply only sanitizer flags at top level ggml-ci * tests : fix gguf context use in same_tensor_data * gguf-test: tensor data comparison * dummy : trigger ggml-ci * unicode : silence gcc warnings ggml-ci * ci : use sanitizer builds only in Debug mode ggml-ci * cmake : add status messages [no ci] --------- Co-authored-by: Johannes Gäßler <[email protected]>

z80maniac and others added 30 commits December 28, 2024 16:08

server : fix token duplication when streaming with stop strings (ggml…

16cdce7

…-org#10997)

server: added more docs for response_fields field (ggml-org#10995)

f865ea1

vulkan: Use push constant offset to handle misaligned descriptors (gg…

fdd2188

…ml-org#10987)

vulkan: im2col and matmul optimizations for stable diffusion (ggml-or…

a813bad

…g#10942) * tests: Add im2col perf tests * vulkan: optimize im2col, more elements per thread * vulkan: increase small tile size for NV_coopmat2 * vulkan: change im2col to 512 elements per workgroup

android : fix llama_batch free (ggml-org#11014)

c250ecb

convert : fix Llama-3_1-Nemotron-51B rope settings (ggml-org#11008)

bc7b1f8

* conflict resolution * move comments after bracket to its own line * DeciLMCausalModel now reads rope_theta from config.json properly

server : add OAI compat for /v1/completions (ggml-org#10974)

5896c65

* server : add OAI compat for /v1/completions * add test * add docs * better docs

server : clean up built-in template detection (ggml-org#11026)

45095a6

* server : clean up built-in template detection * fix compilation * add chat template test * fix condition

ggml : fixes for AVXVNNI instruction set with MSVC and Clang (ggml-or…

0827b2c

…g#11027) * Fixes for clang AVX VNNI * enable AVX VNNI and alder lake build for MSVC * Apply suggestions from code review --------- Co-authored-by: slaren <[email protected]>

readme : add llama-swap to infrastructure section (ggml-org#11032)

a45433b

* list llama-swap under tools in README * readme: add llama-swap to Infrastructure

metal : avoid uint (ggml-org#11019)

e7da954

fix: Vulkan shader gen binary path (ggml-org#11037)

c31fc8b

ggml : do not install metal source when embed library (ggml/1054)

5e3b08d

sync : ggml

78c6785

llama : add support for the cohere2 model architecture (ggml-org#10900)

46be942

Vulkan: Add device-specific blacklist for coopmat for the AMD proprie…

b56f079

…tary driver (ggml-org#11074) * Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver * Add (TM) to AMD name check

CUDA: add BF16 support (ggml-org#11093)

46e3556

* CUDA: add BF16 support

llama : use _impl suffix instead of _internal (ggml-org#11060)

5047dd3

ggml-ci

llama : use LLAMA_TOKEN_NULL (ggml-org#11062)

727368c

ggml-ci

mmap : fix fileno macro clash (ggml-org#11076)

ae2f606

* mmap : fix fileno macro clash ggml-ci * cont ggml-ci

tokenize : escape the prompt (ggml-org#11058)

3e6e7a6

* tokenize : escape the prompt * tokenize : update help

jeffbolznv and others added 13 commits January 16, 2025 22:16

vulkan: optimize coopmat2 q2_k dequant function (ggml-org#11130)

206bc53

vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (ggml-org#11206)

466300f

Do masking on whole dwords, fetch all scales at once.

README : added kalavai to infrastructure list (ggml-org#11216)

7a689c4

llama : fix deprecation message: vocabable -> vocab (ggml-org#11269)

960ec65

vocab : fix double-eos check (ggml-org#11273)

a133566

ggml-ci

rpc : early register backend devices (ggml-org#11262)

667d728

Early register RPC devices and do not propagate RPC specifics in the llama model structures. ref: ggml-org#10609

llama.android: add field formatChat to control whether to parse speci…

3edfa7d

…al tokens when send message (ggml-org#11270)

vulkan: fix coopmat2 flash attention for non-contiguous inputs (ggml-…

44e18ef

…org#11281) Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression. Add noncontiguous FA tests in test-backend-ops. Fixes ggml-org#11268.

scripts : restore hf.sh (ggml-org#11288)

f26c874

ggml-ci

server : implement cancellable request (ggml-org#11285)

f30f099

* server : implement cancellable request * fix typo * httplib 0.18.5 * fix i underflow

github-actions bot added documentation Improvements or additions to documentation SYCL Nvidia GPU Vulkan testing build examples devops python android server ggml Apple Metal script labels Jan 18, 2025

Merge branch 'layla-build' into merge

6c87864

l3utterfly merged commit ed2cdea into layla-build Jan 18, 2025
4 of 29 checks passed

l3utterfly deleted the merge branch January 18, 2025 14:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge upstream #50

merge upstream #50

l3utterfly commented Jan 18, 2025

merge upstream #50

merge upstream #50

Conversation

l3utterfly commented Jan 18, 2025