merge upstream #53

l3utterfly · 2025-02-14T07:54:01Z

No description provided.

…-org#11740) * server : use common_token_to_piece instead of common_detokenize This commit replaces the call to common_detokenize with common_token_to_piece in the populate_token_probs. The motivation for this change is to avoid an issue where common_detokenize would remove the word boundary character for tokens, which caused a regression in the server generated token probabilities. Resolves: ggml-org#11728 * squash! server : use common_token_to_piece instead of common_detokenize Use common_token_to_piece for post_sampling_probs as well.

…yValueEx (ggml-org#11803) * Fix ggml-org#11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx * Fix ggml-org#11802: PR ggml-org#11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string

Signed-off-by: Weizhao Ouyang <[email protected]>

* Bug fix for clamp_f32 When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0. * Bug fix for clamp_f32 * Bug fix for clamp_f32

)

…1836)

…rg#11814) * All messages get the copy button * Update index.html.gz

* ggml : x2 speed for WASM by optimizing SIMD * fix bad merging * rm trailing spaces * rm redundant clamp * better quantize_row_q8_K Co-authored-by: camel-cdr <[email protected]> * remove memset that causes buffer overflow Co-authored-by: camel-cdr <[email protected]> --------- Co-authored-by: camel-cdr <[email protected]>

* ggml-cpu : add chunking support to mul_mat_id * allocate chunk counter in wdata parallelize src1 quantization by column to allows parallelization even when there is only one row * disable for arm * cleanup * better way to disable for arm * fix uninitialized counter when using 1 thread only * revert test-backend-ops changes

This commit updates the comment in llama_kv_cache.h to reflect the change of the function name from llama_decode_internal to llama_decode_impl.

There was a typo-like error, which would print the same number twice if request is received with n_predict > server-side config. Before the fix: ``` slot launch_slot_: id 0 | task 0 | n_predict = 4096 exceeds server configuration, setting to 4096 ``` After the fix: ``` slot launch_slot_: id 0 | task 0 | n_predict = 8192 exceeds server configuration, setting to 4096 ```

* initial sampling changes: * completed top nsigma sampler implementation * apply parameter to only llama-cli * updated readme * added tests and fixed nsigma impl * cleaned up pr * format * format * format * removed commented tests * cleanup pr and remove explicit floats * added top-k sampler to improve performance * changed sigma to float * fixed string format to float * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update common/sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * added llama_sampler_init --------- Co-authored-by: Georgi Gerganov <[email protected]>

… (Command 7RB & DeepSeek R1) unless `--reasoning-format none` (ggml-org#11607) * extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B * tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template * tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out * server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability * tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>

* musa: Update MUSA SDK version to rc3.1.1 Signed-off-by: Xiaodong Ye <[email protected]> * musa: Remove workaround in PR ggml-org#10042 Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

This commit adds a new option `--completion-bash` to the llama.cpp which outputs a source-able bash completion script. The motivation for this change is to provide a more user-friendly experience for users who use the command-line interface of llama.cpp. This is currently only basic and all options are displayed for all llama executables but this can be improved in the future if needed. Example usage: ```console $ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash $ source ~/.llama-completion.bash $ ./build/bin/llama-server --m<TAB> --main-gpu --mirostat --mirostat-lr --model --multiline-input --min-p --mirostat-ent --mlock --model-url ```

Call updated to match the tool used in the output just below, following the example in ggml-org#9639

…rg#11780)

ggml-org#11832) * llama-bench : fix unexpected global variable initialize sequence issue * Update examples/llama-bench/llama-bench.cpp --------- Co-authored-by: Diego Devesa <[email protected]>

* mm subgroup size * upload vulkan x86 builds

danbev and others added 26 commits February 11, 2025 14:06

docs: add OpenCL (ggml-org#11697)

4078c77

llama : fix typo in llama-grammar.h [no ci] (ggml-org#11816)

369be55

CUDA: fix CUDART_VERSION checks (ggml-org#11821)

c3d6af7

ggml-cpu: Fix duplicate MATMUL_INT8 (ggml-org#11817)

198b1ec

Signed-off-by: Weizhao Ouyang <[email protected]>

ggml : fix multi-threaded clamp_f32 (ggml-org#11824)

748ee9f

* Bug fix for clamp_f32 When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0. * Bug fix for clamp_f32 * Bug fix for clamp_f32

cleanup: fix compile warnings associated with gnu_printf (ggml-org#11811

fef0cbe

)

HIP: Switch to std::vector in rocblas version check (ggml-org#11820)

e598697

sync : ggml

0fb77f8

Fix: Compile failure due to Microsoft STL breaking change (ggml-org#1…

bfd11a2

…1836)

HIP: Remove GCN from list of devices that avoid MMQ (ggml-org#11831)

5c4284d

server : (webui) Give copy button back to all message bubbles (ggml-o…

31afcbe

…rg#11814) * All messages get the copy button * Update index.html.gz

llama : update llama_decode_internal ref [no ci] (ggml-org#11840)

3e69319

This commit updates the comment in llama_kv_cache.h to reflect the change of the function name from llama_decode_internal to llama_decode_impl.

server : (docs) Update wrong tool calling example (ggml-org#11809)

c1f958c

Call updated to match the tool used in the output just below, following the example in ggml-org#9639

llamafile: use member variable instead of constant for iq4nlt (ggml-o…

8a8c4ce

…rg#11780)

readme : minor

04045bb

llama-bench : fix unexpected global variable initialize sequence issue (

a7b8ce2

ggml-org#11832) * llama-bench : fix unexpected global variable initialize sequence issue * Update examples/llama-bench/llama-bench.cpp --------- Co-authored-by: Diego Devesa <[email protected]>

vulkan: linux builds + small subgroup size fixes (ggml-org#11767)

a4f011e

* mm subgroup size * upload vulkan x86 builds

l3utterfly merged commit e9d8bb1 into layla-build Feb 14, 2025
52 of 53 checks passed

github-actions bot added documentation Improvements or additions to documentation Nvidia GPU Vulkan labels Feb 14, 2025

github-actions bot added testing examples devops python server ggml script labels Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge upstream #53

merge upstream #53

l3utterfly commented Feb 14, 2025

merge upstream #53

merge upstream #53

Conversation

l3utterfly commented Feb 14, 2025