Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge upstream #53

Merged
merged 26 commits into from
Feb 14, 2025
Merged

merge upstream #53

merged 26 commits into from
Feb 14, 2025

Conversation

l3utterfly
Copy link
Owner

No description provided.

danbev and others added 26 commits February 11, 2025 14:06
…-org#11740)

* server : use common_token_to_piece instead of common_detokenize

This commit replaces the call to common_detokenize with
common_token_to_piece in the populate_token_probs.

The motivation for this change is to avoid an issue where
common_detokenize would remove the word boundary character for tokens,
which caused a regression in the server generated token probabilities.

Resolves: ggml-org#11728

* squash! server : use common_token_to_piece instead of common_detokenize

Use common_token_to_piece for post_sampling_probs as well.
…yValueEx (ggml-org#11803)

* Fix ggml-org#11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx

* Fix ggml-org#11802: PR ggml-org#11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string
* Bug fix for clamp_f32

When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0.

* Bug fix for clamp_f32

* Bug fix for clamp_f32
…rg#11814)

* All messages get the copy button

* Update index.html.gz
* ggml : x2 speed for WASM by optimizing SIMD

* fix bad merging

* rm trailing spaces

* rm redundant clamp

* better quantize_row_q8_K

Co-authored-by: camel-cdr <[email protected]>

* remove memset that causes buffer overflow
Co-authored-by: camel-cdr <[email protected]>

---------

Co-authored-by: camel-cdr <[email protected]>
* ggml-cpu : add chunking support to mul_mat_id

* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row

* disable for arm

* cleanup

* better way to disable for arm

* fix uninitialized counter when using 1 thread only

* revert test-backend-ops changes
This commit updates the comment in llama_kv_cache.h to reflect the
change of the function name from llama_decode_internal to
llama_decode_impl.
There was a typo-like error, which would print the same number twice if
request is received with n_predict > server-side config.

Before the fix:
```
slot launch_slot_: id  0 | task 0 | n_predict = 4096 exceeds server configuration, setting to 4096
```

After the fix:
```
slot launch_slot_: id  0 | task 0 | n_predict = 8192 exceeds server configuration, setting to 4096
```
* initial sampling changes:

* completed top nsigma sampler implementation

* apply parameter to only llama-cli

* updated readme

* added tests and fixed nsigma impl

* cleaned up pr

* format

* format

* format

* removed commented tests

* cleanup pr and remove explicit floats

* added top-k sampler to improve performance

* changed sigma to float

* fixed string format to float

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update common/sampling.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* added llama_sampler_init

---------

Co-authored-by: Georgi Gerganov <[email protected]>
… (Command 7RB & DeepSeek R1) unless `--reasoning-format none` (ggml-org#11607)

* extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B

* tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template

* tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out

* server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability

* tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests

Co-authored-by: Georgi Gerganov <[email protected]>

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* musa: Update MUSA SDK version to rc3.1.1

Signed-off-by: Xiaodong Ye <[email protected]>

* musa: Remove workaround in PR ggml-org#10042

Signed-off-by: Xiaodong Ye <[email protected]>

---------

Signed-off-by: Xiaodong Ye <[email protected]>
This commit adds a new option `--completion-bash` to the llama.cpp which
outputs a source-able bash completion script.

The motivation for this change is to provide a more user-friendly
experience for users who use the command-line interface of llama.cpp.

This is currently only basic and all options are displayed for all llama
executables but this can be improved in the future if needed.

Example usage:
```console
$ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
$ source ~/.llama-completion.bash

$ ./build/bin/llama-server --m<TAB>
--main-gpu         --mirostat         --mirostat-lr      --model            --multiline-input
--min-p            --mirostat-ent     --mlock            --model-url
```
Call updated to match the tool used in the output just below, following the example in ggml-org#9639
ggml-org#11832)

* llama-bench : fix unexpected global variable initialize sequence issue

* Update examples/llama-bench/llama-bench.cpp

---------

Co-authored-by: Diego Devesa <[email protected]>
* mm subgroup size

* upload vulkan x86 builds
@l3utterfly l3utterfly merged commit e9d8bb1 into layla-build Feb 14, 2025
52 of 53 checks passed
@github-actions github-actions bot added documentation Improvements or additions to documentation Nvidia GPU Vulkan labels Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.