forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge upstream #53
Merged
Merged
merge upstream #53
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…-org#11740) * server : use common_token_to_piece instead of common_detokenize This commit replaces the call to common_detokenize with common_token_to_piece in the populate_token_probs. The motivation for this change is to avoid an issue where common_detokenize would remove the word boundary character for tokens, which caused a regression in the server generated token probabilities. Resolves: ggml-org#11728 * squash! server : use common_token_to_piece instead of common_detokenize Use common_token_to_piece for post_sampling_probs as well.
…yValueEx (ggml-org#11803) * Fix ggml-org#11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx * Fix ggml-org#11802: PR ggml-org#11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string
Signed-off-by: Weizhao Ouyang <[email protected]>
* Bug fix for clamp_f32 When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0. * Bug fix for clamp_f32 * Bug fix for clamp_f32
…rg#11814) * All messages get the copy button * Update index.html.gz
* ggml : x2 speed for WASM by optimizing SIMD * fix bad merging * rm trailing spaces * rm redundant clamp * better quantize_row_q8_K Co-authored-by: camel-cdr <[email protected]> * remove memset that causes buffer overflow Co-authored-by: camel-cdr <[email protected]> --------- Co-authored-by: camel-cdr <[email protected]>
* ggml-cpu : add chunking support to mul_mat_id * allocate chunk counter in wdata parallelize src1 quantization by column to allows parallelization even when there is only one row * disable for arm * cleanup * better way to disable for arm * fix uninitialized counter when using 1 thread only * revert test-backend-ops changes
This commit updates the comment in llama_kv_cache.h to reflect the change of the function name from llama_decode_internal to llama_decode_impl.
There was a typo-like error, which would print the same number twice if request is received with n_predict > server-side config. Before the fix: ``` slot launch_slot_: id 0 | task 0 | n_predict = 4096 exceeds server configuration, setting to 4096 ``` After the fix: ``` slot launch_slot_: id 0 | task 0 | n_predict = 8192 exceeds server configuration, setting to 4096 ```
* initial sampling changes: * completed top nsigma sampler implementation * apply parameter to only llama-cli * updated readme * added tests and fixed nsigma impl * cleaned up pr * format * format * format * removed commented tests * cleanup pr and remove explicit floats * added top-k sampler to improve performance * changed sigma to float * fixed string format to float * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update common/sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * Update src/llama-sampling.cpp Co-authored-by: Georgi Gerganov <[email protected]> * added llama_sampler_init --------- Co-authored-by: Georgi Gerganov <[email protected]>
… (Command 7RB & DeepSeek R1) unless `--reasoning-format none` (ggml-org#11607) * extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B * tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template * tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out * server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability * tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests Co-authored-by: Georgi Gerganov <[email protected]> --------- Co-authored-by: Georgi Gerganov <[email protected]>
* musa: Update MUSA SDK version to rc3.1.1 Signed-off-by: Xiaodong Ye <[email protected]> * musa: Remove workaround in PR ggml-org#10042 Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>
This commit adds a new option `--completion-bash` to the llama.cpp which outputs a source-able bash completion script. The motivation for this change is to provide a more user-friendly experience for users who use the command-line interface of llama.cpp. This is currently only basic and all options are displayed for all llama executables but this can be improved in the future if needed. Example usage: ```console $ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash $ source ~/.llama-completion.bash $ ./build/bin/llama-server --m<TAB> --main-gpu --mirostat --mirostat-lr --model --multiline-input --min-p --mirostat-ent --mlock --model-url ```
Call updated to match the tool used in the output just below, following the example in ggml-org#9639
ggml-org#11832) * llama-bench : fix unexpected global variable initialize sequence issue * Update examples/llama-bench/llama-bench.cpp --------- Co-authored-by: Diego Devesa <[email protected]>
* mm subgroup size * upload vulkan x86 builds
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
devops
documentation
Improvements or additions to documentation
examples
ggml
Nvidia GPU
python
script
server
testing
Vulkan
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.