Skip to content

Releases: LostRuins/koboldcpp

koboldcpp-1.82.4

18 Jan 08:20
Compare
Choose a tag to compare

koboldcpp-1.82.4

Old kobo yells at cloud edition

cloud

  • NEW: Added OuteTTS for Text-To-Speech: OuteTTS is a text-to-speech model that can be used for narration by generating audio from KoboldCpp.
    • You need two models, an OuteTTS GGUF and a WavTokenizer GGUF which you can find here.
    • Once downloaded, load them in the Audio tab or using --ttsmodel and --ttswavtokenizer. You can also use --ttsgpu to load them on the GPU instead, and --ttsthreads to set a custom thread count used.
    • When enabled, sets up OpenAI Speech API and XTTS API compatibility endpoints allowing you to easily hook KoboldCpp TTS into existing TTS frontends.
    • Comes with a set of included voices, as well as New Speaker Synthesis, allowing you to create hundreds of new unique voices just by entering a random name. Read more here.
    • All OuteTTS GGUF v0.2 and the NEW v0.3 models are supported, including both 500m and 1B models.
    • Credits to @ggerganov and @edwko for the original upstream implementation
  • NEW: Bundled GGUF file analyzer: In the GUI Extras tab, or with --analyze, you can now analyze any GGUF file, which will display the metadata and tensor names, dimensions and types within that file.
  • TAESD is now also available for SD3 and Flux! Enable with --sdvaeauto or "AutoFix VAE" in the GUI. TAESD is now compressed to fp8, making this VAE only about 3mb in size.
  • VAE tiling for image generation can now be disabled with --sdnotile, this fixes the bleeding graphical artifacts on some cards.
  • Adjusted compatibility build targets: CLBlast (Older CPU) mode now no longer requires AVX, providing a good option for very old/cheap systems to still have some level of GPU support. For users with AVX but not AVX2, you can use the Vulkan (Old CPU) mode instead.
  • mmap is no longer the default option. To enable it, you now need --usemmap or set it in the GUI.
  • Fix for save file GUI prompt not working
  • Fix for web browser not launching with --launch in Linux GUI.
  • Added more GUI slider options for context sizes.
  • Max supported images per API request for Multimodal Vision is now increased to 8.
  • Enabled multilingual support for Whisper (Voice Recognition) setting specific language codes.
  • KoboldCpp now displays what capabilities and endpoints enabled on launch.
    • Available Modules: TextGeneration ImageGeneration VoiceRecognition MultimodalVision NetworkMultiplayer WebSearchProxy TextToSpeech ApiKeyPassword
    • Available APIs: KoboldCppApi OpenAiApi OllamaApi A1111ForgeApi ComfyUiApi WhisperTranscribeApi XttsApi OpenAiSpeechApi
  • Updated Kobold Lite, multiple fixes and improvements
    • Added Whisper language selection: Instead of automatically detecting the speaker language, you can now optionally specify it with a 2 character language code (e.g. ja for Japanese, fr for French). This ensures the output is in the right language.
    • Added Text-To-Speech support for KoboldCpp backend
  • Merged fixes and improvements from upstream

Hotfix 1.82.1: Fixed --analyze which should be working correctly now. Minor fixes to OuteTTS v0.3 handling and updated Lite UI. Whisper now accepts 8 bit and 32bit wav files, and form data input.
Hotfix 1.82.2: Added support for Deepseek R1 Qwen Distill
Hotfix 1.82.3: Fixed a TTS crash, CLBlast mislabeling, quiet now overrides debug
Hotfix 1.82.4: Fixed deepseek adapter, draft decoding now accepts slightly different vocabs

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.81.1

04 Jan 14:19
Compare
Choose a tag to compare

koboldcpp-1.81.1

New year, New Kobo edition

image

  • NEW: Added WebSearch functionality: When enabled, KoboldCpp now optionally functions as a WebSearch proxy with a new /api/extra/websearch endpoint, allowing your queries to be augmented with web searches! Works with all models, needs top be enabled both on Lite and on Kcpp with --websearch or in the GUI. The websearch is executed locally from the KoboldCpp instance, and is powered by DuckDuckGo.
  • NEW: Heuristic chat templates: Added AutoGuess.json bundled chat completions adapter. When this is selected, KoboldCpp will try to heuristically infer the correct instruct template to be used for the chat completions endpoint, based on the detected Jinja template from the model. (Thanks to @kallewoof)
  • Fixed issues with building quantization tools
  • Compilation changes on Windows to unify the windows and linux build flags: Now requires specifying desired build targets on Windows, similar to linux. For example, to do a full nocuda build on windows you now need make LLAMA_PORTABLE=1 LLAMA_VULKAN=1 LLAMA_CLBLAST=1, where previously you would just do make.
  • Updated Kobold Lite, multiple fixes and improvements
    • NEW: TextDB Document Lookup - This is a very rudimentary form of browser-based RAG. You can access it from the Context > TextDB tab. It's powered by a text-based minisearch engine, you can paste a very large text document which is chunked and stored into the database, and at runtime it will find relevant snippets to add to the context depending on the query/instruction you send to the AI. You can use the historical context as a document, or paste a custom text document to use. Note that this is NOT an embedding model, it uses lunr and minisearch for retrieval scoring instead. (Credits to @esolithe)
    • Increased max supported browser save size by switching to use indexedDb for autosaves and slots. Your existing localStorage autosave and saveslot data will be automatically converted and migrated over when you launch the new version. Note that you will not be able to access this new data from older versions of KoboldAI Lite anymore. Downloaded .json savefiles will continue to be accessible in all versions.
    • Allowed more resolutions and aspect ratios for generated and uploaded images
    • Improved quality of multimodal image handling, can upload and recognize larger and more detailed images now. Multimodal should work nicely on a typical screenshot.
  • Merged fixes and improvements from upstream, with Vulkan improvements and new model support.

Hotfix 1.81.1 - Fixed nocertify mode for websearch, fixed aesthetic ui being broken.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.80.3

20 Dec 05:21
Compare
Choose a tag to compare

koboldcpp-1.80.3

End of the year edition

image

  • NEW: Added support for image Multimodal with Qwen2-VL! You can grab the quantized mmproj here for the 2B and 7B models, and then grab the 2B or 7B Instruct models from Bartowski.
  • NEW: Vulkan now has coopmat1 support, making it significantly faster on modern Nvidia cards (credits @0cc4m)
  • Added a few new QoL flags:
    • --moeexperts - Overwrite the number of experts to use in MoE models
    • --failsafe - A proper way to set failsafe mode, which disables all CPU intrinsics and GPU usage.
    • --draftgpulayers - Set number of layers to offload for speculative decoding draft model
    • --draftgpusplit - GPU layer distribution ratio for draft model (default=same as main). Only works if using multi-GPUs.
  • Fixes for buggy tkinter GUI launcher window in Linux (thanks @henk717)
  • Restored support for ARM quants in Kobold (e.g. Q4_0_4_4), but you should consider switching to q4_0 eventually.
  • Fixed a bug that caused context corruption when aborting a generation while halfway processing a prompt
  • Added new field suppress_non_speech to Whisper allowing banning "noise annotation" logits (e.g. Barking, Doorbell, Chime, Muzak)
  • Improved compile flags on ARM, self-compiled builds now use correct native flags and should be significantly faster (tested on Pi and Termux). Simply run make for native ARM builds, or make LLAMA_PORTABLE=1 for a slower portable build.
  • trim_stop now defaults to true (output will no longer contain stop sequence by default)
  • Debugmode shows drafted tokens and allow incompatibles vocab for speculative decoding when enabled (not recommended)
  • Handle more generation parameters in ollama API emulation
  • Handle pyinstaller temp paths for chat adapters when saving a kcpps config file
  • Default image gen sampler set to Euler
  • MMQ is now the default for CLI as well. Use nommq flag to disable (e.g. --usecublas all nommq). Old flags still work.
  • Upgrade build to use C++17
  • Always use PCI Bus ID order for CUDA GPU listing consistency (match nvidia-smi)
  • Updated Kobold Lite, multiple fixes and improvements
    • NEW: Added LaTeX rendering together with markdown. Uses standard \[...\] \(...\) and $$...$$ syntax.
    • You can now manually upload an audio file to transcribe in settings.
    • Better regex to trigger image generation
    • Aesthetic UI fixes
    • Added q as an alias to query for direct URL querying (e.g. http://localhost:5001?q=what+is+love)
    • Added support for AllTalk v2 API. AllTalk v1 is still supported automatically (credits @erew123)
    • Added support for Mantella XTTS (XTTS fork)
    • Toggle to disable "non-speech" whisper output (see above)
    • Consolidated Instruct templates (Mistral V3 merged to V7)
  • Merged fixes and improvements from upstream

Hotfix 1.80.1 - Fixed macOS and vulkan clip for qwen2-vl
Hotfix 1.80.2 - Fixed drafting EOS issue
Hotfix 1.80.3 - Fixed clblast oldcpu not getting set correctly

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.79.1

30 Nov 12:59
Compare
Choose a tag to compare

koboldcpp-1.79.1

One Kobo To Rule Them All Edition

image

  • NEW: Add Multiplayer Support: You can now enable Multiplayer mode on your KoboldCpp instances! Enable it with the --multiplayer flag or in the GUI launcher Network tab. Then, connect to your browser, enter KoboldAI Lite and click the "Join Multiplayer" button.
    • Multiplayer allows multiple users to view and edit a KoboldAI Lite session, live at the same time! You can take turns to chat with the AI together, host a shared adventure or collaborate on a shared story, which is automatically synced between all participants.
    • Multiplayer mode also allows you an easy way to sync a story/session with multiple of your devices over the network. You can treat it like a temporary online save file.
    • To prevent conflicts when two users edit text simultaneously, observe the (Idle) or (Busy) indicator at the top right corner.
    • Multiplayer utilizes the new endpoints /api/extra/multiplayer/status, /api/extra/multiplayer/getstory and /api/extra/multiplayer/setstory, however these only are intended for internal use in Kobold Lite and not for third-party integration.
  • NEW: Added Ollama API Emulation: Adds Ollama compatible endpoints /api/chat and /api/generate which provide basic Ollama API emulation. Streaming is not supported. This will allow you to use KoboldCpp to try out amateur 3rd party tools that only support the Ollama API. Simply point that tool to KoboldCpp (at http://localhost:5001 by default, but you may also need to run KoboldCpp on port 11434 for some exceptionally poorly written tools) and connect normally. If the tool you want to use supports OpenAI API, you're strongly encouraged to use that instead. Here's a sample tool to verify it works. All other KoboldCpp endpoints remain functional and all of them can run at the same time.
  • NEW: Added ComfyUI Emulation: Likewise, add a new endpoint at /prompt emulates a ComfyUI backend, allowing you to use tools that require ComfyUI API, but lack A1111 API support. Right now only txt2img is supported.
  • NEW: Speculative Decoding (Drafting) is now added: You can specify a second lightweight text model with the same vocab to perform speculative decoding, which can offer a speedup in some cases.
    • The small model drafts tokens which the large model evaluates and accepts/rejects. Output should match the large model's quality.
    • Not well supported on Vulkan, will likely be slower.
    • Only works well for low temperatures, generally worse for creative writing.
  • Added /props endpoint, which provides instruction/chat template data from the model (thanks @kallewoof)
  • Added /api/extra/detokenize endpoint, which allows converting an array of token IDs into a detokenized string.
  • Added chunked encoding support (thanks @mkarr)
  • Added Version metadata info tags on Windows .exe binaries.
  • Restored compatibility support for old Mixtral GGUF models. You should still update them.
  • Bugfix for Grammar not being reset, Bugfix for Qwen2.5 missing some UTF-8 characters when streaming.
  • GGUF format text encoders (clip/t5) are now supported for Flux and SD3.5
  • Updated Kobold Lite, multiple fixes and improvements
    • Multiplayer mode support added
    • Added a new toggle switch to Adventure mode "Dice Action", which allow the AI to roll a dice to determine the outcome of an action.
    • Allow disabling sentence trimming in all modes now.
    • Removed some legacy unused features such as pseudostreaming.
  • Merged fixes and improvements from upstream, including some nice Vulkan speedups and enhancements

Hotfix 1.79.1: Fixed a bug that affected image model loading.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag. You can also refer to the readme and the wiki.

koboldcpp-1.78

16 Nov 02:15
Compare
Choose a tag to compare

koboldcpp-1.78

image

  • NEW: Added support for Flux and Stable Diffusion 3.5 models: Image generation has been updated with new arch support (thanks to stable-diffusion.cpp) with additional enhancements. You can use either fp16 or fp8 safetensor models, or the GGUF models. Supports all-in-one models (bundled T5XXL, Clip-L/G, VAE) or loading them individually.
  • Debug mode prints penalties for XTC
  • Added a new flag --nofastforward, this forces full prompt reprocessing on every request. It can potentially give more repeatable/reliable/consistent results in some cases.
  • CLBlast support is still retained, but has been further downgraded to "compatibility mode" and is no longer recommended (use Vulkan instead). CLBlast GPU offload must now maintain duplicate a copy of the layers in RAM as well, as it now piggybacks off the CPU backend.
  • Added common identity provider /.well-known/serviceinfo Haidra-Org/AI-Horde#466 aphrodite-engine/aphrodite-engine#807 theroyallab/tabbyAPI#232
  • Reverted some changes that reduced speed in HIPBLAS.
  • Fixed a bug where bad logprobs JSON was output when logits were -Infinity
  • Updated Kobold Lite, multiple fixes and improvements
    • Added support for custom CSS styles
    • Added support for generating larger images (select BigSquare in image gen settings)
    • Fixed some streaming issues when connecting to Tabby backend
    • Better world info length limiting (capped at 50% of max context before appending to memory)
    • Added support for Clip Skip for local image generation.
  • Merged fixes and improvements from upstream

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, we recommend trying the Vulkan option (available in all releases) first, for best support. Alternatively, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.77

01 Nov 16:32
Compare
Choose a tag to compare

koboldcpp-1.77

the road not taken edition

logprobs

  • NEW: Token Probabilities (logprobs) are now available over the API! Currently only supplied over the sync API (non-streaming), but a second /api/extra/last_logprobs dedicated logprobs endpoint is also provided. Will work and provide a link to view alternate token probabilities for both streaming and non-streaming if "logprobs" is enabled in KoboldAI Lite settings. Will also work in SillyTavern when streaming is disabled, once the latest build is out.
  • Response prompt_tokens, completion_tokens and total_tokens are now accurate values instead of placeholders.
  • Enabled CUDA graphs for the cuda12 build, which can improve performance on some cards.
  • Fixed a bug where .wav audio files uploaded directly to the /v1/audio/transcriptions endpoint get fragmented and cut off early. Audio sent as base64 within JSON payloads are unaffected.
  • Fixed a bug where Whisper transcription blocked generation in non-multiuser mode.
  • Fixed a bug where trim_stop did not remove a stop sequence that was divided across multiple tokens in some cases.
  • Significantly increased the maximum limits for stop sequences, anti-slop token bans, logit biases and DRY sequence breakers, (thanks to @mayaeary for the PR which changes the way some parameters are passed to the CPP side)
  • Added link to help page if user fails to select a model.
  • Flash Attention GUI quick launcher toggle hidden by default if Vulkan is selected (usually reduced performance).
  • Updated Kobold Lite, multiple fixes and improvements
    • NEW: Experimental ComfyUI Support Added!: ComfyUI can now be used as an image generation backend API from within KoboldAI Lite. No workflow customization is necessary. Note: ComfyUI must be launched with the flags --listen --enable-cors-header '*' to enable API access. Then you may use it normally like any other Image Gen backend.
    • Clarified the option for selecting A1111/Forge/KoboldCpp as an image gen backend, since Forge is gradually superseding A1111. This option is compatible with all 3 of the above.
    • You are now able to generate images from instruct mode via natural language, similar to chatgpt. (e.g. Please generate an image of a bag of sand). This option requires having an image model loaded, it uses regex and is enabled by default, it can be disabled in settings.
    • Added support for Tavern "V3" character cards: Actually, V3 is not a real format, it's an augmented V2 card used by Risu that adds additional metadata chunks. These chunks are not supported in Lite, but the base "V2" card functionality will work.
    • Added new scenario "Interactive Storywriter": This is similar to story writing mode, but allows you to secretly steer the story with hidden instruction prompts.
    • Added Token Probability Viewer - You can now see a table of alternative token probabilities in responses. Disabled by default, enable in advanced settings.
    • Fixed JSON file selection problems in some mobile browsers.
    • Fixed Aetherroom importer.
    • Minor Corpo UI layout tweaks by @Ace-Lite
  • Merged fixes and improvements from upstream

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.76

11 Oct 13:06
Compare
Choose a tag to compare

koboldcpp-1.76

shivers down your spine edition

image

  • NEW: Added Anti-Slop Sampling (Phrase Banning) - You can now provide a specified list of words or phrases prevented from being generated, by backtracking and regenerating when they appear. This capability has been merged into the existing token banning feature. It's now also aliased into the banned_strings field.
    • Note: When using Anti-Slop phrase banning, streaming outputs are slightly delayed - this is to allow space for the AI to backtrack a response if necessary. This delay is proportional to the length of the longest banned slop phrase.
    • Up to 48 phrase banning sequences can be used, they are not case sensitive.
  • The /api/extra/perf/ endpoint now includes whether the instance was launched in quiet mode (terminal outputs). Note that this is not foolproof - instances can be running modified versions of KoboldCpp.
  • Added timestamp information when each request starts.
  • Increased some limits for number of stop sequences, logit biases, and banned phrases.
  • Fixed a GUI launcher bug when a changed backend dropdown was overridden by a CLI flag.
  • Updated Kobold Lite, multiple fixes and improvements
    • NEW: Added a new scenario - Roleplay Character Creator. This Kobold Lite scenario presents users with an easy-to-use wizard for creating their own roleplay bots with the Aesthetic UI. Simply fill in the requested fields and you're good to go. The character can always be edited subsequently from the 'Context' menu. Alternatively, you can also load a pre-existing Tavern Character Card.
    • Updated token banning settings to include Phrase Banning (Anti-Slop).
    • Minor fixes and tweaks
  • Merged fixes and improvements from upstream

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.75.2

21 Sep 08:01
Compare
Choose a tag to compare

koboldcpp-1.75.2

Nothing lasts forever edition

  • Important: When running from command line, if no backend was explicitly selected (--use...), a GPU backend is now auto selected by default if available. This can be overridden by picking a specific backend (eg. --usecpu, --usevulkan, --usecublas). As a result, dragging and dropping a gguf model onto the koboldcpp.exe executable will allow it to be launched with GPU and gpulayers auto configured.
  • Important: OpenBLAS backend has been removed, and unified with the NoBLAS backend, to form a single Use CPU option. This utilizes the sgemm functionality that llamafile upstreamed, so processing speeds should still be comparable. --noblas flag is also deprecated, instead CPU Mode can be enabled with the --usecpu flag.
  • Added support for RWKV v6 models (context shifting not supported)
  • Added a new flag --showgui that allows the GUI to be shown even with command line flags are used. Instead, command line flags will get imported into the GUI itself, allowing them to be modified. This also works with .kcpps config files,
  • Added a warning display when loading legacy GGML models
  • Fix for DRY sampler occasionally segfaulting on bad unicode input.
  • Embedded Horde workers now work with password protected instances.
  • Updated Kobold Lite, multiple fixes and improvements
    • Added first-start welcome screen, to pick a starting UI Theme
    • Added support for OpenAI-Compatible TTS endpoints
    • Added a preview option for alternate greetings within a V2 Tavern character card.
    • Now works with Kobold API backends with gated model lists e.g. Tabby
    • Added display-only regex replacement, allowing you to hide or replace displayed text while keeping the original used with the AI in context.
    • Added a new Instruct scenario to mimic CoT Reflection (Thinking)
    • Sampler presets now reset seed, but no longer reset generation amount setting.
    • Markdown parser fixes
    • Added system role for Metharme instruct format
    • Added a toggle for chat name format matching, allowing matching any name or only predefined names.
    • Fixed markdown image scaling
  • Merged fixes and improvements from upstream

Hotfix 1.75.1: Auto backend selection and clblast fixes
Hotfix 1.75.2: Fixed RWKV, modified mistral templates

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.74

31 Aug 03:41
Compare
Choose a tag to compare

koboldcpp-1.74

Kobo's all grown up now

image

  • NEW: Added XTC (Exclude Top Choices) sampler, a brand new creative writing sampler designed by the same author of DRY (@p-e-w). To use it, increase xtc_probability above 0 (recommended values to try: xtc_threshold=0.15, xtc_probability=0.5)
  • Added automatic image resizing and letterboxing for llava/minicpm images, this should improve handling of oddly-sized images.
  • Added a new flag --nomodel which allows launching the Lite WebUI without loading any model at all. You can then select an external api provider like Horde, Gemini or OpenAI
  • MacOS defaults to full offload when -1 gpulayers selected
  • Minor tweaks to context shifting thresholds
  • Horde Worker now has a 5 minute timeout for each request, which should reduce the likelihood of getting stuck (e.g. internet issues). Also, horde worker now supports connecting to SSL secured Kcpp instances (remember to enable --nocertify if using self signed certs)
  • Updated Kobold Lite, multiple fixes and improvements
  • Merged fixes and improvements from upstream (plus Llama-3.1-Minitron-4B-Width support)

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.73.1

19 Aug 08:45
Compare
Choose a tag to compare

koboldcpp-1.73.1

image

  • NEW: Added dual-stack (IPv6) network support. KoboldCpp now properly runs on IPv6 networks, the same instance can serve both IPv4 and IPv6 addresses automatically on the same port. This should also fix problems with resolving localhost on some systems. Please report any issues you face.
  • NEW: Added official MacOS pyinstaller binary builds! Modern MacOS (M1, M2, M3) users can now use KoboldCpp without having to self-compile, simply download and run koboldcpp-mac-arm64. Special thanks to @henk717 for setting this up.
  • NEW: Pure CLI Mode - Added --prompt, allowing KoboldCpp to be used entirely from command-line alone. When running with --prompt, all other console outputs are suppressed, except for that prompt's response which is piped directly to stdout. You can control the output length with --promptlimit. These 2 flags can also be combined with --benchmark, allowing benchmarking with a custom prompt and returning the response. Note that this mode is only intended for quick testing and simple usage, no sampler settings will be configurable.
  • Changed the default benchmark prompt to prevent stack overflow on old bpe tokenizer.
  • Pre-filter to the top 5000 token candidates before sampling, this greatly improves sampling speed on models with massive vocab sizes with negligible response changes.
  • Moved chat completions adapter selection to Model Files tab.
  • Improve GPU layer estimation by accounting for in-use VRAM.
  • --multiuser now defaults to true. Set --multiuser 0 to disable it.
  • Updated Kobold Lite, multiple fixes and improvements
  • Merged fixes and improvements from upstream, including Minitron and MiniCPM features (note: there are some broken minitron models floating around - if stuck, try this one first!)

Hotfix 1.73.1 - Fixed DRY sampler broken, fixed sporadic streaming issues, added letterboxing mode for images in Lite. The previous v1.73 release was buggy, so you are strongly suggested to upgrade to this patch release.

To use minicpm:

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you have an Nvidia GPU, but use an old CPU and koboldcpp.exe does not work, try koboldcpp_oldcpu.exe
If you have a newer Nvidia GPU, you can use the CUDA 12 version koboldcpp_cu12.exe (much larger, slightly faster).
If you're using Linux, select the appropriate Linux binary file instead (not exe).
If you're on a modern MacOS (M1, M2, M3) you can try the koboldcpp-mac-arm64 MacOS binary.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.