Skip to content

Releases: LostRuins/koboldcpp

koboldcpp-1.52.2

13 Dec 14:21
Compare
Choose a tag to compare

koboldcpp-1.52.2

something old, something new edition

image

  • NEW: Added a new bare-bones KoboldCpp NoScript WebUI, which does not require Javascript to work. It should be W3C HTML compliant and should run on every browser in the last 20 years, even text-based ones like Lynx (e.g. in the terminal over SSH). It is accessible by default at /noscript e.g. http://localhost:5001/noscript . This can be helpful when running KoboldCpp from systems which do not support a modern browser with Javascript.
  • Partial per-layer KV offloading is now merged for CUDA. Important: this means that the number of layers you can offload to GPU might be reduced, as each layer now takes up more space. To avoid per-layer KV offloading, use the --usecublas lowvram option (equivalent to -nkvo in llama.cpp). Fully offloaded models should behave the same as before.
  • The /api/extra/tokencount endpoint now also returns an array of token ids in the response body from the tokenizer.
  • Merged support for QWEN and Mixtral from upstream. Note: Mixtral seems to perform large batch prompt processing extremely slowly. This is probably an implementation issue. For now, you might have better luck using --noblas or setting --blasbatchsize -1 when using Mixtral
  • Selecting a .kcpps in the GUI when choosing a model will load the model specified inside that config file instead.
  • Added the Mamba Multitool script (from @henk717). This is a shell script that can be used in Linux to setup an environment with all dependencies required for building and running KoboldCpp on Linux.
  • Improved KCPP Embedded Horde Worker fault tolerance, should now gracefully backoff for increasing durations whenever encountering errors polling from AI Horde, and will automatically recover from up to 24 hours of Horde downtime.
  • Added a new parameter that shows number of Horde Worker errors in the /api/extra/perf endpoint, this can be used to monitor your embedded horde worker if it goes down.
  • Pulled other fixes and improvements from upstream, updated Kobold Lite, added asynchronous file autosaves (thanks @aleksusklim), various other improvements.

Hotfix 1.52.1: Fixed 'not enough memory' loading errors for large (20B+) models. See #563
NEW: Added Linux PyInstaller binaries

Hotfix 1.52.2: Merged fixes for Mixtral prompt processing

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.51.1

01 Dec 16:30
Compare
Choose a tag to compare

koboldcpp-1.51.1

all quiet on the kobold front edition

  • Added a new flag --quiet which allows you to suppress input and outputs from appearing in the console.
  • When context shift is enabled, allocate a small amount (about 80 tokens) of reserved space to reduce the Failed to predict errors that occur due to running out of KV cache space caused by KV cache fragmentation when shifting.
  • Auto rope scaling will not be automatically applied if the model already overrides the RoPE freq scale with a value below 1.
  • Increased the graph node limit for older models to fix AiDungeon GPT2 not working.
  • Display the available endpoint KAI and OAI URLs in the terminal on startup.
  • Updated some API examples in the documentation
  • --multiuser now accepts an extra optional parameter that indicates how many concurrent requests to allow to queue. If unset, or set to 1, it defaults to the default value of 5.
  • Pulled fixed and improvements from upstream, updated Kobold Lite, fixed Chub imports, optimized for Firefox, added multiline input in aesthetic mode, various other improvements.

1.51.1 Hotfix:

  • Reverted an upstream change that caused a CLBlast segfault that occurred when context size exceeded 2k.
  • Stripped out the OAI SSE carriage return after end message that was causing issues in Janitor.
  • Moved the 80 extra tokens allocated for handling KV fragmentation to be added on top of the specified max context length instead of subtracted from it at runtime, which could cause padding issues when counting tokens in Tavern. This means that loading --contextsize 2048 will actually allocate a size of 2128 behind the scenes for example.
  • Changed the API url printouts to include the tunnel url when using --remotetunnel

Added a linux test build provided by @henk717

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.50.1

18 Nov 08:45
Compare
Choose a tag to compare

koboldcpp-1.50.1

  • Improved automatic GPU layer selection: In the GUI launcher with CuBLAS, it will now automatically select all layers to do a full GPU offload if it thinks you have enough VRAM to support it.
  • Added a short delay to the Abort function in Lite, hopefully fixes the glitches with retry and abort.
  • Fixed automatic RoPE values for Yi and Deepseek. If no --ropeconfig is set, the preconfigured rope values in the model now take priority over the automatic context rope scale.
  • The above fix should also allow YaRN RoPE scaled models to work correctly by default, assuming the model has been correctly converted. Note: Customized YaRN configurations flags are not yet available.
  • The OpenAI compatible /v1/completions has been enhanced, adding extra unofficial parameters that Aphrodite uses, such as Min-P, Top-A and Mirostat. However, OpenAI does not support separate memory fields or sampler order, so the Kobold API will still give better results there.
  • SSE streaming support has been added for OpenAI /v1/completions endpoint (tested working in SillyTavern)
  • Custom DALL-E endpoints are now supported, for use with OAI proxies.
  • Pulled fixed and improvements from upstream, updated Kobold Lite

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Hotfix 1.50.1:

  • Fixed a regression with older RWKV/GPT-2/GPT-J/GPT-NeoX models that caused a segfault.
  • If ropeconfig is not set, apply auto linear rope scaling multiplier for rope-tuned models such as Yi when used outside their original context limit.
  • Fixed another bug in Lite with the retry/abort button.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.49

11 Nov 03:28
Compare
Choose a tag to compare

koboldcpp-1.49

  • New API feature: Split Memory - The generation payload also supports a new field memory in addition to the usual prompt field. If set, forcefully appends this string to the beginning of any submitted prompt text. If resulting context exceeds the limit, forcefully overwrites text from the beginning of the main prompt until it can fit. Useful to guarantee full memory insertion even when you cannot determine exact token count. Automatically used in Lite.
  • New API feature: trim_stop can be added to the generate payload. If true, removes detected stop_sequences from the output and truncates all text after them. Does not work with SSE streaming.
  • New API feature: --preloadstory now allows you to specify a json file (such as a story savefile) when launching the server. This file will be hosted on the server at /api/extra/preloadstory, which frontends (such as Kobold Lite) can access over the API.
  • Pulled various improvements and fixes from upstream llama.cpp
  • Updated Kobold Lite, added new TTS options and fixed some bugs with the Retry button when Aborting. Added support for World Info inject position, split memory and preloaded stories. Also added support for optional image generation using DALL-E 3 (OAI API).
  • Fixed KoboldCpp colab prebuilts crashing on some older Colab CPUs. It should now also work on A100 and V100 GPUs in addition to the free tier T4s. If it fails, try enabling the ForceRebuild checkbox. LLAMA_PORTABLE=1 makefile flag can now be used when making builds that target colab or Docker.
  • Various other minor fixes.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.48.1

04 Nov 07:50
Compare
Choose a tag to compare

koboldcpp-1.48.1

Harder Better Faster Stronger Edition

  • NEW FEATURE: Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext.
    • Note: Context Shifting is enabled by default, and will override smartcontext if both are enabled. Context Shifting still needs more testing. Your outputs may be different with shifting enabled, but both seem equally coherent. To disable Context Shifting, use the flag --noshift. If you observe a bug, please report and issue or send a PR fix.
  • 'Tensor Core' Changes: KoboldCpp now handles MMQ/Tensor Cores differently from upstream. Here's a breakdown:
    • old approach (everybody): if mmq is enabled, just use mmq. If cublas is enabled, just use cublas. MMQ dimensions set to "FAVOR BIG"
    • new approach (upstream llama.cpp): you cannot toggle mmq anymore. It is always enabled. MMQ dimensions set to "FAVOR SMALL". CuBLAS always kicks in if batch > 32.
    • new approach (koboldcpp): you CAN toggle MMQ. It is always enabled, until batch > 32, then CuBLAS only kicks in if MMQ flag is false, otherwise it still uses MMQ for all batches. MMQ dimensions set to "FAVOR BIG".
  • Added GPU Info Display and Auto GPU Layer Selection For Newbies - Uses a combination of clinfo and nvidia-smi queries to automatically determine and display the user's GPU name in the GUI, and helps newbies suggest the GPU layers to use when first choosing a model, based on available VRAM and model filesizes. Not optimal, but it should give usable defaults and be even more newbie friendly now. You can thereafter edit the actual GPU layers to use. (Credit: Original concept adapted from @YellowRoseCx )
  • Added Min-P sampler - It is now available over the API, and can also be set in Lite from the Advanced settings tab. (Credit: @kalomaze)
  • Added --remotetunnel flag, which downloads and creates a TryCloudFlare remote tunnel, allowing you to access koboldcpp remotely over the internet even behind a firewall. Note: This downloads a tool called Cloudflared to the same directory.
  • Added a new build target for Windows exe users koboldcpp_clblast_noavx2, now providing a "CLBlast NoAVX2 (Old CPU)" option that finally supports CLBlast acceleration for windows devices without AVX2 intrinsics.
  • Include Content-Length header in responses.
  • Fixed some crashes with other uncommon models in cuda mode.
  • Retained support for GGUFv1, but you're encouraged to update as upstream has removed support.
  • Minor tweaks and optimizations to streaming timings. Fixed segfault that happens when streaming in multiuser mode and aborting connection halfway.
  • freq_base_train is now taken into account when setting automatic rope scale, that should handle codellama correctly now.
  • Updated Kobold Lite, added support for selecting Min-P and Sampler Seeds (for proper deterministic generation).
  • Improved KoboldCpp Colab, now with prebuilt CUDA binaries. Time to load after launch is less than a minute, excluding model downloads. Added a few more default model options, you can also use any custom GGUF model URL. (Try it here!)

Hotfix 1.48.1 - Fixed issues with Multi-GPU setups. GUI defaults to CuBLAS if available. Other minor fixes

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.47.2

20 Oct 08:11
Compare
Choose a tag to compare

koboldcpp-1.47.2

  • Added OpenAI optional adapter from #466 (thanks @lofcz) . This is an unofficial extension of the v1 OpenAI Chat Completions endpoint that allows customization of the instruct tags over the API. The Kobold API still provides better functionality and flexibility overall.
  • Pulled upstream support for ChatML added token merges (they have to be from a correctly converted GGUF model though, overall ChatML is still an inferior prompt template compared to Alpaca/Vicuna/LLAMA2).
  • Embedded Horde Worker improvements: Added auto-recovery pause timeout on too many errors, instead of halting the worker outright. The worker will still be halted if the total error count exceeds a high enough threshold.
  • Bug fixes for a multiuser race condition in polled streaming and for Top-K values being clamped (thanks @raefu @kalomaze)
  • Improved server CORS and content-type handling.
  • Added GUI input for tensor_split fields (thanks @AAbushady)
  • Fixed support for GGUFv1 Falcon models, which was broken due to the upstream rewrite of the BPE tokenizer.
  • Pulled other fixes and optimizations from upstream
  • Updated KoboldCpp Colab, now with the new Tiefighter model (try it here)

Hotfix 1.47.1 - Fixed a race condition with SSE streaming. Tavern streaming should be reliable now.
Hotfix 1.47.2 - Fixed an issue with older multilingual GGUFs needing an alternate BPE tokenizer.

Updates for Embedded Kobold Lite:

  • SSE streaming for Kobold Lite has been implemented! It requires a relatively recent browser. Toggle it on in settings.
  • Added Browser Storage Save Slots! You can now directly save stories within the browser session itself. This is intended to be a temporary storage allowing you to swap between and try multiple stories - the browser storage is wiped when the browser cache/history is cleared!
  • Added World Info Search Depth
  • Added Group Chat Management Panel (You can temporarily toggle the participants in a group chat)
  • Added AUTOMATIC1111 integration! It's finally here, you can now generate images from a local A1111 install, as an alternative to Horde,
  • Lots of miscellaneous fixes and improvements. If you encounter any issues, do report them here.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.46.1

08 Oct 07:03
Compare
Choose a tag to compare

koboldcpp-1.46.1

Important: Deprecation Notice for KoboldCpp 1.46

  • The following command line arguments are deprecated and have been removed from this version on.
--psutil_set_threads - parameter will be removed as it's now generally unhelpful, the defaults are usually sufficient.
--stream - a Kobold Lite only parameter, which is now a toggle saved inside Lite's settings and thus no longer necessary.
--unbantokens - EOS unbans should only be set via the generate API, in the use_default_badwordsids json field.
--usemirostat - Mirostat values should only be set via the generate API, in the mirostat mirostat_tau and mirostat_eta json fields.
  • Removed the original deprecated tkinter GUI, now only the new customtkinter GUI remains.
  • Improved embedded horde worker, added even more session stats, job pulls and job submits are now done in parallel so it should run about 20% faster for horde requests.
  • Changed the default model name from concedo/koboldcpp to koboldcpp/[model_filename]. This does prevent old "Kobold AI-Client" users from connecting via the API, so if you're still using that, either switch to a newer client or connect via the Basic/OpenAI API instead of the Kobold API.
  • Added proper API documentation, which can be found by navigating to /api or the web one at https://lite.koboldai.net/koboldcpp_api
  • Allow .kcpps files to be drag & dropped, as well as working via OpenWith in windows.
  • Added a new OpenAI Chat Completions compatible endpoint at /v1/chat/completions (credit: @teddybear082)
  • --onready processes are now started with subprocess.run instead of Popen (#462)
  • Both /check and /abort can now function together with multiuser mode, provided the correct genkey is used by the client (automatically handled in Lite).
  • Allow 64k --contextsize (for GGUF only, still 16k otherwise).
  • Minor UI fixes and enhancements.
  • Updated Lite, pulled fixes and improvements from upstream.

v1.46.1 hotfix: fixed an issue where blasthreads was used for values between 1 and 32 tokens.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.45.2

01 Oct 07:39
Compare
Choose a tag to compare

koboldcpp-1.45.2

  • Improved embedded horde worker: more responsive, and added Session Stats (Total Kudos Earned, EarnRate, Timings)
  • Added a new parameter to grammar sampler API grammar_retain_state which lets you persist the grammar state across multiple requests.
  • Allow launching by picking a .kcpps file in the file selector GUI combined with --skiplauncher. That settings file must already have a model selected. (Similar to --config, but that one doesn't use GUI at all.)
  • Added a new flag toggle --foreground for windows users. This sends the console terminal to the foreground every time a new prompt is generated, to avoid some idling slowdown issues.
  • Increased max support context with --contextsize to 32k, but only for GGUF models. It's still limited to 16k for older model versions. GGUF now actually has no hard limit to max context since it switched to using allocators, but it's not be compatible with older models. Additionally, models not trained with extended context are unlikely to work when RoPE scaled beyond 32k.
  • Added a simple OpenAI compatible completions API, which you can access at /v1/completions. You're still recommended to use the Kobold API as it has many more settings.
  • Increased stop_sequence limit to 16.
  • Improved SSE streaming by batching pending tokens between events.
  • Upgraded Lite polled-streaming to work even in multiuser mode. This works by sending a unique key for each request.
  • Improved Makefile to reduce unnecessary builds, added flag for skipping K-quants.
  • Enhanced Remote-Link.cmd to also work on Linux, simply run it to create a Cloudflare tunnel to access koboldcpp anywhere.
  • Improved the default colab notebook to use mmq.
  • Updated Lite and pulled other fixes and improvements from upstream llama.cpp.

Important: Deprecation Notice for KoboldCpp 1.45.1

The following command line arguments are considered deprecated and will be removed soon, in a future version.

--psutil_set_threads - parameter will be removed as it's now generally unhelpful, the defaults are usually sufficient.
--stream - a Kobold Lite only parameter, which is now a toggle saved inside Lite's settings and thus no longer necessary.
--unbantokens - EOS unbans should only be set via the generate API, in the use_default_badwordsids json field.
--usemirostat - Mirostat values should only be set via the generate API, in the mirostat mirostat_tau and mirostat_eta json fields.

Hotfix for 1.45.2 - Fixed a bug with reading thread counts in 1.45 and 1.45.1, also moved the OpenAI endpoint from /api/extra/oai/v1/completions to just /v1/completions

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.44.2

20 Sep 10:27
Compare
Choose a tag to compare

koboldcpp-1.44.2

A.K.A The "Mom: we have SillyTavern at home edition"

  • Added multi-user mode with --multiuser which allows up to 5 concurrent incoming /generate requests from multiple clients to be queued up and processed in sequence, instead of rejecting other requests while busy. Note that the /check and /abort endpoints are inactive while multiple requests are in-queue, this is to prevent one user from accidentally reading or cancelling a different user's request.
  • Added a new launcher argument --onready which allows you to pass a terminal command (e.g. start a python script) to be executed after Koboldcpp has finished loading. This runs as a subprocess, and can be useful for starting cloudflare tunnels, displaying URLs etc.
  • Added Grammar Sampling for all architectures, which can be accessed via the web API (also in Lite). Older models are also supported.
  • Added a new API endpoint /api/extra/true_max_context_length which allows fetching the true max context limit, separate from the horde-friendly value.
  • Added support for selecting from a 4th GPU from the UI and command line (was max 3 before).
  • Tweaked automatic RoPE scaling
  • Pulled other fixes and improvements from upstream.
  • Note: Using --usecublas with the prebuilt Windows executables here are only intended for Nvidia devices. For AMD users, please check out @YellowRoseCx koboldcpp-rocm fork instead.

Major Update for Kobold Lite:

taverny

  • Kobold Lite has undergone a massive overhaul, renamed and rearranged elements for a cleaner UI.
  • Added Aesthetic UI for chat mode, which is now automatically selected when importing Tavern cards. You can easily switch between the different UIs for chat and instruct modes from the settings panel.
  • Added Mirostat UI configs to settings panel.
  • Allowed Idle Responses in all modes, it is now a global setting. Also fixed an idle response detection bug.
  • Smarter group chats, mentioning a specific name when inside a group chat will cause that user to respond, instead of being random.
  • Added support for automagically increasing the max context size slider limit, if a larger context is detected.
  • Added scenario for importing characters from Chub.Ai
  • Added a settings checkbox to enable streaming whenever applicable without requiring messing with URLs. Streaming can be easily toggled from the settings UI now, similar to EOS unbanning, although the --stream flag is still kept for compatibility.
  • Added a few Instruct Tag Presets in a dropdown.
  • Supports instruct placeholders, allowing easy switching between instruct formats without rewriting the text. Added a toggle option to use "Raw Instruct Tags" (the old method) as an alternative to placeholder tags like {{[INPUT]}} and {{[OUTPUT]}}
  • Added a toggle for "Newline After Memory" which can be set in the memory panel.
  • Added a toggle for "Show Rename Save File" which shows a popup the user can use to rename the json save file before saving.
  • You can specify a BNDF grammar string in settings to use when generating, this controls grammar sampling.
  • Various minor bugfixes, also fixed stop_sequences still appearing in the AI outputs, they should be correctly truncated now.

v1.44.1 update - added queue number to perf endpoint, and updated lite to fix a few formatting bugs.
v1.44.2 update - fixed a speed regression from sched_yield again.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.
If you're using AMD, you can try koboldcpp_rocm at YellowRoseCx's fork here

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

koboldcpp-1.43

07 Sep 09:04
Compare
Choose a tag to compare

koboldcpp-1.43

  • Re-added support for automatic rope scale calculations based on a model's training context (n_ctx_train), this triggers if you do not explicitly specify a --ropeconfig. For example, this means llama2 models will (by default) use a smaller rope scale compared to llama1 models, for the same specified --contextsize. Setting --ropeconfig will override this. This was bugged and removed in the previous release, but it should be working fine now.
  • HIP and CUDA visible devices set to that GPU only, if GPU number is provided and tensor split is not specified.
  • Fixed RWKV models being broken after recent upgrades.
  • Tweaked --unbantokens to decrease the banned token logit values further, as very rarely they could still appear. Still not using -inf as that causes issues with typical sampling.
  • Integrate SSE streaming improvements from @kalomaze
  • Added mutex for thread-safe polled-streaming from @Elbios
  • Added support for older GGML (ggjt_v3) for 34B llama2 models by @vxiiduu, note that this may still have issues if n_gqa is not 1, in which case using GGUF would be better.
  • Fixed support for Windows 7, which should work in noavx2 and failsafe modes again. Also, SSE3 flags are now enabled for failsafe mode.
  • Updated Kobold Lite, now uses placeholders for instruct tags that get swapped during generation.
  • Tab navigation order improved in GUI launcher, though some elements like checkboxes still require mouse to toggle.
  • Pulled other fixes and improvements from upstream.

To use, download and run the koboldcpp.exe, which is a one-file pyinstaller.
If you don't need CUDA, you can use koboldcpp_nocuda.exe which is much smaller.

Run it from the command line with the desired launch parameters (see --help), or manually select the model in the GUI.
and then once loaded, you can connect like this (or use the full koboldai client):
http://localhost:5001

For more information, be sure to run the program from command line with the --help flag.

Of Note:

  • Reminder that HIPBLAS requires self compilation, and is not included by default in the prebuilt executables.
  • Remember that token unbans can now be set via API (and Lite) in addition to the command line.