Releases: oobabooga/textgen
v4.4 - MCP server support!
Changes
- MCP server support: Use remote MCP servers from the UI. Just add one server URL per line in the new "MCP servers" field in the Chat tab and send a message. Tools will be discovered automatically and used alongside local tools. [Tutorial]
- Several UI improvements, further modernizing the theme:
- Improve hover menu appearance in the Chat tab.
- Improve scrollbar styling (thinner, more rounded).
- Improve message text contrast and heading colors.
- Improve message action icon visibility in light mode.
- Make blockquote, table, and hr borders more subtle and consistent.
- Improve accordion outline styling.
- Reduce empty space between chat input and message contents.
- Hide spin buttons on all sliders (these looked ugly on Windows).
- Show filename tooltip on file attachments in the chat input.
- Add Windows + ROCm portable builds.
- Image generation: Embed metadata in API responses. PNG images returned by the API now include generation settings (model, seed, dimensions, steps, CFG scale, sampler) in the file metadata.
- API: Add
instruction_templateandinstruction_template_strparameters in the model load endpoint. - API: Remove the deprecated
settingsparameter from the model load endpoint. - Move the
cpu-moecheckbox to extra flags (no longer needed now that--fitexists).
Bug fixes
- Fix inline LaTeX rendering:
$...$expressions are now protected from being parsed as markdown (#7423). - Fix crash when truncating prompts with tool call messages.
- Fix "address already in use" on server restart (Linux/macOS).
- Fix GPT-OSS reasoning tags briefly leaking into streamed output between thinking and tool calls.
- Fix tool call check sometimes truncating visible text at end of generation.
- Fix image generation failing with Flash Attention 2 errors by defaulting attention to SDPA.
- Fix loader args leaking between sequential API model loads.
- Fix IPv6 address formatting in the API.
Dependency updates
- Update llama.cpp to ggml-org/llama.cpp@d0a6dfe
- Update ik_llama.cpp to ikawrakow/ik_llama.cpp@67fc9c5 (adds Gemma 4 support)
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Note
NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.
ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (777 MB) | Download (1.09 GB) |
| NVIDIA (CUDA 13.1) | Download (698 MB) | Download (1.19 GB) |
| AMD/Intel (Vulkan) | Download (207 MB) | — |
| AMD (ROCm 7.2) | Download (516 MB) | — |
| CPU only | Download (191 MB) | Download (192 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (761 MB) | Download (1.09 GB) |
| NVIDIA (CUDA 13.1) | Download (712 MB) | Download (1.21 GB) |
| AMD/Intel (Vulkan) | Download (223 MB) | — |
| AMD (ROCm 7.2) | Download (329 MB) | — |
| CPU only | Download (207 MB) | Download (217 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (181 MB) |
| Intel (x86_64) | Download (187 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installsv4.3.3 - Gemma 4 support!
Changes
- Gemma 4 support with tool-calling in the API and UI. 🆕 - v4.3.1.
- ik_llama.cpp support: Add ik_llama.cpp as a new backend through new
textgen-portable-ikportable builds and a new--ikflag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference. - API: Add echo + logprobs for
/v1/completions. The completions endpoint now supports theechoandlogprobsparameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a newtop_logprobs_idsfield. - Further optimize my custom gradio fork, saving up to 50 ms per UI event (button click, etc).
- Transformers: Autodetect
torch_dtypefrom model config instead of always forcing bfloat16/float16. The--bf16flag still works as an override. - Remove the obsolete
models/config.yamlfile. Instruction templates are now detected from model metadata instead of filename patterns. - Rename "truncation length" to "context length" in the terminal log message.
Security
- Gradio fork: Fix ACL bypass via case-insensitive path matching on Windows/macOS.
- Gradio fork: Add server-side validation for Dropdown, Radio, and CheckboxGroup.
- Sanitize filenames in all prompt file operations (CWE-22). Thanks, @ffulbtech. 🆕 - v4.3.3.
- Fix SSRF in superbooga extensions: URLs fetched by superbooga/superboogav2 are now validated to block requests to private/internal networks.
Bug fixes
- Fix
--idle-timeoutfailing on encode/decode requests and not tracking parallel generation properly. - Fix stopping string detection for chromadb/context-1 (
<|return|>vs<|result|>). - Fix Qwen3.5 MoE failing to load via ExLlamav3_HF.
- Fix
ban_eos_tokennot working for ExLlamav3. EOS is now suppressed at the logit level. - Fix "Value: None is not in the list of choices: []" Gradio error introduced in v4.3. 🆕 - v4.3.2.
- Fix Dropdown/Radio/CheckboxGroup crash when choices list is empty. 🆕 - v4.3.3.
- Fix API crash when parsing tool calls from non-dict JSON model output. 🆕 - v4.3.3.
- Fix llama.cpp crashing due to failing to parse the Gemma 4 template (even though we don't use llama.cpp's jinja parser). 🆕 - v4.3.2.
Dependency updates
- Update llama.cpp to ggml-org/llama.cpp@277ff5f
.- Adds Gemma-4 support
- Adds improved KV cache quantization via activations rotation, based on TurboQuant ggml-org/llama.cpp#21038
- Update ik_llama.cpp to ikawrakow/ik_llama.cpp@d557d6c
- Update ExLlamaV3 to 0.0.28
- Update transformers to 5.5
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Note
NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.
ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.
Windows
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (758 MB) | Download (1.12 GB) |
| NVIDIA (CUDA 13.1) | Download (681 MB) | Download (1.17 GB) |
| AMD/Intel (Vulkan) | Download (191 MB) | — |
| AMD (ROCm 7.2) | Download (499 MB) | — |
| CPU only | Download (175 MB) | Download (175 MB) |
Linux
| GPU/Platform | llama.cpp | ik_llama.cpp |
|---|---|---|
| NVIDIA (CUDA 12.4) | Download (753 MB) | Download (1.12 GB) |
| NVIDIA (CUDA 13.1) | Download (706 MB) | Download (1.2 GB) |
| AMD/Intel (Vulkan) | Download (217 MB) | — |
| AMD (ROCm 7.2) | Download (323 MB) | — |
| CPU only | Download (201 MB) | Download (211 MB) |
macOS
| Architecture | llama.cpp |
|---|---|
| Apple Silicon (arm64) | Download (173 MB) |
| Intel (x86_64) | Download (179 MB) |
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installsv4.3.2
v4.3.1
v4.3
v4.2
| Before | After |
|---|---|
![]() |
![]() |
Changes
- Anthropic-compatible API: A new
/v1/messagesendpoint lets you connect Claude Code, Cursor, and other Anthropic API clients. Supports system messages, content blocks, tool use, tool results, image inputs, and thinking blocks. To use with Claude Code:ANTHROPIC_BASE_URL=http://127.0.0.1:5000 claude. - Updated UI theme: New colors, borders, and button styles across light and dark modes.
--extra-flagsnow supports literal flags: You can now pass flags directly, e.g.--extra-flags "--rpc 192.168.1.100:50052 --jinja". The oldkey=valueformat is still accepted for backwards compatibility.- Training
- Enable
gradient_checkpointingby default for lower VRAM usage during training. - Remove the arbitrary
higher_rank_limitparameter. - Reorganize the training UI.
- Enable
- Strip thinking blocks before tool-call parsing to prevent false-positive tool call detection from
<think>content. - Move the OpenAI-compatible API from
extensions/openaitomodules/api. The old--extensions openaiflag is still accepted as an alias for--api. - Set
top_p=0.95as the default sampling parameter for API requests. - Remove 52 obsolete instruction templates from 2023 (Airoboros, Baichuan, Guanaco, Koala, Vicuna v0, MOSS, etc.).
- Reduce portable build sizes by using a stripped Python distribution.
Bug fixes
- Fix prompt corruption when continuing a chat with context truncation (#7439). Thanks, @Phrosty1.
- Fix multi-turn thinking block corruption for Kimi models.
- Fix AMD installer failing to resolve ROCm triton dependency.
- Fix the
--sharefeature in the Gradio fork. - Fix
--extra-flagsbreaking short long-form-only flags like--rpc. - Fix the instruction template delete dialog not appearing.
- Fix file handle leaks and redundant re-reads in model metadata loading (#7422). Thanks, @alvinttang.
- Fix superboogav2 broken delete endpoint (#6010). Thanks, @Raunak-Kumar7.
- Fix leading spaces in post-reasoning
contentin API responses. - Fix Cloudflare tunnel retry logic raising after the first failed attempt instead of exhausting retries.
- Fix
OPENEDAI_DEBUG=0being treated as truthy. - Fix mutable default argument in LogitsBiasProcessor (#7426). Thanks, @Jah-yee.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/3fc6f1aed172602790e9088b57786109438c2466
- Update ExLlamaV3 to 0.0.26
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda13.1, orcuda12.4if you have older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - AMD GPU (ROCm): Use
rocmbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installsv4.1.1
Changes
- Tool-calling in the UI!: Models can now call custom functions during chat. Each tool is a single
.pyfile inuser_data/tools, and five examples are provided:web_search,fetch_webpage,calculate,get_datetime, androll_dice. During streaming, each tool call appears as a collapsible accordion similar to the existing thinking blocks, showing the called function, the arguments chosen by the LLM, and the output. [Tutorial] - Replace
html2textwithtrafilaturafor extracting text from web pages, reducing boilerplate like navigation bars significantly and saving tokens in agentic tool-calling loops. - OpenAI API improvements:
- Rewrite
logprobssupport for full spec compliance across llama.cpp, ExLlamaV3, and Transformers backends. Both streaming and non-streaming responses now return token-by-token logprobs. - Add a
reasoning_contentfield for thinking blocks in both streaming and non-streaming chat completions. Now thinking blocks go exclusively in this field, andcontentonly shows the post-thinking reply, even when tool calls are present. - Add
tool_choicesupport and fix thetool_callsresponse format for strict spec compliance. - Put mid-conversation system messages in the correct positions in the prompt instead of collapsing all system messages at the top.
- Add support for the
developerrole, which is mapped tosystem. - Add
max_completion_tokensas an alias formax_tokens. - Include
/v1in the API URL printed to the terminal since that's what most clients expect. - Make the
/v1/modelsendpoint show only the currently loaded model. - Add
stream_optionssupport withinclude_usagefor streaming responses. - Return
finish_reason: tool_callswhen tool calls are detected. - Several other spec compliance improvements after careful auditing.
- Rewrite
- llama.cpp
- Set
ctx-sizeto0(auto) by default. Note: this only works when--gpu-layersis also set to-1, which is the default value. When using other loaders, 0 maps to 8192. - Reduce the
--fit-targetdefault from 1024 MiB to 512 MiB. - Use
--fit-ctx 8192to set 8192 as the minimum acceptable ctx size for--fit on(llama.cpp uses 4096 by default). - Make
logit_biasandlogprobsfunctional in API calls. - Add missing
custom_token_bansparameter in the UI.
- Set
- ExLlamaV3
- Add native
logit_biasandlogprobssupport. - Load the vision model and the draft model before the main model so memory auto-splitting accounts for them.
- Add native
- New default preset: "Top-P" (
top_p: 0.95), following recommendations for several SOTA open-weights models. The old "Qwen3 - Thinking", "Qwen3 - No Thinking", "min_p", and "Instruct" presets have been removed. - Refactor reasoning/thinking extraction into a standalone module supporting multiple model formats (Qwen, GPT-OSS, Solar, seed:think, and others). Also detect when a chat template appends
<think>to the prompt and prepend it to the reply, so the thinking block appears immediately during streaming. - Incognito chat: This option has been added next to the existing "New chat" button. Incognito chats are temporary, live in RAM and are never saved to disk.
- Optimize chat streaming performance by updating the DOM only once per animation frame.
- Increase the
ctx-sizeslider maximum to 1M tokens in the UI, with 1024 step. - Add a new drag-and-drop UI component for reordering "Sampler priority" items.
- Make all chat styles consistent with instruct style in spacings, line heights, etc., improving the quality and consistency of those styles.
- Remove the gradio import in
--nowebuimode, saving some 0.5-0.8 seconds on startup. - Force-exit the webui on repeated Ctrl+C.
- Improve the
--multi-userwarning to make the known limitations transparent. - Remove the rope scaling parameters (
alpha_value,rope_freq_base,compress_pos_emb). Models now have 128k+ context, and those parameters are from the 4096 context era; the parameters can still be passed to llama.cpp through--extra-flagsif needed. - Optimize wheel downloads in the one-click installer to only download wheels that actually changed between updates. Previously all wheels would get downloaded if at least 1 of them had changed.
- Update the Intel Arc PyTorch installation command in the one-click installer, removing the dependency on Intel oneAPI conda packages.
- Security: server-side file save roots, image URL SSRF protection, extension allowlist (new in 4.1.1)
Bug fixes
- Fix pip accidentally installing to the system Miniconda on Windows instead of the project environment.
- Fix crash on non-UTF-8 Windows locales (e.g. Chinese GBK).
- Fix passing
adaptive-pto llama-server. - Fix
truncation_lengthnot propagating correctly whenctx_sizeis set to auto (0). - Fix dark theme using light theme syntax highlighting.
- Fix word breaks in tables. Tables now scroll horizontally instead of breaking words.
- Fix the OpenAI API server not respecting
--listen-host. - Fix a crash loading the MiniMax-M2.5 jinja template.
- Fix
reasoning_effortnot appearing in the UI for ExLlamaV3. - Fix ExLlamaV3 draft cache size to match main cache.
- Fix ExLlamaV3 EOS handling for models with multiple end-of-sequence tokens.
- Fix ExLlamaV3 perplexity evaluation giving incorrect values for sequences longer than 2048 tokens.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/67a2209fabe2e3498d458561933d5380655085d2
- Update ExLlamaV3 to 0.0.25
- Update diffusers to 0.37
- Update AMD ROCm from 6.4 to 7.2
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda13.1, orcuda12.4if you have older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - AMD GPU (ROCm): Use
rocmbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installsv4.1
v4.0
Changes
- Custom Gradio fork: Gradio has been replaced with a custom fork at oobabooga/gradio where major performance optimizations were made. The UI now does far less redundant work on every update, startup is faster, SSE message delivery is instant instead of polling every 50 ms, and a new zero-rendering
gr.Headlesscomponent reduces overhead during chat streaming. Analytics, unused dependencies, and unused assets have also been removed from the wheel. - Tool-calling overhaul: Now tool-calling actually works for Qwen 3.5, Devstral 2, GPT-OSS, DeepSeek V3.2, GLM 5, MiniMax M2.5, Kimi K2/K2.5, and Llama 4 models. Several improvements have been made for strict OpenAI format compliance. Extensive testing has been done to make sure tool-calling works flawlessly for the supported models. [Documentation]
- Parallel API requests: For llama.cpp, ExLlamaV3, and TensorRT-LLM loaders, it is now possible to make concurrent API requests for maximum throughput. For llama.cpp, it is necessary to use the
--parallel Noption and multiply the context length byN. [Documentation] - Training overhaul (documentation): The training code has been completely rewritten. It is now fully in line with axolotl for both raw text training and chat training.
- For chat training, datasets in OpenAI
messagesformat or ShareGPTconversationsformat are now used. Multi-turn chats are supported, with correct masking of user inputs so that training only happens on assistant messages. Seeuser_data/training/example_messages.jsonanduser_data/training/example_sharegpt.jsonfor examples. - For raw text training, JSONL files are used, with correct BOS and EOS addition for each sub-document. See
user_data/training/example_text.jsonfor an example input. - Chat training now uses jinja2 templates for formatting prompts. You can use either the model's built-in template (if it has one) or a custom user-provided template.
- New "Target all linear layers" checkbox that applies LoRA to every
nn.Linearlayer exceptlm_head. It works for any model architecture. - Checkpoint resumption: HF Trainer checkpoint directories are detected automatically and training resumes with full optimizer/scheduler state.
- All training input parameters now have good, reviewed default values.
- Conversations exceeding the cutoff length are now dropped instead of silently truncated (configurable).
- Dynamic padding (chat datasets): batches are now padded to the longest sequence in each batch instead of always padding to
cutoff_len, reducing wasted computation.
- For chat training, datasets in OpenAI
- llama.cpp
--fitsupport: GPU layers now default to-1(auto), letting llama.cpp determine the optimal number of layers and GPU split automatically. The new--fit-targetparameter controls how much VRAM headroom to leave per GPU (default: 1024 MiB). Context size can also be set to0to let llama.cpp determine that automatically as well.- Integrate N-gram speculative decoding support for faster generation without the need for a draft model, through the
--spec-type,--spec-ngram-size-n,--spec-ngram-size-m, and--spec-ngram-min-hitsparameters. Good defaults are provided, just change--spec-typetongram-modto activate. - Binaries now work for any CPU instruction set (AVX, AVX2, AVX-512) by autodetecting at runtime, replacing the old separate AVX/AVX2 builds.
- Add ROCm portable builds for Windows.
- Add CUDA 13.1 portable builds.
- Add back macOS x86_64 (Intel) portable builds.
- Smaller CUDA binaries after improving compilation flags.
- Compilation workflows at oobabooga/llama-cpp-binaries have been fully audited and aligned with upstream.
- Handle SIGTERM to properly stop llama-server on pkill.
- llama-server is now spawned on port 5005 by default instead of a random port.
- Adaptive-p sampler for llama.cpp, Transformers, ExLlamaV3, and ExLlamaV3_HF loaders. This sampler reshapes the logit distribution to favor tokens near a target probability.
- New CLI flags to set default API generation parameters:
--temperature,--min-p,--top-k,--repetition-penalty, etc., and also--enable-thinking,--reasoning-effort, and--chat-template-file. The last parameter accepts.jinjaor.yamlfiles. - Chat completion requests are now ~85 ms faster after optimizations.
- SSE separator for streaming over the API changed from
\r\nto\nto match OpenAI. - Migrate TensorRT-LLM from the old ModelRunner API to the new LLM API, which can take any Transformers model as input and has more sampling parameters.
- Security
- Prevent path traversal on file save/delete operations for characters, users, and uploaded files.
- Restrict model loading over API to block
extra_flagsandtrust_remote_codeparameters. - Restrict file writes to the
user_data_dir.
- New
--user-data-dirflag to customize the user data directory location. Now the program also auto-detects a../user_datafolder in portable mode if present, making updates easier. - User persona support: A new dropdown in the Character settings tab lets you save and load user profiles (name, bio, profile picture), so you can switch between different personas without re-entering your details (#7367). Thanks, @q5sys.
- Replace PyPDF2 with pymupdf for much more accurate conversion of PDF inputs to text.
- Markdown rendering improvements. All by @mamei16:
- Add Qwen 3.5 thinking block support to the UI.
- Add Solar Open thinking block support to the UI.
- Update the entire documentation to match the current code.
- Update all dockerfiles. [Documentation]
- Update the Google Colab notebook.
- Remove the ExLlamaV2 loader, which has been archived. EXL2 users should migrate to EXL3, which has much better quantization accuracy.
- Remove the Training_PRO extension, which has become obsolete after the Training tab updates.
- Remove obsolete DeepSpeed inference code from 2023.
- Remove unused colorama and psutil dependencies.
- Update outdated GitHub Actions versions (#7384). Thanks, @pgoslatara.
Bug fixes
- Fix
temperature_lasthaving no effect in llama.cpp server sampler order. - Fix code block copy button not working over HTTP (Clipboard API fallback) (#7358). Thanks, @jakubartur.
- Fix message copy buttons not working over HTTP (extend Clipboard API fallback).
- Fix ExLlamaV3 CFG cache initialization and speculative decoding parameter handling.
- Fix blank prompt dropdown in Notebook/Default tabs on first startup.
- Use absolute Python path in Windows batch scripts to fix some rare edge cases.
- Bump sentence-transformers from 2.2.2 to 3.3.1 in superbooga (#7406). Thanks, @OiPunk.
- Fix installer state being saved before requirements were fully installed.
- Fix ExLlamav3 race condition that could cause AssertionError or hangs during generation.
- Fix API server continuing to generate tokens after client disconnects for non-streaming requests.
Dependency updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/6fce5c6a7dba6a3e1df0aad1574b78d1a1970621
- Update Transformers to 5.3
- Update ExLlamaV3 to 0.0.23
- Update TensorRT-LLM to 1.1.0
- Update PyTorch to 2.9.1
- Update Python to 3.13
- Update ROCm wheels to ROCm 6.4
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda13.1, orcuda12.4if you have older drivers. - AMD/Intel GPU: Use
vulkanbuilds. - AMD GPU (ROCm): Use
rocmbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64. - Intel: Use
macos-x86_64.
- Apple Silicon: Use
Updating a portable install:
- Download and extract the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.
Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:
text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/ <-- shared by both installsv3.23
Changes
- Improve the style of tables and horizontal separators in chat messages
Bug fixes
- Fix loading models which have their eos token disabled (#7363). Thanks, @jin-eld.
- Fix a symbolic link issue in llama-cpp-binaries while updating non-portable installs
Backend updates
- Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/55abc393552f3f2097f168cb6db4dc495a514d56
- Update bitsandbytes to 0.49
Portable builds
Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.
Which version to download:
-
Windows/Linux:
- NVIDIA GPU: Use
cuda12.4. - AMD/Intel GPU: Use
vulkanbuilds. - CPU only: Use
cpubuilds.
- NVIDIA GPU: Use
-
Mac:
- Apple Silicon: Use
macos-arm64.
- Apple Silicon: Use
Updating a portable install:
- Download and unzip the latest version.
- Replace the
user_datafolder with the one in your existing install. All your settings and models will be moved.

