Releases · oobabooga/textgen

07 Apr 00:56

oobabooga

v4.4

9dcf574

v4.4 - MCP server support! Latest

Latest

Changes

MCP server support: Use remote MCP servers from the UI. Just add one server URL per line in the new "MCP servers" field in the Chat tab and send a message. Tools will be discovered automatically and used alongside local tools. [Tutorial]
Several UI improvements, further modernizing the theme:
- Improve hover menu appearance in the Chat tab.
- Improve scrollbar styling (thinner, more rounded).
- Improve message text contrast and heading colors.
- Improve message action icon visibility in light mode.
- Make blockquote, table, and hr borders more subtle and consistent.
- Improve accordion outline styling.
- Reduce empty space between chat input and message contents.
- Hide spin buttons on all sliders (these looked ugly on Windows).
- Show filename tooltip on file attachments in the chat input.
Add Windows + ROCm portable builds.
Image generation: Embed metadata in API responses. PNG images returned by the API now include generation settings (model, seed, dimensions, steps, CFG scale, sampler) in the file metadata.
API: Add instruction_template and instruction_template_str parameters in the model load endpoint.
API: Remove the deprecated settings parameter from the model load endpoint.
Move the cpu-moe checkbox to extra flags (no longer needed now that --fit exists).

Bug fixes

Fix inline LaTeX rendering: $...$ expressions are now protected from being parsed as markdown (#7423).
Fix crash when truncating prompts with tool call messages.
Fix "address already in use" on server restart (Linux/macOS).
Fix GPT-OSS reasoning tags briefly leaking into streamed output between thinking and tool calls.
Fix tool call check sometimes truncating visible text at end of generation.
Fix image generation failing with Flash Attention 2 errors by defaulting attention to SDPA.
Fix loader args leaking between sequential API model loads.
Fix IPv6 address formatting in the API.

Dependency updates

Update llama.cpp to ggml-org/llama.cpp@d0a6dfe
Update ik_llama.cpp to ikawrakow/ik_llama.cpp@67fc9c5 (adds Gemma 4 support)

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Note

NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.

ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.

Windows

GPU/Platform	llama.cpp	ik_llama.cpp
NVIDIA (CUDA 12.4)	Download (777 MB)	Download (1.09 GB)
NVIDIA (CUDA 13.1)	Download (698 MB)	Download (1.19 GB)
AMD/Intel (Vulkan)	Download (207 MB)	—
AMD (ROCm 7.2)	Download (516 MB)	—
CPU only	Download (191 MB)	Download (192 MB)

Linux

GPU/Platform	llama.cpp	ik_llama.cpp
NVIDIA (CUDA 12.4)	Download (761 MB)	Download (1.09 GB)
NVIDIA (CUDA 13.1)	Download (712 MB)	Download (1.21 GB)
AMD/Intel (Vulkan)	Download (223 MB)	—
AMD (ROCm 7.2)	Download (329 MB)	—
CPU only	Download (207 MB)	Download (217 MB)

macOS

Architecture	llama.cpp
Apple Silicon (arm64)	Download (181 MB)
Intel (x86_64)	Download (187 MB)

Updating a portable install:

Download and extract the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs

Assets 20

textgen-portable-4.4-linux-cpu.tar.gz

sha256:1f3b1e041afba46846f4a5b99918e82b9d37ad64091aea1c4f2b45c4c6b73a9f

207 MB 2026-04-07T00:58:34Z
textgen-portable-4.4-linux-cuda12.4.tar.gz

sha256:34d9cada28b07fb54ac04bdd6255281287ceb807624ccf572ec93d2d0ebe30d5

761 MB 2026-04-07T00:59:43Z
textgen-portable-4.4-linux-cuda13.1.tar.gz

sha256:a3d8bb48ca36bc81079fc88420bd6e4cfdb922bc4cc66f814bf7be5cfb99adaa

712 MB 2026-04-07T00:59:10Z
textgen-portable-4.4-linux-rocm7.2.tar.gz

sha256:5694dc9ff92280bb091b0c5c498ca3042924dba8e015d567f20c74cf4cf51bfb

329 MB 2026-04-07T00:58:55Z
textgen-portable-4.4-linux-vulkan.tar.gz

sha256:7d0a644f9de6a53b4807b1ab983d30bea6f319511c62d28a3f2660935bcd621b

223 MB 2026-04-07T00:58:34Z
textgen-portable-4.4-macos-arm64.tar.gz

sha256:4c997083f7270507b5931156a4cf0d79290d0d7376941944e7c7942273e1c165

181 MB 2026-04-07T00:58:20Z
textgen-portable-4.4-macos-x86_64.tar.gz

sha256:b576bde12d967628f9e26195fc52e7d0f005ee495ad340457698bf0d0011f758

187 MB 2026-04-07T00:59:06Z
textgen-portable-4.4-windows-cpu.zip

sha256:b256046fff65298d652c35f0381272dc61956c9a2715b913819428250903189f

191 MB 2026-04-07T01:04:31Z
textgen-portable-4.4-windows-cuda12.4.zip

sha256:5f7a909c2e5896e4991ebab3e3cf00261649505c97ea00c9fe405e7af39dc20b

777 MB 2026-04-07T01:07:20Z
textgen-portable-4.4-windows-cuda13.1.zip

sha256:40692d1f714483e143ac37a349454f283eb0ee47034f4dfc56d193ad78c5f100

698 MB 2026-04-07T01:05:35Z
Source code (zip)

2026-04-07T00:54:02Z
Source code (tar.gz)

2026-04-07T00:54:02Z

04 Apr 00:05

oobabooga

v4.3.3

62e67ad

v4.3.3 - Gemma 4 support!

Changes

Gemma 4 support with tool-calling in the API and UI. 🆕 - v4.3.1.
ik_llama.cpp support: Add ik_llama.cpp as a new backend through new textgen-portable-ik portable builds and a new --ik flag for full installs. ik_llama.cpp is a fork by the author of the imatrix quants, including support for new quant types, significantly more accurate KV cache quantization (via Hadamard KV cache rotation, enabled by default), and optimizations for MoE models and CPU inference.
API: Add echo + logprobs for /v1/completions. The completions endpoint now supports the echo and logprobs parameters, returning token-level log probabilities for both prompt and generated tokens. Token IDs are also included in the output via a new top_logprobs_ids field.
Further optimize my custom gradio fork, saving up to 50 ms per UI event (button click, etc).
Transformers: Autodetect torch_dtype from model config instead of always forcing bfloat16/float16. The --bf16 flag still works as an override.
Remove the obsolete models/config.yaml file. Instruction templates are now detected from model metadata instead of filename patterns.
Rename "truncation length" to "context length" in the terminal log message.

Security

Gradio fork: Fix ACL bypass via case-insensitive path matching on Windows/macOS.
Gradio fork: Add server-side validation for Dropdown, Radio, and CheckboxGroup.
Sanitize filenames in all prompt file operations (CWE-22). Thanks, @ffulbtech. 🆕 - v4.3.3.
Fix SSRF in superbooga extensions: URLs fetched by superbooga/superboogav2 are now validated to block requests to private/internal networks.

Bug fixes

Fix --idle-timeout failing on encode/decode requests and not tracking parallel generation properly.
Fix stopping string detection for chromadb/context-1 (<|return|> vs <|result|>).
Fix Qwen3.5 MoE failing to load via ExLlamav3_HF.
Fix ban_eos_token not working for ExLlamav3. EOS is now suppressed at the logit level.
Fix "Value: None is not in the list of choices: []" Gradio error introduced in v4.3. 🆕 - v4.3.2.
Fix Dropdown/Radio/CheckboxGroup crash when choices list is empty. 🆕 - v4.3.3.
Fix API crash when parsing tool calls from non-dict JSON model output. 🆕 - v4.3.3.
Fix llama.cpp crashing due to failing to parse the Gemma 4 template (even though we don't use llama.cpp's jinja parser). 🆕 - v4.3.2.

Dependency updates

Update llama.cpp to ggml-org/llama.cpp@277ff5f
.
- Adds Gemma-4 support
- Adds improved KV cache quantization via activations rotation, based on TurboQuant ggml-org/llama.cpp#21038
Update ik_llama.cpp to ikawrakow/ik_llama.cpp@d557d6c
Update ExLlamaV3 to 0.0.28
Update transformers to 5.5

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Note

NVIDIA GPU: If nvidia-smi reports CUDA Version >= 13.1, use the cuda13.1 build. Otherwise, use cuda12.4.

ik_llama.cpp is a llama.cpp fork with new quant types. If unsure, use the llama.cpp column.

Windows

GPU/Platform	llama.cpp	ik_llama.cpp
NVIDIA (CUDA 12.4)	Download (758 MB)	Download (1.12 GB)
NVIDIA (CUDA 13.1)	Download (681 MB)	Download (1.17 GB)
AMD/Intel (Vulkan)	Download (191 MB)	—
AMD (ROCm 7.2)	Download (499 MB)	—
CPU only	Download (175 MB)	Download (175 MB)

Linux

GPU/Platform	llama.cpp	ik_llama.cpp
NVIDIA (CUDA 12.4)	Download (753 MB)	Download (1.12 GB)
NVIDIA (CUDA 13.1)	Download (706 MB)	Download (1.2 GB)
AMD/Intel (Vulkan)	Download (217 MB)	—
AMD (ROCm 7.2)	Download (323 MB)	—
CPU only	Download (201 MB)	Download (211 MB)

macOS

Architecture	llama.cpp
Apple Silicon (arm64)	Download (173 MB)
Intel (x86_64)	Download (179 MB)

Updating a portable install:

Download and extract the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs

Contributors

ffulbtech

Assets 20

03 Apr 17:08

oobabooga

v4.3.2

0050a33

v4.3.2

Updated to v4.3.3

https://github.com/oobabooga/text-generation-webui/releases/tag/v4.3.3

Assets 19

03 Apr 03:54

oobabooga

v4.3.1

b11379f

v4.3.1

Updated to v4.3.3

https://github.com/oobabooga/text-generation-webui/releases/tag/v4.3.3

Assets 19

03 Apr 01:22

oobabooga

v4.3

9374a4e

v4.3

Updated to v4.3.3

https://github.com/oobabooga/text-generation-webui/releases/tag/v4.3.3

Assets 19

24 Mar 19:39

oobabooga

v4.2

dd9d254

v4.2

Before	After

Changes

Anthropic-compatible API: A new /v1/messages endpoint lets you connect Claude Code, Cursor, and other Anthropic API clients. Supports system messages, content blocks, tool use, tool results, image inputs, and thinking blocks. To use with Claude Code: ANTHROPIC_BASE_URL=http://127.0.0.1:5000 claude.
Updated UI theme: New colors, borders, and button styles across light and dark modes.
--extra-flags now supports literal flags: You can now pass flags directly, e.g. --extra-flags "--rpc 192.168.1.100:50052 --jinja". The old key=value format is still accepted for backwards compatibility.
Training
- Enable gradient_checkpointing by default for lower VRAM usage during training.
- Remove the arbitrary higher_rank_limit parameter.
- Reorganize the training UI.
Strip thinking blocks before tool-call parsing to prevent false-positive tool call detection from <think> content.
Move the OpenAI-compatible API from extensions/openai to modules/api. The old --extensions openai flag is still accepted as an alias for --api.
Set top_p=0.95 as the default sampling parameter for API requests.
Remove 52 obsolete instruction templates from 2023 (Airoboros, Baichuan, Guanaco, Koala, Vicuna v0, MOSS, etc.).
Reduce portable build sizes by using a stripped Python distribution.

Bug fixes

Fix prompt corruption when continuing a chat with context truncation (#7439). Thanks, @Phrosty1.
Fix multi-turn thinking block corruption for Kimi models.
Fix AMD installer failing to resolve ROCm triton dependency.
Fix the --share feature in the Gradio fork.
Fix --extra-flags breaking short long-form-only flags like --rpc.
Fix the instruction template delete dialog not appearing.
Fix file handle leaks and redundant re-reads in model metadata loading (#7422). Thanks, @alvinttang.
Fix superboogav2 broken delete endpoint (#6010). Thanks, @Raunak-Kumar7.
Fix leading spaces in post-reasoning content in API responses.
Fix Cloudflare tunnel retry logic raising after the first failed attempt instead of exhausting retries.
Fix OPENEDAI_DEBUG=0 being treated as truthy.
Fix mutable default argument in LogitsBiasProcessor (#7426). Thanks, @Jah-yee.

Dependency updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/3fc6f1aed172602790e9088b57786109438c2466
Update ExLlamaV3 to 0.0.26

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda13.1, or cuda12.4 if you have older drivers.
- AMD/Intel GPU: Use vulkan builds.
- AMD GPU (ROCm): Use rocm builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel: Use macos-x86_64.

Updating a portable install:

Download and extract the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs

Contributors

Phrosty1, Raunak-Kumar7, and 2 other contributors

Assets 13

18 Mar 05:33

oobabooga

v4.1.1

256431f

v4.1.1

Changes

Tool-calling in the UI!: Models can now call custom functions during chat. Each tool is a single .py file in user_data/tools, and five examples are provided: web_search, fetch_webpage, calculate, get_datetime, and roll_dice. During streaming, each tool call appears as a collapsible accordion similar to the existing thinking blocks, showing the called function, the arguments chosen by the LLM, and the output. [Tutorial]
Replace html2text with trafilatura for extracting text from web pages, reducing boilerplate like navigation bars significantly and saving tokens in agentic tool-calling loops.
OpenAI API improvements:
- Rewrite logprobs support for full spec compliance across llama.cpp, ExLlamaV3, and Transformers backends. Both streaming and non-streaming responses now return token-by-token logprobs.
- Add a reasoning_content field for thinking blocks in both streaming and non-streaming chat completions. Now thinking blocks go exclusively in this field, and content only shows the post-thinking reply, even when tool calls are present.
- Add tool_choice support and fix the tool_calls response format for strict spec compliance.
- Put mid-conversation system messages in the correct positions in the prompt instead of collapsing all system messages at the top.
- Add support for the developer role, which is mapped to system.
- Add max_completion_tokens as an alias for max_tokens.
- Include /v1 in the API URL printed to the terminal since that's what most clients expect.
- Make the /v1/models endpoint show only the currently loaded model.
- Add stream_options support with include_usage for streaming responses.
- Return finish_reason: tool_calls when tool calls are detected.
- Several other spec compliance improvements after careful auditing.
llama.cpp
- Set ctx-size to 0 (auto) by default. Note: this only works when --gpu-layers is also set to -1, which is the default value. When using other loaders, 0 maps to 8192.
- Reduce the --fit-target default from 1024 MiB to 512 MiB.
- Use --fit-ctx 8192 to set 8192 as the minimum acceptable ctx size for --fit on (llama.cpp uses 4096 by default).
- Make logit_bias and logprobs functional in API calls.
- Add missing custom_token_bans parameter in the UI.
ExLlamaV3
- Add native logit_bias and logprobs support.
- Load the vision model and the draft model before the main model so memory auto-splitting accounts for them.
New default preset: "Top-P" (top_p: 0.95), following recommendations for several SOTA open-weights models. The old "Qwen3 - Thinking", "Qwen3 - No Thinking", "min_p", and "Instruct" presets have been removed.
Refactor reasoning/thinking extraction into a standalone module supporting multiple model formats (Qwen, GPT-OSS, Solar, seed:think, and others). Also detect when a chat template appends <think> to the prompt and prepend it to the reply, so the thinking block appears immediately during streaming.
Incognito chat: This option has been added next to the existing "New chat" button. Incognito chats are temporary, live in RAM and are never saved to disk.
Optimize chat streaming performance by updating the DOM only once per animation frame.
Increase the ctx-size slider maximum to 1M tokens in the UI, with 1024 step.
Add a new drag-and-drop UI component for reordering "Sampler priority" items.
Make all chat styles consistent with instruct style in spacings, line heights, etc., improving the quality and consistency of those styles.
Remove the gradio import in --nowebui mode, saving some 0.5-0.8 seconds on startup.
Force-exit the webui on repeated Ctrl+C.
Improve the --multi-user warning to make the known limitations transparent.
Remove the rope scaling parameters (alpha_value, rope_freq_base, compress_pos_emb). Models now have 128k+ context, and those parameters are from the 4096 context era; the parameters can still be passed to llama.cpp through --extra-flags if needed.
Optimize wheel downloads in the one-click installer to only download wheels that actually changed between updates. Previously all wheels would get downloaded if at least 1 of them had changed.
Update the Intel Arc PyTorch installation command in the one-click installer, removing the dependency on Intel oneAPI conda packages.
Security: server-side file save roots, image URL SSRF protection, extension allowlist (new in 4.1.1)

Bug fixes

Fix pip accidentally installing to the system Miniconda on Windows instead of the project environment.
Fix crash on non-UTF-8 Windows locales (e.g. Chinese GBK).
Fix passing adaptive-p to llama-server.
Fix truncation_length not propagating correctly when ctx_size is set to auto (0).
Fix dark theme using light theme syntax highlighting.
Fix word breaks in tables. Tables now scroll horizontally instead of breaking words.
Fix the OpenAI API server not respecting --listen-host.
Fix a crash loading the MiniMax-M2.5 jinja template.
Fix reasoning_effort not appearing in the UI for ExLlamaV3.
Fix ExLlamaV3 draft cache size to match main cache.
Fix ExLlamaV3 EOS handling for models with multiple end-of-sequence tokens.
Fix ExLlamaV3 perplexity evaluation giving incorrect values for sequences longer than 2048 tokens.

Dependency updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/67a2209fabe2e3498d458561933d5380655085d2
Update ExLlamaV3 to 0.0.25
Update diffusers to 0.37
Update AMD ROCm from 6.4 to 7.2

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda13.1, or cuda12.4 if you have older drivers.
- AMD/Intel GPU: Use vulkan builds.
- AMD GPU (ROCm): Use rocm builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel: Use macos-x86_64.

Updating a portable install:

Download and extract the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs

Assets 13

16 Mar 15:53

oobabooga

v4.1

88a3188

v4.1

Updated to v4.1.1!

https://github.com/oobabooga/text-generation-webui/releases/tag/v4.1.1

Assets 13

07 Mar 14:34

oobabooga

v4.0

3b7cf44

v4.0

Changes

Custom Gradio fork: Gradio has been replaced with a custom fork at oobabooga/gradio where major performance optimizations were made. The UI now does far less redundant work on every update, startup is faster, SSE message delivery is instant instead of polling every 50 ms, and a new zero-rendering gr.Headless component reduces overhead during chat streaming. Analytics, unused dependencies, and unused assets have also been removed from the wheel.
Tool-calling overhaul: Now tool-calling actually works for Qwen 3.5, Devstral 2, GPT-OSS, DeepSeek V3.2, GLM 5, MiniMax M2.5, Kimi K2/K2.5, and Llama 4 models. Several improvements have been made for strict OpenAI format compliance. Extensive testing has been done to make sure tool-calling works flawlessly for the supported models. [Documentation]
Parallel API requests: For llama.cpp, ExLlamaV3, and TensorRT-LLM loaders, it is now possible to make concurrent API requests for maximum throughput. For llama.cpp, it is necessary to use the --parallel N option and multiply the context length by N. [Documentation]
Training overhaul (documentation): The training code has been completely rewritten. It is now fully in line with axolotl for both raw text training and chat training.
- For chat training, datasets in OpenAI messages format or ShareGPT conversations format are now used. Multi-turn chats are supported, with correct masking of user inputs so that training only happens on assistant messages. See user_data/training/example_messages.json and user_data/training/example_sharegpt.json for examples.
- For raw text training, JSONL files are used, with correct BOS and EOS addition for each sub-document. See user_data/training/example_text.json for an example input.
- Chat training now uses jinja2 templates for formatting prompts. You can use either the model's built-in template (if it has one) or a custom user-provided template.
- New "Target all linear layers" checkbox that applies LoRA to every nn.Linear layer except lm_head. It works for any model architecture.
- Checkpoint resumption: HF Trainer checkpoint directories are detected automatically and training resumes with full optimizer/scheduler state.
- All training input parameters now have good, reviewed default values.
- Conversations exceeding the cutoff length are now dropped instead of silently truncated (configurable).
- Dynamic padding (chat datasets): batches are now padded to the longest sequence in each batch instead of always padding to cutoff_len, reducing wasted computation.
llama.cpp
- --fit support: GPU layers now default to -1 (auto), letting llama.cpp determine the optimal number of layers and GPU split automatically. The new --fit-target parameter controls how much VRAM headroom to leave per GPU (default: 1024 MiB). Context size can also be set to 0 to let llama.cpp determine that automatically as well.
- Integrate N-gram speculative decoding support for faster generation without the need for a draft model, through the --spec-type, --spec-ngram-size-n, --spec-ngram-size-m, and --spec-ngram-min-hits parameters. Good defaults are provided, just change --spec-type to ngram-mod to activate.
- Binaries now work for any CPU instruction set (AVX, AVX2, AVX-512) by autodetecting at runtime, replacing the old separate AVX/AVX2 builds.
- Add ROCm portable builds for Windows.
- Add CUDA 13.1 portable builds.
- Add back macOS x86_64 (Intel) portable builds.
- Smaller CUDA binaries after improving compilation flags.
- Compilation workflows at oobabooga/llama-cpp-binaries have been fully audited and aligned with upstream.
- Handle SIGTERM to properly stop llama-server on pkill.
- llama-server is now spawned on port 5005 by default instead of a random port.
Adaptive-p sampler for llama.cpp, Transformers, ExLlamaV3, and ExLlamaV3_HF loaders. This sampler reshapes the logit distribution to favor tokens near a target probability.
New CLI flags to set default API generation parameters: --temperature, --min-p, --top-k, --repetition-penalty, etc., and also --enable-thinking, --reasoning-effort, and --chat-template-file. The last parameter accepts .jinja or .yaml files.
Chat completion requests are now ~85 ms faster after optimizations.
SSE separator for streaming over the API changed from \r\n to \n to match OpenAI.
Migrate TensorRT-LLM from the old ModelRunner API to the new LLM API, which can take any Transformers model as input and has more sampling parameters.
Security
- Prevent path traversal on file save/delete operations for characters, users, and uploaded files.
- Restrict model loading over API to block extra_flags and trust_remote_code parameters.
- Restrict file writes to the user_data_dir.
New --user-data-dir flag to customize the user data directory location. Now the program also auto-detects a ../user_data folder in portable mode if present, making updates easier.
User persona support: A new dropdown in the Character settings tab lets you save and load user profiles (name, bio, profile picture), so you can switch between different personas without re-entering your details (#7367). Thanks, @q5sys.
Replace PyPDF2 with pymupdf for much more accurate conversion of PDF inputs to text.
Markdown rendering improvements. All by @mamei16:
- Re-introduce inline LaTeX rendering with more robust exception handling (#7402).
- Disable uncommonly used indented codeblocks (#7401).
- Improve process_markdown_content (#7403).
Add Qwen 3.5 thinking block support to the UI.
Add Solar Open thinking block support to the UI.
Update the entire documentation to match the current code.
Update all dockerfiles. [Documentation]
Update the Google Colab notebook.
Remove the ExLlamaV2 loader, which has been archived. EXL2 users should migrate to EXL3, which has much better quantization accuracy.
Remove the Training_PRO extension, which has become obsolete after the Training tab updates.
Remove obsolete DeepSpeed inference code from 2023.
Remove unused colorama and psutil dependencies.
Update outdated GitHub Actions versions (#7384). Thanks, @pgoslatara.

Bug fixes

Fix temperature_last having no effect in llama.cpp server sampler order.
Fix code block copy button not working over HTTP (Clipboard API fallback) (#7358). Thanks, @jakubartur.
Fix message copy buttons not working over HTTP (extend Clipboard API fallback).
Fix ExLlamaV3 CFG cache initialization and speculative decoding parameter handling.
Fix blank prompt dropdown in Notebook/Default tabs on first startup.
Use absolute Python path in Windows batch scripts to fix some rare edge cases.
Bump sentence-transformers from 2.2.2 to 3.3.1 in superbooga (#7406). Thanks, @OiPunk.
Fix installer state being saved before requirements were fully installed.
Fix ExLlamav3 race condition that could cause AssertionError or hangs during generation.
Fix API server continuing to generate tokens after client disconnects for non-streaming requests.

Dependency updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/6fce5c6a7dba6a3e1df0aad1574b78d1a1970621
Update Transformers to 5.3
Update ExLlamaV3 to 0.0.23
Update TensorRT-LLM to 1.1.0
Update PyTorch to 2.9.1
Update Python to 3.13
Update ROCm wheels to ROCm 6.4

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip/extract, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda13.1, or cuda12.4 if you have older drivers.
- AMD/Intel GPU: Use vulkan builds.
- AMD GPU (ROCm): Use rocm builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.
- Intel: Use macos-x86_64.

Updating a portable install:

Download and extract the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Starting with 4.0, you can also move user_data one folder up, next to the install folder. It will be detected automatically, making updates easier:

text-generation-webui-4.0/
text-generation-webui-4.1/
user_data/                    <-- shared by both installs

Contributors

q5sys, pgoslatara, and 2 other contributors

Assets 13

08 Jan 20:54

oobabooga

v3.23

910456b

v3.23

Changes

Improve the style of tables and horizontal separators in chat messages

Bug fixes

Fix loading models which have their eos token disabled (#7363). Thanks, @jin-eld.
Fix a symbolic link issue in llama-cpp-binaries while updating non-portable installs

Backend updates

Update llama.cpp to https://github.com/ggml-org/llama.cpp/tree/55abc393552f3f2097f168cb6db4dc495a514d56
Update bitsandbytes to 0.49

Portable builds

Below you can find self-contained packages that work with GGUF models (llama.cpp) and require no installation! Just download the right version for your system, unzip, and run.

Which version to download:

Windows/Linux:
- NVIDIA GPU: Use cuda12.4.
- AMD/Intel GPU: Use vulkan builds.
- CPU only: Use cpu builds.
Mac:
- Apple Silicon: Use macos-arm64.

Updating a portable install:

Download and unzip the latest version.
Replace the user_data folder with the one in your existing install. All your settings and models will be moved.

Contributors

jin-eld

Assets 10

Releases: oobabooga/textgen

v4.4 - MCP server support!

Changes

Bug fixes

Dependency updates

Portable builds

Windows

Linux

macOS

Updating a portable install:

Uh oh!

v4.3.3 - Gemma 4 support!

Changes

Security

Bug fixes

Dependency updates

Portable builds

Windows

Linux

macOS

Updating a portable install:

Contributors

Uh oh!

v4.3.2

Uh oh!

v4.3.1

Uh oh!

v4.3

Uh oh!

v4.2

Changes

Bug fixes

Dependency updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v4.1.1

Changes

Bug fixes

Dependency updates

Portable builds

Which version to download:

Updating a portable install:

Uh oh!

v4.1

Uh oh!

v4.0

Changes

Bug fixes

Dependency updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!

v3.23

Changes

Bug fixes

Backend updates

Portable builds

Which version to download:

Updating a portable install:

Contributors

Uh oh!