llama.cpp SYNC by akapoor3518 · Pull Request #45 · tsisw/llama.cpp

akapoor3518 · 2025-09-02T20:24:37Z

llama.cpp SYNC

* ai : update "related issues" prompt * cont * cont * cont

Regenerate docs/ops/Metal.csv using test-backend-ops on Apple M5 and rebuild docs/ops.md via scripts/create_ops_docs.py. Five ops were incorrectly marked as not supported (❌) for Metal: - DIAG: ❌ → ✅ - POOL_1D: ❌ → ✅ - SET: ❌ → ✅ - SOLVE_TRI: ❌ → ✅ - GATED_DELTA_NET:❌ → 🟡 (partial, depends on head_size % 32)

* CANN: add BF16 support for core operators Add BF16 (bfloat16) type support to the CANN backend for the following operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and OUT_PROD. This enables BF16 models to run on Ascend NPUs. * CANN: skip NZ weight format for BF16 and add 310P compile guards NZ weight format conversion does not support BF16 tensors, skip it in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P guards for all BF16 operator support since 310P does not support BF16.

* server : improve mtmd ctx checkpoints * server : fix off-by-one in pos_min_thold

Address GHSA-645x-v54x-34w8. When nextn_predict_layers >= n_layer, n_layer - nextn_predict_layers can underflow (unsigned wrap), which corrupts n_layer_kv_from_start. Assert nextn_predict_layers immediately after parsing the GGUF key. Found-by: Pwno

* context: zero output buffer on allocation Address GHSA-wqq9-25mr-rw76. The logits output buffer allocated in output_reserve() uses posix_memalign(), which does not zero memory. The buffer is only written during decode when needs_raw_logits() returns true. When backend samplers cover all output sequences, needs_raw_logits() returns false and the buffer is never written, but llama_get_logits() still returns a pointer to it, exposing stale heap content. Zero the buffer after allocation to prevent information disclosure through the public logits API. Found-by: Pwno * Update src/llama-context.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

…20662) * vulkan: change gated_delta_net to shard a column across a subgroup This is based on #20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443). * vulkan: Spread columns across fewer lanes to reduce the number of workgroups

* server: (doc) clarify in-scope and out-scope features * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Check granite hybriid expert count to set type as LLM_TYPE_7B_A1B or LLM_TYPE_1B * Use feed fwd dim instead of num of experts Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: M1DNYT3 <m1dnyt3@MacBookPro.lan>

This PR changes the logging that occurs at startup of llama-server. Currently, it is redundant (including CPU information twice) and it is missing the build + commit info.

* HunyuanOCR: add support for text and vision models - Add HunyuanOCR vision projector (perceiver-based) with Conv2d merge - Add separate HUNYUAN_OCR chat template (content-before-role format) - Handle HunyuanOCR's invalid pad_token_id=-1 in converter - Fix EOS/EOT token IDs from generation_config.json - Support xdrope RoPE scaling type - Add tensor mappings for perceiver projector (mm.before_rms, mm.after_rms, etc.) - Register HunYuanVLForConditionalGeneration for both text and mmproj conversion * fix proper mapping * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * address comments * update * Fix typecheck * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* model-loader : fix GGUF bool array conversion * model-loader : fix remaining GGUF bool pointer uses

* convert : set "add bos" == True for Gemma 4 * cont : handle old GGUFs

…21478) Check the return value of sink.write() in the chunked content provider and return false when the write fails, matching cpp-httplib's own streaming contract. This prevents logging chunks as sent when the sink rejected them and properly aborts the stream on connection failure.

* llama-bench: add `-fitc` and `-fitt` to arguments * update README.md * address review comments * update compare-llama-bench.py

* Write an optimized flash_attn_stream_k_fixup kernel Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst. Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst * Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs * Address review comments * Address review comments * Revert variable names to original

* llama-cli: fix stripping of \n in multiline input * Change & string to string_view * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix EditorConfig linter error --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU) * add generic fallback for x86 * remove Q1_0 (group size 32) * rename Q1_0_g128 => Q1_0 * fix Q1_0 LlamaFileType Enum * Fix trailing spaces; add generic fallback for othre backends * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix /r/n spacing + arch-fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

@reeselevine

* Add mul_mat_id support to WebGPU * Apply suggestion from @reeselevine --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

…21527) Extend the existing reorder optimization to Q8_0. The reorder separates scale factors from weight data for coalesced memory access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing. On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x) on Qwen3.5-27B. BW utilization: 21% -> 66%. The key fix beyond the kernels: Q8_0 was missing from the type check in ggml_backend_sycl_buffer_init_tensor() that allocates the extra struct carrying the reorder flag -- so the optimization was silently skipped. AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware. Fixes: #21517

* Fix Arabic RTL text rendering in web UI - Add dir='auto' attributes to markdown containers and blocks - Implement post-processing to add dir='auto' to all text elements - Replace directional CSS properties with logical properties for proper RTL list alignment - Ensure bidirectional text support for mixed Arabic/English content * Clean up commented duplicate function Remove the commented-out duplicate transformMdastNode function that was left over from refactoring. * Fix Arabic RTL text rendering in web UI - Add dir='auto' attributes to markdown containers and blocks - Implement post-processing to add dir='auto' to all text elements - Replace directional CSS properties with logical properties for proper RTL list alignment - Minor code formatting improvements This ensures bidirectional text support for mixed Arabic/English content in the llama.cpp web UI. * Implement rehype plugin for comprehensive RTL text support - Add rehypeRtlSupport plugin that applies dir='auto' to all elements with children - Replace DOMParser-based approach with efficient HAST tree processing - Remove hardcoded element lists for better maintainability - Ensure proper bidirectional text rendering for mixed RTL/LTR content * Fix RTL text rendering with rehype plugin and cleanup * fix: prettier formatting

…21519) GGML_CUDA_CC_CDNA2 was set to 0x910 Fix by setting the constant to 0x90a to match the actual gfx90a ISA.

…ests (#21249)

Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL in the flash attention base shader. Register them in the shader generator, pipeline creation, and enable in the scalar/coopmat1 FA support check.

…ilure (#20868) (#20904)

* ggml : deprecate GGML_OP_ADD1 * cont : remove tests * cont : re-enable vulkan check

#21257) * unicode : add custom Qwen2 regex handler to fix segfault on long input std::regex uses recursive backtracking internally, which causes a stack overflow (segfault) when tokenizing long sequences of repeated characters (e.g. 43K 'A's). The Qwen2 tokenizer regex differs from Llama3 only in the digit pattern (\p{N} vs \p{N}{1,3}), so it was falling through to the std::regex fallback path instead of using a custom handler. Add unicode_regex_split_custom_qwen2() following the established pattern used by gpt2, llama3, kimi_k2, and afmoe custom handlers. Closes: #21113 * cont : remove TODO comment * cont : update comment to reflect original regex * use the correct regex in the comment this time... [no ci] --------- Co-authored-by: Aldehir Rojas <hello@alde.dev>

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

akapoor3518 changed the base branch from master to llama.cpp-syn-sept2 September 2, 2025 20:28

github-actions bot added documentation Improvements or additions to documentation Apple Metal SYCL Nvidia GPU Vulkan IBM zDNN testing build examples devops python script android server ggml nix Ascend NPU OpenCL labels Sep 17, 2025

github-actions bot added the model label Nov 4, 2025

ggerganov and others added 10 commits March 20, 2026 10:28

ai : update find-related action (#20790)

464fd0e

* ai : update "related issues" prompt * cont * cont * cont

server : improve mtmd ctx checkpoints (#20726)

ab9d4c3

* server : improve mtmd ctx checkpoints * server : fix off-by-one in pos_min_thold

server: (doc) clarify in-scope and out-scope features (#20794)

fb78ad2

* server: (doc) clarify in-scope and out-scope features * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ai : do not run bash commands in the prompt (#20810)

b31b30f

M1DNYT3 and others added 30 commits April 5, 2026 09:04

ci: lower cuda12 floor to 12.8.1 for broader host compatibility (#21438)

c08d28d

Co-authored-by: M1DNYT3 <m1dnyt3@MacBookPro.lan>

server : fix logging of build + system info (#21460)

5d3a4a7

This PR changes the logging that occurs at startup of llama-server. Currently, it is redundant (including CPU information twice) and it is missing the build + commit info.

ci : use default RISE RISC-V Runners (#21263)

761797f

llama : correct platform-independent loading of BOOL metadata (#21428)

58190cc

* model-loader : fix GGUF bool array conversion * model-loader : fix remaining GGUF bool pointer uses

hexagon: slight optimization for argosrt output init (#21463)

25eec6f

sycl : handle other FA case (#21377)

f51fd36

convert : set "add bos" == True for Gemma 4 (#21500)

400ac8e

* convert : set "add bos" == True for Gemma 4 * cont : handle old GGUFs

docs: add hunyuan-ocr gguf, also add test [no ci] (#21490)

3979f2b

convert : fix block_ff_dim retrieval for lfm2 (#21508)

941146b

vocab : add byte token handling to BPE detokenizer for Gemma4 (#21488)

4aa962e

llama-bench: add -fitc and -fitt to arguments (#21304)

94ca829

* llama-bench: add `-fitc` and `-fitt` to arguments * update README.md * address review comments * update compare-llama-bench.py

ggml-webgpu: Add the support of MUL_MAT_ID (#21147)

d0a6dfe

* Add mul_mat_id support to WebGPU * Apply suggestion from @reeselevine --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>

docs: fix typo in build.md (emdawbwebgpu -> emdawnwebgpu) (#21518)

0033f53

fix: Detect streaming state in reasoning content blocks (#21549)

ecce008

ggml-cuda : fix CDNA2 compute capability constant for gfx90a (MI210) (#…

71a81f6

…21519) GGML_CUDA_CC_CDNA2 was set to 0x910 Fix by setting the constant to 0x90a to match the actual gfx90a ISA.

webui : store reasoning_content so it is sent back in subsequent requ…

482192f

…ests (#21249)

vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (#21029)

edd4d9b

Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL in the flash attention base shader. Register them in the shader generator, pipeline creation, and enable in the scalar/coopmat1 FA support check.

ggml: Vulkan build, Linux -- output error string for errno on fork fa…

2a619f6

…ilure (#20868) (#20904)

ggml : deprecate GGML_OP_ADD1 (#21363)

22fc791

* ggml : deprecate GGML_OP_ADD1 * cont : remove tests * cont : re-enable vulkan check

server : fix restore for checkpoints with pos_min == 0 (#21510)

e8f5082

llama: remove per-arch tensor name lists (#21531)

a8ec0df

llama-server: fix model params not propagated (#21509)

69c28f1

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp SYNC#45

llama.cpp SYNC#45
akapoor3518 wants to merge 3240 commits intotsisw:llama.cpp-syn-sept2from
ggml-org:master

akapoor3518 commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

akapoor3518 commented Sep 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants