Closed
Conversation
Fixes ggml-org#11 This fixes a Japanese prompt I was attempting to run EG: `./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'` Output before change: `人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]` So it is outputting some characters but some � Output after change: `人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう`
Contributor
Author
|
Closing the PR because what is really needed is different tokenization mechanism see discussion here: |
Hades32
pushed a commit
to Hades32/llama.cpp
that referenced
this pull request
Mar 21, 2023
add easy Windows install instructions to the readme
flowgrad
pushed a commit
to flowgrad/llama.cpp
that referenced
this pull request
Jun 27, 2023
fix bug: Parameter --reverse-prompt won't accept text
4 tasks
jesusmb1995
pushed a commit
to jesusmb1995/llama.cpp
that referenced
this pull request
Oct 30, 2025
rururush
pushed a commit
to USTC-ADSL/llama.cpp
that referenced
this pull request
Mar 16, 2026
* fix warning * wip * add todo for graph key generate * rename some file to meet upstream guideline * remove local .clang-format * expend supported/unsupported counter to all ops * append device name to log * port to ggml logger * fix warning after adapt to ggml logger * append \n to all log * use case op instead of convert * Revert "use case op instead of convert" This reverts commit e662fc2. * fix op that needs same shape * opt kQnnOpsTable * refresh params name field when getting op config * opt npu log print * remove unused functions
TheTom
added a commit
to TheTom/llama-cpp-turboquant
that referenced
this pull request
Mar 25, 2026
…gml-org#26 Massive reduction in constant memory and compute: - 256KB of dense matrices → 512 bytes of sign arrays - O(d²) = 16,384 ops → O(d log d) = 896 ops per rotation - Metal shader file: 1.5MB → 432KB Speed: still 2.4 tok/s. WHT reduced per-rotation cost but the bottleneck is redundant calls (8-32× per block from flash attention). The dequantize function is called per 4/16-element chunk, each time doing the full 128-element WHT. Need to modify the flash attention kernel to dequantize once per block. Quality: WHT+signs gives BETTER quality than dense QR on real KV tensors (cosine 0.94 vs 0.79 at 2-bit). Sub-Gaussian distribution (kurtosis 1.53) means fewer outliers hitting extreme centroids. Reviewed by Codex: WHT butterfly correct, inverse order verified, QJL correction matches reference C implementation. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
added a commit
to TheTom/llama-cpp-turboquant
that referenced
this pull request
Mar 26, 2026
…gml-org#26 Massive reduction in constant memory and compute: - 256KB of dense matrices → 512 bytes of sign arrays - O(d²) = 16,384 ops → O(d log d) = 896 ops per rotation - Metal shader file: 1.5MB → 432KB Speed: still 2.4 tok/s. WHT reduced per-rotation cost but the bottleneck is redundant calls (8-32× per block from flash attention). The dequantize function is called per 4/16-element chunk, each time doing the full 128-element WHT. Need to modify the flash attention kernel to dequantize once per block. Quality: WHT+signs gives BETTER quality than dense QR on real KV tensors (cosine 0.94 vs 0.79 at 2-bit). Sub-Gaussian distribution (kurtosis 1.53) means fewer outliers hitting extreme centroids. Reviewed by Codex: WHT butterfly correct, inverse order verified, QJL correction matches reference C implementation. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
didlawowo
pushed a commit
to didlawowo/llama.cpp
that referenced
this pull request
Mar 27, 2026
…gml-org#26 Massive reduction in constant memory and compute: - 256KB of dense matrices → 512 bytes of sign arrays - O(d²) = 16,384 ops → O(d log d) = 896 ops per rotation - Metal shader file: 1.5MB → 432KB Speed: still 2.4 tok/s. WHT reduced per-rotation cost but the bottleneck is redundant calls (8-32× per block from flash attention). The dequantize function is called per 4/16-element chunk, each time doing the full 128-element WHT. Need to modify the flash attention kernel to dequantize once per block. Quality: WHT+signs gives BETTER quality than dense QR on real KV tensors (cosine 0.94 vs 0.79 at 2-bit). Sub-Gaussian distribution (kurtosis 1.53) means fewer outliers hitting extreme centroids. Reviewed by Codex: WHT butterfly correct, inverse order verified, QJL correction matches reference C implementation. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #11
This fixes a Japanese prompt I was attempting to run
EG:
./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'Output before change:
人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]So it is outputting some characters but some �
Output after change:
人生の意 当者: Dr. Yukari Takamatsu 作成時間: 2015年9月8日(金)、第3回ルプセンター上研修会「Mini-Workshop」で学生がしたことについて書き伝えます。 ニュアスミレショナの実行は、10位けんだあるから重要なメッセージを与り開くがうれやで報告したことについて書き伝えます。 当者: Dr. Yukari Takamatsu, MD PhD FRCR FRCP (Hon) Prof Emeritus of Hokkaido Univ School Med Sys Biol and Nanboku University Medical Sch Professor at Imperial College London Senior Member ESMO IASLC