Skip to content

Remove Unprintable#26

Closed
beiller wants to merge 1 commit intoggml-org:masterfrom
beiller:feature/remove_unprintable
Closed

Remove Unprintable#26
beiller wants to merge 1 commit intoggml-org:masterfrom
beiller:feature/remove_unprintable

Conversation

@beiller
Copy link
Copy Markdown
Contributor

@beiller beiller commented Mar 11, 2023

Fixes #11

This fixes a Japanese prompt I was attempting to run

EG:

./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'

Output before change:

人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]

So it is outputting some characters but some �

Output after change:

人生の意 当者: Dr. Yukari Takamatsu 作成時間: 2015年9月8日(金)、第3回ルプセンター上研修会「Mini-Workshop」で学生がしたことについて書き伝えます。 ニュアスミレショナの実行は、10位けんだあるから重要なメッセージを与り開くがうれやで報告したことについて書き伝えます。 当者: Dr. Yukari Takamatsu, MD PhD FRCR FRCP (Hon) Prof Emeritus of Hokkaido Univ School Med Sys Biol and Nanboku University Medical Sch Professor at Imperial College London Senior Member ESMO IASLC

Fixes ggml-org#11 

This fixes a Japanese prompt I was attempting to run

EG:

`./main -m ./models/13B/ggml-model-q4_0.bin -t 8 -n 128 -n 512 -p $'人生の意味は'`

Output before change:

`人生の意���、フロントカードに���いてる。 2019年3月 © All Rights Reserved. [end of text]`

So it is outputting some characters but some �

Output after change:

`人生の意は、一人が一人ということであります。は安部が立していたので、去からは一人の人にれるのはにとどまったのですが、そう`
@beiller
Copy link
Copy Markdown
Contributor Author

beiller commented Mar 12, 2023

Closing the PR because what is really needed is different tokenization mechanism see discussion here:

#11

@beiller beiller closed this Mar 12, 2023
Hades32 pushed a commit to Hades32/llama.cpp that referenced this pull request Mar 21, 2023
add easy Windows install instructions to the readme
flowgrad pushed a commit to flowgrad/llama.cpp that referenced this pull request Jun 27, 2023
fix bug: Parameter --reverse-prompt won't accept text
jesusmb1995 pushed a commit to jesusmb1995/llama.cpp that referenced this pull request Oct 30, 2025
rururush pushed a commit to USTC-ADSL/llama.cpp that referenced this pull request Mar 16, 2026
* fix warning

* wip

* add todo for graph key generate

* rename some file to meet upstream guideline

* remove local .clang-format

* expend supported/unsupported counter to all ops

* append device name to log

* port to ggml logger

* fix warning after adapt to ggml logger

* append \n to all log

* use case op instead of convert

* Revert "use case op instead of convert"

This reverts commit e662fc2.

* fix op that needs same shape

* opt kQnnOpsTable

* refresh params name field when getting op config

* opt npu log print

* remove unused functions
TheTom added a commit to TheTom/llama-cpp-turboquant that referenced this pull request Mar 25, 2026
…gml-org#26

Massive reduction in constant memory and compute:
- 256KB of dense matrices → 512 bytes of sign arrays
- O(d²) = 16,384 ops → O(d log d) = 896 ops per rotation
- Metal shader file: 1.5MB → 432KB

Speed: still 2.4 tok/s. WHT reduced per-rotation cost but the
bottleneck is redundant calls (8-32× per block from flash attention).
The dequantize function is called per 4/16-element chunk, each time
doing the full 128-element WHT. Need to modify the flash attention
kernel to dequantize once per block.

Quality: WHT+signs gives BETTER quality than dense QR on real KV
tensors (cosine 0.94 vs 0.79 at 2-bit). Sub-Gaussian distribution
(kurtosis 1.53) means fewer outliers hitting extreme centroids.

Reviewed by Codex: WHT butterfly correct, inverse order verified,
QJL correction matches reference C implementation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom added a commit to TheTom/llama-cpp-turboquant that referenced this pull request Mar 26, 2026
…gml-org#26

Massive reduction in constant memory and compute:
- 256KB of dense matrices → 512 bytes of sign arrays
- O(d²) = 16,384 ops → O(d log d) = 896 ops per rotation
- Metal shader file: 1.5MB → 432KB

Speed: still 2.4 tok/s. WHT reduced per-rotation cost but the
bottleneck is redundant calls (8-32× per block from flash attention).
The dequantize function is called per 4/16-element chunk, each time
doing the full 128-element WHT. Need to modify the flash attention
kernel to dequantize once per block.

Quality: WHT+signs gives BETTER quality than dense QR on real KV
tensors (cosine 0.94 vs 0.79 at 2-bit). Sub-Gaussian distribution
(kurtosis 1.53) means fewer outliers hitting extreme centroids.

Reviewed by Codex: WHT butterfly correct, inverse order verified,
QJL correction matches reference C implementation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
didlawowo pushed a commit to didlawowo/llama.cpp that referenced this pull request Mar 27, 2026
…gml-org#26

Massive reduction in constant memory and compute:
- 256KB of dense matrices → 512 bytes of sign arrays
- O(d²) = 16,384 ops → O(d log d) = 896 ops per rotation
- Metal shader file: 1.5MB → 432KB

Speed: still 2.4 tok/s. WHT reduced per-rotation cost but the
bottleneck is redundant calls (8-32× per block from flash attention).
The dequantize function is called per 4/16-element chunk, each time
doing the full 128-element WHT. Need to modify the flash attention
kernel to dequantize once per block.

Quality: WHT+signs gives BETTER quality than dense QR on real KV
tensors (cosine 0.94 vs 0.79 at 2-bit). Sub-Gaussian distribution
(kurtosis 1.53) means fewer outliers hitting extreme centroids.

Reviewed by Codex: WHT butterfly correct, inverse order verified,
QJL correction matches reference C implementation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unicode support

1 participant