ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083) by shalinib-ibm · Pull Request #20130 · ggml-org/llama.cpp

shalinib-ibm · 2026-03-05T08:37:34Z

This patch addresses an Internal Compiler Error (Segmentation fault) observed with gcc 15 by replacing the intrinsic + cast with casting the data first and then calling the intrinsic.
This bypasses the buggy compiler path while maintaining identical instruction selection.

Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original code and this fix generate the identical Power10 prefixed load instruction:
plxv 40, 2(14)

This ensures zero performance regression while unblocking builds on newer toolchains.

Reproduced on:

Alpine Linux + GCC 15.2.0-r2
RHEL 9 + GCC 15.1.1 (gcc-toolset-15)

Make sure to read the contributing guidelines before submitting a PR

shalinib-ibm · 2026-03-05T08:44:25Z

This is the assembly that gets generated for the error line with gcc 13.3 with the original code which gives no error.

 g++ -mcpu=power10 -O3 -mvsx -S -fverbose-asm -g     -I/home/shalini/llama_5_3_26/llama.cpp/ggml/src -I /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu -I /home/shalini/llama_5_3_26/llama.cpp/ggml/include/ /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp     -o sgemm.s
 
# /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp:2500:                     vector signed char v_qs = reinterpret_cast<vector signed char>(vec_xl(0, current_blk->qs));
 97836         .loc 1 2500 40 view .LVU26813
 97837         plxv 40,2(14)    # v_qs, MEM <__vector signed char> [(void *)_10028 + 2B]
 97838 .LBB24085:
 97839 .LBB24074:

This is the assembly with this change and gcc15:

 g++ -mcpu=power10 -O3 -mvsx -S -fverbose-asm -g     -I/home/shalini/llama_5_3_26/llama.cpp/ggml/src -I /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu -I /home/shalini/llama_5_3_26/llama.cpp/ggml/include/ /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp     -o sgemm_gcc15.s
 
.LBE24857:
 # /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp:2500:                 vector signed char v_qs = *(const vector signed char *)(const void *)current_blk->qs;
        .loc 1 2500 26 view .LVU24176
        plxv 40,2(10)    # v_qs, MEM[(const __vector signed char *)_7982 + 2B]
.LBB24870:
.LBB24859:

taronaeo · 2026-03-05T09:33:38Z

This is an alternative but what about unaligned loads? Are there any chances of unaligned loads happening and any performance impact?

vec_xl at least on s390x handles both aligned and unaligned loads so I'd assume it's the same for POWER.

shalinib-ibm · 2026-03-05T09:56:03Z

This is an alternative but what about unaligned loads? Are there any chances of unaligned loads happening and any performance impact?

vec_xl at least on s390x handles both aligned and unaligned loads so I'd assume it's the same for POWER.

Thanks @taronaeo for the quick reply.

You're absolutely right that vec_xl handles unaligned loads. On the Power10 target (-mcpu=power10), both the vec_xl intrinsic and a direct pointer dereference to a vector type lower to the same plxv (Prefixed Load VSX Vector) instruction.
I've verified this with the assembly for both paths( above comment).

Since the hardware instruction is identical, the alignment handling and performance characteristics are 1:1.
Please let me know your thoughts.

perf results of llama-bench with gcc13.3 and the original code:

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp512	117.52 ± 0.10
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	tg128	13.37 ± 0.00

Perf results of llama-bench with gcc15 and the above change:

model	size	params	backend	threads	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	pp512	116.07 ± 0.14
llama 8B Q4_0	4.33 GiB	8.03 B	CPU	10	tg128	13.24 ± 0.01

shalinib-ibm · 2026-03-05T10:02:41Z

Also, interestingly, GCC15 does not throw error at this line

llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp

Line 2614 in b5ed0e0

c1[1] = reinterpret_cast<vector signed char>(vec_xl(0, aoffset1->qs));

probably because here the result of reinterpret_cast is being moved directly to an array slot. Whereas, in the below case where GCC15 errors out,

llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp

Line 2500 in b5ed0e0

    
           vector signed char v_qs = reinterpret_cast<vector signed char>(vec_xl(0, current_blk->qs));

the result of the cast is being assigned to a new local variable.

taronaeo · 2026-03-05T10:27:29Z

Very interesting. Okay it would be good to raise this to the respective development teams internally.

We had a bug previously with vec_xl as well but seems like a different problem. In case you're interested: #12848

taronaeo · 2026-03-05T10:28:50Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

                    const block_q4_0 * current_blk = rows_base[r] + blk;
                    vector float v_scale = vec_extract_fp32_from_shorth(vec_splats(current_blk->d));
-                    vector signed char v_qs = reinterpret_cast<vector signed char>(vec_xl(0, current_blk->qs));
+                    vector signed char v_qs = *(const vector signed char *)(const void *)current_blk->qs;


Would be good to add a comment above informing that this is a fix to #20083 since maintainers would expect vec_xl here.

@shalinib-ibm Please let me know if you would be considering this. Otherwise, we're good to merge :)

@taronaeo agree that using vec_xl is a readable and expected way here.
Below is another valid way of writing the code which works fine with gcc15 .
vector signed char v_qs = vec_xl(0, (const vector signed char *)current_blk->qs);
Can you please take a look now?

This patch addresses an Internal Compiler Error (Segmentation fault) observed with gcc 15 by replacing the intrinsic + cast by doing a cat on the data first and then calling the intrinsic. This bypasses the buggy compiler path while maintaining identical instruction selection. Performance Verification: Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original code and this fix generate the identical Power10 prefixed load instruction: `plxv 40, 2(14)` This ensures zero performance regression while unblocking builds on newer toolchains. Reproduced on: - Alpine Linux + GCC 15.2.0-r2 - RHEL 9 + GCC 15.1.1 (gcc-toolset-15) Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

shalinib-ibm · 2026-03-06T16:29:14Z

Thank you @taronaeo !

This patch addresses an Internal Compiler Error (Segmentation fault) observed with gcc 15 by replacing the intrinsic + cast by doing a cat on the data first and then calling the intrinsic. This bypasses the buggy compiler path while maintaining identical instruction selection. Performance Verification: Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original code and this fix generate the identical Power10 prefixed load instruction: `plxv 40, 2(14)` This ensures zero performance regression while unblocking builds on newer toolchains. Reproduced on: - Alpine Linux + GCC 15.2.0-r2 - RHEL 9 + GCC 15.1.1 (gcc-toolset-15) Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>

shalinib-ibm requested a review from ggerganov as a code owner March 5, 2026 08:37

shalinib-ibm mentioned this pull request Mar 5, 2026

Compile bug: internal compiler error: Segmentation fault #20083

Closed

shalinib-ibm force-pushed the fix_issue_20083 branch from 9bf09bb to 37aefa5 Compare March 5, 2026 08:39

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 5, 2026

taronaeo approved these changes Mar 5, 2026

View reviewed changes

CISC linked an issue Mar 5, 2026 that may be closed by this pull request

Compile bug: internal compiler error: Segmentation fault #20083

Closed

shalinib-ibm force-pushed the fix_issue_20083 branch from 37aefa5 to cab9514 Compare March 6, 2026 07:17

taronaeo approved these changes Mar 6, 2026

View reviewed changes

taronaeo merged commit c6980ff into ggml-org:master Mar 6, 2026
78 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083)#20130

ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083)#20130
taronaeo merged 1 commit intoggml-org:masterfrom
shalinib-ibm:fix_issue_20083

shalinib-ibm commented Mar 5, 2026 •

edited

Loading

Uh oh!

shalinib-ibm commented Mar 5, 2026

Uh oh!

taronaeo commented Mar 5, 2026

Uh oh!

shalinib-ibm commented Mar 5, 2026 •

edited

Loading

Uh oh!

shalinib-ibm commented Mar 5, 2026 •

edited

Loading

Uh oh!

taronaeo commented Mar 5, 2026

Uh oh!

taronaeo Mar 5, 2026

Uh oh!

taronaeo Mar 6, 2026

Uh oh!

shalinib-ibm Mar 6, 2026

Uh oh!

Uh oh!

shalinib-ibm commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shalinib-ibm commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shalinib-ibm commented Mar 5, 2026

Uh oh!

taronaeo commented Mar 5, 2026

Uh oh!

shalinib-ibm commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shalinib-ibm commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taronaeo commented Mar 5, 2026

Uh oh!

taronaeo Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

taronaeo Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

shalinib-ibm Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shalinib-ibm commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shalinib-ibm commented Mar 5, 2026 •

edited

Loading

shalinib-ibm commented Mar 5, 2026 •

edited

Loading

shalinib-ibm commented Mar 5, 2026 •

edited

Loading