Skip to content

ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083)#20130

Merged
taronaeo merged 1 commit intoggml-org:masterfrom
shalinib-ibm:fix_issue_20083
Mar 6, 2026
Merged

ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083)#20130
taronaeo merged 1 commit intoggml-org:masterfrom
shalinib-ibm:fix_issue_20083

Conversation

@shalinib-ibm
Copy link
Contributor

@shalinib-ibm shalinib-ibm commented Mar 5, 2026

This patch addresses an Internal Compiler Error (Segmentation fault) observed with gcc 15 by replacing the intrinsic + cast with casting the data first and then calling the intrinsic.
This bypasses the buggy compiler path while maintaining identical instruction selection.

Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original code and this fix generate the identical Power10 prefixed load instruction:
plxv 40, 2(14)

This ensures zero performance regression while unblocking builds on newer toolchains.

Reproduced on:

  • Alpine Linux + GCC 15.2.0-r2
  • RHEL 9 + GCC 15.1.1 (gcc-toolset-15)

Make sure to read the contributing guidelines before submitting a PR

@shalinib-ibm
Copy link
Contributor Author

This is the assembly that gets generated for the error line with gcc 13.3 with the original code which gives no error.

 g++ -mcpu=power10 -O3 -mvsx -S -fverbose-asm -g     -I/home/shalini/llama_5_3_26/llama.cpp/ggml/src -I /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu -I /home/shalini/llama_5_3_26/llama.cpp/ggml/include/ /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp     -o sgemm.s
 
# /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp:2500:                     vector signed char v_qs = reinterpret_cast<vector signed char>(vec_xl(0, current_blk->qs));
 97836         .loc 1 2500 40 view .LVU26813
 97837         plxv 40,2(14)    # v_qs, MEM <__vector signed char> [(void *)_10028 + 2B]
 97838 .LBB24085:
 97839 .LBB24074:

This is the assembly with this change and gcc15:

 g++ -mcpu=power10 -O3 -mvsx -S -fverbose-asm -g     -I/home/shalini/llama_5_3_26/llama.cpp/ggml/src -I /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu -I /home/shalini/llama_5_3_26/llama.cpp/ggml/include/ /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp     -o sgemm_gcc15.s
 
.LBE24857:
 # /home/shalini/llama_5_3_26/llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp:2500:                 vector signed char v_qs = *(const vector signed char *)(const void *)current_blk->qs;
        .loc 1 2500 26 view .LVU24176
        plxv 40,2(10)    # v_qs, MEM[(const __vector signed char *)_7982 + 2B]
.LBB24870:
.LBB24859:

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 5, 2026
@taronaeo
Copy link
Member

taronaeo commented Mar 5, 2026

This is an alternative but what about unaligned loads? Are there any chances of unaligned loads happening and any performance impact?

vec_xl at least on s390x handles both aligned and unaligned loads so I'd assume it's the same for POWER.

@shalinib-ibm
Copy link
Contributor Author

shalinib-ibm commented Mar 5, 2026

This is an alternative but what about unaligned loads? Are there any chances of unaligned loads happening and any performance impact?

vec_xl at least on s390x handles both aligned and unaligned loads so I'd assume it's the same for POWER.

Thanks @taronaeo for the quick reply.

You're absolutely right that vec_xl handles unaligned loads. On the Power10 target (-mcpu=power10), both the vec_xl intrinsic and a direct pointer dereference to a vector type lower to the same plxv (Prefixed Load VSX Vector) instruction.
I've verified this with the assembly for both paths( above comment).

Since the hardware instruction is identical, the alignment handling and performance characteristics are 1:1.
Please let me know your thoughts.

perf results of llama-bench with gcc13.3 and the original code:

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 117.52 ± 0.10
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 13.37 ± 0.00

Perf results of llama-bench with gcc15 and the above change:

model size params backend threads test t/s
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 pp512 116.07 ± 0.14
llama 8B Q4_0 4.33 GiB 8.03 B CPU 10 tg128 13.24 ± 0.01

@shalinib-ibm
Copy link
Contributor Author

shalinib-ibm commented Mar 5, 2026

Also, interestingly, GCC15 does not throw error at this line

c1[1] = reinterpret_cast<vector signed char>(vec_xl(0, aoffset1->qs));

probably because here the result of reinterpret_cast is being moved directly to an array slot. Whereas, in the below case where GCC15 errors out,

vector signed char v_qs = reinterpret_cast<vector signed char>(vec_xl(0, current_blk->qs));

the result of the cast is being assigned to a new local variable.

@taronaeo
Copy link
Member

taronaeo commented Mar 5, 2026

Very interesting. Okay it would be good to raise this to the respective development teams internally.

We had a bug previously with vec_xl as well but seems like a different problem. In case you're interested: #12848

const block_q4_0 * current_blk = rows_base[r] + blk;
vector float v_scale = vec_extract_fp32_from_shorth(vec_splats(current_blk->d));
vector signed char v_qs = reinterpret_cast<vector signed char>(vec_xl(0, current_blk->qs));
vector signed char v_qs = *(const vector signed char *)(const void *)current_blk->qs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a comment above informing that this is a fix to #20083 since maintainers would expect vec_xl here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shalinib-ibm Please let me know if you would be considering this. Otherwise, we're good to merge :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taronaeo agree that using vec_xl is a readable and expected way here.
Below is another valid way of writing the code which works fine with gcc15 .
vector signed char v_qs = vec_xl(0, (const vector signed char *)current_blk->qs);
Can you please take a look now?

@CISC CISC linked an issue Mar 5, 2026 that may be closed by this pull request
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.

Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
    `plxv 40, 2(14)`

This ensures zero performance regression while unblocking builds on
newer toolchains.

Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9  + GCC 15.1.1 (gcc-toolset-15)

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
@taronaeo taronaeo merged commit c6980ff into ggml-org:master Mar 6, 2026
78 checks passed
@shalinib-ibm
Copy link
Contributor Author

Thank you @taronaeo !

bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 10, 2026
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.

Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
    `plxv 40, 2(14)`

This ensures zero performance regression while unblocking builds on
newer toolchains.

Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9  + GCC 15.1.1 (gcc-toolset-15)

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
Ethan-a2 pushed a commit to Ethan-a2/llama.cpp that referenced this pull request Mar 20, 2026
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.

Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
    `plxv 40, 2(14)`

This ensures zero performance regression while unblocking builds on
newer toolchains.

Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9  + GCC 15.1.1 (gcc-toolset-15)

Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compile bug: internal compiler error: Segmentation fault

2 participants