ggml-cpu: Fix gcc 15 ICE on ppc64le (#20083)#20130
Conversation
9bf09bb to
37aefa5
Compare
|
This is the assembly that gets generated for the error line with gcc 13.3 with the original code which gives no error. This is the assembly with this change and gcc15: |
|
This is an alternative but what about unaligned loads? Are there any chances of unaligned loads happening and any performance impact?
|
Thanks @taronaeo for the quick reply. You're absolutely right that vec_xl handles unaligned loads. On the Power10 target (-mcpu=power10), both the vec_xl intrinsic and a direct pointer dereference to a vector type lower to the same plxv (Prefixed Load VSX Vector) instruction. Since the hardware instruction is identical, the alignment handling and performance characteristics are 1:1. perf results of llama-bench with gcc13.3 and the original code:
Perf results of llama-bench with gcc15 and the above change:
|
|
Also, interestingly, GCC15 does not throw error at this line llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp Line 2614 in b5ed0e0 probably because here the result of reinterpret_cast is being moved directly to an array slot. Whereas, in the below case where GCC15 errors out, llama.cpp/ggml/src/ggml-cpu/llamafile/sgemm.cpp Line 2500 in b5ed0e0 the result of the cast is being assigned to a new local variable. |
|
Very interesting. Okay it would be good to raise this to the respective development teams internally. We had a bug previously with |
| const block_q4_0 * current_blk = rows_base[r] + blk; | ||
| vector float v_scale = vec_extract_fp32_from_shorth(vec_splats(current_blk->d)); | ||
| vector signed char v_qs = reinterpret_cast<vector signed char>(vec_xl(0, current_blk->qs)); | ||
| vector signed char v_qs = *(const vector signed char *)(const void *)current_blk->qs; |
There was a problem hiding this comment.
Would be good to add a comment above informing that this is a fix to #20083 since maintainers would expect vec_xl here.
There was a problem hiding this comment.
@shalinib-ibm Please let me know if you would be considering this. Otherwise, we're good to merge :)
There was a problem hiding this comment.
@taronaeo agree that using vec_xl is a readable and expected way here.
Below is another valid way of writing the code which works fine with gcc15 .
vector signed char v_qs = vec_xl(0, (const vector signed char *)current_blk->qs);
Can you please take a look now?
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.
Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
`plxv 40, 2(14)`
This ensures zero performance regression while unblocking builds on
newer toolchains.
Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9 + GCC 15.1.1 (gcc-toolset-15)
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
37aefa5 to
cab9514
Compare
|
Thank you @taronaeo ! |
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.
Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
`plxv 40, 2(14)`
This ensures zero performance regression while unblocking builds on
newer toolchains.
Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9 + GCC 15.1.1 (gcc-toolset-15)
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.
Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
`plxv 40, 2(14)`
This ensures zero performance regression while unblocking builds on
newer toolchains.
Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9 + GCC 15.1.1 (gcc-toolset-15)
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
This patch addresses an Internal Compiler Error (Segmentation fault) observed with gcc 15 by replacing the intrinsic + cast with casting the data first and then calling the intrinsic.
This bypasses the buggy compiler path while maintaining identical instruction selection.
Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original code and this fix generate the identical Power10 prefixed load instruction:
plxv 40, 2(14)This ensures zero performance regression while unblocking builds on newer toolchains.
Reproduced on:
Make sure to read the contributing guidelines before submitting a PR