ggml-cuda : fix INT_MAX overflow in cpy kernels (#18140)#18340
ggml-cuda : fix INT_MAX overflow in cpy kernels (#18140)#18340Muhammad-Kamran-Khan wants to merge 3 commits intoggml-org:masterfrom
Conversation
|
Hi @JohannesGaessler, @lilblam Thanks! |
| // determine indices i03/i13, i02/i12, i01/i11, i00/i10 as a function of index i of flattened tensor | ||
| // then combine those indices with the corresponding byte offsets to get the total offsets |
There was a problem hiding this comment.
You are expected to read the contribution guidelines where it clearly states:
Using AI to generate PRs is permitted. However, you must (1) explicitly disclose how AI was used and (2) conduct a thorough manual review before publishing the PR. Note that trivial tab autocompletions do not require disclosure.
This PR is not acceptable in terms of quality control.
There was a problem hiding this comment.
Apologies for the oversight. Per the guidelines, I disclose that AI was used to assist with these type changes. I have now manually reviewed the code, restored the original comments, and fixed the formatting. Ready for re-review.
Added comments to clarify index calculations and assertions.
Description
This PR addresses the crash reported in #18140, where loading a context of ~126k tokens with a large batch size causes an
INT_MAXoverflow in the CUDAcpykernels.Motivation and Context
The crash is triggered by
ggml_nbytes(src0)exceeding the signed 32-bit integer limit (~2.14GB). The existing kernel logic inggml-cuda/cpy.cuusesintfor element counts (ne) and byte offsets. When processing large contexts (e.g., Qwen3-Next-80B with 128k context), this results in integer overflow and assertion failures.Changes
ne,ne00,nb00, etc.) frominttoint64_tincpy_scalar,cpy_f32_q, and related templates.(int64_t)casting to thread index calculations (e.g.,blockDim.x * blockIdx.x) to ensure 64-bit arithmetic is used for memory addressing.GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX)checks to allow processing tensors larger than 2GB.Testing Status
Important Note: I have implemented this fix based on the stack trace analysis and the clear integer overflow root cause.
No Local Verification: I was unable to verify this fix locally due to hardware limitations (Github Codespaces resource constraints prevented full compilation, and I do not have access to the high-VRAM hardware required to reproduce the 126k token crash).
Fixes #18140