cuda: allow `ggml_cuda_cpy` to copy contiguous `F32` and `F16` tensors greater than `INT_MAX` #16924

jukofyork · 2025-11-01T21:13:39Z

This is a partial fix for #15049 and #14325, so the specific case where we are using cudaMemcpyAsync doesn't fail.

See #15298 and the discussion #15049 (comment) for details on fixing this for the non- cudaMemcpyAsync case.

I'm unsure about mudnnMemcpyAsync and the only reference it I can find is here:

https://github.com/SJTU-IPADS/PowerInfer/blob/d3ebd7c5666348cf43c22f0d62dfbc9a763cffb8/smallthinker/ggml/src/ggml-musa/mudnn.cu#L88

// Asynchronous memory copy using mudnn::Unary::IDENTITY
musaError_t mudnnMemcpyAsync(ggml_backend_cuda_context& ctx, const ggml_tensor* dst, const ggml_tensor* src) {
    mudnn::Tensor tensor_dst, tensor_src;

    MUDNN_CHECK(tensor_dst.SetType(ggml_type_to_mudnn_type(dst->type)));
    MUDNN_CHECK(tensor_src.SetType(ggml_type_to_mudnn_type(src->type)));

    std::vector<int64_t> dims, strides;
    const int ndims = get_ggml_dims_and_strides(src, dims, strides);

    MUDNN_CHECK(tensor_dst.SetNdInfo(ndims, dims.data(), strides.data()));
    MUDNN_CHECK(tensor_src.SetNdInfo(ndims, dims.data(), strides.data()));
    MUDNN_CHECK(tensor_dst.SetAddr(dst->data));
    MUDNN_CHECK(tensor_src.SetAddr(src->data));

    mudnn::Unary op;
    MUDNN_CHECK(op.SetMode(mudnn::Unary::Mode::IDENTITY));
    MUDNN_CHECK(op.SetAlpha(0.0f));
    MUDNN_CHECK(op.SetBeta(0.0f));

    mudnn::Handle* handle = get_cached_handle(ctx.device);
    MUDNN_CHECK(handle->SetStream(ctx.stream()));
    MUDNN_CHECK(op.Run(*handle, tensor_dst, tensor_src));

    return musaSuccess;
}

it doesn't look to need the <= INT_MAX assertion, but not 100% sure without any proper API references.

jukofyork · 2025-11-01T21:14:27Z

Just testing this now and will un-draft after.

jukofyork · 2025-11-01T21:32:03Z

It doesn't fix my crash sadly:

slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, task.n_tokens = 42410
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 18432, batch.n_tokens = 18432, progress = 0.434614
slot update_slots: id  0 | task 0 | n_tokens = 18432, memory_seq_rm [18432, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 18432, progress = 0.869229
/home/juk/llama.cpp/ggml/src/ggml-cuda/cpy.cu:326: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed
/home/juk/llama.cpp/build/bin/libggml-base.so(+0x16298)[0x7f4c5c95d298]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x1e4)[0x7f4c5c95d664]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x11e)[0x7f4c5c95d7ee]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(_Z13ggml_cuda_cpyR25ggml_backend_cuda_contextPK11ggml_tensorPS1_+0xef7)[0x7f4c552eb337]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(+0x133b28)[0x7f4c55333b28]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(+0x13401f)[0x7f4c5533401f]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x807)[0x7f4c5c977a67]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1)[0x7f4c5c69c1c1]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe1)[0x7f4c5c69d831]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x291)[0x7f4c5c6a3021]
/home/juk/llama.cpp/build/bin/libllama.so(llama_decode+0xb)[0x7f4c5c6a3f1b]
/home/juk/llama.cpp/build/bin/llama-server(+0xdd166)[0x555ade363166]
/home/juk/llama.cpp/build/bin/llama-server(+0xa29d9)[0x555ade3289d9]
/home/juk/llama.cpp/build/bin/llama-server(+0x62a52)[0x555ade2e8a52]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f4c5c44624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f4c5c446305]
/home/juk/llama.cpp/build/bin/llama-server(+0x64791)[0x555ade2ea791]

Closing for now and will take a look at that other PR that tries to do this for all types.

Allow ggml_cuda_cpy to copy contiguous F32 tensors > INT_MAX

442076f

jukofyork requested review from JohannesGaessler and slaren November 1, 2025 21:16

jukofyork changed the title ~~cuda: allow ggml_cuda_cpy to copy contiguous F32 tensors greater than INT_MAX~~ cuda: allow ggml_cuda_cpy to copy contiguous F32 and F16 tensors greater than INT_MAX Nov 1, 2025

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 1, 2025

jukofyork closed this Nov 1, 2025

jukofyork removed request for JohannesGaessler and slaren November 1, 2025 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda: allow `ggml_cuda_cpy` to copy contiguous `F32` and `F16` tensors greater than `INT_MAX` #16924

cuda: allow `ggml_cuda_cpy` to copy contiguous `F32` and `F16` tensors greater than `INT_MAX` #16924

Uh oh!

jukofyork commented Nov 1, 2025 •

edited

Loading

Uh oh!

jukofyork commented Nov 1, 2025

Uh oh!

jukofyork commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cuda: allow ggml_cuda_cpy to copy contiguous F32 and F16 tensors greater than INT_MAX #16924

cuda: allow ggml_cuda_cpy to copy contiguous F32 and F16 tensors greater than INT_MAX #16924

Uh oh!

Conversation

jukofyork commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukofyork commented Nov 1, 2025

Uh oh!

jukofyork commented Nov 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cuda: allow `ggml_cuda_cpy` to copy contiguous `F32` and `F16` tensors greater than `INT_MAX` #16924

cuda: allow `ggml_cuda_cpy` to copy contiguous `F32` and `F16` tensors greater than `INT_MAX` #16924

jukofyork commented Nov 1, 2025 •

edited

Loading