Skip to content

Conversation

@jukofyork
Copy link
Collaborator

@jukofyork jukofyork commented Nov 1, 2025

This is a partial fix for #15049 and #14325, so the specific case where we are using cudaMemcpyAsync doesn't fail.

See #15298 and the discussion #15049 (comment) for details on fixing this for the non- cudaMemcpyAsync case.

I'm unsure about mudnnMemcpyAsync and the only reference it I can find is here:

https://github.com/SJTU-IPADS/PowerInfer/blob/d3ebd7c5666348cf43c22f0d62dfbc9a763cffb8/smallthinker/ggml/src/ggml-musa/mudnn.cu#L88

// Asynchronous memory copy using mudnn::Unary::IDENTITY
musaError_t mudnnMemcpyAsync(ggml_backend_cuda_context& ctx, const ggml_tensor* dst, const ggml_tensor* src) {
    mudnn::Tensor tensor_dst, tensor_src;

    MUDNN_CHECK(tensor_dst.SetType(ggml_type_to_mudnn_type(dst->type)));
    MUDNN_CHECK(tensor_src.SetType(ggml_type_to_mudnn_type(src->type)));

    std::vector<int64_t> dims, strides;
    const int ndims = get_ggml_dims_and_strides(src, dims, strides);

    MUDNN_CHECK(tensor_dst.SetNdInfo(ndims, dims.data(), strides.data()));
    MUDNN_CHECK(tensor_src.SetNdInfo(ndims, dims.data(), strides.data()));
    MUDNN_CHECK(tensor_dst.SetAddr(dst->data));
    MUDNN_CHECK(tensor_src.SetAddr(src->data));

    mudnn::Unary op;
    MUDNN_CHECK(op.SetMode(mudnn::Unary::Mode::IDENTITY));
    MUDNN_CHECK(op.SetAlpha(0.0f));
    MUDNN_CHECK(op.SetBeta(0.0f));

    mudnn::Handle* handle = get_cached_handle(ctx.device);
    MUDNN_CHECK(handle->SetStream(ctx.stream()));
    MUDNN_CHECK(op.Run(*handle, tensor_dst, tensor_src));

    return musaSuccess;
}

it doesn't look to need the <= INT_MAX assertion, but not 100% sure without any proper API references.

@jukofyork
Copy link
Collaborator Author

Just testing this now and will un-draft after.

@jukofyork jukofyork changed the title cuda: allow ggml_cuda_cpy to copy contiguous F32 tensors greater than INT_MAX cuda: allow ggml_cuda_cpy to copy contiguous F32 and F16 tensors greater than INT_MAX Nov 1, 2025
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 1, 2025
@jukofyork
Copy link
Collaborator Author

It doesn't fix my crash sadly:

slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 65536, n_keep = 0, task.n_tokens = 42410
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 18432, batch.n_tokens = 18432, progress = 0.434614
slot update_slots: id  0 | task 0 | n_tokens = 18432, memory_seq_rm [18432, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 36864, batch.n_tokens = 18432, progress = 0.869229
/home/juk/llama.cpp/ggml/src/ggml-cuda/cpy.cu:326: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed
/home/juk/llama.cpp/build/bin/libggml-base.so(+0x16298)[0x7f4c5c95d298]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x1e4)[0x7f4c5c95d664]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x11e)[0x7f4c5c95d7ee]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(_Z13ggml_cuda_cpyR25ggml_backend_cuda_contextPK11ggml_tensorPS1_+0xef7)[0x7f4c552eb337]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(+0x133b28)[0x7f4c55333b28]
/home/juk/llama.cpp/build/bin/libggml-cuda.so(+0x13401f)[0x7f4c5533401f]
/home/juk/llama.cpp/build/bin/libggml-base.so(ggml_backend_sched_graph_compute_async+0x807)[0x7f4c5c977a67]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_computeEP11ggml_cgraphb+0xa1)[0x7f4c5c69c1c1]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe1)[0x7f4c5c69d831]
/home/juk/llama.cpp/build/bin/libllama.so(_ZN13llama_context6decodeERK11llama_batch+0x291)[0x7f4c5c6a3021]
/home/juk/llama.cpp/build/bin/libllama.so(llama_decode+0xb)[0x7f4c5c6a3f1b]
/home/juk/llama.cpp/build/bin/llama-server(+0xdd166)[0x555ade363166]
/home/juk/llama.cpp/build/bin/llama-server(+0xa29d9)[0x555ade3289d9]
/home/juk/llama.cpp/build/bin/llama-server(+0x62a52)[0x555ade2e8a52]
/lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7f4c5c44624a]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7f4c5c446305]
/home/juk/llama.cpp/build/bin/llama-server(+0x64791)[0x555ade2ea791]

Closing for now and will take a look at that other PR that tries to do this for all types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant