[Java] CuVSMatrix for device memory#1232
[Java] CuVSMatrix for device memory#1232rapids-bot[bot] merged 34 commits intorapidsai:branch-25.10from
Conversation
…read-graph-cuvsmatrix
|
Preliminary micro benchmarks revealed that the CuVS C API functions (e.g. By "different ways" I mean different memory: "normal" paged memory (native), direct access to Java heap memory, and pinned memory (allocated with cudaHostAlloc) For the record: these are run on my laptop, which has a 2070 Max-Q on a PCI Express x16 Gen3 bus. The dimensions used lead to a matrix with 16K * 2K -> 32M of floats (128MB data) So 33 ops/s -> 4.1 GByte/s or ~1/4 of the theoretical bandwidth
|
|
Further benchmarks/investigation:
|
|
Final numbers: |
There was a problem hiding this comment.
Mostly 👍.
I'd like clarification on the addVector() APIs. Particularly in the case of the device-matrices, I'd like clarification on whether the copy is expected to be row-by-row.
(Edit: That, and a minor nit regarding cudaMemcpyAsync using CudaMemcpyKind.)
Yep, it is expected, I measured it and it is slower, but it's not meant to be used on a hot path and I have ideas and half-backed code to improve it. But I think this PR is already quite big and complex, I'd like to address this separately (in a follow-up PR), if that's OK with you. |
|
/ok to test 699e933 |
|
/ok to test 699e933 |
|
Seems to be conda-related errors in CI at the moment. I'll look into this tomorrow. |
|
/ok to test 699e933 |
|
@mythrocks thanks for the heads up! |
|
/ok to test c26e640 |
|
/merge |
…as a CagraIndex input dataset (#1340) When we introduced `CuVSDeviceMatrix` in #1232, we made it possible to use device-memory-backed dataset as an input for index build: since we accept a `CuVSMatrix`, and we have correct `toTensor` implementations for CPU and GPU, and the underlying functions in libcuvs support different memory types and sizes (through DLManagedTensor information), this became supported "naturally". However, we never tested this explicitly. This PR adds tests to check and show that using CuVSDeviceMatrix (device memory) directly as a CagraIndex input dataset works as intended. (similar tests for other index types will be added as follow-ups) Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - Chris Hegarty (https://github.com/ChrisHegarty) - MithunR (https://github.com/mythrocks) URL: #1340
…as a CagraIndex input dataset (rapidsai#1340) When we introduced `CuVSDeviceMatrix` in rapidsai#1232, we made it possible to use device-memory-backed dataset as an input for index build: since we accept a `CuVSMatrix`, and we have correct `toTensor` implementations for CPU and GPU, and the underlying functions in libcuvs support different memory types and sizes (through DLManagedTensor information), this became supported "naturally". However, we never tested this explicitly. This PR adds tests to check and show that using CuVSDeviceMatrix (device memory) directly as a CagraIndex input dataset works as intended. (similar tests for other index types will be added as follow-ups) Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - Chris Hegarty (https://github.com/ChrisHegarty) - MithunR (https://github.com/mythrocks) URL: rapidsai#1340
This PR introduces implementation classes for
CuVSDeviceMatrix(aCuVSMatrixbacked by device memory).It reworks the base implementation classes a bit to increase reuse, and adds benchmarks and tests for the new classes.
Benchmarks were used to try out different implementations so the best one could be chosen:
cudaMemcpyAsyncwith thecriticallinker option to directly access heap-based memorycuvsMatrixCopyis used across the board, as it has the same performances of the variouscudaMemcpy*functions.There are some places in the codebase that will benefit from refactoring to use
CuVSDeviceMatrix(or a genericCuVSMatrixplustoHost/toTensor/fromTensorfunctions); replacing these multiple ad-hoc implementations withCuVSDeviceMatrixwill be addressed in a follow-up PR.Final numbers:
With theoretical max for the PCI-E bus of 15.7 GB/s and a data size of 128MB, we get close to 2/3 of the maximum theoretical throughput (see comments for details).