[Java] CuVSMatrix for device memory by ldematte · Pull Request #1232 · rapidsai/cuvs

ldematte · 2025-08-08T15:52:52Z

This PR introduces implementation classes for CuVSDeviceMatrix (a CuVSMatrix backed by device memory).
It reworks the base implementation classes a bit to increase reuse, and adds benchmarks and tests for the new classes.

Benchmarks were used to try out different implementations so the best one could be chosen:

Row access is backed by a buffer of pinned memory
Builders for device memory use cudaMemcpyAsync with the critical linker option to directly access heap-based memory
cuvsMatrixCopy is used across the board, as it has the same performances of the various cudaMemcpy* functions.

There are some places in the codebase that will benefit from refactoring to use CuVSDeviceMatrix (or a generic CuVSMatrix plus toHost/toTensor/fromTensor functions); replacing these multiple ad-hoc implementations with CuVSDeviceMatrix will be addressed in a follow-up PR.

Final numbers:

Benchmark                                            (dims)  (size)   Mode  Cnt   Score   Error  Units
CuVSDeviceMatrixBenchmarks.matrixCopyDeviceToHost      2048   16384  thrpt    5  70.531 ± 0.322  ops/s
CuVSDeviceMatrixBenchmarks.matrixDeviceBuilder         2048   16384  thrpt    5  35.493 ± 0.772  ops/s
CuVSDeviceMatrixBenchmarks.matrixReadRowsFromDevice    2048   16384  thrpt    5  83.616 ± 0.745  ops/s

With theoretical max for the PCI-E bus of 15.7 GB/s and a data size of 128MB, we get close to 2/3 of the maximum theoretical throughput (see comments for details).

…read-graph-cuvsmatrix

copy-pr-bot · 2025-08-08T15:52:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…-matrix

…tions for buffer w/benchmarks

ldematte · 2025-08-11T16:15:14Z

Preliminary micro benchmarks revealed that the CuVS C API functions (e.g. cuvsMatrixCopy) have comparable performances to calling the corresponding cuda functions directly; so I removed those and focused on different ways of using those.

By "different ways" I mean different memory: "normal" paged memory (native), direct access to Java heap memory, and pinned memory (allocated with cudaHostAlloc)

Benchmark                                                         (dims)  (size)   Mode  Cnt   Score   Error  Units
CuVSDeviceMatrixBenchmarks.matrixCopyToHost                         2048   16384  thrpt    5  34.889 ± 1.166  ops/s
CuVSDeviceMatrixBenchmarks.matrixDeviceBuilderCudaMemcpyCudaHost    2048   16384  thrpt    5  27.454 ± 0.209  ops/s
CuVSDeviceMatrixBenchmarks.matrixDeviceBuilderCudaMemcpyHeap        2048   16384  thrpt    5  33.176 ± 0.578  ops/s
CuVSDeviceMatrixBenchmarks.matrixDeviceBuilderCudaMemcpyNative      2048   16384  thrpt    5  31.574 ± 2.331  ops/s
CuVSDeviceMatrixBenchmarks.matrixReadRowsWithNativeBuffer           2048   16384  thrpt    4  68.148 ± 4.488  ops/s
CuVSDeviceMatrixBenchmarks.matrixReadRowsWithPinnedBuffer           2048   16384  thrpt    5  84.511 ± 1.174  ops/s

For the record: these are run on my laptop, which has a 2070 Max-Q on a PCI Express x16 Gen3 bus.
The theoretical max for the bus is 15.7 GB/s (I plan to run them on a different (more modern, enterprise-grade) GPU too, but I wanted to get some numbers to spot obvious problems first).

The dimensions used lead to a matrix with 16K * 2K -> 32M of floats (128MB data)

So 33 ops/s -> 4.1 GByte/s or ~1/4 of the theoretical bandwidth
While 85 ops/s -> 10.6 GByte/s or ~66% of the theoretical bandwidth

matrixCopyToHost and matrixDeviceBuilderCudaMemcpyCudaHost are a bit surprising -- especially the second, as it seems to indicate that using pinned memory in that case is comparable/a bit slower. While it should be ~2x faster.
Probably these scenarios don't matter too much: the way we are shaping the API, the preferred way would be to read a matrix by row. Still, I want to investigate why they are 2.5x slower. And again, try and see if I see the same results on other GPUs.

…-matrix

ldematte · 2025-08-12T10:44:57Z

Further benchmarks/investigation: matrixCopyToHost difference is due to Java allocations/deallocations. If we take that out of the loop, we are back up to > 65ops/s, very much like matrixReadRowsWithNativeBuffer, which is what one should expect.

matrixDeviceBuilder* performance is due to the size of data copied: since data is copied row-by-row, we are in a situation where we copy in KBs (not MBs) for each call.
This can be optimized; it probably isn't a hot path (it's used to get data from Java heap to device, an already sub-optimal path), and it's not "unusable slow", but I'll give it a go.

ldematte · 2025-08-12T12:55:13Z

Final numbers:

Benchmark                                            (dims)  (size)   Mode  Cnt   Score   Error  Units
CuVSDeviceMatrixBenchmarks.matrixCopyDeviceToHost      2048   16384  thrpt    5  70.531 ± 0.322  ops/s
CuVSDeviceMatrixBenchmarks.matrixDeviceBuilder         2048   16384  thrpt    5  35.493 ± 0.772  ops/s
CuVSDeviceMatrixBenchmarks.matrixReadRowsFromDevice    2048   16384  thrpt    5  83.616 ± 0.745  ops/s

…va/device-matrix

mythrocks

Mostly 👍.

I'd like clarification on the addVector() APIs. Particularly in the case of the device-matrices, I'd like clarification on whether the copy is expected to be row-by-row.

(Edit: That, and a minor nit regarding cudaMemcpyAsync using CudaMemcpyKind.)

ldematte · 2025-08-20T06:41:13Z

I'd like clarification on the addVector() APIs. Particularly in the case of the device-matrices, I'd like clarification on whether the copy is expected to be row-by-row.

Yep, it is expected, I measured it and it is slower, but it's not meant to be used on a hot path and I have ideas and half-backed code to improve it. But I think this PR is already quite big and complex, I'd like to address this separately (in a follow-up PR), if that's OK with you.

mythrocks

Thank you for your patience with the review, @ldematte. LGTM.

mythrocks · 2025-08-20T22:22:13Z

/ok to test 699e933

mythrocks · 2025-08-21T04:57:56Z

/ok to test 699e933

mythrocks · 2025-08-21T05:02:42Z

Seems to be conda-related errors in CI at the moment. I'll look into this tomorrow.

mythrocks · 2025-08-21T21:35:06Z

/ok to test 699e933

mythrocks · 2025-08-22T05:16:11Z

Please ignore the merge conflicts here. I intended to merge this ahead of #1104.

I have raised #1274 to revert #1104, so as to prioritize #1232.

The error is regretted. I'll re-kick CI when the revert has gone through.

ldematte · 2025-08-22T06:34:54Z

@mythrocks thanks for the heads up!

mythrocks · 2025-08-22T19:03:49Z

/ok to test c26e640

mythrocks · 2025-08-22T21:50:04Z

/merge

…as a CagraIndex input dataset (#1340) When we introduced `CuVSDeviceMatrix` in #1232, we made it possible to use device-memory-backed dataset as an input for index build: since we accept a `CuVSMatrix`, and we have correct `toTensor` implementations for CPU and GPU, and the underlying functions in libcuvs support different memory types and sizes (through DLManagedTensor information), this became supported "naturally". However, we never tested this explicitly. This PR adds tests to check and show that using CuVSDeviceMatrix (device memory) directly as a CagraIndex input dataset works as intended. (similar tests for other index types will be added as follow-ups) Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - Chris Hegarty (https://github.com/ChrisHegarty) - MithunR (https://github.com/mythrocks) URL: #1340

…as a CagraIndex input dataset (rapidsai#1340) When we introduced `CuVSDeviceMatrix` in rapidsai#1232, we made it possible to use device-memory-backed dataset as an input for index build: since we accept a `CuVSMatrix`, and we have correct `toTensor` implementations for CPU and GPU, and the underlying functions in libcuvs support different memory types and sizes (through DLManagedTensor information), this became supported "naturally". However, we never tested this explicitly. This PR adds tests to check and show that using CuVSDeviceMatrix (device memory) directly as a CagraIndex input dataset works as intended. (similar tests for other index types will be added as follow-ups) Authors: - Lorenzo Dematté (https://github.com/ldematte) - MithunR (https://github.com/mythrocks) Approvers: - Chris Hegarty (https://github.com/ChrisHegarty) - MithunR (https://github.com/mythrocks) URL: rapidsai#1340

ldematte added 9 commits August 5, 2025 15:05

Add CAGRA index graph accessor/build from graph (host memory)

c987afa

Merge branch 'branch-25.10' into java/cagra-read-graph-cuvsmatrix

9362a7a

Merge remote-tracking branch 'upstream/branch-25.10' into java/cagra-…

ec6f389

…read-graph-cuvsmatrix

Add initial CuVSDeviceMatrix implementations

80bdb4c

Fixes + different copy strategies

0c68dbf

Direct heap access to cudaMemcpy

801ae56

Different builder and copy modes + benchmarks

45860d1

Some fixes, more tests + benchmarks

bcd2c85

Added fromTensor utility; added check functions

8c15347

github-project-automation Bot added this to Unstructured Data Processing Aug 8, 2025

github-project-automation Bot moved this to Todo in Unstructured Data Processing Aug 8, 2025

ldematte commented Aug 8, 2025

View reviewed changes

Comment thread java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraIndex.java Outdated

ldematte commented Aug 8, 2025

View reviewed changes

Comment thread java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSDeviceMatrixImpl.java Outdated

ldematte commented Aug 8, 2025

View reviewed changes

Comment thread java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java Outdated

ldematte commented Aug 8, 2025

View reviewed changes

Comment thread java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/Native.java Outdated

cjnolet assigned ldematte Aug 8, 2025

cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Aug 8, 2025

cjnolet moved this from Todo to In Progress in Unstructured Data Processing Aug 8, 2025

ldematte added 3 commits August 11, 2025 15:20

Merge remote-tracking branch 'upstream/branch-25.10' into java/device…

f2baaee

…-matrix

Fix after merge; use cudaMemcpyAsync

59b8df7

Removing all "copy functions" in favour of cuvs; different implementa…

bd0b594

…tions for buffer w/benchmarks

ldematte added 3 commits August 12, 2025 08:05

Merge remote-tracking branch 'upstream/branch-25.10' into java/device…

3d922d6

…-matrix

Add support for strides (untested)

1c9cd5d

Return Cagra graph as a device matrix

6fdfdd1

Fixes + using long for matrix dimensions

63b8ef6

Merge branch 'java/device-matrix' of github.com:ldematte/cuvs into ja…

c19ea90

…va/device-matrix