Skip to content

[Java] CuVSMatrix for device memory#1232

Merged
rapids-bot[bot] merged 34 commits intorapidsai:branch-25.10from
ldematte:java/device-matrix
Aug 22, 2025
Merged

[Java] CuVSMatrix for device memory#1232
rapids-bot[bot] merged 34 commits intorapidsai:branch-25.10from
ldematte:java/device-matrix

Conversation

@ldematte
Copy link
Copy Markdown
Contributor

@ldematte ldematte commented Aug 8, 2025

This PR introduces implementation classes for CuVSDeviceMatrix (a CuVSMatrix backed by device memory).
It reworks the base implementation classes a bit to increase reuse, and adds benchmarks and tests for the new classes.

Benchmarks were used to try out different implementations so the best one could be chosen:

  • Row access is backed by a buffer of pinned memory
  • Builders for device memory use cudaMemcpyAsync with the critical linker option to directly access heap-based memory
  • cuvsMatrixCopy is used across the board, as it has the same performances of the various cudaMemcpy* functions.

There are some places in the codebase that will benefit from refactoring to use CuVSDeviceMatrix (or a generic CuVSMatrix plus toHost/toTensor/fromTensor functions); replacing these multiple ad-hoc implementations with CuVSDeviceMatrix will be addressed in a follow-up PR.

Final numbers:

Benchmark                                            (dims)  (size)   Mode  Cnt   Score   Error  Units
CuVSDeviceMatrixBenchmarks.matrixCopyDeviceToHost      2048   16384  thrpt    5  70.531 ± 0.322  ops/s
CuVSDeviceMatrixBenchmarks.matrixDeviceBuilder         2048   16384  thrpt    5  35.493 ± 0.772  ops/s
CuVSDeviceMatrixBenchmarks.matrixReadRowsFromDevice    2048   16384  thrpt    5  83.616 ± 0.745  ops/s

With theoretical max for the PCI-E bus of 15.7 GB/s and a data size of 128MB, we get close to 2/3 of the maximum theoretical throughput (see comments for details).

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Aug 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread java/cuvs-java/src/main/java/com/nvidia/cuvs/CagraIndex.java Outdated
Comment thread java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/CuVSDeviceMatrixImpl.java Outdated
Comment thread java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java Outdated
Comment thread java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/Native.java Outdated
@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Aug 8, 2025
@cjnolet cjnolet moved this from Todo to In Progress in Unstructured Data Processing Aug 8, 2025
@ldematte
Copy link
Copy Markdown
Contributor Author

Preliminary micro benchmarks revealed that the CuVS C API functions (e.g. cuvsMatrixCopy) have comparable performances to calling the corresponding cuda functions directly; so I removed those and focused on different ways of using those.

By "different ways" I mean different memory: "normal" paged memory (native), direct access to Java heap memory, and pinned memory (allocated with cudaHostAlloc)

Benchmark                                                         (dims)  (size)   Mode  Cnt   Score   Error  Units
CuVSDeviceMatrixBenchmarks.matrixCopyToHost                         2048   16384  thrpt    5  34.889 ± 1.166  ops/s
CuVSDeviceMatrixBenchmarks.matrixDeviceBuilderCudaMemcpyCudaHost    2048   16384  thrpt    5  27.454 ± 0.209  ops/s
CuVSDeviceMatrixBenchmarks.matrixDeviceBuilderCudaMemcpyHeap        2048   16384  thrpt    5  33.176 ± 0.578  ops/s
CuVSDeviceMatrixBenchmarks.matrixDeviceBuilderCudaMemcpyNative      2048   16384  thrpt    5  31.574 ± 2.331  ops/s
CuVSDeviceMatrixBenchmarks.matrixReadRowsWithNativeBuffer           2048   16384  thrpt    4  68.148 ± 4.488  ops/s
CuVSDeviceMatrixBenchmarks.matrixReadRowsWithPinnedBuffer           2048   16384  thrpt    5  84.511 ± 1.174  ops/s

For the record: these are run on my laptop, which has a 2070 Max-Q on a PCI Express x16 Gen3 bus.
The theoretical max for the bus is 15.7 GB/s (I plan to run them on a different (more modern, enterprise-grade) GPU too, but I wanted to get some numbers to spot obvious problems first).

The dimensions used lead to a matrix with 16K * 2K -> 32M of floats (128MB data)

So 33 ops/s -> 4.1 GByte/s or ~1/4 of the theoretical bandwidth
While 85 ops/s -> 10.6 GByte/s or ~66% of the theoretical bandwidth

matrixCopyToHost and matrixDeviceBuilderCudaMemcpyCudaHost are a bit surprising -- especially the second, as it seems to indicate that using pinned memory in that case is comparable/a bit slower. While it should be ~2x faster.
Probably these scenarios don't matter too much: the way we are shaping the API, the preferred way would be to read a matrix by row. Still, I want to investigate why they are 2.5x slower. And again, try and see if I see the same results on other GPUs.

@ldematte
Copy link
Copy Markdown
Contributor Author

Further benchmarks/investigation: matrixCopyToHost difference is due to Java allocations/deallocations. If we take that out of the loop, we are back up to > 65ops/s, very much like matrixReadRowsWithNativeBuffer, which is what one should expect.

matrixDeviceBuilder* performance is due to the size of data copied: since data is copied row-by-row, we are in a situation where we copy in KBs (not MBs) for each call.
This can be optimized; it probably isn't a hot path (it's used to get data from Java heap to device, an already sub-optimal path), and it's not "unusable slow", but I'll give it a go.

@ldematte
Copy link
Copy Markdown
Contributor Author

Final numbers:

Benchmark                                            (dims)  (size)   Mode  Cnt   Score   Error  Units
CuVSDeviceMatrixBenchmarks.matrixCopyDeviceToHost      2048   16384  thrpt    5  70.531 ± 0.322  ops/s
CuVSDeviceMatrixBenchmarks.matrixDeviceBuilder         2048   16384  thrpt    5  35.493 ± 0.772  ops/s
CuVSDeviceMatrixBenchmarks.matrixReadRowsFromDevice    2048   16384  thrpt    5  83.616 ± 0.745  ops/s

Comment thread java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java Outdated
Comment thread java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/Native.java Outdated
Comment thread java/cuvs-java/src/main/java22/com/nvidia/cuvs/internal/common/Native.java Outdated
Comment thread java/cuvs-java/src/main/java22/com/nvidia/cuvs/spi/JDKProvider.java Outdated
Comment thread java/cuvs-java/src/test/java/com/nvidia/cuvs/CuVSMatrixIT.java
Copy link
Copy Markdown
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly 👍.

I'd like clarification on the addVector() APIs. Particularly in the case of the device-matrices, I'd like clarification on whether the copy is expected to be row-by-row.

(Edit: That, and a minor nit regarding cudaMemcpyAsync using CudaMemcpyKind.)

@ldematte
Copy link
Copy Markdown
Contributor Author

I'd like clarification on the addVector() APIs. Particularly in the case of the device-matrices, I'd like clarification on whether the copy is expected to be row-by-row.

Yep, it is expected, I measured it and it is slower, but it's not meant to be used on a hot path and I have ideas and half-backed code to improve it. But I think this PR is already quite big and complex, I'd like to address this separately (in a follow-up PR), if that's OK with you.

Copy link
Copy Markdown
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience with the review, @ldematte. LGTM.

@mythrocks
Copy link
Copy Markdown
Contributor

/ok to test 699e933

@mythrocks mythrocks changed the title [REVIEW][Java] CuVSMatrix for device memory [Java] CuVSMatrix for device memory Aug 21, 2025
@mythrocks
Copy link
Copy Markdown
Contributor

/ok to test 699e933

@mythrocks
Copy link
Copy Markdown
Contributor

Seems to be conda-related errors in CI at the moment. I'll look into this tomorrow.

@mythrocks
Copy link
Copy Markdown
Contributor

/ok to test 699e933

@mythrocks
Copy link
Copy Markdown
Contributor

mythrocks commented Aug 22, 2025

Please ignore the merge conflicts here. I intended to merge this ahead of #1104.

I have raised #1274 to revert #1104, so as to prioritize #1232.

The error is regretted. I'll re-kick CI when the revert has gone through.

@ldematte
Copy link
Copy Markdown
Contributor Author

@mythrocks thanks for the heads up!

@mythrocks
Copy link
Copy Markdown
Contributor

/ok to test c26e640

@mythrocks
Copy link
Copy Markdown
Contributor

/merge

@rapids-bot rapids-bot Bot merged commit 50ed710 into rapidsai:branch-25.10 Aug 22, 2025
55 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Unstructured Data Processing Aug 22, 2025
@ldematte ldematte deleted the java/device-matrix branch August 25, 2025 06:47
rapids-bot Bot pushed a commit that referenced this pull request Sep 25, 2025
…as a CagraIndex input dataset (#1340)

When we introduced `CuVSDeviceMatrix` in #1232, we made it possible to use device-memory-backed dataset as an input for index build: since we accept a `CuVSMatrix`, and we have correct `toTensor` implementations for CPU and GPU, and the underlying functions in libcuvs support different memory types and sizes (through DLManagedTensor information), this became supported "naturally".
However, we never tested this explicitly. 

This PR adds tests to check and show that using CuVSDeviceMatrix (device memory) directly as a CagraIndex input dataset works as intended.

(similar tests for other index types will be added as follow-ups)

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: #1340
enp1s0 pushed a commit to enp1s0/cuvs that referenced this pull request Oct 22, 2025
…as a CagraIndex input dataset (rapidsai#1340)

When we introduced `CuVSDeviceMatrix` in rapidsai#1232, we made it possible to use device-memory-backed dataset as an input for index build: since we accept a `CuVSMatrix`, and we have correct `toTensor` implementations for CPU and GPU, and the underlying functions in libcuvs support different memory types and sizes (through DLManagedTensor information), this became supported "naturally".
However, we never tested this explicitly. 

This PR adds tests to check and show that using CuVSDeviceMatrix (device memory) directly as a CagraIndex input dataset works as intended.

(similar tests for other index types will be added as follow-ups)

Authors:
  - Lorenzo Dematté (https://github.com/ldematte)
  - MithunR (https://github.com/mythrocks)

Approvers:
  - Chris Hegarty (https://github.com/ChrisHegarty)
  - MithunR (https://github.com/mythrocks)

URL: rapidsai#1340
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants