Use C linkage for JIT LTO kernels by divyegala · Pull Request #1909 · rapidsai/cuvs

divyegala · 2026-03-11T03:43:37Z

This helps us control the kernel symbol names.

KyleFromNVIDIA

I love the overall idea of using C linkage for the entry point, especially if it lets us wind down the non-JIT+LTO path early, but I think there's a simpler and smarter way to do it.

KyleFromNVIDIA

Approved with one small note

KyleFromNVIDIA · 2026-03-11T17:34:57Z

-    "interleaved_scan_kernel_capacity_@capacity@_veclen_@veclen@_@ascending_descending@_@compute_norm_name@",
-    embedded_fatbin,
-    sizeof(embedded_fatbin));
+  registerAlgorithm("@kernel_name@", embedded_fatbin, sizeof(embedded_fatbin));


One thing to note is that the kernel_name variable is currently an implementation detail of process_matrix_entry(). We could document in generate_jit_lto_kernels() that this variable is set and available for usage inside the source file.

dantegd · 2026-03-16T16:55:47Z

 struct tag_idx_l {};

+template <typename T>
+struct tag_abbrev;


The fragment key is now assembled in C++ from tag_abbrev<>, while the generated embedded file key comes from CMake/JSON via @kernel_name@, no?

If so, that'd mean the planner side and generator side must stay perfectly synchronized across:

interleaved_scan_tags.hpp

interleaved_scan_planner.hpp

the CMake NAME_FORMAT

the JSON matrix abbrevs

If any one of those drifts, the failure mode might be late and opaque.

I’d strongly suggest either deriving both names from one source, or adding a smoke test that exercises one generated interleaved scan kernel through the full registration cudaLibraryGetKernel path.

This is not a blocking comment, if I'm correct then this can be addressed as a follow up, so would just request to open an issue to track.

I don't know if it's possible to derive those from one source. CMake NAME_FORMAT is done at configure/build time, while the C++ implementation is done at runtime.

The way we're doing things here is not new, and any time the naming conventions have drifted, it's failed very loudly due to failure of either nvjitlink or the fragment database to find the appropriate fragment.

We could inject at least the string of NAME_FORMAT with placeholders to the Planner class so at runtime, instead of constructing the string piece-by-piece, developers can just substitute the placeholder?

Wouldn't we then have to generate a whole matrix of files that instantiate Planner for each possible combination?

We wouldn't substitute the real types/values at build time. Just the string with placeholders for the types/values so the developers don't have to know how to construct and match the string.

How would that work? There are hundreds of possible strings. How do you substitute in @kernel_name@ without generating another matrix of hundreds of files?

Oh, I see. You're thinking of having it look like:

this->set_name_format("some_kernel_@param1@_@param2@");

and then doing the substitution of @param1@ and @param2@ at runtime.

Yes, that's a good idea. We should do that in a follow-up.

Yes that's exactly what I was thinking. Follow-up is perfect 👍 will merge this PR now

divyegala · 2026-03-16T17:46:20Z

/merge

Since rapidsai#1909, we've been able to use older versions of the CUDA driver, since we no longer rely on `cudaLibraryEnumerateKernels()`. Since rapidsai#1918, we've been using static cudart, which allows us to run on platforms with versions of CUDA older than 12.8 installed, since the runtime library API is now bundled with cuvs. Always build with JIT+LTO so that we can get the full compile time and binary size benefits in CUDA 12 too.

Rather than register each fragment in a runtime class with a string key, "register" them with the linker using template specialization. This solves a number of problems: 1. It simplifies the code by removing the `FragmentDatabase` class. 2. It addresses rapidsai#1909 (comment) by bypassing the issue entirely. There is no longer a need to build the fragment name string at runtime. 3. For clients that use the `cuvs_static` static library, it allows the linker to pick and choose which fragment symbols it needs rather than including all of them with every client just in case any of them are needed. 4. Since there is no longer a need for `$<WHOLE_ARCHIVE:...>` linkage, there is no need for the `cuvs_jit_lto_kernels` target at all, thus simplifying the CMake code too.

Rather than register each fragment in a runtime class with a string key, "register" them with the linker using template specialization. This solves a number of problems: 1. It simplifies the code by removing the `FragmentDatabase` class. 2. It addresses #1909 (comment) by bypassing the issue entirely. There is no longer a need to build the fragment name string at runtime. 3. For clients that use the `cuvs_static` static library, it allows the linker to pick and choose which fragment symbols it needs rather than including all of them with every client just in case any of them are needed. 4. Since there is no longer a need for `$<WHOLE_ARCHIVE:...>` linkage, there is no need for the `cuvs_jit_lto_kernels` target at all, thus simplifying the CMake code too. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - Divye Gala (https://github.com/divyegala) URL: #1927

Since #1909, we've been able to use older versions of the CUDA driver, since we no longer rely on `cudaLibraryEnumerateKernels()`. Since #1918, we've been using static cudart, which allows us to run on platforms with versions of CUDA older than 12.8 installed, since the runtime library API is now bundled with cuvs. Always build with JIT+LTO so that we can get the full compile time and binary size benefits in CUDA 12 too. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Bradley Dice (https://github.com/bdice) Approvers: - Divye Gala (https://github.com/divyegala) - Ben Frederickson (https://github.com/benfred) - Bradley Dice (https://github.com/bdice) URL: #1923

Since rapidsai#1909, we've been able to use older versions of the CUDA driver, since we no longer rely on `cudaLibraryEnumerateKernels()`. Since rapidsai#1918, we've been using static cudart, which allows us to run on platforms with versions of CUDA older than 12.8 installed, since the runtime library API is now bundled with cuvs. Always build with JIT+LTO so that we can get the full compile time and binary size benefits in CUDA 12 too. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Bradley Dice (https://github.com/bdice) Approvers: - Divye Gala (https://github.com/divyegala) - Ben Frederickson (https://github.com/benfred) - Bradley Dice (https://github.com/bdice) URL: rapidsai#1923

Rather than register each fragment in a runtime class with a string key, "register" them with the linker using template specialization. This solves a number of problems: 1. It simplifies the code by removing the `FragmentDatabase` class. 2. It addresses rapidsai#1909 (comment) by bypassing the issue entirely. There is no longer a need to build the fragment name string at runtime. 3. For clients that use the `cuvs_static` static library, it allows the linker to pick and choose which fragment symbols it needs rather than including all of them with every client just in case any of them are needed. 4. Since there is no longer a need for `$<WHOLE_ARCHIVE:...>` linkage, there is no need for the `cuvs_jit_lto_kernels` target at all, thus simplifying the CMake code too. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#1927

introduce c linkage

8b5d1f7

divyegala self-assigned this Mar 11, 2026

divyegala added the improvement Improves an existing functionality label Mar 11, 2026

divyegala requested review from a team as code owners March 11, 2026 03:43

divyegala added the non-breaking Introduces a non-breaking change label Mar 11, 2026

github-project-automation Bot added this to Unstructured Data Processing Mar 11, 2026

KyleFromNVIDIA requested changes Mar 11, 2026

View reviewed changes

Comment thread cpp/cmake/modules/generate_jit_lto_kernels.cmake Outdated

Comment thread cpp/src/neighbors/ivf_flat/jit_lto_kernels/interleaved_scan_planner.hpp Outdated

Use same symbol name in each fragment

f1c4cee

divyegala commented Mar 11, 2026

View reviewed changes

Comment thread cpp/include/cuvs/detail/jit_lto/FragmentDatabase.hpp Outdated

KyleFromNVIDIA added 2 commits March 11, 2026 16:30

Make cache private

3722008

Pass strings by value

aab1d74

divyegala commented Mar 11, 2026

View reviewed changes

Comment thread cpp/src/detail/jit_lto/AlgorithmPlanner.cu Outdated

KyleFromNVIDIA added 2 commits March 11, 2026 17:15

s/fragment/kernel/

01e06a5

Remove extra newline

de61748

KyleFromNVIDIA approved these changes Mar 11, 2026

View reviewed changes

Make fragment database .cpp instead of .cu

50a83f8

divyegala changed the base branch from main to release/26.04 March 12, 2026 21:11

dantegd approved these changes Mar 16, 2026

View reviewed changes

rapids-bot Bot merged commit 1c6400c into rapidsai:release/26.04 Mar 16, 2026
79 checks passed

github-project-automation Bot moved this to Done in Unstructured Data Processing Mar 16, 2026

KyleFromNVIDIA mentioned this pull request Mar 16, 2026

Always build with JIT+LTO #1923

Merged

KyleFromNVIDIA mentioned this pull request Mar 17, 2026

Remove JIT+LTO fragment database #1927

Merged

Conversation

divyegala commented Mar 11, 2026

Uh oh!

KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KyleFromNVIDIA Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyegala commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KyleFromNVIDIA Mar 16, 2026 •

edited

Loading