Skip to content

Use C linkage for JIT LTO kernels#1909

Merged
rapids-bot[bot] merged 7 commits intorapidsai:release/26.04from
divyegala:jit-lto-c-linkage
Mar 16, 2026
Merged

Use C linkage for JIT LTO kernels#1909
rapids-bot[bot] merged 7 commits intorapidsai:release/26.04from
divyegala:jit-lto-c-linkage

Conversation

@divyegala
Copy link
Copy Markdown
Member

This helps us control the kernel symbol names.

@divyegala divyegala self-assigned this Mar 11, 2026
@divyegala divyegala added the improvement Improves an existing functionality label Mar 11, 2026
@divyegala divyegala requested review from a team as code owners March 11, 2026 03:43
@divyegala divyegala added the non-breaking Introduces a non-breaking change label Mar 11, 2026
Copy link
Copy Markdown
Member

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the overall idea of using C linkage for the entry point, especially if it lets us wind down the non-JIT+LTO path early, but I think there's a simpler and smarter way to do it.

Comment thread cpp/cmake/modules/generate_jit_lto_kernels.cmake Outdated
Comment thread cpp/src/neighbors/ivf_flat/jit_lto_kernels/interleaved_scan_planner.hpp Outdated
Comment thread cpp/include/cuvs/detail/jit_lto/FragmentDatabase.hpp Outdated
Comment thread cpp/src/detail/jit_lto/AlgorithmPlanner.cu Outdated
Copy link
Copy Markdown
Member

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with one small note

"interleaved_scan_kernel_capacity_@capacity@_veclen_@veclen@_@ascending_descending@_@compute_norm_name@",
embedded_fatbin,
sizeof(embedded_fatbin));
registerAlgorithm("@kernel_name@", embedded_fatbin, sizeof(embedded_fatbin));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that the kernel_name variable is currently an implementation detail of process_matrix_entry(). We could document in generate_jit_lto_kernels() that this variable is set and available for usage inside the source file.

@divyegala divyegala changed the base branch from main to release/26.04 March 12, 2026 21:11
struct tag_idx_l {};

template <typename T>
struct tag_abbrev;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fragment key is now assembled in C++ from tag_abbrev<>, while the generated embedded file key comes from CMake/JSON via @kernel_name@, no?

If so, that'd mean the planner side and generator side must stay perfectly synchronized across:

  • interleaved_scan_tags.hpp
  • interleaved_scan_planner.hpp
  • the CMake NAME_FORMAT
  • the JSON matrix abbrevs

If any one of those drifts, the failure mode might be late and opaque.

I’d strongly suggest either deriving both names from one source, or adding a smoke test that exercises one generated interleaved scan kernel through the full registration cudaLibraryGetKernel path.

This is not a blocking comment, if I'm correct then this can be addressed as a follow up, so would just request to open an issue to track.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it's possible to derive those from one source. CMake NAME_FORMAT is done at configure/build time, while the C++ implementation is done at runtime.

The way we're doing things here is not new, and any time the naming conventions have drifted, it's failed very loudly due to failure of either nvjitlink or the fragment database to find the appropriate fragment.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could inject at least the string of NAME_FORMAT with placeholders to the Planner class so at runtime, instead of constructing the string piece-by-piece, developers can just substitute the placeholder?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we then have to generate a whole matrix of files that instantiate Planner for each possible combination?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't substitute the real types/values at build time. Just the string with placeholders for the types/values so the developers don't have to know how to construct and match the string.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would that work? There are hundreds of possible strings. How do you substitute in @kernel_name@ without generating another matrix of hundreds of files?

Copy link
Copy Markdown
Member

@KyleFromNVIDIA KyleFromNVIDIA Mar 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. You're thinking of having it look like:

this->set_name_format("some_kernel_@param1@_@param2@");

and then doing the substitution of @param1@ and @param2@ at runtime.

Yes, that's a good idea. We should do that in a follow-up.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's exactly what I was thinking. Follow-up is perfect 👍 will merge this PR now

@divyegala
Copy link
Copy Markdown
Member Author

/merge

@rapids-bot rapids-bot Bot merged commit 1c6400c into rapidsai:release/26.04 Mar 16, 2026
79 checks passed
KyleFromNVIDIA added a commit to KyleFromNVIDIA/cuvs that referenced this pull request Mar 16, 2026
Since rapidsai#1909, we've been able to use
older versions of the CUDA driver, since we no longer rely on
`cudaLibraryEnumerateKernels()`. Since
rapidsai#1918, we've been using static
cudart, which allows us to run on platforms with versions of CUDA older
than 12.8 installed, since the runtime library API is now bundled with
cuvs. Always build with JIT+LTO so that we can get the full compile time
and binary size benefits in CUDA 12 too.
KyleFromNVIDIA added a commit to KyleFromNVIDIA/cuvs that referenced this pull request Mar 17, 2026
Rather than register each fragment in a runtime class with a string
key, "register" them with the linker using template specialization. This
solves a number of problems:

1. It simplifies the code by removing the `FragmentDatabase` class.
2. It addresses rapidsai#1909 (comment)
   by bypassing the issue entirely. There is no longer a need to build
   the fragment name string at runtime.
3. For clients that use the `cuvs_static` static library, it allows the
   linker to pick and choose which fragment symbols it needs rather than
   including all of them with every client just in case any of them are
   needed.
4. Since there is no longer a need for `$<WHOLE_ARCHIVE:...>` linkage,
   there is no need for the `cuvs_jit_lto_kernels` target at all, thus
   simplifying the CMake code too.
rapids-bot Bot pushed a commit that referenced this pull request Mar 17, 2026
Rather than register each fragment in a runtime class with a string key, "register" them with the linker using template specialization. This solves a number of problems:

1. It simplifies the code by removing the `FragmentDatabase` class.
2. It addresses #1909 (comment) by bypassing the issue entirely. There is no longer a need to build the fragment name string at runtime.
3. For clients that use the `cuvs_static` static library, it allows the linker to pick and choose which fragment symbols it needs rather than including all of them with every client just in case any of them are needed.
4. Since there is no longer a need for `$<WHOLE_ARCHIVE:...>` linkage, there is no need for the `cuvs_jit_lto_kernels` target at all, thus simplifying the CMake code too.

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: #1927
rapids-bot Bot pushed a commit that referenced this pull request Mar 25, 2026
Since #1909, we've been able to use older versions of the CUDA driver, since we no longer rely on `cudaLibraryEnumerateKernels()`. Since #1918, we've been using static cudart, which allows us to run on platforms with versions of CUDA older than 12.8 installed, since the runtime library API is now bundled with cuvs. Always build with JIT+LTO so that we can get the full compile time and binary size benefits in CUDA 12 too.

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Divye Gala (https://github.com/divyegala)
  - Ben Frederickson (https://github.com/benfred)
  - Bradley Dice (https://github.com/bdice)

URL: #1923
jrbourbeau pushed a commit to jrbourbeau/cuvs that referenced this pull request Mar 25, 2026
Since rapidsai#1909, we've been able to use older versions of the CUDA driver, since we no longer rely on `cudaLibraryEnumerateKernels()`. Since rapidsai#1918, we've been using static cudart, which allows us to run on platforms with versions of CUDA older than 12.8 installed, since the runtime library API is now bundled with cuvs. Always build with JIT+LTO so that we can get the full compile time and binary size benefits in CUDA 12 too.

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Divye Gala (https://github.com/divyegala)
  - Ben Frederickson (https://github.com/benfred)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#1923
jrbourbeau pushed a commit to jrbourbeau/cuvs that referenced this pull request Mar 25, 2026
Since rapidsai#1909, we've been able to use older versions of the CUDA driver, since we no longer rely on `cudaLibraryEnumerateKernels()`. Since rapidsai#1918, we've been using static cudart, which allows us to run on platforms with versions of CUDA older than 12.8 installed, since the runtime library API is now bundled with cuvs. Always build with JIT+LTO so that we can get the full compile time and binary size benefits in CUDA 12 too.

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Divye Gala (https://github.com/divyegala)
  - Ben Frederickson (https://github.com/benfred)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#1923
lowener pushed a commit to lowener/cuvs that referenced this pull request Mar 30, 2026
Rather than register each fragment in a runtime class with a string key, "register" them with the linker using template specialization. This solves a number of problems:

1. It simplifies the code by removing the `FragmentDatabase` class.
2. It addresses rapidsai#1909 (comment) by bypassing the issue entirely. There is no longer a need to build the fragment name string at runtime.
3. For clients that use the `cuvs_static` static library, it allows the linker to pick and choose which fragment symbols it needs rather than including all of them with every client just in case any of them are needed.
4. Since there is no longer a need for `$<WHOLE_ARCHIVE:...>` linkage, there is no need for the `cuvs_jit_lto_kernels` target at all, thus simplifying the CMake code too.

Authors:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: rapidsai#1927
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants