Use C linkage for JIT LTO kernels#1909
Use C linkage for JIT LTO kernels#1909rapids-bot[bot] merged 7 commits intorapidsai:release/26.04from
Conversation
KyleFromNVIDIA
left a comment
There was a problem hiding this comment.
I love the overall idea of using C linkage for the entry point, especially if it lets us wind down the non-JIT+LTO path early, but I think there's a simpler and smarter way to do it.
KyleFromNVIDIA
left a comment
There was a problem hiding this comment.
Approved with one small note
| "interleaved_scan_kernel_capacity_@capacity@_veclen_@veclen@_@ascending_descending@_@compute_norm_name@", | ||
| embedded_fatbin, | ||
| sizeof(embedded_fatbin)); | ||
| registerAlgorithm("@kernel_name@", embedded_fatbin, sizeof(embedded_fatbin)); |
There was a problem hiding this comment.
One thing to note is that the kernel_name variable is currently an implementation detail of process_matrix_entry(). We could document in generate_jit_lto_kernels() that this variable is set and available for usage inside the source file.
| struct tag_idx_l {}; | ||
|
|
||
| template <typename T> | ||
| struct tag_abbrev; |
There was a problem hiding this comment.
The fragment key is now assembled in C++ from tag_abbrev<>, while the generated embedded file key comes from CMake/JSON via @kernel_name@, no?
If so, that'd mean the planner side and generator side must stay perfectly synchronized across:
- interleaved_scan_tags.hpp
- interleaved_scan_planner.hpp
- the CMake NAME_FORMAT
- the JSON matrix abbrevs
If any one of those drifts, the failure mode might be late and opaque.
I’d strongly suggest either deriving both names from one source, or adding a smoke test that exercises one generated interleaved scan kernel through the full registration cudaLibraryGetKernel path.
This is not a blocking comment, if I'm correct then this can be addressed as a follow up, so would just request to open an issue to track.
There was a problem hiding this comment.
I don't know if it's possible to derive those from one source. CMake NAME_FORMAT is done at configure/build time, while the C++ implementation is done at runtime.
The way we're doing things here is not new, and any time the naming conventions have drifted, it's failed very loudly due to failure of either nvjitlink or the fragment database to find the appropriate fragment.
There was a problem hiding this comment.
We could inject at least the string of NAME_FORMAT with placeholders to the Planner class so at runtime, instead of constructing the string piece-by-piece, developers can just substitute the placeholder?
There was a problem hiding this comment.
Wouldn't we then have to generate a whole matrix of files that instantiate Planner for each possible combination?
There was a problem hiding this comment.
We wouldn't substitute the real types/values at build time. Just the string with placeholders for the types/values so the developers don't have to know how to construct and match the string.
There was a problem hiding this comment.
How would that work? There are hundreds of possible strings. How do you substitute in @kernel_name@ without generating another matrix of hundreds of files?
There was a problem hiding this comment.
Oh, I see. You're thinking of having it look like:
this->set_name_format("some_kernel_@param1@_@param2@");and then doing the substitution of @param1@ and @param2@ at runtime.
Yes, that's a good idea. We should do that in a follow-up.
There was a problem hiding this comment.
Yes that's exactly what I was thinking. Follow-up is perfect 👍 will merge this PR now
|
/merge |
Since rapidsai#1909, we've been able to use older versions of the CUDA driver, since we no longer rely on `cudaLibraryEnumerateKernels()`. Since rapidsai#1918, we've been using static cudart, which allows us to run on platforms with versions of CUDA older than 12.8 installed, since the runtime library API is now bundled with cuvs. Always build with JIT+LTO so that we can get the full compile time and binary size benefits in CUDA 12 too.
Rather than register each fragment in a runtime class with a string key, "register" them with the linker using template specialization. This solves a number of problems: 1. It simplifies the code by removing the `FragmentDatabase` class. 2. It addresses rapidsai#1909 (comment) by bypassing the issue entirely. There is no longer a need to build the fragment name string at runtime. 3. For clients that use the `cuvs_static` static library, it allows the linker to pick and choose which fragment symbols it needs rather than including all of them with every client just in case any of them are needed. 4. Since there is no longer a need for `$<WHOLE_ARCHIVE:...>` linkage, there is no need for the `cuvs_jit_lto_kernels` target at all, thus simplifying the CMake code too.
Rather than register each fragment in a runtime class with a string key, "register" them with the linker using template specialization. This solves a number of problems: 1. It simplifies the code by removing the `FragmentDatabase` class. 2. It addresses #1909 (comment) by bypassing the issue entirely. There is no longer a need to build the fragment name string at runtime. 3. For clients that use the `cuvs_static` static library, it allows the linker to pick and choose which fragment symbols it needs rather than including all of them with every client just in case any of them are needed. 4. Since there is no longer a need for `$<WHOLE_ARCHIVE:...>` linkage, there is no need for the `cuvs_jit_lto_kernels` target at all, thus simplifying the CMake code too. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - Divye Gala (https://github.com/divyegala) URL: #1927
Since #1909, we've been able to use older versions of the CUDA driver, since we no longer rely on `cudaLibraryEnumerateKernels()`. Since #1918, we've been using static cudart, which allows us to run on platforms with versions of CUDA older than 12.8 installed, since the runtime library API is now bundled with cuvs. Always build with JIT+LTO so that we can get the full compile time and binary size benefits in CUDA 12 too. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Bradley Dice (https://github.com/bdice) Approvers: - Divye Gala (https://github.com/divyegala) - Ben Frederickson (https://github.com/benfred) - Bradley Dice (https://github.com/bdice) URL: #1923
Since rapidsai#1909, we've been able to use older versions of the CUDA driver, since we no longer rely on `cudaLibraryEnumerateKernels()`. Since rapidsai#1918, we've been using static cudart, which allows us to run on platforms with versions of CUDA older than 12.8 installed, since the runtime library API is now bundled with cuvs. Always build with JIT+LTO so that we can get the full compile time and binary size benefits in CUDA 12 too. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Bradley Dice (https://github.com/bdice) Approvers: - Divye Gala (https://github.com/divyegala) - Ben Frederickson (https://github.com/benfred) - Bradley Dice (https://github.com/bdice) URL: rapidsai#1923
Since rapidsai#1909, we've been able to use older versions of the CUDA driver, since we no longer rely on `cudaLibraryEnumerateKernels()`. Since rapidsai#1918, we've been using static cudart, which allows us to run on platforms with versions of CUDA older than 12.8 installed, since the runtime library API is now bundled with cuvs. Always build with JIT+LTO so that we can get the full compile time and binary size benefits in CUDA 12 too. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Bradley Dice (https://github.com/bdice) Approvers: - Divye Gala (https://github.com/divyegala) - Ben Frederickson (https://github.com/benfred) - Bradley Dice (https://github.com/bdice) URL: rapidsai#1923
Rather than register each fragment in a runtime class with a string key, "register" them with the linker using template specialization. This solves a number of problems: 1. It simplifies the code by removing the `FragmentDatabase` class. 2. It addresses rapidsai#1909 (comment) by bypassing the issue entirely. There is no longer a need to build the fragment name string at runtime. 3. For clients that use the `cuvs_static` static library, it allows the linker to pick and choose which fragment symbols it needs rather than including all of them with every client just in case any of them are needed. 4. Since there is no longer a need for `$<WHOLE_ARCHIVE:...>` linkage, there is no need for the `cuvs_jit_lto_kernels` target at all, thus simplifying the CMake code too. Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - Divye Gala (https://github.com/divyegala) URL: rapidsai#1927
This helps us control the kernel symbol names.