Skip to content

Add a philox PRNG engine#6109

Merged
davebayer merged 38 commits intoNVIDIA:mainfrom
RAMitchell:philox
Oct 27, 2025
Merged

Add a philox PRNG engine#6109
davebayer merged 38 commits intoNVIDIA:mainfrom
RAMitchell:philox

Conversation

@RAMitchell
Copy link
Contributor

@RAMitchell RAMitchell commented Oct 2, 2025

Implements part of #5679

This implementation follows the C++ standard https://en.cppreference.com/w/cpp/numeric/random/philox_engine/set_counter.html

I will additionally implement an optimised discard operator for this engine for use in parallel.

This engine depends on a mulhilo operation for good performance. We will have to implement a few versions of this for different platforms.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Oct 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Oct 2, 2025
@davebayer
Copy link
Contributor

If it is standard compliant, we can move this directly to libcu++ instead of thrust :)

@davebayer
Copy link
Contributor

/ok to test ebd31c7

@github-actions

This comment has been minimized.

@fbusato
Copy link
Contributor

fbusato commented Oct 2, 2025

entirely agree with @davebayer. std::philox_engine P2075R6 is part of C++26.

Also, about mulhilo, please take a look at fast_modulo_division.h that already provides a good implementation

@iburyl
Copy link

iburyl commented Oct 2, 2025

Comment on mulhilo. C++ spec requires w to be: 0 < w && w <= numeric_limits<UIntType>​::​digits
That is mulhilo should be able to split result of 31-bits values multiplied into 31-bits lo and 31-bits hi parts.
That said 99.9% use cases will use w = 32 or 64. Another 0.01% will go for w=16. All other cases are hypothetical but formally should be supported. So, any half-decent generic implementation is needed, but 2 (or 3) special cases would benefit from specific optimizations.

@RAMitchell
Copy link
Contributor Author

If it is standard compliant, we can move this directly to libcu++ instead of thrust :)

I started in thrust because the test infrastructure is here, but if everyone is in agreement I can move it. I will focus on getting everything working correctly first.

Also, about mulhilo, please take a look at fast_modulo_division.h that already provides a good implementation

Thank you! This could definitely simplify things for me if that function handles GCC/Clang/MSVC.

Currently fiddling with the discard operator to make it efficient, but will probably ask for review soon.

@bernhardmgruber
Copy link
Contributor

As others have pointed out, we should definitely aim to bring this to libcu++ instead of Thrust.

Signed-off-by: Rory Mitchell <[email protected]>
Signed-off-by: Rory Mitchell <[email protected]>
Signed-off-by: Rory Mitchell <[email protected]>
@RAMitchell
Copy link
Contributor Author

Also, about mulhilo, please take a look at fast_modulo_division.h that already provides a good implementation

@fbusato after looking through I think the cases covered here aren't useful. We only need __umul64 to handle the u64 case. The u32 bit case is handled by 64 bit multiplication. This function also doesn't look like it includes anything for CPU so it would be pretty crippled there. Let me know if I am missing something.

@fbusato
Copy link
Contributor

fbusato commented Oct 6, 2025

This choice is up to you. The libcu++ implementation supports all integral types and constant expression evaluation, as well as 128-bit integers. It is also optimized for device code. Please note that __umulhi is also more efficient than multiplication followed by shifting.

I'm actually considering exposing it in the public API, given that it is used for several purposes: RNG, cryptography and modulo arithmetic.

@fbusato fbusato mentioned this pull request Oct 7, 2025
Signed-off-by: Rory Mitchell <[email protected]>
Signed-off-by: Rory Mitchell <[email protected]>
Signed-off-by: Rory Mitchell <[email protected]>
@RAMitchell RAMitchell changed the title Add a philox PRNG engine to thrust Add a philox PRNG engine Oct 13, 2025
@RAMitchell RAMitchell marked this pull request as ready for review October 13, 2025 13:16
@RAMitchell RAMitchell requested a review from a team as a code owner October 13, 2025 13:16
@RAMitchell RAMitchell requested a review from griwes October 13, 2025 13:16
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Oct 13, 2025

_CCCL_TEMPLATE(class _Sseq)
_CCCL_REQUIRES(__is_seed_sequence<_Sseq, philox_engine>)
_CCCL_API constexpr explicit philox_engine(_Sseq& __seq)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not be noexcept too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seed sequence could throw.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but something like noexcept(seed(__seq)) better expresses the behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow sorry!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not mandatory, more a suggestion. You can express the exception behavior of a function with the syntax noexcept(expression_that_could_throw)

__x_[__i] = (__x_[__i] + 1) & max();
if (__x_[__i] != 0)
{
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. My idea is to just ignore the computation instead of adding a break. Similar to a manual loop unrolling

if constexpr (word_size == 32 || word_size == 64)
{
using _Up = ::cuda::std::__make_nbit_uint_t<word_size>;
auto __hi = static_cast<result_type>(::cuda::mul_hi(static_cast<_Up>(__a), static_cast<_Up>(__b)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is the most efficient way to implement mul_hilo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fiddle with it later.

// Only two variants are allowed, n=2 or n=4
if constexpr (word_count == 2)
{
auto [__hi, __lo] = __mulhilo(__S[0], multipliers[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice to check the generated code. I would like to prevent inefficient code caused by structured binding / cuda::std::pair

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally was returning by reference, but it made constexpr difficult so I used this method.

__hi += __ahbl_albh >> __w_half;
__hi += ((__lo >> __w_half) < (__ahbl_albh & __lo_mask));

return ::cuda::std::pair(__hi, __lo);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should not __hi and __lo be masked too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the paper author does it, but I will just mask these for safety. This branch should never be used unless a user sets w themselves, so I am less worried about performance here.

break;
}
result_type __new_x_j = (__x_[__j] + (__increment & max()) + __carry) & max();
__carry = (__new_x_j < __x_[__j]) ? 1 : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if it is right way to check for carry because __new_x_j is masked with max()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x_j is masked too

Signed-off-by: Rory Mitchell <[email protected]>
@RAMitchell
Copy link
Contributor Author

Does anyone have further changes? Would someone kindly run CI.

@davebayer
Copy link
Contributor

/ok to test 226f831

#include <cuda/std/cstdint>

#if !_CCCL_COMPILER(NVRTC)
# include <iostream>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need the whole heavyweight <iostream>, I think <sstream> should be enough

@RAMitchell
Copy link
Contributor Author

CI failure looks unrelated.

@github-actions

This comment has been minimized.

@davebayer
Copy link
Contributor

/ok to test 0c058f1

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@RAMitchell
Copy link
Contributor Author

@fbusato @davebayer @miscco you have requested changes, would you please re-check :)

Copy link
Contributor

@davebayer davebayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more details

@davebayer
Copy link
Contributor

/ok to test a716c3d

@github-actions

This comment has been minimized.

#include "test_engine.h"

template <typename Engine>
__host__ __device__ constexpr bool test_set_counter()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests are perfectly fine. However, I'm a bit concerned that we don't compare the results with some references.
e.g. we could compute some values from the implementation in libstdc++ and see if they match

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its about a month old in libstdc++, so I I think I would have to compile it from scratch. Let me try the code from the original paper first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to test a bunch of reference values out of libstdc++ and everthing lines up.

@fbusato
Copy link
Contributor

fbusato commented Oct 27, 2025

/ok to test 8be223b

@github-project-automation github-project-automation bot moved this from In Progress to In Review in CCCL Oct 27, 2025
@github-actions
Copy link
Contributor

🥳 CI Workflow Results

🟩 Finished in 40m 27s: Pass: 100%/84 | Total: 9h 50m | Max: 29m 16s | Hits: 99%/212164

See results here.

@davebayer davebayer merged commit a4132d6 into NVIDIA:main Oct 27, 2025
94 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Oct 27, 2025
@RAMitchell RAMitchell deleted the philox branch October 31, 2025 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

6 participants