Add a philox PRNG engine by RAMitchell · Pull Request #6109 · NVIDIA/cccl

RAMitchell · 2025-10-02T08:17:13Z

Implements part of #5679

This implementation follows the C++ standard https://en.cppreference.com/w/cpp/numeric/random/philox_engine/set_counter.html

I will additionally implement an optimised discard operator for this engine for use in parallel.

This engine depends on a mulhilo operation for good performance. We will have to implement a few versions of this for different platforms.

Signed-off-by: Rory Mitchell <[email protected]>

copy-pr-bot · 2025-10-02T08:17:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

davebayer · 2025-10-02T09:46:12Z

If it is standard compliant, we can move this directly to libcu++ instead of thrust :)

davebayer · 2025-10-02T10:11:29Z

/ok to test ebd31c7

fbusato · 2025-10-02T16:17:22Z

entirely agree with @davebayer. std::philox_engine P2075R6 is part of C++26.

Also, about mulhilo, please take a look at fast_modulo_division.h that already provides a good implementation

iburyl · 2025-10-02T16:48:15Z

Comment on mulhilo. C++ spec requires w to be: 0 < w && w <= numeric_limits<UIntType>::digits
That is mulhilo should be able to split result of 31-bits values multiplied into 31-bits lo and 31-bits hi parts.
That said 99.9% use cases will use w = 32 or 64. Another 0.01% will go for w=16. All other cases are hypothetical but formally should be supported. So, any half-decent generic implementation is needed, but 2 (or 3) special cases would benefit from specific optimizations.

RAMitchell · 2025-10-03T09:36:07Z

If it is standard compliant, we can move this directly to libcu++ instead of thrust :)

I started in thrust because the test infrastructure is here, but if everyone is in agreement I can move it. I will focus on getting everything working correctly first.

Also, about mulhilo, please take a look at fast_modulo_division.h that already provides a good implementation

Thank you! This could definitely simplify things for me if that function handles GCC/Clang/MSVC.

Currently fiddling with the discard operator to make it efficient, but will probably ask for review soon.

bernhardmgruber · 2025-10-03T11:49:34Z

As others have pointed out, we should definitely aim to bring this to libcu++ instead of Thrust.

Signed-off-by: Rory Mitchell <[email protected]>

RAMitchell · 2025-10-06T18:42:34Z

Also, about mulhilo, please take a look at fast_modulo_division.h that already provides a good implementation

@fbusato after looking through I think the cases covered here aren't useful. We only need __umul64 to handle the u64 case. The u32 bit case is handled by 64 bit multiplication. This function also doesn't look like it includes anything for CPU so it would be pretty crippled there. Let me know if I am missing something.

This reverts commit 3ab8e69.

fbusato · 2025-10-06T21:12:37Z

This choice is up to you. The libcu++ implementation supports all integral types and constant expression evaluation, as well as 128-bit integers. It is also optimized for device code. Please note that __umulhi is also more efficient than multiplication followed by shifting.

I'm actually considering exposing it in the public API, given that it is used for several purposes: RNG, cryptography and modulo arithmetic.

Signed-off-by: Rory Mitchell <[email protected]>

libcudacxx/include/cuda/std/__random/philox_engine.h

fbusato · 2025-10-21T16:22:18Z

libcudacxx/include/cuda/std/__random/philox_engine.h

+
+  _CCCL_TEMPLATE(class _Sseq)
+  _CCCL_REQUIRES(__is_seed_sequence<_Sseq, philox_engine>)
+  _CCCL_API constexpr explicit philox_engine(_Sseq& __seq)


should not be noexcept too?

Seed sequence could throw.

yes, but something like noexcept(seed(__seq)) better expresses the behavior

Not sure I follow sorry!

not mandatory, more a suggestion. You can express the exception behavior of a function with the syntax noexcept(expression_that_could_throw)

libcudacxx/include/cuda/std/__random/philox_engine.h

fbusato · 2025-10-21T16:38:12Z

libcudacxx/include/cuda/std/__random/philox_engine.h

+      __x_[__i] = (__x_[__i] + 1) & max();
+      if (__x_[__i] != 0)
+      {
+        break;


I don't think so. My idea is to just ignore the computation instead of adding a break. Similar to a manual loop unrolling

fbusato · 2025-10-21T16:39:55Z

libcudacxx/include/cuda/std/__random/philox_engine.h

+    if constexpr (word_size == 32 || word_size == 64)
+    {
+      using _Up = ::cuda::std::__make_nbit_uint_t<word_size>;
+      auto __hi = static_cast<result_type>(::cuda::mul_hi(static_cast<_Up>(__a), static_cast<_Up>(__b)));


not sure if this is the most efficient way to implement mul_hilo

We can fiddle with it later.

libcudacxx/include/cuda/std/__random/philox_engine.h

fbusato · 2025-10-21T16:47:00Z

libcudacxx/include/cuda/std/__random/philox_engine.h

+      // Only two variants are allowed, n=2 or n=4
+      if constexpr (word_count == 2)
+      {
+        auto [__hi, __lo] = __mulhilo(__S[0], multipliers[0]);


would be nice to check the generated code. I would like to prevent inefficient code caused by structured binding / cuda::std::pair

I originally was returning by reference, but it made constexpr difficult so I used this method.

libcudacxx/include/cuda/std/__random/philox_engine.h

fbusato · 2025-10-21T17:21:08Z

libcudacxx/include/cuda/std/__random/philox_engine.h

+    __hi += __ahbl_albh >> __w_half;
+    __hi += ((__lo >> __w_half) < (__ahbl_albh & __lo_mask));
+
+    return ::cuda::std::pair(__hi, __lo);


should not __hi and __lo be masked too?

This is how the paper author does it, but I will just mask these for safety. This branch should never be used unless a user sets w themselves, so I am less worried about performance here.

fbusato · 2025-10-21T17:27:18Z

libcudacxx/include/cuda/std/__random/philox_engine.h

+        break;
+      }
+      result_type __new_x_j = (__x_[__j] + (__increment & max()) + __carry) & max();
+      __carry               = (__new_x_j < __x_[__j]) ? 1 : 0;


not sure if it is right way to check for carry because __new_x_j is masked with max()

x_j is masked too

Signed-off-by: Rory Mitchell <[email protected]>

RAMitchell · 2025-10-23T07:15:18Z

Does anyone have further changes? Would someone kindly run CI.

davebayer · 2025-10-23T07:32:36Z

/ok to test 226f831

davebayer · 2025-10-23T07:33:25Z

libcudacxx/include/cuda/std/__random/philox_engine.h

+#include <cuda/std/cstdint>
+
+#if !_CCCL_COMPILER(NVRTC)
+#  include <iostream>


I'm not sure if we need the whole heavyweight <iostream>, I think <sstream> should be enough

RAMitchell · 2025-10-23T09:40:00Z

CI failure looks unrelated.

davebayer · 2025-10-23T10:07:11Z

/ok to test 0c058f1

RAMitchell · 2025-10-27T08:06:05Z

@fbusato @davebayer @miscco you have requested changes, would you please re-check :)

davebayer

Some more details

libcudacxx/include/cuda/std/__random/philox_engine.h

davebayer · 2025-10-27T09:44:38Z

/ok to test a716c3d

fbusato · 2025-10-27T16:17:38Z

libcudacxx/test/libcudacxx/std/random/engine/philox.pass.cpp

+#include "test_engine.h"
+
+template <typename Engine>
+__host__ __device__ constexpr bool test_set_counter()


these tests are perfectly fine. However, I'm a bit concerned that we don't compare the results with some references.
e.g. we could compute some values from the implementation in libstdc++ and see if they match

Its about a month old in libstdc++, so I I think I would have to compile it from scratch. Let me try the code from the original paper first.

I was able to test a bunch of reference values out of libstdc++ and everthing lines up.

fbusato · 2025-10-27T18:05:50Z

/ok to test 8be223b

github-actions · 2025-10-27T18:49:28Z

🥳 CI Workflow Results

🟩 Finished in 40m 27s: Pass: 100%/84 | Total: 9h 50m | Max: 29m 16s | Hits: 99%/212164

See results here.

RAMitchell added 2 commits September 30, 2025 02:22

First attempt at philox

f19e36b

Tests passing

ebd31c7

Signed-off-by: Rory Mitchell <[email protected]>

github-project-automation bot added this to CCCL Oct 2, 2025

github-project-automation bot moved this to Todo in CCCL Oct 2, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Oct 2, 2025

This comment has been minimized.

Sign in to view

RAMitchell added 3 commits October 6, 2025 04:58

Efficient discard operator

d7387dc

Signed-off-by: Rory Mitchell <[email protected]>

Use cuda::std::array

ff140ea

Signed-off-by: Rory Mitchell <[email protected]>

Use fast mulhilo from cccl

3ab8e69

Signed-off-by: Rory Mitchell <[email protected]>

Revert "Use fast mulhilo from cccl"

a3e594a

This reverts commit 3ab8e69.

fbusato mentioned this pull request Oct 7, 2025

Expose cuda::mul_hi #6146

Merged

RAMitchell added 4 commits October 10, 2025 02:25

Use internal mulhi implementation

267bc34

Signed-off-by: Rory Mitchell <[email protected]>

Add philox to libcu++

563bf20

Signed-off-by: Rory Mitchell <[email protected]>

Add more tests

ac14e16

Signed-off-by: Rory Mitchell <[email protected]>

Remove thrust stuff

a0ecbe0

Signed-off-by: Rory Mitchell <[email protected]>

RAMitchell changed the title ~~Add a philox PRNG engine to thrust~~ Add a philox PRNG engine Oct 13, 2025

RAMitchell marked this pull request as ready for review October 13, 2025 13:16

RAMitchell requested a review from a team as a code owner October 13, 2025 13:16

RAMitchell requested a review from griwes October 13, 2025 13:16

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Oct 13, 2025

miscco requested changes Oct 13, 2025

View reviewed changes

RAMitchell mentioned this pull request Oct 20, 2025

Implement PCG64 as extension #6292

Merged

fbusato requested changes Oct 21, 2025

View reviewed changes

Review comments

226f831

Signed-off-by: Rory Mitchell <[email protected]>

davebayer reviewed Oct 23, 2025

View reviewed changes

Don't include iostream

0c058f1

This comment has been minimized.

Sign in to view

davebayer reviewed Oct 27, 2025

View reviewed changes

libcudacxx/include/cuda/std/__random/philox_engine.h Outdated Show resolved Hide resolved

Review comments

a716c3d

miscco approved these changes Oct 27, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

fbusato requested changes Oct 27, 2025

View reviewed changes

Add reference values

8be223b

davebayer approved these changes Oct 27, 2025

View reviewed changes

fbusato approved these changes Oct 27, 2025

View reviewed changes

github-project-automation bot moved this from In Progress to In Review in CCCL Oct 27, 2025

davebayer merged commit a4132d6 into NVIDIA:main Oct 27, 2025
94 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Oct 27, 2025

RAMitchell deleted the philox branch October 31, 2025 09:28

Conversation

RAMitchell commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Oct 2, 2025

Uh oh!

davebayer commented Oct 2, 2025

Uh oh!

davebayer commented Oct 2, 2025

Uh oh!

This comment has been minimized.

fbusato commented Oct 2, 2025

Uh oh!

iburyl commented Oct 2, 2025

Uh oh!

RAMitchell commented Oct 3, 2025

Uh oh!

bernhardmgruber commented Oct 3, 2025

Uh oh!

RAMitchell commented Oct 6, 2025

Uh oh!

fbusato commented Oct 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RAMitchell commented Oct 23, 2025

Uh oh!

davebayer commented Oct 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RAMitchell commented Oct 23, 2025

Uh oh!

This comment has been minimized.

davebayer commented Oct 23, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

RAMitchell commented Oct 2, 2025 •

edited

Loading