Conversation
Signed-off-by: Rory Mitchell <[email protected]>
|
If it is standard compliant, we can move this directly to libcu++ instead of thrust :) |
|
/ok to test ebd31c7 |
This comment has been minimized.
This comment has been minimized.
|
entirely agree with @davebayer. Also, about |
|
Comment on |
I started in thrust because the test infrastructure is here, but if everyone is in agreement I can move it. I will focus on getting everything working correctly first.
Thank you! This could definitely simplify things for me if that function handles GCC/Clang/MSVC. Currently fiddling with the discard operator to make it efficient, but will probably ask for review soon. |
|
As others have pointed out, we should definitely aim to bring this to libcu++ instead of Thrust. |
Signed-off-by: Rory Mitchell <[email protected]>
Signed-off-by: Rory Mitchell <[email protected]>
Signed-off-by: Rory Mitchell <[email protected]>
@fbusato after looking through I think the cases covered here aren't useful. We only need __umul64 to handle the u64 case. The u32 bit case is handled by 64 bit multiplication. This function also doesn't look like it includes anything for CPU so it would be pretty crippled there. Let me know if I am missing something. |
This reverts commit 3ab8e69.
|
This choice is up to you. The libcu++ implementation supports all integral types and constant expression evaluation, as well as 128-bit integers. It is also optimized for device code. Please note that I'm actually considering exposing it in the public API, given that it is used for several purposes: RNG, cryptography and modulo arithmetic. |
Signed-off-by: Rory Mitchell <[email protected]>
Signed-off-by: Rory Mitchell <[email protected]>
Signed-off-by: Rory Mitchell <[email protected]>
Signed-off-by: Rory Mitchell <[email protected]>
|
|
||
| _CCCL_TEMPLATE(class _Sseq) | ||
| _CCCL_REQUIRES(__is_seed_sequence<_Sseq, philox_engine>) | ||
| _CCCL_API constexpr explicit philox_engine(_Sseq& __seq) |
There was a problem hiding this comment.
should not be noexcept too?
There was a problem hiding this comment.
Seed sequence could throw.
There was a problem hiding this comment.
yes, but something like noexcept(seed(__seq)) better expresses the behavior
There was a problem hiding this comment.
Not sure I follow sorry!
There was a problem hiding this comment.
not mandatory, more a suggestion. You can express the exception behavior of a function with the syntax noexcept(expression_that_could_throw)
| __x_[__i] = (__x_[__i] + 1) & max(); | ||
| if (__x_[__i] != 0) | ||
| { | ||
| break; |
There was a problem hiding this comment.
I don't think so. My idea is to just ignore the computation instead of adding a break. Similar to a manual loop unrolling
| if constexpr (word_size == 32 || word_size == 64) | ||
| { | ||
| using _Up = ::cuda::std::__make_nbit_uint_t<word_size>; | ||
| auto __hi = static_cast<result_type>(::cuda::mul_hi(static_cast<_Up>(__a), static_cast<_Up>(__b))); |
There was a problem hiding this comment.
not sure if this is the most efficient way to implement mul_hilo
There was a problem hiding this comment.
We can fiddle with it later.
| // Only two variants are allowed, n=2 or n=4 | ||
| if constexpr (word_count == 2) | ||
| { | ||
| auto [__hi, __lo] = __mulhilo(__S[0], multipliers[0]); |
There was a problem hiding this comment.
would be nice to check the generated code. I would like to prevent inefficient code caused by structured binding / cuda::std::pair
There was a problem hiding this comment.
I originally was returning by reference, but it made constexpr difficult so I used this method.
| __hi += __ahbl_albh >> __w_half; | ||
| __hi += ((__lo >> __w_half) < (__ahbl_albh & __lo_mask)); | ||
|
|
||
| return ::cuda::std::pair(__hi, __lo); |
There was a problem hiding this comment.
should not __hi and __lo be masked too?
There was a problem hiding this comment.
This is how the paper author does it, but I will just mask these for safety. This branch should never be used unless a user sets w themselves, so I am less worried about performance here.
| break; | ||
| } | ||
| result_type __new_x_j = (__x_[__j] + (__increment & max()) + __carry) & max(); | ||
| __carry = (__new_x_j < __x_[__j]) ? 1 : 0; |
There was a problem hiding this comment.
not sure if it is right way to check for carry because __new_x_j is masked with max()
There was a problem hiding this comment.
x_j is masked too
Signed-off-by: Rory Mitchell <[email protected]>
|
Does anyone have further changes? Would someone kindly run CI. |
|
/ok to test 226f831 |
| #include <cuda/std/cstdint> | ||
|
|
||
| #if !_CCCL_COMPILER(NVRTC) | ||
| # include <iostream> |
There was a problem hiding this comment.
I'm not sure if we need the whole heavyweight <iostream>, I think <sstream> should be enough
|
CI failure looks unrelated. |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 0c058f1 |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@fbusato @davebayer @miscco you have requested changes, would you please re-check :) |
|
/ok to test a716c3d |
This comment has been minimized.
This comment has been minimized.
| #include "test_engine.h" | ||
|
|
||
| template <typename Engine> | ||
| __host__ __device__ constexpr bool test_set_counter() |
There was a problem hiding this comment.
these tests are perfectly fine. However, I'm a bit concerned that we don't compare the results with some references.
e.g. we could compute some values from the implementation in libstdc++ and see if they match
There was a problem hiding this comment.
Its about a month old in libstdc++, so I I think I would have to compile it from scratch. Let me try the code from the original paper first.
There was a problem hiding this comment.
I was able to test a bunch of reference values out of libstdc++ and everthing lines up.
|
/ok to test 8be223b |
🥳 CI Workflow Results🟩 Finished in 40m 27s: Pass: 100%/84 | Total: 9h 50m | Max: 29m 16s | Hits: 99%/212164See results here. |
Implements part of #5679
This implementation follows the C++ standard https://en.cppreference.com/w/cpp/numeric/random/philox_engine/set_counter.html
I will additionally implement an optimised discard operator for this engine for use in parallel.
This engine depends on a
mulhilooperation for good performance. We will have to implement a few versions of this for different platforms.