Complex asinh accuracy refinement#6428
Conversation
…h header disappears
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
Applied suggested constant initializers. Co-authored-by: Michael Schellenberger Costa <[email protected]> Co-authored-by: David Bayer <[email protected]>
…ions.h Replace __device__ __host__ Co-authored-by: Michael Schellenberger Costa <[email protected]>
…ions.h Add guards for non-cuda compilers Co-authored-by: Michael Schellenberger Costa <[email protected]>
…ions.h undo-ing clang-format Co-authored-by: David Bayer <[email protected]>
This comment has been minimized.
This comment has been minimized.
|
/ok to test 684c44c |
This comment has been minimized.
This comment has been minimized.
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
| // We still need to to the extended-sqrt on these values, so we fix them now. | ||
| // It occurs around "real ~= small" and "imag ~= (1 - small)", and imag < 1. | ||
| // Worked out through targeted testing on fp64 and fp32. | ||
| _Tp __realx_small_bound = _Tp{1.0e-13}; |
There was a problem hiding this comment.
Anticipating a future when other types (eg fp128) will need their own values.
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
| // but not small enough that the asinh(x) ~ log(2x) estimate does | ||
| // not break down. We are not able to reduce this with a single simple reduction, | ||
| // so we do a fast/inlined frexp/ldexp: | ||
| const auto __exp_biased = static_cast<int32_t>(::cuda::std::bit_cast<__uint_t>(__max) >> __mant_nbits); |
There was a problem hiding this comment.
| const auto __exp_biased = static_cast<int32_t>(::cuda::std::bit_cast<__uint_t>(__max) >> __mant_nbits); | |
| const auto __exp_biased = ::cuda::std::__fp_get_exp_biased(__max); |
There was a problem hiding this comment.
Could you create a version of __fp_get_exp_biased for values that are known to be positive? Like __fp_get_exp_biased_pos or whatever you want.
The small little extra operations and mask value needed tend to have a measurable perf effect on the math functions when they are called in a loop unrolled fashion, if the compiler can't work the stuff it can optimize out.
| constexpr __uint_t __max_allowed_val_as_uint = | ||
| (__uint_t(__max_allowed_exponent + __exp_bias) << __mant_nbits) | __fp_explicit_bit_mask_of_v<_Tp>; |
There was a problem hiding this comment.
| constexpr __uint_t __max_allowed_val_as_uint = | |
| (__uint_t(__max_allowed_exponent + __exp_bias) << __mant_nbits) | __fp_explicit_bit_mask_of_v<_Tp>; | |
| constexpr __uint_t __max_allowed_val_as_uint = ::cuda::std::__fp_set_exp(__uint_t{}, __max_allowed_exponent); |
There was a problem hiding this comment.
Same with the extra mask and operation used in this function.
| const __uint_t __exp_reduce_factor = | ||
| (__uint_t((2 * __exp_max) + __max_allowed_exponent - __exp_biased) << __mant_nbits) | ||
| | __fp_explicit_bit_mask_of_v<_Tp>; |
There was a problem hiding this comment.
| const __uint_t __exp_reduce_factor = | |
| (__uint_t((2 * __exp_max) + __max_allowed_exponent - __exp_biased) << __mant_nbits) | |
| | __fp_explicit_bit_mask_of_v<_Tp>; | |
| const auto __exp_reduce_factor = ::cuda::std::__fp_set_exp_biased(__uint_t{}, (2 * __exp_max) + __max_allowed_exponent - __exp_biased); |
There was a problem hiding this comment.
Wouldn't be needed here, but a version of __fp_set_exp_biased for fp-types that need an explicit bit, but that doesn't set it would be good. Just to remove the unneeded ops.
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h
Outdated
Show resolved
Hide resolved
Added some suggestions Co-authored-by: Federico Busato <[email protected]> Co-authored-by: David Bayer <[email protected]>
|
/ok to test eacbe9e |
This comment has been minimized.
This comment has been minimized.
|
/ok to test 39dbc1c |
🥳 CI Workflow Results🟩 Finished in 1h 25m: Pass: 100%/90 | Total: 1d 13h | Max: 1h 22m | Hits: 95%/202196See results here. |
* moving machine * Comment cleanup * clang-format * More const's, comment cleanup * Remove unneeded headers. May need to add some back in when the roots.h header disappears * re-enable header error * Add noexcept to internal function * spell-check * Apply suggestions from code review Applied suggested constant initializers. Co-authored-by: Michael Schellenberger Costa <[email protected]> Co-authored-by: David Bayer <[email protected]> * Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h Replace __device__ __host__ Co-authored-by: Michael Schellenberger Costa <[email protected]> * Updated all constant initialization * Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h Add guards for non-cuda compilers Co-authored-by: Michael Schellenberger Costa <[email protected]> * More non-cuda compiler guards * Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h undo-ing clang-format Co-authored-by: David Bayer <[email protected]> * Changed return method for extended_sqrt * Added headers back * removed noexcept in favour of _CCCL_FORCEINLINE * last commit does not work, switched back to using inline and noexcept in plain terms * Removed inline * reverted bfloat16 change * removed noexcept * Minor cleanups * addde back inline to prevent build errors, since static breaks * Added warning back * reverted victims of find-replace * Added warning back in * changed inline semantics * Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h Co-authored-by: Michael Schellenberger Costa <[email protected]> * rollback last commit * Updated constant init method * Added suggestions * Added explicit bit * Changed exp_max to exp_bias * build errors * build errors * build errors * Apply suggestions from code review Added some suggestions Co-authored-by: Federico Busato <[email protected]> Co-authored-by: David Bayer <[email protected]> * More small changes --------- Co-authored-by: Michael Schellenberger Costa <[email protected]> Co-authored-by: David Bayer <[email protected]> Co-authored-by: Federico Busato <[email protected]>
Update the complex asinh function to avoid numerical issues.
The current complex asinh function loses accuracy in several places.
These mostly relate to over/underflow, and catastrophic cancellation for tough values.
This new version fixes these accuracy issues while retaining it's perf.
Perf
On GH100 we don't have much perf difference. (There used to be a much large perf gap until #5371 got merged, which the current version is availing of).
Using the math-teams standard math_bench test we have the following:
Operations/SM/cycle:
casinh():Correctness
The current version has several intervals where accuracy gets lost.
Apart from the usual over/underflow suspects, there is also some very subtle intervals where accuracy gets badly thrown out, especially by catastrophic cancellation very close to
+-i.This new version fixes these and testing gives the following:
GPU Correctness
For the new version, an intensive bracket and bisect search, along with testing special hard values, gives:
CPU Correctness