Skip to content

Complex asinh accuracy refinement#6428

Merged
s-oboyle merged 51 commits intoNVIDIA:mainfrom
s-oboyle:complex_asinh_accuracy_refinement
Dec 3, 2025
Merged

Complex asinh accuracy refinement#6428
s-oboyle merged 51 commits intoNVIDIA:mainfrom
s-oboyle:complex_asinh_accuracy_refinement

Conversation

@s-oboyle
Copy link
Contributor

Update the complex asinh function to avoid numerical issues.

The current complex asinh function loses accuracy in several places.
These mostly relate to over/underflow, and catastrophic cancellation for tough values.

This new version fixes these accuracy issues while retaining it's perf.

Perf

On GH100 we don't have much perf difference. (There used to be a much large perf gap until #5371 got merged, which the current version is availing of).
Using the math-teams standard math_bench test we have the following:

Operations/SM/cycle:
casinh():

H100 old new new/old
fp64 0.2531 0.2549 1.01
fp32 0.6072 0.6334 1.04

Correctness

The current version has several intervals where accuracy gets lost.
Apart from the usual over/underflow suspects, there is also some very subtle intervals where accuracy gets badly thrown out, especially by catastrophic cancellation very close to +-i.

This new version fixes these and testing gives the following:

GPU Correctness

For the new version, an intensive bracket and bisect search, along with testing special hard values, gives:

GPU fp64:
Max ulp real error (4.867,1.742) @ (0.007757045272,-0.0002247045536)    (0x3f7fc5d9fc1f5662,0xbf2d73d56affd72d)
        Ours = (0.007756967678,-0.0002246977954)    Ref = (0.007756967678,-0.0002246977954)
        Ours = (0x3f7fc5c527dd1d58,0xbf2d739b5d7a3961)               Ref = (0x3f7fc5c527dd1d53,0xbf2d739b5d7a3963)

Max ulp imag error (0.1719,5.453) @ (7.198570162e+103,-5.623976789e+101)        (0x558011effdb5a3ad,0xd510120190e898df)
        Ours = (239.8333247,-0.007812471425)    Ref = (239.8333247,-0.007812471425)
        Ours = (0x406dfaaa988c99ba,0xbf7ffff854599f01)               Ref = (0x406dfaaa988c99ba,0xbf7ffff854599efc)
GPU fp32:
Max ulp real error (6.619,2.232) @ (0.007812378462,-0.0007928675623)    (0x3bfffefb,0xba4fd871)
        Ours = (0.007812298369,-0.0007928435807)    Ref = (0.007812301628,-0.0007928434643)
        Ours = (0x3bfffe4f,0xba4fd6d5)               Ref = (0x3bfffe56,0xba4fd6d3)

Max ulp imag error (3.732,5.528) @ (0.007806597743,3.029832988e-05)     (0x3bffce7d,0x37fe292b)
        Ours = (0.007806516718,3.029741674e-05)    Ref = (0.007806518581,3.029740583e-05)
        Ours = (0x3bffcdcf,0x37fe2735)               Ref = (0x3bffcdd3,0x37fe272f)

CPU Correctness

CPU fp64:
Max ulp real error (4.125,0) @ (0.01542893159,-0)       (0x3f8f993424afec00,0x8000000000000000)
        Ours = (0.01542831951,-0)    Ref = (0.01542831951,-0)
        Ours = (0x3f8f98e1fdb37251,0x8000000000000000)               Ref = (0x3f8f98e1fdb3724d,0x8000000000000000)

Max ulp imag error (0.5078,3.484) @ (0.8869326854,-1.12001698e-254)     (0x3fec61c0a7b18800,0x8b3505783ad41800)
        Ours = (0.7991224705,-8.379245502e-255)    Ref = (0.7991224705,-8.379245502e-255)
        Ours = (0x3fe99269498d3a37,0x8b2f742347588681)               Ref = (0x3fe99269498d3a36,0x8b2f742347588684)
CPU fp32:
Max ulp real error (4.827,0.5125) @ (0.001131535857,0.9893865585)       (0x3a94500b,0x3f7d4870)
        Ours = (0.007776218932,1.424767017)    Ref = (0.007776216604,1.424766898)
        Ours = (0x3bfecfa7,0x3fb65ec4)               Ref = (0x3bfecfa2,0x3fb65ec3)

Max ulp imag error (0.498,4.695) @ (-5.695841894e+11,3.510264218e+10)   (0xd3049ddd,0x5102c47d)
        Ours = (-27.76321411,0.06155067682)    Ref = (-27.76321411,0.06155069545)
        Ours = (0xc1de1b10,0x3d7c1c90)               Ref = (0xc1de1b10,0x3d7c1c95)

@s-oboyle s-oboyle requested a review from a team as a code owner October 31, 2025 20:57
@s-oboyle s-oboyle requested a review from pciolkosz October 31, 2025 20:57
@github-project-automation github-project-automation bot moved this to Todo in CCCL Oct 31, 2025
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Oct 31, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Review in CCCL Oct 31, 2025
Copy link
Contributor

@davebayer davebayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small things

@github-project-automation github-project-automation bot moved this from In Review to In Progress in CCCL Nov 3, 2025
s-oboyle and others added 7 commits November 3, 2025 16:23
Applied suggested constant initializers.

Co-authored-by: Michael Schellenberger Costa <[email protected]>
Co-authored-by: David Bayer <[email protected]>
…ions.h


Replace __device__ __host__

Co-authored-by: Michael Schellenberger Costa <[email protected]>
…ions.h


Add guards for non-cuda compilers

Co-authored-by: Michael Schellenberger Costa <[email protected]>
…ions.h


undo-ing clang-format

Co-authored-by: David Bayer <[email protected]>
@github-actions

This comment has been minimized.

@s-oboyle
Copy link
Contributor Author

/ok to test 684c44c

@github-actions

This comment has been minimized.

// We still need to to the extended-sqrt on these values, so we fix them now.
// It occurs around "real ~= small" and "imag ~= (1 - small)", and imag < 1.
// Worked out through targeted testing on fp64 and fp32.
_Tp __realx_small_bound = _Tp{1.0e-13};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anticipating a future when other types (eg fp128) will need their own values.

@miscco miscco requested review from davebayer and fbusato November 24, 2025 09:29
Copy link
Contributor

@davebayer davebayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, try to use more of the __fp_meow machinery. In #6749, I am fixing the behaviour of __fp_set_exp, so it automatically sets the explicit bit

// but not small enough that the asinh(x) ~ log(2x) estimate does
// not break down. We are not able to reduce this with a single simple reduction,
// so we do a fast/inlined frexp/ldexp:
const auto __exp_biased = static_cast<int32_t>(::cuda::std::bit_cast<__uint_t>(__max) >> __mant_nbits);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const auto __exp_biased = static_cast<int32_t>(::cuda::std::bit_cast<__uint_t>(__max) >> __mant_nbits);
const auto __exp_biased = ::cuda::std::__fp_get_exp_biased(__max);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a version of __fp_get_exp_biased for values that are known to be positive? Like __fp_get_exp_biased_pos or whatever you want.
The small little extra operations and mask value needed tend to have a measurable perf effect on the math functions when they are called in a loop unrolled fashion, if the compiler can't work the stuff it can optimize out.

Comment on lines +188 to +189
constexpr __uint_t __max_allowed_val_as_uint =
(__uint_t(__max_allowed_exponent + __exp_bias) << __mant_nbits) | __fp_explicit_bit_mask_of_v<_Tp>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
constexpr __uint_t __max_allowed_val_as_uint =
(__uint_t(__max_allowed_exponent + __exp_bias) << __mant_nbits) | __fp_explicit_bit_mask_of_v<_Tp>;
constexpr __uint_t __max_allowed_val_as_uint = ::cuda::std::__fp_set_exp(__uint_t{}, __max_allowed_exponent);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with the extra mask and operation used in this function.

Comment on lines +205 to +207
const __uint_t __exp_reduce_factor =
(__uint_t((2 * __exp_max) + __max_allowed_exponent - __exp_biased) << __mant_nbits)
| __fp_explicit_bit_mask_of_v<_Tp>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const __uint_t __exp_reduce_factor =
(__uint_t((2 * __exp_max) + __max_allowed_exponent - __exp_biased) << __mant_nbits)
| __fp_explicit_bit_mask_of_v<_Tp>;
const auto __exp_reduce_factor = ::cuda::std::__fp_set_exp_biased(__uint_t{}, (2 * __exp_max) + __max_allowed_exponent - __exp_biased);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't be needed here, but a version of __fp_set_exp_biased for fp-types that need an explicit bit, but that doesn't set it would be good. Just to remove the unneeded ops.

@s-oboyle
Copy link
Contributor Author

/ok to test eacbe9e

@github-actions

This comment has been minimized.

@s-oboyle
Copy link
Contributor Author

s-oboyle commented Dec 2, 2025

/ok to test 39dbc1c

@github-actions
Copy link
Contributor

github-actions bot commented Dec 2, 2025

🥳 CI Workflow Results

🟩 Finished in 1h 25m: Pass: 100%/90 | Total: 1d 13h | Max: 1h 22m | Hits: 95%/202196

See results here.

@github-project-automation github-project-automation bot moved this from In Progress to In Review in CCCL Dec 2, 2025
@s-oboyle s-oboyle merged commit 3e672a2 into NVIDIA:main Dec 3, 2025
105 of 106 checks passed
@github-project-automation github-project-automation bot moved this from In Review to Done in CCCL Dec 3, 2025
srinivasyadav18 pushed a commit to srinivasyadav18/cccl that referenced this pull request Dec 9, 2025
* moving machine

* Comment cleanup

* clang-format

* More const's, comment cleanup

* Remove unneeded headers. May need to add some back in when the roots.h header disappears

* re-enable header error

* Add noexcept to internal function

* spell-check

* Apply suggestions from code review

Applied suggested constant initializers.

Co-authored-by: Michael Schellenberger Costa <[email protected]>
Co-authored-by: David Bayer <[email protected]>

* Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

Replace __device__ __host__

Co-authored-by: Michael Schellenberger Costa <[email protected]>

* Updated all constant initialization

* Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

Add guards for non-cuda compilers

Co-authored-by: Michael Schellenberger Costa <[email protected]>

* More non-cuda compiler guards

* Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

undo-ing clang-format

Co-authored-by: David Bayer <[email protected]>

* Changed return method for extended_sqrt

* Added headers back

* removed noexcept in favour of _CCCL_FORCEINLINE

* last commit does not work, switched back to using inline and noexcept in plain terms

* Removed inline

* reverted bfloat16 change

* removed noexcept

* Minor cleanups

* addde back inline to prevent build errors, since static breaks

* Added warning back

* reverted victims of find-replace

* Added warning back in

* changed inline semantics

* Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

Co-authored-by: Michael Schellenberger Costa <[email protected]>

* rollback last commit

* Updated constant init method

* Added suggestions

* Added explicit bit

* Changed exp_max to exp_bias

* build errors

* build errors

* build errors

* Apply suggestions from code review

Added some suggestions

Co-authored-by: Federico Busato <[email protected]>
Co-authored-by: David Bayer <[email protected]>

* More small changes

---------

Co-authored-by: Michael Schellenberger Costa <[email protected]>
Co-authored-by: David Bayer <[email protected]>
Co-authored-by: Federico Busato <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants