Complex asinh accuracy refinement by s-oboyle · Pull Request #6428 · NVIDIA/cccl

s-oboyle · 2025-10-31T20:57:33Z

Update the complex asinh function to avoid numerical issues.

The current complex asinh function loses accuracy in several places.
These mostly relate to over/underflow, and catastrophic cancellation for tough values.

This new version fixes these accuracy issues while retaining it's perf.

Perf

On GH100 we don't have much perf difference. (There used to be a much large perf gap until #5371 got merged, which the current version is availing of).
Using the math-teams standard math_bench test we have the following:

Operations/SM/cycle:
casinh():

H100	old	new	new/old
fp64	0.2531	0.2549	1.01
fp32	0.6072	0.6334	1.04

Correctness

The current version has several intervals where accuracy gets lost.
Apart from the usual over/underflow suspects, there is also some very subtle intervals where accuracy gets badly thrown out, especially by catastrophic cancellation very close to +-i.

This new version fixes these and testing gives the following:

GPU Correctness

For the new version, an intensive bracket and bisect search, along with testing special hard values, gives:

GPU fp64:
Max ulp real error (4.867,1.742) @ (0.007757045272,-0.0002247045536)    (0x3f7fc5d9fc1f5662,0xbf2d73d56affd72d)
        Ours = (0.007756967678,-0.0002246977954)    Ref = (0.007756967678,-0.0002246977954)
        Ours = (0x3f7fc5c527dd1d58,0xbf2d739b5d7a3961)               Ref = (0x3f7fc5c527dd1d53,0xbf2d739b5d7a3963)

Max ulp imag error (0.1719,5.453) @ (7.198570162e+103,-5.623976789e+101)        (0x558011effdb5a3ad,0xd510120190e898df)
        Ours = (239.8333247,-0.007812471425)    Ref = (239.8333247,-0.007812471425)
        Ours = (0x406dfaaa988c99ba,0xbf7ffff854599f01)               Ref = (0x406dfaaa988c99ba,0xbf7ffff854599efc)

GPU fp32:
Max ulp real error (6.619,2.232) @ (0.007812378462,-0.0007928675623)    (0x3bfffefb,0xba4fd871)
        Ours = (0.007812298369,-0.0007928435807)    Ref = (0.007812301628,-0.0007928434643)
        Ours = (0x3bfffe4f,0xba4fd6d5)               Ref = (0x3bfffe56,0xba4fd6d3)

Max ulp imag error (3.732,5.528) @ (0.007806597743,3.029832988e-05)     (0x3bffce7d,0x37fe292b)
        Ours = (0.007806516718,3.029741674e-05)    Ref = (0.007806518581,3.029740583e-05)
        Ours = (0x3bffcdcf,0x37fe2735)               Ref = (0x3bffcdd3,0x37fe272f)

CPU Correctness

CPU fp64:
Max ulp real error (4.125,0) @ (0.01542893159,-0)       (0x3f8f993424afec00,0x8000000000000000)
        Ours = (0.01542831951,-0)    Ref = (0.01542831951,-0)
        Ours = (0x3f8f98e1fdb37251,0x8000000000000000)               Ref = (0x3f8f98e1fdb3724d,0x8000000000000000)

Max ulp imag error (0.5078,3.484) @ (0.8869326854,-1.12001698e-254)     (0x3fec61c0a7b18800,0x8b3505783ad41800)
        Ours = (0.7991224705,-8.379245502e-255)    Ref = (0.7991224705,-8.379245502e-255)
        Ours = (0x3fe99269498d3a37,0x8b2f742347588681)               Ref = (0x3fe99269498d3a36,0x8b2f742347588684)

CPU fp32:
Max ulp real error (4.827,0.5125) @ (0.001131535857,0.9893865585)       (0x3a94500b,0x3f7d4870)
        Ours = (0.007776218932,1.424767017)    Ref = (0.007776216604,1.424766898)
        Ours = (0x3bfecfa7,0x3fb65ec4)               Ref = (0x3bfecfa2,0x3fb65ec3)

Max ulp imag error (0.498,4.695) @ (-5.695841894e+11,3.510264218e+10)   (0xd3049ddd,0x5102c47d)
        Ours = (-27.76321411,0.06155067682)    Ref = (-27.76321411,0.06155069545)
        Ours = (0xc1de1b10,0x3d7c1c90)               Ref = (0xc1de1b10,0x3d7c1c95)

…h header disappears

copy-pr-bot · 2025-10-31T20:57:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

davebayer

Some small things

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

Applied suggested constant initializers. Co-authored-by: Michael Schellenberger Costa <[email protected]> Co-authored-by: David Bayer <[email protected]>

…ions.h Replace __device__ __host__ Co-authored-by: Michael Schellenberger Costa <[email protected]>

…ions.h Add guards for non-cuda compilers Co-authored-by: Michael Schellenberger Costa <[email protected]>

…ions.h undo-ing clang-format Co-authored-by: David Bayer <[email protected]>

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

s-oboyle · 2025-11-21T10:49:27Z

/ok to test 684c44c

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

s-oboyle · 2025-11-21T13:01:51Z

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

+  // We still need to to the extended-sqrt on these values, so we fix them now.
+  // It occurs around "real ~= small" and "imag ~= (1 - small)", and imag < 1.
+  // Worked out through targeted testing on fp64 and fp32.
+  _Tp __realx_small_bound = _Tp{1.0e-13};


Anticipating a future when other types (eg fp128) will need their own values.

davebayer

Please, try to use more of the __fp_meow machinery. In #6749, I am fixing the behaviour of __fp_set_exp, so it automatically sets the explicit bit

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

davebayer · 2025-11-24T09:42:01Z

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

+    // but not small enough that the asinh(x) ~ log(2x) estimate does
+    // not break down. We are not able to reduce this with a single simple reduction,
+    // so we do a fast/inlined frexp/ldexp:
+    const auto __exp_biased = static_cast<int32_t>(::cuda::std::bit_cast<__uint_t>(__max) >> __mant_nbits);


Suggested change

const auto __exp_biased = static_cast<int32_t>(::cuda::std::bit_cast<__uint_t>(__max) >> __mant_nbits);

const auto __exp_biased = ::cuda::std::__fp_get_exp_biased(__max);

Could you create a version of __fp_get_exp_biased for values that are known to be positive? Like __fp_get_exp_biased_pos or whatever you want.
The small little extra operations and mask value needed tend to have a measurable perf effect on the math functions when they are called in a loop unrolled fashion, if the compiler can't work the stuff it can optimize out.

davebayer · 2025-11-24T10:29:56Z

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

+  constexpr __uint_t __max_allowed_val_as_uint =
+    (__uint_t(__max_allowed_exponent + __exp_bias) << __mant_nbits) | __fp_explicit_bit_mask_of_v<_Tp>;


Suggested change

constexpr __uint_t __max_allowed_val_as_uint =

(__uint_t(__max_allowed_exponent + __exp_bias) << __mant_nbits) | __fp_explicit_bit_mask_of_v<_Tp>;

constexpr __uint_t __max_allowed_val_as_uint = ::cuda::std::__fp_set_exp(__uint_t{}, __max_allowed_exponent);

Same with the extra mask and operation used in this function.

davebayer · 2025-11-24T10:33:10Z

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

+    const __uint_t __exp_reduce_factor =
+      (__uint_t((2 * __exp_max) + __max_allowed_exponent - __exp_biased) << __mant_nbits)
+      | __fp_explicit_bit_mask_of_v<_Tp>;


Suggested change

const __uint_t __exp_reduce_factor =

(__uint_t((2 * __exp_max) + __max_allowed_exponent - __exp_biased) << __mant_nbits)

| __fp_explicit_bit_mask_of_v<_Tp>;

const auto __exp_reduce_factor = ::cuda::std::__fp_set_exp_biased(__uint_t{}, (2 * __exp_max) + __max_allowed_exponent - __exp_biased);

Wouldn't be needed here, but a version of __fp_set_exp_biased for fp-types that need an explicit bit, but that doesn't set it would be good. Just to remove the unneeded ops.

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

Added some suggestions Co-authored-by: Federico Busato <[email protected]> Co-authored-by: David Bayer <[email protected]>

s-oboyle · 2025-11-26T14:25:33Z

/ok to test eacbe9e

s-oboyle · 2025-12-02T16:55:10Z

/ok to test 39dbc1c

github-actions · 2025-12-02T18:22:40Z

🥳 CI Workflow Results

🟩 Finished in 1h 25m: Pass: 100%/90 | Total: 1d 13h | Max: 1h 22m | Hits: 95%/202196

See results here.

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h

* moving machine * Comment cleanup * clang-format * More const's, comment cleanup * Remove unneeded headers. May need to add some back in when the roots.h header disappears * re-enable header error * Add noexcept to internal function * spell-check * Apply suggestions from code review Applied suggested constant initializers. Co-authored-by: Michael Schellenberger Costa <[email protected]> Co-authored-by: David Bayer <[email protected]> * Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h Replace __device__ __host__ Co-authored-by: Michael Schellenberger Costa <[email protected]> * Updated all constant initialization * Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h Add guards for non-cuda compilers Co-authored-by: Michael Schellenberger Costa <[email protected]> * More non-cuda compiler guards * Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h undo-ing clang-format Co-authored-by: David Bayer <[email protected]> * Changed return method for extended_sqrt * Added headers back * removed noexcept in favour of _CCCL_FORCEINLINE * last commit does not work, switched back to using inline and noexcept in plain terms * Removed inline * reverted bfloat16 change * removed noexcept * Minor cleanups * addde back inline to prevent build errors, since static breaks * Added warning back * reverted victims of find-replace * Added warning back in * changed inline semantics * Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h Co-authored-by: Michael Schellenberger Costa <[email protected]> * rollback last commit * Updated constant init method * Added suggestions * Added explicit bit * Changed exp_max to exp_bias * build errors * build errors * build errors * Apply suggestions from code review Added some suggestions Co-authored-by: Federico Busato <[email protected]> Co-authored-by: David Bayer <[email protected]> * More small changes --------- Co-authored-by: Michael Schellenberger Costa <[email protected]> Co-authored-by: David Bayer <[email protected]> Co-authored-by: Federico Busato <[email protected]>

s-oboyle added 9 commits October 15, 2025 16:19

moving machine

1401314

Merge branch 'NVIDIA:main' into complex_asinh_accuracy_refinement

4469c9d

Comment cleanup

8b659d0

clang-format

c16cd26

More const's, comment cleanup

99c51dc

Remove unneeded headers. May need to add some back in when the roots.…

63898de

…h header disappears

Merge branch 'NVIDIA:main' into complex_asinh_accuracy_refinement

962f48a

re-enable header error

7844a7e

Add noexcept to internal function

08f5133

s-oboyle assigned miscco Oct 31, 2025

s-oboyle requested a review from a team as a code owner October 31, 2025 20:57

s-oboyle requested a review from pciolkosz October 31, 2025 20:57

github-project-automation bot added this to CCCL Oct 31, 2025

github-project-automation bot moved this to Todo in CCCL Oct 31, 2025

cccl-authenticator-app bot moved this from Todo to In Review in CCCL Oct 31, 2025

spell-check

a8c2333

davebayer reviewed Nov 3, 2025

View reviewed changes

miscco requested changes Nov 3, 2025

View reviewed changes

github-project-automation bot moved this from In Review to In Progress in CCCL Nov 3, 2025

s-oboyle and others added 7 commits November 3, 2025 16:23

Apply suggestions from code review

c24b96d

Applied suggested constant initializers. Co-authored-by: Michael Schellenberger Costa <[email protected]> Co-authored-by: David Bayer <[email protected]>

Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_funct…

bce04ea

…ions.h Replace __device__ __host__ Co-authored-by: Michael Schellenberger Costa <[email protected]>

Updated all constant initialization

d69eafc

Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_funct…

9ad1b20

…ions.h Add guards for non-cuda compilers Co-authored-by: Michael Schellenberger Costa <[email protected]>

More non-cuda compiler guards

7996458

Update libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_funct…

55c63ad

…ions.h undo-ing clang-format Co-authored-by: David Bayer <[email protected]>

Changed return method for extended_sqrt

7b6aec1

davebayer reviewed Nov 3, 2025

View reviewed changes

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h Show resolved Hide resolved

s-oboyle added 2 commits November 3, 2025 18:46

Added headers back

8f447e8

Merge branch 'main' into complex_asinh_accuracy_refinement

9629a78

s-oboyle added 6 commits November 20, 2025 12:31

Added explicit bit

56cce84

Changed exp_max to exp_bias

bb33e93

Merge branch 'main' into complex_asinh_accuracy_refinement

e9d30ab

build errors

9390f44

build errors

7bbe264

build errors

8c7a411

This comment has been minimized.

Sign in to view

Merge branch 'main' into complex_asinh_accuracy_refinement

684c44c

This comment has been minimized.

Sign in to view

s-oboyle commented Nov 21, 2025

View reviewed changes

miscco requested review from davebayer and fbusato November 24, 2025 09:29

davebayer requested changes Nov 24, 2025

View reviewed changes

fbusato requested changes Nov 24, 2025

View reviewed changes

s-oboyle and others added 3 commits November 26, 2025 13:51

Apply suggestions from code review

5cb75e7

Added some suggestions Co-authored-by: Federico Busato <[email protected]> Co-authored-by: David Bayer <[email protected]>

More small changes

94050b1

Merge branch 'main' into complex_asinh_accuracy_refinement

eacbe9e

This comment has been minimized.

Sign in to view

Merge branch 'main' into complex_asinh_accuracy_refinement

39dbc1c

fbusato approved these changes Dec 2, 2025

View reviewed changes

davebayer approved these changes Dec 2, 2025

View reviewed changes

libcudacxx/include/cuda/std/__complex/inverse_hyperbolic_functions.h Show resolved Hide resolved

github-project-automation bot moved this from In Progress to In Review in CCCL Dec 2, 2025

s-oboyle merged commit 3e672a2 into NVIDIA:main Dec 3, 2025
105 of 106 checks passed

github-project-automation bot moved this from In Review to Done in CCCL Dec 3, 2025

s-oboyle mentioned this pull request Feb 11, 2026

Complex acosh accuracy refinement #7632

Merged

	const auto __exp_biased = static_cast<int32_t>(::cuda::std::bit_cast<__uint_t>(__max) >> __mant_nbits);
	const auto __exp_biased = ::cuda::std::__fp_get_exp_biased(__max);

		constexpr __uint_t __max_allowed_val_as_uint =
		(__uint_t(__max_allowed_exponent + __exp_bias) << __mant_nbits) \| __fp_explicit_bit_mask_of_v<_Tp>;

Conversation

s-oboyle commented Oct 31, 2025

Update the complex asinh function to avoid numerical issues.

Perf

Correctness

GPU Correctness

CPU Correctness

Uh oh!

copy-pr-bot bot commented Oct 31, 2025

Uh oh!

davebayer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

s-oboyle commented Nov 21, 2025

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

s-oboyle Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

davebayer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davebayer Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

s-oboyle Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

davebayer Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

s-oboyle Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

davebayer Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

s-oboyle Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

s-oboyle commented Nov 26, 2025

Uh oh!

This comment has been minimized.

s-oboyle commented Dec 2, 2025

Uh oh!

github-actions bot commented Dec 2, 2025

🥳 CI Workflow Results

🟩 Finished in 1h 25m: Pass: 100%/90 | Total: 1d 13h | Max: 1h 22m | Hits: 95%/202196

Uh oh!

Uh oh!

Uh oh!

Reviewers