-
-
Notifications
You must be signed in to change notification settings - Fork 77
shared: attempt a faster date <-> rate die algorithm #449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
cc @benjoffe - I hope you don't mind the ping. :-) |
This comes from: https://www.benjoffe.com/fast-date-64 More specifically, the code is here: https://github.com/benjoffe/fast-date-benchmarks/blob/7fcf82b07d340ddbec866e15cfe333485439ab7f/algorithms/benjoffe_fast64.hpp But the benchmarks don't show an improvement: ``` $ critcmp base x01 group base x01 ----- ---- --- civil_datetime/to_timestamp_static/bundled/jiff 1.00 10.5±0.09ns ? ?/sec 1.03 10.9±0.07ns ? ?/sec civil_datetime/to_timestamp_static/zoneinfo/jiff 1.00 10.4±0.09ns ? ?/sec 1.03 10.8±0.07ns ? ?/sec timestamp/to_civil_datetime_offset_conversion/jiff 1.00 4.4±0.05ns ? ?/sec 1.03 4.6±0.03ns ? ?/sec ``` I ran the benchmarks like this: ``` cd bench ``` Before the change: ``` cargo bench --'(civil_datetime/to_timestamp_static|timestamp/to_civil_datetime_offset_conversion).*jiff' --save-baseline base ``` And then after the change: ``` cargo bench --'(civil_datetime/to_timestamp_static|timestamp/to_civil_datetime_offset_conversion).*jiff' --save-baseline x01 ``` Then I used [`critcmp`] to compare them: ``` critcmp base x01 ``` It's very possible I didn't port it correctly. I haven't scrutinized the codegen. It's also possible that there is an improvement, but that it's hard to write a benchmark using Jiff APIs to observe it. (Note that I left out the ARM-specific bits. I'm testing this on x86-64. I wanted to test there first before digging into the platform specific optimizations.) [`critcmp`]: https://github.com/BurntSushi/critcmp
|
Hi, thanks for trying out the algorithm! I'm very keen to help out, it'll be great to see this used successfully in the wild. I have to first note, I have never programmed in Rust before, however it's been high on my list to try out, and this is the perfect push for me. At first glance, it looks right, and if it passes tests I assume it's been ported correctly.
And finally, perhaps you are testing this on a x64 Mac? There is a note at the end of my blog post that x64 Mac does not surface the same speed gains in the benchmarks (it's still faster, but by less), which I presume is due to the power management features of the OS or chip activating when 64 bit math is run in a tight loop. If it turns out that Rust is unable to compile the 128-bit arithmetic to direct register access, then it might be that the 32-bit friendly alternative is going to end up faster in this language*. It would be unfortunate to have to fallback to that, as it only achieves 3/4 of the speed gains over Neri-Schneider, but could be an interesting thing to test if you run out of ideas. Finally, I note that the inverse function can be 1-cycle faster if you only need to support 32-bit inputs but are only targeting speed on 64-bit machines - by using 64-bit internal arithmetic and then changing I'll be able to have a deeper dive later, hopefully in the next week or two. |
|
I don't have hands on a keyboard, so excuse the short reply. But thank you for responding! rustc uses llvm, so I would be very surprised if rudimentary constant folding or simple branch avoidance wasn't happening. The 128-bit handling I'm less sure about though. For your benchmarks, did you use gcc or clang? Either way, when I get hands on a keyboard, I'll post the codegen here (and show you how I got it). The previous Neri-Schneider implementation used similar techniques, and I had looked at its codegen and at least verified there was no branching or div instructions. My benchmarks above were on a quiet Linux desktop with the CPU governor set to performance mode. Thanks again for taking a look! |
|
Also, Jiff only requires 32 bit integers to represent rata die. Namely, Jiff only supports years -9999 to 9999. |
|
The benchmark table at the bottom of the blog post specifies the compilers used. For Windows it was MSVC 19.44, and on Apple M4 Pro: Apple clang 17.0.0. Since the date range is restricted to just +-9999 - you don't need to hoist to 64-bit to safely use the more compact, and faster code in the inverse: |
|
Upon further investigation, perhaps the issue is the usage of (A as u128) * (B as u128). |
This comes from:
https://www.benjoffe.com/fast-date-64
More specifically, the code is here:
https://github.com/benjoffe/fast-date-benchmarks/blob/7fcf82b07d340ddbec866e15cfe333485439ab7f/algorithms/benjoffe_fast64.hpp
But the benchmarks don't show an improvement:
I ran the benchmarks like this:
Before the change:
And then after the change:
Then I used
critcmpto compare them:It's very possible I didn't port it correctly. I haven't scrutinized the
codegen. It's also possible that there is an improvement, but that it's
hard to write a benchmark using Jiff APIs to observe it.
(Note that I left out the ARM-specific bits. I'm testing this on x86-64.
I wanted to test there first before digging into the platform specific
optimizations.)