Skip to content

Conversation

@BurntSushi
Copy link
Owner

This comes from:
https://www.benjoffe.com/fast-date-64

More specifically, the code is here:
https://github.com/benjoffe/fast-date-benchmarks/blob/7fcf82b07d340ddbec866e15cfe333485439ab7f/algorithms/benjoffe_fast64.hpp

But the benchmarks don't show an improvement:

$ critcmp base x01
group                                                   base                                   x01
-----                                                   ----                                   ---
civil_datetime/to_timestamp_static/bundled/jiff         1.00     10.5±0.09ns        ? ?/sec    1.03     10.9±0.07ns        ? ?/sec
civil_datetime/to_timestamp_static/zoneinfo/jiff        1.00     10.4±0.09ns        ? ?/sec    1.03     10.8±0.07ns        ? ?/sec
timestamp/to_civil_datetime_offset_conversion/jiff      1.00      4.4±0.05ns        ? ?/sec    1.03      4.6±0.03ns        ? ?/sec

I ran the benchmarks like this:

cd bench

Before the change:

cargo bench --'(civil_datetime/to_timestamp_static|timestamp/to_civil_datetime_offset_conversion).*jiff' --save-baseline base

And then after the change:

cargo bench --'(civil_datetime/to_timestamp_static|timestamp/to_civil_datetime_offset_conversion).*jiff' --save-baseline x01

Then I used critcmp to compare them:

critcmp base x01

It's very possible I didn't port it correctly. I haven't scrutinized the
codegen. It's also possible that there is an improvement, but that it's
hard to write a benchmark using Jiff APIs to observe it.

(Note that I left out the ARM-specific bits. I'm testing this on x86-64.
I wanted to test there first before digging into the platform specific
optimizations.)

@BurntSushi
Copy link
Owner Author

cc @benjoffe - I hope you don't mind the ping. :-)

This comes from:
https://www.benjoffe.com/fast-date-64

More specifically, the code is here:
https://github.com/benjoffe/fast-date-benchmarks/blob/7fcf82b07d340ddbec866e15cfe333485439ab7f/algorithms/benjoffe_fast64.hpp

But the benchmarks don't show an improvement:

```
$ critcmp base x01
group                                                   base                                   x01
-----                                                   ----                                   ---
civil_datetime/to_timestamp_static/bundled/jiff         1.00     10.5±0.09ns        ? ?/sec    1.03     10.9±0.07ns        ? ?/sec
civil_datetime/to_timestamp_static/zoneinfo/jiff        1.00     10.4±0.09ns        ? ?/sec    1.03     10.8±0.07ns        ? ?/sec
timestamp/to_civil_datetime_offset_conversion/jiff      1.00      4.4±0.05ns        ? ?/sec    1.03      4.6±0.03ns        ? ?/sec
```

I ran the benchmarks like this:

```
cd bench
```

Before the change:

```
cargo bench --'(civil_datetime/to_timestamp_static|timestamp/to_civil_datetime_offset_conversion).*jiff' --save-baseline base
```

And then after the change:

```
cargo bench --'(civil_datetime/to_timestamp_static|timestamp/to_civil_datetime_offset_conversion).*jiff' --save-baseline x01
```

Then I used [`critcmp`] to compare them:

```
critcmp base x01
```

It's very possible I didn't port it correctly. I haven't scrutinized the
codegen. It's also possible that there is an improvement, but that it's
hard to write a benchmark using Jiff APIs to observe it.

(Note that I left out the ARM-specific bits. I'm testing this on x86-64.
I wanted to test there first before digging into the platform specific
optimizations.)

[`critcmp`]: https://github.com/BurntSushi/critcmp
@benjoffe
Copy link

benjoffe commented Nov 26, 2025

Hi, thanks for trying out the algorithm!

I'm very keen to help out, it'll be great to see this used successfully in the wild.

I have to first note, I have never programmed in Rust before, however it's been high on my list to try out, and this is the perfect push for me.

At first glance, it looks right, and if it passes tests I assume it's been ported correctly.

  1. perhaps rust is not pre-compiling constants such as 2048 * SCALE?
  2. perhaps the 128-bit arithmetic does not compile directly to raw 64-bit register access without extra work?
  3. perhaps the line let shift = if bump ... is compiling to an actual branch?

And finally, perhaps you are testing this on a x64 Mac? There is a note at the end of my blog post that x64 Mac does not surface the same speed gains in the benchmarks (it's still faster, but by less), which I presume is due to the power management features of the OS or chip activating when 64 bit math is run in a tight loop.

If it turns out that Rust is unable to compile the 128-bit arithmetic to direct register access, then it might be that the 32-bit friendly alternative is going to end up faster in this language*. It would be unfortunate to have to fallback to that, as it only achieves 3/4 of the speed gains over Neri-Schneider, but could be an interesting thing to test if you run out of ideas.

Finally, I note that the inverse function can be 1-cycle faster if you only need to support 32-bit inputs but are only targeting speed on 64-bit machines - by using 64-bit internal arithmetic and then changing yrs * 365 + yrs / 4 to yrs * 1461 / 4.

I'll be able to have a deeper dive later, hopefully in the next week or two.

@BurntSushi
Copy link
Owner Author

BurntSushi commented Nov 27, 2025

I don't have hands on a keyboard, so excuse the short reply. But thank you for responding!

rustc uses llvm, so I would be very surprised if rudimentary constant folding or simple branch avoidance wasn't happening. The 128-bit handling I'm less sure about though. For your benchmarks, did you use gcc or clang? Either way, when I get hands on a keyboard, I'll post the codegen here (and show you how I got it). The previous Neri-Schneider implementation used similar techniques, and I had looked at its codegen and at least verified there was no branching or div instructions.

My benchmarks above were on a quiet Linux desktop with the CPU governor set to performance mode.

Thanks again for taking a look!

@BurntSushi
Copy link
Owner Author

Also, Jiff only requires 32 bit integers to represent rata die. Namely, Jiff only supports years -9999 to 9999.

@benjoffe
Copy link

benjoffe commented Nov 27, 2025

The benchmark table at the bottom of the blog post specifies the compilers used. For Windows it was MSVC 19.44, and on Apple M4 Pro: Apple clang 17.0.0.

Since the date range is restricted to just +-9999 - you don't need to hoist to 64-bit to safely use the more compact, and faster code in the inverse: yrs * 1461 / 4

@benjoffe
Copy link

Upon further investigation, perhaps the issue is the usage of (A as u128) * (B as u128).
In the worst case, the compiler might possibly turn this into four u64 bit multiplications and stitch the results together.
There seems to be a feature: core::arch::x86_64::_mulx_u64 which might do this more directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants