Faster vartime division for `Uint` #608

andrewwhitehead · 2024-06-12T23:43:41Z

Implements faster vartime division (vartime with the divisor only) for Uint based on Knuth's TAOCP volume 2, as outlined at https://janmr.com/blog/2014/04/basic-multiple-precision-long-division/

This does not address vartime division for BoxedUint or the other TODOs in #511

Relevant benchmarks:

wrapping ops/div/rem_vartime, U256/U128, full size
                        time:   [84.091 ns 84.290 ns 84.525 ns]
                        change: [-85.312% -85.155% -84.896%] (p = 0.00 < 0.05)
                        Performance has improved.
wrapping ops/rem_vartime, U256/U128, full size
                        time:   [84.129 ns 84.474 ns 84.890 ns]
                        change: [-84.208% -84.057% -83.898%] (p = 0.00 < 0.05)
                        Performance has improved.
wrapping ops/rem_wide_vartime, U256
                        time:   [166.95 ns 167.27 ns 167.69 ns]
                        change: [-94.750% -94.705% -94.636%] (p = 0.00 < 0.05)
                        Performance has improved.
Dynamic Montgomery arithmetic/MontyParams::new_vartime
                        time:   [459.28 ns 460.48 ns 461.88 ns]
                        change: [-80.512% -80.452% -80.390%] (p = 0.00 < 0.05)
                        Performance has improved.

Signed-off-by: Andrew Whitehead <[email protected]>

andrewwhitehead · 2024-06-13T17:15:40Z

It looks like Uint::mul_mod would also be 90% faster using this update and split_mut/rem_wide_vartime, and could accept an even modulus. The current implementation uses MontyParams::new_vartime so it seems like it is vartime for the modulus, but maybe that should be renamed mul_mod_vartime anyway? Accepting a &NonZero<Uint> for the modulus would also simplify things.

Signed-off-by: Andrew Whitehead <[email protected]>

src/uint/div.rs

andrewwhitehead · 2024-06-13T18:11:10Z

src/uint/div.rs

+        // If the subtraction borrowed, then decrement q and add back the divisor
+        let ct_borrow = ConstChoice::from_word_mask(borrow.0);
+        carry = Limb::ZERO;
+        i = 0;
+        while i < yc {
+            (x[xi + i + 1 - yc], carry) =
+                x[xi + i + 1 - yc].adc(Limb::select(Limb::ZERO, y[i], ct_borrow), carry);
+            i += 1;
+        }
+        quo -= ct_borrow.select_word(0, 1);


I can't actually get this to trigger in practice (except by incrementing quo earlier). That's probably because "the probability that the adding back in step L5 must be executed is of order 2/b" ie. 1/2**31 or 1/2**63 in this case.

Signed-off-by: Andrew Whitehead <[email protected]>

tarcieri · 2024-07-26T17:31:24Z

The current implementation uses MontyParams::new_vartime so it seems like it is vartime for the modulus, but maybe that should be renamed mul_mod_vartime anyway?

That's a good point... perhaps there should be both Uint::mul_mod as well as Uint::mul_mod_vartime with the former using MontyParams::new? (looks like it would need some extra bounds)

tarcieri · 2024-07-26T18:59:25Z

@fjarri thoughts on this versus #511?

fjarri · 2024-07-27T00:24:31Z

This is definitely much faster than what I have there, and as for the naming stuff, I'll make a separate PR.

src/uint/div.rs

fjarri · 2024-07-27T00:29:24Z

src/uint/div.rs

-        }
-
-        rem
+        self.div_rem_vartime(rhs).1


I wonder if manually modifying div_rem_vartime() with the assumption that the quotient is discarded produces a faster result.

I don't believe there would be much benefit, it would just avoid storing the quotient words and then shifting them at the end. I think I also tried an inlined version to see if the optimizer would catch anything and it didn't seem to have any impact.

src/uint/div.rs

fjarri · 2024-07-27T00:31:47Z

src/uint/div.rs

    ///
    /// When used with a fixed `rhs`, this function is constant-time with respect
    /// to `self`.
    pub const fn rem_wide_vartime(lower_upper: (Self, Self), rhs: &NonZero<Self>) -> Self {


Could this be reduced to div_rem_vartime() by widening rhs to double width, and then halving the result? So that you don't have to spell out the whole algorithm again.

I don't think that can be done without extra trait bounds or const evaluation support.

tarcieri · 2024-07-27T15:43:40Z

This splits Uint::mul_mod and Uint::mul_mod_vartime: #623

Signed-off-by: Andrew Whitehead <[email protected]>

andrewwhitehead · 2024-07-27T15:48:47Z

I also have vartime division for BoxedUint implemented, which I was planning to add in a separate PR.

fjarri · 2024-08-01T22:33:32Z

Just a note: I experimented a little with trying to piggyback rem_wide_vartime() on div_rem_vartime(), and the following kinda works:

    pub const fn rem_wide_vartime<const R: usize>(lower_upper: (Self, Self), rhs: &NonZero<Self>) -> Self
    where
        Self: Concat<Output=Uint<R>>,
        Uint<R>: Split<Output=Self>,
    {
        let (lo, hi) = lower_upper;
        let wide_self: <Self as Concat>::Output = lo.concat(&hi);
        let (_q, r) = wide_self.div_rem_vartime(rhs);
        r
    }

The problem is, it then requires an <R> generic parameter in some Monty trait methods, which doesn't make sense when they're implemented for BoxedUint. So yeah, we'll have to wait until const traits are a thing so that we can say "Concat::Output is an Uint (or Integer)" without spelling out the concrete type.

implement faster vartime division for Uint

67c85cd

Signed-off-by: Andrew Whitehead <[email protected]>

flatten nested block

47a15b4

Signed-off-by: Andrew Whitehead <[email protected]>

tarcieri reviewed Jun 13, 2024

View reviewed changes

src/uint/div.rs Outdated Show resolved Hide resolved

andrewwhitehead commented Jun 13, 2024

View reviewed changes

move div/rem implementation into method body

9b0dc1a

Signed-off-by: Andrew Whitehead <[email protected]>

tarcieri requested a review from fjarri July 26, 2024 18:59

fjarri reviewed Jul 27, 2024

View reviewed changes

src/uint/div.rs Outdated Show resolved Hide resolved

fjarri reviewed Jul 27, 2024

View reviewed changes

src/uint/div.rs Outdated Show resolved Hide resolved

fjarri reviewed Jul 27, 2024

View reviewed changes

remove unnecessary checks for zero divisor

95c8a61

Signed-off-by: Andrew Whitehead <[email protected]>

tarcieri approved these changes Jul 31, 2024

View reviewed changes

tarcieri merged commit e948539 into RustCrypto:master Jul 31, 2024

tarcieri changed the title ~~Faster vartime division for Uint~~ Faster vartime division for Uint Jul 31, 2024

tarcieri mentioned this pull request Jul 31, 2024

Variable-time division improvements #511

Closed

andrewwhitehead mentioned this pull request Aug 2, 2024

Faster vartime division for BoxedUint #626

Merged

fjarri mentioned this pull request Aug 5, 2024

rem_wide_vartime() issues #638

Closed

Faster vartime division for Uint #608

Faster vartime division for Uint #608

Uh oh!

Conversation

andrewwhitehead commented Jun 12, 2024

Uh oh!

andrewwhitehead commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

andrewwhitehead Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarcieri commented Jul 26, 2024

Uh oh!

tarcieri commented Jul 26, 2024

Uh oh!

fjarri commented Jul 27, 2024

Uh oh!

Uh oh!

fjarri Jul 27, 2024

Choose a reason for hiding this comment

Uh oh!

andrewwhitehead Jul 27, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fjarri Jul 27, 2024

Choose a reason for hiding this comment

Uh oh!

andrewwhitehead Jul 27, 2024

Choose a reason for hiding this comment

Uh oh!

tarcieri commented Jul 27, 2024

Uh oh!

andrewwhitehead commented Jul 27, 2024

Uh oh!

fjarri commented Aug 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Faster vartime division for `Uint` #608

Faster vartime division for `Uint` #608

andrewwhitehead commented Jun 13, 2024 •

edited

Loading

andrewwhitehead Jun 13, 2024 •

edited

Loading