Skip to content

Conversation

@andrewwhitehead
Copy link
Contributor

Implements faster vartime division (vartime with the divisor only) for Uint based on Knuth's TAOCP volume 2, as outlined at https://janmr.com/blog/2014/04/basic-multiple-precision-long-division/

This does not address vartime division for BoxedUint or the other TODOs in #511

Relevant benchmarks:

wrapping ops/div/rem_vartime, U256/U128, full size
                        time:   [84.091 ns 84.290 ns 84.525 ns]
                        change: [-85.312% -85.155% -84.896%] (p = 0.00 < 0.05)
                        Performance has improved.
wrapping ops/rem_vartime, U256/U128, full size
                        time:   [84.129 ns 84.474 ns 84.890 ns]
                        change: [-84.208% -84.057% -83.898%] (p = 0.00 < 0.05)
                        Performance has improved.
wrapping ops/rem_wide_vartime, U256
                        time:   [166.95 ns 167.27 ns 167.69 ns]
                        change: [-94.750% -94.705% -94.636%] (p = 0.00 < 0.05)
                        Performance has improved.
Dynamic Montgomery arithmetic/MontyParams::new_vartime
                        time:   [459.28 ns 460.48 ns 461.88 ns]
                        change: [-80.512% -80.452% -80.390%] (p = 0.00 < 0.05)
                        Performance has improved.

@andrewwhitehead
Copy link
Contributor Author

andrewwhitehead commented Jun 13, 2024

It looks like Uint::mul_mod would also be 90% faster using this update and split_mut/rem_wide_vartime, and could accept an even modulus. The current implementation uses MontyParams::new_vartime so it seems like it is vartime for the modulus, but maybe that should be renamed mul_mod_vartime anyway? Accepting a &NonZero<Uint> for the modulus would also simplify things.

Signed-off-by: Andrew Whitehead <[email protected]>
src/uint/div.rs Outdated
Comment on lines 695 to 704
// If the subtraction borrowed, then decrement q and add back the divisor
let ct_borrow = ConstChoice::from_word_mask(borrow.0);
carry = Limb::ZERO;
i = 0;
while i < yc {
(x[xi + i + 1 - yc], carry) =
x[xi + i + 1 - yc].adc(Limb::select(Limb::ZERO, y[i], ct_borrow), carry);
i += 1;
}
quo -= ct_borrow.select_word(0, 1);
Copy link
Contributor Author

@andrewwhitehead andrewwhitehead Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't actually get this to trigger in practice (except by incrementing quo earlier). That's probably because "the probability that the adding back in step L5 must be executed is of order 2/b" ie. 1/2**31 or 1/2**63 in this case.

@tarcieri
Copy link
Member

The current implementation uses MontyParams::new_vartime so it seems like it is vartime for the modulus, but maybe that should be renamed mul_mod_vartime anyway?

That's a good point... perhaps there should be both Uint::mul_mod as well as Uint::mul_mod_vartime with the former using MontyParams::new? (looks like it would need some extra bounds)

@tarcieri
Copy link
Member

@fjarri thoughts on this versus #511?

@tarcieri tarcieri requested a review from fjarri July 26, 2024 18:59
@fjarri
Copy link
Contributor

fjarri commented Jul 27, 2024

This is definitely much faster than what I have there, and as for the naming stuff, I'll make a separate PR.

}

rem
self.div_rem_vartime(rhs).1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if manually modifying div_rem_vartime() with the assumption that the quotient is discarded produces a faster result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there would be much benefit, it would just avoid storing the quotient words and then shifting them at the end. I think I also tried an inlined version to see if the optimizer would catch anything and it didn't seem to have any impact.

///
/// When used with a fixed `rhs`, this function is constant-time with respect
/// to `self`.
pub const fn rem_wide_vartime(lower_upper: (Self, Self), rhs: &NonZero<Self>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be reduced to div_rem_vartime() by widening rhs to double width, and then halving the result? So that you don't have to spell out the whole algorithm again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that can be done without extra trait bounds or const evaluation support.

@tarcieri
Copy link
Member

This splits Uint::mul_mod and Uint::mul_mod_vartime: #623

@andrewwhitehead
Copy link
Contributor Author

I also have vartime division for BoxedUint implemented, which I was planning to add in a separate PR.

@tarcieri tarcieri merged commit e948539 into RustCrypto:master Jul 31, 2024
@tarcieri tarcieri changed the title Faster vartime division for Uint Faster vartime division for Uint Jul 31, 2024
@fjarri
Copy link
Contributor

fjarri commented Aug 1, 2024

Just a note: I experimented a little with trying to piggyback rem_wide_vartime() on div_rem_vartime(), and the following kinda works:

    pub const fn rem_wide_vartime<const R: usize>(lower_upper: (Self, Self), rhs: &NonZero<Self>) -> Self
    where
        Self: Concat<Output=Uint<R>>,
        Uint<R>: Split<Output=Self>,
    {
        let (lo, hi) = lower_upper;
        let wide_self: <Self as Concat>::Output = lo.concat(&hi);
        let (_q, r) = wide_self.div_rem_vartime(rhs);
        r
    }

The problem is, it then requires an <R> generic parameter in some Monty trait methods, which doesn't make sense when they're implemented for BoxedUint. So yeah, we'll have to wait until const traits are a thing so that we can say "Concat::Output is an Uint (or Integer)" without spelling out the concrete type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants