-
Notifications
You must be signed in to change notification settings - Fork 126
Faster GCD and MOD (single limb) #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@madhur4127 : Looks like a great improvement. Could you post your benchmark code so I could look at the asm? |
|
@NAThompson, sure, here's the code: I generally do One thing I noticed is that GMP's speedup is approximately a constant so that indicates that this is the optimal algorithmic choice and only thing that needs work is constant optimization for modulus operation. |
|
Good catch! I had to change the int64 to int32 on msvc but then I see: |
My bad, I thought I have turned on compiler optimizations. I have updated the new findings. Thanks John for posting your results. These results look counter-intuitive to me, 400 bits means double the number of limbs in 32bit as opposed to 64 bits. Because modulus is a costly operation it should have been a bottleneck. Considering the differences in CPU, I could see that the GMP code runs 2 times fast but this PR's code changed speedup from 2 to 3 and the original code runs almost 2 times fast! GMP can be made baseline as it has assembly in most parts. But like I mentioned, this PR's code has speedup of around 3 while it should run slower in 32 bit! Did speedup increase with an increase in number of operations? Or some compiler tricks are at play here? Can somebody else share their results?
|
|
Ubunutu/gcc-9.2.1 results: |
|
With the Intel 19.1 compiler: |
Somebody committed a file that doesn't actually build: I'll look into it.
Tiny bit faster here!
I think because this is still marked as a draft PR? |
|
This looks good, and raises some interesting questions:
|
This is easy to handle in a similar fashion, I should probably make it a template function to handle mod(unsigned __int128, unsigned __int128):
sub rsp, 8
call __umodti3
add rsp, 8
retso it is known to the compiler how to efficiently do this. Modulus must also be taken in this way too, like a sliding window for
This is where the hard part is, converting a gcd of 2 cpp_ints to a single limb, algorithmically. I believe converting one of them to a single limb or double is not possible without modifying other, so in the end, modulus will be inefficient.
Surely binary GCD is faster on approximately similar sizes. I will have to widen my algorithmic toolbox to handle two arbitrary big integers. I am thinking of picking up Knuth's Vol 2 and it will take me some time to fully understand the algorithms to the point where I can efficiently code it up.
Then we'll have to simplify the modulus, something like this: Godbolt link.
Compilers can also do that, see the |
Do you remember the strategy I showed you for 32*32 mul -> 64 result? I wonder if you can in this case here for modulus think the same way for the efficient modulus operation? The assembly you show seems like it might be for a ful-on heavy 128%128 -> 128 result. That might be why the compiler generates potentially time-consuming that calls umod, jumps there, and restores a register or two upon return (slow stuff). If you seek speed and if possible, coerce the compiler to do the mod of 128 % 64 -> 64, if this satisfies the math requirements in your algoritm (which I do not know iif this is the case). Kind regards, Chris |
I don't know how to convert |
I didn't really know if it would work in your particular use-case either. I just got the possibility on the table in case it was relevant for your case at hand (which I was unsure of). Best, Chris |
|
AFAICT I think this PR is in good shape for a single limb. I don't have any optimizations for double limb expect to take modulus and convert both of them to double limbs, this approach is feasible for only for 32bit limbs(MSVC). I initially thought of converting the double limb modulus to two (or more) single limbs modulus and then combine the results by using the Chinese Remainder Theorem. This approach would be too slow. The General case for GCD will be handled in a subsequent PR. Also, the motive of implementing general case first is that it will make it clear whether modulus with a double limb is slower than the general case and we can then decide accordingly. |
|
One thing I noticed is that modulus on a single limb can be done in O(number_of_limbs) by using a sliding window technique (similar to rolling hashes) instead of calling generic So this specializing this function can improve performance multiple times of GCD, I guess. |
|
I think the sliding window is an excellent idea, I quickly coded up: Which is much better than a full divide and remainder, the casts could also be eluded if we know that the modulus is small enough in value. |
|
Ha, that code is actually slower than calling eval_modulus! :( |
|
Best I can come up with for single limb modulus is: Which has about the same performance as the existing divide routine, @madhur4127 did you have something better in mind? |
|
@jzmaddock, I reduced the modulo operations to N, your code takes 2*N modulo operations. My idea was like horner's rule. Can you run the benchmark on ICC because I think guess ICC can optimize 128-bit modulus operation. |
|
I'm clueless how to do 64*64%64, compilers call up I observed around 11-16% speedup. Edit: Montgomery Multiplication is a candidate for fixed modulus. |
|
Bit slower for me with msvc: |
|
There are too many 746667 in the above benchmark with different CPU times! |
|
Ubuntu results: So about the same? |
Yes, they look pretty similar. I used It's funny how O(N) time O(1) space lost because of a single operation. I hope this will get better in future :) The most curious case is of MSVC where 32*32%64bit modulo must have been done in native hardware instructions. |
|
The algorithm you have there looks the "right way" to me, and with a bit of jiggling around I think I have about 30% knocked off the times, bare with me a bit while I do some more testing though... |
This version is also untested after the modular commits. I also plan to stress it. |
|
Those last 2 commits cause massive failures for me locally - I doubt very much that line is performance critical anyway (even if that approach would be more elegant)? |
|
@madhur4127 : I've fixed some test failures, removed some tabs and other SNAFU's, but also streamlined the code slightly (fewer temporaries and fewer function calls) in this: #215. Can you performance compare to what you have now? I make it about a third quicker, mostly through fewer temporaries. |
@jzmaddock, My apologies, It was very late when I pushed those commits. I wanted to resume work in the morning so that was a sort of reminder.
I see a 10% improvement from yesterday, so I went ahead and removed the temporary completely. |
Changes Binary GCD to Euclid for first iteration of GCD. This reduces the computation of N x 1 gcd to 1 x 1 gcd, which can be further solved by Binary GCD.
The number of bits (N) of the bigger operand is shown by the numeral in the benchmark name with the type used.
EDIT: I erroneously thought to have turned on
O2