Improve memory usage and overall performance#8
Conversation
|
Preliminary benchmark improvements |
cstockton
left a comment
There was a problem hiding this comment.
Good work overall just wanted to illustrate an additional optimization that would save a lot of allocs that didn't apply to the prior pull request.
ff1/ff1.go
Outdated
| // temp must be 16 bytes incuding j | ||
| // This will only be needed if maxJ > 1, for the inner for loop | ||
| if maxJ > 1 { | ||
| temp = make([]byte, 16) |
There was a problem hiding this comment.
Line 153, 173, 174 and this could all share sections of a single N byte buf, this will save on the allocations not only at entry but for loop iterations in the general case since chances are you can derive the length of some of those as a best-guess from the input. To illustrate:
const (
// For every buffer we have here we know we need at least aes.BlockSize bytes, excluding the
// primary buffer since that will be given a larger input relative backing.
minSliceBackings = 6
minPerSliceBufSize = aes.BlockSize*numberOfSliceBackings
// Average additional scratch space needed relative to input, no idea what the figure is :)
avgInputRelativeSize = aes.BlockSize * 10 // no idea :) BlockSize * iteration count looks close?
)
// These are re-used in the for loop below
var (
// For every buffer we have here we know we need at least aes.BlockSize bytes.
relSize = (avgInputRelativeSize * len(X))
bufSize = relSize + minPerSliceBufSize
buf = make([]byte, 0, bufSize)
// Q's length is a multiple of 16
// Start it with a length of 16
// Q Seems to be the main buffer here at a quick glance? If so we can spill our relSize into that
Q = buf[0:16:relSize]
QOrig = buf[relSize:0:16]
// R is gauranteed to be 16 bytes since it holds output of PRF
R = buf[relSize+16:16:16]
// TODO: rename this
temp buf[relSize+32:0:16]
numB big.Int
numBBytes buf[relSize+48:0:16]
br, bm, mod big.Int
y, c big.Int
Y buf[relSize+64:0:16]
)TheObnoxiousLongVarNames are for illustration of course, you could put a clear derivative in a const to help document the size of the backings though. But setting the capacity of these backings allows us to just append as we were before with the benefit of saving our minimum allocation count. You can play with the sections you give each slice relative to input length until you can get away with common best case of a single make.
Oh and I've omitted the math to calculate offsets for the capacity and length portions. You only have to write it once, but could could circumvent it with a counter and short variable declarations.
There was a problem hiding this comment.
This is a good suggestion, thanks! I was initially a little apprehensive because I was worried about readability but the variables remain as is, it's just the underlying backing array that would change.
There was a problem hiding this comment.
I think another thing I should be able to do is change A, B, and C to be byte slices, and only convert them to strings when needed, like when passing into SetString or at the final return
for large inputs where d > 16, maxJ becomes >1, which causes the code to enter the inner for loop. previously there was an issue where it an input could be encrypted then the decrypted form didn’t match the original plaintext. this commit fixes that. note that correctness of the cipher text in these scenarios is still unknown since no test vector covers it
this reduces only a few allocations for short inputs, but improves allocations on long inputs quite a bit
benchmark old ns/op new ns/op delta BenchmarkEncrypt/Sample7-8 10105 8913 -11.80% BenchmarkEncryptLong-8 42491 40936 -3.66% benchmark old allocs new allocs delta BenchmarkEncrypt/Sample7-8 90 84 -6.67% BenchmarkEncryptLong-8 109 99 -9.17% benchmark old bytes new bytes delta BenchmarkEncrypt/Sample7-8 2752 2464 -10.47% BenchmarkEncryptLong-8 6979 6306 -9.64%
benchmark old ns/op new ns/op delta BenchmarkEncrypt/Sample7-8 9309 5123 -44.97% BenchmarkEncryptLong-8 42819 18058 -57.83% benchmark old allocs new allocs delta BenchmarkEncrypt/Sample7-8 84 43 -48.81% BenchmarkEncryptLong-8 99 58 -41.41% benchmark old bytes new bytes delta BenchmarkEncrypt/Sample7-8 2464 1424 -42.21% BenchmarkEncryptLong-8 6306 3905 -38.07%
rev now just wraps revB | 20% improvement in speed xorBytes is now inlined depending on use numRadix is replaced with just SetString directly
there's no situation in which the input to the aes block will not be 16 bytes so there's no need for cbc
What's in this PR?
This builds on the changes in #3 by applying the memory usage / allocation / performance optimizations.
Changes to FF1:
Q,PQ,Y,R,xoredbuffers.Premains a singleblockSizearray on the stack. Example for reference on long inputs:numA,numB, andnumCare swapped via assignment andSetBytesrather than repetitiveSetStringandTextcalls inside the Feistel round for loopm=u(i is even) andm=v(i is odd) since there are only 2 that are used in an alternating fashion.On average, FF1 performance improves quite a bit from the previous
masterHEAD (3d2762d )Changes to FF3:
revnow usesrevBwhichstring/[]byteconversions.cbcEncryptorsince input blocks are always 16 bytes with iv=0, so just use normal AES (ECB)ciphhas been replaced with direct calls to aes encryptMINOR BREAKING CHANGE TO BOTH:
ff1andff3NewCipher now return aCipherrather than*Cipherand methods are now on value receivers instead of pointer receivers for consistency. Because of Go's transparent indirection and since the struct fields are not exported, this shouldn't cause issues for most practical uses, as the only way to interact with the cipher is usingEncryptandDecrypt, which remain unchanged.This was done because in situations where many
NewCiphercalls must be made (ex. tweak is different for each piece of data), then using value semantics causes less allocations.TODOs