Update Wasm benchmarks#2957
Update Wasm benchmarks#2957Robbepop merged 28 commits intoparitytech:masterfrom Robbepop:rf-update-wasm-benchmarks
Conversation
The former instr_i64const benchmark has had the problem in that it was too simple to optimize for the new Wasmi (register), resulting in an artificial 20x speedup since Wasmi (register) mostly optimized the whole function body away. The new benchmark body still makes use of the new Wasmi (register) optimizations but in a way that no computation is lost and thus the benchmark is more aligned to real world base weights for both Wasmi (stack) and Wasmi (register) executors.
It now uses local.get(0) instead of i32.const for the if condition which makes it impossible for Wasmi (register) to aggressively optimize whole parts of the if away at compilation time thus creating an unfair advantage in benchmarks towards Wasmi (stack).
|
bot bench substrate-pallet --pallet=pallet_contracts |
|
@Robbepop |
|
bot cancel |
|
@Robbepop Command |
|
bot bench substrate-pallet --pallet=pallet_contracts |
|
@Robbepop https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4941568 was started for your command Comment |
|
@Robbepop Command |
|
bot bench substrate-pallet --pallet=pallet_contracts |
|
@Robbepop https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4941733 was started for your command Comment |
This is weird to me since we change an automatically generated file but code review recommended to change this to make the CI happy so I do.
|
@Robbepop Command |
|
bot bench substrate-pallet --pallet=pallet_contracts |
|
@Robbepop https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4942135 was started for your command Comment |
…=dev --target_dir=substrate --pallet=pallet_contracts
|
@Robbepop Command |
|
I have an explanation for the performance difference between the old An explanation is the following: Wasmi (stack) has optimized |
There was a problem hiding this comment.
Given your findings we need to keep continuing using big numbers to exercise this worst case. While it is "unrealistic" in practise that big numbers are used it is very easy for an attacker to fill a contract with these instructions. If we want to profit from this optimization it has to be accounted for in the wasmi gas model. Not by optimistically benchmarking for that optimization. Benchmarks have to be for the worst case.
While the benchmark itself only improves by 20% you have to keep in mind that we are dividing that by 4 to account for the increase of instructions. So you actually have to compare what is being put into the schedule and not only the raw benchmarks results.
Before we divided by 2. So with this PR we are actually using 40% less gas per instruction.
|
In defense of the new benchmark using proper The issue is that the previous benchmark using If we want to do a proper job with accounting for the worst case instruction we probably should at least be explicit about our intent and use Though, there technically are even worse offenders with respect to performance but it always varies from engine to engine. For example Wasmi (stack) is very efficient at handling function calls whereas Wasmi (register) is much more efficient at handling computation and conditional branches. At the very moment where we are about to change the underlying Wasm execution engine soon and furthermore where a thorough Polkadot SDK built-in gas model is in sight, I don't see a huge gain in trying to do a splendid job here right now. If division by 4 is a problem then division by 2 (as planned by me originally) would be a good enough compromise. |
But in this PR we are using the stack machine. We don't merge code to master saying "will be fixed with the next PR". If we want to write a benchmark for the register machine we need to do that on the PR that activates the register machine. So maybe you should merge the two PRs after all. You can still compare weights between different commits in the same PR.
Why would it matter if it was accidental? When I implemented it I was just paranoid and used random numbers so no optimisation can take place. And this payed off as it turned out that the optimization was dodged avoiding a potential security problem.
Then you should do exactly that. You can just use a maximum sized memory (as defined in the
I don't agree with his "we can't be perfect so we might as well be less correct than right now" argument. I am not asking for perfection. I am just asking not to knowingly benchmark something that exercises a best case scenario: Executing 4 very fast instructions and then dividing this by 4. Please keep in mind that having too high numbers is totally fine. It is safe. This code is in production. If we are reducing it by fiddling with the benchmark we need to be very careful. So please just be safe here. You can optimize with the register machine. So if it is less work for you, you can just do all of this on one PR so you don't have to care for the stack machine anymore. |
|
After some discussion in chat we agreed to use the proposed @athei I just implemented this agreed design test case. |
|
bot bench substrate-pallet --pallet=pallet_contracts |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
bot bench substrate-pallet --pallet=pallet_contracts |
|
@Robbepop https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4975857 was started for your command Comment |
athei
left a comment
There was a problem hiding this comment.
Looks good. Let's wait for the benchmark results.
…=dev --target_dir=substrate --pallet=pallet_contracts
|
@Robbepop Command |
|
Applied your review suggestions. Note that After division by 6 (because of 6 instructions) the benchmarks is rouhgly twice as fast (equals to half gas costs). This is explained because before there every second instructions was causing a worst-case performance behavior due to the random access of big value constants and now only 2 of the 6 instructions have this worst-case behavior, however, at least we are now explicit about it and are engine agnostic. So this should probably be fine. |
In paritytech#2941 we found out that the new Wasmi (register) is very effective at optimizing away certain benchmark bytecode constructs in a way that created an unfair advantage over Wasmi (stack) which yielded our former benchmarks to be ineffective at properly measuring the performance impact. This PR adjusts both affected benchmarks to fix the stated problems. Affected are - `instr_i64const` -> `instr_i64add`: Renamed since it now measures the performance impact of the Wasm `i64.add` instruction with locals as inputs and outputs. This makes it impossible for Wasmi (register) to aggressively optimize away the entire function body (as it previously did) but still provides a way for Wasmi (register) to shine with its register based execution model. - `call_with_code_per_byte`: Now uses `local.get` instead of `i32.const` for the `if` condition which prevents Wasmi (register) to aggressively optimizing away whole parts of the `if` creating an unfair advantage. cc @athei --------- Co-authored-by: command-bot <> Co-authored-by: Alexander Theißen <[email protected]> Co-authored-by: Ignacio Palacios <[email protected]>

In #2941 we found out that the new Wasmi (register) is very effective at optimizing away certain benchmark bytecode constructs in a way that created an unfair advantage over Wasmi (stack) which yielded our former benchmarks to be ineffective at properly measuring the performance impact.
This PR adjusts both affected benchmarks to fix the stated problems.
Affected are
instr_i64const->instr_i64add: Renamed since it now measures the performance impact of the Wasmi64.addinstruction with locals as inputs and outputs. This makes it impossible for Wasmi (register) to aggressively optimize away the entire function body (as it previously did) but still provides a way for Wasmi (register) to shine with its register based execution model.call_with_code_per_byte: Now useslocal.getinstead ofi32.constfor theifcondition which prevents Wasmi (register) to aggressively optimizing away whole parts of theifcreating an unfair advantage.cc @athei