Skip to content

Update Wasm benchmarks#2957

Merged
Robbepop merged 28 commits intoparitytech:masterfrom
Robbepop:rf-update-wasm-benchmarks
Jan 19, 2024
Merged

Update Wasm benchmarks#2957
Robbepop merged 28 commits intoparitytech:masterfrom
Robbepop:rf-update-wasm-benchmarks

Conversation

@Robbepop
Copy link
Contributor

In #2941 we found out that the new Wasmi (register) is very effective at optimizing away certain benchmark bytecode constructs in a way that created an unfair advantage over Wasmi (stack) which yielded our former benchmarks to be ineffective at properly measuring the performance impact.

This PR adjusts both affected benchmarks to fix the stated problems.
Affected are

  • instr_i64const -> instr_i64add: Renamed since it now measures the performance impact of the Wasm i64.add instruction with locals as inputs and outputs. This makes it impossible for Wasmi (register) to aggressively optimize away the entire function body (as it previously did) but still provides a way for Wasmi (register) to shine with its register based execution model.
  • call_with_code_per_byte: Now uses local.get instead of i32.const for the if condition which prevents Wasmi (register) to aggressively optimizing away whole parts of the if creating an unfair advantage.

cc @athei

The former instr_i64const benchmark has had the problem in that it was too simple to optimize for the new Wasmi (register), resulting in an artificial 20x speedup since Wasmi (register) mostly optimized the whole function body away.
The new benchmark body still makes use of the new Wasmi (register) optimizations but in a way that no computation is lost and thus the benchmark is more aligned to real world base weights for both Wasmi (stack) and Wasmi (register) executors.
It now uses local.get(0) instead of i32.const for the if condition which makes it impossible for Wasmi (register) to aggressively optimize whole parts of the if away at compilation time thus creating an unfair advantage in benchmarks towards Wasmi (stack).
@Robbepop Robbepop requested a review from athei as a code owner January 17, 2024 10:02
@Robbepop Robbepop requested a review from a team January 17, 2024 10:02
@Robbepop
Copy link
Contributor Author

bot bench substrate-pallet --pallet=pallet_contracts

@command-bot
Copy link

command-bot bot commented Jan 17, 2024

@Robbepop "$PIPELINE_SCRIPTS_DIR/commands/bench/bench.sh" --subcommand=pallet --runtime=dev --target_dir=substrate --pallet=pallet_contracts (https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4940621) was cancelled in #2957 (comment)

@pgherveou pgherveou added the R0-no-crate-publish-required The change does not require any crates to be re-published. label Jan 17, 2024
@Robbepop
Copy link
Contributor Author

bot cancel

@command-bot
Copy link

command-bot bot commented Jan 17, 2024

@Robbepop Command "$PIPELINE_SCRIPTS_DIR/commands/bench/bench.sh" --subcommand=pallet --runtime=dev --target_dir=substrate --pallet=pallet_contracts has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4940621 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4940621/artifacts/download.

@Robbepop
Copy link
Contributor Author

bot bench substrate-pallet --pallet=pallet_contracts

@command-bot
Copy link

command-bot bot commented Jan 17, 2024

@Robbepop https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4941568 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/bench/bench.sh" --subcommand=pallet --runtime=dev --target_dir=substrate --pallet=pallet_contracts. Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 2-5601ff89-fc96-4e77-ae30-0aefea757569 to cancel this command or bot cancel to cancel all commands in this pull request.

@command-bot
Copy link

command-bot bot commented Jan 17, 2024

@Robbepop Command "$PIPELINE_SCRIPTS_DIR/commands/bench/bench.sh" --subcommand=pallet --runtime=dev --target_dir=substrate --pallet=pallet_contracts has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4941568 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4941568/artifacts/download.

@Robbepop
Copy link
Contributor Author

bot bench substrate-pallet --pallet=pallet_contracts

@command-bot
Copy link

command-bot bot commented Jan 17, 2024

@Robbepop https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4941733 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/bench/bench.sh" --subcommand=pallet --runtime=dev --target_dir=substrate --pallet=pallet_contracts. Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 3-ae5d6a3c-59f0-43bb-b0e3-da828d8189cb to cancel this command or bot cancel to cancel all commands in this pull request.

This is weird to me since we change an automatically generated file but code review recommended to change this to make the CI happy so I do.
@command-bot
Copy link

command-bot bot commented Jan 17, 2024

@Robbepop Command "$PIPELINE_SCRIPTS_DIR/commands/bench/bench.sh" --subcommand=pallet --runtime=dev --target_dir=substrate --pallet=pallet_contracts has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4941733 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4941733/artifacts/download.

@Robbepop
Copy link
Contributor Author

bot bench substrate-pallet --pallet=pallet_contracts

@command-bot
Copy link

command-bot bot commented Jan 17, 2024

@Robbepop https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4942135 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/bench/bench.sh" --subcommand=pallet --runtime=dev --target_dir=substrate --pallet=pallet_contracts. Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 4-a4beb28b-b880-41c2-85f3-8a120f623fcf to cancel this command or bot cancel to cancel all commands in this pull request.

@command-bot
Copy link

command-bot bot commented Jan 17, 2024

@Robbepop Command "$PIPELINE_SCRIPTS_DIR/commands/bench/bench.sh" --subcommand=pallet --runtime=dev --target_dir=substrate --pallet=pallet_contracts has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4942135 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4942135/artifacts/download.

@Robbepop
Copy link
Contributor Author

I have an explanation for the performance difference between the old instr_i64const and the new instr_i64add.
Recap: Since instr_i64add has roughly double the amount of instructions than instr_i64const we assumed that it would also run twice as slow, however, benchmarks showed that it in fact runs roughly 20% faster. Why is that?

An explanation is the following: Wasmi (stack) has optimized i64.const instructions for i64 (and i32) values in range i32::MIN..i32::MAX. For all other values of i64 it needs to load the constant value from a global constant pool which is slow because it isn't local and not cache friendly (more or less random access).
However, in the old instr_i64const benchmark all of the i64.const constant values were generated in i64::MIN..i64::MAX uniformely making it extremely unlikely (1/2^32) that the constant could be small enough for Wasmi (stack) to use the optimized instruction. Therefore the former instr_i64const benchmark tested the absolute and kinda unrealistic worst case possible for i64.const instructions all the time whereas the new instr_i64add has a more realistic approach and Wasmi (stack) can execute all the instructions in a cache friendly manner making the results more realistic.

Copy link
Member

@athei athei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given your findings we need to keep continuing using big numbers to exercise this worst case. While it is "unrealistic" in practise that big numbers are used it is very easy for an attacker to fill a contract with these instructions. If we want to profit from this optimization it has to be accounted for in the wasmi gas model. Not by optimistically benchmarking for that optimization. Benchmarks have to be for the worst case.

While the benchmark itself only improves by 20% you have to keep in mind that we are dividing that by 4 to account for the increase of instructions. So you actually have to compare what is being put into the schedule and not only the raw benchmarks results.

Before we divided by 2. So with this PR we are actually using 40% less gas per instruction.

@Robbepop
Copy link
Contributor Author

Robbepop commented Jan 18, 2024

In defense of the new benchmark using proper i64.add instructions as base weight I have to disagree.

The issue is that the previous benchmark using i64.const was accidentally using a worst-case for the Wasmi (stack), however, under Wasmi (register) this might totally behave differently since both Wasmi versions handle these big constants differently.
Therefore the old benchmark was only implicitly testing the worst case and only by accident since I am sure that the person who implemented this was not even aware of this Wasmi optimization for small constant values.

If we want to do a proper job with accounting for the worst case instruction we probably should at least be explicit about our intent and use i64.load or i64.store or ideally both with random access patterns and a huge linear memory area that does not fit into the CPU cache. However, at this point we are benchmarking CPU details. As discussed in chat this should behave extremely similar to the previous i64.const with big values but is explicit about the worst-case random access and not implicit.

Though, there technically are even worse offenders with respect to performance but it always varies from engine to engine. For example Wasmi (stack) is very efficient at handling function calls whereas Wasmi (register) is much more efficient at handling computation and conditional branches.

At the very moment where we are about to change the underlying Wasm execution engine soon and furthermore where a thorough Polkadot SDK built-in gas model is in sight, I don't see a huge gain in trying to do a splendid job here right now.

If division by 4 is a problem then division by 2 (as planned by me originally) would be a good enough compromise.

@athei
Copy link
Member

athei commented Jan 18, 2024

The issue is that the previous benchmark using i64.const was accidentally using a worst-case for the Wasmi (stack), however, under Wasmi (register) this might totally behave differently since both Wasmi versions handle these big constants differently.

But in this PR we are using the stack machine. We don't merge code to master saying "will be fixed with the next PR". If we want to write a benchmark for the register machine we need to do that on the PR that activates the register machine. So maybe you should merge the two PRs after all. You can still compare weights between different commits in the same PR.

Therefore the old benchmark was only implicitly testing the worst case and only by accident since I am sure that the person who implemented this was not even aware of this Wasmi optimization for small constant values.

Why would it matter if it was accidental? When I implemented it I was just paranoid and used random numbers so no optimisation can take place. And this payed off as it turned out that the optimization was dodged avoiding a potential security problem.

If we want to do a proper job with accounting for the worst case instruction we probably should at least be explicit about our intent and use i64.load or i64.store or ideally both with random access patterns and a huge linear memory area that does not fit into the CPU cache. However, at this point we are benchmarking CPU details. As discussed in chat this should behave extremely similar to the previous i64.const with big values but is explicit about the worst-case random access and not implicit.

Then you should do exactly that. You can just use a maximum sized memory (as defined in the Config trait). I know it is annoying that we overestimate the costs. But without a more precise gas model we simply can't reduce the weight safely.
But contracts memory is pretty small so changes are it does not tank the performance too much.

At the very moment where we are about to change the underlying Wasm execution engine soon and furthermore where a thorough Polkadot SDK built-in gas model is in sight, I don't see a huge gain in trying to do a splendid job here right now.

I don't agree with his "we can't be perfect so we might as well be less correct than right now" argument. I am not asking for perfection. I am just asking not to knowingly benchmark something that exercises a best case scenario: Executing 4 very fast instructions and then dividing this by 4. Please keep in mind that having too high numbers is totally fine. It is safe. This code is in production. If we are reducing it by fiddling with the benchmark we need to be very careful. So please just be safe here. You can optimize with the register machine. So if it is less work for you, you can just do all of this on one PR so you don't have to care for the stack machine anymore.

@Robbepop
Copy link
Contributor Author

Robbepop commented Jan 19, 2024

After some discussion in chat we agreed to use the proposed i64.{load,store} based benchmark since it is kinda similar to what was done previously by accident and kinda models some worst case behavior for a base instruction with random memory access patterns as described earlier. The new benchmark test case is suited for both Wasmi (stack) and Wasmi (register). Wasmi (register) can still profit from some minor optimizations inherent to its design but will suffer from the same worst case random memory accesses.

@athei I just implemented this agreed design test case.

@Robbepop
Copy link
Contributor Author

bot bench substrate-pallet --pallet=pallet_contracts

@command-bot

This comment was marked as outdated.

@command-bot

This comment was marked as outdated.

@Robbepop
Copy link
Contributor Author

bot bench substrate-pallet --pallet=pallet_contracts

@command-bot
Copy link

command-bot bot commented Jan 19, 2024

@Robbepop https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4975857 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/bench/bench.sh" --subcommand=pallet --runtime=dev --target_dir=substrate --pallet=pallet_contracts. Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 5-17907f2b-f59e-42ef-85b2-bd5fd0902264 to cancel this command or bot cancel to cancel all commands in this pull request.

Copy link
Member

@athei athei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Let's wait for the benchmark results.

…=dev --target_dir=substrate --pallet=pallet_contracts
@command-bot
Copy link

command-bot bot commented Jan 19, 2024

@Robbepop Command "$PIPELINE_SCRIPTS_DIR/commands/bench/bench.sh" --subcommand=pallet --runtime=dev --target_dir=substrate --pallet=pallet_contracts has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4975857 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4975857/artifacts/download.

@Robbepop
Copy link
Contributor Author

Robbepop commented Jan 19, 2024

Applied your review suggestions.

Rendered benchmark results:
image

Note that instr_i64const is actually named instr_i64_load_store now.

After division by 6 (because of 6 instructions) the benchmarks is rouhgly twice as fast (equals to half gas costs). This is explained because before there every second instructions was causing a worst-case performance behavior due to the random access of big value constants and now only 2 of the 6 instructions have this worst-case behavior, however, at least we are now explicit about it and are engine agnostic. So this should probably be fine.

@Robbepop Robbepop enabled auto-merge January 19, 2024 19:13
@Robbepop Robbepop added this pull request to the merge queue Jan 19, 2024
Merged via the queue into paritytech:master with commit e02c520 Jan 19, 2024
@Robbepop Robbepop deleted the rf-update-wasm-benchmarks branch January 19, 2024 20:23
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this pull request Mar 25, 2024
In paritytech#2941 we found out
that the new Wasmi (register) is very effective at optimizing away
certain benchmark bytecode constructs in a way that created an unfair
advantage over Wasmi (stack) which yielded our former benchmarks to be
ineffective at properly measuring the performance impact.

This PR adjusts both affected benchmarks to fix the stated problems.
Affected are
- `instr_i64const` -> `instr_i64add`: Renamed since it now measures the
performance impact of the Wasm `i64.add` instruction with locals as
inputs and outputs. This makes it impossible for Wasmi (register) to
aggressively optimize away the entire function body (as it previously
did) but still provides a way for Wasmi (register) to shine with its
register based execution model.
- `call_with_code_per_byte`: Now uses `local.get` instead of `i32.const`
for the `if` condition which prevents Wasmi (register) to aggressively
optimizing away whole parts of the `if` creating an unfair advantage.

cc @athei

---------

Co-authored-by: command-bot <>
Co-authored-by: Alexander Theißen <[email protected]>
Co-authored-by: Ignacio Palacios <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

R0-no-crate-publish-required The change does not require any crates to be re-published.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants