subsystem-bench: cache misses profiling#2893
Conversation
alindima
left a comment
There was a problem hiding this comment.
Generally looking good to me.
I think we should avoid printing cachegrind output directly to stdout, as it can be confusing. Either print to a file or prepend the valgrind stdout with a header that specifies that valgrind output follows.
| } | ||
|
|
||
| #[cfg(target_os = "linux")] | ||
| fn is_valgrind_mode() -> bool { |
There was a problem hiding this comment.
nit: we could add all of these functions to a linux-only valgrind module for better encapsulation. also, we could avoid having empty valgrind functions.
There was a problem hiding this comment.
Yes, it's good to extract it to a module, but how to avoid empty functions?
There was a problem hiding this comment.
if you add #![cfg(target_os = "linux") to the top of the valgrind file, it'll only be compiled on linux. Then you'd have to only call the valgrind functions on linux (add #[cfg()]s to the calling code). Then you wouldn't need empty functions
sandreim
left a comment
There was a problem hiding this comment.
LGTM! There are additional options to cache sim which might be useful:
--I1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 instruction cache. Only useful with --cache-sim=yes.
--D1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 data cache. Only useful with --cache-sim=yes.
--LL=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the last-level cache. Only useful with --cache-sim=yes.
The documentation states that currently the simulator approximates a AMD Athlon CPU circa 2002 which is worse than ref hw spec. I think we should tune these values to the ref hardware or the actual host configuration.
| #[cfg(target_os = "linux")] | ||
| fn valgrind_init() -> eyre::Result<()> { | ||
| use std::os::unix::process::CommandExt; | ||
| std::process::Command::new("valgrind") |
There was a problem hiding this comment.
it doesn't look that we get an error printed if valgrind is missing
That's a good idea. Unfortunately, I couldn't to find a way how to catch the report from stderr, because it appears after the process has completed. So I print it to a report file, which is a good option imho. |
I think you could use https://doc.rust-lang.org/std/process/struct.Command.html#method.output for this (which enables you to get stderr as well). But printing to a file is good as well IMO 👍🏻 |
@sandreim I tuned the simulation config to Intel Ice Lake CPU. |
Why we need it
To provide another level of understanding to why polkadot's subsystems may perform slower than expected. Cache misses occur when processing large amounts of data, such as during availability recovery.
Why Cachegrind
Cachegrind has many drawbacks: it is slow, it uses its own cache simulation, which is very basic. But unlike
perf, which is a great tool, Cachegrind can run in a virtual machine. This means we can easily run it in remote installations and even use it in CI/CD to catch possible regressions.Why Cachegrind and not Callgrind, another part of Valgrind? It is simply empirically proven that profiling runs faster with Cachegrind.
First results
First results have been obtained while testing of the approach. Here is an example.
The CLI output shows that 1.4% of the L1 data cache missed, which is not so bad, given that the last-level cache had that data most of the time missing only 0.3%. Instruction data of the L1 has 0.00% misses of the time. Looking at an output file with
cg_annotateshows that most of the misses occur during reed-solomon, which is expected.