Skip to content

subsystem-bench: cache misses profiling#2893

Merged
AndreiEres merged 28 commits intomasterfrom
AndreiEres/bench-cache-misses
Jan 16, 2024
Merged

subsystem-bench: cache misses profiling#2893
AndreiEres merged 28 commits intomasterfrom
AndreiEres/bench-cache-misses

Conversation

@AndreiEres
Copy link
Copy Markdown
Contributor

@AndreiEres AndreiEres commented Jan 9, 2024

Why we need it

To provide another level of understanding to why polkadot's subsystems may perform slower than expected. Cache misses occur when processing large amounts of data, such as during availability recovery.

Why Cachegrind

Cachegrind has many drawbacks: it is slow, it uses its own cache simulation, which is very basic. But unlike perf, which is a great tool, Cachegrind can run in a virtual machine. This means we can easily run it in remote installations and even use it in CI/CD to catch possible regressions.

Why Cachegrind and not Callgrind, another part of Valgrind? It is simply empirically proven that profiling runs faster with Cachegrind.

First results

First results have been obtained while testing of the approach. Here is an example.

$ target/testnet/subsystem-bench --n-cores 10 --cache-misses data-availability-read
$ cat cachegrind_report.txt
I refs:        64,622,081,485
I1  misses:         3,018,168
LLi misses:           437,654
I1  miss rate:           0.00%
LLi miss rate:           0.00%

D refs:        12,161,833,115  (9,868,356,364 rd   + 2,293,476,751 wr)
D1  misses:       167,940,701  (   71,060,073 rd   +    96,880,628 wr)
LLd misses:        33,550,018  (   16,685,853 rd   +    16,864,165 wr)
D1  miss rate:            1.4% (          0.7%     +           4.2%  )
LLd miss rate:            0.3% (          0.2%     +           0.7%  )

LL refs:          170,958,869  (   74,078,241 rd   +    96,880,628 wr)
LL misses:         33,987,672  (   17,123,507 rd   +    16,864,165 wr)
LL miss rate:             0.0% (          0.0%     +           0.7%  )

The CLI output shows that 1.4% of the L1 data cache missed, which is not so bad, given that the last-level cache had that data most of the time missing only 0.3%. Instruction data of the L1 has 0.00% misses of the time. Looking at an output file with cg_annotate shows that most of the misses occur during reed-solomon, which is expected.

@AndreiEres AndreiEres changed the title [WIP] subsystem-bench: cache misses subsystem-bench: cache misses profiling Jan 10, 2024
@AndreiEres AndreiEres added T10-tests This PR/Issue is related to tests. T12-benchmarks This PR/Issue is related to benchmarking and weights. labels Jan 10, 2024
@AndreiEres AndreiEres marked this pull request as ready for review January 10, 2024 11:42
Copy link
Copy Markdown
Contributor

@alindima alindima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking good to me.

I think we should avoid printing cachegrind output directly to stdout, as it can be confusing. Either print to a file or prepend the valgrind stdout with a header that specifies that valgrind output follows.

}

#[cfg(target_os = "linux")]
fn is_valgrind_mode() -> bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we could add all of these functions to a linux-only valgrind module for better encapsulation. also, we could avoid having empty valgrind functions.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's good to extract it to a module, but how to avoid empty functions?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you add #![cfg(target_os = "linux") to the top of the valgrind file, it'll only be compiled on linux. Then you'd have to only call the valgrind functions on linux (add #[cfg()]s to the calling code). Then you wouldn't need empty functions

Copy link
Copy Markdown
Contributor

@sandreim sandreim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! There are additional options to cache sim which might be useful:

--I1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 instruction cache. Only useful with --cache-sim=yes.

--D1=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the level 1 data cache. Only useful with --cache-sim=yes.

--LL=<size>,<associativity>,<line size>
Specify the size, associativity and line size of the last-level cache. Only useful with --cache-sim=yes.

The documentation states that currently the simulator approximates a AMD Athlon CPU circa 2002 which is worse than ref hw spec. I think we should tune these values to the ref hardware or the actual host configuration.

#[cfg(target_os = "linux")]
fn valgrind_init() -> eyre::Result<()> {
use std::os::unix::process::CommandExt;
std::process::Command::new("valgrind")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't look that we get an error printed if valgrind is missing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@AndreiEres
Copy link
Copy Markdown
Contributor Author

I think we should avoid printing cachegrind output directly to stdout, as it can be confusing.

That's a good idea. Unfortunately, I couldn't to find a way how to catch the report from stderr, because it appears after the process has completed. So I print it to a report file, which is a good option imho.

@alindima
Copy link
Copy Markdown
Contributor

That's a good idea. Unfortunately, I couldn't to find a way how to catch the report from stderr, because it appears after the process has completed. So I print it to a report file, which is a good option imho.

I think you could use https://doc.rust-lang.org/std/process/struct.Command.html#method.output for this (which enables you to get stderr as well). But printing to a file is good as well IMO 👍🏻

@AndreiEres
Copy link
Copy Markdown
Contributor Author

I think we should tune these values to the ref hardware or the actual host configuration.

@sandreim I tuned the simulation config to Intel Ice Lake CPU.

@AndreiEres AndreiEres added the R0-no-crate-publish-required The change does not require any crates to be re-published. label Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

R0-no-crate-publish-required The change does not require any crates to be re-published. T10-tests This PR/Issue is related to tests. T12-benchmarks This PR/Issue is related to benchmarking and weights.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants