Reactive syncing metrics#5410
Conversation
…task of chain sync and make them reactive
e8fde12 to
9135bb1
Compare
dmitry-markin
left a comment
There was a problem hiding this comment.
Looks good, thank you! With updating metrics on every operation it's more difficult to check every update is handled. Hope you have searched through the code and made sure everything is covered.
| metrics.pending.inc(); | ||
| } | ||
| } else { | ||
| trace!(target: LOG_TARGET, |
There was a problem hiding this comment.
This is out of scope of this PR, but it looks like we can legitimately hit this if the response was delivered after the peer disconnected event arrived. This can happen depending on the order different protocols/streams are polled as these events are emitted by different objects (notifications protocol for sync peers and request-response protocol for responses).
Created an issue for this: #5414.
nazar-pc
left a comment
There was a problem hiding this comment.
Hope you have searched through the code and made sure everything is covered.
I checked all occurrences and invariants and I don't think I missed anything.
| if let Some(request) = self.active_requests.remove(&who) { | ||
| if let Some(metrics) = &self.metrics { | ||
| metrics.active.dec(); | ||
| } |
There was a problem hiding this comment.
Would grouping the request hashmaps (or vec) with metrics help here?
Some time in the future we might do a active_requests.remove() and forget to decrement the appropriate metric.
One suggestion might be to group them into a wrapper:
struct ActiveRequestsMetered {
inner: HashMap<..>, // Current self.active_requests
metrics: Option<Metrics>,
}Maybe an even simpler approach would be to introduce a fn active_requests_remove() { ...; metrics.active.dec(); }.
Could also be a followup if would take too long to implement :D
There was a problem hiding this comment.
There are several of these in the code and I'm not sure how much it would help with ergonomics to be completely honest. Metrics are one of those things that are just inherently invasive unfortunately.
lexnv
left a comment
There was a problem hiding this comment.
LGTM! Thanks for contributing!
…rics # Conflicts: # substrate/client/network/sync/src/strategy.rs
|
I resolved minor merge conflict, but also had to restore useless tick. Without this tick tests like polkadot-sdk/substrate/client/network/test/src/sync.rs Lines 1020 to 1024 in 9fecd89 If someone with more knowledge in this area can take a look at it that'd be great, but it shouldn't block this PR anymore. |
|
CI seems to be good now |
* master: (36 commits) Bump the ci_dependencies group across 1 directory with 2 updates (#5401) Remove deprecated calls in cumulus-parachain-system (#5439) Make the PR template a default for new PRs (#5462) Only log the propagating transactions when they are not empty (#5424) [CI] Fix SemVer check base commit (#5361) Sync status refactoring (#5450) Add build options to the srtool build step (#4956) `MaybeConsideration` extension trait for `Consideration` (#5384) Skip slot before creating inherent data providers during major sync (#5344) Add symlinks for code of conduct and contribution guidelines (#5447) pallet-collator-selection: correctly register weight in `new_session` (#5430) Derive `Clone` on `EncodableOpaqueLeaf` (#5442) Moving `Find FAIL-CI` check to GHA (#5377) Remove panic, as proof is invalid. (#5427) Reactive syncing metrics (#5410) [bridges] Prune messages from confirmation tx body, not from the on_idle (#5006) Change the chain to Rococo in the parachain template Zombienet config (#5279) Improve the appearance of crates on `crates.io` (#5243) Add initial version of `pallet_revive` (#5293) Update OpenZeppelin template documentation (#5398) ...
This PR untangles syncing metrics and makes them reactive, the way metrics are supposed to be in general.
Syncing metrics were bundled in a way that caused coupling across multiple layers: justifications metrics were defined and managed by
ChainSync, but only updated periodically on tick inSyncingEngine, while actual values were queried fromExtraRequests. This convoluted architecture was hard to follow when I was looking into #5333.Now metrics that correspond to each component are owned by that component and updated as changes are made instead of on tick every 1100ms.
This does add some annoying boilerplate that is a bit harder to maintain, but it separates metrics more nicely and if someone queries them more frequently will give arbitrary resolution. Since metrics updates are just atomic operations I do not expect any performance impact of these changes.
Will add prdoc if changes look good otherwise.
P.S. I noticed that importing requests (and corresponding metrics) were not cleared ever since corresponding code was introduced in dc41558#r145518721 and I left it as is to not change the behavior, but it might be something worth fixing.
cc @dmitry-markin