feat(node): new observability architecture + events#1053
Conversation
|
@cryptoAtwill Alright, I have resolved all the comments. |
Close #1057 There is a problem with using the configuration file because we want to set up tracing before the configuration is loaded. At least, this is how it currently works. Therefore, some kind of dynamic layer will need to be added to the subscriber layer when we want to use the configuration file. That is why I just used environment variables for now to cut corners and get this out as soon as possible. Once all metrics are implemented, I can fix the configuration reloading.
| #[arg(long, env = "FM_CONFIG_DIR")] | ||
| config_dir: Option<PathBuf>, | ||
|
|
||
| // TODO Karel - move all FM_LOG_FILE* flags to a configuration file instead |
There was a problem hiding this comment.
Let's land the final config experience before merging this to main. Agree that this should be configured via the config.toml. Best way would be to group all these options in a struct, and place that struct in the ipc-observability crate, where it can then be embedded in the various configuration files of the various binaries (e.g. the relayer has its own configuration file, and it should also write a journal).
There was a problem hiding this comment.
Alright - there is an issue for it. I can pick it up once we finish review of this PR.
There was a problem hiding this comment.
#1078 dropping it here to have the configuration PR linked. It's almost done!
raulk
left a comment
There was a problem hiding this comment.
Overall, great job here! This will bring about a massive improvement in observability and monitoring. Excited to see it operating live. Kudos on your first major epic, @karlem! 👏
If you're able to get the changes addressed in the next hours, we should be able to merge today and release tomorrow with Axon r02! 🚀
|
@raulk Thanks! I have addressed all your comments, and it should be ready to go. However, please bear in mind that there are still two outstanding PRs that require review. |
raulk
left a comment
There was a problem hiding this comment.
A few follow-up comments. I'm scrutinizing the ProcessResult and ProcessError changes as we speak, so expect one more final review of that in the next minutes.
| impl Recordable for MsgExec { | ||
| fn record_metrics(&self) { | ||
| EXEC_FVM_CALL_EXECUTION_TIME_SECS.observe(self.duration); | ||
| match self.purpose { | ||
| MsgExecPurpose::Check => EXEC_FVM_CHECK_EXECUTION_TIME_SECS.observe(self.duration), | ||
| MsgExecPurpose::Estimate => { | ||
| EXEC_FVM_ESTIMATE_EXECUTION_TIME_SECS.observe(self.duration) | ||
| } | ||
| MsgExecPurpose::Apply => EXEC_FVM_APPLY_EXECUTION_TIME_SECS.observe(self.duration), | ||
| MsgExecPurpose::Call => EXEC_FVM_CALL_EXECUTION_TIME_SECS.observe(self.duration), | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Nice! It's all coming together. This would've been excrutiatingly difficult to do with the previous code.
There was a problem hiding this comment.
Yes, it is also much cleaner this way.
| let accept = self | ||
| let process_result = self | ||
| .interpreter | ||
| .process(self.chain_env.clone(), txs) | ||
| .await | ||
| .context("failed to process proposal")?; |
There was a problem hiding this comment.
Dug into this further:
- Returning an error from the
Applicationwill induce a panic and cause the Tower Service to crash. - This approach is sensible if (a) Tower is configured to restart the service upon crash, AND (b) we're in the middle of some state update (e.g. after begin_block and before end_block). In that case, we want to reset transient state and start over.
- However, (b) is not true for process_proposal -- this is a stateless method. So returning an error here is unnecessary and problematic.
Conclusion:
- Thanks to your diff, we discovered a deeper issue! 🌟
- But because this is a delicate part of the system, I prefer to extract this diff out to its own changeset and changelog entry.
- Behaviourally speaking, these changes seem reasonable, but I have some hesitations on introducing new types to signal a binary outcome (idiomatic way is to use Result or Option, either flattened or nested).
Next steps:
- Let's back out these changes and forego setting a reason for now.
- Open a separate PR with your changes that also prevents
process_proposalfrom propagating an error.
There was a problem hiding this comment.
Yes, so firstly I will revert my changes but just a few comments.
- 100% - That is the reason why I changed the error handling in the previous change. Instead of propagating the error up, I matched the error to the cometBFT accept/reject.
- Yes, so I thought that wrapping the result with Option is a bit odd, and unwrapping something nested twice is a bit of a PITA. Therefore, even though it might not seem idiomatic at first glance, it felt more elegant.
In any case, I will revert this change for now and will create an issue.
raulk
left a comment
There was a problem hiding this comment.
Feel free to merge once the above comments have been addressed ;-)
Close #1054