Skip to content

feat(node): new observability architecture + events#1053

Merged
karlem merged 33 commits intomainfrom
new-observability
Jul 19, 2024
Merged

feat(node): new observability architecture + events#1053
karlem merged 33 commits intomainfrom
new-observability

Conversation

@karlem
Copy link
Copy Markdown
Contributor

@karlem karlem commented Jul 4, 2024

Close #1054

@karlem karlem force-pushed the new-observability branch from 1100032 to 06eb54a Compare July 4, 2024 18:16
@karlem karlem changed the title Introduce a new observability model Introduce New Observability Architecture with Top-Down Tracing/Metrics Events Jul 8, 2024
@karlem karlem marked this pull request as ready for review July 9, 2024 11:07
@raulk raulk requested review from cryptoAtwill and raulk July 9, 2024 15:27
@karlem karlem requested a review from cryptoAtwill July 10, 2024 06:04
@karlem
Copy link
Copy Markdown
Contributor Author

karlem commented Jul 10, 2024

@cryptoAtwill Alright, I have resolved all the comments.

karlem added 6 commits July 12, 2024 13:49
Close #1057 

There is a problem with using the configuration file because we want to
set up tracing before the configuration is loaded. At least, this is how
it currently works. Therefore, some kind of dynamic layer will need to
be added to the subscriber layer when we want to use the configuration
file. That is why I just used environment variables for now to cut
corners and get this out as soon as possible. Once all metrics are
implemented, I can fix the configuration reloading.
Copy link
Copy Markdown
Contributor

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/3 reviewed.

#[arg(long, env = "FM_CONFIG_DIR")]
config_dir: Option<PathBuf>,

// TODO Karel - move all FM_LOG_FILE* flags to a configuration file instead
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's land the final config experience before merging this to main. Agree that this should be configured via the config.toml. Best way would be to group all these options in a struct, and place that struct in the ipc-observability crate, where it can then be embedded in the various configuration files of the various binaries (e.g. the relayer has its own configuration file, and it should also write a journal).

Copy link
Copy Markdown
Contributor Author

@karlem karlem Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright - there is an issue for it. I can pick it up once we finish review of this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1078 dropping it here to have the configuration PR linked. It's almost done!

@raulk raulk changed the title Introduce New Observability Architecture with Top-Down Tracing/Metrics Events feat(node): new observability architecture + events Jul 16, 2024
Copy link
Copy Markdown
Contributor

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2/3.

Copy link
Copy Markdown
Contributor

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, great job here! This will bring about a massive improvement in observability and monitoring. Excited to see it operating live. Kudos on your first major epic, @karlem! 👏

If you're able to get the changes addressed in the next hours, we should be able to merge today and release tomorrow with Axon r02! 🚀

@karlem
Copy link
Copy Markdown
Contributor Author

karlem commented Jul 19, 2024

@raulk Thanks! I have addressed all your comments, and it should be ready to go. However, please bear in mind that there are still two outstanding PRs that require review.

Copy link
Copy Markdown
Contributor

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few follow-up comments. I'm scrutinizing the ProcessResult and ProcessError changes as we speak, so expect one more final review of that in the next minutes.

Comment on lines +44 to 55
impl Recordable for MsgExec {
fn record_metrics(&self) {
EXEC_FVM_CALL_EXECUTION_TIME_SECS.observe(self.duration);
match self.purpose {
MsgExecPurpose::Check => EXEC_FVM_CHECK_EXECUTION_TIME_SECS.observe(self.duration),
MsgExecPurpose::Estimate => {
EXEC_FVM_ESTIMATE_EXECUTION_TIME_SECS.observe(self.duration)
}
MsgExecPurpose::Apply => EXEC_FVM_APPLY_EXECUTION_TIME_SECS.observe(self.duration),
MsgExecPurpose::Call => EXEC_FVM_CALL_EXECUTION_TIME_SECS.observe(self.duration),
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! It's all coming together. This would've been excrutiatingly difficult to do with the previous code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is also much cleaner this way.

Comment on lines -671 to -675
let accept = self
let process_result = self
.interpreter
.process(self.chain_env.clone(), txs)
.await
.context("failed to process proposal")?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dug into this further:

  • Returning an error from the Application will induce a panic and cause the Tower Service to crash.
  • This approach is sensible if (a) Tower is configured to restart the service upon crash, AND (b) we're in the middle of some state update (e.g. after begin_block and before end_block). In that case, we want to reset transient state and start over.
  • However, (b) is not true for process_proposal -- this is a stateless method. So returning an error here is unnecessary and problematic.

Conclusion:

  • Thanks to your diff, we discovered a deeper issue! 🌟
  • But because this is a delicate part of the system, I prefer to extract this diff out to its own changeset and changelog entry.
  • Behaviourally speaking, these changes seem reasonable, but I have some hesitations on introducing new types to signal a binary outcome (idiomatic way is to use Result or Option, either flattened or nested).

Next steps:

  • Let's back out these changes and forego setting a reason for now.
  • Open a separate PR with your changes that also prevents process_proposal from propagating an error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so firstly I will revert my changes but just a few comments.

  • 100% - That is the reason why I changed the error handling in the previous change. Instead of propagating the error up, I matched the error to the cometBFT accept/reject.
  • Yes, so I thought that wrapping the result with Option is a bit odd, and unwrapping something nested twice is a bit of a PITA. Therefore, even though it might not seem idiomatic at first glance, it felt more elegant.

In any case, I will revert this change for now and will create an issue.

Copy link
Copy Markdown
Contributor

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to merge once the above comments have been addressed ;-)

@karlem karlem merged commit 6aaa95c into main Jul 19, 2024
@karlem karlem deleted the new-observability branch July 19, 2024 13:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects
Archived in project

Development

Successfully merging this pull request may close these issues.

Introduce new observibility - Emit Top-Down Traces

3 participants