Skip to content

Conversation

@panta-123
Copy link
Contributor

  • Add figures to illustrate monitoring setup
  • Add configuration examples for common monitoring tools
  • Have each section for different monitoring tool.
  • add note that dashboard might be outdated.
  • remove some dev stuff from it as this is operator docs not development.
  • Listing metrics is hard to keep track and risk of being outdated. Code search link is added.

- Add figures to illustrate monitoring setup
- Add configuration examples for common monitoring tools
- Have each section for different monitoring tool.
- add note that dashboard might be outdated.
@panta-123 panta-123 requested review from bari12 and voetberg November 11, 2025 15:03
metrics_port = 8080
```
The used metrics can be found in following links (code search)
- [Counter](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.Counter&type=code)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how elegant this solution is but we should probably figure out a way to describe what these are actually monitoring. It's useful to have the list of names like this though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voetberg ,

I can't see any other way than listing the name manually if we want to add descriptions.
Do you want me to add back the list ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was mostly musing out loud here, I think this is a good solution but we should probably include auto-docs or something for core/metrics.

- [Gauge](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.gauge&type=code)
- [Timer](https://github.com/search?q=repo%3Arucio%2Frucio+Metrics.timer&type=code)
[Grafana Dashboard JSON](https://github.com/rucio/rucio/blob/master/tools/monitoring/visualization/rucio-internal.json) for Graphite is given here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include an example screenshot of what these dashboards end up looking like? Just so people know what they're getting.

Copy link
Contributor Author

@panta-123 panta-123 Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. People might not be using use these example as it is. I can add screenshot of what we have for our experiment which is a little different than these, but give some idea on dashboard.

@voetberg
Copy link
Contributor

These are overall really good changes! I think it is worth it to talk about how these can be actually set up (e.g., what sort of pods should be running, what sort of infra people need to run these monitoring things, the exact executables to run a hermes daemon, etc). I am unsure if this is outside the scope of this PR though (mostly just want to call this out as something that's missing)

@panta-123
Copy link
Contributor Author

These are overall really good changes! I think it is worth it to talk about how these can be actually set up (e.g., what sort of pods should be running, what sort of infra people need to run these monitoring things, the exact executables to run a hermes daemon, etc). I am unsure if this is outside the scope of this PR though (mostly just want to call this out as something that's missing)

I think we can add hermes daemon required and point to daemon deployment doc.
But any other infrastructure deployment strategies should not be mentioned, as they are outside the rucio docs scopes.
The available inetgration to infrastructure is already listed in doc and related config choices is mentionedin this PR.

- **Jobber** acts as a cron-like scheduler inside the container.
- **Output options:**
- **Prometheus Pushgateway:** for time-series metrics. Alert in prometheus Alert manager or Grafana Alert manager.
- **Nagios:** for exit-code–based alerting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dchristidis Do you have a few words for how Atlas is using Nagios?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nagios is used as a glorified cron engine. Most of the probes don’t serve a monitoring purpose: creating scopes, accounts, identities, RSE, RSE attributes, and more. When a probes does monitoring, it’s typically pushing data to Prometheus. The exit code mostly reflects how well the probe run, not the state of a service or metric.

Despite its shortcomings, we have had a positive experience with Nagios mainly because we can configure the alerts to our liking: being sent after a certain number of failures, not being resent unless the output changes, and more. As a bonus, its web interface provides a nice overview of the state of the probes, plus an easy way to make tweaks (e.g. trigger a probe run or silence an alarm).


Rucio provides a collection of **monitoring probes** that check the different status metrics of the Rucio.
The list of probes is available [here](https://github.com/rucio/probes/tree/master).
There are [common](https://github.com/rucio/probes/tree/master/common) probes shared across experiments, and you can also create your own experiment-specific probes for custom monitoring.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's some probe templates that could be useful to include here:

https://gist.github.com/voetberg/19e8f0f621f3e5c6afc346179573e8ac (Less sure on the nagios probe, as I haven't written one before lol)

}
}
```
There are other event for replicas, dids etc not stated here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't show all event types, I don't know how useful it is to list payload events. Do we instead have a way to discover this information? (e.g., something to run to look at the payload?)

[Kibana Dashboard](https://github.com/rucio/rucio/tree/master/tools/monitoring/visualization) example was given.
[Grafana Dashboard](https://github.com/rucio/monitoring-templates/blob/main/message-monitoring/Dashboards/Rucio-Transfer.json) for transfer for Elasticsearch/OpenSearch example given.
Note: Dashboard example is just for giving some idea, they might need to be tweaked according to your setup and needs. They might be also be on old versions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would instead just check if these dashboards are up to date instead of giving this disclaimer, or maybe just give a template or link to a generic one?

- some spacing edit
- added differet event type list
- condense some Traces description.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants