-
Notifications
You must be signed in to change notification settings - Fork 6
🎨 Add prometheus metrics: vector, loki, tempo, grafana, jaeger #1280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
🎨 Add prometheus metrics: vector, loki, tempo, grafana, jaeger #1280
Conversation
…oundation#979) * Introduce longhorn chart * Further longhorn configuration * Longhorn: further settings configuration * Fix longhorn configuration bugs Extra: introduce longhorn pv vales for portainer * Add comment for deletion longhorn * Further longhorn configuration * Add README.md for Longhorn wit FAQ * Update Longhorn readme * Update readme * Futher LH configuration * Update LH's Readme * Update Longhorn Readme * Improve LH's Readme * LH: Reduce reserved default disk space to 5% Since we use a dedicated disk for LH, we can go ahead with 5% * Use values to set Longhorn storage class * Update LH's Readme * LH Readme: add requirements reference * PR Review: bring back portainer s3 pv * LH: decrease portinaer volume size
This reverts commit 2d3adb1.
* wip * Add csi-s3 and have portainer use it * Change request @Hrytsuk 1GB max portainer volume size * Arch Linux Certificates Customization * Fix pgsql exporter failure * [Kubernetes] Introduce on-prem persistent Storage (Longhorn) 🎉 (ITISFoundation#979) * Introduce longhorn chart * Further longhorn configuration * Longhorn: further settings configuration * Fix longhorn configuration bugs Extra: introduce longhorn pv vales for portainer * Add comment for deletion longhorn * Further longhorn configuration * Add README.md for Longhorn wit FAQ * Update Longhorn readme * Update readme * Futher LH configuration * Update LH's Readme * Update Longhorn Readme * Improve LH's Readme * LH: Reduce reserved default disk space to 5% Since we use a dedicated disk for LH, we can go ahead with 5% * Use values to set Longhorn storage class * Update LH's Readme * LH Readme: add requirements reference * PR Review: bring back portainer s3 pv * LH: decrease portinaer volume size * Experimental: Try to add tracing to simcore-traefik on master * Fixes ITISFoundation/osparc-simcore#7363 * Arch Linux Certificates Customization - 2 * wip * wip * this might work * k8s wip * wip * wip --------- Co-authored-by: Dustin Kaiser <[email protected]> Co-authored-by: YH <[email protected]>
YuryHrytsuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Left comments
services/logging/vector.yaml
Outdated
| # Receive GELF messages from Docker containers via UDP | ||
| vector_metrics: | ||
| type: internal_metrics | ||
| scrape_interval_secs: 23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 23? Shall we not use the same scraping interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I picked 23 since it is a prime number (not joking, this is important). Apart from that it is not very important i think, it only controls the resolution of the timeseries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we use 15s for the simcore services at least. I suggest to keep the same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
background i gathered:
Case for having only one global scrape interval: https://www.robustperception.io/keep-it-simple-scrape_interval-id/
Ok i will harmonize this, makes sense 🙏
bisgaard-itis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks a lot 👍🏻
services/logging/vector.yaml
Outdated
| # Receive GELF messages from Docker containers via UDP | ||
| vector_metrics: | ||
| type: internal_metrics | ||
| scrape_interval_secs: 23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we use 15s for the simcore services at least. I suggest to keep the same here.
bisgaard-itis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. Thanks a lot for the effort. 👍🏻
| scrape_interval: 15s # By default, scrape targets every 15 seconds. | ||
| evaluation_interval: 15s # By default, scrape targets every 15 seconds. | ||
| # scrape_timeout global default would be (10s). | ||
| scrape_interval: ${PROMETHEUS_SCRAPE_INTERVAL}s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PROMETHEUS_SCRAPE_INTERVAL_SECONDS would clearly define purpose and units
| healthcheck: | ||
| enabled: true | ||
|
|
||
| prometheus_exporter: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reference, documentation that makes clear why this is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will add this thx
| exporters: [otlphttp,otlp] | ||
| processors: [batch,filter/drop_healthcheck] | ||
| telemetry: | ||
| metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have a link with this config excerpt documentation for future.
it looks like it exports metrics to prometheus (aka sends them directly) but the URL is looks more like it exposes metrics 🤔
Looks like this link https://opentelemetry.io/docs/collector/internal-telemetry/#prometheus-endpoint-for-internal-metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the right link but there is no concise paragraph about it, the information is scattered on this page
YuryHrytsuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 🙏
What do these changes do?
Add prometheus metric scraping for: vector, loki, tempo, grafana, jaeger
Bonus:
tested this on
osparc.localRelated issue/s
#1279
Related PR/s
https://git.speag.com/oSparc/osparc-ops-deployment-configuration/-/merge_requests/1692
Checklist