Skip to content

Conversation

@mrnicegyu11
Copy link
Member

@mrnicegyu11 mrnicegyu11 commented Nov 26, 2025

What do these changes do?

Add prometheus metric scraping for: vector, loki, tempo, grafana, jaeger

Bonus:

  • Disable telemetry sending on tempo
  • Fix shellcheck pre-commit hook failure on a random bash-script

tested this on osparc.local

Related issue/s

#1279

Related PR/s

https://git.speag.com/oSparc/osparc-ops-deployment-configuration/-/merge_requests/1692

Checklist

  • I tested and it works

mrnicegyu11 and others added 30 commits October 15, 2024 16:18
Merge remote-tracking branch 'upstream/main'
…oundation#979)

* Introduce longhorn chart

* Further longhorn configuration

* Longhorn: further settings configuration

* Fix longhorn configuration bugs

Extra: introduce longhorn pv vales for portainer

* Add comment for deletion longhorn

* Further longhorn configuration

* Add README.md for Longhorn wit FAQ

* Update Longhorn readme

* Update readme

* Futher LH configuration

* Update LH's Readme

* Update Longhorn Readme

* Improve LH's Readme

* LH: Reduce reserved default disk space to 5%

Since we use a dedicated disk for LH, we can go ahead with 5%

* Use values to set Longhorn storage class

* Update LH's Readme

* LH Readme: add requirements reference

* PR Review: bring back portainer s3 pv

* LH: decrease portinaer volume size
Merge remote-tracking branch 'upstream/main'
mrnicegyu11 and others added 7 commits October 8, 2025 11:51
* wip

* Add csi-s3 and have portainer use it

* Change request @Hrytsuk 1GB max portainer volume size

* Arch Linux Certificates Customization

* Fix pgsql exporter failure

* [Kubernetes] Introduce on-prem persistent Storage (Longhorn) 🎉  (ITISFoundation#979)

* Introduce longhorn chart

* Further longhorn configuration

* Longhorn: further settings configuration

* Fix longhorn configuration bugs

Extra: introduce longhorn pv vales for portainer

* Add comment for deletion longhorn

* Further longhorn configuration

* Add README.md for Longhorn wit FAQ

* Update Longhorn readme

* Update readme

* Futher LH configuration

* Update LH's Readme

* Update Longhorn Readme

* Improve LH's Readme

* LH: Reduce reserved default disk space to 5%

Since we use a dedicated disk for LH, we can go ahead with 5%

* Use values to set Longhorn storage class

* Update LH's Readme

* LH Readme: add requirements reference

* PR Review: bring back portainer s3 pv

* LH: decrease portinaer volume size

* Experimental: Try to add tracing to simcore-traefik on master

* Fixes ITISFoundation/osparc-simcore#7363

* Arch Linux Certificates Customization - 2

* wip

* wip

* this might work

* k8s wip

* wip

* wip

---------

Co-authored-by: Dustin Kaiser <[email protected]>
Co-authored-by: YH <[email protected]>
@mrnicegyu11 mrnicegyu11 added this to the Imparable milestone Nov 26, 2025
@mrnicegyu11 mrnicegyu11 self-assigned this Nov 26, 2025
@mrnicegyu11 mrnicegyu11 added observability alerting/monitoring FAST labels Nov 26, 2025
@mrnicegyu11 mrnicegyu11 changed the title Add prometheus metric scraping for: vector, loki, tempo, grafana, jaeger 🎨 Add prometheus metrics: vector, loki, tempo, grafana, jaeger Nov 26, 2025
@mrnicegyu11 mrnicegyu11 marked this pull request as ready for review November 26, 2025 09:33
Copy link
Collaborator

@YuryHrytsuk YuryHrytsuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Left comments

# Receive GELF messages from Docker containers via UDP
vector_metrics:
type: internal_metrics
scrape_interval_secs: 23
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 23? Shall we not use the same scraping interval?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked 23 since it is a prime number (not joking, this is important). Apart from that it is not very important i think, it only controls the resolution of the timeseries

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we use 15s for the simcore services at least. I suggest to keep the same here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

background i gathered:

Case for having only one global scrape interval: https://www.robustperception.io/keep-it-simple-scrape_interval-id/

Ok i will harmonize this, makes sense 🙏

Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks a lot 👍🏻

# Receive GELF messages from Docker containers via UDP
vector_metrics:
type: internal_metrics
scrape_interval_secs: 23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we use 15s for the simcore services at least. I suggest to keep the same here.

Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. Thanks a lot for the effort. 👍🏻

scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # By default, scrape targets every 15 seconds.
# scrape_timeout global default would be (10s).
scrape_interval: ${PROMETHEUS_SCRAPE_INTERVAL}s
Copy link
Collaborator

@YuryHrytsuk YuryHrytsuk Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PROMETHEUS_SCRAPE_INTERVAL_SECONDS would clearly define purpose and units

@YuryHrytsuk YuryHrytsuk self-requested a review November 27, 2025 12:09
healthcheck:
enabled: true

prometheus_exporter:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, documentation that makes clear why this is necessary.

https://vector.dev/docs/administration/monitoring/#metrics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add this thx

exporters: [otlphttp,otlp]
processors: [batch,filter/drop_healthcheck]
telemetry:
metrics:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have a link with this config excerpt documentation for future.

it looks like it exports metrics to prometheus (aka sends them directly) but the URL is looks more like it exposes metrics 🤔

Looks like this link https://opentelemetry.io/docs/collector/internal-telemetry/#prometheus-endpoint-for-internal-metrics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the right link but there is no concise paragraph about it, the information is scattered on this page

Copy link
Collaborator

@YuryHrytsuk YuryHrytsuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

FAST observability alerting/monitoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants