🎨 Add prometheus metrics: vector, loki, tempo, grafana, jaeger #1280

mrnicegyu11 · 2025-11-26T09:30:45Z

What do these changes do?

Add prometheus metric scraping for: vector, loki, tempo, grafana, jaeger

Bonus:

Disable telemetry sending on tempo
Fix shellcheck pre-commit hook failure on a random bash-script

tested this on osparc.local

Related issue/s

#1279

Related PR/s

https://git.speag.com/oSparc/osparc-ops-deployment-configuration/-/merge_requests/1692

Checklist

I tested and it works

Merge remote-tracking branch 'upstream/main'

…oundation#979) * Introduce longhorn chart * Further longhorn configuration * Longhorn: further settings configuration * Fix longhorn configuration bugs Extra: introduce longhorn pv vales for portainer * Add comment for deletion longhorn * Further longhorn configuration * Add README.md for Longhorn wit FAQ * Update Longhorn readme * Update readme * Futher LH configuration * Update LH's Readme * Update Longhorn Readme * Improve LH's Readme * LH: Reduce reserved default disk space to 5% Since we use a dedicated disk for LH, we can go ahead with 5% * Use values to set Longhorn storage class * Update LH's Readme * LH Readme: add requirements reference * PR Review: bring back portainer s3 pv * LH: decrease portinaer volume size

Merge remote-tracking branch 'upstream/main'

This reverts commit 2d3adb1.

@Hrytsuk

* wip * Add csi-s3 and have portainer use it * Change request @Hrytsuk 1GB max portainer volume size * Arch Linux Certificates Customization * Fix pgsql exporter failure * [Kubernetes] Introduce on-prem persistent Storage (Longhorn) 🎉 (ITISFoundation#979) * Introduce longhorn chart * Further longhorn configuration * Longhorn: further settings configuration * Fix longhorn configuration bugs Extra: introduce longhorn pv vales for portainer * Add comment for deletion longhorn * Further longhorn configuration * Add README.md for Longhorn wit FAQ * Update Longhorn readme * Update readme * Futher LH configuration * Update LH's Readme * Update Longhorn Readme * Improve LH's Readme * LH: Reduce reserved default disk space to 5% Since we use a dedicated disk for LH, we can go ahead with 5% * Use values to set Longhorn storage class * Update LH's Readme * LH Readme: add requirements reference * PR Review: bring back portainer s3 pv * LH: decrease portinaer volume size * Experimental: Try to add tracing to simcore-traefik on master * Fixes ITISFoundation/osparc-simcore#7363 * Arch Linux Certificates Customization - 2 * wip * wip * this might work * k8s wip * wip * wip --------- Co-authored-by: Dustin Kaiser <[email protected]> Co-authored-by: YH <[email protected]>

YuryHrytsuk

Thanks! Left comments

YuryHrytsuk · 2025-11-26T09:36:40Z

services/logging/vector.yaml

  # Receive GELF messages from Docker containers via UDP
+  vector_metrics:
+    type: internal_metrics
+    scrape_interval_secs: 23


why 23? Shall we not use the same scraping interval?

I picked 23 since it is a prime number (not joking, this is important). Apart from that it is not very important i think, it only controls the resolution of the timeseries

I believe we use 15s for the simcore services at least. I suggest to keep the same here.

background i gathered:

Case for having only one global scrape interval: https://www.robustperception.io/keep-it-simple-scrape_interval-id/

Ok i will harmonize this, makes sense 🙏

services/monitoring/docker-compose.yml.j2

services/logging/docker-compose.yml.j2

services/monitoring/tempo_config.yaml.j2

bisgaard-itis

Cool, thanks a lot 👍🏻

bisgaard-itis · 2025-11-26T10:53:18Z

services/logging/vector.yaml

  # Receive GELF messages from Docker containers via UDP
+  vector_metrics:
+    type: internal_metrics
+    scrape_interval_secs: 23


I believe we use 15s for the simcore services at least. I suggest to keep the same here.

bisgaard-itis

Very nice. Thanks a lot for the effort. 👍🏻

YuryHrytsuk · 2025-11-27T12:02:53Z

services/monitoring/prometheus/prometheus-base.yml

-  scrape_interval: 15s # By default, scrape targets every 15 seconds.
-  evaluation_interval: 15s # By default, scrape targets every 15 seconds.
-  # scrape_timeout global default would be (10s).
+  scrape_interval:  ${PROMETHEUS_SCRAPE_INTERVAL}s


PROMETHEUS_SCRAPE_INTERVAL_SECONDS would clearly define purpose and units

YuryHrytsuk · 2025-11-27T12:14:44Z

services/logging/vector.yaml

    healthcheck:
      enabled: true
-
+  prometheus_exporter:


For future reference, documentation that makes clear why this is necessary.

https://vector.dev/docs/administration/monitoring/#metrics

will add this thx

YuryHrytsuk · 2025-11-27T12:19:17Z

services/jaeger/opentelemetry-collector-config.yaml

      exporters: [otlphttp,otlp]
      processors: [batch,filter/drop_healthcheck]
  telemetry:
+    metrics:


Would be nice to have a link with this config excerpt documentation for future.

it looks like it exports metrics to prometheus (aka sends them directly) but the URL is looks more like it exposes metrics 🤔

Looks like this link https://opentelemetry.io/docs/collector/internal-telemetry/#prometheus-endpoint-for-internal-metrics

this is the right link but there is no concise paragraph about it, the information is scattered on this page

YuryHrytsuk

Thanks 🙏

mrnicegyu11 and others added 30 commits October 15, 2024 16:18

wip

f0d8cf0

Merge remote-tracking branch 'upstream/main' into main

e906b41

Merge remote-tracking branch 'upstream/main' into main

14c751d

Add csi-s3 and have portainer use it

293f63c

Change request @Hrytsuk 1GB max portainer volume size

f7f72ec

t push

94cfb76

Merge remote-tracking branch 'upstream/main'

Merge remote-tracking branch 'upstream/main'

509c717

Merge remote-tracking branch 'upstream/main'

1a65ecf

Merge remote-tracking branch 'upstream/main'

77ee45e

Arch Linux Certificates Customization

c9c70d6

Merge remote-tracking branch 'upstream/main'

7b8be53

Merge remote-tracking branch 'upstream/main'

bcd61cd

Merge remote-tracking branch 'upstream/main'

58e1030

Merge remote-tracking branch 'upstream/main'

ed8d479

Merge remote-tracking branch 'upstream/main'

dda6e01

Merge remote-tracking branch 'upstream/main'

f6f4f36

Merge remote-tracking branch 'upstream/main'

5dca5c3

Merge remote-tracking branch 'upstream/main'

4a653ef

Merge remote-tracking branch 'upstream/main'

3a21f0f

Fix pgsql exporter failure

48fbbca

Merge remote-tracking branch 'upstream/main'

08c57db

Experimental: Try to add tracing to simcore-traefik on master

3ea41b5

Fixes ITISFoundation/osparc-simcore#7363

29f2f2e

Merge branch 'ITISFoundation:main' into main

cdef57f

t push

c0f393e

Merge remote-tracking branch 'upstream/main'

Merge remote-tracking branch 'upstream/main'

34a86fd

Merge remote-tracking branch 'upstream/main'

df3f5df

Merge remote-tracking branch 'upstream/main'

ac44663

Merge remote-tracking branch 'upstream/main'

4100b87

mrnicegyu11 and others added 7 commits October 8, 2025 11:51

Revert "Kubernetes: fix global network policy (ITISFoundation#1227)"

c05f58c

This reverts commit 2d3adb1.

Merge remote-tracking branch 'upstream/main'

acf8518

Merge remote-tracking branch 'upstream/main'

62f4547

Merge remote-tracking branch 'upstream/main'

6cb9761

Merge remote-tracking branch 'upstream/main'

dc8fbb1

Add prometheus metric scraping for: vector, loki, tempo, grafana, jaeger

e5b6414

mrnicegyu11 added this to the Imparable milestone Nov 26, 2025

mrnicegyu11 requested a review from bisgaard-itis November 26, 2025 09:30

mrnicegyu11 self-assigned this Nov 26, 2025

mrnicegyu11 added observability alerting/monitoring FAST labels Nov 26, 2025

fix

0261014

mrnicegyu11 changed the title ~~Add prometheus metric scraping for: vector, loki, tempo, grafana, jaeger~~ 🎨 Add prometheus metrics: vector, loki, tempo, grafana, jaeger Nov 26, 2025

mrnicegyu11 marked this pull request as ready for review November 26, 2025 09:33

mrnicegyu11 requested a review from YuryHrytsuk as a code owner November 26, 2025 09:33

YuryHrytsuk approved these changes Nov 26, 2025

View reviewed changes

add sink for vector prom metrics

e268b8d

bisgaard-itis approved these changes Nov 26, 2025

View reviewed changes

mrnicegyu11 added 3 commits November 27, 2025 10:31

Make scrape interval and scrape timeout global and configurable

3add80f

Fix format

9407426

fix

0decbe0

mrnicegyu11 requested review from YuryHrytsuk and bisgaard-itis November 27, 2025 10:24

bisgaard-itis approved these changes Nov 27, 2025

View reviewed changes

YuryHrytsuk reviewed Nov 27, 2025

View reviewed changes

YuryHrytsuk self-requested a review November 27, 2025 12:09

YuryHrytsuk reviewed Nov 27, 2025

View reviewed changes

YuryHrytsuk approved these changes Nov 27, 2025

View reviewed changes

🎨 Add prometheus metrics: vector, loki, tempo, grafana, jaeger #1280

Are you sure you want to change the base?

🎨 Add prometheus metrics: vector, loki, tempo, grafana, jaeger #1280

Uh oh!

Conversation

mrnicegyu11 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue/s

Related PR/s

Checklist

Uh oh!

YuryHrytsuk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bisgaard-itis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bisgaard-itis left a comment

Choose a reason for hiding this comment

Uh oh!

YuryHrytsuk Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YuryHrytsuk left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mrnicegyu11 commented Nov 26, 2025 •

edited

Loading

YuryHrytsuk Nov 27, 2025 •

edited

Loading