Skip to content

Commit fc6694a

Browse files
🎨 Add prometheus metrics: vector, loki, tempo, grafana, jaeger (#1280)
* wip * Add csi-s3 and have portainer use it * Change request @Hrytsuk 1GB max portainer volume size * Arch Linux Certificates Customization * Fix pgsql exporter failure * [Kubernetes] Introduce on-prem persistent Storage (Longhorn) 🎉 (#979) * Introduce longhorn chart * Further longhorn configuration * Longhorn: further settings configuration * Fix longhorn configuration bugs Extra: introduce longhorn pv vales for portainer * Add comment for deletion longhorn * Further longhorn configuration * Add README.md for Longhorn wit FAQ * Update Longhorn readme * Update readme * Futher LH configuration * Update LH's Readme * Update Longhorn Readme * Improve LH's Readme * LH: Reduce reserved default disk space to 5% Since we use a dedicated disk for LH, we can go ahead with 5% * Use values to set Longhorn storage class * Update LH's Readme * LH Readme: add requirements reference * PR Review: bring back portainer s3 pv * LH: decrease portinaer volume size * Experimental: Try to add tracing to simcore-traefik on master * Fixes ITISFoundation/osparc-simcore#7363 * Arch Linux Certificates Customization - 2 * Revert: disable loki & vector-dev, oldschool graylog logging (#1223) * wip * Add csi-s3 and have portainer use it * Change request @Hrytsuk 1GB max portainer volume size * Arch Linux Certificates Customization * Fix pgsql exporter failure * [Kubernetes] Introduce on-prem persistent Storage (Longhorn) 🎉 (#979) * Introduce longhorn chart * Further longhorn configuration * Longhorn: further settings configuration * Fix longhorn configuration bugs Extra: introduce longhorn pv vales for portainer * Add comment for deletion longhorn * Further longhorn configuration * Add README.md for Longhorn wit FAQ * Update Longhorn readme * Update readme * Futher LH configuration * Update LH's Readme * Update Longhorn Readme * Improve LH's Readme * LH: Reduce reserved default disk space to 5% Since we use a dedicated disk for LH, we can go ahead with 5% * Use values to set Longhorn storage class * Update LH's Readme * LH Readme: add requirements reference * PR Review: bring back portainer s3 pv * LH: decrease portinaer volume size * Experimental: Try to add tracing to simcore-traefik on master * Fixes ITISFoundation/osparc-simcore#7363 * Arch Linux Certificates Customization - 2 * Send docker logs directly to graylog * revert arch linux customization --------- Co-authored-by: Dustin Kaiser <[email protected]> Co-authored-by: YH <[email protected]> * Enable Chatbot for S4L products (#1221) * wip * Add csi-s3 and have portainer use it * Change request @Hrytsuk 1GB max portainer volume size * Arch Linux Certificates Customization * Fix pgsql exporter failure * [Kubernetes] Introduce on-prem persistent Storage (Longhorn) 🎉 (#979) * Introduce longhorn chart * Further longhorn configuration * Longhorn: further settings configuration * Fix longhorn configuration bugs Extra: introduce longhorn pv vales for portainer * Add comment for deletion longhorn * Further longhorn configuration * Add README.md for Longhorn wit FAQ * Update Longhorn readme * Update readme * Futher LH configuration * Update LH's Readme * Update Longhorn Readme * Improve LH's Readme * LH: Reduce reserved default disk space to 5% Since we use a dedicated disk for LH, we can go ahead with 5% * Use values to set Longhorn storage class * Update LH's Readme * LH Readme: add requirements reference * PR Review: bring back portainer s3 pv * LH: decrease portinaer volume size * Experimental: Try to add tracing to simcore-traefik on master * Fixes ITISFoundation/osparc-simcore#7363 * Arch Linux Certificates Customization - 2 * Remove frontend vendor chatbot service * wip --------- Co-authored-by: Dustin Kaiser <[email protected]> Co-authored-by: YH <[email protected]> * Kubernetes: fix global network policy (#1227) * Add authentication middleware to cahtbot vendor service * Revert "Kubernetes: fix global network policy (#1227)" This reverts commit 2d3adb1. * Add ACME DNS Resolver for gitlabCD and k8s (#1217) * wip * Add csi-s3 and have portainer use it * Change request @Hrytsuk 1GB max portainer volume size * Arch Linux Certificates Customization * Fix pgsql exporter failure * [Kubernetes] Introduce on-prem persistent Storage (Longhorn) 🎉 (#979) * Introduce longhorn chart * Further longhorn configuration * Longhorn: further settings configuration * Fix longhorn configuration bugs Extra: introduce longhorn pv vales for portainer * Add comment for deletion longhorn * Further longhorn configuration * Add README.md for Longhorn wit FAQ * Update Longhorn readme * Update readme * Futher LH configuration * Update LH's Readme * Update Longhorn Readme * Improve LH's Readme * LH: Reduce reserved default disk space to 5% Since we use a dedicated disk for LH, we can go ahead with 5% * Use values to set Longhorn storage class * Update LH's Readme * LH Readme: add requirements reference * PR Review: bring back portainer s3 pv * LH: decrease portinaer volume size * Experimental: Try to add tracing to simcore-traefik on master * Fixes ITISFoundation/osparc-simcore#7363 * Arch Linux Certificates Customization - 2 * wip * wip * this might work * k8s wip * wip * wip --------- Co-authored-by: Dustin Kaiser <[email protected]> Co-authored-by: YH <[email protected]> * Add prometheus metric scraping for: vector, loki, tempo, grafana, jaeger * fix * add sink for vector prom metrics * Make scrape interval and scrape timeout global and configurable * Fix format * fix * Fixes @Hrytsuk --------- Co-authored-by: Dustin Kaiser <[email protected]> Co-authored-by: YH <[email protected]>
1 parent 80036d9 commit fc6694a

16 files changed

+62
-17
lines changed

‎.pre-commit-config.yaml‎

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# See https://pre-commit.com/hooks.html for more hooks
22
exclude: "^.venv$|^.cache$|^.pytest_cache$"
33
default_language_version:
4-
python: python3.10
4+
python: python3
55
repos:
66
- repo: https://github.com/pre-commit/pre-commit-hooks
77
rev: v4.4.0

‎scripts/purge-docker-registry/docker-registry-curl.bash‎

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ main() {
2929
console "${REGISTRY_HOST}"
3030
console "${WWW_AUTHENTICATE}"
3131

32-
if [ "x${WWW_AUTHENTICATE}" != "x" ];then
32+
if [ "${WWW_AUTHENTICATE}" != "" ];then
3333
# we need to get a token
3434
DOCKER_AUTH_TYPE=$(echo "${WWW_AUTHENTICATE}" | cut --delimiter=" " --fields=1)
3535
DETAILS=$(echo "${WWW_AUTHENTICATE}" | cut --delimiter=" " --fields=2-)
@@ -42,7 +42,7 @@ main() {
4242
SCOPE=$(echo "${DETAILS}" | cut --delimiter=',' --fields=3 | cut --delimiter="=" --fields=2 | tr --delete '"')
4343
if [ -v DOCKER_AUTH ];then
4444
:
45-
elif [[ "x${DOCKER_USERNAME}" != "x" && "x${DOCKER_PASSWORD}" != "x" ]];then
45+
elif [[ "${DOCKER_USERNAME}" != "" && "${DOCKER_PASSWORD}" != "" ]];then
4646
DOCKER_AUTH="${DOCKER_USERNAME}:${DOCKER_PASSWORD}"
4747
elif [ -e ~/.docker/config.json ];then
4848
DOCKER_AUTH=$(jq -r ".[\"auths\"][\"${REGISTRY_HOST}\"][\"auth\"]" ~/.docker/config.json | base64 -d)

‎services/jaeger/docker-compose.yml.j2‎

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,9 @@ services:
4141
command:
4242
- "--config=/etc/otel/config.yaml"
4343
deploy:
44+
labels:
45+
- prometheus-job=otel-collector
46+
- prometheus-port=8888
4447
placement:
4548
constraints:
4649
- node.labels.ops==true

‎services/jaeger/opentelemetry-collector-config.yaml‎

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,14 @@ service:
1919
exporters: [otlphttp,otlp]
2020
processors: [batch,filter/drop_healthcheck]
2121
telemetry:
22+
metrics: # https://opentelemetry.io/docs/collector/internal-telemetry/#prometheus-endpoint-for-internal-metrics
23+
readers:
24+
- pull:
25+
exporter:
26+
prometheus:
27+
host: '0.0.0.0'
28+
port: 8888
29+
2230
logs:
2331
level: ${TRACING_OPENTELEMETRY_COLLECTOR_SERVICE_TELEMETRY_LOG_LEVEL}
2432
processors:

‎services/logging/docker-compose.yml.j2‎

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,18 +118,21 @@ services:
118118
- VECTOR_CONFIG=/etc/vector/vector.yaml
119119
- VECTOR_LOG=info
120120
- VECTOR_LOG_DESTINATION=${VECTOR_LOG_DESTINATION}
121+
- PROMETHEUS_SCRAPE_INTERVAL_SECONDS=${PROMETHEUS_SCRAPE_INTERVAL_SECONDS}
121122
configs:
122123
- source: vector_config
123124
target: /etc/vector/vector.yaml
124125
deploy:
125126
replicas: 1
127+
labels:
128+
- prometheus-job=vector
129+
- prometheus-port=9598
126130
resources:
127131
limits:
128132
cpus: "1.0"
129133
memory: 512M
130134
reservations:
131135
memory: 256M
132-
labels: []
133136
networks:
134137
logging:
135138

@@ -153,6 +156,9 @@ services:
153156
- S3_ENDPOINT_LOKI=${S3_ENDPOINT_LOKI}
154157
- LOKI_RETENTION_PERIOD=${LOKI_RETENTION_PERIOD}
155158
deploy:
159+
labels:
160+
- prometheus-job=loki
161+
- prometheus-port=3100
156162
placement:
157163
constraints: []
158164
replicas: 1

‎services/logging/template.env‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,4 @@ S3_REGION_LOKI=${S3_REGION_LOKI}
2525
S3_SECRET_KEY_LOKI=${S3_SECRET_KEY_LOKI}
2626
STORAGE_DOMAIN=${STORAGE_DOMAIN}
2727
VECTOR_LOG_DESTINATION=${VECTOR_LOG_DESTINATION}
28+
PROMETHEUS_SCRAPE_INTERVAL_SECONDS=${PROMETHEUS_SCRAPE_INTERVAL_SECONDS}

‎services/logging/vector.yaml‎

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22

33
sources:
44
# Receive GELF messages from Docker containers via UDP
5+
vector_metrics:
6+
type: internal_metrics
7+
scrape_interval_secs: ${PROMETHEUS_SCRAPE_INTERVAL_SECONDS}
58
docker_gelf:
69
type: socket
710
address: "0.0.0.0:12201"
@@ -124,7 +127,11 @@ sinks:
124127

125128
healthcheck:
126129
enabled: true
127-
130+
prometheus_exporter: # https://vector.dev/docs/administration/monitoring/#metrics
131+
type: prometheus_exporter
132+
inputs:
133+
- vector_metrics
134+
address: "0.0.0.0:9598"
128135
# Send to Graylog via GELF over TCP
129136
graylog:
130137
type: socket

‎services/monitoring/Makefile‎

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,16 @@ config.prometheus.simcore: ${REPO_CONFIG_LOCATION} venv
132132
envsubst < prometheus/prometheus.yml > prometheus/prometheus.temp.yml; \
133133
mv prometheus/prometheus.temp.yml prometheus/prometheus.yml
134134

135+
.PHONY: config.prometheus.federation
136+
config.prometheus.federation: ${REPO_CONFIG_LOCATION} venv
137+
@set -o allexport; \
138+
source $(REPO_CONFIG_LOCATION); \
139+
set +o allexport; \
140+
envsubst < prometheus/prometheus-federation.template.yml > prometheus/prometheus-federation.yml
141+
142+
.PHONY: prometheus/prometheus-federation.yml
143+
prometheus/prometheus-federation.yml: config.prometheus.federation
144+
135145
.PHONY: config.prometheus.simcore.aws
136146
config.prometheus.simcore.aws: ${REPO_CONFIG_LOCATION} venv
137147
@set -o allexport; \

‎services/monitoring/docker-compose.yml.j2‎

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -230,12 +230,11 @@ services:
230230
- monitored # needed to access postgres
231231
- public
232232
deploy:
233-
#restart_policy:
234-
# condition: on-failure
235233
labels:
234+
- prometheus-job=grafana
235+
- prometheus-port=3000
236236
- traefik.enable=true
237237
- traefik.swarm.network=${PUBLIC_NETWORK}
238-
# direct access through port
239238
- traefik.http.services.grafana.loadbalancer.server.port=3000
240239
- traefik.http.routers.grafana.rule=Host(`${MONITORING_DOMAIN}`) && PathPrefix(`/grafana`)
241240
- traefik.http.routers.grafana.entrypoints=https
@@ -391,6 +390,8 @@ services:
391390
- monitored
392391
deploy:
393392
labels:
393+
- prometheus-job=tempo
394+
- prometheus-port=3200
394395
- traefik.enable=true
395396
- traefik.swarm.network=${PUBLIC_NETWORK}
396397
- traefik.http.services.tempo.loadbalancer.server.port=9095
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
prometheus-ceph.yml
22
prometheus.yml
3+
prometheus-federation.yml

0 commit comments

Comments
 (0)