Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
- `MimirContinuousTestNotRunningOnWrites`
- `MimirContinuousTestNotRunningOnReads`
- `MimirContinuousTestFailed`
* [ENHANCEMENT] Added `per_cluster_label` support to allow to change the label name used to differentiate between Kubernetes clusters.
* [BUGFIX] Dashboards: Fix "Failed evaluation rate" panel on Tenants dashboard. #1629

### Jsonnet
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ The following table shows the required label names and whether they can be custo

| Label name | Configurable | Description |
| :---------- | :----------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `cluster` | No | The Kubernetes cluster or datacenter where the Mimir cluster is running. |
| `cluster` | Yes | The Kubernetes cluster or datacenter where the Mimir cluster is running. The cluster label can be configured with the `per_cluster_label` field in the mixin config. |
| `namespace` | No | The Kubernetes namespace where the Mimir cluster is running. |
| `job` | Partially | The Kubernetes namespace and Mimir component in the format `<namespace>/<component>`. When running in monolithic mode, the `<component>` should be `mimir`. When running in microservices mode, the `<component>` should be the name of the specific Mimir component (singular), like `distributor`, `ingester` or `store-gateway`. The label name can't be configured, while the regular expressions used to match components can be configured with the `job_names` field in the mixin config. |
| `pod` | Yes | The unique identifier of a Mimir replica (eg. Pod ID when running on Kubernetes). The label name can be configured with the `per_instance_label` field in the mixin config. |
Expand Down
2 changes: 1 addition & 1 deletion operations/mimir-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@
alert: $.alertName('ProvisioningTooManyWrites'),
// 80k writes / s per ingester max.
expr: |||
avg by (%(alert_aggregation_labels)s) (cluster_namespace_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m) > 80e3
avg by (%(alert_aggregation_labels)s) (%(alert_aggregation_rule_prefix)s_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m) > 80e3
||| % $._config,
'for': '15m',
labels: {
Expand Down
6 changes: 3 additions & 3 deletions operations/mimir-mixin/alerts/blocks.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (thanos_objstore_bucket_last_successful_upload_time{job=~".+/ingester.*"}) > 0)
and
# Only if the ingester has ingested samples over the last 4h.
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(cluster_namespace_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[4h])) > 0)
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(%(alert_aggregation_rule_prefix)s_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[4h])) > 0)
and
# Only if the ingester was ingesting samples 4h ago. This protects from the case the ingester instance
# had ingested samples in the past, then no traffic was received for a long period and then it starts
# receiving samples again. Without this check, the alert would fire as soon as it gets back receiving
# samples, while the a block shipping is expected within the next 4h.
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(cluster_namespace_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[1h] offset 4h)) > 0)
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(%(alert_aggregation_rule_prefix)s_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[1h] offset 4h)) > 0)
||| % $._config,
labels: {
severity: 'critical',
Expand All @@ -37,7 +37,7 @@
expr: |||
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (thanos_objstore_bucket_last_successful_upload_time{job=~".+/ingester.*"}) == 0)
and
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(cluster_namespace_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[4h])) > 0)
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(%(alert_aggregation_rule_prefix)s_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[4h])) > 0)
||| % $._config,
labels: {
severity: 'critical',
Expand Down
7 changes: 5 additions & 2 deletions operations/mimir-mixin/config.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,12 @@
overrides_exporter: 'overrides-exporter',
},

// The label used to differentiate between different Kubernetes clusters.
per_cluster_label: 'cluster',

// Grouping labels, to uniquely identify and group by {jobs, clusters}
job_labels: ['cluster', 'namespace', 'job'],
cluster_labels: ['cluster', 'namespace'],
job_labels: [$._config.per_cluster_label, 'namespace', 'job'],
cluster_labels: [$._config.per_cluster_label, 'namespace'],

cortex_p99_latency_threshold_seconds: 2.5,

Expand Down
48 changes: 24 additions & 24 deletions operations/mimir-mixin/dashboards/alertmanager.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ local utils = import 'mixin-utils/utils.libsonnet';
})
.addPanel(
$.panel('Total alerts') +
$.statPanel('sum(cluster_job_%s:cortex_alertmanager_alerts:sum{%s})' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)], format='short')
$.statPanel('sum(%s_job_%s:cortex_alertmanager_alerts:sum{%s})' % [$._config.per_cluster_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)], format='short')
)
.addPanel(
$.panel('Total silences') +
$.statPanel('sum(cluster_job_%s:cortex_alertmanager_silences:sum{%s})' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)], format='short')
$.statPanel('sum(%s_job_%s:cortex_alertmanager_silences:sum{%s})' % [$._config.per_cluster_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)], format='short')
)
.addPanel(
$.panel('Tenants') +
Expand All @@ -29,11 +29,11 @@ local utils = import 'mixin-utils/utils.libsonnet';
$.queryPanel(
[
|||
sum(cluster_job:cortex_alertmanager_alerts_received_total:rate5m{%s})
sum(%s_job:cortex_alertmanager_alerts_received_total:rate5m{%s})
-
sum(cluster_job:cortex_alertmanager_alerts_invalid_total:rate5m{%s})
||| % [$.jobMatcher($._config.job_names.alertmanager), $.jobMatcher($._config.job_names.alertmanager)],
'sum(cluster_job:cortex_alertmanager_alerts_invalid_total:rate5m{%s})' % $.jobMatcher($._config.job_names.alertmanager),
sum(%s_job:cortex_alertmanager_alerts_invalid_total:rate5m{%s})
||| % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager), $._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum(%s_job:cortex_alertmanager_alerts_invalid_total:rate5m{%s})' % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
],
['success', 'failed']
)
Expand All @@ -46,11 +46,11 @@ local utils = import 'mixin-utils/utils.libsonnet';
$.queryPanel(
[
|||
sum(cluster_job_integration:cortex_alertmanager_notifications_total:rate5m{%s})
sum(%s_job_integration:cortex_alertmanager_notifications_total:rate5m{%s})
-
sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s})
||| % [$.jobMatcher($._config.job_names.alertmanager), $.jobMatcher($._config.job_names.alertmanager)],
'sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s})' % $.jobMatcher($._config.job_names.alertmanager),
sum(%s_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s})
||| % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager), $._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum(%s_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s})' % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
],
['success', 'failed']
)
Expand All @@ -61,13 +61,13 @@ local utils = import 'mixin-utils/utils.libsonnet';
[
|||
(
sum(cluster_job_integration:cortex_alertmanager_notifications_total:rate5m{%s}) by(integration)
sum(%s_job_integration:cortex_alertmanager_notifications_total:rate5m{%s}) by(integration)
-
sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s}) by(integration)
sum(%s_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s}) by(integration)
) > 0
or on () vector(0)
||| % [$.jobMatcher($._config.job_names.alertmanager), $.jobMatcher($._config.job_names.alertmanager)],
'sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s}) by(integration)' % $.jobMatcher($._config.job_names.alertmanager),
||| % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager), $._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum(%s_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s}) by(integration)' % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
],
['success - {{ integration }}', 'failed - {{ integration }}']
)
Expand Down Expand Up @@ -104,15 +104,15 @@ local utils = import 'mixin-utils/utils.libsonnet';
.addPanel(
$.panel('Per %s alerts' % $._config.per_instance_label) +
$.queryPanel(
'sum by(%s) (cluster_job_%s:cortex_alertmanager_alerts:sum{%s})' % [$._config.per_instance_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum by(%s) (%s_job_%s:cortex_alertmanager_alerts:sum{%s})' % [$._config.per_instance_label, $._config.per_cluster_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)],
'{{%s}}' % $._config.per_instance_label
) +
$.stack
)
.addPanel(
$.panel('Per %s silences' % $._config.per_instance_label) +
$.queryPanel(
'sum by(%s) (cluster_job_%s:cortex_alertmanager_silences:sum{%s})' % [$._config.per_instance_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum by(%s) (%s_job_%s:cortex_alertmanager_silences:sum{%s})' % [$._config.per_instance_label, $._config.per_cluster_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)],
'{{%s}}' % $._config.per_instance_label
) +
$.stack
Expand Down Expand Up @@ -205,11 +205,11 @@ local utils = import 'mixin-utils/utils.libsonnet';
$.queryPanel(
[
|||
sum(cluster_job:cortex_alertmanager_state_replication_total:rate5m{%s})
sum(%s_job:cortex_alertmanager_state_replication_total:rate5m{%s})
-
sum(cluster_job:cortex_alertmanager_state_replication_failed_total:rate5m{%s})
||| % [$.jobMatcher($._config.job_names.alertmanager), $.jobMatcher($._config.job_names.alertmanager)],
'sum(cluster_job:cortex_alertmanager_state_replication_failed_total:rate5m{%s})' % $.jobMatcher($._config.job_names.alertmanager),
sum(%s_job:cortex_alertmanager_state_replication_failed_total:rate5m{%s})
||| % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager), $._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum(%s_job:cortex_alertmanager_state_replication_failed_total:rate5m{%s})' % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
],
['success', 'failed']
)
Expand All @@ -219,11 +219,11 @@ local utils = import 'mixin-utils/utils.libsonnet';
$.queryPanel(
[
|||
sum(cluster_job:cortex_alertmanager_partial_state_merges_total:rate5m{%s})
sum(%s_job:cortex_alertmanager_partial_state_merges_total:rate5m{%s})
-
sum(cluster_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m{%s})
||| % [$.jobMatcher($._config.job_names.alertmanager), $.jobMatcher($._config.job_names.alertmanager)],
'sum(cluster_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m{%s})' % $.jobMatcher($._config.job_names.alertmanager),
sum(%s_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m{%s})
||| % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager), $._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum(%s_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m{%s})' % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
],
['success', 'failed']
)
Expand Down
18 changes: 9 additions & 9 deletions operations/mimir-mixin/dashboards/dashboard-utils.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -54,17 +54,17 @@ local utils = import 'mixin-utils/utils.libsonnet';
if $._config.singleBinary
then d.addMultiTemplate('job', 'cortex_build_info', 'job')
else d
.addMultiTemplate('cluster', 'cortex_build_info', 'cluster')
.addMultiTemplate('namespace', 'cortex_build_info{cluster=~"$cluster"}', 'namespace')
.addMultiTemplate('cluster', 'cortex_build_info', '%s' % $._config.per_cluster_label)
.addMultiTemplate('namespace', 'cortex_build_info{%s=~"$cluster"}' % $._config.per_cluster_label, 'namespace')
else
if $._config.singleBinary
then d.addTemplate('job', 'cortex_build_info', 'job')
else d
.addTemplate('cluster', 'cortex_build_info', 'cluster')
.addTemplate('namespace', 'cortex_build_info{cluster=~"$cluster"}', 'namespace'),
.addTemplate('cluster', 'cortex_build_info', '%s' % $._config.per_cluster_label)
.addTemplate('namespace', 'cortex_build_info{%s=~"$cluster"}' % $._config.per_cluster_label, 'namespace'),

addActiveUserSelectorTemplates()::
self.addTemplate('user', 'cortex_ingester_active_series{cluster=~"$cluster", namespace=~"$namespace"}', 'user'),
self.addTemplate('user', 'cortex_ingester_active_series{%s=~"$cluster", namespace=~"$namespace"}' % $._config.per_cluster_label, 'user'),

addCustomTemplate(name, values, defaultIndex=0):: self {
templating+: {
Expand Down Expand Up @@ -99,17 +99,17 @@ local utils = import 'mixin-utils/utils.libsonnet';
jobMatcher(job)::
if $._config.singleBinary
then 'job=~"$job"'
else 'cluster=~"$cluster", job=~"($namespace)/(%s)"' % job,
else '%s=~"$cluster", job=~"($namespace)/(%s)"' % [$._config.per_cluster_label, job],

namespaceMatcher()::
if $._config.singleBinary
then 'job=~"$job"'
else 'cluster=~"$cluster", namespace=~"$namespace"',
else '%s=~"$cluster", namespace=~"$namespace"' % $._config.per_cluster_label,

jobSelector(job)::
if $._config.singleBinary
then [utils.selector.noop('cluster'), utils.selector.re('job', '$job')]
else [utils.selector.re('cluster', '$cluster'), utils.selector.re('job', '($namespace)/(%s)' % job)],
then [utils.selector.noop('%s' % $._config.per_cluster_label), utils.selector.re('job', '$job')]
else [utils.selector.re('%s' % $._config.per_cluster_label, '$cluster'), utils.selector.re('job', '($namespace)/(%s)' % job)],

queryPanel(queries, legends, legendLink=null)::
super.queryPanel(queries, legends, legendLink) + {
Expand Down
4 changes: 2 additions & 2 deletions operations/mimir-mixin/dashboards/overrides.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ local utils = import 'mixin-utils/utils.libsonnet';
datasource: '${datasource}',
targets: [
{
expr: 'max by(limit_name) (cortex_limits_defaults{cluster=~"$cluster",namespace=~"$namespace"})',
expr: 'max by(limit_name) (cortex_limits_defaults{%s=~"$cluster",namespace=~"$namespace"})' % $._config.per_cluster_label,
instant: true,
legendFormat: '',
refId: 'A',
Expand Down Expand Up @@ -69,7 +69,7 @@ local utils = import 'mixin-utils/utils.libsonnet';
datasource: '${datasource}',
targets: [
{
expr: 'max by(user, limit_name) (cortex_limits_overrides{cluster=~"$cluster",namespace=~"$namespace",user=~"${tenant_id}"})',
expr: 'max by(user, limit_name) (cortex_limits_overrides{%s=~"$cluster",namespace=~"$namespace",user=~"${tenant_id}"})' % $._config.per_cluster_label,
instant: true,
legendFormat: '',
refId: 'A',
Expand Down
Loading