-
Notifications
You must be signed in to change notification settings - Fork 1.5k
docs: Document all OPA metrics definitions #7929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d78f63d
924c090
a99784e
fe0c7e2
1d0251f
70bc91f
143f1b2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -20,10 +20,17 @@ for all OpenTelemetry-related configurables. | |||||
|
|
||||||
| ## Prometheus | ||||||
|
|
||||||
| OPA exposes an HTTP endpoint that can be used to collect performance metrics | ||||||
| OPA exposes an HTTP endpoint at `/metrics` that can be used to collect performance metrics | ||||||
| for all API calls. The Prometheus endpoint is enabled by default when you run | ||||||
| OPA as a server. | ||||||
|
|
||||||
| OPA provides two ways to access performance metrics: | ||||||
|
|
||||||
| 1. **System-wide metrics** via the `/metrics` Prometheus endpoint - Instance-level metrics across all OPA operations | ||||||
| 2. **Per-query metrics** via API responses with `?metrics=true` - Metrics for individual query executions | ||||||
|
|
||||||
| These serve different purposes: system metrics for OPA instance monitoring and alerting, per-query metrics for debugging and optimization. | ||||||
|
|
||||||
| You can enable metric collection from OPA with the following `prometheus.yml` config: | ||||||
|
|
||||||
| ```yaml | ||||||
|
|
@@ -86,6 +93,24 @@ When Prometheus is enabled in the status plugin (see [Configuration](./configura | |||||
| | last_success_bundle_request | gauge | Last successful bundle request in UNIX nanoseconds. | STABLE | | ||||||
| | bundle_loading_duration_ns | histogram | A histogram of duration for bundle loading. | STABLE | | ||||||
|
|
||||||
| ## Available Metrics | ||||||
|
|
||||||
| The Prometheus `/metrics` endpoint exposes the following instance-level metrics: | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This no longer really introduces the content in this section. |
||||||
|
|
||||||
| - **URL**: `http://localhost:8181/metrics` (default configuration) | ||||||
| - **Method**: HTTP GET | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
This is not needed as it's the default. |
||||||
| - **Format**: Prometheus text format | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
This is not needed as it's expected to be in that format. |
||||||
| - **Contents**: Instance-level counters, timers, histograms, Go runtime metrics | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| - **Use case**: Monitoring dashboards, alerting, performance trends | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
not really needed as I think users understand how to use the metrics if they're looking for which are available. |
||||||
|
|
||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please list
|
||||||
| ### Additional Resources | ||||||
anivar marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
|
||||||
| - **Per-query metrics**: See [REST API Performance Metrics](./rest-api#performance-metrics) for debugging individual queries | ||||||
| - **Policy performance**: See [Policy Performance](./policy-performance#performance-metrics) for optimization guidance | ||||||
| - **Status API**: See [Status API](./management-status) for metrics reporting via status updates | ||||||
| - **Decision logs**: See [Decision Logs](./management-decision-logs) for including metrics in decision logs | ||||||
| - **CLI tools**: See [opa eval](./cli#eval) and [opa bench](./cli#bench) for command-line metric collection | ||||||
|
|
||||||
| ## Health Checks | ||||||
|
|
||||||
| OPA exposes a `/health` API endpoint that can be used to perform health checks. | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -977,6 +977,66 @@ This feature can be enabled for `opa run`, `opa eval`, and `opa bench` by settin | |
|
|
||
| Users are recommended to do performance testing to determine the optimal configuration for their use case. | ||
|
|
||
| ## Performance Metrics | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We still seem to have the per built in metrics here in this doc as well as in the built in docs themselves, I think they're better in the built in docs only. |
||
|
|
||
| OPA exposes metrics for policy evaluation performance. These are available through: | ||
|
|
||
| - **System-wide metrics** at the `/metrics` Prometheus endpoint | ||
| - **Per-query metrics** with individual API responses when `?metrics=true` is specified | ||
|
|
||
| See [Monitoring](./monitoring#metrics-overview) for more details. | ||
|
|
||
| ### Common Built-in Function Metrics | ||
|
|
||
| #### HTTP Built-ins | ||
anivar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| `http.send` metrics help identify I/O bottlenecks: | ||
|
|
||
| - `timer_rego_builtin_http_send_ns` - Total time spent in http.send calls | ||
| - `counter_rego_builtin_http_send_interquery_cache_hits` - Inter-query cache hits | ||
| - `counter_rego_builtin_http_send_network_requests` - Actual network requests made | ||
|
|
||
| High cache hit ratios indicate effective caching and reduced network overhead. | ||
|
|
||
| #### Regex Built-ins | ||
anivar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Regex operation metrics help optimize pattern matching: | ||
|
|
||
| - `timer_rego_builtin_regex_interquery_ns` - Time spent in regex operations | ||
| - `counter_rego_builtin_regex_interquery_cache_hits` - Regex pattern cache hits | ||
| - `counter_rego_builtin_regex_interquery_value_cache_hits` - Regex value cache hits | ||
|
|
||
| Effective regex caching improves performance when the same patterns are used repeatedly. | ||
|
|
||
| ### Core Query Metrics | ||
|
|
||
| Basic query evaluation phases: | ||
|
|
||
| - `timer_rego_query_parse_ns` - Time parsing the query string | ||
| - `timer_rego_query_compile_ns` - Time compiling the query | ||
| - `timer_rego_query_eval_ns` - Time executing the compiled query | ||
|
|
||
| Compilation time often dominates in complex policies. | ||
|
|
||
| ### High-Level Metrics | ||
|
|
||
| Server-level metrics for overall performance: | ||
|
|
||
| - `timer_server_handler_ns` - Total request handler execution time | ||
| - `counter_server_query_cache_hit` - Server-level query cache hits | ||
|
|
||
| ### Using Metrics for Optimization | ||
|
|
||
| 1. **Query phases**: Compare parse, compile, and eval times to identify bottlenecks | ||
| 2. **Cache effectiveness**: Low cache hit rates suggest tuning opportunities | ||
| 3. **I/O bottlenecks**: High `http.send` network request counts indicate caching issues | ||
| 4. **Pattern matching**: Monitor regex cache hits for frequently used patterns | ||
|
|
||
| Access metrics via: | ||
| - REST API: Add `?metrics=true` to policy evaluation requests | ||
| - CLI: Use `--metrics` flag with `opa eval` or `opa bench` | ||
| - Prometheus: See [Monitoring](./monitoring#prometheus) for system-wide metrics | ||
|
|
||
| ## Key Takeaways | ||
|
|
||
| For high-performance use cases: | ||
|
|
@@ -987,3 +1047,4 @@ For high-performance use cases: | |
| - Write your policies with indexed statements so that [rule-indexing](https://blog.openpolicyagent.org/optimizing-opa-rule-indexing-59f03f17caf3) is effective. | ||
| - Use the profiler to help identify portions of the policy that would benefit the most from improved performance. | ||
| - Use the benchmark tools to help get real world timing data and detect policy performance changes. | ||
| - Monitor performance metrics to track optimization impact and identify bottlenecks. | ||
anivar marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -27,3 +27,13 @@ The following table shows examples of how `glob.match` works: | |||||
| | `output := glob.match("{cat,bat,[fr]at}", [], "bat")` | `true` | A glob with pattern-alternatives matchers. | | ||||||
| | `output := glob.match("{cat,bat,[fr]at}", [], "rat")` | `true` | A glob with pattern-alternatives matchers. | | ||||||
| | `output := glob.match("{cat,bat,[fr]at}", [], "at")` | `false` | A glob with pattern-alternatives matchers. | | ||||||
|
|
||||||
| ## Performance Metrics | ||||||
|
|
||||||
| When OPA is configured with metrics enabled, `glob.match` operations expose the following metrics in per-query metrics (accessible when `?metrics=true` is specified in API requests): | ||||||
|
|
||||||
| | Metric | Description | | ||||||
| | ------ | ----------- | | ||||||
| | `counter_rego_builtin_glob_interquery_value_cache_hits` | Number of inter-query cache hits for compiled glob patterns | | ||||||
|
|
||||||
| Effective glob pattern caching improves performance when the same patterns are used repeatedly across queries. High cache hit ratios indicate that glob compilation overhead is being minimized through caching. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -110,3 +110,13 @@ overlap. This can be useful when using patterns to define permissions or access | |||||
| rules. The function returns `true` if the two patterns overlap and `false` otherwise. | ||||||
|
|
||||||
| <PlaygroundExample dir={require.context('../_examples/regex/globs_match/role_patterns')} /> | ||||||
|
|
||||||
| ## Performance Metrics | ||||||
|
|
||||||
| When OPA is configured with metrics enabled, regex operations expose the following metrics in per-query metrics (accessible when `?metrics=true` is specified in API requests): | ||||||
|
|
||||||
| | Metric | Description | | ||||||
| | ------ | ----------- | | ||||||
| | `counter_rego_builtin_regex_interquery_value_cache_hits` | Number of regex cache hits for compiled patterns | | ||||||
|
|
||||||
| Effective regex caching improves performance when the same patterns are used repeatedly. High cache hit ratios indicate that regex compilation overhead is being minimized through caching. | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -2333,9 +2333,12 @@ Query instrumentation can help diagnose performance problems, however, it can | |||||
| add significant overhead to query evaluation. We recommend leaving query | ||||||
| instrumentation off unless you are debugging a performance problem. | ||||||
|
|
||||||
| When instrumentation is enabled there are several additional performance metrics | ||||||
| for the compilation stages. They follow the format of `timer_compile_stage_*_ns` | ||||||
| and `timer_query_compile_stage_*_ns` for the query and module compilation stages. | ||||||
| When query instrumentation is enabled (`instrument=true`), the following additional detailed evaluation metrics are included: | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here are some examples of how to learn what the different metrics are: query v1/data This is not an exhaustive list, if you can, it'd be best to run some example queries for each endpoint so you can learn what the different metrics are. Also note that the different metrics will depend on the data you post, built in functions etc. If you want to document this section, some significant research will be needed in order to gather what is available and what the metrics mean. |
||||||
| - **timer_eval_op_***: Various evaluation operation timers (e.g., `timer_eval_op_plug_ns`, `timer_eval_op_resolve_ns`) | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would be good to explain what these are rather than just 'various operation timers' |
||||||
| - **histogram_eval_op_***: Histograms tracking evaluation operation time distributions | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. histogram_eval_op_builtin_call, is one, it'd be good to give examples of them all. |
||||||
| - **timer_rego_builtin_***: Built-in function execution times | ||||||
| - **counter_rego_builtin_***: Built-in function call counts and cache hits | ||||||
| - **timer_compile_stage_*_ns**: Compilation stage timers for the query and module compilation stages | ||||||
|
|
||||||
| ## Provenance | ||||||
|
|
||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this in an admonition instead since it's related but not 100% on topic for the prometheus section, this section is just about
/metricsbut ?metrics=true is important further reading.