diff --git a/docs/docs/monitoring.md b/docs/docs/monitoring.md index eaedb631d6..abce9f53c1 100644 --- a/docs/docs/monitoring.md +++ b/docs/docs/monitoring.md @@ -20,10 +20,17 @@ for all OpenTelemetry-related configurables. ## Prometheus -OPA exposes an HTTP endpoint that can be used to collect performance metrics +OPA exposes an HTTP endpoint at `/metrics` that can be used to collect performance metrics for all API calls. The Prometheus endpoint is enabled by default when you run OPA as a server. +OPA provides two ways to access performance metrics: + +1. **System-wide metrics** via the `/metrics` Prometheus endpoint - Instance-level metrics across all OPA operations +2. **Per-query metrics** via API responses with `?metrics=true` - Metrics for individual query executions + +These serve different purposes: system metrics for OPA instance monitoring and alerting, per-query metrics for debugging and optimization. + You can enable metric collection from OPA with the following `prometheus.yml` config: ```yaml @@ -86,6 +93,24 @@ When Prometheus is enabled in the status plugin (see [Configuration](./configura | last_success_bundle_request | gauge | Last successful bundle request in UNIX nanoseconds. | STABLE | | bundle_loading_duration_ns | histogram | A histogram of duration for bundle loading. | STABLE | +## Available Metrics + +The Prometheus `/metrics` endpoint exposes the following instance-level metrics: + +- **URL**: `http://localhost:8181/metrics` (default configuration) +- **Method**: HTTP GET +- **Format**: Prometheus text format +- **Contents**: Instance-level counters, timers, histograms, Go runtime metrics +- **Use case**: Monitoring dashboards, alerting, performance trends + +### Additional Resources + +- **Per-query metrics**: See [REST API Performance Metrics](./rest-api#performance-metrics) for debugging individual queries +- **Policy performance**: See [Policy Performance](./policy-performance#performance-metrics) for optimization guidance +- **Status API**: See [Status API](./management-status) for metrics reporting via status updates +- **Decision logs**: See [Decision Logs](./management-decision-logs) for including metrics in decision logs +- **CLI tools**: See [opa eval](./cli#eval) and [opa bench](./cli#bench) for command-line metric collection + ## Health Checks OPA exposes a `/health` API endpoint that can be used to perform health checks. diff --git a/docs/docs/policy-performance.md b/docs/docs/policy-performance.md index 5875dd0c77..8b822fbe55 100644 --- a/docs/docs/policy-performance.md +++ b/docs/docs/policy-performance.md @@ -977,6 +977,66 @@ This feature can be enabled for `opa run`, `opa eval`, and `opa bench` by settin Users are recommended to do performance testing to determine the optimal configuration for their use case. +## Performance Metrics + +OPA exposes metrics for policy evaluation performance. These are available through: + +- **System-wide metrics** at the `/metrics` Prometheus endpoint +- **Per-query metrics** with individual API responses when `?metrics=true` is specified + +See [Monitoring](./monitoring#metrics-overview) for more details. + +### Common Built-in Function Metrics + +#### HTTP Built-ins + +`http.send` metrics help identify I/O bottlenecks: + +- `timer_rego_builtin_http_send_ns` - Total time spent in http.send calls +- `counter_rego_builtin_http_send_interquery_cache_hits` - Inter-query cache hits +- `counter_rego_builtin_http_send_network_requests` - Actual network requests made + +High cache hit ratios indicate effective caching and reduced network overhead. + +#### Regex Built-ins + +Regex operation metrics help optimize pattern matching: + +- `timer_rego_builtin_regex_interquery_ns` - Time spent in regex operations +- `counter_rego_builtin_regex_interquery_cache_hits` - Regex pattern cache hits +- `counter_rego_builtin_regex_interquery_value_cache_hits` - Regex value cache hits + +Effective regex caching improves performance when the same patterns are used repeatedly. + +### Core Query Metrics + +Basic query evaluation phases: + +- `timer_rego_query_parse_ns` - Time parsing the query string +- `timer_rego_query_compile_ns` - Time compiling the query +- `timer_rego_query_eval_ns` - Time executing the compiled query + +Compilation time often dominates in complex policies. + +### High-Level Metrics + +Server-level metrics for overall performance: + +- `timer_server_handler_ns` - Total request handler execution time +- `counter_server_query_cache_hit` - Server-level query cache hits + +### Using Metrics for Optimization + +1. **Query phases**: Compare parse, compile, and eval times to identify bottlenecks +2. **Cache effectiveness**: Low cache hit rates suggest tuning opportunities +3. **I/O bottlenecks**: High `http.send` network request counts indicate caching issues +4. **Pattern matching**: Monitor regex cache hits for frequently used patterns + +Access metrics via: +- REST API: Add `?metrics=true` to policy evaluation requests +- CLI: Use `--metrics` flag with `opa eval` or `opa bench` +- Prometheus: See [Monitoring](./monitoring#prometheus) for system-wide metrics + ## Key Takeaways For high-performance use cases: @@ -987,3 +1047,4 @@ For high-performance use cases: - Write your policies with indexed statements so that [rule-indexing](https://blog.openpolicyagent.org/optimizing-opa-rule-indexing-59f03f17caf3) is effective. - Use the profiler to help identify portions of the policy that would benefit the most from improved performance. - Use the benchmark tools to help get real world timing data and detect policy performance changes. +- Monitor performance metrics to track optimization impact and identify bottlenecks. diff --git a/docs/docs/policy-reference/builtins/glob.mdx b/docs/docs/policy-reference/builtins/glob.mdx index 799665e91c..3d17b8f496 100644 --- a/docs/docs/policy-reference/builtins/glob.mdx +++ b/docs/docs/policy-reference/builtins/glob.mdx @@ -27,3 +27,13 @@ The following table shows examples of how `glob.match` works: | `output := glob.match("{cat,bat,[fr]at}", [], "bat")` | `true` | A glob with pattern-alternatives matchers. | | `output := glob.match("{cat,bat,[fr]at}", [], "rat")` | `true` | A glob with pattern-alternatives matchers. | | `output := glob.match("{cat,bat,[fr]at}", [], "at")` | `false` | A glob with pattern-alternatives matchers. | + +## Performance Metrics + +When OPA is configured with metrics enabled, `glob.match` operations expose the following metrics in per-query metrics (accessible when `?metrics=true` is specified in API requests): + +| Metric | Description | +| ------ | ----------- | +| `counter_rego_builtin_glob_interquery_value_cache_hits` | Number of inter-query cache hits for compiled glob patterns | + +Effective glob pattern caching improves performance when the same patterns are used repeatedly across queries. High cache hit ratios indicate that glob compilation overhead is being minimized through caching. diff --git a/docs/docs/policy-reference/builtins/http.mdx b/docs/docs/policy-reference/builtins/http.mdx index b09cf3f2b9..aacb475b88 100644 --- a/docs/docs/policy-reference/builtins/http.mdx +++ b/docs/docs/policy-reference/builtins/http.mdx @@ -113,3 +113,15 @@ The table below shows examples of calling `http.send`: | Files containing TLS material | `http.send({"method": "get", "url": "https://127.0.0.1:65331", "tls_ca_cert_file": "testdata/ca.pem", "tls_client_cert_file": "testdata/client-cert.pem", "tls_client_key_file": "testdata/client-key.pem"})` | | Environment variables containing TLS material | `http.send({"method": "get", "url": "https://127.0.0.1:65360", "tls_ca_cert_env_variable": "CLIENT_CA_ENV", "tls_client_cert_env_variable": "CLIENT_CERT_ENV", "tls_client_key_env_variable": "CLIENT_KEY_ENV"})` | | Unix Socket URL Format | `http.send({"method": "get", "url": "unix://localhost/?socket=%F2path%F2file.socket"})` | + +## Performance Metrics + +When OPA is configured with metrics enabled, `http.send` operations expose the following metrics in per-query metrics (accessible when `?metrics=true` is specified in API requests): + +| Metric | Description | +| ------ | ----------- | +| `timer_rego_builtin_http_send_ns` | Total time spent in `http.send` calls during query evaluation | +| `counter_rego_builtin_http_send_interquery_cache_hits` | Number of inter-query cache hits for `http.send` requests | +| `counter_rego_builtin_http_send_network_requests` | Number of actual network requests made by `http.send` | + +High cache hit ratios indicate effective caching and reduced network overhead. These metrics help identify I/O bottlenecks in policies that make external HTTP requests. diff --git a/docs/docs/policy-reference/builtins/regex.mdx b/docs/docs/policy-reference/builtins/regex.mdx index eb3f99e2d2..35a1400324 100644 --- a/docs/docs/policy-reference/builtins/regex.mdx +++ b/docs/docs/policy-reference/builtins/regex.mdx @@ -110,3 +110,13 @@ overlap. This can be useful when using patterns to define permissions or access rules. The function returns `true` if the two patterns overlap and `false` otherwise. + +## Performance Metrics + +When OPA is configured with metrics enabled, regex operations expose the following metrics in per-query metrics (accessible when `?metrics=true` is specified in API requests): + +| Metric | Description | +| ------ | ----------- | +| `counter_rego_builtin_regex_interquery_value_cache_hits` | Number of regex cache hits for compiled patterns | + +Effective regex caching improves performance when the same patterns are used repeatedly. High cache hit ratios indicate that regex compilation overhead is being minimized through caching. diff --git a/docs/docs/rest-api.md b/docs/docs/rest-api.md index d9eace22d1..793372a7b9 100644 --- a/docs/docs/rest-api.md +++ b/docs/docs/rest-api.md @@ -2333,9 +2333,12 @@ Query instrumentation can help diagnose performance problems, however, it can add significant overhead to query evaluation. We recommend leaving query instrumentation off unless you are debugging a performance problem. -When instrumentation is enabled there are several additional performance metrics -for the compilation stages. They follow the format of `timer_compile_stage_*_ns` -and `timer_query_compile_stage_*_ns` for the query and module compilation stages. +When query instrumentation is enabled (`instrument=true`), the following additional detailed evaluation metrics are included: +- **timer_eval_op_***: Various evaluation operation timers (e.g., `timer_eval_op_plug_ns`, `timer_eval_op_resolve_ns`) +- **histogram_eval_op_***: Histograms tracking evaluation operation time distributions +- **timer_rego_builtin_***: Built-in function execution times +- **counter_rego_builtin_***: Built-in function call counts and cache hits +- **timer_compile_stage_*_ns**: Compilation stage timers for the query and module compilation stages ## Provenance