diff --git a/docs/developers/l1/dashboards.md b/docs/developers/l1/dashboards.md index bd6727783f3..7b9316c3af4 100644 --- a/docs/developers/l1/dashboards.md +++ b/docs/developers/l1/dashboards.md @@ -1,4 +1,4 @@ -# Ethrex L1 Performance Dashboard (Oct 2025) +# Ethrex L1 Performance Dashboard (Nov 2025) Our Grafana dashboard provides a comprehensive overview of key metrics to help developers and operators ensure optimal performance and reliability of their Ethrex nodes. The only configured datasource today is `prometheus`, and the `job` variable defaults to `ethrex L1`, which is the job configured by default in our provisioning. @@ -67,10 +67,78 @@ _**Limitations**: This panel has the same limitations as the "Ggas/s by Block" p ## Block execution breakdown -This row repeats a pie chart for each instance showing how execution time splits between storage reads, account reads, and non-database work so you can confirm performance tuning effects. +Collapsed row that surfaces instrumentation from the `add_block_pipeline` and `execute_block_pipeline` timer series so you can understand how each instance spends time when processing blocks. Every panel repeats per instance vertically to facilitate comparisons. ![Block Execution Breakdown](img/block_execution_breakdown.png) +### Block Execution Breakdown pie +Pie chart showing how execution time splits between storage reads, account reads, and non-database work so you can confirm what are the bottlenecks outside of execution itself. + +![Block Execution Breakdown pie](img/block_execution_breakdown_pie.png) + +### Execution vs Merkleization Diff % +Tracks how much longer we spend merkleizing versus running the execution phase inside `execute_block_pipeline`. Values above zero mean merkleization dominates; negative readings flag when pure execution becomes the bottleneck (which should be extremely rare). Both run concurrently and merkleization depends on execution, 99% of the actual `execute_block_pipeline` time is just the max of both. + +![Execution vs Merkleization Diff %](img/execution_vs_merkleization_diff.png) + +### Block Execution Deaggregated by Block +Plots execution-stage timers (storage/account reads, execution without reads, merkleization) against the block number once all selected instances report the same head. + +![Block Execution Deaggregated by Block](img/block_execution_deaggregated_by_block.png) + +_**Limitations**: This panel has the same limitations as the other `by block` panels, as it relies on the same logic to align blocks across instances. Can look odd during multi-slot reorgs_ + +## Engine API + +Collapsed row that surfaces the `namespace="engine"` Prometheus timers so you can keep an eye on EL <> CL Engine API health. Each panel repeats per instance to be able to compare behaviour across nodes. + +![Engine API row](img/engine_api_row.png) + +### Engine Request Rate by Method +Shows how many Engine API calls per second we process, split by JSON-RPC method and averaged across the currently selected dashboard range. + +![Engine Request Rate by Method](img/engine_request_rate_by_method.png) + +### Engine Latency by Methods (Avg Duration) +Bar gauge of the historical average latency per Engine method over the selected time range. + +![Engine Latency by Methods](img/engine_latency_by_methods.png) + +### Engine Latency by Method +Live timeseries that tries to correlate to the per-block execution time by showing real-time latency per Engine method with an 18 s lookback window. + +![Engine Latency by Method](img/engine_latency_by_method.png) + +_**Limitations**: The aggregated panels pull averages across the current dashboard range, so very short ranges can look noisy while long ranges may smooth out brief incidents. The live latency chart still relies on an 18 s window for calculate the average, which should be near-exact per-block executions but we can lose some intermediary measure._ + +## RPC API + +Another collapsed row focused on the public JSON-RPC surface (`namespace="rpc"`). Expand it when you need to diagnose endpoint hotspots or validate rate limiting. Each panel repeats per instance to be able to compare behaviour across nodes. + +![RPC API row](img/rpc_api_row.png) + +### RPC Time per Method +Pie chart that shows where RPC time is spent across methods over the selected range. Quickly surfaces which endpoints dominate total processing time. + +![RPC Time per Method](img/rpc_time_per_method.png) + +### Slowest RPC Methods +Table listing the highest average-latency methods over the active dashboard range. Used to prioritise optimisation or caching efforts. + +![Slowest RPC Methods](img/slowest_rpc_methods.png) + +### RPC Request Rate by Method +Timeseries showing request throughput broken down by method, averaged across the selected range. + +![RPC Request Rate by Method](img/rpc_request_rate_by_method.png) + +### RPC Latency by Methods +Live timeseries that tries to correlate to the per-block execution time by showing real-time latency per Engine method with an 18 s lookback window. + +![RPC Latency by Methods](img/rpc_latency_by_methods.png) + +_**Limitations**: The RPC latency views inherit the same windowing caveats as the Engine charts: averages use the dashboard time range while the live chart relies on an 18 s window._ + ## Process and server info Row panels showing process-level and host-level metrics to help you monitor resource usage and spot potential issues. diff --git a/docs/developers/l1/img/block_execution_breakdown.png b/docs/developers/l1/img/block_execution_breakdown.png index a28cfdc5eb5..ee8e1050571 100644 Binary files a/docs/developers/l1/img/block_execution_breakdown.png and b/docs/developers/l1/img/block_execution_breakdown.png differ diff --git a/docs/developers/l1/img/block_execution_breakdown.png.png b/docs/developers/l1/img/block_execution_breakdown.png.png new file mode 100644 index 00000000000..8e5a509bfb1 Binary files /dev/null and b/docs/developers/l1/img/block_execution_breakdown.png.png differ diff --git a/docs/developers/l1/img/block_execution_breakdown_pie.png b/docs/developers/l1/img/block_execution_breakdown_pie.png new file mode 100644 index 00000000000..40c0206ff89 Binary files /dev/null and b/docs/developers/l1/img/block_execution_breakdown_pie.png differ diff --git a/docs/developers/l1/img/block_execution_deaggregated_by_block.png b/docs/developers/l1/img/block_execution_deaggregated_by_block.png new file mode 100644 index 00000000000..db7e195c9b7 Binary files /dev/null and b/docs/developers/l1/img/block_execution_deaggregated_by_block.png differ diff --git a/docs/developers/l1/img/engine_api_row.png b/docs/developers/l1/img/engine_api_row.png new file mode 100644 index 00000000000..a77024cfb25 Binary files /dev/null and b/docs/developers/l1/img/engine_api_row.png differ diff --git a/docs/developers/l1/img/engine_latency_by_method.png b/docs/developers/l1/img/engine_latency_by_method.png new file mode 100644 index 00000000000..195fd6ea92c Binary files /dev/null and b/docs/developers/l1/img/engine_latency_by_method.png differ diff --git a/docs/developers/l1/img/engine_latency_by_methods.png b/docs/developers/l1/img/engine_latency_by_methods.png new file mode 100644 index 00000000000..8c7137a33af Binary files /dev/null and b/docs/developers/l1/img/engine_latency_by_methods.png differ diff --git a/docs/developers/l1/img/engine_request_rate_by_method.png b/docs/developers/l1/img/engine_request_rate_by_method.png new file mode 100644 index 00000000000..3c84ea7afdc Binary files /dev/null and b/docs/developers/l1/img/engine_request_rate_by_method.png differ diff --git a/docs/developers/l1/img/execution_vs_merkleization_diff.png b/docs/developers/l1/img/execution_vs_merkleization_diff.png new file mode 100644 index 00000000000..46e4c82aab1 Binary files /dev/null and b/docs/developers/l1/img/execution_vs_merkleization_diff.png differ diff --git a/docs/developers/l1/img/rpc_api_row.png b/docs/developers/l1/img/rpc_api_row.png new file mode 100644 index 00000000000..0bd5ae99edc Binary files /dev/null and b/docs/developers/l1/img/rpc_api_row.png differ diff --git a/docs/developers/l1/img/rpc_latency_by_methods.png b/docs/developers/l1/img/rpc_latency_by_methods.png new file mode 100644 index 00000000000..48b7c5d41cd Binary files /dev/null and b/docs/developers/l1/img/rpc_latency_by_methods.png differ diff --git a/docs/developers/l1/img/rpc_request_rate_by_method.png b/docs/developers/l1/img/rpc_request_rate_by_method.png new file mode 100644 index 00000000000..54524eaece3 Binary files /dev/null and b/docs/developers/l1/img/rpc_request_rate_by_method.png differ diff --git a/docs/developers/l1/img/rpc_time_per_method.png b/docs/developers/l1/img/rpc_time_per_method.png new file mode 100644 index 00000000000..f41c7c5b6ec Binary files /dev/null and b/docs/developers/l1/img/rpc_time_per_method.png differ diff --git a/docs/developers/l1/img/slowest_rpc_methods.png b/docs/developers/l1/img/slowest_rpc_methods.png new file mode 100644 index 00000000000..49a04f8eb47 Binary files /dev/null and b/docs/developers/l1/img/slowest_rpc_methods.png differ diff --git a/docs/developers/l1/metrics.md b/docs/developers/l1/metrics.md index 82b12713c5c..b754bad2e07 100644 --- a/docs/developers/l1/metrics.md +++ b/docs/developers/l1/metrics.md @@ -1,43 +1,26 @@ # Metrics -## Ethereum Metrics Exporter - -We use the [Ethereum Metrics Exporter](https://github.com/ethpandaops/ethereum-metrics-exporter), a Prometheus metrics exporter for Ethereum execution and consensus nodes, to gather metrics during syncing for L1. The exporter uses the prometheus data source to create a Grafana dashboard and display the metrics. For the syncing to work there must be a consensus node running along with the execution node. - -Currently we have two make targets to easily start an execution node and a consensus node on either hoodi or holesky, and display the syncing metrics. In both cases we use a lighthouse consensus node. - -### Quickstart guide +## Quickstart +For a high level quickstart guide, please refer to [Monitoring](../../l1/running/monitoring.md). -Make sure you have your docker daemon running. - -- **Code Location**: The targets are defined in `tooling/sync/Makefile`. -- **How to Run**: - - ```bash - # Navigate to tooling/sync directory - cd tooling/sync - - # Run target for hoodi - make start-hoodi-metrics-docker +## Ethereum Metrics Exporter - # Run target for holesky - make start-holesky-metrics-docker - ``` +We use the [Ethereum Metrics Exporter](https://github.com/ethpandaops/ethereum-metrics-exporter), a Prometheus metrics exporter for Ethereum execution and consensus nodes, as an additional tool to gather metrics during L1 execution. The exporter uses the prometheus data source to create a Grafana dashboard and display the metrics. -To see the dashboards go to [http://localhost:3001](http://localhost:3001). Use “admin” for user and password. Select the Dashboards menu and go to Ethereum Metrics Exporter (Single) to see the exported metrics. +## L1 Metrics Dashboard -To see the prometheus exported metrics and its respective requests with more detail in case you need to debug go to [http://localhost:9093/metrics](http://localhost:9093/metrics). +We provide a pre-configured Grafana dashboard to monitor Ethrex L1 nodes. For detailed information on the provided dashboard, see our [L1 Dashboard document](./dashboards.md). ### Running the execution node on other networks with metrics enabled -A `docker-compose` is used to bundle prometheus and grafana services, the `*overrides` files define the ports and mounts the prometheus' configuration file. +As shown in [Monitoring](../../l1/running/monitoring.md) `docker-compose` is used to bundle prometheus and grafana services, the `*overrides` files define the ports and mounts the prometheus' configuration file. If a new dashboard is designed, it can be mounted only in that `*overrides` file. A consensus node must be running for the syncing to work. To run the execution node on any network with metrics, the next steps should be followed: 1. Build the `ethrex` binary for the network you want (see node options in [CLI Commands](../../CLI.md#cli-commands)) with the `metrics` feature enabled. 2. Enable metrics by using the `--metrics` flag when starting the node. -3. Set the `--metrics.port` cli arg of the ethrex binary to match the port defined in `metrics/provisioning/prometheus/prometheus_l1_sync_docker.yaml` +3. Set the `--metrics.port` cli arg of the ethrex binary to match the port defined in `metrics/provisioning/prometheus/prometheus_l1_sync_docker.yaml`, which is `3701` right now. 4. Run the docker containers: ```bash @@ -45,4 +28,3 @@ To run the execution node on any network with metrics, the next steps should be docker compose -f docker-compose-metrics.yaml -f docker-compose-metrics-l1.overrides.yaml up ``` -For more details on running a sync go to `tooling/sync/readme.md`. diff --git a/docs/internal/l1/metrics_coverage_gap_analysis.md b/docs/internal/l1/metrics_coverage_gap_analysis.md index d60ea7b8913..a27533f4e00 100644 --- a/docs/internal/l1/metrics_coverage_gap_analysis.md +++ b/docs/internal/l1/metrics_coverage_gap_analysis.md @@ -4,9 +4,9 @@ This note tracks the current state of metrics and dashboard observability for the L1, highlights the gaps against a cross-client baseline. It covers runtime metrics exposed through our crates, the existing Grafana "Ethrex L1 - Perf" dashboard, and supporting exporters already wired in provisioning. ### At a glance -- **Covered today**: Block execution timings, gas throughput, and host/process health are exported and graphed through `metrics/provisioning/grafana/dashboards/common_dashboards/ethrex_l1_perf.json`. -- **Missing**: Sync/peer awareness, txpool depth, Engine API visibility, JSON-RPC health, and state/storage IO metrics are absent or only logged. -- **Near-term focus**: Ship sync & peer gauges, surface txpool counters we already emit, and extend instrumentation around Engine API and storage before adding alert rules. +- **Covered today**: Block execution timings, detailed execution breakdown, Engine API and JSON-RPC method telemetry, and host/process health are exported and graphed through `metrics/provisioning/grafana/dashboards/common_dashboards/ethrex_l1_perf.json`. The refreshed [L1 Dashboard doc](./dashboards.md) has screenshots and panel descriptions. +- **Missing**: Sync/peer awareness, txpool depth, storage IO metrics, and richer error taxonomy are absent or only logged. +- **Near-term focus**: Ship sync & peer gauges, surface txpool counters we already emit, extend storage instrumentation, and harden alerting before widening coverage further. ## Baseline We Compare Against The gap analysis below uses a cross-client checklist we gathered after looking at Geth and Nethermind metrics and dashboard setups; this works as a baseline of "must-have" coverage for execution clients. The key categories are: @@ -19,7 +19,7 @@ The gap analysis below uses a cross-client checklist we gathered after looking a - **Process & host health**: CPU, memory, FDs, uptime, disk headroom (usually covered by node_exporter but treated as must-have). - **Error & anomaly counters**: explicit counters for reorgs, failed imports, sync retries, bad peer events. -Snapshot: October 2025. +Snapshot: November 2025. | Client | Dashboard snapshot | @@ -40,7 +40,7 @@ Ethrex exposes the metrics API by default when the CLI `--metrics` flag is enabl | Peer health | Yes | Yes | No | | Block & payload pipeline | Yes | Yes | Yes (latency + throughput) | | Transaction pool | Yes (basic) | Yes | Partial (counters, no panels) | -| Engine API & RPC | Partial (metrics exist, limited panels) | Yes | No | +| Engine API & RPC | Partial (metrics exist, limited panels) | Yes | Partial (per-method rate/latency) | | State & storage | Yes | Yes | Partial (datadir size; no pruning) | | Process & host health | Yes | Yes | Yes (node exporter + process) | | Error & anomaly counters | Yes | Yes | No | @@ -48,7 +48,7 @@ Ethrex exposes the metrics API by default when the CLI `--metrics` flag is enabl - **Block execution pipeline** - Gauges exposed in `ethrex_metrics::metrics_blocks`: `gas_limit`, `gas_used`, `gigagas`, `block_number`, `head_height`, `execution_ms`, `merkle_ms`, `store_ms`, `transaction_count`, plus block-building focused gauges that need to be reviewed first (`gigagas_block_building`, `block_building_ms`, `block_building_base_fee`). - Updated on the hot path in `crates/blockchain/blockchain.rs`, `crates/blockchain/payload.rs`, and `crates/blockchain/fork_choice.rs`; block-building throughput updates live in `crates/blockchain/payload.rs`. - - Exposed via `/metrics` when the `metrics` feature or CLI flag is enabled and visualised in Grafana panels "Gas Used %", "Ggas/s", "Ggas/s by Block", "Block Height", and "Block Execution Breakdown" inside `metrics/provisioning/grafana/dashboards/common_dashboards/ethrex_l1_perf.json`. + - Exposed via `/metrics` when the `metrics` feature or CLI flag is enabled and visualised in Grafana panels "Gas Used %", "Ggas/s", "Ggas/s by Block", "Block Height", and the expanded "Block Execution Breakdown" row (pie, diff %, deaggregated by block) inside `metrics/provisioning/grafana/dashboards/common_dashboards/ethrex_l1_perf.json`. - **Transaction pipeline** - `crates/blockchain/metrics/metrics_transactions.rs` defines counters and gauges: `transactions_tracker{tx_type}`, `transaction_errors_count{tx_error}`, `transactions_total`, `mempool_tx_count{type}`, `transactions_per_second`. - L1 currently uses the per-type success/error counters via `metrics!(METRICS_TX...)` in `crates/blockchain/payload.rs`. Aggregate setters (`set_tx_count`, `set_mempool_tx_count`, `set_transactions_per_second`) are only invoked from the L2 sequencer (`crates/l2/sequencer/metrics.rs` and `crates/l2/sequencer/block_producer.rs`), so there is no TPS gauge driven by the execution client today. @@ -59,6 +59,9 @@ Ethrex exposes the metrics API by default when the CLI `--metrics` flag is enabl - **Tracing-driven profiling** - `crates/blockchain/metrics/profiling.rs` installs a `FunctionProfilingLayer` whenever the CLI `--metrics` flag is set. Histograms (`function_duration_seconds{function_name}`) capture tracing span durations across block processing. - The current "Block Execution Breakdown" pie panel pulls straight from the gauges in `METRICS_BLOCKS` (`execution_ms`, `merkle_ms`, `store_ms`). The profiling histograms are scraped by Prometheus but are not charted in Grafana yet. +- **Engine & RPC telemetry** + - `function_duration_seconds_*{namespace="engine"|"rpc"}` histograms are emitted by the same profiling layer. + - Grafana now charts per-method request rates and range-based latency averages for both Engine API and JSON-RPC namespaces via the "Engine API" and "RPC API" rows. - **Metrics API** - `crates/blockchain/metrics/api.rs` exposes `/metrics` and `/health`; orchestration defined in `cmd/ethrex/initializers.rs` ensures the Axum server starts alongside the node when metrics are enabled. - The provisioning stack (docker-compose, Makefile targets) ships Prometheus and Grafana wiring, so any new metric family automatically appears in the scrape. @@ -79,7 +82,7 @@ Before addressing the gaps listed below, we should also consider some general im | Peer health | `net_peerCount` RPC endpoint exists. | No Prometheus gauges for active peers, peer limits, snap-capable availability, or handshake failures; dashboard lacks a networking row. | | Block & payload pipeline | `METRICS_BLOCKS` tracks gas throughput and execution stage timings; `transaction_count` is exported but not visualised yet. | Add p50/p95 histograms for execution stages, block import success/failure counters, and an L1-driven TPS gauge so operators can read execution throughput without relying on L2 metrics. | | Transaction pool | Success/error counters per tx type emitted from `crates/blockchain/payload.rs`. | No exported pending depth, blob/regular split, drop reasons, or gossip throughput; aggregates exist only in L2 (`crates/l2/sequencer/metrics.rs`). | -| Engine API & RPC | None published. | Instrument `newPayload`, `forkChoiceUpdated`, `getPayload` handlers with histograms/counters and wrap JSON-RPC handlers (`crates/networking/rpc`) with per-method rate/latency/error metrics, then chart them in Grafana. | +| Engine API & RPC | Per-method request rate, latency (range-based + 18 s lookback) covering `namespace="engine"` and `namespace="rpc"` metrics. | Deepen error taxonomy ( error/rates and distinguish failure reasons), add payload build latency distributions, and baseline alert thresholds. | | State & storage | Only `datadir_size_bytes` today. | Export healing/download progress, snapshot sync %, DB read/write throughput, pruning/backfill counters (we need to check what makes sense here), and cache hit/miss ratios. | | Process & host health | Process collector + `datadir_size_bytes`; node_exporter covers CPU/RSS/disk. | Add cache pressure indicators (fd saturation, async task backlog) and ensure dashboards surface alert thresholds. | | Error & anomaly counters | None published. | Add Prometheus counters for failed block imports, reorg depth, RPC errors, Engine API retries, sync failures, and wire alerting. | @@ -89,7 +92,7 @@ Before addressing the gaps listed below, we should also consider some general im 2. Implement sync & peer metrics (best-peer lag, stage progress) and add corresponding Grafana row. 3. Surface txpool metrics by wiring existing counters and charting them. 4. Add the metrics relying on `ethereum-metrics-exporter` into the existing metrics, and avoid our dashboard dependence on it. -5. Instrument Engine API handlers with histograms/counters. +5. Extend Engine API / JSON-RPC metrics with richer error taxonomy and payload construction latency distributions. 6. State and Storage metrics, specially related to snapsync, pruning, db and cache. 7. Process health improvements, specially related to read/write latencies and probably tokio tasks. 8. Review block building metrics. diff --git a/docs/l1/running/monitoring.md b/docs/l1/running/monitoring.md index a32595b0dc4..2156d6e4fbc 100644 --- a/docs/l1/running/monitoring.md +++ b/docs/l1/running/monitoring.md @@ -1,6 +1,6 @@ # Monitoring and Metrics -Ethrex exposes metrics in Prometheus format on port `9090` by default. The easiest way to monitor your node is to use the provided Docker Compose stack, which includes Prometheus and Grafana preconfigured. +Ethrex exposes metrics in Prometheus format on port `9090` by default. But the easiest way to monitor your node is to use the provided Docker Compose stack, which includes Prometheus and Grafana preconfigured. For that we are currently using port `3701`, this will match the default in the future but for now if running the containers we expected to have the ethrex metrics exposed on port `3701`. ## Quickstart: Monitoring Stack with Docker Compose @@ -13,9 +13,21 @@ Ethrex exposes metrics in Prometheus format on port `9090` by default. The easie 2. **Start the monitoring stack:** ```sh + # Optional: if you have updated from a previous version, stop first the docker compose. + # docker compose -f docker-compose-metrics.yaml -f docker-compose-metrics-l1.overrides.yaml down docker compose -f docker-compose-metrics.yaml -f docker-compose-metrics-l1.overrides.yaml up -d ``` +_**Note:** You might want to restart the docker containers in case of an update from a previous ethrex version to make sure the latest provisioned configurations are applied:_ + +3. **Run ethrex with metrics enabled:** + + Make sure to start ethrex with the `--metrics` flag and set the port to `3701`: + + ```sh + ethrex --authrpc.jwtsecret ./secrets/jwt.hex --network hoodi --metrics --metrics.port 3701 + ``` + This will launch Prometheus and Grafana, already set up to scrape ethrex metrics. ## Accessing Metrics and Dashboards @@ -26,12 +38,20 @@ This will launch Prometheus and Grafana, already set up to scrape ethrex metrics - Prometheus is preconfigured as a data source - Example dashboards are included in the repo -Metrics from ethrex will be available at `http://localhost:9090/metrics` in Prometheus format. +Metrics from ethrex will be available at `http://localhost:3701/metrics` in Prometheus format if you followed [step 3](#run-ethrex-with-metrics-enabled). + +For detailed information on the provided Grafana dashboards, see our [L1 Dashboard document](../../developers/l1/dashboards.md). ## Custom Configuration Your ethrex setup may differ from the default configuration. Check your endpoints at `provisioning/prometheus/prometheus_l1_sync_docker.yaml`. +Also if you have a centralized Prometheus or Grafana setup, you can adapt the provided configuration files to fit your environment. or even stop the docker containers that run Prometheus and/or Grafana leaving only the additional `ethereum-metrics-exporter` running alongside ethrex to export the metrics to your existing monitoring stack. + +```sh +docker compose -f docker-compose-metrics.yaml -f docker-compose-metrics-l1.overrides.yaml up -d ethereum-metrics-exporter +``` + --- For manual setup or more details, see the [Prometheus documentation](https://prometheus.io/docs/introduction/overview/) and [Grafana documentation](https://grafana.com/docs/). diff --git a/metrics/provisioning/grafana/datasources/prometheus.yaml b/metrics/provisioning/grafana/datasources/prometheus.yaml index e14e02af74e..eda24be3844 100644 --- a/metrics/provisioning/grafana/datasources/prometheus.yaml +++ b/metrics/provisioning/grafana/datasources/prometheus.yaml @@ -9,7 +9,8 @@ datasources: isDefault: true version: 2 jsonData: - timeInterval: 10s + # Scrape time interval should be 5s. + timeInterval: 5s httpMethod: POST readOnly: false editable: true