Skip to content
72 changes: 70 additions & 2 deletions docs/developers/l1/dashboards.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Ethrex L1 Performance Dashboard (Oct 2025)
# Ethrex L1 Performance Dashboard (Nov 2025)

Our Grafana dashboard provides a comprehensive overview of key metrics to help developers and operators ensure optimal performance and reliability of their Ethrex nodes. The only configured datasource today is `prometheus`, and the `job` variable defaults to `ethrex L1`, which is the job configured by default in our provisioning.

Expand Down Expand Up @@ -67,10 +67,78 @@ _**Limitations**: This panel has the same limitations as the "Ggas/s by Block" p

## Block execution breakdown

This row repeats a pie chart for each instance showing how execution time splits between storage reads, account reads, and non-database work so you can confirm performance tuning effects.
Collapsed row that surfaces instrumentation from the `add_block_pipeline` and `execute_block_pipeline` timer series so you can understand how each instance spends time when processing blocks. Every panel repeats per instance vertically to facilitate comparisons.

![Block Execution Breakdown](img/block_execution_breakdown.png)

### Block Execution Breakdown pie
Pie chart showing how execution time splits between storage reads, account reads, and non-database work so you can confirm what are the bottlenecks outside of execution itself.

![Block Execution Breakdown pie](img/block_execution_breakdown_pie.png)

### Execution vs Merkleization Diff %
Tracks how much longer we spend merkleizing versus running the execution phase inside `execute_block_pipeline`. Values above zero mean merkleization dominates; negative readings flag when pure execution becomes the bottleneck (which should be extremely rare). Both run concurrently and merkleization depends on execution, 99% of the actual `execute_block_pipeline` time is just the max of both.

![Execution vs Merkleization Diff %](img/execution_vs_merkleization_diff.png)

### Block Execution Deaggregated by Block
Plots execution-stage timers (storage/account reads, execution without reads, merkleization) against the block number once all selected instances report the same head.

![Block Execution Deaggregated by Block](img/block_execution_deaggregated_by_block.png)

_**Limitations**: This panel has the same limitations as the other `by block` panels, as it relies on the same logic to align blocks across instances. Can look odd during multi-slot reorgs_

## Engine API

Collapsed row that surfaces the `namespace="engine"` Prometheus timers so you can keep an eye on EL <> CL Engine API health. Each panel repeats per instance to be able to compare behaviour across nodes.

![Engine API row](img/engine_api_row.png)

### Engine Request Rate by Method
Shows how many Engine API calls per second we process, split by JSON-RPC method and averaged across the currently selected dashboard range.

![Engine Request Rate by Method](img/engine_request_rate_by_method.png)

### Engine Latency by Methods (Avg Duration)
Bar gauge of the historical average latency per Engine method over the selected time range.

![Engine Latency by Methods](img/engine_latency_by_methods.png)

### Engine Latency by Method
Live timeseries that tries to correlate to the per-block execution time by showing real-time latency per Engine method with an 18 s lookback window.

![Engine Latency by Method](img/engine_latency_by_method.png)

_**Limitations**: The aggregated panels pull averages across the current dashboard range, so very short ranges can look noisy while long ranges may smooth out brief incidents. The live latency chart still relies on an 18 s window for calculate the average, which should be near-exact per-block executions but we can lose some intermediary measure._

## RPC API

Another collapsed row focused on the public JSON-RPC surface (`namespace="rpc"`). Expand it when you need to diagnose endpoint hotspots or validate rate limiting. Each panel repeats per instance to be able to compare behaviour across nodes.

![RPC API row](img/rpc_api_row.png)

### RPC Time per Method
Pie chart that shows where RPC time is spent across methods over the selected range. Quickly surfaces which endpoints dominate total processing time.

![RPC Time per Method](img/rpc_time_per_method.png)

### Slowest RPC Methods
Table listing the highest average-latency methods over the active dashboard range. Used to prioritise optimisation or caching efforts.

![Slowest RPC Methods](img/slowest_rpc_methods.png)

### RPC Request Rate by Method
Timeseries showing request throughput broken down by method, averaged across the selected range.

![RPC Request Rate by Method](img/rpc_request_rate_by_method.png)

### RPC Latency by Methods
Live timeseries that tries to correlate to the per-block execution time by showing real-time latency per Engine method with an 18 s lookback window.

![RPC Latency by Methods](img/rpc_latency_by_methods.png)

_**Limitations**: The RPC latency views inherit the same windowing caveats as the Engine charts: averages use the dashboard time range while the live chart relies on an 18 s window._

## Process and server info

Row panels showing process-level and host-level metrics to help you monitor resource usage and spot potential issues.
Expand Down
Binary file modified docs/developers/l1/img/block_execution_breakdown.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/developers/l1/img/engine_api_row.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/developers/l1/img/rpc_api_row.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/developers/l1/img/rpc_time_per_method.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/developers/l1/img/slowest_rpc_methods.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
34 changes: 8 additions & 26 deletions docs/developers/l1/metrics.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,30 @@
# Metrics

## Ethereum Metrics Exporter

We use the [Ethereum Metrics Exporter](https://github.com/ethpandaops/ethereum-metrics-exporter), a Prometheus metrics exporter for Ethereum execution and consensus nodes, to gather metrics during syncing for L1. The exporter uses the prometheus data source to create a Grafana dashboard and display the metrics. For the syncing to work there must be a consensus node running along with the execution node.

Currently we have two make targets to easily start an execution node and a consensus node on either hoodi or holesky, and display the syncing metrics. In both cases we use a lighthouse consensus node.

### Quickstart guide
## Quickstart
For a high level quickstart guide, please refer to [Monitoring](../../l1/running/monitoring.md).

Make sure you have your docker daemon running.

- **Code Location**: The targets are defined in `tooling/sync/Makefile`.
- **How to Run**:

```bash
# Navigate to tooling/sync directory
cd tooling/sync

# Run target for hoodi
make start-hoodi-metrics-docker
## Ethereum Metrics Exporter

# Run target for holesky
make start-holesky-metrics-docker
```
We use the [Ethereum Metrics Exporter](https://github.com/ethpandaops/ethereum-metrics-exporter), a Prometheus metrics exporter for Ethereum execution and consensus nodes, as an additional tool to gather metrics during L1 execution. The exporter uses the prometheus data source to create a Grafana dashboard and display the metrics.

To see the dashboards go to [http://localhost:3001](http://localhost:3001). Use “admin” for user and password. Select the Dashboards menu and go to Ethereum Metrics Exporter (Single) to see the exported metrics.
## L1 Metrics Dashboard

To see the prometheus exported metrics and its respective requests with more detail in case you need to debug go to [http://localhost:9093/metrics](http://localhost:9093/metrics).
We provide a pre-configured Grafana dashboard to monitor Ethrex L1 nodes. For detailed information on the provided dashboard, see our [L1 Dashboard document](./dashboards.md).

### Running the execution node on other networks with metrics enabled

A `docker-compose` is used to bundle prometheus and grafana services, the `*overrides` files define the ports and mounts the prometheus' configuration file.
As shown in [Monitoring](../../l1/running/monitoring.md) `docker-compose` is used to bundle prometheus and grafana services, the `*overrides` files define the ports and mounts the prometheus' configuration file.
If a new dashboard is designed, it can be mounted only in that `*overrides` file.
A consensus node must be running for the syncing to work.

To run the execution node on any network with metrics, the next steps should be followed:
1. Build the `ethrex` binary for the network you want (see node options in [CLI Commands](../../CLI.md#cli-commands)) with the `metrics` feature enabled.
2. Enable metrics by using the `--metrics` flag when starting the node.
3. Set the `--metrics.port` cli arg of the ethrex binary to match the port defined in `metrics/provisioning/prometheus/prometheus_l1_sync_docker.yaml`
3. Set the `--metrics.port` cli arg of the ethrex binary to match the port defined in `metrics/provisioning/prometheus/prometheus_l1_sync_docker.yaml`, which is `3701` right now.
4. Run the docker containers:

```bash
cd metrics

docker compose -f docker-compose-metrics.yaml -f docker-compose-metrics-l1.overrides.yaml up
```
For more details on running a sync go to `tooling/sync/readme.md`.
19 changes: 11 additions & 8 deletions docs/internal/l1/metrics_coverage_gap_analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
This note tracks the current state of metrics and dashboard observability for the L1, highlights the gaps against a cross-client baseline. It covers runtime metrics exposed through our crates, the existing Grafana "Ethrex L1 - Perf" dashboard, and supporting exporters already wired in provisioning.

### At a glance
- **Covered today**: Block execution timings, gas throughput, and host/process health are exported and graphed through `metrics/provisioning/grafana/dashboards/common_dashboards/ethrex_l1_perf.json`.
- **Missing**: Sync/peer awareness, txpool depth, Engine API visibility, JSON-RPC health, and state/storage IO metrics are absent or only logged.
- **Near-term focus**: Ship sync & peer gauges, surface txpool counters we already emit, and extend instrumentation around Engine API and storage before adding alert rules.
- **Covered today**: Block execution timings, detailed execution breakdown, Engine API and JSON-RPC method telemetry, and host/process health are exported and graphed through `metrics/provisioning/grafana/dashboards/common_dashboards/ethrex_l1_perf.json`. The refreshed [L1 Dashboard doc](./dashboards.md) has screenshots and panel descriptions.
- **Missing**: Sync/peer awareness, txpool depth, storage IO metrics, and richer error taxonomy are absent or only logged.
- **Near-term focus**: Ship sync & peer gauges, surface txpool counters we already emit, extend storage instrumentation, and harden alerting before widening coverage further.

## Baseline We Compare Against
The gap analysis below uses a cross-client checklist we gathered after looking at Geth and Nethermind metrics and dashboard setups; this works as a baseline of "must-have" coverage for execution clients. The key categories are:
Expand All @@ -19,7 +19,7 @@ The gap analysis below uses a cross-client checklist we gathered after looking a
- **Process & host health**: CPU, memory, FDs, uptime, disk headroom (usually covered by node_exporter but treated as must-have).
- **Error & anomaly counters**: explicit counters for reorgs, failed imports, sync retries, bad peer events.

Snapshot: October 2025.
Snapshot: November 2025.


| Client | Dashboard snapshot |
Expand All @@ -40,15 +40,15 @@ Ethrex exposes the metrics API by default when the CLI `--metrics` flag is enabl
| Peer health | Yes | Yes | No |
| Block & payload pipeline | Yes | Yes | Yes (latency + throughput) |
| Transaction pool | Yes (basic) | Yes | Partial (counters, no panels) |
| Engine API & RPC | Partial (metrics exist, limited panels) | Yes | No |
| Engine API & RPC | Partial (metrics exist, limited panels) | Yes | Partial (per-method rate/latency) |
| State & storage | Yes | Yes | Partial (datadir size; no pruning) |
| Process & host health | Yes | Yes | Yes (node exporter + process) |
| Error & anomaly counters | Yes | Yes | No |

- **Block execution pipeline**
- Gauges exposed in `ethrex_metrics::metrics_blocks`: `gas_limit`, `gas_used`, `gigagas`, `block_number`, `head_height`, `execution_ms`, `merkle_ms`, `store_ms`, `transaction_count`, plus block-building focused gauges that need to be reviewed first (`gigagas_block_building`, `block_building_ms`, `block_building_base_fee`).
- Updated on the hot path in `crates/blockchain/blockchain.rs`, `crates/blockchain/payload.rs`, and `crates/blockchain/fork_choice.rs`; block-building throughput updates live in `crates/blockchain/payload.rs`.
- Exposed via `/metrics` when the `metrics` feature or CLI flag is enabled and visualised in Grafana panels "Gas Used %", "Ggas/s", "Ggas/s by Block", "Block Height", and "Block Execution Breakdown" inside `metrics/provisioning/grafana/dashboards/common_dashboards/ethrex_l1_perf.json`.
- Exposed via `/metrics` when the `metrics` feature or CLI flag is enabled and visualised in Grafana panels "Gas Used %", "Ggas/s", "Ggas/s by Block", "Block Height", and the expanded "Block Execution Breakdown" row (pie, diff %, deaggregated by block) inside `metrics/provisioning/grafana/dashboards/common_dashboards/ethrex_l1_perf.json`.
- **Transaction pipeline**
- `crates/blockchain/metrics/metrics_transactions.rs` defines counters and gauges: `transactions_tracker{tx_type}`, `transaction_errors_count{tx_error}`, `transactions_total`, `mempool_tx_count{type}`, `transactions_per_second`.
- L1 currently uses the per-type success/error counters via `metrics!(METRICS_TX...)` in `crates/blockchain/payload.rs`. Aggregate setters (`set_tx_count`, `set_mempool_tx_count`, `set_transactions_per_second`) are only invoked from the L2 sequencer (`crates/l2/sequencer/metrics.rs` and `crates/l2/sequencer/block_producer.rs`), so there is no TPS gauge driven by the execution client today.
Expand All @@ -59,6 +59,9 @@ Ethrex exposes the metrics API by default when the CLI `--metrics` flag is enabl
- **Tracing-driven profiling**
- `crates/blockchain/metrics/profiling.rs` installs a `FunctionProfilingLayer` whenever the CLI `--metrics` flag is set. Histograms (`function_duration_seconds{function_name}`) capture tracing span durations across block processing.
- The current "Block Execution Breakdown" pie panel pulls straight from the gauges in `METRICS_BLOCKS` (`execution_ms`, `merkle_ms`, `store_ms`). The profiling histograms are scraped by Prometheus but are not charted in Grafana yet.
- **Engine & RPC telemetry**
- `function_duration_seconds_*{namespace="engine"|"rpc"}` histograms are emitted by the same profiling layer.
- Grafana now charts per-method request rates and range-based latency averages for both Engine API and JSON-RPC namespaces via the "Engine API" and "RPC API" rows.
- **Metrics API**
- `crates/blockchain/metrics/api.rs` exposes `/metrics` and `/health`; orchestration defined in `cmd/ethrex/initializers.rs` ensures the Axum server starts alongside the node when metrics are enabled.
- The provisioning stack (docker-compose, Makefile targets) ships Prometheus and Grafana wiring, so any new metric family automatically appears in the scrape.
Expand All @@ -79,7 +82,7 @@ Before addressing the gaps listed below, we should also consider some general im
| Peer health | `net_peerCount` RPC endpoint exists. | No Prometheus gauges for active peers, peer limits, snap-capable availability, or handshake failures; dashboard lacks a networking row. |
| Block & payload pipeline | `METRICS_BLOCKS` tracks gas throughput and execution stage timings; `transaction_count` is exported but not visualised yet. | Add p50/p95 histograms for execution stages, block import success/failure counters, and an L1-driven TPS gauge so operators can read execution throughput without relying on L2 metrics. |
| Transaction pool | Success/error counters per tx type emitted from `crates/blockchain/payload.rs`. | No exported pending depth, blob/regular split, drop reasons, or gossip throughput; aggregates exist only in L2 (`crates/l2/sequencer/metrics.rs`). |
| Engine API & RPC | None published. | Instrument `newPayload`, `forkChoiceUpdated`, `getPayload` handlers with histograms/counters and wrap JSON-RPC handlers (`crates/networking/rpc`) with per-method rate/latency/error metrics, then chart them in Grafana. |
| Engine API & RPC | Per-method request rate, latency (range-based + 18 s lookback) covering `namespace="engine"` and `namespace="rpc"` metrics. | Deepen error taxonomy ( error/rates and distinguish failure reasons), add payload build latency distributions, and baseline alert thresholds. |
| State & storage | Only `datadir_size_bytes` today. | Export healing/download progress, snapshot sync %, DB read/write throughput, pruning/backfill counters (we need to check what makes sense here), and cache hit/miss ratios. |
| Process & host health | Process collector + `datadir_size_bytes`; node_exporter covers CPU/RSS/disk. | Add cache pressure indicators (fd saturation, async task backlog) and ensure dashboards surface alert thresholds. |
| Error & anomaly counters | None published. | Add Prometheus counters for failed block imports, reorg depth, RPC errors, Engine API retries, sync failures, and wire alerting. |
Expand All @@ -89,7 +92,7 @@ Before addressing the gaps listed below, we should also consider some general im
2. Implement sync & peer metrics (best-peer lag, stage progress) and add corresponding Grafana row.
3. Surface txpool metrics by wiring existing counters and charting them.
4. Add the metrics relying on `ethereum-metrics-exporter` into the existing metrics, and avoid our dashboard dependence on it.
5. Instrument Engine API handlers with histograms/counters.
5. Extend Engine API / JSON-RPC metrics with richer error taxonomy and payload construction latency distributions.
6. State and Storage metrics, specially related to snapsync, pruning, db and cache.
7. Process health improvements, specially related to read/write latencies and probably tokio tasks.
8. Review block building metrics.
Expand Down
Loading
Loading