Added janitor task for automatic deposit recovery with Prometheus metrics by sigurpol · Pull Request #1095 · paritytech/polkadot-staking-miner

sigurpol · 2025-06-19T14:20:34Z

Implements automatic cleanup of old discarded submissions to reclaim deposits.
Close #980.

Features:

Triggers when the election round increments (runs once per new round)
Scans the last 5 rounds for old submissions at most (the usual process is to simply clear the previous round).
Calls clear_old_round_data() to recover deposits and clean storage
Non-blocking integration with existing mining operations
Comprehensive error handling (critical vs recoverable errors)

Architecture

Introduce JanitorMessage enum and janitor_task for deposit recovery
Use dedicated bounded channel for janitor communication
Update listener to send janitor ticks to janitor task
Ensure mining and janitor operations are independent and non-blocking

┌──────────────────────────────────────────────────────────────────────────┐
│   ┌─────────────┐                      ┌─────────────┐            ┌─────────────┐
└──▶│ Listener    │                      │   Miner     │            │ Blockchain  │
    │             │  Snapshot/Signed     │             │            │             │
    │ ┌─────────┐ │ ────────────────────▶│ ┌─────────┐ │ (solutions)│             │
    │ │ Stream  │ │  (mining work)       │ │ Mining  │ │───────────▶│             │
    │ └─────────┘ │                      │ └─────────┘ │            │             │
    │      │      │  Round++             │ ┌─────────┐ │            │             │
    │      ▼      │ ────────────────────▶│ │ Clear   │ │            │             │
    │ ┌─────────┐ │                      │ │ Snapshot│ │            │             │
    │ │ Phase   │ │                      │ └─────────┘ │            │             │
    │ │ Check   │ │  Round++             └─────────────┘            │             │
    │ └─────────┘ │ ────────────────────▶┌─────────────┐            │             │
    │             │  (deposit cleanup)   │  Janitor    │ (cleanup)  │             │
    │             │                      │ ┌─────────┐ │───────────▶│             │
    │             │                      │ │ Cleanup │ │            │             │
    │             │                      │ └─────────┘ │            │             │
    └─────────────┘                      └─────────────┘            └─────────────┘

Prometheus Metrics Added:

Counters:

staking_miner_janitor_cleanup_success_total: Successful cleanup operations
staking_miner_janitor_cleanup_failures_total: Failed cleanup operations

Gauges:

staking_miner_janitor_cleanup_duration_ms: Time taken for last cleanup
staking_miner_janitor_old_submissions_found: Old submissions discovered
staking_miner_janitor_old_submissions_cleared: Submissions successfully cleared

Key Metrics:

Success rate: success_total / (success_total + failures_total)
Performance: avg(cleanup_duration_ms)
Activity: old_submissions_cleared shows actual deposit recovery

Example Prometheus queries:

staking_miner_janitor_cleanup_success_total / (staking_miner_janitor_cleanup_success_total + staking_miner_janitor_cleanup_failures_total)

increase(staking_miner_janitor_cleanup_success_total[24h])

staking_miner_janitor_old_submissions_cleared / staking_miner_janitor_old_submissions_found

Logs

An example of logs below when a previous submission is cleared

2025-06-20T20:09:03.221683Z DEBUG polkadot-staking-miner: Detected round increment 9 -> 10    
2025-06-20T20:09:03.221707Z TRACE polkadot-staking-miner: Sent janitor tick for round 10    
2025-06-20T20:09:03.221711Z DEBUG polkadot-staking-miner: Round increment in Off phase, signaling snapshot cleanup    
2025-06-20T20:09:03.221718Z TRACE polkadot-staking-miner: Block #792, Phase Off - nothing to do    
2025-06-20T20:09:03.221731Z TRACE polkadot-staking-miner: Running janitor cleanup for round 10    
2025-06-20T20:09:03.221745Z TRACE polkadot-staking-miner: Clearing snapshots    
2025-06-20T20:09:03.221749Z TRACE polkadot-staking-miner: Scanning round 9 for old submissions (current round: 10, scanning rounds from 9)    
2025-06-20T20:09:03.222313Z DEBUG polkadot-staking-miner: Found old submission in round 9 with 4 pages, attempting cleanup    
2025-06-20T20:09:03.222324Z DEBUG polkadot-staking-miner: Clearing old round data for round 9 with 4 witness pages    
2025-06-20T20:09:11.232603Z TRACE polkadot-staking-miner: Block #793, Phase Off - nothing to do    
2025-06-20T20:09:15.243377Z TRACE polkadot-staking-miner: Block #794, Phase Off - nothing to do    
2025-06-20T20:09:23.253658Z TRACE polkadot-staking-miner: Block #795, Phase Off - nothing to do    
2025-06-20T20:09:27.257928Z TRACE polkadot-staking-miner: Block #796, Phase Off - nothing to do    
2025-06-20T20:09:35.275623Z TRACE polkadot-staking-miner: Block #797, Phase Off - nothing to do    
2025-06-20T20:09:39.278849Z DEBUG polkadot-staking-miner: Successfully submitted clear_old_round_data for round 9    
2025-06-20T20:09:39.278892Z  INFO polkadot-staking-miner: Successfully cleaned up old submission from round 9 (4 witness pages)    
2025-06-20T20:09:39.278901Z  INFO polkadot-staking-miner: Janitor cleaned up 1 old submissions in 36057ms

Integration test

An integration test now verifies the scenario where two miners submit identical solutions. Both solutions are successfully submitted; one is rewarded while the other is discarded after clear_old_round() has been explicitly called by the miner with the non-winning solution.

2025-06-20T13:13:08.930566Z  INFO monitor: Bob solution discarded!    
2025-06-20T13:13:08.998134Z  INFO monitor: 🤑 Successfully completed two-miner test: both submitted solutions, one rewarded, one discarded! Duration: 1179.356788208s 🤑    
test submit_works ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1179.38s

Implements automatic cleanup of old discarded submissions to reclaim deposits: Features: - Triggers on Done → Off phase transitions (first Off block only) - Scans last 5 rounds for old submissions - Calls clear_old_round_data() to recover deposits and clean storage - Non-blocking integration with existing mining operations - Comprehensive error handling (critical vs recoverable errors) Prometheus Metrics Added: Counters: - staking_miner_janitor_cleanup_success_total: Successful cleanup operations - staking_miner_janitor_cleanup_failures_total: Failed cleanup operations Gauges: - staking_miner_janitor_cleanup_duration_ms: Time taken for last cleanup - staking_miner_janitor_old_submissions_found: Old submissions discovered - staking_miner_janitor_old_submissions_cleared: Submissions successfully cleared Key Metrics: - Success rate: success_total / (success_total + failures_total) - Performance: avg(cleanup_duration_ms) - Activity: old_submissions_cleared shows actual deposit recovery Example Prometheus queries: staking_miner_janitor_cleanup_success_total / (staking_miner_janitor_cleanup_success_total + staking_miner_janitor_cleanup_failures_total) increase(staking_miner_janitor_cleanup_success_total[24h]) staking_miner_janitor_old_submissions_cleared / staking_miner_janitor_old_submissions_found

Update janitor logic to run when transitioning from Done or Export phase to Off, not just Done. Improve log message to include previous phase.

- Introduce JanitorMessage enum and janitor_task for deposit recovery - Use dedicated bounded channel for janitor communication - Update listener to send janitor ticks to janitor task - Improve documentation and diagrams to reflect new architecture - Ensure mining and janitor operations are independent and non-blocking

sigurpol · 2025-06-20T09:46:01Z

My usual suspects as reviewers are all OOO these days. This PR is needed for Kusama AH migration.
@jsdw , @Overkillus or @tdimitrov : if you have time to spare / waste, I would appreciate a review 🙏 🙇

We expect to see one solution rewarded and the other discarded.

README.md

jsdw · 2025-06-20T16:31:21Z

README.md

+This ensures that deposits from unsuccessful submissions are automatically recovered, maintaining
+the economic viability of long-term mining operations.


For somebody who doesn't know this stuff so well: if this didn't exist, would the old deposits be locked up forever without such a "cleanup" phase to claim them back?

correct, there is no automatic reclaim, the election pallet does not automatically return deposits for valid not best solutions.

jsdw · 2025-06-20T16:32:32Z

README.md

Nice job on the README rewrite/tidyup!

src/commands/multi_block/monitor.rs

jsdw

I am not an expert in the specifics of staking, but the code looks clean and the approach makes sense to me, modulo a couple of tiny comments; nice one!

The change switches from phase-based to round-based triggers for the janitor cleanup task and for clearing the snapshot.

sigurpol added 3 commits June 19, 2025 16:17

Trigger janitor cleanup on Done or Export to Off transition

6581550

Update janitor logic to run when transitioning from Done or Export phase to Off, not just Done. Improve log message to include previous phase.

sigurpol force-pushed the clear_old_round_data branch from 033a2ff to 912685c Compare June 19, 2025 17:15

Update README.md

f1218d0

sigurpol force-pushed the clear_old_round_data branch from 912685c to f1218d0 Compare June 19, 2025 17:17

sigurpol mentioned this pull request Jun 20, 2025

Create a wiki or gh-pages hosted knowledge-base about what this is #870

Open

sigurpol requested review from Ank4n, kianenigma and seadanda June 20, 2025 09:34

Cleanup

64678c1

sigurpol force-pushed the clear_old_round_data branch from e595dfb to 64678c1 Compare June 20, 2025 11:04

Extend integration test to cover two-miners scenario

68cda80

We expect to see one solution rewarded and the other discarded.

sigurpol force-pushed the clear_old_round_data branch from dbdb333 to 68cda80 Compare June 20, 2025 13:21

sigurpol mentioned this pull request Jun 20, 2025

Add e2e integration tests with multiple submissions #1094

Closed