Skip to content

feat: reclaim deposit for discarded solutions#1029

Closed
sigurpol wants to merge 18 commits intomainfrom
reclaim-deposit
Closed

feat: reclaim deposit for discarded solutions#1029
sigurpol wants to merge 18 commits intomainfrom
reclaim-deposit

Conversation

@sigurpol
Copy link
Copy Markdown
Contributor

@sigurpol sigurpol commented Apr 22, 2025

This pull request introduces logic to track submitted rounds and automatically call the clear_submission function (which dispatches the clear_old_round_data1 extrinsic) for any solution that is determined to be discarded (i.e., not the winning solution) in past rounds.

The monitor now checks past rounds on each block, and if a better solution has been validated or the round has ended and our submission still exists, it triggers the clearing process to reclaim the deposit.

We have adopted a queue-based approach for clearing rounds, along with a dedicated task to call clear_old_round_data. This strategy helps us avoid blocking the main event loop and potentially delaying the processing of blocks in monitor_cmd.

Dedicated Prometheus metrics have been added as well to keep track of number of failures, queue size and duration of a clearing round attempt.

This ensures that locked deposits for discarded solutions are properly returned to the miner without manual intervention.

Fix #980.

Driven-by: the miner now only listens to finalized blocks and not also to best blocks, avoiding potential race conditions and simplifying round submission handling.

Example:
Let's run a miner both for //Alice and //Bob in the usual Zombienet setup as provided by the SDK. We are processing round2, in round1 Alice got rewarded and Bob's solution was discarded.

Relevant logs of the miner for Alice in round 2:

2025-04-22T07:12:17.825910Z DEBUG polkadot-staking-miner: Checking status of PAST rounds (< 2): [1]
2025-04-22T07:12:17.826488Z DEBUG polkadot-staking-miner: Submission metadata for past round 1 gone. Removing from tracking.
2025-04-22T07:12:17.826511Z  INFO polkadot-staking-miner: Removed past round 1 from local tracking.

Relevant logs for Bob in round 2:

2025-04-22T07:12:17.825890Z DEBUG polkadot-staking-miner: Checking status of PAST rounds (< 2): [1]
2025-04-22T07:12:17.826790Z  INFO polkadot-staking-miner: Submission for past round 1 detected as DISCARDED (Reason: Phase is past Signed/SignedValidation (or round gone) and metadata still exists). Attempting to clear.
2025-04-22T07:12:17.826806Z  INFO polkadot-staking-miner: Attempting to clear submission data for round 1 (16 pages)
2025-04-22T07:12:53.907228Z  INFO polkadot-staking-miner: Successfully cleared submission data for round 1 in block 0x7334857ad8a2a322198cbdbbdd788953ce52048352f6a66732447fa852b00ffa
2025-04-22T07:12:53.907293Z  INFO polkadot-staking-miner: Successfully cleared submission for past round 1
2025-04-22T07:12:53.907336Z  INFO polkadot-staking-miner: Removed past round 1 from local tracking.

On Polkadot JS I can see an event multiBlockSigned.Discarded associated to Bob after Bob has cleared his submission.

Footnotes

  1. Important Distinction: bail is for self-withdrawing a solution during the active mining phase whereas clear_old_round_data is for reclaiming deposits from solutions that were discarded after the validation phase

@sigurpol sigurpol marked this pull request as draft April 22, 2025 07:52
@sigurpol sigurpol marked this pull request as ready for review April 22, 2025 09:44
@sigurpol sigurpol requested a review from niklasad1 April 22, 2025 09:45
@sigurpol
Copy link
Copy Markdown
Contributor Author

@niklasad1 , @kianenigma PTAL 🙏

This commit introduces logic to track submitted
rounds and automatically call the clear_submission
function (which dispatches the clear_old_round_data
extrinsic) for any solution that is determined to
be discarded (i.e., not the winning solution) in
past rounds. The monitor now checks all previously
submitted rounds on each block, and if a better
solution has been validated or the round has ended
and our submission still exists, it triggers the
clearing process to reclaim the deposit.
This ensures that locked deposits for discarded
solutions are properly returned to the miner
without manual intervention.
if has_submitted(&storage, round, signer.account_id(), n_pages).await? {
// 2. If the solution has already been submitted:
// 2.1 Check local tracking first
if submitted_rounds.lock().await.contains_key(&round) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we support both best blocks and finalized blocks at moment it's possible that a block can be reverted and this locally submitted_round may not be accurate with the state on chain then.

Finalized blocks can't be reverted except hard forks IIRC, so to support such as functionality we would need to remove support for listening to best blocks which may simplify our life a bit.

That's the reason why we are checking the state at latest head like:

if has_submitted(
        &utils::storage_at_head(&client, listen).await?
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. What are you suggesting as best option?

  1. Only list to finalized blocks
    vs
  2. always double-check with chain state i.e. use local tracking as now but then validate against chain state e.g.
// Check local tracking first, but validate against chain state
if submitted_rounds.lock().await.contains_key(&round) {
    // Verify with chain state before trusting local tracking
    let storage_head = utils::storage_at_head(&client, listen).await?;
    if has_submitted(&storage_head, round, signer.account_id(), n_pages).await? {
        // Confirmed both locally and on-chain
        return Ok(());
    } else {
        // Local tracking is wrong, remove it and continue
        untrack_round_submission(
            &submitted_rounds,
            round,
            Some("Removing inconsistent local tracking for round {}")
        ).await;
        // Continue with submission attempt
    }
}

Copy link
Copy Markdown
Contributor

@niklasad1 niklasad1 Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire point with best blocks is that it's faster and you may win against finalized blocks but it may bite you in the ass and if something got included in a block but never got finalized or reverted.

Personally I don't think it's worth the effort of maintaining such annoyances when we are just using finalized when we are running the miner. My cents is to remove best block listening completely.

This thingy would be annoying to implement or simply for best blocks these weird stuff could happen users that uses best blocks simply has to accept the fact that it is unreliable.

/cc @kianenigma what's your take on best blocks?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be against to just go with finalized blocks and check the impact on the overall performance of the miner. I'll make a commit on top, if you agree

Copy link
Copy Markdown
Contributor Author

@sigurpol sigurpol Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to support ONLY finalized blocks in this commit

@sigurpol sigurpol force-pushed the reclaim-deposit branch 2 times, most recently from 319a705 to b71e2b3 Compare April 23, 2025 16:46
@sigurpol
Copy link
Copy Markdown
Contributor Author

@niklasad1 , @kianenigma , CI/run test failures will be addressed once paritytech/polkadot-sdk#8310 is merged into polkadot-sdk, so please don't pay too much attention to it.

- Created a new helper function `storage_at_finalized`` that gets
storage only from the finalized chain state:
- Updated the `monitor_cmd` function to always use finalized blocks
- Replaced all instances of utils::storage_at_head(&client, listen)
with storage_at_finalized(&client) in three locations:
  - After acquiring the submission lock
  - After mining a solution
  - Before submitting a solution to check if it's better than what's
  on chain
- Removed the unused functions from utils.rs:
  - Removed storage_at_head which was replaced by our new
  storage_at_finalized function
  - Removed rpc_get_latest_head which is no longer needed since
  we're only using finalized blocks

This change ensures:

1. The miner only listens to finalized blocks, avoiding potential
race conditions
2. All storage queries are made against finalized blocks, ensuring
consistent state
3. We only submit solutions when we're certain they're better than
the current best solution on the finalized chain
This reverts commit 1a5718e and
leverage `is_solution_already_submitted`.

if maybe_submission_metadata.is_none() {
log::debug!(target: LOG_TARGET, "Submission metadata for past round {} gone. Removing from tracking.", round_to_check);
rounds_to_remove.push(round_to_check);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove directly instead of pushing to a vec?

Copy link
Copy Markdown
Contributor Author

@sigurpol sigurpol Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we invalidate the iterator though? And I thought retain or similar would be problematic in this async code....

- Introduce a queue-based approach where we enqueue rounds to clear and
  a task processes the queue. We do not consider only the most recent
  past rounds, but we still cap the size of queue.
- Add prometheus metrics to monitor clearing round mechanism
@kianenigma
Copy link
Copy Markdown
Contributor

kianenigma commented Apr 25, 2025

I would do this slightly differently, first elaborating on that before I do a more in detail review.

Note that I am not against this; if it works and the code is clear to you, you can keep the current approach.

now checks past rounds on each block

and the fact that you keep your internal state of what you have submitted, is my main point of disagreement. I would have done this simpler:

  1. Keep the event loop of monitor-and-submit as separate as possible. This thread should always look at the current round, if Phase::Signed mine something and submit it. Its responsibilities end here. It won't keep track of what has submitted or not
  2. A totally separate thread is always looking at the SignedSubmissions storage, and checking if there are any data starting with key (x, our_account) where x < current_round in the storage, and clean that up. This thread can run every block, every 10 blocks, or every day, it doesn't need to be super on-time.

In some sense, I am suggesting we don't re-store what is already in chain. The question of "okay, what have I submitted in any old round that needs to be cleared now?" is 100% answerable through looking at the chain state, we don't need to store it here again.

I think this would be a simpler approach, perhaps worth considering before you commit to this.

The benefit of your current approach, keeping an internal record of what you have submitted and not, is that iff a lot of other submitters submit garbage, and don't clean it up, then the query that we have to do ("checking if there are any data starting with key (x, our_account) where x < current_round") becomes a bit slow.

A downside of the current approach is that if the miner crashes and restarts, we lose track of what we have submitted in the past, and won't clear it up.

Another downside of the current approach is that it is not really compatible with the idea of allowing anyone to clear anyone's data. If we want to do that.

// Process any rounds in the queue
let mut processed = false;

while let Ok(round_to_clear) = clear_receiver.try_recv() {
Copy link
Copy Markdown
Contributor

@niklasad1 niklasad1 Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will essentially poll the receiver, let's use the async API to avoid waking up the task that often when there is nothing to consume

Suggested change
while let Ok(round_to_clear) = clear_receiver.try_recv() {
while let Ok(round_to_clear) = clear_receiver.recv().await {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done a variant of this in 4a7899a

@sigurpol
Copy link
Copy Markdown
Contributor Author

Thanks for the comment, @kianenigma. I see your valid points. Let me explore your approach and see how it would look. I'm also interested in @niklasad1's opinion on this!

@niklasad1
Copy link
Copy Markdown
Contributor

niklasad1 commented Apr 25, 2025

and the fact that you keep your internal state of what you have submitted, is my main point of disagreement. I would have done this simpler:

  1. Keep the event loop of monitor-and-submit as separate as possible. This thread should always look at the current round, if Phase::Signed mine something and submit it. Its responsibilities end here. It won't keep track of what has submitted or not
  2. A totally separate thread is always looking at the SignedSubmissions storage, and checking if there are any data starting with key (x, our_account) where x < current_round in the storage, and clean that up. This thread can run every block, every 10 blocks, or every day, it doesn't need to be super on-time.

It seems like a good idea to me because "the current implementation in this PR" became a bit complicated and I don't think reconnecting/restarting shouldn't take that long (tens of seconds is my guess) so the probability missing the round transition should quite low but possible.

There is also a possibility to use the reconnecting-rpc-client from subxt to protect against this stuff as well, but still possible that it stops for some other reason.

I would think this Kian's suggested would be a good first step and then improve it if really losing submissions is a big issue then we could write them to a file or something to re-use on start-up.

but really it's to you @sigurpol :)

@sigurpol
Copy link
Copy Markdown
Contributor Author

and the fact that you keep your internal state of what you have submitted, is my main point of disagreement. I would have done this simpler:

  1. Keep the event loop of monitor-and-submit as separate as possible. This thread should always look at the current round, if Phase::Signed mine something and submit it. Its responsibilities end here. It won't keep track of what has submitted or not
  2. A totally separate thread is always looking at the SignedSubmissions storage, and checking if there are any data starting with key (x, our_account) where x < current_round in the storage, and clean that up. This thread can run every block, every 10 blocks, or every day, it doesn't need to be super on-time.

It seems like a good idea to me because "the current implementation in this PR" became a bit complicated and I don't think reconnecting/restarting shouldn't take that long (tens of seconds is my guess) so the probability missing the round transition should quite low but possible.

There is also a possibility to use the reconnecting-rpc-client from subxt to protect against this stuff as well, but still possible that it stops for some other reason.

I would think this Kian's suggested would be a good first step and then improve it if really losing submissions is a big issue then we could write them to a file or something to re-use on start-up.

but really it's to you @sigurpol :)

Thanks @kianenigma and @niklasad1 . I do agree with @kianenigma 's suggestion actually. I think the pros are pretty strong:

  • Chain as source of truth: no longer duplicate state that's already tracked on-chain
  • Crash resistance: If the miner crashes and restarts, it will still find and clear old submissions
  • Cleaner separation of concerns: The main monitor process focuses solely on current round submissions
  • Simplicity: No need for complex coordination between processes with channels and notifications

so seems a clear winner 🚀

@sigurpol
Copy link
Copy Markdown
Contributor Author

Closing this pull request, I will implement the feature on the new miner's architecture (now that we have a listener and a miner task, I will also add a janitor task responsible to call clear_old_round_data(), should be simpler now to keep concerns well isolated - last famous words 😅 ). Stay tuned 📻

@sigurpol sigurpol closed this Jun 19, 2025
@sigurpol
Copy link
Copy Markdown
Contributor Author

Feature has been implemented in #1095

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

multi block miner: reclaim deposit

3 participants