Skip to content

Migrate pool to dashmap#304

Open
Shourya742 wants to merge 2 commits intostratum-mining:mainfrom
Shourya742:02-03-2026-migrate-pool-to-dashmap
Open

Migrate pool to dashmap#304
Shourya742 wants to merge 2 commits intostratum-mining:mainfrom
Shourya742:02-03-2026-migrate-pool-to-dashmap

Conversation

@Shourya742
Copy link
Copy Markdown
Collaborator

closes: #205

@Shourya742 Shourya742 force-pushed the 02-03-2026-migrate-pool-to-dashmap branch from 2ecf9d3 to 5a23188 Compare March 2, 2026 03:04
@Shourya742 Shourya742 marked this pull request as ready for review March 3, 2026 00:12
@Shourya742 Shourya742 force-pushed the 02-03-2026-migrate-pool-to-dashmap branch from 247919a to 281946f Compare March 3, 2026 00:13
Copy link
Copy Markdown
Contributor

@average-gary average-gary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two concurrency concerns flagged below.

let Some(downstream) = channel_manager_data.downstream.get(&downstream_id) else {
return Err(PoolError::disconnect(PoolErrorKind::DownstreamNotFound(downstream_id), downstream_id));
};
let Some(downstream) = self.downstream.get(&downstream_id) else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DashMap guard held across .await

Unlike every other handler in this file, handle_update_channel doesn't use the closure pattern to scope DashMap guards. The downstream Ref acquired here lives through the for message in messages { message.forward(...).await; } loop at the bottom, blocking the entire shard for the duration of the async send.

Wrap the body in a closure like the other handlers do:

let process_update_channel = || {
    let Some(downstream) = self.downstream.get(&downstream_id) else { ... };
    // ... build messages ...
    Ok(messages)
};
let messages = process_update_channel()?;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downstream object gets dropped as soon as its stop being used and we are calling await at the very end, and their scope doesn't intersect.

let vardiff_key = vardiff.key().clone();
let vardiff_state = vardiff.value_mut();
let downstream_id = &vardiff_key.downstream_id;
let channel_id = &vardiff_key.channel_id;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deadlock risk: inverted lock ordering with submit handlers

This loop holds three nested DashMap guards simultaneously: self.vardiff (iter_mut) → self.downstream (get_mut) → downstream.standard_channels (get_mut).

The submit handlers acquire these in the opposite order: self.downstreamstandard_channelsself.vardiff.

Under shard collision this is a classic lock-ordering deadlock. Consider collecting the keys first to avoid holding the vardiff iter guard while acquiring the others:

let keys: Vec<_> = self.vardiff.iter().map(|r| r.key().clone()).collect();
for key in keys {
    let Some(mut vardiff) = self.vardiff.get_mut(&key) else { continue };
    // ...
    drop(vardiff); // or scope it tightly
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I don't understand this. ;)

let messages = self.channel_manager_data.super_safe_lock(|channel_manager_data| {
let Some(downstream) = channel_manager_data.downstream.get_mut(&downstream_id) else {
return Err(PoolError::disconnect(PoolErrorKind::DownstreamIdNotFound, downstream_id));
let process_open_standard_mining_channel = || {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you put this here instead of the let messages = ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we require a closure here is that the block contains return statements. Without the closure, those returns would exit the entire handler method instead of just the block.

That said, I am not a big fan of this pattern anymore. It originally existed to work with the nested locking pattern we had before. Since that is no longer the case, we don’t really need this structure anymore. The code can likely be simplified to something much leaner and easier to reason about.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simplify this block though, and I would do the same in all the other places where you introduced this closure.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, I am currently doing that. Will push changes in sometime

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made the changes, the handlers should be leaner now. The commits are structured so that each method change is in its own atomic commit making review easier for latest set of changes related to this.

3af7b21
e567925
f11b665
6fde432
cc66092

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before merging this, I would squash them in the previous commits accordingly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, they were there for ease of review.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commits are squashed now.

@Shourya742 Shourya742 force-pushed the 02-03-2026-migrate-pool-to-dashmap branch from 2645ba9 to cc66092 Compare March 11, 2026 11:08
if downstream.requires_custom_work.load(Ordering::SeqCst) {
error!("OpenStandardMiningChannel: Standard Channels are not supported for this connection");
let open_standard_mining_channel_error = OpenMiningChannelError {
let send_error = |error_code: &'static str| async {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this become a function in utils.rs, which can be called from every place where we need it?

I see it's currently defined and repeated multiple times.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we look at the implementation of all such closure, we can see that it points to a very specific error message tied to the method.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not completely true, because we have some cases where the error message is exactly the same.

For example, we have two identical closures for the OpenMiningChannelError.

Since it seems something which can be used for different error messages, why can't it be a function in utils.rs, where you can also pass the error message you want, and it does the job (probably matching on the error message which is passed) ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing the entire error message to a helper method somewhat defeats the purpose of the closure in the first place. The closure was introduced to avoid constructing the error message repeatedly and to eliminate boilerplate across multiple call sites when the only variation is the error code.

Copy link
Copy Markdown
Member

@GitGab19 GitGab19 Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With one helper method you put the logic which is inside the different closures only in one place, but you can use it for different error messages, and then call it from different contexts to send a specific error message with a specific error code.

Example:

forward_error_message_to_channel_manager(error_message_type, error_code)

and then:

forward_error_message_to_channel_manager(OPEN_MINING_CHANNEL_ERROR_MESSAGE_TYPE, "standard-channels-not-supported-for-custom-work")

or

forward_error_message_to_channel_manager(SET_CUSTOM_MINING_JOB_ERROR_MESSAGE_TYPE, "pool-payout-script-missing")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless I'm missing context, which is a very real possibility

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that should be an issue to be tracked for, its just a helper closure to remove repetitive message construction during method execution.

Copy link
Copy Markdown
Member

@plebhash plebhash Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems to be two topics of discussion here:

  1. RouteTo
  2. closure

I'm just pointing out that (IIUC) @GitGab19 said we need an issue to keep track of 2. (replying to your ping), #330 was presented as the answer, while it's scope only covers 1

I'm fine if we decide to move forward without addressing the concerns raised about closure convolution, I'm just trying to make sure we're all on the same page and not masquerading one issue with another

anyways, I'm hitting the road in a bit so won't be able to do a deep dive on this PR today so I'll leave it for you guys to figure it out

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am removing the closure.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good now, updated the commits. IT passes. :)

@Shourya742 Shourya742 force-pushed the 02-03-2026-migrate-pool-to-dashmap branch 3 times, most recently from 5a88034 to 6f19477 Compare March 13, 2026 14:31
@Shourya742 Shourya742 marked this pull request as draft March 13, 2026 17:38
@Shourya742 Shourya742 marked this pull request as ready for review March 14, 2026 11:35
@Shourya742 Shourya742 force-pushed the 02-03-2026-migrate-pool-to-dashmap branch from 6c5dd6a to 07f9f0d Compare March 17, 2026 04:00
@GitGab19
Copy link
Copy Markdown
Member

I'm looking into your last commit (07f9f0d), and I notice that JDC vardiff functions don't take the self. Are we risking deadlocks there?

@Shourya742
Copy link
Copy Markdown
Collaborator Author

I'm looking into your last commit (07f9f0d), and I notice that JDC vardiff functions don't take the self. Are we risking deadlocks there?

The lock acquisition order has different, from vardiff → downstream → channel to downstream → channel → vardiff.

In the JDC case, the original order works because all components operate under the channel_manager_data lock, which ensures synchronized access. In JDC, no operation on the channel manager can proceed without first acquiring the channel_manager_data lock, effectively acting as a guardrail.

@GitGab19
Copy link
Copy Markdown
Member

And why are we handling this differently in the Pool? They are servers with downstreams in the the same way.

@Shourya742
Copy link
Copy Markdown
Collaborator Author

And why are we handling this differently in the Pool? They are servers with downstreams in the the same way.

Because we are migrating to dashmap, which makes us move the vardiff and downstream (which were earlier part of ChannelManagerData) to Channel Manager, which doesn't have any centralized lock as Channel Manager Data.

@GitGab19
Copy link
Copy Markdown
Member

Right, I'm sorry but for a moment I forgot that we haven't introduced the Dashmap in JDC yet.

@Shourya742
Copy link
Copy Markdown
Collaborator Author

This looks terrible, need to design it better

Screenshot from 2026-03-17 23-24-35

@Shourya742 Shourya742 marked this pull request as draft March 17, 2026 17:58
@Shourya742
Copy link
Copy Markdown
Collaborator Author

This looks terrible, need to design it better

Screenshot from 2026-03-17 23-24-35

This is definitely a solid improvement over what we had before, more context here: #299 (comment). Taking my words back.

That said, we still need to dig deeper into the root cause of the latency. Even though this PR reduces it significantly, seeing delays in the order of seconds is still concerning.

@Shourya742 Shourya742 force-pushed the 02-03-2026-migrate-pool-to-dashmap branch 3 times, most recently from ae73953 to c85d149 Compare March 23, 2026 07:17
@Shourya742 Shourya742 force-pushed the 02-03-2026-migrate-pool-to-dashmap branch from c85d149 to af9f7db Compare March 23, 2026 07:22
@Shourya742 Shourya742 force-pushed the 02-03-2026-migrate-pool-to-dashmap branch from af9f7db to e0a0f75 Compare March 23, 2026 07:29
@Shourya742 Shourya742 marked this pull request as ready for review March 23, 2026 11:37
@Shourya742
Copy link
Copy Markdown
Collaborator Author

Opening this for review: this PR also introduces a wrapper around DashMap, helping us avoid its common footguns and keeping lock semantics out of the rest of the codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor Pool to Reduce Nested Locking

4 participants