-
Notifications
You must be signed in to change notification settings - Fork 955
Maintain deterministic order of CLUSTER SHARDS response #411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #411 +/- ##
============================================
- Coverage 70.30% 70.28% -0.03%
============================================
Files 111 111
Lines 60300 60285 -15
============================================
- Hits 42393 42370 -23
- Misses 17907 17915 +8
|
|
Thanks @VoletiRam for the PR. There were discussion around using @valkey-io/core-team Please take a look. |
|
Just want to confirm with you @VoletiRam If addling the new parameter "topology" for cluster shards goal is that from every client's view, the output of the 2 masters and 2 replicas nodes is always: 127.0.0.1:6321> cluster shards topology
BWT, I am reviewing this PR codes, Thanks |
|
@hwware The ordering is based on primary node id lexicographically irrespective of the |
|
Thank you @hwware. As @hpatro pointed, the view will be same for both |
|
Does it filter out failed and loading nodes? It would be ideal if the client's topology map could be solely based on the results of this command, eliminating the need for subsequent checks on the nodes' status. |
Thanks for your words. Then according to this rule, sorted by primary node id lexicographically, all clients should get the same view from any node. (At least primary output is same.) |
|
@barshaul We are only filtering out fields that contribute to non-deterministic output but not the node's information based on their health status. I think the ask was to eliminate volatile fields that can vary across the clients, at least not clear from discussion in #114. We can help filter out node's information if everyone agrees. |
src/cluster.c
Outdated
| (c->argc == 2 || c->argc == 3)) | ||
| { | ||
| /* CLUSTER SHARDS [TOPOLOGY] */ | ||
| int topology = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, I set this kind of bool variable default value as 0 (here the variable more close to a bool variable).
But it is not a big issue I think
|
Few questions which we need to reach consensus on:
|
|
Thank you @hpatro for raising the questions that need consensus. I want to add couple of questions as well. I am checking few scenarios with 2 primaries - 2 replicas in a 4 node cluster with slot coverage on primaries.
With my implementation, the slot coverage empty issue in ##5 can be solved as we go over each master from masters list and print corresponding slots, but it will still show old master in the response with empty slots and fail health status unless we decide to filter out either master node if marked fail or master with no slot coverage. Please share your opinion. |
Not sure if I understand your question. Can you elaborate?
Based on the use case as described in #114 , including
I don't think 2-shard deployments are legit given the current design/implementation. We need to officially support 2-shard clusters first and then it makes sense to discuss the output of @zuiderkwast should we resurrect the "voting replicas" discussion? redis/redis#12390 |
If all clients would prefer using
|
I do not think we should deprecate CLUSTER SHARDS command. Client need to remember one more command and Thus my suggestion is: |
In fact. 2 primaries - 2 replicas cluster, 2 primaries - 0 replicas cluster, 2 primaries - 4 replicas cluster are totally different. Case 1: 2 primaries - 0 replicas cluster. If client set cluster-require-full-coverage as no in conf file, cluster still work even one primary node fail. Case 2: 2 primaries - 2 replicas cluster: If any primary fails, no vote happens, and replica can failover immediately Case 3: 2 primaries - 4 replicas cluster: vote happen if any primary fails So i agree with Ping, let us first support 2-shard clusters first then discuss the output of cluster shards [topology] |
@PingXie Whether replicas can vote or whether the cluster has quorum to perform failovers, or even what kind of consensus algorithm is used, should be irrelevant to the clients. (It's even possible to have some external watchdog that performs manual failover.) So let's decouple those discussions from this PR?
I don't think clients should make their own decisions about the health of nodes. That's something the cluster does for them. The clients should only be concerned with routing according to what the cluster tells them. For this, there's no need to include shards without slots. Maybe it's better to exclude them, because such nodes are usually going to be taken down or are just being set up and not really ready to be used for pubsub and other stuff clients may want to send to them. To summarize: I think CLUSTER SHARDS TOPOLOGY should return no more info that what's included in CLUSTER SLOTS. (Just on a different format.)
I agree with @hwware about this. If clients have started using CLUSTER SHARDS, we can let them do that. Let's not break it. |
If we accept this premise, I think we should consider that maybe we are trying to force
It seems like we are saying that clients just shouldn't care about all the extra data provided by The asks from @barshaul are basically, "I don't want any more information, I just want to know what slots are healthy and able to be served from". That is what So, we can make |
|
Yes (1) was what I meant, but I wasn't completely aware of the background and details. It seems like the main point of this new CLUSTER SHARDS variant is that it's deterministic, so a you (or a test case) can check that the nodes' views of the cluster is consistent. This isn't the use case for client slot routing. It's rather a use case for test cases and for admins, to check that the cluster converges after adding/removing nodes, slot migrations, etc. If it's deterministic for a healthy cluster even with health info included, then I'm not going to argue against it. It can be used by clients too, just to save some bytes, but if some clients feels they want more info, they'll just use the full version of the command, or CLUSTER NODES. That can't be helped. So I guess the question should be: How common or important is it for cluster admins to check that a cluster converges in this way? (In our own test framework we can solve it in some other way if it's just for us.) |
I've done a fair amount of "diff" between various cluster outputs, and usually have to do some pre-processing to make sure they agree. It would be nice if the node ordering was the same in that case. You could then trivially ignore the fields that are known to be slightly different (replication offset). |
|
To make my suggestion about cluster slots more concrete, I'm proposing a change so that the response of cluster slots becomes: Besides that, it behaves the exact same cluster |
|
@madolson I don't think the reason clients haven't adopted CLUSTER SHARDS (added in 7.0) is that it's hard to parse. The reason is rather that clients want to be backward compatible and support old Redis versions. If we add CLUSTER SLOTS PACKED, it will have the same problem: Clients can only use it if they know the server supports it, and then they still need a fallback for versions that don't. Once Redis 6.2 and all Redis 6-compatible services are EOL (or about the time of Valkey 9 is released), then all deployments support CLUSTER SHARDS and then we can start expecting clients to switch to CLUSTER SHARDS. |
I agree! People don't want to use the I suppose there is another option. If we implement a
I don't agree this will happen. Lots of people will continue to use old versions because they will be supported. |
|
Regarding this PR: Can we just settle with sorting what can be sorted in CLUSTER SHARDS? No new argument. That's my vote. Then we document what needs to be ignored when comparing the result from two different nodes. That means a doc PR. |
|
@VoletiRam @hpatro Can you review what Victor posted in the previous message. Instead of adding a new new command, let's just make the existing version deterministically ordered but not make any changes to arguments. |
| } | ||
| } | ||
|
|
||
| test "Deterministic order of CLUSTER SHARDS response" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test isn't part of the CI since it uses the legacy clustering test framework. We should either move cluster-shards to the new framework or just move this part of the file over and we can remove the rest of the file in a separate PR.
|
@VoletiRam One of the thing which came up while discussing with @madolson we could sort the |
@VoletiRam do you have any updates here? |
Maintain deterministic order of CLUSTER SHARDS response. Currently we don't maintain the shards/masters in sorted fashion and hence we get the order of CLUSTER SHARDS response non-deterministic on different nodes. Maintain the sorted Masters list of pointers, similar to replicas, and get rid of <shards, list<nodes>> dict which is not suitable for sorting. Add TOPOLOGY argument to get the deterministic response which would remove the replication offset and node health status from cluster shards response. Sort the nodes based on the node Id. Use it in proc `cluster_config_consistent` for the test coverage and sanity purpose. Signed-off-by: Ram Prasad Voleti <[email protected]>
Remove topology argument and cleanup related code changes. Signed-off-by: Ram Prasad Voleti <[email protected]>
45b7926 to
20d8225
Compare
Replace Dict with Rax for Cluster Nodes and construct primaries list on the go, instead of maintaining shards/masters list. Signed-off-by: Ram Prasad Voleti <[email protected]>
20d8225 to
d2303cf
Compare
|
Sorry for the delayed response. Was busy with the other commitments at work. I addressed the comments. I replaced the Dict data structure with Rax for cluster->nodes, and constructed the list of primaries from it when the 'CLUSTER SHARDS' command is requested. |
|
@madolson we would still want to improve |
|
yes, i am in for the |
hpatro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. There are plenty of touchpoints (iteration over rax) but the idea is to replace dict with rax to maintain the cluster nodes information to get primaries/replicas in lexicographical ordering.
| list *primaries = clusterGetPrimaries(); | ||
| addReplyArrayLen(c, listLength(primaries)); | ||
| listIter li; | ||
| listRewind(primaries, &li); | ||
| for (listNode *ln = listNext(&li); ln != NULL; ln = listNext(&li)) { | ||
| clusterNode *n = listNodeValue(ln); | ||
| addShardReplyForClusterShards(c, n); | ||
| } | ||
| dictReleaseIterator(di); | ||
| listRelease(primaries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the crux of the change. Here we would get primaries in lexicographical ordering due to underlying RAX structure.
|
@PingXie Could you also take a look at this? It removes one of the abstraction you had introduced of |
Yeah. I'm still fine improving this. I wasn't sure I was happy with removing the shards abstraction though from the code internals since we intend to add it in the |
We could maybe still keep the abstraction with the small overhead we were already paying. |
I like the high level idea of single-sourcing the shard membership management. However, I have a few questions regarding the impact of switching from a dictionary to a Rax:
|
These are good callouts but might be difficult to measure. @PingXie Do you have any suggestion/scenario(s) in mind to reproduce? Will be helpful for @VoletiRam. |
can we decouple the two changes: single sourcing the shard membership management and switching to Rax? I think we need some time to better understand the impact of Rax but we could benefit from the single sourcing change sooner. For the performance impact analysis of Rax, I'm thinking about writing a small program that constructs a Rax tree with 1000 cluster nodes and performs queries on it. To simulate real-world conditions, we'll periodically flush the CPU cache by writing a large amount of data to a 32 MB memory block. We'll repeat the same process for a hash table-based implementation. Afterward, we can compare the aggregated lookup times, excluding the memory copy time. For the distribution analysis, we can take a similar approach by having the program log its random node selections. We can then generate charts to compare the distribution patterns between the Rax-based and dictionary-based implementations. Thoughts? |
|
For now, the conclusion is we are okay leaving |
Maintain deterministic order of CLUSTER SHARDS response. Currently we don't maintain the shards/masters in sorted fashion and hence we get the order of CLUSTER SHARDS response non-deterministic on different nodes. Maintain the sorted Masters list of pointers, similar to replicas, and replace the current <shards, list[nodes]> dict which is not suitable for sorting. Add the
TOPOLOGYargument to get the deterministic response which would remove the replication offset and node health status from cluster shards response. Sort the masters based on the node Id. Include the new CLUSTER SHARDS TOPOLOGY command in the cluster_config_consistent procedure to ensure thorough test coverage and conduct a sanity check on cluster consistency.Example response of
CLUSTER SHARDS TOPOLOGYin a 2 primaries 2 replicas cluster.Response from Primary 1:
Response from Primary 2:
Ref: #114