Allow multi-slot MGET in Cluster Mode #707

JohnSully · 2024-06-27T21:13:41Z

The current restrictions for MGET in cluster mode is that all keys must reside in the same slot. This results in very high inefficiency for high batch use cases as the batch has to be cracked into individual GETs. Because there are 16k slots its very unlikely we will find a pair we can send with MGET making it nearly useless in cluster mode.

Instead we can relax the condition a bit and permit MGET if all slots reside on this node and none are migrating. This allows us to serve the request in the common case. In the case where a slot was migrated or changed we still send CROSSSLOT and the client will know to break the batch down to individual GETs.

Because we expect there to be cases where we still send CROSSSLOT simply doing an MGET test on the client is not sufficient to communicate this new support. To help this along we introduce a new INFO field called "features". This is intended to work similar to the features flag in lscpu where new features get added over time. Now the client can check this and determine if it still needs to breakup batches or if they can be sent in one go.

Note: tests will come in a little bit but we can start the conversation of the feature now.

Old Behaviour:

127.0.0.1:30001> get a
(error) MOVED 15495 127.0.0.1:30003
127.0.0.1:30001> get b
(nil)
127.0.0.1:30001> get c
(error) MOVED 7365 127.0.0.1:30002
127.0.0.1:30001> get d
(error) MOVED 11298 127.0.0.1:30003
127.0.0.1:30001> get e
(error) MOVED 15363 127.0.0.1:30003
127.0.0.1:30001> get f
(nil)
127.0.0.1:30001> mget b f
(error) CROSSSLOT Keys in request don't hash to the same slot

New Behavior:

127.0.0.1:30001> get a
(error) MOVED 15495 127.0.0.1:30003
127.0.0.1:30001> get b
(nil)
127.0.0.1:30001> get c
(error) MOVED 7365 127.0.0.1:30002
127.0.0.1:30001> get d
(error) MOVED 11298 127.0.0.1:30003
127.0.0.1:30001> get e
(error) MOVED 15363 127.0.0.1:30003
127.0.0.1:30001> get f
(nil)
127.0.0.1:30001> mget b f
1) (nil)
2) (nil)

JohnSully · 2024-06-27T21:20:26Z

We fail the test:

client can't subscribe to multiple shard channels across different slots in same call

Since the intent is to allow crossslot I need to investigate if pubsub has a special restriction we need to continue to hold

madolson · 2024-06-27T21:21:31Z

For reference, this was discussed in #507.

To recap my perspective. I was okay with allowing commands that aren't read-modify-write commands. Either pure writes (DEL, MSET) or pure reads (MGET) I was okay with operating across slots.

JohnSully · 2024-06-27T21:22:36Z

We have use cases that we simply can't run without this so pretty important for very high scale jobs. Will dive into the pubsub issue but I can't think of a reason it needs to restrict this.

madolson · 2024-06-27T21:23:46Z

We have use cases that we simply can't run without this so pretty important for very high scale jobs.

One question we did want to understand was the actual performance benefit of this. In the other thread we discussed whether or you could just do a deep pipeline of GET operations.

JohnSully · 2024-06-27T21:23:52Z

What was the rationale to not permit read modify write?

JohnSully · 2024-06-27T21:24:39Z

One question we did want to understand was the actual performance benefit of this. In the other thread we discussed whether or you could just do a deep pipeline of GET operations

It has been a year since we did this but its an order of magnitude difference in performance. I can rerun some benchmarks with memtier to get modern data.

madolson · 2024-06-27T21:26:38Z

What was the rationale to not permit read modify write?

It felt like it was logically breaking the abstraction, since we practically have serializability within a slot. To allow atomically doing a read, modifying it, and then writing it a different slot felt like something you would have to carefully architect on the client side to make sure the slots were on the same node. Possible, but not something we wanted to encourage end users to do.

JohnSully · 2024-06-27T21:35:13Z

Ah you are looking at it from the perspective of greater order atomics. This is not designed to allow that and it doesn't since at any time you can get a CROSSSLOT the client must be prepared to break the operation down into lower order operations. It does require more intelligence from the client but the performance wins are worth it.

Re performance here are some examples with valkey-benchmark for mset vs set:

SET: 143266.47 requests per second, p50=0.175 msec
MSET (10 keys): 156494.52 requests per second, p50=0.175 msec

You can see that MSET and SET do the same QPS but MSET is doing 10x the number of keys per second. We use MGET more than MSET but MGET isn't in valkey's benchmark util for some reason. Regardless we've proven this out in production over the last year and its super efficient, in fact the biggest issue is when you try to add shards the MGET batch sizes shrink so you lose efficency which offsets the scale. However this can be worked around by adding more replicas instead of full shards.

JohnSully · 2024-06-27T21:58:20Z

Pipeline of 10 it closes up a bit but still a win:

SET: 1136363.62 requests per second, p50=0.375 msec        
MSET (10 keys): 440528.62 requests per second, p50=1.127 msec

1,136,363 keys/sec for SET vs 4,405,280 for MSET so still 4x faster.

EDIT: For good measure here is pipeline 100

SET: 1886792.50 requests per second, p50=2.375 msec       
MSET (10 keys): 502512.56 requests per second, p50=6.559 msec

so 1,886,792 keys/sec for SET vs 5,025,120 keys/sec for MSET. So 2.66x faster at the extremes.

madolson · 2024-06-27T22:36:24Z

so 1,886,792 keys/sec for SET vs 5,025,120 keys/sec for MSET. So 2.66x faster at the extremes.

Another thing that was discussed was that we could use instruction prefetching or alter the way MGET works to parallelize the fetching of the data out of the dictionary, since a big chunk of the time is spent on TLB and L3 cache misses. We only get that with MGET.

JohnSully · 2024-06-27T22:43:31Z

My experiments with prefetch were mostly failures but its always something that rolls around in the back of your head.

I'm going to clean this one up since its really a huge win, then if you do anything to make non cluster MGET better, the cluster version will improve as well.

Also do take a look at the features field of info as that's a fairly decent change for Redis but I think will be more necessary now that theres so many variants out there.

PingXie · 2024-06-28T05:28:50Z

Pipeline of 10 it closes up a bit but still a win:
SET: 1136363.62 requests per second, p50=0.375 msec        
MSET (10 keys): 440528.62 requests per second, p50=1.127 msec                    
1,136,363 keys/sec for SET vs 4,405,280 for MSET so still 4x faster.

EDIT: For good measure here is pipeline 100
SET: 1886792.50 requests per second, p50=2.375 msec       
MSET (10 keys): 502512.56 requests per second, p50=6.559 msec                    
so 1,886,792 keys/sec for SET vs 5,025,120 keys/sec for MSET. So 2.66x faster at the extremes.

Thanks a lot for sharing the performance data, @JohnSully. This answers one of my earlier questions on "why not pipelining". I wonder if part of the performance improvement comes from the lower RESP protocol parsing overhead?

I'm going to clean this one up since its really a huge win,

I am looking forward to the new update. The way it is coded now however is too specific/narrow to be accepted into the mainline and it changes the contract (whether it is for better or not is a different question). We discussed a similar situation where externally visible behavior would be changed by a proposal and you can find the core team's thoughts here. As a result of that discussion, there is now a new CLIENT CAPA mechanism for opting in non-standard behaviors.

zuiderkwast

@JohnSully Did you verify that you got the correct result from a cross-slot MGET?

In cluster mode, we store the keys in one hashtable per slot. We use server.current_client->slot to avoid computing the slot again when looking up each key in the right hashtable. See getKeySlot().

I think it should be client opt-in, a CLIENT CAPA as Ping suggested.

If a client receives -CROSSSLOT because one of the slots is just being migrated, then if the client splits up the command into multiple ones, the atomicity guarantee of the command is lost. Therefore, a client should make it clear to the user of the client that this is a non-atomic command or feature.

With the client opt-in, I don't think the INFO field is necessary. (If we do add it, then I think it should be under '# Cluster' rather than under '# Server'.)

JohnSully · 2024-06-28T14:31:48Z

@PingXie re changing the contract it is only additive. So it should not be a back compatibility problem. I.e. cases that use to be errors are now accepted but no case that did not error will send an error now.

This is also a major place where Redis is deficient against databases like Aerospike so it would be really limiting to make it gated behind a config.

JohnSully · 2024-06-28T14:42:31Z

@zuiderkwast The code is correct in KeyDB but i will check if anything has changed enough to break it.

However we check hash slots for both the first and all later keys ensuring none are migrating.

As mentioned earlier this is purely for performance it is not intended to increase atomicity gurantees and the client must be prepared for slots to migrate away.

The reason for the feature flag is there is a forward compatibility issue. You cannot naively try the batch on a non supporting server because the performance hit of 2Xing traffic (first the failure then the GET retry). Customers aren't going to want to update their client and see twice the traffic.

So if you don't provide a flag clients will have a hard time implementing support. In our environment we had exactly this problem as not all clusters could be upgraded right away so we ran with a mix for a period.

JohnSully · 2024-06-28T16:12:08Z

@zuiderkwast I will modify the change to only permit crossslot on reads. This should eliminate the issue of needing current_slot since you only need to update the counts on write.

JohnSully · 2024-06-28T16:29:17Z

I pushed the following changes (which should also unblock the tests)

Prevent multi-slot entirely for writes, @madolson this should by proxy meet your read-modify-write requirement as well as the bug with current_slot since its not needed for reads
Prevent multi-slot with transactions since we don't want anyone relying on the higher order behavior as its not going to be guranteed to be available
Prevent with pub/sub entirely as pubsub will not benefit

codecov · 2024-06-28T16:42:12Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.86%. Comparing base (d72a97e) to head (897d36e).
Report is 136 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #707      +/-   ##
============================================
- Coverage     70.99%   70.86%   -0.13%     
============================================
  Files           121      121              
  Lines         65175    65184       +9     
============================================
- Hits          46269    46192      -77     
- Misses        18906    18992      +86

Files with missing lines	Coverage Δ
src/cluster.c	`89.37% <100.00%> (+0.13%)`	⬆️
src/server.c	`87.61% <ø> (ø)`

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zuiderkwast · 2024-06-28T17:04:40Z

The code is correct in KeyDB but i will check if anything has changed enough to break it.

I'm pretty sure KeyDB doesn't have the recent refactoring to storing the keys in one dict per slot. This is not even released yet. :) This refactoring avoids the need to keep a separate key-to-slot mapping, which takes extra memory per key.

If you write a simple test case where you read from two different slots, you will notice that you get NULL for the second slot, because it looks for the key in the wrong dict. (A simple fix for that would be to set set *hashslot = -1 for multi-slot commands, which will bypass the optimization in getKeySlot(), but you will notice that once you see the problem and start debugging.)

The reason for the feature flag is there is a forward compatibility issue. You cannot naively try the batch on a non supporting server because the performance hit of 2Xing traffic (first the failure then the GET retry). Customers aren't going to want to update their client and see twice the traffic.

Right, we need some way for clients to find out. The thing is, calling INFO on connect is not something clients are encouraged to do, even though your client does that.

The clients can see the version in the HELLO response, or we can add more fields in the HELLO response, alternatively another command to opt in that clients can call after connecting, just we have for example CLIENT TRACKING to opt-in to client-side caching.

I will modify the change to only permit crossslot on reads. This should eliminate the issue of needing current_slot since you only need to update the counts on write.

I don't think we need to limit MSET actually. The problem is not about counts.

JohnSully · 2024-06-28T18:24:11Z

If you write a simple test case where you read from two different slots, you will notice that you get NULL for the second slot, because it looks for the key in the wrong dict. (A simple fix for that would be to set set *hashslot = -1 for multi-slot commands, which will bypass the optimization in getKeySlot(), but you will notice that once you see the problem and start debugging.)

Ah ok as mentioned tests still pending. So this should not be merged before those. But I wanted to start the conversation earlier. I will write the tests and look for issues integrating with your new feature.

JohnSully · 2024-06-28T18:26:36Z

In the meantime we should sort out the following open questions:

Should this be gated behind a default off setting (I vote NO because its purely additive and really important for competitive reasons). I'm OK with a default ON setting if that is desired but I don't really understand why you would want it off.
Are there any other cases where we don't want to support this. Another option I suppose is to gate it only to MGET but it seems a bit unfortunate other batch commands would get blocked.
Should we switch to the CAPA functionality instead of the feature flags in INFO, my read here is this won't actually help since the client is the one that has to make the decision to use this or not, not us. But I do need to learn more about this new feature you added.

PingXie · 2024-06-28T20:05:45Z

I'm pretty sure KeyDB doesn't have the recent refactoring to storing the keys in one dict per slot.

@JohnSully were your numbers collected on KeyDB or Valkey unstable? Since the key argument here is performance I am also interested in a rerun after we merge the async IO work.

zuiderkwast · 2024-06-28T21:54:01Z

These questions have already a discussion started in the issue #507 so we can already see some peoples ideas there.

Should this be gated behind a default off setting (I vote NO because its purely additive and really important for competitive reasons).

I vote NO to a config, because enable per client is better than globally. The client needs to handle this special logic, so I prefer client opt-in rather than enabled by default. If it's enabled for everyone always, then clients and applications start assuming things and suddenly it's not possible to scale and migrate slots. I'm not sure we want to go down that path.

Are there any other cases where we don't want to support this. Another option I suppose is to gate it only to MGET but it seems a bit unfortunate other batch commands would get blocked.

In the issue #507 we discussed three categories of commands:

Read commands (MGET, SUNION)
Write commands (MSET, DEL)
Read-modify-write (SUNIONSTORE, transactions with reads and writes, Lua scripts)

@madolson argued that it should be allowed for 1-2 but not for 3. IIUC it's because those commands aren't simply a batch of jobs and can't be used for fan-in fan-out. I tend to agree, but OTOH I'm not sure we need to restrict that. It's cleaner, simpler and more predictable to have no exception to this rule. When cross-slot commands are allowed, then they're all allowed. So I vote for no exceptions. (There may be use cases like finding a lot of sets using SCAN and then aggregating them using SUNIONSTORE, or other yet unknown use cases.)

Worth noting: There's already a hack that allows cross-slot commands. It's called Lua scripts without declared keys.

Should we switch to the CAPA functionality instead of the feature flags in INFO, my read here is this won't actually help since the client is the one that has to make the decision to use this or not, not us. But I do need to learn more about this new feature you added.

Synopsis: CLIENT CAPA capa1 [capa2...]

The client can announce capabilities it supports. To avoid an error when the client sends an unknown capability that exists in a future valkey version (or a fork of Valkey or something) the server just ignores unknown capabilities and returns OK. Thus, this command doesn't allow the client to check if the server supports a capability. (We can probably change the return value if we want, since it's not released yet.)

Another option is a new command, such as CLIENT CROSSSLOT ALLOW. The client sends it in pipeline with other commands (HELLO, INFO, etc.) when it connects, so there's no extra round trip. If the server doesn't understand it, it returns an error and the client knows.

were your numbers collected on KeyDB or Valkey unstable?

(@PingXie asked)

I want to know this too. If they were done on Valkey with the existing code, then maybe most of the keys were looked up in the wrong dict and got NULL instead of a correct reply. That most likely affects performance.

JohnSully · 2024-06-28T21:56:25Z

I don't understand why the server cares if the client "supports" this or not. Since non-supporting clients already know they can't send cross slot commands nothing changes for them. Its the client that needs to know if the server supports it not the other way around.

What is the bad case scenario for legacy clients here?

Now it may still make sense to use the CAPA handshake for the client to query support for this, and that is fine. But I don't see why the server would change its behavior if a CAPA isn't sent.

zuiderkwast · 2024-06-28T22:03:08Z

Since non-supporting clients already know they can't send cross slot commands nothing changes for them.

Not always. Some clients are dumb and do minimal checks. Some clients just lookup the first key in the command and routes the command based on that.

What is the bad case scenario for legacy clients here?

Unaware application programmers start using Valkey cluster without knowing about slots and the cross slot rule. It seems to work because their keys happen to be in the same node.

JohnSully · 2024-06-28T22:15:50Z

Unaware application programmers start using Valkey cluster without knowing about slots and the cross slot rule. It seems to work because their keys happen to be in the same node.

I can see the argument to make things more explicit upstream, however this will only work 33% of the time or less since minimum clusters have 3 primaries. They would really have to do no validation to not notice. Also clients have to do slot->server mapping anyways so they are probably already going to throw exceptions before it even hits the server.

I also don't think we can assume all clients will implement this as a transparent x-server MGET since the implicit assumption of atomicity for MGET is broken when its sent to different servers. Clients may instead choose to expose this in a different way than a transparent MGET that under the hood cracks the operation into smaller ones. At Snap we did choose the weak consistency option since we have deep control of our use cases and could make that decision.

In the context of the server itself this is all irrelevant since MGETs are atomic at that level so we are not breaking any ABI. It is purely additive at that level and no existing use case will be broken.

zuiderkwast · 2024-06-28T22:21:17Z

OK, now let's wait for others to catch up and share their opinions. :)

Can you tell us more about Aerospike? It seems interesting. I found https://aerospike.com/docs/server/architecture/overview.

Does it support cross-node transactions? If it does, there's no unpredictable cross-slot error to worry about, so that's a bit different situation.

PingXie · 2024-06-29T22:34:54Z

I don't see how you can get over the 4-10x single thread efficiency of batch MGET

Let's re-establish this on Valkey 8.0 RC first. I fully support performance improvements, but it's important to have a holistic understanding of the change and its impact on the rest of the system. As it stands, the proposed PR does not align with the proposal in #507, which also needs thorough deliberation. Given the scope of Valkey use cases, we cannot afford to implement pointed fixes without comprehensive consideration.

Here is the process that I am thinking of:

Re-evaluate the GET performance of your use case against Valkey 8.0 RC.
If the improvement is significant (4x+), we can continue brainstorming the best way to bring this great performance to Valkey. We can continue here, or preferably via an RFC.
Otherwise, I would suggest continuing the discussion on [Feature-Request]: Cross-Slot Command Execution in ValKey Cluster #507 instead.

Additionally, I would like to understand other ways to help onboard existing KeyDB users with Valkey. Please feel free to file issues for any further support.

But it is a long weekend in Canada so we can revisit this more next week.

SGTM

BTW, there is also the correctness issue that @zuiderkwast mentioned in #707 (comment) (copied below).

If you write a simple test case where you read from two different slots, you will notice that you get NULL for the second slot, because it looks for the key in the wrong dict. (A simple fix for that would be to set set *hashslot = -1 for multi-slot commands, which will bypass the optimization in getKeySlot(), but you will notice that once you see the problem and start debugging.)

asafpamzn · 2024-07-04T12:55:11Z

I'm strongly against this feature. From the PR description the use case is to improve performance.
IMHO, this should be solved at the client side,
If the user wants to have atomic multi GET they should use MGET and verify that all the keys are in the same slot.
If the user wants enhanced performance they should not use transaction or MGET and use a batch API to send a single batch. The client driver will split the batch between the nodes and have small batches.

If the feature has a different purpose and I'm missing something please update the PR description with the motivation.
The batching can be implemented in the client side, we will be happy to get contribution or issue for the https://github.com/aws/glide-for-redis repo and we can support it.

JohnSully · 2024-07-04T18:46:05Z

The issue is that there's a 1/16,000 chance you'll have two keys going to the same slot so for practical purposes you can't really use MGET in cluster mode. Whereas if you are a 3 shard cluster that should be a 1/3 probability. We typically have batch sizes of 200 to 500 keys so generally can get decent batches out to the servers if they would accept them. In this case the client is doing the job of making sure that the batches all are served by the server we send the batch to, its just loosening an arbitrary restriction of making it the same slot.

…

On Thu, Jul 4, 2024 at 8:55 AM asafpamzn ***@***.***> wrote: I'm strongly against this feature. From the PR description the use case is to improve performance. IMHO, this should be solved at the client side, If the user wants to have atomic multi GET they should use MGET and verify that all the keys are in the same slot. If the user wants enhanced performance they should not use transaction or MGET and use a batch API to send a single batch. The client driver will split the batch between the nodes and have small batches. If the feature has a different purpose and I'm missing something please update the PR description with the motivation. The batching can be implemented in the client side, we will be happy to get contribution or issue for the https://github.com/aws/glide-for-redis repo and we can support it. — Reply to this email directly, view it on GitHub <#707 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA5W4ASM2K2NUNHGSPRSQGTZKVA4JAVCNFSM6AAAAABKAV3SIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBYHEYTEOJXGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

zuiderkwast · 2024-07-04T21:55:38Z

@asafpamzn

I'm strongly against this feature.

It helps if you explain why you're against it.

@JohnSully

for practical purposes you can't really use MGET in cluster mode

That depends how you use it. If you're accessing a bunch of keys related to the same user, the same product, etc. then they should be designed with tags to put them in the same slot. Then MGET works for these keys.

Now, what you're doing is different. You're doing batch processing for performance reasons, which is a special optimization that not every simple web app does. If you fix this PR so it actually works correctly, then we can check the performance and see how much better faster than a pipeline it actually is.

asafpamzn · 2024-07-07T11:21:53Z

zuiderkwast

I think that I explained. :) probably was not clear.
It can be solved at the client side using a new batching API. We can have a new batch API, like pipeline in valkey-py with Transaction=false, see https://github.com/valkey-io/valkey-py?tab=readme-ov-file#advanced-topics

It won't guarantee the atomicity but, it will have better performance than your suggestion from the server side

Why?
MGET is atomic, that is, the server has to collect all GET requests at the server before issuing the command, if the MGET size is 1K elements it can be significant.

The new client Batching API will just stream the commands, the server will handle them as they goes. It will allow multi slot operations.

zuiderkwast · 2024-07-07T15:49:04Z

@asafpamzn yes, many clients allow pipelining or async api, where the client can already send one MGET per slot in a single round trip, but @JohnSully claims that a single cross-slot MGET is several times faster than that, so that solution is not satisfying IIUC. I still haven't seen compelling evidence for this though.

asafpamzn · 2024-07-08T09:10:43Z

Thanks for the clarification.

We did some benchmark back then with redis-py see https://aws.amazon.com/blogs/database/optimize-redis-client-performance-for-amazon-elasticache/

I don't know how to link specific section, but, please search for "The following table summarizes the performance of pipelines with a single connection (in RPS).

It shows that in redis-py the Transactions = False flag is faster.
It is not the exact scenario, but, it can give us some intuition. I will be happy to see different results.

Thus, I don't think that we should change the server, but, optimizing the clients.

In addition, due to clients implementation of MGET there is a confusion if MGET is atomic or not in cluster mode, and this change is going to make it more confusing. I saw many users who use MGET for batching.

I think that it should be solved by updating the existing clients APIs to be more clear, what is atomic and what is not.. I would like to see the documentation of MGET following this change and how we are going to explain it to users that are less advanced and don't pin slots to specific nodes. I think that the docs is not going to be easy for the common user.

…n with MGET. Currently we check if all keys are the same slot but we could retunr if all slots are on this node instead. Signed-off-by: John Sully <[email protected]>

Signed-off-by: John Sully <[email protected]>

JohnSully · 2025-01-28T20:13:32Z

Just a quick update I fixed the code to work with the slot cache on the client. We just disable this optimization for crossslot mget but leave it on if the mget is served by a single slot.

Still needs tests but we're at full functionality now.

I rebased it to unstable since this PR is a few months old.

dmytro-arkhypenko · 2025-02-25T14:41:44Z

I strongly support this change. Our custom module uses a read-only command that retrieves keys from multiple hashslots but only once per cluster node. With the current implementation, where commands run per hashslot, performance is significantly reduced.

This command works as expected in Redis, and the discrepancy in Valkey is quite frustrating. Enabling read-only multi-slot commands would align Valkey's behavior with Redis and restore optimal performance.

PingXie · 2025-02-28T00:26:43Z

I am open to this best effort "cluster mget" idea but I don't think we should change the semantics of the existing MGET. I would like to explore the possibility of a properly designed and dedicated command with the possibility to expand to other commands in the future (if needed). I have strong concerns about one-off commands.

madolson · 2025-02-28T18:37:50Z

I am open to this best effort "cluster mget" idea but I don't think we should change the semantics of the existing MGET. I would like to explore the possibility of a properly designed and dedicated command with the possibility to expand to other commands in the future (if needed). I have strong concerns about one-off commands.

What is the benefit of having a dedicated command for it? The behavior of MGET will be the same, regardless of it crosses slots or not. We could implement some type of flag on the command, "SUPPORTS_FANOUT" or something, and that clients can consume that to know if they can fan out the operation (one to each node). In another thread I mentioned that fanout works as long as all keys are independently read or written to. (DEL, MSET, MSETNX, MGET, EXISTS, TOUCH, UNLINK, that's it, that's the list) but not commands that cross that boundary (SUNIONSTORE, XREAD, etc). Of the commands that can be trivially fanned out, only MGET and EXISTS are read commands. It does feel like in this case, we are just making an optimization for two commands, of which only one is really useful.

PingXie · 2025-03-03T18:52:02Z

What is the benefit of having a dedicated command for it?

We would be changing the behavior/semantics/contract of the existing MGET command if we were to reuse it. I would like to make sure the client is fully aware of this change. I guess another option is for the client to opt in via client capa. I can see the fanout or not decision being a client side thing so two clients with different expectations can co-exist. However, would there be a use case where the "all or none" decision needs to be more granular than clients? If we were to continue with this thinking, what would happen when the next command needs this best effort behavior, like MSET, MULTI/EXEC, scripts, etc? My biggest concern is about making one-off decisions for the interface.

madolson · 2025-03-03T20:44:46Z

If we were to continue with this thinking, what would happen when the next command needs this best effort behavior, like MSET, MULTI/EXEC, scripts, etc? My biggest concern is about making one-off decisions for the interface.

MSET is a write command, do we want to cross-slot write operations? That will break cross-slot write atomicity. It's complicated to do the write during slot migration when one is being migrated and the other isn't. I suppose the line I want to draw is that we should only consider multi-slot read commands as long as all keys are on the same shard.

Having multiple READ commands in a transaction would also meet my criteria, as long as each read query only operates on one key at at a time. For example, a multi-exec of HGETALL could be fanned out. In this case though, the value isn't as clear because you're adding overhead with the MULTI, which defeats the ask in this PR to improve performance. We could consider the behavior if we added a new HGETALL command that operates on multiple keys, then it would also be eligible for the fanout optimization.

Maybe a new command is the best approach then? It seems like we're really trying to make a single tactical optimization. The capability seems like more overhead.

zuiderkwast · 2025-03-04T00:38:45Z

MSET is a write command, do we want to cross-slot write operations? That will break cross-slot write atomicity. It's complicated to do the write during slot migration when one is being migrated and the other isn't. I suppose the line I want to draw is that we should only consider multi-slot read commands as long as all keys are on the same shard.

Just like MSET will break write atomicity, cross-slot MGET will break read atomicity. It's equally complicated to do the atomic read during slot migration when one is being migrated and the other isn't. I see no difference in this regard between read and write commands.

The way I see us handling edge cases like these is to reject the command if one of the slots is being or has already been migrated. We would return an error like -CROSSSLOT or -TRYAGAIN, so the client still needs to handle these errors.

Just like @PingXie I don't like to make this optimization for a single command, but I like the idea of a client opt-in, signaling that the client is able to handle these edge cases.

madolson · 2025-03-04T18:42:08Z

Just like MSET will break write atomicity, cross-slot MGET will break read atomicity. It's equally complicated to do the atomic read during slot migration when one is being migrated and the other isn't. I see no difference in this regard between read and write commands.

I'm not sure the two types are exactly the same. If all keys are present in a shard, you can execute the MGET command (by the contract). If all the keys are present, you still can't execute the write command as you are supposed to forward the write command to the target shard but only for the one migrating. You basically can't execute a cross-slot write command while the slot is being migrated without breaking it up. I think it also poses a greater risk when we get to atomic slot migration, since these cross slot write commands will be getting injected into two different streams. I guess all I'm saying is that I think read atomicity is something we can soften more easily that write atomicity. There is overlap.

I like the idea of a client opt-in, signaling that the client is able to handle these edge cases.

I also agree with this. I think the open question is do we prefer to use CLIENT CAPA or a new command. I think a new command is probably clearer to application developers in what they are choosing, and also allows client developers to add the special logic in specifically for that command.

zuiderkwast · 2025-03-04T19:52:26Z

We're about to open up a new dimension of commands. If we add let's say MGETXS (mget cross-slot) then what's next? existsxs, delxs, msetxs, msetnxxs, ... I don't like where this is going.

I'm not sure the difference between reads and writes are that significant either. There are very similar limitations to read commands, though they're not exactly the same as the write commands.

If all keys are present in a shard, you can execute the MGET command (by the contract).

Yes, but if some keys don't exist, it can mean they have been migrated in an ongoing slot migration. For single-key commands, we reply with an -ASK redirect if it's a single key and -TRYAGAIN for multi-key commands. We'd probably return -CROSSSLOT for crossslot commands for this case. As long as it can happen, then the client needs to be able to handle it, same as for write commands.

With atomic slot migration, the transfer of a slot is atomic. That's the contract, so I don't see the problem here. We'll have no more -ASK or -TRYAGAIN for read nor write commands. Only if one slot ownership has been transferred and one hasn't, we'll need to return -CROSSSLOT. It's a race condition, yet it can happen, so clients need to handle it, for read and write commands equally.

madolson · 2025-03-04T20:17:23Z

We're about to open up a new dimension of commands. If we add let's say MGETXS (mget cross-slot) then what's next? existsxs, delxs, msetxs, msetnxxs, ... I don't like where this is going.

Yeah, I basically think we should just start with MXGET (and MXEXISTS if anyone wants it). (For some reason, I really like MX for multi-slot). With the justification that we are just allowing cross-slot on reads.

Only if one slot ownership has been transferred and one hasn't, we'll need to return -CROSSSLOT.

Today, -CROSSSLOT is considered a terminal error that the end user needs to address in their application. We might want to introduce a new error -CROSSSHARD, to indicate to the client sent a command that accesses slots on a different shard, so they need to re-execute the command and re-discover the topology. We might also consider just throwing a -MOVED here as well.

madolson · 2025-11-10T16:28:02Z

Removing this from the 9.1 backlog. Right now this complicates a lot of things in our roadmap. If we can come up with a clean solution we can consider moving forward on it.

zuiderkwast reviewed Jun 28, 2024

View reviewed changes

JohnSully force-pushed the cluster_mget branch from 50fb65a to a63b137 Compare June 28, 2024 16:27

JohnSully force-pushed the cluster_mget branch from 4afe0ed to 69bc27e Compare June 28, 2024 16:43

JohnSully mentioned this pull request Jun 28, 2024

[Feature-Request]: Cross-Slot Command Execution in ValKey Cluster #507

Open

madolson added the major-decision-pending Major decision pending by TSC team label Jul 1, 2024

JohnSully added 2 commits January 28, 2025 18:59

When in cluster mode we should be more permissive about what we retur…

73b4e66

…n with MGET. Currently we check if all keys are the same slot but we could retunr if all slots are on this node instead. Signed-off-by: John Sully <[email protected]>

Fix formatting

9ec45be

Signed-off-by: John Sully <[email protected]>

JohnSully force-pushed the cluster_mget branch from 69bc27e to 9ec45be Compare January 28, 2025 19:25

Ensure that multi-slot queries work with slot caching

b0dfe05

Improve C compatibility

897d36e

hpatro added cluster and removed major-decision-pending Major decision pending by TSC team labels Apr 16, 2025

madolson removed this from Valkey 9.0 Aug 6, 2025

madolson added this to Valkey 9.1 Aug 6, 2025

madolson moved this to Todo in Valkey 9.1 Aug 6, 2025

madolson removed this from Valkey 9.1 Nov 10, 2025

Allow multi-slot MGET in Cluster Mode #707

Are you sure you want to change the base?

Allow multi-slot MGET in Cluster Mode #707

Uh oh!

Conversation

JohnSully commented Jun 27, 2024

Uh oh!

JohnSully commented Jun 27, 2024

Uh oh!

madolson commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnSully commented Jun 27, 2024

Uh oh!

madolson commented Jun 27, 2024

Uh oh!

JohnSully commented Jun 27, 2024

Uh oh!

JohnSully commented Jun 27, 2024

Uh oh!

madolson commented Jun 27, 2024

Uh oh!

JohnSully commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnSully commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madolson commented Jun 27, 2024

Uh oh!

JohnSully commented Jun 27, 2024

Uh oh!

PingXie commented Jun 28, 2024

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

JohnSully commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnSully commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnSully commented Jun 28, 2024

Uh oh!

JohnSully commented Jun 28, 2024

Uh oh!

codecov bot commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zuiderkwast commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnSully commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohnSully commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingXie commented Jun 28, 2024

Uh oh!

zuiderkwast commented Jun 28, 2024

Uh oh!

JohnSully commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Jun 28, 2024

Uh oh!

JohnSully commented Jun 28, 2024

Uh oh!

zuiderkwast commented Jun 28, 2024

Uh oh!

PingXie commented Jun 29, 2024

Uh oh!

asafpamzn commented Jul 4, 2024

Uh oh!

JohnSully commented Jul 4, 2024 via email

Uh oh!

zuiderkwast commented Jul 4, 2024

madolson commented Jun 27, 2024 •

edited

Loading

JohnSully commented Jun 27, 2024 •

edited

Loading

JohnSully commented Jun 27, 2024 •

edited

Loading

JohnSully commented Jun 28, 2024 •

edited

Loading

JohnSully commented Jun 28, 2024 •

edited

Loading

codecov bot commented Jun 28, 2024 •

edited

Loading

zuiderkwast commented Jun 28, 2024 •

edited

Loading

JohnSully commented Jun 28, 2024 •

edited

Loading

JohnSully commented Jun 28, 2024 •

edited

Loading

JohnSully commented Jun 28, 2024 •

edited

Loading

JohnSully commented Jan 28, 2025 •

edited

Loading

PingXie commented Feb 28, 2025 •

edited

Loading

madolson commented Feb 28, 2025 •

edited

Loading

madolson commented Mar 4, 2025 •

edited

Loading