-
Notifications
You must be signed in to change notification settings - Fork 955
Allow multi-slot MGET in Cluster Mode #707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
|
We fail the test:
Since the intent is to allow crossslot I need to investigate if pubsub has a special restriction we need to continue to hold |
|
For reference, this was discussed in #507. To recap my perspective. I was okay with allowing commands that aren't read-modify-write commands. Either pure writes (DEL, MSET) or pure reads (MGET) I was okay with operating across slots. |
|
We have use cases that we simply can't run without this so pretty important for very high scale jobs. Will dive into the pubsub issue but I can't think of a reason it needs to restrict this. |
One question we did want to understand was the actual performance benefit of this. In the other thread we discussed whether or you could just do a deep pipeline of GET operations. |
|
What was the rationale to not permit read modify write? |
It has been a year since we did this but its an order of magnitude difference in performance. I can rerun some benchmarks with memtier to get modern data. |
It felt like it was logically breaking the abstraction, since we practically have serializability within a slot. To allow atomically doing a read, modifying it, and then writing it a different slot felt like something you would have to carefully architect on the client side to make sure the slots were on the same node. Possible, but not something we wanted to encourage end users to do. |
|
Ah you are looking at it from the perspective of greater order atomics. This is not designed to allow that and it doesn't since at any time you can get a CROSSSLOT the client must be prepared to break the operation down into lower order operations. It does require more intelligence from the client but the performance wins are worth it. Re performance here are some examples with valkey-benchmark for mset vs set: You can see that MSET and SET do the same QPS but MSET is doing 10x the number of keys per second. We use MGET more than MSET but MGET isn't in valkey's benchmark util for some reason. Regardless we've proven this out in production over the last year and its super efficient, in fact the biggest issue is when you try to add shards the MGET batch sizes shrink so you lose efficency which offsets the scale. However this can be worked around by adding more replicas instead of full shards. |
|
Pipeline of 10 it closes up a bit but still a win: 1,136,363 keys/sec for SET vs 4,405,280 for MSET so still 4x faster. EDIT: For good measure here is pipeline 100 so 1,886,792 keys/sec for SET vs 5,025,120 keys/sec for MSET. So 2.66x faster at the extremes. |
Another thing that was discussed was that we could use instruction prefetching or alter the way MGET works to parallelize the fetching of the data out of the dictionary, since a big chunk of the time is spent on TLB and L3 cache misses. We only get that with MGET. |
|
My experiments with prefetch were mostly failures but its always something that rolls around in the back of your head. I'm going to clean this one up since its really a huge win, then if you do anything to make non cluster MGET better, the cluster version will improve as well. Also do take a look at the features field of info as that's a fairly decent change for Redis but I think will be more necessary now that theres so many variants out there. |
Thanks a lot for sharing the performance data, @JohnSully. This answers one of my earlier questions on "why not pipelining". I wonder if part of the performance improvement comes from the lower RESP protocol parsing overhead?
I am looking forward to the new update. The way it is coded now however is too specific/narrow to be accepted into the mainline and it changes the contract (whether it is for better or not is a different question). We discussed a similar situation where externally visible behavior would be changed by a proposal and you can find the core team's thoughts here. As a result of that discussion, there is now a new |
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohnSully Did you verify that you got the correct result from a cross-slot MGET?
In cluster mode, we store the keys in one hashtable per slot. We use server.current_client->slot to avoid computing the slot again when looking up each key in the right hashtable. See getKeySlot().
I think it should be client opt-in, a CLIENT CAPA as Ping suggested.
If a client receives -CROSSSLOT because one of the slots is just being migrated, then if the client splits up the command into multiple ones, the atomicity guarantee of the command is lost. Therefore, a client should make it clear to the user of the client that this is a non-atomic command or feature.
With the client opt-in, I don't think the INFO field is necessary. (If we do add it, then I think it should be under '# Cluster' rather than under '# Server'.)
|
@PingXie re changing the contract it is only additive. So it should not be a back compatibility problem. I.e. cases that use to be errors are now accepted but no case that did not error will send an error now. This is also a major place where Redis is deficient against databases like Aerospike so it would be really limiting to make it gated behind a config. |
|
@zuiderkwast The code is correct in KeyDB but i will check if anything has changed enough to break it. However we check hash slots for both the first and all later keys ensuring none are migrating. As mentioned earlier this is purely for performance it is not intended to increase atomicity gurantees and the client must be prepared for slots to migrate away. The reason for the feature flag is there is a forward compatibility issue. You cannot naively try the batch on a non supporting server because the performance hit of 2Xing traffic (first the failure then the GET retry). Customers aren't going to want to update their client and see twice the traffic. So if you don't provide a flag clients will have a hard time implementing support. In our environment we had exactly this problem as not all clusters could be upgraded right away so we ran with a mix for a period. |
|
@zuiderkwast I will modify the change to only permit crossslot on reads. This should eliminate the issue of needing current_slot since you only need to update the counts on write. |
|
I pushed the following changes (which should also unblock the tests)
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #707 +/- ##
============================================
- Coverage 70.99% 70.86% -0.13%
============================================
Files 121 121
Lines 65175 65184 +9
============================================
- Hits 46269 46192 -77
- Misses 18906 18992 +86
🚀 New features to boost your workflow:
|
I'm pretty sure KeyDB doesn't have the recent refactoring to storing the keys in one dict per slot. This is not even released yet. :) This refactoring avoids the need to keep a separate key-to-slot mapping, which takes extra memory per key. If you write a simple test case where you read from two different slots, you will notice that you get NULL for the second slot, because it looks for the key in the wrong dict. (A simple fix for that would be to set set
Right, we need some way for clients to find out. The thing is, calling INFO on connect is not something clients are encouraged to do, even though your client does that. The clients can see the version in the HELLO response, or we can add more fields in the HELLO response, alternatively another command to opt in that clients can call after connecting, just we have for example CLIENT TRACKING to opt-in to client-side caching.
I don't think we need to limit MSET actually. The problem is not about counts. |
Ah ok as mentioned tests still pending. So this should not be merged before those. But I wanted to start the conversation earlier. I will write the tests and look for issues integrating with your new feature. |
|
In the meantime we should sort out the following open questions:
|
@JohnSully were your numbers collected on KeyDB or Valkey unstable? Since the key argument here is performance I am also interested in a rerun after we merge the async IO work. |
|
These questions have already a discussion started in the issue #507 so we can already see some peoples ideas there.
I vote NO to a config, because enable per client is better than globally. The client needs to handle this special logic, so I prefer client opt-in rather than enabled by default. If it's enabled for everyone always, then clients and applications start assuming things and suddenly it's not possible to scale and migrate slots. I'm not sure we want to go down that path.
In the issue #507 we discussed three categories of commands:
@madolson argued that it should be allowed for 1-2 but not for 3. IIUC it's because those commands aren't simply a batch of jobs and can't be used for fan-in fan-out. I tend to agree, but OTOH I'm not sure we need to restrict that. It's cleaner, simpler and more predictable to have no exception to this rule. When cross-slot commands are allowed, then they're all allowed. So I vote for no exceptions. (There may be use cases like finding a lot of sets using SCAN and then aggregating them using SUNIONSTORE, or other yet unknown use cases.) Worth noting: There's already a hack that allows cross-slot commands. It's called Lua scripts without declared keys.
Synopsis: The client can announce capabilities it supports. To avoid an error when the client sends an unknown capability that exists in a future valkey version (or a fork of Valkey or something) the server just ignores unknown capabilities and returns OK. Thus, this command doesn't allow the client to check if the server supports a capability. (We can probably change the return value if we want, since it's not released yet.) Another option is a new command, such as
(@PingXie asked) I want to know this too. If they were done on Valkey with the existing code, then maybe most of the keys were looked up in the wrong dict and got NULL instead of a correct reply. That most likely affects performance. |
|
I don't understand why the server cares if the client "supports" this or not. Since non-supporting clients already know they can't send cross slot commands nothing changes for them. Its the client that needs to know if the server supports it not the other way around. What is the bad case scenario for legacy clients here? Now it may still make sense to use the CAPA handshake for the client to query support for this, and that is fine. But I don't see why the server would change its behavior if a CAPA isn't sent. |
Not always. Some clients are dumb and do minimal checks. Some clients just lookup the first key in the command and routes the command based on that.
Unaware application programmers start using Valkey cluster without knowing about slots and the cross slot rule. It seems to work because their keys happen to be in the same node. |
I can see the argument to make things more explicit upstream, however this will only work 33% of the time or less since minimum clusters have 3 primaries. They would really have to do no validation to not notice. Also clients have to do slot->server mapping anyways so they are probably already going to throw exceptions before it even hits the server. I also don't think we can assume all clients will implement this as a transparent x-server MGET since the implicit assumption of atomicity for MGET is broken when its sent to different servers. Clients may instead choose to expose this in a different way than a transparent MGET that under the hood cracks the operation into smaller ones. At Snap we did choose the weak consistency option since we have deep control of our use cases and could make that decision. In the context of the server itself this is all irrelevant since MGETs are atomic at that level so we are not breaking any ABI. It is purely additive at that level and no existing use case will be broken. |
|
OK, now let's wait for others to catch up and share their opinions. :) Can you tell us more about Aerospike? It seems interesting. I found https://aerospike.com/docs/server/architecture/overview. Does it support cross-node transactions? If it does, there's no unpredictable cross-slot error to worry about, so that's a bit different situation. |
Let's re-establish this on Valkey 8.0 RC first. I fully support performance improvements, but it's important to have a holistic understanding of the change and its impact on the rest of the system. As it stands, the proposed PR does not align with the proposal in #507, which also needs thorough deliberation. Given the scope of Valkey use cases, we cannot afford to implement pointed fixes without comprehensive consideration. Here is the process that I am thinking of:
Additionally, I would like to understand other ways to help onboard existing KeyDB users with Valkey. Please feel free to file issues for any further support.
SGTM BTW, there is also the correctness issue that @zuiderkwast mentioned in #707 (comment) (copied below).
|
|
I'm strongly against this feature. From the PR description the use case is to improve performance. If the feature has a different purpose and I'm missing something please update the PR description with the motivation. |
|
The issue is that there's a 1/16,000 chance you'll have two keys going to
the same slot so for practical purposes you can't really use MGET in
cluster mode. Whereas if you are a 3 shard cluster that should be a 1/3
probability. We typically have batch sizes of 200 to 500 keys so generally
can get decent batches out to the servers if they would accept them.
In this case the client is doing the job of making sure that the batches
all are served by the server we send the batch to, its just loosening an
arbitrary restriction of making it the same slot.
…On Thu, Jul 4, 2024 at 8:55 AM asafpamzn ***@***.***> wrote:
I'm strongly against this feature. From the PR description the use case is
to improve performance.
IMHO, this should be solved at the client side,
If the user wants to have atomic multi GET they should use MGET and verify
that all the keys are in the same slot.
If the user wants enhanced performance they should not use transaction or
MGET and use a batch API to send a single batch. The client driver will
split the batch between the nodes and have small batches.
If the feature has a different purpose and I'm missing something please
update the PR description with the motivation.
The batching can be implemented in the client side, we will be happy to
get contribution or issue for the https://github.com/aws/glide-for-redis
repo and we can support it.
—
Reply to this email directly, view it on GitHub
<#707 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA5W4ASM2K2NUNHGSPRSQGTZKVA4JAVCNFSM6AAAAABKAV3SIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBYHEYTEOJXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It helps if you explain why you're against it.
That depends how you use it. If you're accessing a bunch of keys related to the same user, the same product, etc. then they should be designed with tags to put them in the same slot. Then MGET works for these keys. Now, what you're doing is different. You're doing batch processing for performance reasons, which is a special optimization that not every simple web app does. If you fix this PR so it actually works correctly, then we can check the performance and see how much better faster than a pipeline it actually is. |
I think that I explained. :) probably was not clear. It won't guarantee the atomicity but, it will have better performance than your suggestion from the server side Why? The new client Batching API will just stream the commands, the server will handle them as they goes. It will allow multi slot operations. |
|
@asafpamzn yes, many clients allow pipelining or async api, where the client can already send one MGET per slot in a single round trip, but @JohnSully claims that a single cross-slot MGET is several times faster than that, so that solution is not satisfying IIUC. I still haven't seen compelling evidence for this though. |
|
Thanks for the clarification. We did some benchmark back then with redis-py see https://aws.amazon.com/blogs/database/optimize-redis-client-performance-for-amazon-elasticache/ I don't know how to link specific section, but, please search for "The following table summarizes the performance of pipelines with a single connection (in RPS). It shows that in redis-py the Thus, I don't think that we should change the server, but, optimizing the clients. In addition, due to clients implementation of MGET there is a confusion if MGET is atomic or not in cluster mode, and this change is going to make it more confusing. I saw many users who use MGET for batching. I think that it should be solved by updating the existing clients APIs to be more clear, what is atomic and what is not.. I would like to see the documentation of MGET following this change and how we are going to explain it to users that are less advanced and don't pin slots to specific nodes. I think that the docs is not going to be easy for the common user. |
…n with MGET. Currently we check if all keys are the same slot but we could retunr if all slots are on this node instead. Signed-off-by: John Sully <[email protected]>
Signed-off-by: John Sully <[email protected]>
69bc27e to
9ec45be
Compare
|
Just a quick update I fixed the code to work with the slot cache on the client. We just disable this optimization for crossslot mget but leave it on if the mget is served by a single slot. Still needs tests but we're at full functionality now. I rebased it to unstable since this PR is a few months old. |
|
I strongly support this change. Our custom module uses a read-only command that retrieves keys from multiple hashslots but only once per cluster node. With the current implementation, where commands run per hashslot, performance is significantly reduced. This command works as expected in Redis, and the discrepancy in Valkey is quite frustrating. Enabling read-only multi-slot commands would align Valkey's behavior with Redis and restore optimal performance. |
|
I am open to this best effort "cluster mget" idea but I don't think we should change the semantics of the existing |
What is the benefit of having a dedicated command for it? The behavior of MGET will be the same, regardless of it crosses slots or not. We could implement some type of flag on the command, "SUPPORTS_FANOUT" or something, and that clients can consume that to know if they can fan out the operation (one to each node). In another thread I mentioned that fanout works as long as all keys are independently read or written to. (DEL, MSET, MSETNX, MGET, EXISTS, TOUCH, UNLINK, that's it, that's the list) but not commands that cross that boundary (SUNIONSTORE, XREAD, etc). Of the commands that can be trivially fanned out, only MGET and EXISTS are read commands. It does feel like in this case, we are just making an optimization for two commands, of which only one is really useful. |
We would be changing the behavior/semantics/contract of the existing |
MSET is a write command, do we want to cross-slot write operations? That will break cross-slot write atomicity. It's complicated to do the write during slot migration when one is being migrated and the other isn't. I suppose the line I want to draw is that we should only consider multi-slot read commands as long as all keys are on the same shard. Having multiple READ commands in a transaction would also meet my criteria, as long as each read query only operates on one key at at a time. For example, a multi-exec of HGETALL could be fanned out. In this case though, the value isn't as clear because you're adding overhead with the MULTI, which defeats the ask in this PR to improve performance. We could consider the behavior if we added a new HGETALL command that operates on multiple keys, then it would also be eligible for the fanout optimization. Maybe a new command is the best approach then? It seems like we're really trying to make a single tactical optimization. The capability seems like more overhead. |
Just like MSET will break write atomicity, cross-slot MGET will break read atomicity. It's equally complicated to do the atomic read during slot migration when one is being migrated and the other isn't. I see no difference in this regard between read and write commands. The way I see us handling edge cases like these is to reject the command if one of the slots is being or has already been migrated. We would return an error like -CROSSSLOT or -TRYAGAIN, so the client still needs to handle these errors. Just like @PingXie I don't like to make this optimization for a single command, but I like the idea of a client opt-in, signaling that the client is able to handle these edge cases. |
I'm not sure the two types are exactly the same. If all keys are present in a shard, you can execute the MGET command (by the contract). If all the keys are present, you still can't execute the write command as you are supposed to forward the write command to the target shard but only for the one migrating. You basically can't execute a cross-slot write command while the slot is being migrated without breaking it up. I think it also poses a greater risk when we get to atomic slot migration, since these cross slot write commands will be getting injected into two different streams. I guess all I'm saying is that I think read atomicity is something we can soften more easily that write atomicity. There is overlap.
I also agree with this. I think the open question is do we prefer to use |
|
We're about to open up a new dimension of commands. If we add let's say MGETXS (mget cross-slot) then what's next? existsxs, delxs, msetxs, msetnxxs, ... I don't like where this is going. I'm not sure the difference between reads and writes are that significant either. There are very similar limitations to read commands, though they're not exactly the same as the write commands.
Yes, but if some keys don't exist, it can mean they have been migrated in an ongoing slot migration. For single-key commands, we reply with an -ASK redirect if it's a single key and -TRYAGAIN for multi-key commands. We'd probably return -CROSSSLOT for crossslot commands for this case. As long as it can happen, then the client needs to be able to handle it, same as for write commands. With atomic slot migration, the transfer of a slot is atomic. That's the contract, so I don't see the problem here. We'll have no more -ASK or -TRYAGAIN for read nor write commands. Only if one slot ownership has been transferred and one hasn't, we'll need to return -CROSSSLOT. It's a race condition, yet it can happen, so clients need to handle it, for read and write commands equally. |
Yeah, I basically think we should just start with MXGET (and MXEXISTS if anyone wants it). (For some reason, I really like MX for multi-slot). With the justification that we are just allowing cross-slot on reads.
Today, |
|
Removing this from the 9.1 backlog. Right now this complicates a lot of things in our roadmap. If we can come up with a clean solution we can consider moving forward on it. |
The current restrictions for MGET in cluster mode is that all keys must reside in the same slot. This results in very high inefficiency for high batch use cases as the batch has to be cracked into individual GETs. Because there are 16k slots its very unlikely we will find a pair we can send with MGET making it nearly useless in cluster mode.
Instead we can relax the condition a bit and permit MGET if all slots reside on this node and none are migrating. This allows us to serve the request in the common case. In the case where a slot was migrated or changed we still send CROSSSLOT and the client will know to break the batch down to individual GETs.
Because we expect there to be cases where we still send CROSSSLOT simply doing an MGET test on the client is not sufficient to communicate this new support. To help this along we introduce a new INFO field called "features". This is intended to work similar to the features flag in lscpu where new features get added over time. Now the client can check this and determine if it still needs to breakup batches or if they can be sent in one go.
Note: tests will come in a little bit but we can start the conversation of the feature now.
Old Behaviour:
New Behavior: