-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multi-slot MGET in Cluster Mode #707
base: unstable
Are you sure you want to change the base?
Conversation
We fail the test:
Since the intent is to allow crossslot I need to investigate if pubsub has a special restriction we need to continue to hold |
For reference, this was discussed in #507. To recap my perspective. I was okay with allowing commands that aren't read-modify-write commands. Either pure writes (DEL, MSET) or pure reads (MGET) I was okay with operating across slots. |
We have use cases that we simply can't run without this so pretty important for very high scale jobs. Will dive into the pubsub issue but I can't think of a reason it needs to restrict this. |
One question we did want to understand was the actual performance benefit of this. In the other thread we discussed whether or you could just do a deep pipeline of GET operations. |
What was the rationale to not permit read modify write? |
It has been a year since we did this but its an order of magnitude difference in performance. I can rerun some benchmarks with memtier to get modern data. |
It felt like it was logically breaking the abstraction, since we practically have serializability within a slot. To allow atomically doing a read, modifying it, and then writing it a different slot felt like something you would have to carefully architect on the client side to make sure the slots were on the same node. Possible, but not something we wanted to encourage end users to do. |
Ah you are looking at it from the perspective of greater order atomics. This is not designed to allow that and it doesn't since at any time you can get a CROSSSLOT the client must be prepared to break the operation down into lower order operations. It does require more intelligence from the client but the performance wins are worth it. Re performance here are some examples with valkey-benchmark for mset vs set:
You can see that MSET and SET do the same QPS but MSET is doing 10x the number of keys per second. We use MGET more than MSET but MGET isn't in valkey's benchmark util for some reason. Regardless we've proven this out in production over the last year and its super efficient, in fact the biggest issue is when you try to add shards the MGET batch sizes shrink so you lose efficency which offsets the scale. However this can be worked around by adding more replicas instead of full shards. |
Pipeline of 10 it closes up a bit but still a win:
1,136,363 keys/sec for SET vs 4,405,280 for MSET so still 4x faster. EDIT: For good measure here is pipeline 100
so 1,886,792 keys/sec for SET vs 5,025,120 keys/sec for MSET. So 2.66x faster at the extremes. |
Another thing that was discussed was that we could use instruction prefetching or alter the way MGET works to parallelize the fetching of the data out of the dictionary, since a big chunk of the time is spent on TLB and L3 cache misses. We only get that with MGET. |
My experiments with prefetch were mostly failures but its always something that rolls around in the back of your head. I'm going to clean this one up since its really a huge win, then if you do anything to make non cluster MGET better, the cluster version will improve as well. Also do take a look at the features field of info as that's a fairly decent change for Redis but I think will be more necessary now that theres so many variants out there. |
Thanks a lot for sharing the performance data, @JohnSully. This answers one of my earlier questions on "why not pipelining". I wonder if part of the performance improvement comes from the lower RESP protocol parsing overhead?
I am looking forward to the new update. The way it is coded now however is too specific/narrow to be accepted into the mainline and it changes the contract (whether it is for better or not is a different question). We discussed a similar situation where externally visible behavior would be changed by a proposal and you can find the core team's thoughts here. As a result of that discussion, there is now a new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JohnSully Did you verify that you got the correct result from a cross-slot MGET?
In cluster mode, we store the keys in one hashtable per slot. We use server.current_client->slot
to avoid computing the slot again when looking up each key in the right hashtable. See getKeySlot()
.
I think it should be client opt-in, a CLIENT CAPA as Ping suggested.
If a client receives -CROSSSLOT because one of the slots is just being migrated, then if the client splits up the command into multiple ones, the atomicity guarantee of the command is lost. Therefore, a client should make it clear to the user of the client that this is a non-atomic command or feature.
With the client opt-in, I don't think the INFO field is necessary. (If we do add it, then I think it should be under '# Cluster' rather than under '# Server'.)
@PingXie re changing the contract it is only additive. So it should not be a back compatibility problem. I.e. cases that use to be errors are now accepted but no case that did not error will send an error now. This is also a major place where Redis is deficient against databases like Aerospike so it would be really limiting to make it gated behind a config. |
@zuiderkwast The code is correct in KeyDB but i will check if anything has changed enough to break it. However we check hash slots for both the first and all later keys ensuring none are migrating. As mentioned earlier this is purely for performance it is not intended to increase atomicity gurantees and the client must be prepared for slots to migrate away. The reason for the feature flag is there is a forward compatibility issue. You cannot naively try the batch on a non supporting server because the performance hit of 2Xing traffic (first the failure then the GET retry). Customers aren't going to want to update their client and see twice the traffic. So if you don't provide a flag clients will have a hard time implementing support. In our environment we had exactly this problem as not all clusters could be upgraded right away so we ran with a mix for a period. |
@zuiderkwast I will modify the change to only permit crossslot on reads. This should eliminate the issue of needing current_slot since you only need to update the counts on write. |
I pushed the following changes (which should also unblock the tests)
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #707 +/- ##
============================================
- Coverage 70.99% 70.86% -0.13%
============================================
Files 121 121
Lines 65175 65184 +9
============================================
- Hits 46269 46192 -77
- Misses 18906 18992 +86
|
I'm pretty sure KeyDB doesn't have the recent refactoring to storing the keys in one dict per slot. This is not even released yet. :) This refactoring avoids the need to keep a separate key-to-slot mapping, which takes extra memory per key. If you write a simple test case where you read from two different slots, you will notice that you get NULL for the second slot, because it looks for the key in the wrong dict. (A simple fix for that would be to set set
Right, we need some way for clients to find out. The thing is, calling INFO on connect is not something clients are encouraged to do, even though your client does that. The clients can see the version in the HELLO response, or we can add more fields in the HELLO response, alternatively another command to opt in that clients can call after connecting, just we have for example CLIENT TRACKING to opt-in to client-side caching.
I don't think we need to limit MSET actually. The problem is not about counts. |
Ah ok as mentioned tests still pending. So this should not be merged before those. But I wanted to start the conversation earlier. I will write the tests and look for issues integrating with your new feature. |
In the meantime we should sort out the following open questions:
|
@JohnSully were your numbers collected on KeyDB or Valkey unstable? Since the key argument here is performance I am also interested in a rerun after we merge the async IO work. |
These questions have already a discussion started in the issue #507 so we can already see some peoples ideas there.
I vote NO to a config, because enable per client is better than globally. The client needs to handle this special logic, so I prefer client opt-in rather than enabled by default. If it's enabled for everyone always, then clients and applications start assuming things and suddenly it's not possible to scale and migrate slots. I'm not sure we want to go down that path.
In the issue #507 we discussed three categories of commands:
@madolson argued that it should be allowed for 1-2 but not for 3. IIUC it's because those commands aren't simply a batch of jobs and can't be used for fan-in fan-out. I tend to agree, but OTOH I'm not sure we need to restrict that. It's cleaner, simpler and more predictable to have no exception to this rule. When cross-slot commands are allowed, then they're all allowed. So I vote for no exceptions. (There may be use cases like finding a lot of sets using SCAN and then aggregating them using SUNIONSTORE, or other yet unknown use cases.) Worth noting: There's already a hack that allows cross-slot commands. It's called Lua scripts without declared keys.
Synopsis: The client can announce capabilities it supports. To avoid an error when the client sends an unknown capability that exists in a future valkey version (or a fork of Valkey or something) the server just ignores unknown capabilities and returns OK. Thus, this command doesn't allow the client to check if the server supports a capability. (We can probably change the return value if we want, since it's not released yet.) Another option is a new command, such as
(@PingXie asked) I want to know this too. If they were done on Valkey with the existing code, then maybe most of the keys were looked up in the wrong dict and got NULL instead of a correct reply. That most likely affects performance. |
I don't understand why the server cares if the client "supports" this or not. Since non-supporting clients already know they can't send cross slot commands nothing changes for them. Its the client that needs to know if the server supports it not the other way around. What is the bad case scenario for legacy clients here? Now it may still make sense to use the CAPA handshake for the client to query support for this, and that is fine. But I don't see why the server would change its behavior if a CAPA isn't sent. |
Not always. Some clients are dumb and do minimal checks. Some clients just lookup the first key in the command and routes the command based on that.
Unaware application programmers start using Valkey cluster without knowing about slots and the cross slot rule. It seems to work because their keys happen to be in the same node. |
I can see the argument to make things more explicit upstream, however this will only work 33% of the time or less since minimum clusters have 3 primaries. They would really have to do no validation to not notice. Also clients have to do slot->server mapping anyways so they are probably already going to throw exceptions before it even hits the server. I also don't think we can assume all clients will implement this as a transparent x-server MGET since the implicit assumption of atomicity for MGET is broken when its sent to different servers. Clients may instead choose to expose this in a different way than a transparent MGET that under the hood cracks the operation into smaller ones. At Snap we did choose the weak consistency option since we have deep control of our use cases and could make that decision. In the context of the server itself this is all irrelevant since MGETs are atomic at that level so we are not breaking any ABI. It is purely additive at that level and no existing use case will be broken. |
OK, now let's wait for others to catch up and share their opinions. :) Can you tell us more about Aerospike? It seems interesting. I found https://aerospike.com/docs/server/architecture/overview. Does it support cross-node transactions? If it does, there's no unpredictable cross-slot error to worry about, so that's a bit different situation. |
It has a strict and non strict mode that you can set. For us non strict is the most common. I think strict mode would be super valuable to some people. But most cache users in my experience can trade strictness for performance so theres lots of value in faster non strict. |
@JohnSully, to address your concerns effectively, let's revisit the original problem: you're focused on enhancing GET performance, and KeyDB's cross-slot MGET offers a compelling solution in your use case. As I mentioned earlier, since we've been actively developing comprehensive performance optimizations, and it's important to assess the performance of our 8.0 release candidate for this use case before proceeding further. IMO #507 covers a broader scope than just MGET, making a per-operation decision on cross-slot functionality less than ideal. A unified strategy for cross-slot operations is necessary, and an RFC could be an excellent starting point to define your desired cross-slot semantics for Valkey. As for Aerospike, their approach, as I understand it, primarily revolves around a tiered caching mechanism to lower costs, rather than a specific focus on cross-slot MGET functionality but curious to hear your insights. |
I can make an RFC but Snap cannot use Valkey in production without this since it would be a major cost regression and we use batches heavily. We were hoping this could be in the September release so that we could put Valkey in our road map. |
I believe we still have time. Let's redo the performance test on the same hardware spec once we have a release candidate out, which should contain all the major performance improvements we'd like to include. I assume it is acceptable IF Valkey 8.0 (RC) delivers GET performance comparable to that of your current best setup (with cross-slot MGET). In other words, it's not the performance delta between GET and MGET on Valkey 8.0 that matters most, but how the GET performance on Valkey 8.0 stacks up against the (M)GET performance on KeyDB on the same hardware. We can continue brainstorming cross-slot operations too. |
I don't see how you can get over the 4-10x single thread efficiency of batch MGET not to mention the bandwidth improvements, and if you could why wouldn't you want to add it to the performance improvements you are planning? But it is a long weekend in Canada so we can revisit this more next week. |
Let's re-establish this on Valkey 8.0 RC first. I fully support performance improvements, but it's important to have a holistic understanding of the change and its impact on the rest of the system. As it stands, the proposed PR does not align with the proposal in #507, which also needs thorough deliberation. Given the scope of Valkey use cases, we cannot afford to implement pointed fixes without comprehensive consideration. Here is the process that I am thinking of:
Additionally, I would like to understand other ways to help onboard existing KeyDB users with Valkey. Please feel free to file issues for any further support.
SGTM BTW, there is also the correctness issue that @zuiderkwast mentioned in #707 (comment) (copied below).
|
I'm strongly against this feature. From the PR description the use case is to improve performance. If the feature has a different purpose and I'm missing something please update the PR description with the motivation. |
The issue is that there's a 1/16,000 chance you'll have two keys going to
the same slot so for practical purposes you can't really use MGET in
cluster mode. Whereas if you are a 3 shard cluster that should be a 1/3
probability. We typically have batch sizes of 200 to 500 keys so generally
can get decent batches out to the servers if they would accept them.
In this case the client is doing the job of making sure that the batches
all are served by the server we send the batch to, its just loosening an
arbitrary restriction of making it the same slot.
…On Thu, Jul 4, 2024 at 8:55 AM asafpamzn ***@***.***> wrote:
I'm strongly against this feature. From the PR description the use case is
to improve performance.
IMHO, this should be solved at the client side,
If the user wants to have atomic multi GET they should use MGET and verify
that all the keys are in the same slot.
If the user wants enhanced performance they should not use transaction or
MGET and use a batch API to send a single batch. The client driver will
split the batch between the nodes and have small batches.
If the feature has a different purpose and I'm missing something please
update the PR description with the motivation.
The batching can be implemented in the client side, we will be happy to
get contribution or issue for the https://github.com/aws/glide-for-redis
repo and we can support it.
—
Reply to this email directly, view it on GitHub
<#707 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA5W4ASM2K2NUNHGSPRSQGTZKVA4JAVCNFSM6AAAAABKAV3SIKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBYHEYTEOJXGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
It helps if you explain why you're against it.
That depends how you use it. If you're accessing a bunch of keys related to the same user, the same product, etc. then they should be designed with tags to put them in the same slot. Then MGET works for these keys. Now, what you're doing is different. You're doing batch processing for performance reasons, which is a special optimization that not every simple web app does. If you fix this PR so it actually works correctly, then we can check the performance and see how much better faster than a pipeline it actually is. |
I think that I explained. :) probably was not clear. It won't guarantee the atomicity but, it will have better performance than your suggestion from the server side Why? The new client Batching API will just stream the commands, the server will handle them as they goes. It will allow multi slot operations. |
@asafpamzn yes, many clients allow pipelining or async api, where the client can already send one MGET per slot in a single round trip, but @JohnSully claims that a single cross-slot MGET is several times faster than that, so that solution is not satisfying IIUC. I still haven't seen compelling evidence for this though. |
Thanks for the clarification. We did some benchmark back then with redis-py see https://aws.amazon.com/blogs/database/optimize-redis-client-performance-for-amazon-elasticache/ I don't know how to link specific section, but, please search for "The following table summarizes the performance of pipelines with a single connection (in RPS). It shows that in redis-py the Thus, I don't think that we should change the server, but, optimizing the clients. In addition, due to clients implementation of MGET there is a confusion if MGET is atomic or not in cluster mode, and this change is going to make it more confusing. I saw many users who use MGET for batching. I think that it should be solved by updating the existing clients APIs to be more clear, what is atomic and what is not.. I would like to see the documentation of MGET following this change and how we are going to explain it to users that are less advanced and don't pin slots to specific nodes. I think that the docs is not going to be easy for the common user. |
…n with MGET. Currently we check if all keys are the same slot but we could retunr if all slots are on this node instead. Signed-off-by: John Sully <[email protected]>
Signed-off-by: John Sully <[email protected]>
69bc27e
to
9ec45be
Compare
Just a quick update I fixed the code to work with the slot cache on the client. We just disable this optimization for crossslot mget but leave it on if the mget is served by a single slot. Still needs tests but we're at full functionality now. I rebased it to unstable since this PR is a few months old. |
The current restrictions for MGET in cluster mode is that all keys must reside in the same slot. This results in very high inefficiency for high batch use cases as the batch has to be cracked into individual GETs. Because there are 16k slots its very unlikely we will find a pair we can send with MGET making it nearly useless in cluster mode.
Instead we can relax the condition a bit and permit MGET if all slots reside on this node and none are migrating. This allows us to serve the request in the common case. In the case where a slot was migrated or changed we still send CROSSSLOT and the client will know to break the batch down to individual GETs.
Because we expect there to be cases where we still send CROSSSLOT simply doing an MGET test on the client is not sufficient to communicate this new support. To help this along we introduce a new INFO field called "features". This is intended to work similar to the features flag in lscpu where new features get added over time. Now the client can check this and determine if it still needs to breakup batches or if they can be sent in one go.
Note: tests will come in a little bit but we can start the conversation of the feature now.
Old Behaviour:
New Behavior: