-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] Atomic slot migration HLD #23
Comments
@madolson and I discussed another option (in the context of atomic slot migration) which essentially implements slot-level replication. Here is the high level flow
any failure on the target before completing step 8 would abort the migrating process on the source. |
I have an implementation. Didn't write it down yet, but I recorded a song which explains it all. Please review. https://suno.com/song/b06c2a5b-3760-4916-9f56-eb3fe66f24e2 |
Are you referring to #298? I will take a look next but I don't see it solving this issue? |
Ship-it |
This atomic slot migration proposal incorporates valuable input from offline discussions with @enjoy-binbin and @murphyjacob4. It builds on the shared foundation of using RDB format for slot data transfer via a child process and the ownership transfer flow, while also addressing considerations discussed during the process, such as support for targets serving traffic, epoch bumps, and the trade-off between availability and durability. @valkey-io/core-team PTAL |
This sounds perfect! Good job! Questions follow.
Availability is already compromised since we pause writes on the source while changing the ownership, isn't? I don't fully understand this then. Can we pause writes on the target until the epoch has been bumped? Or we can't because we don't know when a collission will be detected by other nodes?
This formulation makes it look as if these are already existing configs. Are they proposed or just hypotherical ones? I guess this could just be made more clear. |
amazing song |
I have another cluster migrate command syntax proposal: @PingXie proposed syntax is that the slot scale includes start and end, but there is another case: client just would like to migrate some individual slots, thus can we make the syntax as below: CLUSTER MIGRATE <source_shard_id> SLOTS <slot_start> <slot_end> [<slot_start> <slot_end>...], --- after SLOTS, it must be even number of the parameters Another thing I am concerning is reply message: Because some slot migration process could fail due to some reason, and there slot 1: ok |
I have a couple of reservations about the current proposal.
Many clients have built in timeouts that might unexpectedly error. Are we sure we want to make this blocking by default? Very few options block like this, maybe at least have an option to make it not-block. I see there is some discussion a bit later about this.
Why connect to the target over connecting to the source? Seems counter intuitive to me, and doesn't match how the existing migrate command works. The primary is also the owner of the slot, and would logically know now if it can do the migration. It also feels like the source is driving everything, so it makes more sense to send the command to it.
Are we streaming the updates like dual channel?
How is the replica receiving the data from the slot migration? I assumed that the target replica would be forwarded the data from the target primary, but not sure how that would get interleaved into the replication stream.
Are we transferring the slots as a batch or individually? Doing it as a batch increases the overall time, since you have to accumulate more incremental replication data, but makes it more atomic. |
Synchronous vs. Asynchronous Approach for CLUSTER MIGRATEI see the following variants on this topic:
I don't think we should go with #1 or #3 because forking is expensive. These options eliminate the possibility of the engine ever forking once for multiple slot migrations. #2 is what’s proposed in this issue, while #4 is described/discussed in #587.
I agree. Error handling and progress reporting are critical parts of this design. If we adopt synchronous semantics, we’ll likely need a pub/sub channel or a
The
This question becomes moot if we remove the
From a command shape perspective, we can easily add a
It’s doable. Supporting both slot ranges and single slots adds some complexity, but this is limited to parsing logic, so it’s not a major concern for me.
That’s an excellent and very convincing point. I’m now leaning toward the asynchronous approach.
The core idea here revolves around "slot-level" replication, so a better analogy would be "cluster replicate." I don’t think we need or want to emulate the existing migrate command. Major Decisions on This Topic:
|
Ideally, the clients wish there is no blocking. But for batching a range of slots, it must have blocking time for sync.
As what you said, let us first decide it is async or sync model.
|
I have no strong opinion on sync vs async, but I guess an interactive tool may want to display a progress indicator of some sort. As Wen said, we can start with sync and add async later as long as the syntax is extensible.
I have no strong opinions on this either. Only some thoughts.
|
I think the main idea here was to mimic replication flow. In replication, we indicate a node to be a replica of another, we don't indicate that a primary replicate to a certain replica.
I think when you dig into the details, the target is actually the one driving most things:
It's possible to implement it in the opposite direction, but it breaks with how the replication (1) and manual failover (2,3,4) primitives are implemented. |
The entire slot-level replication flow will ideally use the same logic as the node-level replication flow. So all improvements that we have made to node-level replication should be reusable, including dual channel. The way I have previously prototyped it is using a new
Just like with chaining replication, we can proxy the data from the source primary to the target replica through the target primary using |
Oh, you're using REPLCONF and PSYNC for the inter-node communication. That's an important detail that wasn't mentioned until now. :) It makes sense that the target is driving it.
This is new? It isn't done now, because it's assumed to have too much overhead, so the target bumps the epoch without consensus iirc.
We proxy a slot RDB stream interleaved with regular write commands for other slots in one replication stream? This sounds like a new concept to me. |
Yeah, this is OK, but the RDB stream is not based off of commands, so we need a way to properly interleave them. We've talked about a command at AWS like
Yeah, I guess I don't really have much of a strong preference either way since both sides need a state machine for tracking the progress. |
You are not required to get consensus to bump the epoch when quorum isn't changing if you are just trying to prevent divergent clusters (also called a split brain), although the election would effectively mitigate the risk of race against a pending failover election in the target shard (and I guess the source shard depending on how the failover is constructed). I guess my question is what are we voting on. If it's based on the epoch, we would effectively be serializing the slot migrations through the cluster for each epoch, which I don't think is a great idea. |
Replication Scheme
That is a great point. I was thinking of converting the key/value pair back to commands on the target primary before forwarding them to the target replicas but I feel that is complicated.
We could/should though I consider dual channel a "local" optimization, as opposed to the core design. Also, the "replication chaining" question above would have an impact on this topic too. Source-driving vs Target-drivingWhen we (@enjoy-binbin @murphyjacob4) discussed offline, we had the mental model of "slot level replication", which lends naturally to "target-driving". But now I can see "source-driving" is more natural for "replication chaining". Main-thread Loading vs Background-thread LoadingThe discussion here is also tied to the "replication chaining" one. If we go with "source-driving", I think this becomes a moot question because all commands run on the main thread. Nevertheless, the following discussion still assumes "target-driving" and "slot level replication".
The other option is to load the RDB using a background thread. The downside is that it will might break some modules. Yes another option would be a hybrid approach, i.e., we implement both modes with the background loading as the default. If an incompatible module is loaded, we fall back to the main-thread loading. I feel that the main-thread loading is easier to implement, could very likely offer better perf than the current protocol, and is more compatible. |
Slot Ownership Transfer
I am in favor of batch transfer
I don’t see a need to vote, which is synonymous with a consensus epoch bump in the Valkey context. A consensus epoch bump resolves epoch collisions, and that’s about it. In my view, epoch collisions are less of a concern with ASM. Moreover, a consensus epoch bump doesn’t guarantee atomic slot ownership transfer—there’s still a chance that the source node might cling to the old slot even if the rest of the cluster votes for the target. When this happens, writes could temporarily go to/get accepted by the source but would eventually be dropped by the source once the source learns the updated truth. |
I think there are two logical options:
The main benefit of 1 is efficiency when there are a lot of small keys to avoid the overhead of the per key
We need to participate in AOF presumably. AOF with slot migration works today, although I don't know how well it plays with fsync
Yeah, if a replica disconnects we would want to allow a psync without having to do a full-sync if possible.
I'm currently leaning this way too. |
Large keys would still be a challenge I think, if we chunk the data along the key boundary. What do you think about leveraging AOF instead of RDB? The source would still fork, but it would stream the data as commands ( @soloestoy curious to hear your thoughts - this is essentially the same idea as your proposal #59 but at the slot level. |
Currently - I see three outstanding high level design questions:
Feel free to chime in if there are any other open questions I am missing that we need addressed. Semantics for transferring slot dataJust like for replication-related syncs, it breaks down into a snapshot and a backlog of changes. @PingXie and I have discussed offline about using AOF format (i.e. a command stream) for the snapshot. As mentioned by Ping, AOF has a benefit that we can multiplex the snapshot command stream with client commands on the target's replication link. This would also be achievable by using I think using AOF rewrite to serialize the snapshot will give us a majority of the benefit at a lower cost. The implementation of Source vs target drivenI see a few drawbacks to being source driven:
Overall, it feels like the majority of the state and state transitions will need to be owned by the target node, and there is an additional syntactical advantage that when communicating with the target you do not need to provide the origin node - just the slot - and it can use topology to determine ownership. So my vote would be for "target-driven". Election vs bump
I think the main reason would be to prevent a lost write scenario when there is a failover on either source or target (whichever we choose not to own the election). If All that said, I think this is not the only data loss condition in the current protocol. Even with an election, we could have situations where the source unpauses itself after its pause timeout is reached, but the target election eventually succeeds, causing lost writes. My vote would be on bumping the epoch instead of going through election, and we should defer the data loss issues with slot migration as a different project (cluster v2 maybe?). |
We lose a bit more efficiency, at least my understanding is that RDB load is faster than command execution, but we get some compatibility benefits between versions which is a nice benefit. I imagine sending it as a RESTORE command will be strictly faster and achieves the same outcome, so I'm not sure the benefit of the AOF format. |
RESTORE requires that we serialize an entire key. On both source and target, we need additional memory overhead to store the dump of the key in memory (for serialization and deserialization, respectively). In AOF, rewrite will be able to break a big hash into many HMSET calls with AOF_REWRITE_ITEMS_PER_CMD elements each, but RESTORE means the smallest chunk we can send at a time is the entire hash. This is one of the big downsides of current slot migration implementation and oftentimes means that we can't move slots that have large keys if we don't have an equivalent amount of free memory on the node. Some data structures are O(GB) in many workloads. To be able to chunk the RESTORE command on a key level - I think we will need the RDBCHUNK functionality - which I see as an incremental improvement. For atomic slot migration - we should use REPLCONF to negotiate the data format, so that we can later change how the data transfer happens. I also generally see an AOF formatted full sync as being a useful primitive for other things (as you mentioned, AOF could be loaded in a forwards compatible manner). I also do see the benefit of RDBCHUNK, but as I mentioned above, I'm just not sure on the return on investment over doing something like AOF for this feature. |
We aren't solving this problem right? Whether it's AOF or RESTORE, both cases end up running out of memory on the target.
AOF isn't forward compatible. If you introduce a new functionality, it doesn't know how to load it, HFE for example. I feel like we make this claim periodically, but AOF and RDB are the same in terms of being backwards compatible. AOF just chooses to standardize on a single high left format for compatibility at the cost of efficiency. I guess I generally would bias towards performance during the majority case instead of some theoretical forward compatibility story.
I'm OK with an incremental improvement. Although in this specific case, I'm not sure it's enough work to split between two minor versions since we need to implement a handshake. I think we either commit to AOF or to RESTORE with a chunk. |
@PingXie Can you update the comment with the updates to the proposal so that it's easier for someone to come in and review? This thread is a bit long, but I think we're close to having consensus so worth to make sure it's easy to send this to some other folks. |
Did you bug my office? 😂 I've been working on updating the proposal today on and off. Hoping to wrap it up by the weekend🤞 |
I guess I wasn't clear in my explanation. You do need the target to be able to fit the new data, yes, however there is an overhead to non-chunked To give it an example, suppose I am migrating a 5 GiB hash on a node currently at 10 GiB of usage. In order to The idea being that with AOF format - the size of that interim overhead shrinks drastically from the size of the key (5 GiB in this example) to the size of a chunk of elements in that key (and adjusting the number of elements to try to maintain a max in-memory size shouldn't be difficult).
I don't disagree with you. Instead of saying "forward compatible", let's say that AOF allows for lossy downgrade, which is something I don't think we can do with RDB since I don't think we have a way to skip over unknown types effectively. Point being - it could be useful outside of atomic slot migration for those who are okay with data loss. If we want perfect forward compatibility - we should talk about a protobuf-based RDB v2 😉
We can talk about how chunked RESTORE design would look. What I have in my head is similar to what you mentioned with RDBCHUNK:
Really - it is just a mechanism for a push-based RDB transfer mechanism - it wouldn't necessarily be tied to just slot migration and could be used for import/export use cases as well (i.e. I want to take data from a backup and load it on a live node). Calling
When we get a
If either 2 or 3 fail due to not enough data in So this would effectively be a refactor of Line 1859 in 11cb8ee
Looking at the state machine and protocol - I am seeing a decent amount of complexity, when we can get AOF for "free" today. But maybe I have overcomplicated it? |
Updated. I will close #587 next now that this proposal is going with ASYNC behavior too.
I see the fundamental constraint here being on the receiving end. Regardless of chunking or the encoding format, the receiver must ultimately apply changes as complete data types. This means that if we transfer composite data structures as-is, the receiver needs to buffer/track them, which leads to significant memory pressure for large keys. AOF addresses this by allowing us to send the data in its smallest components – the primitive data types. These are generally small in practice, which minimizes the memory burden on the receiver. By breaking down composite structures into their constituent primitives, we effectively mitigate the memory issues associated with large keys. This approach aligns with the observation that large keys often result from users employing composite data types that grow over time. By decomposing these structures into their atomic elements for transfer, we address the root cause of the problem. |
The code we have internally is less than 200 lines, so I guess I'm not really seeing the complexity. There are pretty straight forward ways to avoid the large object buffering problem by partially serializing the key as you process the chunks of memory. The RDB format is more memory efficient than the in-memory representation.
This is still not forward compatible 😛. As soon as we introduce hash field expiration it breaks. We can't be forward compatible unless we pre-define all future changes we are going to make.
I guess my question boils down to is this a pathway worth complicating for performance. I agree we can get AOF for a lower implementation cost, but we can be more efficient and more dense across the network. |
If we want to interleave a dump in a stream of commands, then AOF seems trivial while anything else is non-trivial, so I'd prefer to start with that. We have REPLCONF VERSION, so it's easy to extend this later. I'd like to scope cut this feature because i want to see it implemented and merged in a foreseeable future. |
I suppose FWIW I think starting with AOF and then evaluating changing it makes sense. There is a lot of code we need to write, I don't think cutting this corner makes sense but I don't think it materially changes the overall design. |
So lets get to writing/reviewing/merging code. Can it be broken down in smaller tasks in some way? Is there a WIP implementation anywhere? |
Thanks for @PingXie Hard work to update the top comment, however I have 2 questions about the design, and if you can clarify my concern:
|
@zuiderkwast I have a WIP implementation here: unstable...murphyjacob4:valkey:atomic-slot-migration It is somewhat out of sync with the design. Going to spend the rest of today resolving merge conflicts with unstable and incorporating the design changes, then I can post a draft PR. My optimistic ETA would be Thursday. I can also chunk it into multiple PRs if that would be easier to review |
@murphyjacob4 this is great news! I don't know if it needs to be split into multiple PRs. How many lines of code is it? @PingXie and others, regarding these commands:
They will appear under the Regarding the internal commands:
When a slot migration is cancelled, should the target send |
I re-read the top comment and discussion, sorry for jumping in so late. I've always had anxiety about long sentences and have been busy with other things lately. I will first talk about the top comment, and then try to explain my previous code, https://github.com/enjoy-binbin/valkey/commits/slot_migrations/. I discussed this solution with Ping offline a long time ago, and I also tried to write the code before, although it is not perfect yet, but it is something. About the top commenti see
About the discusstionSorry i decide to skip it since i do lost it a while ago, i hope the top comment is the newly one. About our implementationI'm going to take a moment to explain our approach. I discuss this with Ping and Zhaozhao offline last year, at that time we both agree with something, but sadly i now see this topic has moved on a lot. This is the branch https://github.com/enjoy-binbin/valkey/commits/slot_migrations/ & https://github.com/valkey-io/valkey/compare/unstable...enjoy-binbin:valkey:slot_migrations?expand=1 i post to Ping last year, and i have a document about it, sadly it was written in Chineses so i am not able to post it. I knew people get sick of reading a lot of code, but i hope you guys can take a moment to scan it, since i hate to waste the effort, it is not perfect, but it can work. There is a I will briefly describe the implementation. The main process is slot sync + slot failover. In a A/B/C cluster, lets say we are trying migrate slots 0-1000 from A to C, and migrate 2000-3000 from B to C. The high level:
Other stuff:
Sorry for typing so much text and without pictures. Maybe it's because of preconceptions, i kind of like our current solution. I'm not asking that we must accept this Tencent Cloud's solution. Hope you guys will like it, or have more ideas after reading this. I just discussed it with Ping offline in Chinese before, so maybe you guys will interest it. |
I was a bit over-optimistic, but I have a draft PR out now: #1591. It is just about 1k LOC, but will be more with tests
I just use a connection close to signal this in my PR
I opted for this style in the PR - since it turns out there was more state transitions needed. But we can talk more there
I think that the solution is quite similar to what I have implemented (although, API naming is different and there are some slight semantic differences). But overall it is quite similar, with the primary difference being AOF vs RDB based snapshotting. Please feel free to take a look at the draft PR and let me know if you'd like to collaborate further! Happy to set up a shared branch |
@enjoy-binbin Thanks for sharing your solution and what was discussed before in Chinese. :) I think the solution is similar, just a few differences. Your solution seems a bit simpler. I like simple.
Three steps? CLUSTER IMPORT SLOTS is just one command. I think one command seems better. The target can complete the switchover immediately when the sync is complete, without another admin command.
Importing one slot range from one node can succeed and importing another range from another node can fail, if the source node crashes or something. Is that right @murphyjacob4? If there is a failover in the source shard, shall the target node retry importing from the new source primary?
If we can avoid these (STATUS and CANCEL) in the first step, it can be simpler, and we can avoid the op-id. I like the approach of doing the minimal solution and later extending it based on what we need. To compare with replication, how do we abort a full sync? We use
Extending SYNC or adding a new command SYNCSLOTS is not a big difference, but if we want to use REPLCONF to exchange the offset and other things, then maybe SYNC is better? I don't know why we need a session-id. Just closing the connection seems enough to me. A related feature that needs replication per slot is #1372. Just keep in mind if we ever want that feature. |
Actually there is no much difference i think, it can be changed easier. Peoply anyway need a way to check the offset, or check the status, it kind of a same job. Unless people plans to check the results in the next day. Another reason i guess in here is that, we want to split it. We want to only do the SLOT FAILOVER when we require, since the failover will need a pause. Noted that the failover may timeout out if something bad happen. We have this issue before, the Consensus-Based Bumping, the old cluster failover, the manual failover timedout, the primary get pause for a long time. This could be a bug in the code, or a blocked node, or a network problem. In ours fork, we have a arbiter mode, in this mode, we transfer voting rights to the arbiter node, which is an empty primary node, arbiter dont process commands so it won't never get block by user commands. We are doing this arbiter-mode in everywhere, like one-shard cluster, or a cluster that cross multi available zone. In a large cluster, only a few arbiter nodes are needed, their voting is very stable and they are not bound to the primary node or AZ. I think this arbiter-mode is a great stuff we want to open-source, but Madelyn doesn't like it if i recall. So the reason here is, we want to manually trigger slot failover in a confirmed situation. Of course, this can also be made automatic.
the target node will retry (re-init the slot link) with the new primary in this case. |
From my understanding,
|
Thank you, @enjoy-binbin, for the detailed write-up and for contributing your insights to this discussion! I appreciate the effort you've put into exploring this area, and your input has been incredibly helpful in shaping our direction. Capturing the offline sync with @enjoy-binbin, @soloestoy, and @hwware:
We agree on the importance of supporting cancellation and a robust reporting mechanism for slot migration. Since migrations, whether atomic or not, can be long-running operations, operators need a clear way to monitor and abort the process.
We've aligned on using the AOF format for synchronization. This approach offers the advantages of relatively low overhead on the main thread, eliminating the background loading work, and automatic replication to replicas. Optimizations like dual-channel replication and filtering irrelevant delta changes can be explored after the core functionality is implemented.
We agree that consensus-based epoch bumps aren't essential for the initial release. As discussed, addressing reliability concerns, such as those related to epoch management, can be better handled within the framework of issue #1355.
We identified a new requirement: separating data migration from slot ownership transfer, giving the operator control over each step. This is important because transferring slot ownership can have a greater impact on clients than migrating data, as not all clients handle topology changes efficiently. Allowing just the slot ownership transfer to be performed separately during a dedicated downtime minimizes potential disruptions. Next StepsThis is a complex project, so to make it easier to review and manage, I propose breaking it down into smaller PRs. Each PR can deliver a working piece, even if it's not perfect at first, and we can improve things incrementally. I suggest we all take a look at both @enjoy-binbin and @murphyjacob4's PRs to get up to speed. Thanks again to everyone for their contributions and collaboration! |
Right. If I am an operator of a 9 shard cluster and I am adding a tenth shard, I will go to the new node and enqueue 9 separate slot migration operations, one for each target node. It is possible for 8 of those to succeed, but for the 9th to encounter issues that result in failure. As an operator, I would want to poll on the status of all 9 operations and look into the failure, and maybe retry
Actually - @PingXie - in the PR I wrote this was required to be implemented due to the way chaining replication works. We reuse the chaining replication code which means that the target primary directly proxies its received contents from the slot migration link to its replica links. However, the replicas aren't aware that the slot migration is ongoing, so don't have the context to do the filtering on their end. Instead of adding a replica-informing layer to the protocol, it was easier to filter on the source primary, which also has the benefit of reducing network bandwidth and overall overhead - at the cost of requiring the primary to maintain a dedicated output buffer for the slot migration instead of the replication backlog. |
@enjoy-binbin, @PingXie, and I discussed offline. Binbin and I are joining efforts and going to iterate on the Tencent solution to meet the requirements we outlined here. Once this is ready, we will send a shared PR for review. Given this, I have closed my pull request. I will be able to repurpose much of this into the Tencent solution to make it AOF based and not require replica full sync on the target replicas. Here is the new shared branch that @enjoy-binbin has set up for us to work on: https://github.com/enjoy-binbin/valkey/tree/slot_migration_new |
It looks like you've been updating the problem statement based on comments. So just reviewing that. Sorry if I repeat responses from others. Firstly, nice to see atomic slot migration. IMO the existing solution with non-atomic is simply broken as it provides no mechanism for multi-key commands to function reliably. Questions:
CLUSTER IMPORT STATUS
CLUSTER IMPORT CANCEL
SYNCSLOTS (internal)
TRANSFERSLOTS (internal)
High-Level Migration Flow
Other questions
|
Yeah, I think there are still several fit-and-finish questions we need to address before we can consider this done. So far, what we’ve agreed upon is just the core logic. I’ll share my current thoughts below.
CLUSTER IMPORT SLOTS is a user-facing command, whereas SYNCSLOTS is internal. These two can run in parallel, and there’s no strict 1:1 mapping between them at the protocol level. For example, a single IMPORT SLOTS command could result in multiple SYNCSLOTS commands, or two IMPORT SLOTS commands could potentially be combined. As a result, the error handling for these two commands needs to be defined independently.
I can include a diagram to illustrate this. At a high level, I’d imagine the target node drops all active connections to the sources, stops processing the remaining slots associated with the op_id, and deletes incomplete slots. Slots that have been fully migrated should remain intact. I’m also open to using ABORT instead of CANCEL.
We can start with the "one-shot" ASM approach, but I acknowledge that there’s a valid use case for a "two-stage" ASM. In this approach, data migration happens first, and topology updates occur during planned downtime. This avoids the higher impact typically associated with topology changes. That said, I view the "two-stage" ASM as incremental work.
The need for STATUS and CANCEL arises because IMPORT SLOTS is an asynchronous call. However, the requirement for op-id could be eliminated if we enforce a single active migration session at a time or allow the ranges to be amended dynamically. I agree this would simplify the process a lot.
Following the discussion above, If we remove op-id and allow only one active IMPORT SLOTS session at a time or allow the operator to amend the ongoing migration, we could support slot-range-based cancellations. I’m increasingly in favor of this option.
For internal commands, I’d like to prioritize clarity. Reusing existing replication commands might confuse developers later on. I’m open to renaming SYNCSLOTS—perhaps IMPORTSLOTS would be a better alternative?
The session ID was intended to link the slots being migrated with those whose ownership is being transferred, as these represent two separate internal states. However, now that we plan to support "two-stage" ASM in the future, I agree we should just converge on slots. I’ll update the HLD accordingly.
Need to think about that some more next. CLUSTER IMPORT SLOTS
Allowing this would provide more flexibility but also introduce additional complexity. I think eventually we should support it but I am also OK with a more restrictive experience.
As far as the high-level design is concerned, there’s no strict requirement for ordering.
I’m not entirely sure about the "pull" model reference, but my current thinking is to allow operators to amend the slots to be migrated (without relying on op-id). The server would then process each slot individually, choosing the best migration plan dynamically.
My mental model is that IMPORT SLOTS would perform the basic input validation, such as ensuring slots are within range. Operators could then use IMPORT STATUS to check the actual execution progress and status. CLUSTER IMPORT STATUS
Agreed. We should eliminate op-id entirely and rely solely on slot numbers.
If we remove op-id altogether, IMPORT STATUS would return detailed information about the slots submitted for migration. Since there would only be one ongoing operation, managing the op-id state becomes unnecessary. CLUSTER IMPORT CANCEL
I now think the cancellation should be slot-based, for example: CANCEL 100-200, 301, etc..
session_id will also be removed. I now believe everything should converge around slot numbers.
For observability, we should log all errors encountered in the server logs. Additionally, we could maintain a collection of migration records with stats such as:
TRANSFERSLOTS (internal)
This operation should be atomic, as it primarily involves bumping the node’s config epoch and updating the slot ownership. I’m open to name suggestions and we can continue the discussion in the PR. :) High-Level Migration Flow
Right, separate source nodes would require their own "sessions." Either parallel or sequential execution works for the high-level design. We could start with the simpler approach and optimize incrementally.
Correct. The current plan is to filter at the source. This means we won’t reuse the normal repl_backlog. While a dedicated log could be introduced, my understanding from offline discussions is that we can initially use the client output buffer. Resuming interrupted migrations would require a new backlog mechanism, but that isn’t a priority/concern for now IMO. Other questions
That is a great question. My initial thought is that we should cancel all ongoing migrations first and then let the FLUSHALL command proceed. |
Last Updated: Jan 12, 2025
Problem Statement
Efficient slot migration is critical for maintaining the scalability and resilience of Valkey clusters, but the current client-driven migration approach presents significant challenges.
Operators currently need to perform multiple manual steps to migrate slots, including setting migration states, transferring data, and updating ownership, all of which are prone to errors and inefficiencies.
While existing migration methods ensure availability during the process, they rely on mechanisms like redirection (-MOVED, -ASK) to maintain client access to data. These mechanisms introduce significant client-side complexity, requiring clients to handle redirections, retries, and reconfigurations during migration. Ensuring atomicity and consistency during migrations remains a challenge, especially for operations involving multiple keys or large datasets.
Compounding these challenges, limited observability leaves operators without adequate insight into the migration process, making it difficult to monitor progress or resolve issues efficiently.
Command Definitions
CLUSTER IMPORT SLOTS
CLUSTER IMPORT SLOTS <slot_start> <slot_end> [<slot_start> <slot_end>...]
<slot_start>
: The starting slot number of a range (integer between 0 and 16383).<slot_end>
: The ending slot number of a range (integer between 0 and 16383 and equal to or greater than<slot_start>
), inclusive.<op_id>
on queuing success, where<op_id>
is a UUID for the long running import operation.-ERR <error message>
on failure (e.g., invalid slot range).CLUSTER IMPORT STATUS
CLUSTER IMPORT STATUS OPID <id>
<op_id>
: The unique identifier of the import operation, as returned byCLUSTER IMPORT SLOTS
.-ERR <error message>
on failure (e.g., invalid operation ID).CLUSTER IMPORT CANCEL
CLUSTER IMPORT CANCEL OPID <id>
<op_id>
: The unique identifier of the import operation, as returned byCLUSTER IMPORT SLOTS
.+OK
-ERR <error message>
on failure (e.g., invalid operation ID).SYNCSLOTS (Internal)
SYNCSLOTS SLOTS <slot_start> <slot_end> [<slot_start> <slot_end> ...]
<slot_start>
: The starting slot number of a range (integer between 0 and 16383).<slot_end>
: The ending slot number of a range (integer between 0 and 16383).<session_id>
on success, where<session_id>
is a UUID for the long running full sync operation.-ERR <error message>
on failure. Possible error messages include:TRANSFERSLOTS (Internal)
TRANSFERSLOTS SESSIONID <session_id>
<session_id>
: The unique identifier of the import operation, previously provided by the source node in response to aSYNCSLOTS
command.<tracked_changes_size>
on success, where<tracked_changes_size>
is the number of bytes of remaining tracked changes.-ERR <error message>
on failure. Possible error messages include:-ERR Invalid session ID.
-ERR Slot transfer already in progress.
Operator WorkFlow
Consider a Valkey cluster comprising three nodes (A, B, and C), each serving a distinct range of hash slots. To optimize resource utilization, an operator decides to migrate slots 1000-2000 from Node A to Node C. This migration is initiated by connecting to Node C via
valkey-cli
and issuing the commandCLUSTER
IMPORT SLOTS 1000 2000
. Node C responds with a unique identifier for the import operation.The operator can then periodically monitor the progress of this migration using the command
CLUSTER IMPORT STATUS
followed by the unique identifier. This command provides a high level slot migration status. This allows the operator to track the migration's progress and identify any potential issues.Should the need arise to halt the migration, perhaps due to an unforeseen event or a change in operational requirements, the operator can issue the command
CLUSTER IMPORT CANCEL
along with the corresponding import identifier. This command signals Node C to stop the ongoing migration process.Upon completion of the migration, the operator can verify the successful transfer of slots 1000-2000 to Node C by inspecting the cluster's slot distribution using commands like
CLUSTER NODES
. This confirms that the data has been correctly redistributed within the cluster as intended.High-Level Migration Flow
When a target primary receives a
CLUSTER IMPORT
command, it groups the slots for import by source primary node. It then connects to the source primaries and starts the migration using a new internal command,SYNCSLOTS
. During migration, the target primary acts as a "slot-replica" of the source primary, conceptually. Upon receivingSYNCSLOTS
, the source primary forks a child process to create a slot-specific AOF command sequence containing only the keys in the specified slots. This sequence is streamed to the target primary, which executes it on the main thread and replicates it normally to the target replicas. To ensure all changes to the migrating slots are captured, incremental updates (deltas) from the source shard are also streamed to the target primary, as a replica would handle live updates from its primary. Optimizations similar to dual-channel full sync could also be considered.During atomic slot migration, the target primary only takes ownership of the slots after fully catching up with the source primary. This is done by freezing writes to the source primary for the migrating slots after the target receives the full slot snapshots. Minimizing the delta between the primaries before ownership transfer is important to reduce the unavailable window for writes (due to pausing). The target receives delta changes during the write pause. Upon achieving full synchronization, the target bumps up its config epoch without consensus. Note that the source primary’s pause of all writers is done with a timeout (potentially the same cluster node timeout). The source primary resumes writes upon detecting the ownership transfer (via the target's cluster message with a higher config epoch). If the target doesn't claim the slots before the timeout, the source unpauses writers and retains ownership. While this may cause data loss in some rare cases, it is acceptable for this design, and further solutions are explored in issue #1355.
If the source primary fails over before migration completes, the target primary can retry with exponential backoff or proceed to the next source. This implementation detail can be discussed further in the PR review.
The migration concludes when the source shard relinquishes ownership of the migrated slots, removes the associated data, and redirects client traffic to the target shard.
Neither the source nor target replicas are aware of the ongoing slot migration. Target nodes can either be new or serving client traffic. If a source or target replica fails during this process, it will simply perform a full synchronization from its respective primary as though no migration is underway.
Client traffic, including both read and write commands, to the slots in question continues to be directed to the source shard throughout the migration process, with no awareness of the migration.
Inter-Node Slot Import Protocol
The process of transferring slots from a source node to a target node involves an orchestrated sequence of steps to ensure data consistency and availability. This sequence begins with the target primary node sending a
SYNCSLOTS
command to the source primary, specifying the desired slot range. The source node responds with a unique session ID to identify this specific import operation.Upon receiving the
SYNCSLOTS
command and acknowledging with the session ID, the source node forks a child process dedicated to streaming an AOF snapshot of the requested slots to the target. Concurrently, the source node begins tracking changes made to these slots only. This granular tracking mechanism allows the source node to maintain a record of all modifications to the migrating slots without having to pause writes to the entire keyspace, thereby preserving write availability for unaffected slots. This represents a trade-off: increased memory consumption in exchange for enhanced write availability.Once the target primary has finished processing the AOF snapshot and has sufficiently caught up with the ongoing changes, it sends a
TRANSFERSLOTS SESSIONID <session id>
command to the source. This command signals the source to finalize the slot transfer.In response to
TRANSFERSLOTS
, the source node pauses all further writes to the slots being migrated. This effectively freezes the tracked changes, ensuring that no new modifications are made to the data while the final transfer is in progress. The source then replies to theTRANSFERSLOTS
command with the size of the tracked changes, indicating the amount of data the target node needs to receive to be fully synchronized.The target node, upon receiving the size of the tracked changes, starts counting the number of bytes received from the source. As soon as the expected number of bytes has been received, the target node is certain that it has a complete and up-to-date copy of the migrating slots. At this point, the target node proceeds to claim ownership of the slots and bumps up its config epoch.
Immediately after claiming the slots and updating its epoch, the target node broadcasts this information to the entire cluster. This broadcast ensures that all other nodes are aware of the new slot ownership and can redirect client requests accordingly.
Major Design Considerations
CLUSTER IMPORT
Execution Model (ASYNC)The proposed
CLUSTER IMPORT
command uses an asynchronous approach for better client compatibility and resilience to network issues. This requires two additional commands for progress monitoring (CLUSTER IMPORT STATUS
) and cancellation support (CLUSTER IMPORT CANCEL
).Epoch Bumping Strategies (Concensusless)
In Valkey, a node's configuration epoch determines slot ownership. When a slot is migrated from one node to another, the target node must increase its epoch to a value higher than any other node in the cluster. This ensures a clear and unambiguous handover of ownership, as the node with the highest epoch is considered the rightful owner of the slot.
There are two primary approaches to bumping a node's epoch:
Consensusless Bumping: The target node directly increases its epoch without requiring agreement from other nodes. This method is efficient but carries the risk of epoch collisions if multiple nodes attempt to claim the same slot concurrently.
Consensus-Based Bumping: The target node proposes an epoch increase and requires a majority of nodes in the cluster to approve this change before it can take effect. This approach reduces the risk of collisions but introduces complexity and potential delays.
For atomic slot migration in Valkey, it is argued that consensus-based epoch bumping is not necessary. This argument rests on the observation that consensus does not inherently eliminate the risk of data loss during slot migration.
Consider a scenario where slots are being migrated from node
A
to nodeB
. NodeB
initiates the process, and nodeA
pauses writes to the slots being transferred. After nodeB
fully catches up with the data, it stages a vote to increase its epoch. However, due to network issues or other unforeseen circumstances, nodeA
might not receive or process this vote request.Despite this lack of acknowledgment from node
A
, nodeB
might still secure enough votes from other nodes to bump its epoch. Subsequently, nodeB
claims ownership of the slots. However, if nodeB
's cluster messages fail to reach nodeA
before its pause timeout expires, nodeA
will resume accepting writes to the slots it no longer owns. This leads to an inconsistency where both nodes believe they own the slots, and writes directed to nodeA
after the timeout will ultimately be lost when it eventually receives nodeB
's updated epoch.This scenario highlights the inherent trade-off between data consistency and availability. While a consensus-based approach might seem to offer stronger consistency guarantees, it cannot prevent data loss in situations where network partitions or node failures occur. Ultimately, the decision of whether to unpause writes on node
A
without explicit confirmation from nodeB
rests on a delicate balance between ensuring data consistency and maintaining availability.This complex issue of balancing availability and consistency, particularly in the context of write pauses and timeout mechanisms, is best addressed within the broader discussion of data durability. Therefore, further exploration of this topic and potential solutions are deferred to issue #1355, which focuses on enhancing data durability guarantees in Valkey.
Streaming Format (AOF)
When migrating data between nodes in Valkey, a fundamental design decision involves choosing the appropriate format for representing and transferring that data. Two primary candidates emerge: AOF, which logs Valkey commands, and RDB, a snapshot of the in-memory data. This section analyzes these options, considering the constraints of atomic data types and the implications for memory management on the receiving end.
Regardless of whether AOF, RDB, or a chunked variant of RDB is used, the receiver can only apply a change once the corresponding data is fully received. These changes can be primitive data types like strings and integers or composite data types like
SET
s,HASH
es, etc. The primitives are indivisible and cannot be processed partially.When a composite data structure is transferred, the receiver must buffer the entire structure in memory for processing. This remains the case even with chunked data unless the chunks are aligned with the atomic data type boundaries within the structure (akin to AOF).
Consequently, streaming the atomic data types (strings and integers) emerges as the most memory-efficient and straightforward approach. This strategy minimizes the buffering and tracking requirements on the receiver, as each atomic unit can be processed and discarded immediately upon reception.
This approach aligns with the AOF format, which essentially represents data as a sequence of Valkey commands operating on these atomic data types. Using AOF for data transfer offers several advantages:
In contrast, RDB, even when streamed in chunks, presents challenges. Unless the chunks align perfectly with the atomic data type boundaries, the receiver still needs to buffer potentially large segments of data, increasing memory pressure. This makes RDB or chunked RDB less suitable for efficient data transfer in Valkey.
Therefore, based on the constraints of atomic data types and the need to minimize memory pressure on the receiver, AOF emerges as the preferred solution for data transfer in Valkey. Its inherent alignment with atomic units, combined with existing support and efficient processing capabilities, makes it a more suitable choice compared to RDB or its chunked variants.
Observability for Atomic Slot Migration
To enhance the observability of the atomic slot migration process and provide operators with improved visibility into its progress and potential issues, we can integrate detailed metrics as follows into
CLUSTER INFO
:Metrics on the Source Primary
Metrics on the Target Primary
The text was updated successfully, but these errors were encountered: