Fix data loss when replica do a failover with a old history repl offset #885

enjoy-binbin · 2024-08-11T08:52:54Z

Our current replica can initiate a failover without restriction when
it detects that the primary node is offline. This is generally not a
problem. However, consider the following scenarios:

In slot migration, a primary loses its last slot and then becomes
a replica. When it is fully synchronized with the new primary, the new
primary downs.
In CLUSTER REPLICATE command, a replica becomes a replica of another
primary. When it is fully synchronized with the new primary, the new
primary downs.

In the above scenario, case 1 may cause the empty primary to be elected
as the new primary, resulting in primary data loss. Case 2 may cause the
non-empty replica to be elected as the new primary, resulting in data
loss and confusion.

The reason is that we have cached primary logic, which is used for psync.
In the above scenario, when clusterSetPrimary is called, myself will cache
server.primary in server.cached_primary for psync. In replicationGetReplicaOffset,
we get server.cached_primary->reploff for offset, gossip it and rank it,
which causes the replica to use the old historical offset to initiate
failover, and it get a good rank, initiates election first, and then is
elected as the new primary.

The main problem here is that when the replica has not completed full
sync, it may get the historical offset in replicationGetReplicaOffset.

The fix is to clear cached_primary in these places where full sync is
obviously needed, and let the replica use offset == 0 to participate
in the election. In this way, this unhealthy replica has a worse rank
and is not easy to be elected.

Of course, it is possible that it will be elected with offset == 0.
In the future, we may need to prohibit the replica with offset == 0
from having the right to initiate elections.

Another point worth mentioning, in above cases:

In the ROLE command, the replica status will be handshake, and the
offset will be -1.
Before this PR, in the CLUSTER SHARD command, the replica status will
be online, and the offset will be the old cached value (which is wrong).
After this PR, in the CLUSTER SHARD, the replica status will be loading,
and the offset will be 0.

Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Signed-off-by: Binbin <[email protected]>

PingXie

This is a great bug. The fix LGTM overall.

Thanks, Binbin!

src/cluster_legacy.c

src/replication.c

tests/unit/cluster/replica_migration.tcl

Co-authored-by: Ping Xie <[email protected]> Signed-off-by: Binbin <[email protected]>

enjoy-binbin · 2024-08-12T03:55:35Z

force-push for the DCO.

codecov · 2024-08-12T04:08:38Z

Codecov Report

Attention: Patch coverage is 83.33333% with 3 lines in your changes missing coverage. Please review.

Project coverage is 70.58%. Comparing base (7424620) to head (b47148d).
Report is 22 commits behind head on unstable.

Files	Patch %	Lines
src/cluster_legacy.c	78.57%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #885      +/-   ##
============================================
+ Coverage     70.39%   70.58%   +0.18%     
============================================
  Files           112      112              
  Lines         61465    61509      +44     
============================================
+ Hits          43271    43417     +146     
+ Misses        18194    18092     -102

Files	Coverage Δ
src/replication.c	`87.02% <100.00%> (-0.11%)`	⬇️
src/server.h	`100.00% <ø> (ø)`
src/cluster_legacy.c	`85.58% <78.57%> (+0.06%)`	⬆️

... and 22 files with indirect coverage changes

Signed-off-by: Binbin <[email protected]>

ranshid

Overall LGTM - minor test comment

tests/unit/cluster/replica_migration.tcl

Signed-off-by: Binbin <[email protected]>

PingXie

LGTM. Nits only. Thanks Binbin.

tests/unit/cluster/replica_migration.tcl

Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>

enjoy-binbin · 2024-08-21T03:49:09Z

@PingXie thanks for the review! i think i took care of all the comments, please take another look.

…et (#885) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <[email protected]>

…et (valkey-io#885) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <[email protected]> Signed-off-by: mwish <[email protected]>

…n CLUSTER REPLICATE (#884) If n is already myself primary, there is no need to re-establish the replication connection. In the past we allow a replica node to reconnect with its primary via this CLUSTER REPLICATE command, it will use psync. But since #885, we will assume that a full sync is needed in this case, so if we don't do this, the replica will always use full sync. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>

…et (valkey-io#885) Our current replica can initiate a failover without restriction when it detects that the primary node is offline. This is generally not a problem. However, consider the following scenarios: 1. In slot migration, a primary loses its last slot and then becomes a replica. When it is fully synchronized with the new primary, the new primary downs. 2. In CLUSTER REPLICATE command, a replica becomes a replica of another primary. When it is fully synchronized with the new primary, the new primary downs. In the above scenario, case 1 may cause the empty primary to be elected as the new primary, resulting in primary data loss. Case 2 may cause the non-empty replica to be elected as the new primary, resulting in data loss and confusion. The reason is that we have cached primary logic, which is used for psync. In the above scenario, when clusterSetPrimary is called, myself will cache server.primary in server.cached_primary for psync. In replicationGetReplicaOffset, we get server.cached_primary->reploff for offset, gossip it and rank it, which causes the replica to use the old historical offset to initiate failover, and it get a good rank, initiates election first, and then is elected as the new primary. The main problem here is that when the replica has not completed full sync, it may get the historical offset in replicationGetReplicaOffset. The fix is to clear cached_primary in these places where full sync is obviously needed, and let the replica use offset == 0 to participate in the election. In this way, this unhealthy replica has a worse rank and is not easy to be elected. Of course, it is possible that it will be elected with offset == 0. In the future, we may need to prohibit the replica with offset == 0 from having the right to initiate elections. Another point worth mentioning, in above cases: 1. In the ROLE command, the replica status will be handshake, and the offset will be -1. 2. Before this PR, in the CLUSTER SHARD command, the replica status will be online, and the offset will be the old cached value (which is wrong). 3. After this PR, in the CLUSTER SHARD, the replica status will be loading, and the offset will be 0. Signed-off-by: Binbin <[email protected]> Signed-off-by: mwish <[email protected]>

In valkey-io#885, we only add a shutdown path, there is another path is that the server might got hang by slowlog. This PR added the pause path coverage to cover it. Signed-off-by: Binbin <[email protected]>

…ard_id In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. This one follow valkey-io#885 and closes valkey-io#942. Signed-off-by: Binbin <[email protected]>

In #885, we only add a shutdown path, there is another path is that the server might got hang by slowlog. This PR added the pause path coverage to cover it. Signed-off-by: Binbin <[email protected]>

…ard_id (#944) When reconfiguring sub-replica, there may a case that the sub-replica will use the old offset and win the election and cause the data loss if the old primary went down. In this case, sender is myself's primary, when executing updateShardId, not only the sender's shard_id is updated, but also the shard_id of myself is updated, casuing the subsequent areInSameShard check, that is, the full_sync_required check to fail. As part of the recent fix of #885, the sub-replica needs to decide whether a full sync is required or not when switching shards. This shard membership check is supposed to be done against sub-replica's current shard_id, which however was lost in this code path. This then leads to sub-replica joining the other shard with a completely different and incorrect replication history. This is the only place where replicaof state can be updated on this path so the most natural fix would be to pull the chain replication reduction logic into this code block and before the updateShardId call. This one follow #885 and closes #942. Signed-off-by: Binbin <[email protected]> Co-authored-by: Ping Xie <[email protected]>