Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Migrate non topology update cluster bus messages to light header type #932

Open
hpatro opened this issue Aug 21, 2024 · 13 comments
Open

Comments

@hpatro
Copy link
Collaborator

hpatro commented Aug 21, 2024

With the light message header for cluster bus in place, we can migrate the following messages to use light header type. The following messages can be migrated:

  1. CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST
  2. CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK
  3. CLUSTERMSG_TYPE_UPDATE
  4. CLUSTERMSG_TYPE_MFSTART
  5. CLUSTERMSG_TYPE_MODULE - Enable module API SendClusterMessage to use light message header type #1572

This should only be used when the cluster is in homogenous state and all of the nodes support the light message header.

@hpatro
Copy link
Collaborator Author

hpatro commented Aug 21, 2024

@roshkhatri / @zuiderkwast WDYT ?

@madolson
Copy link
Member

CLUSTERMSG_TYPE_UPDATE

I kind of think this one should stay. We're indicating state changes, it seems prudent to send our state of the world as well.

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Aug 22, 2024

I believe the messages listed here are rare and don't make up a large percentage of the cluster bus traffic, so I don't think we need to prioritize these. The bulk of the cluster bus traffic is PING and PONG. If we can figure out a way to use light ping pong 🏓, then we're saving a lot of cluster bus overhead for large clusters.

A basic idea is that each node remembers what it sent to any other node (for example a hash of the content) and if it hasn't changed since the last ping/pong between the nodes, then the node can send a light PING or PONG, just as a keepalive message and to say that nothing has changed. Do you want to explore this idea?

@madolson
Copy link
Member

The bulk of the cluster bus traffic is PING and PONG.

This biggest gain might be to send a message that says, "I have no new slot information changes from last time", which I think is the biggest part of the message (That's like 2kb right)? One problem will be that it'll still get rounded up to a full TCP packet.

@roshkhatri
Copy link
Member

roshkhatri commented Aug 22, 2024

At a high level, it does seem like a nice idea!
However for,

  1. CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST we do require currentEpoch, configEpoch and myslots.
  2. 'CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST' would require mflags

This means we will be adding new field to the clusterMsgLight which will make it a bit large again or add these in some other way in the data field.

@zuiderkwast's idea is what we had discussed some time ago and I do agree with that too and would like to explore.

This biggest gain might be to send a message that says, "I have no new slot information changes from last time", which I think is the biggest part of the message (That's like 2kb right)? One problem will be that it'll still get rounded up to a full TCP packet.

Yeah this would he the best gain stating that I am alive and no changes since last time.

@zuiderkwast
Copy link
Contributor

The bulk of the cluster bus traffic is PING and PONG.

This biggest gain might be to send a message that says, "I have no new slot information changes from last time", which I think is the biggest part of the message (That's like 2kb right)? One problem will be that it'll still get rounded up to a full TCP packet.

@madolson Yes, the slot bitmap is 2KB (16K bits). A TCP packet has 20 bytes overhead and an IPv4 header is another 20 bytes overhead. Then, Ethernet has somewhere around 128 bytes overhead I believe (source and target MAC addresses and some CRC-sum). We're at somewhere around 200 bytes in total. On top of this, TLS has some minimup packet size and VPNs and other stuff probably add more overhead. Did I miss anything?

Anyway, the light packet doesn't solve our scaling problem. The problem is still the all-to-all communication. In a raft cluster, nodes only communicate with the leader, except when the leader isn't reachable anymore. To focus on that might be a better idea.

@roshkhatri
Copy link
Member

A TCP packet has 20 bytes overhead and an IPv4 header is another 20 bytes overhead. Then, Ethernet has somewhere around 128 bytes overhead I believe (source and target MAC addresses and some CRC-sum). We're at somewhere around 200 bytes in total. On top of this, TLS has some minimup packet size and VPNs and other stuff probably add more overhead. Did I miss anything?

This will be a constant for every msg, right? The thing we can focus on is how we can reduce our overhead for every msg.

@zuiderkwast
Copy link
Contributor

A TCP packet has 20 bytes overhead and an IPv4 header is another 20 bytes overhead. Then, Ethernet has somewhere around 128 bytes overhead I believe (source and target MAC addresses and some CRC-sum). We're at somewhere around 200 bytes in total. On top of this, TLS has some minimup packet size and VPNs and other stuff probably add more overhead. Did I miss anything?

This will be a constant for every msg, right? The thing we can focus on is how we can reduce our overhead for every msg.

Yes, the overhead per message is constant, but it's important to discuss it. It makes little sense to reduce a small message to even smaller (say from 20 to 10 bytes) because we would only save very little compare to all the overhead. If we can reduce a message from 2K to 400 including all protocol overhead, we save 80% so this is good.

We still have very many messages though and this will still be a bottleneck, because the number of ping-pong messages in a cluster grows exponentially with the size of the cluster (right?). If we save 80% of the size of each message, we can maybe achieve a 600 nodes cluster instead of 500 nodes, but if we can reduce the number of message to just grow linearly or quadratically with the size of the cluster, then we can scale to huge clusters for real. We could introduce features like non-voting nodes, and less active nodes that don't ping anyone but they just answer pings from other nodes. Or... we take the step to use a raft protocol.

@enjoy-binbin
Copy link
Member

enjoy-binbin commented Aug 25, 2024

A basic idea is that each node remembers what it sent to any other node (for example a hash of the content) and if it hasn't changed since the last ping/pong between the nodes, then the node can send a light PING or PONG, just as a keepalive message and to say that nothing has changed. Do you want to explore this idea?

Is is a good idea i think, in the now current cluster implementation, that is a simple idea that can do to reduce the traffic without a huge change. I was also exploring this (a simple keepalive ping-pong) long time ago, maybe it was the end of last year i think, but i haven't finished it yes (busy with other things in the internal fork). Internally i had the intention to pick it back up after two weeks.

The problem is still the all-to-all communication

yes, in a large cluster, the all-to-all communication is a pain. And there is also a large guy here, the wanted one: wanted = floor(dictSize(server.cluster->nodes) / 10)

btw, what are our current plans for cluster V2? Do we still planning to do the cluster V2?

@zuiderkwast
Copy link
Contributor

yes, in a large cluster, the all-to-all communication is a pain. And there is also a large guy here, the wanted one: wanted = floor(dictSize(server.cluster->nodes) / 10)

The wanted one! I think it has potential for tuning! Can we gossip 1/10 about primaries and less about replicas? The comment above has the explanation for 1/10:

    /* How many gossip sections we want to add? 1/10 of the number of nodes
     * and anyway at least 3. Why 1/10?
     *
     * If we have N masters, with N/10 entries, and we consider that in
     * node_timeout we exchange with each other node at least 4 packets
     * (we ping in the worst case in node_timeout/2 time, and we also
     * receive two pings from the host), we have a total of 8 packets
     * in the node_timeout*2 failure reports validity time. So we have
     * that, for a single PFAIL node, we can expect to receive the following
     * number of failure reports (in the specified window of time):
     *
     * PROB * GOSSIP_ENTRIES_PER_PACKET * TOTAL_PACKETS:
     *
     * PROB = probability of being featured in a single gossip entry,
     *        which is 1 / NUM_OF_NODES.
     * ENTRIES = 10.
     * TOTAL_PACKETS = 2 * 4 * NUM_OF_MASTERS.
     *
     * If we assume we have just masters (so num of nodes and num of masters
     * is the same), with 1/10 we always get over the majority, and specifically
     * 80% of the number of nodes, to account for many masters failing at the
     * same time.
     *
     * Since we have non-voting slaves that lower the probability of an entry
     * to feature our node, we set the number of entries per packet as
     * 10% of the total nodes we have. */

Apparently, Antirez has a private branch for fast failure detection, which can reduce the gossip. 😁 It's mentioned in redis/redis#3929.

btw, what are our current plans for cluster V2? Do we still planning to do the cluster V2?

I want it, but nobody has started implementing anything AFAIK, only discussion in #384 and #457 for preparation. Now we're just trying to do these small optimizations instead. But someone needs to make some proof of concept for V2, with a raft consensus model. The Redis ideas were too complex though.

@hpatro
Copy link
Collaborator Author

hpatro commented Aug 30, 2024

@zuiderkwast @enjoy-binbin I'm playing around with a POC around Cluster V2, will put up a RFC with concrete design soon. can't promise a date as of now.

@zuiderkwast
Copy link
Contributor

A light-weight MODULE message would be useful for some modules, such as cluster-wide search. @murphyjacob4 do you want to implement it (or find someone to do it)?

@hpatro
Copy link
Collaborator Author

hpatro commented Jan 15, 2025

I've have made the changes to the existing API to use light header if possible @zuiderkwast / @murphyjacob4 . I can share it as a PR shortly. https://github.com/hpatro/valkey/tree/cluster_module_support_light_msg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants