-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The driver does not deal with changing redis IP addresses in cluster mode #183
Comments
Please see #184 |
@michaelglass, #168 suggests you're using cluster support quite extensively. Is this something you could take a look into? |
Need to run PS: the current connections being held in pool might be still used and removed after only exceptions are thrown. |
For anyone else looking here, we have fork in juspay (https://github.com/juspay/hedis) with some of these fixes done already. Would like to upstream them in sometime. |
@aravindgopall Glad to see you forked. Is it the fix/connection branch people are supposed to use? Why is it 104 commits behind? Would you be interested in taking over hedis maintainership? |
Right now it is fix/connection branch, we are testing the changes in our setup. Once done will merge everything to cluster branch followed by master.
@ysangkok Would be very much interested on this. |
JFI. fix/connection branch is up to date with upstream master and being tested. Once done will make merge to master and release. |
JFI, @ysangkok how do you want me to continue, shall i create a PR to upstream or we can continue on juspay fork. |
I would suggest creating a PR to this repo and we can get it merged. |
I mistakenly offered to maintain this library some months ago and it turns that I simply don't have the time. Could you email me and we can discuss handing over maintainership? Hi @aravindgopall - I mistakenly offered to maintain this library some months ago and it turns that I simply don't have the time. Could you email me and we can discuss handing over maintainership? james at flipstone . com |
This issue won't be solved by refreshing the shard map from IP addresses existing in the map: after Helm updated the Redis cluster, all Redis nodes will have new IP addresses assigned. That is, no IP address stored in the available shard map would connect to a then running Redis node. The only way to reconnect the client is by resolving a new valid IP address via the Redis cluster's DNS name and then obtain an entirely new shard map. For our use-case, we retain the DNS name outside Hedis. This is suboptimal, since the DNS name is passed to Hedis for establishing the initial connection and could be retained for failovers when all IPs have become invalid. For instance, due to Helm being a villain. |
@stefanwire This is a valid scenario. Right now it resolves this issue where at least one of the Redis node persisted the ip address. Generally helpful in more frequent scenarios like failovers, scaling etc.. |
Feel free to check and adapt the code linked in my previous comment to solve the issue where all IPs wouldn't connect anymore. Also, check my open PR which adds the |
I did a quick check on Juspay's branch with the following test scenario:
It doesn't seem to work:
This means the |
Hi @omnibs , can you give a try with Juspay master branch and share the results. In our internal test it recovers without |
With the It seemed to take several minutes, though (can't re-test bc I ran out of test failovers for today). I don't follow in the code what's making it recover, but I'm curious if there's something we can tweak to make it recover faster. |
it's retry mechanism based on MOVED/timeout (this can be configured to recover faster), when it occurs it will get the new shardMap info by picking a random node (again chances this might be down too can cause the time). Post this it will send the command to the new node which is responsible. Would be great to see some numbers, graphs to understand this. (eg: re sharding time, throughput, how many errors etc..) |
Thank you for explaining! So I re-ran my test in a simple manner:
It took a pretty long time:
So it took 8min30s to recover from The failover itself is quick. I didn't time it, but restarting another instance of the app brought it back to life immediately, while the instance I was curling kept at
Do my results make sense with this mechanism? It sounds like after the failover my I plan to debug the mechanism further to figure out what's up, but any pointers you might have on what's happening would be appreciated! ps: just to clarify, there's no resharding time, the failover works over a single shard |
Ok, my test was too simplistic, and for Redis Cluster's intended usage, perhaps unrealistic. Redis outside Elasticache requires at least 3 primaries for cluster mode. Elasticache allows you to set it up with a single primary. I tested a local redis cluster with 3 primaries, ran a failover with I tested failing over an Elasticache cluster with 2 primaries, and all went well. We got a single @aravindgopall one last question then, does Juspay plan to upstream its fixes to cluster mode? I noticed there were 2 PRs for upstreaming them which got closed (#192 and #191). |
Hedis loses connection to a redis cluster, if the redis cluster nodes change their IP addresses during a restart.
How to reproduce the problem
While the app cannot talk to redis, we get these kinds of error messages:
Impact
This makes using hedis in Kubernetes very difficult.
Speculation
This could be a result of using
socketToHandle
here. The docs for the function say:It could also be some other problem, this is just speculation.
Additional Info
I create ConnectInfo like this:
And connect to redis like this:
This connection does not use TLS.
The text was updated successfully, but these errors were encountered: