-
-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sshpiperd restart kills all proxied connections #515
Comments
totally agree this is the library i was thinking to make it happen |
Thanks for the feedback and reference. Happy to help where we can. I've started looking over those libraries. They seem well encapsulated and gracenet seems like the place to start exploring. I'm a newbie with go but have experience with network programing in C and python. Could you advise on where in the code base you would expect the grace network listener to replace the traditional network listener? |
check here sshpiper/cmd/sshpiperd/daemon.go Line 181 in b2f7f79
it is same as accept(fd) and
can be interpret as really appreciate if you can send a pr the reason the feature was delayed is because that sshpiper now runs inside containers in a large cluster in our case, as a result, managing state across servers to have gracefully restart is much more difficult. |
Thanks for the additional pointers. We will continue to explore this to see how we can contribute. It may take us a some time to prepare a pull request. We are starting our use of sshpiper as a traditional long-lived system service on a front-end proxy into traditional ssh servers that are cluster login nodes shared by all users of a cluster. Our thinking is that we will eventually isolate users into individual, containerized login sessions on a k8s cluster. We've been thinking those containers would run OpenSSH. We would, however, still need an ssh proxy plane (sshpiper) to route connections to the per-user login containers. I assume you are referring to running the sshpiper in a container on the proxy plane. What issues do you see that complicate containerizing the sshpiper workload? Does it come from having to do session management in the container runtime as well? |
more details on why gracefully restart is first, unlike http, ssh is a typically long live connection. which means a gracefully restart may take inf time until timeout kicks in, which still will break the connection and cause a non-gracefully disconnect. in addition, it is hard to transfer a live connection to another upgraded server instance. that would be the better or correct approach to |
Ok, so our traditional use case of a single proxy server and single ssh login node is more like a 1:1 containerized implementation, where one sshpiper container proxies for one ssh login container. If instead we had m sshpiper proxy containers and n ssh login nodes, with a reasonably even connection distribution across the proxies, then any one sshpiper restart would only impact n/m connections. Is this how you are viewing the situation? A one-update-restart per month cadence isn't all that bad. I'm not familiar enough with container orchestration environments to know how connection state could be maintained in those environments as a way to do graceful restart. If we could adopt this one-update-per-month cadence with our current config, we could likely tolerate breaking any connections that happen to be active during the sshpiper proxy restart. This would make graceful restart not that important in our environment as well. Our perceived need for graceful restart from more frequent restarts is actually stemming from our use of the failtoban plugin. It's our understanding that we can't clear individual IPs out of the sshpiper ban cache unless we restart sshpiper. We have traditionally used the system Fail2ban to protect the public interface of our ssh endpoints. We didn't see expected event logging in sshpiper that would allow Fail2ban to do the failed login event counting. We thought we could use the failtoban plugin in sshpiper as a convenient counter implementation But then we noticed we could not unban select IPs without an sshpiper restart to flushes the whole ban cache. Obviously this restart-based-flush kills all other active ssh sessions. That's undesirable. :) Do you have any suggests on how we might either a) get a password fail event log stream from sshpiper that we could monitor with the system Fail2ban or b) selectively remove entries from the failtoban plugin's built-in ban cache? |
omg, you are right, the only way to reset failtoban is to restart. failtoban code is very simple, maybe you can create your own plugin to handle your case you can simply add more log here |
add #519 to support sighup to reset |
Thanks for the quick fix. We are exploring it but need to update some of our local code modifications for supporting group-based routing. Will follow up when that's complete. Or feel free to close this issue if needed. |
It appears restarting the sshpiperd process on the proxy node will kill all active proxied ssh connections to upstream servers. This makes sense since the sole sshpiperd process terminates the ssh connection with the downstream (client) and establishes the ssh connection to the upstream (server running OpenSSH sshd).
Comparing this to familiar behavior when restarting OpenSSH's sshd, you can restart the sshd server process (e.g. to read an updated configuration file) but all other established ssh sessions are not killed. Only the one sshd that is responsible for accepting new connections gets restarted.
Is there a way of isolating proxied connections to a single instance of sshpiperd so that the lifetime of one server and it's proxied connection doesn't impact the lifetime of other proxied connections.
We're trying to understand how to interact with sshpiperd from an operations perspective coming from an OpenSSH background. In our environment we have many user sessions proxied through our sshpiper proxy layer.
How should we be thinking about minimizing the impact of operational events that may require an sshpiper restart due to configuration changes or due to some unexpected failure of the proxy itself?
The text was updated successfully, but these errors were encountered: