Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sshpiperd restart kills all proxied connections #515

Open
jprorama opened this issue Jan 29, 2025 · 9 comments
Open

Sshpiperd restart kills all proxied connections #515

jprorama opened this issue Jan 29, 2025 · 9 comments

Comments

@jprorama
Copy link

It appears restarting the sshpiperd process on the proxy node will kill all active proxied ssh connections to upstream servers. This makes sense since the sole sshpiperd process terminates the ssh connection with the downstream (client) and establishes the ssh connection to the upstream (server running OpenSSH sshd).

Comparing this to familiar behavior when restarting OpenSSH's sshd, you can restart the sshd server process (e.g. to read an updated configuration file) but all other established ssh sessions are not killed. Only the one sshd that is responsible for accepting new connections gets restarted.

Is there a way of isolating proxied connections to a single instance of sshpiperd so that the lifetime of one server and it's proxied connection doesn't impact the lifetime of other proxied connections.

We're trying to understand how to interact with sshpiperd from an operations perspective coming from an OpenSSH background. In our environment we have many user sessions proxied through our sshpiper proxy layer.

How should we be thinking about minimizing the impact of operational events that may require an sshpiper restart due to configuration changes or due to some unexpected failure of the proxy itself?

@tg123
Copy link
Owner

tg123 commented Jan 29, 2025

totally agree
gracefully restart was planned

this is the library i was thinking to make it happen

https://godoc.org/github.com/facebookgo/grace

@jprorama
Copy link
Author

Thanks for the feedback and reference. Happy to help where we can. I've started looking over those libraries. They seem well encapsulated and gracenet seems like the place to start exploring.

I'm a newbie with go but have experience with network programing in C and python.

Could you advise on where in the code base you would expect the grace network listener to replace the traditional network listener?

@tg123
Copy link
Owner

tg123 commented Jan 31, 2025

check here

conn, err := d.lis.Accept()

it is same as

accept(fd)

and

go func(conn)

can be interpret as pthread_create(func) (not thread, but help you understand)

really appreciate if you can send a pr

the reason the feature was delayed is because that sshpiper now runs inside containers in a large cluster in our case, as a result, managing state across servers to have gracefully restart is much more difficult.
the lib from facebook is also in archived state, i am not sure if there is any active project doing the same

@jprorama
Copy link
Author

Thanks for the additional pointers. We will continue to explore this to see how we can contribute. It may take us a some time to prepare a pull request.

We are starting our use of sshpiper as a traditional long-lived system service on a front-end proxy into traditional ssh servers that are cluster login nodes shared by all users of a cluster.

Our thinking is that we will eventually isolate users into individual, containerized login sessions on a k8s cluster. We've been thinking those containers would run OpenSSH. We would, however, still need an ssh proxy plane (sshpiper) to route connections to the per-user login containers.

I assume you are referring to running the sshpiper in a container on the proxy plane. What issues do you see that complicate containerizing the sshpiper workload? Does it come from having to do session management in the container runtime as well?

@tg123
Copy link
Owner

tg123 commented Feb 1, 2025

more details on why gracefully restart is not that important

first, unlike http, ssh is a typically long live connection. which means a gracefully restart may take inf time until timeout kicks in, which still will break the connection and cause a non-gracefully disconnect.
second, the gracefully restart usually happens when sshipiper is being upgraded, that is happening less than once a month. thus, a non-gracefully restart is somewhat acceptable

in addition, it is hard to transfer a live connection to another upgraded server instance. that would be the better or correct approach to gracefully restart

@jprorama
Copy link
Author

jprorama commented Feb 1, 2025

Ok, so our traditional use case of a single proxy server and single ssh login node is more like a 1:1 containerized implementation, where one sshpiper container proxies for one ssh login container. If instead we had m sshpiper proxy containers and n ssh login nodes, with a reasonably even connection distribution across the proxies, then any one sshpiper restart would only impact n/m connections.

Is this how you are viewing the situation?

A one-update-restart per month cadence isn't all that bad. I'm not familiar enough with container orchestration environments to know how connection state could be maintained in those environments as a way to do graceful restart.

If we could adopt this one-update-per-month cadence with our current config, we could likely tolerate breaking any connections that happen to be active during the sshpiper proxy restart. This would make graceful restart not that important in our environment as well.

Our perceived need for graceful restart from more frequent restarts is actually stemming from our use of the failtoban plugin. It's our understanding that we can't clear individual IPs out of the sshpiper ban cache unless we restart sshpiper.

We have traditionally used the system Fail2ban to protect the public interface of our ssh endpoints. We didn't see expected event logging in sshpiper that would allow Fail2ban to do the failed login event counting. We thought we could use the failtoban plugin in sshpiper as a convenient counter implementation But then we noticed we could not unban select IPs without an sshpiper restart to flushes the whole ban cache. Obviously this restart-based-flush kills all other active ssh sessions. That's undesirable. :)

Do you have any suggests on how we might either a) get a password fail event log stream from sshpiper that we could monitor with the system Fail2ban or b) selectively remove entries from the failtoban plugin's built-in ban cache?

@tg123
Copy link
Owner

tg123 commented Feb 1, 2025

omg, you are right, the only way to reset failtoban is to restart.
that is a bad design, file based database should be introduced to failtoban

failtoban code is very simple, maybe you can create your own plugin to handle your case

you can simply add more log here
https://github.com/tg123/sshpiper/blob/b2f7f79cc485cea23a2d7db5bd436b6265df9b81/plugin/failtoban/main.go#L67C1-L68C1

@tg123
Copy link
Owner

tg123 commented Feb 5, 2025

add #519 to support sighup to reset

@jprorama
Copy link
Author

Thanks for the quick fix. We are exploring it but need to update some of our local code modifications for supporting group-based routing. Will follow up when that's complete. Or feel free to close this issue if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants