Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto reconnection after interruption #3795

Open
alekseydemidov opened this issue Feb 10, 2025 · 5 comments · Fixed by #3796
Open

Auto reconnection after interruption #3795

alekseydemidov opened this issue Feb 10, 2025 · 5 comments · Fixed by #3796
Assignees

Comments

@alekseydemidov
Copy link

alekseydemidov commented Feb 10, 2025

Describe the bug
We have multiple users who get issue with stability telepresence connection.
In case any network interruption telepresence just loses the connection and doesn't try to reconnect. We can see status as:
Traffic Manager: Not connected

connector.log:

2025-02-04 19:27:21.8252 info    ---
2025-02-04 19:27:21.8252 info    Telepresence Connector v2.20.1 (api v3) starting...
2025-02-04 19:27:21.8252 info    PID is 67912
2025-02-04 19:27:21.8252 info    
2025-02-04 19:27:21.8371 info    connector/server-grpc : gRPC server started
2025-02-04 19:27:23.6342 info    connector/session : -- Starting new session
2025-02-04 19:27:23.6343 info    connector/session : Connecting to k8s cluster...
2025-02-04 19:27:24.1105 info    connector/session : Server version v1.29.10
2025-02-04 19:27:24.1110 info    connector/session : Context: user-api.tst.example.org
2025-02-04 19:27:24.1111 info    connector/session : Server: https://api.tst.example.org
2025-02-04 19:27:24.2215 info    connector/session : "can-i watch namespaces" is not allowed
2025-02-04 19:27:24.7681 info    connector/session : Will look for traffic manager in namespace ambassador
2025-02-04 19:27:24.7682 info    connector/session : Connected to context username-api.tst.example.org, namespace stand1 (https://api.tst.example.org)
2025-02-04 19:27:24.8738 info    connector/session : Connecting to traffic manager...
2025-02-04 19:27:26.1740 info    connector/session : Connected to OSS Traffic Manager v2.20.1
2025-02-04 19:27:26.5047 info    connector/session : Configuration reloaded
2025-02-04 19:27:26.5118 info    connector/session : Connecting to root daemon...
2025-02-04 19:29:01.9166 error   connector/server-grpc/conn=3 : sshfs not installed: exec: "sshfs": executable file not found in $PATH
2025-02-04 19:29:37.2292 error   connector/server-grpc/conn=4 : sshfs not installed: exec: "sshfs": executable file not found in $PATH
2025-02-04 19:29:47.3191 error   connector/server-grpc/conn=4 : sshfs not installed: exec: "sshfs": executable file not found in $PATH
2025-02-04 19:33:24.9047 error   connector/session/dial-request-watcher : dial request stream recv: rpc error: code = Unavailable desc = error reading from server: EOF
2025-02-04 19:33:24.9048 error   connector/session/intercept-port-forward : manager.WatchIntercepts recv: rpc error: code = Unavailable desc = error reading from server: EOF
2025-02-04 19:36:37.0557 info    connector/session:shutdown_logger : shutting down (gracefully)...
2025-02-04 19:36:37.0557 info    connector/session:shutdown_logger : shutting down (gracefully)...
2025-02-04 19:36:38.2692 error   connector/session : closing spdyStream: traffic-manager-7484b86d75-qbz96.ambassador
2025-02-04 19:36:38.2694 info    connector/session : -- Session ended
2025-02-04 19:36:49.4060 info    connector/session : Configuration reloaded
2025-02-04 19:36:49.4077 info    connector:shutdown_logger : shutting down (gracefully)...
2025-02-04 19:37:13.5400 info    Starting socket listener for /tmp/telepresence-connector.socket
2025-02-04 19:37:13.5415 info    ---

daemon.log

2025-02-04 19:27:28.9770 info    daemon/session/network : Starting Endpoint
2025-02-04 19:27:28.9770 info    daemon/session/network : Setting cluster DNS to 192.168.88.120
2025-02-04 19:27:28.9771 info    daemon/session/network : Setting cluster domain to "cluster.local."
2025-02-04 19:27:28.9772 info    daemon/session/network : Dropping never-proxy "10.12.2.0/32" because it is not routed
2025-02-04 19:27:29.0122 error   daemon/session/network : failed to retrieve route for subnet 10.222.0.0/16: <nil>
2025-02-04 19:27:29.0234 info    daemon/session/dns/Server : Generating /etc/resolver/telepresence.cluster.local
2025-02-04 19:27:29.0238 info    daemon/session/dns/Server : Generating /etc/resolver/telepresence.tel2-search
2025-02-04 19:27:29.5248 info    daemon/session/dns/SearchPaths : Generating /etc/resolver/telepresence.svc
2025-02-04 19:27:29.5258 info    daemon/session/dns/SearchPaths : Regenerating /etc/resolver/telepresence.cluster.local
2025-02-04 19:27:29.5262 info    daemon/session/dns/SearchPaths : Regenerating /etc/resolver/telepresence.tel2-search
2025-02-04 19:27:29.5264 info    daemon/session/dns/SearchPaths : Regenerating /etc/resolver/telepresence.tel2-search
2025-02-04 19:27:29.5267 info    daemon/session/dns/SearchPaths : Regenerating /etc/resolver/telepresence.svc
2025-02-04 19:27:29.5271 info    daemon/session/dns/SearchPaths : Regenerating /etc/resolver/telepresence.cluster.local
2025-02-04 19:29:46.5791 info    daemon/session/agentPods : Connected to OSS Traffic Agent v2.20.1
2025-02-04 19:30:49.7550 info    daemon/session/agentPods : Connected to OSS Traffic Agent v2.20.1
2025-02-04 19:30:59.1859 info    daemon/session/agentPods : Connected to OSS Traffic Agent v2.20.1
2025-02-04 19:33:24.9070 error   daemon/session/network : WatchClusterInfo recv: Unavailable: error reading from server: EOF
2025-02-04 19:33:26.2023 info    daemon/session/network : also-proxy subnets []
2025-02-04 19:33:26.2026 info    daemon/session/network : never-proxy subnets [10.12.2.0/32]
2025-02-04 19:33:26.2027 info    daemon/session/network : allow-conflicting subnets []
2025-02-04 19:33:26.2027 info    daemon/session/network : Adding service subnet 10.222.0.0/16
2025-02-04 19:33:26.2028 info    daemon/session/network : Adding pod subnet 192.168.0.0/22
2025-02-04 19:33:26.2028 info    daemon/session/network : Adding pod subnet 192.168.4.0/22
2025-02-04 19:33:26.2029 info    daemon/session/network : Adding pod subnet 192.168.8.0/22
2025-02-04 19:33:26.2029 info    daemon/session/network : Adding pod subnet 192.168.12.0/22
2025-02-04 19:33:26.2029 info    daemon/session/network : Adding pod subnet 192.168.16.0/22
2025-02-04 19:33:26.2029 info    daemon/session/network : Adding pod subnet 192.168.20.0/22
2025-02-04 19:33:26.2029 info    daemon/session/network : Adding pod subnet 192.168.24.0/22
2025-02-04 19:33:26.2030 info    daemon/session/network : Adding pod subnet 192.168.28.0/22
2025-02-04 19:33:26.2030 info    daemon/session/network : Adding pod subnet 192.168.32.0/22
2025-02-04 19:33:26.2030 info    daemon/session/network : Adding pod subnet 192.168.36.0/22
2025-02-04 19:33:26.2030 info    daemon/session/network : Adding pod subnet 192.168.40.0/22
2025-02-04 19:33:26.2030 info    daemon/session/network : Adding pod subnet 192.168.44.0/22
2025-02-04 19:33:26.2031 info    daemon/session/network : Adding pod subnet 192.168.48.0/22
2025-02-04 19:33:26.2031 info    daemon/session/network : Adding pod subnet 192.168.52.0/22
2025-02-04 19:33:26.2031 info    daemon/session/network : Adding pod subnet 192.168.56.0/22
2025-02-04 19:33:26.2031 info    daemon/session/network : Adding pod subnet 192.168.60.0/22
2025-02-04 19:33:26.2032 info    daemon/session/network : Adding pod subnet 192.168.64.0/22
2025-02-04 19:33:26.2032 info    daemon/session/network : Adding pod subnet 192.168.68.0/22
2025-02-04 19:33:26.2032 info    daemon/session/network : Adding pod subnet 192.168.72.0/22
2025-02-04 19:33:26.2032 info    daemon/session/network : Adding pod subnet 192.168.76.0/22
2025-02-04 19:33:26.2032 info    daemon/session/network : Adding pod subnet 192.168.80.0/22
2025-02-04 19:33:26.2032 info    daemon/session/network : Adding pod subnet 192.168.84.0/22
2025-02-04 19:33:26.2032 info    daemon/session/network : Adding pod subnet 192.168.88.0/22
2025-02-04 19:33:26.2033 info    daemon/session/network : Setting cluster DNS to 192.168.88.120
2025-02-04 19:33:26.2033 info    daemon/session/network : Setting cluster domain to "cluster.local."
2025-02-04 19:33:26.2033 info    daemon/session/network : Dropping never-proxy "10.12.2.0/32" because it is not routed
2025-02-04 19:33:28.8451 info    daemon/session/agentPods : Connected to OSS Traffic Agent v2.20.1
2025-02-04 19:33:28.8490 info    daemon/session/agentPods : Connected to OSS Traffic Agent v2.20.1
2025-02-04 19:36:37.0503 error   daemon/session/agentPods : !! CONN tcp 192.168.46.61:50132 -> 127.0.0.1:3000, call to manager Tunnel failed: rpc error: code = Canceled desc = context canceled
2025-02-04 19:36:37.0503 error   daemon/session/agentPods : !! CONN tcp 192.168.43.107:43270 -> 127.0.0.1:3000, call to manager Tunnel failed: rpc error: code = Canceled desc = context canceled
2025-02-04 19:36:37.0503 error   daemon/session/agentPods : failed to read initial StreamOK message: rpc error: code = Canceled desc = context canceled
2025-02-04 19:36:37.0503 error   daemon/session/agentPods : !! CONN tcp 192.168.51.23:58254 -> 127.0.0.1:3000, call to manager Tunnel failed: rpc error: code = Canceled desc = context canceled
2025-02-04 19:36:37.0503 error   daemon/session/agentPods : !! CONN tcp 192.168.52.145:36648 -> 127.0.0.1:3000, call to manager Tunnel failed: rpc error: code = Canceled desc = context canceled
2025-02-04 19:36:37.0503 error   daemon/session/agentPods : !! CONN tcp 192.168.51.23:54256 -> 127.0.0.1:3000, call to manager Tunnel failed: rpc error: code = Canceled desc = context canceled
2025-02-04 19:36:37.0503 error   daemon/session/agentPods : !! CONN tcp 192.168.24.104:60458 -> 127.0.0.1:3000, call to manager Tunnel failed: rpc error: code = Canceled desc = context canceled
2025-02-04 19:36:37.0503 error   daemon/session/agentPods : failed to read initial StreamOK message: rpc error: code = Canceled desc = context canceled
2025-02-04 19:36:37.0504 error   daemon/session/agentPods : !! CONN tcp 192.168.22.205:43298 -> 127.0.0.1:3000, call to manager Tunnel failed: rpc error: code = Canceled desc = context canceled
2025-02-04 19:36:37.0504 error   daemon/session/agentPods : !! CONN tcp 192.168.43.107:58430 -> 127.0.0.1:3000, call to manager Tunnel failed: rpc error: code = Canceled desc = context canceled
2025-02-04 19:36:37.0504 error   daemon/session/agentPods : !! CONN tcp 192.168.43.107:48264 -> 127.0.0.1:3000, call to manager Tunnel failed: rpc error: code = Canceled desc = context canceled
2025-02-04 19:36:37.0504 error   daemon/session/agentPods : !! CONN tcp 192.168.51.23:57230 -> 127.0.0.1:3000, call to manager Tunnel failed: rpc error: code = Canceled desc = grpc: the client connection is closing
2025-02-04 19:36:38.2710 info    daemon/session:shutdown_logger : shutting down (gracefully)...
2025-02-04 19:36:38.2711 info    daemon/session/dns:shutdown_logger : shutting down (gracefully)...
2025-02-04 19:36:38.2711 info    daemon/session/dns/Server:shutdown_logger : shutting down (gracefully)...

To Reproduce
interrupt network connection for short time

Expected behavior
Client should be possible to retry reconnect.
Also good to make logs more informative.

Versions (please complete the following information):

  • Version: v2.21.3
@thallgren
Copy link
Member

@alekseydemidov A couple of questions for you to clarify:

  1. The logs show 2.20.1 for both client and traffic-manager. You write 2.21.3 in your comment. Can I assume that you've tested both?
  2. How long is the "short time" for the interruption?
  3. Are any telepresence commands executed during the interruption to force the error?

@thallgren thallgren self-assigned this Feb 11, 2025
@alekseydemidov
Copy link
Author

Hi, thanks a lot for you questions:

  1. Right, we tried both version to be sure that the issue actual for both.
  2. "short time" - that's difficult question, because no any other services including online calling, like slack or google meet does not interrupt. So, I can just guess about some packets losing, because some users have the issue more often, someone very very rarely.
  3. Not really, just checking telepresence status, nothing more. The connection could be lost either during the started interception either without interception. So, we could not see any regularity, it happens for different customers in different region through with unpredictable time interval.

@thallgren
Copy link
Member

thallgren commented Feb 11, 2025

@alekseydemidov I've been doing some tests, trying to reproduce the behavior that is shown in the logs. I can conclude the following.

  1. At 19:33:24, the traffic-manager disconnects the client. The client receives EOF. This is not a network error.
  2. At 19:33:26, the traffic-manager sends an update to the root daemon, telling it what subnets to use.19:36:37
  3. At 19:36:37, the user daemon shuts down gracefully. The session ends which prints some ugly (but harmless) errors in the root daemon log.

I'm not sure what's causing the first event, but I don't think it's a network interruption. I've tried to force such interruptions and both daemons do recover when it happens.

In order to investigate this further, I suggest that you turn on debug-level logging, both for the client and the traffic-manager. Then, after the problem has resurfaced, please do a telepresence gather-logs and provide the zip.

@alekseydemidov
Copy link
Author

Thank you for feedback, and your assist
We've collected logs as you suggested (attached)
telepresence_logs_2025-02-12T13_34_34+03_00.zip

@thallgren
Copy link
Member

@alekseydemidov , a lot of improvements have been made in this area in the coming 2.22.0 release. If possible, please test it using 2.22.0-test.8.

@thallgren thallgren linked a pull request Mar 9, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants