Replies: 2 comments
-
@IsaacYangSLA can you help comment on this, thanks in advance. I will also try this when I got time. |
Beta Was this translation helpful? Give feedback.
0 replies
-
sorry, did not respond on this and just noticed your questions. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I am trying to orchestrate everything in a Kubernetes environment on 2 instances in different network. I have generated the provision file with HA setting and with helm builder to deploy it using the Helm chart.
To give you a brief overview of my deployment, I have used Netmaker to create a private network and joined both instances to that network so the instances can communicate via netmaker interface IP. I have created Kubernetes cluster using kubeadm command and updated the node-ip to private netmaker IP in kubelet arguments for both instances. Additionally, I have used Calico CNI for pod netorking and got all pods successfully running and ready. I have added ingress-nginx controller to expose pod ports for FL server by updating the config map and daemon set part in the yaml file as mentioned in the Helm deployment of Nvflare - https://nvflare.readthedocs.io/en/latest/user_guide/helm_chart.html. After this I just used helm to install the Nvflare server to kubernetes which created 3 pods - Server1, Server2, and Overseer which were all successfully running and ready.
While the deployment of the NVFlare server was successful and I was able to login to the admin console, I encountered an issue when trying to start the client sites (site-1 and site-2). The error that I am receiving is as follows as per the site logs:
Cell - INFO - site-1: created backbone external connector to grpc://server2:8102
2023-04-25 12:17:22,020 - ConnectorManager - INFO - 1227537: Try start_listener Listener resources: {'secure': False, 'host': 'localhost'}
2023-04-25 12:17:22,020 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00002 PASSIVE tcp://0:26496] is starting
2023-04-25 12:17:22,521 - Cell - INFO - site-1: created backbone internal listener for tcp://localhost:26496
2023-04-25 12:17:22,521 - nvflare.fuel.f3.sfm.conn_manager - INFO - Connector [CH00001 ACTIVE grpc://server2:8102] is starting
2023-04-25 12:17:22,522 - FederatedClient - INFO - Wait for engine to be created.
2023-04-25 12:17:30,328 - nvflare.fuel.f3.sfm.conn_manager - INFO - Retrying [CH00001 ACTIVE grpc://server2:8102] in 8 seconds
2023-04-25 12:17:38,535 - nvflare.fuel.f3.sfm.conn_manager - INFO - Retrying [CH00001 ACTIVE grpc://server2:8102] in 16 seconds
2023-04-25 12:17:53,051 - MPM - ERROR - main_func execute exception: Login failed.
2023-04-25 12:17:53,052 - MPM - ERROR - Traceback (most recent call last):
File "/home/kubeflare/.local/lib/python3.10/site-packages/nvflare/fuel/f3/mpm.py", line 144, in run
rc = main_func()
File "/home/kubeflare/.local/lib/python3.10/site-packages/nvflare/private/fed/app/client/client_train.py", line 120, in main
raise RuntimeError("Login failed.")
RuntimeError: Login failed.
2023-04-25 12:17:55,254 - MPM - INFO - MPM: Good Bye!
I have reviewed the discussion on Github that suggests that this error could be related to the TLS settings. I would greatly appreciate your guidance on how to resolve this issue. - #1130 (reply in thread).
Beta Was this translation helpful? Give feedback.
All reactions