You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today 15:04:24 node a00636 was reported down by slurm. I checked journalctl and found some sshd related errors, especially that something got killed by its own watchdog (see attached screenshot). The node came up again later and at 15:41 it went down again.
When I logged into that node, the login took quite some time. I checked in journalctld the most recent messages and again found the same message that was also reported in the screenshot:
"sshd: Bad protocol version identification '0' from 192.38.118.181"
I have then checked the slurmctld log on a00552 at that time and i found the following:
16:14:21 Node a00636 now responding
16:14:21 Node a00636 returned to service
But this was around 10s after i logged in on a00636.
I opened a ticket at science IT and DRAINed the node until then.
The text was updated successfully, but these errors were encountered:
Today 15:04:24 node a00636 was reported down by slurm. I checked journalctl and found some sshd related errors, especially that something got killed by its own watchdog (see attached screenshot). The node came up again later and at 15:41 it went down again.
When I logged into that node, the login took quite some time. I checked in journalctld the most recent messages and again found the same message that was also reported in the screenshot:
"sshd: Bad protocol version identification '0' from 192.38.118.181"
I have then checked the slurmctld log on a00552 at that time and i found the following:
16:14:21 Node a00636 now responding
16:14:21 Node a00636 returned to service
But this was around 10s after i logged in on a00636.
I opened a ticket at science IT and DRAINed the node until then.
The text was updated successfully, but these errors were encountered: