a00636 DRAINED, sshd unstable #44

Ulfgard · 2021-10-18T15:00:34Z

Today 15:04:24 node a00636 was reported down by slurm. I checked journalctl and found some sshd related errors, especially that something got killed by its own watchdog (see attached screenshot). The node came up again later and at 15:41 it went down again.

When I logged into that node, the login took quite some time. I checked in journalctld the most recent messages and again found the same message that was also reported in the screenshot:

"sshd: Bad protocol version identification '0' from 192.38.118.181"

I have then checked the slurmctld log on a00552 at that time and i found the following:

16:14:21 Node a00636 now responding
16:14:21 Node a00636 returned to service

But this was around 10s after i logged in on a00636.

I opened a ticket at science IT and DRAINed the node until then.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a00636 DRAINED, sshd unstable #44

a00636 DRAINED, sshd unstable #44

Ulfgard commented Oct 18, 2021

a00636 DRAINED, sshd unstable #44

a00636 DRAINED, sshd unstable #44

Comments

Ulfgard commented Oct 18, 2021