Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a00636 DRAINED, sshd unstable #44

Open
Ulfgard opened this issue Oct 18, 2021 · 0 comments
Open

a00636 DRAINED, sshd unstable #44

Ulfgard opened this issue Oct 18, 2021 · 0 comments

Comments

@Ulfgard
Copy link
Collaborator

Ulfgard commented Oct 18, 2021

Today 15:04:24 node a00636 was reported down by slurm. I checked journalctl and found some sshd related errors, especially that something got killed by its own watchdog (see attached screenshot). The node came up again later and at 15:41 it went down again.

When I logged into that node, the login took quite some time. I checked in journalctld the most recent messages and again found the same message that was also reported in the screenshot:

"sshd: Bad protocol version identification '0' from 192.38.118.181"

I have then checked the slurmctld log on a00552 at that time and i found the following:

16:14:21 Node a00636 now responding
16:14:21 Node a00636 returned to service

But this was around 10s after i logged in on a00636.

I opened a ticket at science IT and DRAINed the node until then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant