Fix issue 48 #71

LennoxLiu · 2024-04-16T09:42:53Z

The issue

Issue #48 was about sporadic model crashes on Helix. When using load-balancer to start the servers, the server occasionally fails to start because the randomly selected port is in use.

Debugging

I found that some ports were vacant when checking and occupied when the server was trying to use them. It should be the case that some processes occupied the port just between the checking and using.

I noticed that some jobs did try to select a new port during the test, so the checking part should be correct.

Solution

I tried to occupy the port using nc -l $port & when checking and then release it just before the start of the server, but the releasing using fuser -k -n tcp $port was not stable (on Helix). Therefore, I didn't use this approach.

Instead, I added a timeout check in job.sh to see if the script waits for a server to respond exceeds $timeout seconds. If so, the script will call itself again to restart the server.

Since the issue #48 only happens (around) once every hundred times, the usage time should not increase significantly. Also, since there's a time limit for HQ jobs set in the job.sh, this retry won't repeat infinitely.

Just need to notice that if there are too many servers need to restart, it might be because that the $timeout set in job.sh is too small to start the server.

Other changes

Added a PORT check in the Makefile before executing the load-balancer. Since there was no error message when the port of load-balancer was occupied. (Also set HQ_SUBMIT_DELAY to 100ms)
Added my codes for debugging using MultiplyBy2 in the test folder.

Adding this because there's no error message when failed to start the load-balancer because of port in use Also add HQ_SUBMIT_DELAY_MS for more stable performance

Added a timeout when waiting for server to respond. If a timeout happens, rerun job.sh to restart the server.

LennoxLiu added 5 commits April 16, 2024 10:37

Update MultiplyBy2 test

67166b7

Check PORT in Makefile run

eff2e5a

Adding this because there's no error message when failed to start the load-balancer because of port in use Also add HQ_SUBMIT_DELAY_MS for more stable performance

Fix a typo in LoadBalancer.cpp

a4ba4a0

Add job.sh for testing in MultiplyBy2

54b6bab

Apply fix for issue 48

1717c6d

Added a timeout when waiting for server to respond. If a timeout happens, rerun job.sh to restart the server.

LennoxLiu requested a review from linusseelinger April 16, 2024 09:43

LennoxLiu linked an issue Apr 18, 2024 that may be closed by this pull request

HPC: Sporadic model crashes on Helix #48

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue 48 #71

Fix issue 48 #71

LennoxLiu commented Apr 16, 2024

Fix issue 48 #71

Are you sure you want to change the base?

Fix issue 48 #71

Conversation

LennoxLiu commented Apr 16, 2024

The issue

Debugging

Solution

Other changes