Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The issue
Issue #48 was about sporadic model crashes on Helix. When using
load-balancer
to start the servers, the server occasionally fails to start because the randomly selected port is in use.Debugging
I found that some ports were vacant when checking and occupied when the server was trying to use them. It should be the case that some processes occupied the port just between the checking and using.
I noticed that some jobs did try to select a new port during the test, so the checking part should be correct.
Solution
I tried to occupy the port using
nc -l $port &
when checking and then release it just before the start of the server, but the releasing usingfuser -k -n tcp $port
was not stable (on Helix). Therefore, I didn't use this approach.Instead, I added a timeout check in
job.sh
to see if the script waits for a server to respond exceeds$timeout
seconds. If so, the script will call itself again to restart the server.Since the issue #48 only happens (around) once every hundred times, the usage time should not increase significantly. Also, since there's a time limit for HQ jobs set in the
job.sh
, this retry won't repeat infinitely.Just need to notice that if there are too many servers need to restart, it might be because that the
$timeout
set injob.sh
is too small to start the server.Other changes
load-balancer
. Since there was no error message when the port ofload-balancer
was occupied. (Also setHQ_SUBMIT_DELAY
to 100ms)MultiplyBy2
in thetest
folder.