Enh/error on timeout #683

PGijsbers · 2024-12-22T12:46:04Z

Closes #681

If a process "completes" due to exceeding the activity timeout, it should raise an error, and not just continue.

This avoids situations where the benchmark continues to run, thinking that the process completed successfully.

codecov-commenter · 2024-12-22T12:47:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.70%. Comparing base (b719142) to head (6bb1085).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #683      +/-   ##
==========================================
+ Coverage   68.15%   68.70%   +0.54%     
==========================================
  Files          54       55       +1     
  Lines        6730     6749      +19     
==========================================
+ Hits         4587     4637      +50     
+ Misses       2143     2112      -31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

PGijsbers · 2024-12-22T12:47:59Z

amlb/utils/process.py

        retcode = process.poll()
+        if retcode is None:


As far as I can tell, this is a reliable way to tell the process is still running. And if the process is still running at this point I think the only reason can be that the communicate function returned early, which should only happen with an activity timeout.

Hi @PGijsbers, I am not very familiar with this section of the code base but the logic looks reasonable to me.

Do I understand correctly that after the previous try/except block has been completed, the process should not be running, and this code block ensures that this is the case?

The stdout, stderr = communicate(process, input, timeout=timeout) line is blocking until no stderr and stdout is detected. A lack of output means that either there was an activity timeout (nothing written during the specified interval), or the process stopped. The communication logic doesn't detect which.
That's why at this level, after communication has stopped, we can do an extra check (this one), to see if the process is alive. If the process is still alive we can infer that the communication was stopped because a timeout was detected. In that case, the subprocess wouldn't just stop by itself, so we send the kill signal and raise an error as a normal return from this function would indicate the process finished successfully.

And also thanks so much for doing the review anyway :) I understand generally speaking people aren't too familiar with the internals (to be frank, I also had to brush up on some of it as Seb wrote this), I just appreciate a sanity check. Both for correctness and to avoid me refactoring/solving things in a way that only I will be able to understand later.

Thanks for the explanation, that makes sense. I would maybe add a small comment indicating the two possible reasons why retcode is None - at least for me that helpful to understand the purpose of this block.

PGijsbers · 2024-12-22T12:48:55Z

amlb/utils/process.py

+        # if a pipe is not ready it could be timeout or it could be end of process
+        # so at this point we do not know. Only after the communicate function is over do we know.
+        # i.e., if the process is still running it does not have a retcode.


Preferably I would have raised the activity timeout from the function which uses it. But at this stage we can unfortunately not detect whether the error should be raised.

PGijsbers · 2024-12-22T13:04:56Z

runbenchmark.py

-        res = bench.run(args.task, args.fold)
+    try:
+        bench.setup(amlb.SetupMode[args.setup])
+    except StaleProcessError as e:


I'll move to exception notes instead and/or revise this structure. The reason I did it this way is to communicate more clearly to the user with a final message what went wrong and how to solve it. I want to generally make errors easier to parse, as there are some issues opened that are completely solvable from the traceback, but users can't/don't try to parse those.

shchur · 2025-01-06T10:27:26Z

amlb/utils/process.py

        retcode = process.poll()
+        if retcode is None:


Hi @PGijsbers, I am not very familiar with this section of the code base but the logic looks reasonable to me.

Do I understand correctly that after the previous try/except block has been completed, the process should not be running, and this code block ensures that this is the case?

PGijsbers added 3 commits December 22, 2024 14:15

Raise an error when a process stops due to activity timeout

4dc7ad6

This avoids situations where the benchmark continues to run, thinking that the process completed successfully.

Add tests around raising StaleProcessErrors

95e905d

More explicit warning that activity timeout requires live output

0660453

PGijsbers commented Dec 22, 2024

View reviewed changes

Simplify the exception

6bb1085

PGijsbers commented Dec 22, 2024

View reviewed changes

PGijsbers requested a review from shchur December 22, 2024 13:05

shchur approved these changes Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enh/error on timeout #683

Enh/error on timeout #683

PGijsbers commented Dec 22, 2024

codecov-commenter commented Dec 22, 2024 •

edited

Loading

PGijsbers Dec 22, 2024

shchur Jan 6, 2025

PGijsbers Jan 6, 2025

PGijsbers Jan 6, 2025

shchur Jan 6, 2025

PGijsbers Dec 22, 2024

PGijsbers Dec 22, 2024

shchur Jan 6, 2025

Enh/error on timeout #683

Are you sure you want to change the base?

Enh/error on timeout #683

Conversation

PGijsbers commented Dec 22, 2024

codecov-commenter commented Dec 22, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Dec 22, 2024 •

edited

Loading