Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit number of auto_clone restarts #5397

Merged
merged 1 commit into from
Jan 10, 2024

Conversation

perlpunk
Copy link
Contributor

@perlpunk perlpunk commented Dec 14, 2023

It can happen that a job consistently fails with the same error. We want to prevent an endless cloning loop here.

Issue: https://progress.opensuse.org/issues/152569

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I see with this is that apparently there is no clear communication to test reviewers that tests are restarted so many times. If the auto-cloning would stop in the realistic case any test reviewers if they ever stumble across the scenario again would likely just hit the retrigger button anyway. So, how to communicate the stop of auto-cloning to test reviewers?

etc/openqa/openqa.ini Outdated Show resolved Hide resolved
@perlpunk
Copy link
Contributor Author

The problem I see with this is that apparently there is no clear communication to test reviewers that tests are restarted so many times

I don't really understand.
Is there any clear communication to test reviewers currently about when a job is auto cloned at all? And "so many times" - well, it's basically endlessly if the cloned jobs also fail.

Copy link

codecov bot commented Dec 14, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (4d4e5b7) 98.37% compared to head (9ffa730) 98.37%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5397   +/-   ##
=======================================
  Coverage   98.37%   98.37%           
=======================================
  Files         389      389           
  Lines       37643    37708   +65     
=======================================
+ Hits        37031    37096   +65     
  Misses        612      612           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@asdil12
Copy link
Member

asdil12 commented Dec 14, 2023

Stop the clonging!

s/clonging/cloning/ ;)

@Martchus
Copy link
Contributor

The problem I see with this is that apparently there is no clear communication to test reviewers that tests are restarted so many times

There is no clear communication with and without this PR. We have accumulated over 50 pages of jobs in the Next & Previous tab in relevant scenarios and apparently no reviewers took notice of it. If we now only have say 10 pages this will not change anything for reviewers (except that why might not be wondering anymore why the heck openQA is endlessly restarting these jobs if they would care about this scenario anyways which they apparently don't). So I don't see how this PR makes things worse.

Copy link
Contributor

@Martchus Martchus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally good. Maybe we could add a comment stating that the maximum number of retries are exhausted instead of just doing nothing. (I guess that wouldn't be too much work.)

lib/OpenQA/Schema/Result/Jobs.pm Outdated Show resolved Hide resolved
@perlpunk
Copy link
Contributor Author

Maybe we could add a comment stating that the maximum number of retries are exhausted instead of just doing nothing. (I guess that wouldn't be too much work.)

I'm wondering how much work we should put into this, given we also have the investigation tools, that can also do retries and add comments.
Maybe this feature should be moved there instead?

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should rethink the overall approach. Obviously it's wasteful to restart jobs over and over again if nobody cares about the results. However the auto cloning triggered by the worker was always intended to trigger in the case when a need for a retry would arise that's not in the responsibility of any test maintainer like the bugs we have in the cache service or terminating worker processes.

Your question if we should move the retries to scripts makes me think that we need to clarify the original goal, maybe talking in person next week?

Instead of stopping the automatic retries I would rather think about preventing new incompletes to happen at all, here maybe with not even starting jobs in continuously failing scenarios?

etc/openqa/openqa.ini Outdated Show resolved Hide resolved
lib/OpenQA/Schema/Result/Jobs.pm Outdated Show resolved Hide resolved
@perlpunk perlpunk marked this pull request as draft December 21, 2023 18:40
lib/OpenQA/Schema/Result/Jobs.pm Show resolved Hide resolved
etc/openqa/openqa.ini Show resolved Hide resolved
t/api/04-jobs.t Outdated Show resolved Hide resolved
t/api/04-jobs.t Outdated Show resolved Hide resolved
t/api/04-jobs.t Show resolved Hide resolved
@perlpunk perlpunk force-pushed the limit-auto-clone branch 4 times, most recently from b80e856 to cb9732c Compare January 4, 2024 16:28
@perlpunk perlpunk marked this pull request as ready for review January 4, 2024 16:34
@perlpunk perlpunk force-pushed the limit-auto-clone branch 2 times, most recently from 99045fb to 197c241 Compare January 5, 2024 15:02
lib/OpenQA/Schema/Result/Jobs.pm Outdated Show resolved Hide resolved
lib/OpenQA/Schema/Result/Jobs.pm Outdated Show resolved Hide resolved
lib/OpenQA/Schema/Result/Jobs.pm Outdated Show resolved Hide resolved
lib/OpenQA/Schema/Result/Jobs.pm Outdated Show resolved Hide resolved
It can happen that a job consistently fails with the same error.
We want to prevent an endless cloning loop here.

Issue: https://progress.opensuse.org/issues/152569
@mergify mergify bot merged commit b5e992e into os-autoinst:master Jan 10, 2024
36 checks passed
@perlpunk perlpunk deleted the limit-auto-clone branch January 11, 2024 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants