-
-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[16.0][IMP] queue_job: remove dead jobs requeuer cron and automatically requeue dead jobs #716
base: 16.0
Are you sure you want to change the base?
Conversation
Hi @guewen, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very promising!
A few thoughts:
- There is a Caveat comment in
runner.py
that needs updating. - This should handle the
enqueued
state too, which is the state when the runner has decided that a jobs needs to run but then/queue_job/runjob
controller has not set it started yet. This state normally exists for a very short time, but I have seen situations where the workers are overloaded and take time to accept/queue_job/runjob
requests then die, leaving jobs inenqueued
state forever. - Since there are two race conditions between
enqueued
andstarted
, and betweenstarted
and the time the job transaction actually starts with the lock, I wonder if we should not introduce a small elapsed time condition (10s?) inreset_dead_jobs
. May be based ondate_enqueued
.
b0f9f59
to
abd0e2f
Compare
Thank you for your suggestions. I have implemented the necessary corrections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more minor comments.
310847e
to
0102273
Compare
f98c856
to
8edf042
Compare
8edf042
to
fd96829
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this old issue, I like this elegant solution. It should be quite optimized also. Congrats for this work
@@ -1,17 +1,6 @@ | |||
<?xml version="1.0" encoding="utf-8" ?> | |||
<odoo> | |||
<data noupdate="1"> | |||
<record id="ir_cron_queue_job_garbage_collector" model="ir.cron"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this become
<delete id="ir_cron_queue_job_garbage_collector" model="ir.cron"/>
to clean up upgrades / avoid filling cron logs with errors since requeue_stuck_jobs method is gone?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have archived the cron in the pre-migration.py
. Therefore, there won't be any error.
I always prefer to archive (set active to false) rather than deleting.
@AnizR I think we can also remove the |
Yes, jobs that have not been started will be re-queued by my new mechanism. |
fd96829
to
8186b10
Compare
@@ -248,11 +210,8 @@ def urlopen(): | |||
# for HTTP Response codes between 400 and 500 or a Server Error | |||
# for codes between 500 and 600 | |||
response.raise_for_status() | |||
except requests.Timeout: | |||
set_job_pending() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A timeout here is normal behaviour, so we don't want to log it as an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment on why there is pass
here.
8186b10
to
0201bac
Compare
0201bac
to
6f8f4b8
Compare
…eue jobs in timeout [IMP] queue_job: increment 'retry' when re-queuing job that have been killed
6f8f4b8
to
e0e9327
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for that.
@guewen Could you merge this one ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the manifest, the version is 2.8.0, could you update this file accordingly @AnizR?
@guewen We're finally going to do a few more tests before we merge this. We'll keep you informed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review LGTM
I also did some tests with job exceeding the time_cpu_limit and it works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing this problem.
I'm asking for more comments to give explicit intent.
And I have a question about: why all these joins instead of use id or uuid directly.
@@ -238,6 +238,34 @@ def load_many(cls, env, job_uuids): | |||
recordset = cls.db_records_from_uuids(env, job_uuids) | |||
return {cls._load_from_db_record(record) for record in recordset} | |||
|
|||
def lock(self): | |||
self.env.cr.execute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add the intent of this def
as comment
@@ -248,11 +210,8 @@ def urlopen(): | |||
# for HTTP Response codes between 400 and 500 or a Server Error | |||
# for codes between 500 and 600 | |||
response.raise_for_status() | |||
except requests.Timeout: | |||
set_job_pending() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add a comment on why there is pass
here.
cr.execute( | ||
""" | ||
CREATE TABLE IF NOT EXISTS queue_job_locks ( | ||
id INT PRIMARY KEY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use uuid
here or at the opposit only use id
and get rid of all the joins ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hparfr we probably could but I don't think it would significantly change the main query in requeue_dead_jobs
. And since we want a foreign key with on delete cascade between job locks and jobs, it's perhaps more intuitive to have it on the primary key, as usual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the clarification
Goal
Automatically re-queue jobs that have been started but whose worker have been killed.
And rely on Odoo's limit_time_cpu and limit_time_real for job execution.
Technical explanation
Everything relies on a new table
queue_job_locks
which contains ids of jobs that have been started.When a job is executed, its row
queue_job_locks
is locked.If a row is in
queue_job_locks
with astate='started'
but not locked, it is either:Using this information, we can re-queue these jobs.
Why not lock directly in the `queue_job' table?
This was tried in #423 but it wasn't working when a job was raising an error.
It seems that the row was locked and it tried to write on the row to set as
failed
before committing.Improve current behavior
Re-queue jobs that have been killed but increment their 'retries' to avoid having a job that is always get killed in infinite re-queuing.