[16.0][IMP] queue_job: remove dead jobs requeuer cron and automatically requeue dead jobs #716

AnizR · 2024-12-06T10:55:13Z

Goal

Automatically re-queue jobs that have been started but whose worker have been killed.
And rely on Odoo's limit_time_cpu and limit_time_real for job execution.

Technical explanation

Everything relies on a new table queue_job_locks which contains ids of jobs that have been started.
When a job is executed, its row queue_job_locks is locked.

If a row is in queue_job_locks with a state='started' but not locked, it is either:

a job killed
a job not yet started (happens in a very small time frame)

Using this information, we can re-queue these jobs.

Why not lock directly in the `queue_job' table?

This was tried in #423 but it wasn't working when a job was raising an error.
It seems that the row was locked and it tried to write on the row to set as failed before committing.

Improve current behavior

Re-queue jobs that have been killed but increment their 'retries' to avoid having a job that is always get killed in infinite re-queuing.

OCA-git-bot · 2024-12-06T10:55:17Z

Hi @guewen,
some modules you are maintaining are being modified, check this out!

sbidoul

This looks very promising!

A few thoughts:

There is a Caveat comment in runner.py that needs updating.
This should handle the enqueued state too, which is the state when the runner has decided that a jobs needs to run but then /queue_job/runjob controller has not set it started yet. This state normally exists for a very short time, but I have seen situations where the workers are overloaded and take time to accept /queue_job/runjob requests then die, leaving jobs in enqueued state forever.
Since there are two race conditions between enqueued and started, and between started and the time the job transaction actually starts with the lock, I wonder if we should not introduce a small elapsed time condition (10s?) in reset_dead_jobs. May be based on date_enqueued.

queue_job/readme/CONFIGURE.rst

queue_job/jobrunner/runner.py

queue_job/migrations/16.0.2.6.9/pre-migration.py

queue_job/models/queue_job.py

AnizR · 2024-12-06T14:15:05Z

This looks very promising!

A few thoughts:

* There is a Caveat comment in `runner.py` that needs updating.

* This should handle the `enqueued` state too, which is the state when the runner has decided that a jobs needs to run but then `/queue_job/runjob` controller has not set it started yet. This state normally exists for a very short time, but I have seen situations where the workers are overloaded and take time to accept `/queue_job/runjob` requests then die, leaving jobs in `enqueued` state forever.

* Since there are two race conditions between `enqueued` and `started`, and between `started` and the time the job transaction actually starts with the lock, I wonder if we should not introduce a small elapsed time condition (10s?) in `reset_dead_jobs`. May be based on `date_enqueued`.

Thank you for your suggestions. I have implemented the necessary corrections.
While I have not encountered a case where a job is left in the 'enqueued' state, I have made the appropriate correction to ensure the code is more robust.

sbidoul

A few more minor comments.

queue_job/__manifest__.py

queue_job/controllers/main.py

queue_job/job.py

queue_job/jobrunner/runner.py

queue_job/job.py

queue_job/migrations/16.0.2.6.9/pre-migration.py

queue_job/jobrunner/runner.py

guewen

Thanks for tackling this old issue, I like this elegant solution. It should be quite optimized also. Congrats for this work

amh-mw · 2024-12-11T12:54:30Z

queue_job/data/queue_data.xml

@@ -1,17 +1,6 @@
 <?xml version="1.0" encoding="utf-8" ?>
 <odoo>
    <data noupdate="1">
-        <record id="ir_cron_queue_job_garbage_collector" model="ir.cron">


Should this become

<delete id="ir_cron_queue_job_garbage_collector" model="ir.cron"/>

to clean up upgrades / avoid filling cron logs with errors since requeue_stuck_jobs method is gone?

I have archived the cron in the pre-migration.py. Therefore, there won't be any error.
I always prefer to archive (set active to false) rather than deleting.

sbidoul · 2024-12-11T13:00:39Z

@AnizR I think we can also remove the set_job_pending function, as the new mechanism will take care of that part.

AnizR · 2024-12-13T12:25:41Z

@AnizR I think we can also remove the set_job_pending function, as the new mechanism will take care of that part.

Yes, jobs that have not been started will be re-queued by my new mechanism.
Removing this part will provide a more "uniform" process of re-queuing jobs.

sbidoul · 2024-12-15T12:16:05Z

queue_job/jobrunner/runner.py

@@ -248,11 +210,8 @@ def urlopen():
            # for HTTP Response codes between 400 and 500 or a Server Error
            # for codes between 500 and 600
            response.raise_for_status()
-        except requests.Timeout:
-            set_job_pending()


A timeout here is normal behaviour, so we don't want to log it as an exception.

please add a comment on why there is pass here.

queue_job/jobrunner/runner.py

…eue jobs in timeout [IMP] queue_job: increment 'retry' when re-queuing job that have been killed

adrienpeiffer

LGTM. Thanks for that.

adrienpeiffer · 2025-01-02T08:28:26Z

@guewen Could you merge this one ?

guewen · 2025-01-02T08:42:37Z

queue_job/migrations/16.0.2.7.0/pre-migration.py

In the manifest, the version is 2.8.0, could you update this file accordingly @AnizR?

adrienpeiffer · 2025-01-02T08:52:56Z

@guewen We're finally going to do a few more tests before we merge this. We'll keep you informed

florian-dacosta

Review LGTM
I also did some tests with job exceeding the time_cpu_limit and it works as expected.

hparfr

Thanks for addressing this problem.

I'm asking for more comments to give explicit intent.

And I have a question about: why all these joins instead of use id or uuid directly.

hparfr · 2025-01-20T10:15:39Z

queue_job/job.py

@@ -238,6 +238,34 @@ def load_many(cls, env, job_uuids):
        recordset = cls.db_records_from_uuids(env, job_uuids)
        return {cls._load_from_db_record(record) for record in recordset}

+    def lock(self):
+        self.env.cr.execute(


please add the intent of this def as comment

hparfr · 2025-01-20T10:18:31Z

queue_job/jobrunner/runner.py

@@ -248,11 +210,8 @@ def urlopen():
            # for HTTP Response codes between 400 and 500 or a Server Error
            # for codes between 500 and 600
            response.raise_for_status()
-        except requests.Timeout:
-            set_job_pending()


please add a comment on why there is pass here.

hparfr · 2025-01-20T10:46:26Z

queue_job/migrations/16.0.2.7.0/pre-migration.py

+    cr.execute(
+        """
+            CREATE TABLE IF NOT EXISTS queue_job_locks (
+                id INT PRIMARY KEY,


why not use uuid here or at the opposit only use id and get rid of all the joins ?

@hparfr we probably could but I don't think it would significantly change the main query in requeue_dead_jobs. And since we want a foreign key with on delete cascade between job locks and jobs, it's perhaps more intuitive to have it on the primary key, as usual.

thanks for the clarification

AnizR mentioned this pull request Dec 6, 2024

[IMP] queue_job: detect jobs runned by workers that have been killed #713

Closed

AnizR changed the title ~~[IMP] queue_job: remove cron garbage collector and automatically requeue jobs in timeout~~ [16.0][IMP] queue_job: remove cron garbage collector and automatically requeue jobs in timeout Dec 6, 2024