Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition with retries and multiple workers #482

Open
eigenein opened this issue Oct 17, 2024 · 1 comment
Open

Race condition with retries and multiple workers #482

eigenein opened this issue Oct 17, 2024 · 1 comment

Comments

@eigenein
Copy link

eigenein commented Oct 17, 2024

Symptoms

I'm observing jobs randomly being run sooner than scheduled. As far as I can tell, this occurs with multiple workers and retried jobs.

Investigation

I believe I see a race condition in the code.

Disclaimer: I haven't verified this particular path, as it's extremely difficult to reproduce it. I did, however, verified that scaling down to 1 worker makes the issue go away.

Consider this scenario:

  1. Worker A fetches the job X:

job_ids = await self.pool.zrangebyscore(

  1. Worker B fetches multiple jobs, including the job X. Worker B goes ahead and iterates through them, making its way to the job X:

for job_id_b in job_ids:

  1. In the meantime, Worker A finishes the job X and catches a Retry. It increments the job score:

tr.zincrby(self.queue_name, incr_score, job_id)

  1. Now, Worker B gets a chance to run X. It reads the score again:

score = await pipe.zscore(self.queue_name, job_id)

And now, it’s in the future yet worker B continues normally 💥 As far as I can tell, steps 3 and 4 are not protected by the sync primitives. Does this sound plausible?

Possible fix

I haven't studied the code well enough. It looks like an additional check, like score > timestamp_ms() could be added around here to prevent the execution of a future retry:

if ongoing_exists or not score:

@RB387
Copy link
Contributor

RB387 commented Nov 26, 2024

Seems about right. We also experience the same thing
@chrisguidry @samuelcolvin could you please take a look?

Here is PR with fix: #487

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants