Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CORE-5696] Have a deduplicating job worker #18
base: master
Are you sure you want to change the base?
[CORE-5696] Have a deduplicating job worker #18
Changes from 18 commits
36fe363
1999544
7813acd
49e11d5
9dea97c
3bc7f16
1cf9f5a
641c500
95e7312
866ed7d
bd30a0b
c6684ca
f47bd3a
e5f1185
02f74b4
7912318
9a61a9b
83828a8
d1d8e77
f454220
cd5ccc2
f67a52c
b3a6761
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Either is quite confusing. Could you add a comment about the idea behind it so that people don't have to figure out what it's supposed to mean? I wonder if a simple data type wouldn't be better here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And by "better" I mean more readable...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right.
You're picking the first job from the list and assume that it was the one with the highest id later in
updateJob
, but why? The query above doesn't sort on theid
field. But even if you take the highest one, it's not guaranteed that you want to update all jobs with a lower id later (once looking at run_at is fixed in thereservedJobs
query).Uhh, this looks to be more complicated than I first thought it will be (even more so considering my other comment below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this almost entirely ignores
run_at
column? There's norun_at <= <?> now
, so this would just process any job, even ones scheduled in the future, but even if the conditions was set, the de-duplicating job worker would not be very efficient at de-duplicating if jobs were scheduled into the future or whenccNotificationChannel
is set.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. The mode should lock the group of jobs with the same deduplication id (
dedId
) that are scheduled to be processed withrun_at <= now
. If there are other jobs with thisdedId
scheduled for the future, they should be left alone.The other problem here is not looking at the
reserved_by
column. However, introduction ofreserved_by
check like in the standard case doesn't fully solve the issue because even if a job is still being processed, there might be another row inserted after it started with the samededId
. And now it will be started in parallel to the old one and there's going to be a race :/There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sorry, I feel like i'm arriving after the party…
so for what i see:
select pg_advisory_lock(hash(dedid))
or something like that, or we just create a table for this with this dedid as the unique column and PK. When you want to work on a dedid, you insert a record there. when you have finished, you delete it and commit. noone will be able to work on it in the meantime. advisory locks are probably better here… you can tie them to a transaction or not (better in case you want them to be freed on error for instance), and you have the "try" function variants. So maybe it would be simpler to just:BTW, maybe i misunderstood how this work, i didn't look at the haskell code all around
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to change the type signature of
updateJobs
so that it takes only a single(idx, Result)
in the deduplicating case 🤔 The problem now is that if something goes awry and multiple ids are passed here, the conditionid <= ANY (...)
will wreak havoc.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, how were tests passing with this bug? :) They should be updated accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allow deduplicating on the primary key row of the jobs table like
ccMode = Duplicating "id"
? If you do this right now you get an ambiguity error in the sql query used for reserving jobs in the consumer because part of the query used inreserveJobs
becomesSELECT id, id ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My approach would be not to do that if it's not necessary for proper function now. It can be added later if there's a need for it. And maybe document it somewhere that you can't deduplicate based on fields that are called
id
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicating
should probably be a non-empty array of expressions as one may want to be able to de-duplicate on more than one expression. And the SQL expression type should be justSQL
and notRawSQL ()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RawSQL ()
is fine, it's for "sql literals", i.e. values that can't hold parameters.