Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing WorkflowExecutor.enqueueReadyTasks #1752

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

frsyuki
Copy link
Member

@frsyuki frsyuki commented Jun 27, 2022

WorkflowExecutor.enqueueReadyTasks runs in follow steps:

  1. Take at most executor.enqueue_fetch_size number of tasks
  2. For each task,
    2-1. Lock the task if the task is still ready
    2-2. Enqueue the task

Step 2-1 checks task state again ("recheck") because another thread may
enqueue the task during the operation (notice that step 1 doesn't
lock the tasks).

"recheck" needs to run a SELECT statement on the database to get state
& lock the task atomicly. If "recheck" doesn't pass, the task will be
ignored (step 2-2 doesn't run).

"recheck" may not pass very frequently when following conditions are met:

  • executor.enqueue_fetch_size is large (default, 100, is already large)
  • Many threads run on the same database (e.g. many digdag servers exist)
  • Step 2 takes relatively long amount of time (e.g. latency of database
    operations are large, the database is overloaded temporarily, or the
    digdag server is overloaded temporarily)

Frequent failing "recheck" means that a lot of SELECT operations waste
database workload. It also wastes digdag server's thread time.

This change optimizes step 1 & 2 as following:

  1. Find one ready task and lock it atomicly
  2. Enqueue the task

On PostgreSQL, step 1 can be done using one SELECT statement. This
solves above potential problem.

On H2 database, step 2 needs two SELECT statements. Thus this commit
won't optimize performance. But notice that above problem won't happen
on H2 database because a database won't be shared by multiple servers.

frsyuki added 3 commits March 25, 2022 14:39
MultiThreadAgent.run calculates maxAcquire = number of idling threads,
then acquire at most maxAcquire number of tasks from the database. It
sleeps 500ms before next iteration only if maxAcquire was 0.

However, there might be no tasks available on the database even if
maxAcquire is not 0. In such case, it should sleep before next iteration
not to cause busy loop. This may reduce database workload.
WorkflowExecutor.enqueueReadyTasks was running follow steps:

1. Take at most executor.enqueue_fetch_size number of tasks
2. For each task,
   2-1. Lock the task if the task is still ready
   2-2. Enqueue the task

Step 2-1 checks task state again ("recheck") because another thread may
enqueue the task during the while operation (notice that step 1 doesn't
lock the tasks).

"recheck" needs to run one SELECT statement on the database to get state
and lock the task atomicly, and if it doesn't pass, the task will be
ignored (step 2-2 doesn't run).

"recheck" may not pass very frequently if following conditions are met:

* executor.enqueue_fetch_size is large (default, 100, is already large)
* Many threads run on the same database (e.g. many digdag servers exist)
* Step 2 takes relatively long amount of time (e.g. latency of database
  operations are large, the database is overloaded temporarily, or the
  digdag server is overloaded temporarily)

Frequent failing "recheck" means that a lot of SELECT operations waste
database workload. It also wastes digdag server's thread time.

This commit optimizes the method as following:

1. Find a ready task and lock it
2. Enqueue the task

On PostgreSQL, step 1 can be done using one SELECT statement. This
solves above potential problem.

On H2 database, step 2 needs two SELECT statements. Thus this commit
won't optimize performance. But notice that above problem won't happen
on H2 database because a database won't be shared by multiple servers.
Cleaning up changes by commit 739a185
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant