-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(breadbox): Celery checker #144
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine but I personally think the try/catches repeated are unnecessary and are distracting:
try:
check_celery()
except CeleryConnectionError as err:
raise err
What's the motivation for those?
breadbox/breadbox/api/downloads.py
Outdated
@@ -67,7 +69,10 @@ def export_dataset( | |||
user: str = Depends(get_user), | |||
settings: Settings = Depends(get_settings), | |||
): | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could consolidate these calls.
Instead of calling utils.check_celery() before job submission, what do you think about putting it in get_task_status?
That's when the worker being offline is a problem: the status will be reported as pending forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! I figured the fail fast approach would be better. I also considered creating a decorator or using fastapi's dependency injection Depends()
but that would still modify each endpoint regardless
Ideally, I would put them in celery's delay()
function to avoid repeated calls in each function using celery but patching that function felt too hacky…
Pings to see if any worker responds and returns true if successful response detected | ||
""" | ||
inspect = app.control.inspect() | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know that we need to explicitly connect to the broker before submitting a task. If the broker is offline, then we already get an exception thrown when it tries to submit the task to the broker.
That being said, I suppose it's harmless to try. Probably just costs a few milliseconds and I can't imagine it having any real downside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pgm Usually if celery isn't running (at least on development) we've forgotten to start the celery broker. I figured returning an explicit error to the client could be helpful.. This does have 3 retries to connect to the broker which takes a few seconds rather than a few milliseconds though so maybe I should decrease the amount of retries
"Failed to connect to celery redis broker!" | ||
) from exc | ||
# Pings workers to see if any of them respond. Returns None if no response | ||
ping = inspect.ping() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting... The worker is asynchronous but this call is clearly synchronous. There must be a timeout baked into this.
Regardless if you've verified that this is reliably identifying when the worker is offline, then looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked up inspect.ping()
and from what I see in this SO post, I think any time that there's a backlog of work, will result in an exception being thrown.
Imagine we have 1 worker, and two users. User A submits a "custom analysis" job and while that's running user B submits a batch download. I think user B will get an exception saying "Celery workers are not responding" instead of A's job running and then B's job running.
Is that behavior intentional? Perhaps it's not bad, but I think at the very least, we'd want a longer timeout then the default of 1 second.
Rejecting requests when there's a backlog has some nice properties ("load shedding") but it also has the downside that when the server is busy, instead of jobs taking longer, users will have their job fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be wrong though. It's something I think would good to be explicitly try to make sure we understand what's going to happen before deploying this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pgm yes ping()
has a default timeout of 1 second. HOWEVER, I just saw a concerning post where calling ping()
may not return any response if all workers are busy running tasks. I don't think this is the behavior we want since this is meant to check if workers exist. I'm thinking the stats()
like the top stackoverflow post suggested would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pgm Oops I didn't see your subsequent comments and you've basically addressed everything I was pointing out including the SO post 😅 You're right, the behavior of throwing an error when workers are occupied wasn't my intention. I think using stats()
like the top SO reply might be what I want since it seems to return None if there are no workers present. I will test this first before making the change official
I can get rid of them! I think this was before I decided to put the try/excepts in |
health_check/ok
endpoint is already calling that utils function)To be done: Currently tests failing with celery. Need to modify behavior