Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(breadbox): Celery checker #144

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

jessica-cheng
Copy link
Contributor

@jessica-cheng jessica-cheng commented Dec 5, 2024

  • Add util function for celery to check if celery is running
  • For all functions that call a celery task, call that util function to check if celery is running first. The goal is to fail fast if celery isn't running
  • Added a health check endpoint to check if celery is running (TBD if should keep because the health_check/ok endpoint is already calling that utils function)

To be done: Currently tests failing with celery. Need to modify behavior

Copy link
Contributor

@pgm pgm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine but I personally think the try/catches repeated are unnecessary and are distracting:

try:
        check_celery()
except CeleryConnectionError as err:
        raise err

What's the motivation for those?

breadbox/breadbox/api/dataset_uploads.py Outdated Show resolved Hide resolved
@@ -67,7 +69,10 @@ def export_dataset(
user: str = Depends(get_user),
settings: Settings = Depends(get_settings),
):

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could consolidate these calls.

Instead of calling utils.check_celery() before job submission, what do you think about putting it in get_task_status?

That's when the worker being offline is a problem: the status will be reported as pending forever.

Copy link
Contributor Author

@jessica-cheng jessica-cheng Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! I figured the fail fast approach would be better. I also considered creating a decorator or using fastapi's dependency injection Depends() but that would still modify each endpoint regardless

Ideally, I would put them in celery's delay() function to avoid repeated calls in each function using celery but patching that function felt too hacky…

Pings to see if any worker responds and returns true if successful response detected
"""
inspect = app.control.inspect()
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that we need to explicitly connect to the broker before submitting a task. If the broker is offline, then we already get an exception thrown when it tries to submit the task to the broker.

That being said, I suppose it's harmless to try. Probably just costs a few milliseconds and I can't imagine it having any real downside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgm Usually if celery isn't running (at least on development) we've forgotten to start the celery broker. I figured returning an explicit error to the client could be helpful.. This does have 3 retries to connect to the broker which takes a few seconds rather than a few milliseconds though so maybe I should decrease the amount of retries

"Failed to connect to celery redis broker!"
) from exc
# Pings workers to see if any of them respond. Returns None if no response
ping = inspect.ping()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting... The worker is asynchronous but this call is clearly synchronous. There must be a timeout baked into this.

Regardless if you've verified that this is reliably identifying when the worker is offline, then looks good to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked up inspect.ping() and from what I see in this SO post, I think any time that there's a backlog of work, will result in an exception being thrown.

Imagine we have 1 worker, and two users. User A submits a "custom analysis" job and while that's running user B submits a batch download. I think user B will get an exception saying "Celery workers are not responding" instead of A's job running and then B's job running.

Is that behavior intentional? Perhaps it's not bad, but I think at the very least, we'd want a longer timeout then the default of 1 second.

Rejecting requests when there's a backlog has some nice properties ("load shedding") but it also has the downside that when the server is busy, instead of jobs taking longer, users will have their job fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be wrong though. It's something I think would good to be explicitly try to make sure we understand what's going to happen before deploying this change.

Copy link
Contributor Author

@jessica-cheng jessica-cheng Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgm yes ping() has a default timeout of 1 second. HOWEVER, I just saw a concerning post where calling ping() may not return any response if all workers are busy running tasks. I don't think this is the behavior we want since this is meant to check if workers exist. I'm thinking the stats() like the top stackoverflow post suggested would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pgm Oops I didn't see your subsequent comments and you've basically addressed everything I was pointing out including the SO post 😅 You're right, the behavior of throwing an error when workers are occupied wasn't my intention. I think using stats() like the top SO reply might be what I want since it seems to return None if there are no workers present. I will test this first before making the change official

@jessica-cheng
Copy link
Contributor Author

jessica-cheng commented Dec 6, 2024

Looks fine but I personally think the try/catches repeated are unnecessary and are distracting:

try:
        check_celery()
except CeleryConnectionError as err:
        raise err

What's the motivation for those?

I can get rid of them! I think this was before I decided to put the try/excepts in check_celery() instead. I've had a habit of trying to put these try/excepts in the api layer before deciding otherwise

@jessica-cheng jessica-cheng changed the title Celery pinger chore(breadbox): Celery checker Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants