Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout and retry to create_namespaced_pod #433

Merged
merged 26 commits into from
Sep 16, 2020

Conversation

gravenimage
Copy link

We've found persistent intermittent failures to spawn on Azure AKS (multiple k8s versions). On a rancher cluster created on Azure VMs (k8s 1.18) we do not see this. It appears that this particular POST call to the create_namespaced_pod API occasionally just doesn't return (within 5 minutes at least). Adding a timeout and using the existing retry logic fixes this.

Improvements that could be made:

  • explaining why k8s fails sometimes!
  • could this happen on other k8s APIs?
  • should this use a better backoff mechanism
  • how to avoid pulling in urllib3 dependency (used by kubernetes package) - or just accept it.

@welcome
Copy link

welcome bot commented Sep 14, 2020

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.
welcome
You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

yuvipanda added a commit to utoronto-2i2c/jupyterhub-deploy that referenced this pull request Sep 14, 2020
Some of the failures we were seeing - of pod spawns getting
'stuck', might be bugs in AKS versions. See
jupyterhub/kubespawner#433.

Hopefully upgrading fixes it?
@yuvipanda
Copy link
Collaborator

This is great! I was just running into this issue, I think 👍

We already use exponential backoff with timoutes in various other places - see

yield exponential_backoff(
for example. Can you modify this to use that pattern? That would be awesome <3

We already use the kubernetes package, so urllib3 isn't a new dependency, right? Should be ok.

Thanks a lot for working on this!

@yuvipanda
Copy link
Collaborator

I totally misunderstood the backoff comment you made, please ignore what I said 👍

I'm testing this in production now :)

We already have a dependency on urllib3 via
kubernetes client API. If kubernetes client changes,
we'll have to.
This is probably an error at the intersection of networking
and async, which makes it pretty hard to pin down
Turns out we still have to use dot notation for objects
we create - subscription is only for objects we read
Maybe this is what's causing the issue?
Replaces previous home-grown retry functionality
It redoes on return value (right thing to do), not on
exceptions
- Passing _request_timeout doesn't seem to work
- We weren't actually catching the Timeout errors and returning
  False before.
@yuvipanda
Copy link
Collaborator

Since I'm running into this right now in production, I've been hacking away trying to get it fixed. You can see my work in my branch: https://github.com/yuvipanda/kubespawner/tree/timeout.

Are you running this in production right now, @gravenimage?

Again, thank you very much for finding this issue and contributing :)

gen is actually imported! These are all aliases
of each other
@yuvipanda
Copy link
Collaborator

Basically, it looks like the AKS master we are using is very flaky - lots of ReadTimeouts everywhere. So we have to do the right thing and have timeouts / retries everywhere we talk to the k8s API. Our start / stop timeouts time out before the requests themselves time out, causing very weird & hard to debug issues.

In my branch I now have it for create and delete, since they both were being quite flaky. Needs a bit more of a systematic approach.

@gravenimage
Copy link
Author

I've been running my PR in a very low-use production environment for a few days. I see a 5-second delay as expected quite a lot, but haven't had any failures so by that mark it is a success. In the past I have also seen pod shutdown failures but could never get a reproduction, and havent seen them recently. I can well believe that AKS is flaky all over the place; at my work we have no real insight into the master plane :-(

@gravenimage
Copy link
Author

I'm on mobile at the moment so can't look at code very well, but I'd thought of making a more failure-aware exponential-backoff ala Polly in .net where I could specify strategies for ReadTimeout without changing the logic of the rest of the code

We now use our own timeout mechanism, since the one
from the kubernetes API seems very flaky / inconsistent
@yuvipanda
Copy link
Collaborator

I've added timeout and retry logic to pod / pvc creations / deletions. I ended up using tornado's timeout mechanism - the k8s API libraries was very iffy, and I couldn't quite figure out why. This also means we don't have to import urllib3 anymore, so one less implicit dependency!

Without this, we had extremely high spawn failure rates. This seems to have placated that a little, and I'm continuing to run this in production.

I hope this is useful to your production users too, @gravenimage. I am sorry I sort of took over your PR - I've sent you a private note about that too.

@yuvipanda yuvipanda requested review from consideRatio and minrk and removed request for consideRatio September 16, 2020 07:08
Comment on lines +1788 to +1793
"""
Make an HTTP request to create the given pod

Designed to be used with exponential_backoff, so returns
True / False on success / failure
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function also relies on raising errors to signal an outcome. I'm not confident on how JupyterHub which is using a KubeSpawner object throwing an error will react. @yuvipanda do you know?

To have this docstring or similar describe a bit about that would be relevant for me at least, I've asked this myself many times while inspecting this code base and failed to spot the associated JupyterHub try/except logic as I assume this wouldn't bring the hub down.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Letting unhandled errors raise is normal error behavior, i.e. spawn failed with an error, resulting in logged exception and 500 error. Using exponential backoff has three cases:

  1. success (return True/truthy)
  2. handled failure that should be retried (return False)
  3. unhandled error (let it raise, will propagate up to the caller)

I don't think this PR changes the outcomes of different errors other than these short timeouts.


self.log.info(f'Killed pod {pod_name}, will try starting singleuser pod again')
# We tell exponential_backoff to retry
return False
Copy link
Member

@consideRatio consideRatio Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the following list of k8s api-server response status codes to be useful. For a create resource request, the 409 status means the resource already exists.

I think this logic here is invalid or needs clarifications. Should we raise an error, return True / False, and/or take actions like deleting the pod etc in various situations?

Confusion points:

  • We assume to have found existing pod whenever we get an error, but specifically not for a 409 error that is the explicit error for having a conflict.
  • We fallback to stopping (deleting) pods and retrying as a consequence of ApiException, given the k8s api-server response status codes, we should be a bit more selective on this fallback strategy I think.
  • What is the ApiException error that makes us need this kind of fallback?

Concern points:

  • We use the fallback strategy to delete and retry also on 429 responses etc.
  • We use the fallback strategy without to my mind knowing it could make sense - I don't understand the situation when it would at least. This would make sense for an evicted pod perhaps, but specifically for an evicted pod (leading to 409), I think we don't!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is unchanged from before - if a pod is already created (i.e. not by us!), delete it and recreate it.

The very-short timeouts in this PR, though, make it a bit more likely that we:

  1. create it successfully, but timeout is hit
  2. next attempt fails with 409, but success was ours!
  3. delete our first creation and try again

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up to clarify the last point: this case where we created the pod, API call raised but creation actually happened, and then we delete & retry was already technically a possibility that could occur. This PR adds one more situation where it can happen that can be triggered by load on either kubernetes or the Hub.

return True
except gen.TimeoutError:
# Just try again
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a bit of pr-review-creep perhaps, but I feel like I want to mention that there are various others situations that will make sense to retry that will be listed as ApiException errors.

Here are those recommended to retry according to this kubernetes community documentation.

429 StatusTooManyRequests
500 StatusInternalServerError
503 StatusServiceUnavailable
504 StatusServerTimeout

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great comment! I think it's okay to address this here if desired by the author, but also fine to improve the retry conditions in a later PR, since this one doesn't change the PVC retry logic, just reorganizes it.

Copy link
Member

@consideRatio consideRatio Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #436 to represent this matter so we can safely focus on this PRs core intention in this PR.

yield exponential_backoff(
partial(self._make_delete_pod_request, self.pod_name, delete_options, grace_seconds, self.k8s_api_request_timeout),
f'Could not delete pod {self.pod_name}',
# FIXME: We should instead add a timeout_times property to exponential_backoff instead
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I don't agree that this timeout_times logic belongs in exponential_backoff. exponential_backoff is well defined for retrying something up to a time limit, letting unhandled errors raise. Handling a separate, shorter timeout as one of the cases that should be retried rightly belongs in the internal function without any knowledge of that in the outer exponential_backoff logic.

partial(self._make_delete_pod_request, self.pod_name, delete_options, grace_seconds, self.k8s_api_request_timeout),
f'Could not delete pod {self.pod_name}',
# FIXME: We should instead add a timeout_times property to exponential_backoff instead
timeout=self.k8s_api_request_timeout * self.k8s_api_request_timeout_retries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timeout math won't allow the full number of retries to be attempted, because the timeout includes both the execution time and the exponentially increasing backoff in between attempts.

It may be appropriate instead to promote a max_attempts argument to exponential_backoff to be used instead of the timeout in exponential_backoff, but in the absence of that, implement a different retry scheme here (exponential backoff a fixed number of attempts), or set an independent timeout that is defined on its own, i.e. "spend up to 30 seconds attempting to do this, considering a single request timeout of 3 seconds a failure" where the number of attempts can vary within a small range.

Comment on lines 225 to 226
k8s_api_request_timeout_retries = Integer(
5,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the use of exponential_backoff, I think using a timeout configurable here instead of a retry count will result in more accurate behavior, and more intuitive behavior when changing the per-attempt timeout:

  • k8s_api_request_timeout_retry = Float(3 - timeout before an API request is canceled and retried
  • k8s_api_request_timeout = Float(30 - outer timeout to give up and stop retrying the request

(I'm open to clearer naming for the two timeouts - per-retry and for all retries)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree - both on having a timeout here and on the naming. Will amend.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -207,6 +208,31 @@ def __init__(self, *args, **kwargs):
"""
)

k8s_api_request_timeout = Integer(
1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this includes local thread processing, high load on the Hub can probably trigger false-positive failures here if this number is too low. I'd recommend a safer default of e.g. ~3-5s.

return True
except gen.TimeoutError:
# Just try again
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great comment! I think it's okay to address this here if desired by the author, but also fine to improve the retry conditions in a later PR, since this one doesn't change the PVC retry logic, just reorganizes it.

f'Could not create pod {self.pod_name}',
# Each req should be given k8s_api_request_timeout seconds.
# FIXME: We should instead add a timeout_times property to exponential_backoff instead
timeout=self.k8s_api_request_timeout * self.k8s_api_request_timeout_retries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above suggestion for making this timeout directly configurable instead of trying to calculate it from countable retries

Comment on lines +1788 to +1793
"""
Make an HTTP request to create the given pod

Designed to be used with exponential_backoff, so returns
True / False on success / failure
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Letting unhandled errors raise is normal error behavior, i.e. spawn failed with an error, resulting in logged exception and 500 error. Using exponential backoff has three cases:

  1. success (return True/truthy)
  2. handled failure that should be retried (return False)
  3. unhandled error (let it raise, will propagate up to the caller)

I don't think this PR changes the outcomes of different errors other than these short timeouts.


self.log.info(f'Killed pod {pod_name}, will try starting singleuser pod again')
# We tell exponential_backoff to retry
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is unchanged from before - if a pod is already created (i.e. not by us!), delete it and recreate it.

The very-short timeouts in this PR, though, make it a bit more likely that we:

  1. create it successfully, but timeout is hit
  2. next attempt fails with 409, but success was ours!
  3. delete our first creation and try again

Much better than trying to hack it with retry counts
@yuvipanda
Copy link
Collaborator

Thanks for the comments, @consideRatio @minrk. I've made the changes suggested I think. I agree the retry and error handling logic could be improved further, but hope that can be made a separate PR to keep this one small.

@minrk minrk merged commit 0dfccb2 into jupyterhub:master Sep 16, 2020
@welcome
Copy link

welcome bot commented Sep 16, 2020

Congrats on your first merged pull request in this project! 🎉
congrats
Thank you for contributing, we are very proud of you! ❤️

@minrk
Copy link
Member

minrk commented Sep 16, 2020

Awesome, thank you!

yuvipanda added a commit to utoronto-2i2c/jupyterhub-deploy that referenced this pull request Sep 17, 2020
Users were reporting 'stuck' server creating screens,
which to be just jupyterhub/kubespawner#433.
So I'm trying to test that PR in production here.

- Use chartpress for image building. <3
- Push image into ACR. Use imagePullSecret to pull from it
- Chartpress sets image info, reformats values.yaml
# If there's a timeout, just let it propagate
yield exponential_backoff(
partial(self._make_create_pvc_request, pvc, self.k8s_api_request_timeout),
f'Could not create pod {self.pvc_name}',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be something like "Could not create PVC" right? Copy/paste error?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Good catch!

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this pull request Jul 13, 2021
Some of the failures we were seeing - of pod spawns getting
'stuck', might be bugs in AKS versions. See
jupyterhub/kubespawner#433.

Hopefully upgrading fixes it?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants