-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout and retry to create_namespaced_pod #433
Conversation
Thanks for submitting your first pull request! You are awesome! 🤗 |
Some of the failures we were seeing - of pod spawns getting 'stuck', might be bugs in AKS versions. See jupyterhub/kubespawner#433. Hopefully upgrading fixes it?
This is great! I was just running into this issue, I think 👍 We already use exponential backoff with timoutes in various other places - see kubespawner/kubespawner/spawner.py Line 1851 in b209ce1
We already use the kubernetes package, so Thanks a lot for working on this! |
I totally misunderstood the backoff comment you made, please ignore what I said 👍 I'm testing this in production now :) |
This wasn't defined before
We already have a dependency on urllib3 via kubernetes client API. If kubernetes client changes, we'll have to.
This is probably an error at the intersection of networking and async, which makes it pretty hard to pin down
Turns out we still have to use dot notation for objects we create - subscription is only for objects we read
Maybe this is what's causing the issue?
Replaces previous home-grown retry functionality
It redoes on return value (right thing to do), not on exceptions
- Passing _request_timeout doesn't seem to work - We weren't actually catching the Timeout errors and returning False before.
Since I'm running into this right now in production, I've been hacking away trying to get it fixed. You can see my work in my branch: https://github.com/yuvipanda/kubespawner/tree/timeout. Are you running this in production right now, @gravenimage? Again, thank you very much for finding this issue and contributing :) |
gen is actually imported! These are all aliases of each other
Basically, it looks like the AKS master we are using is very flaky - lots of ReadTimeouts everywhere. So we have to do the right thing and have timeouts / retries everywhere we talk to the k8s API. Our start / stop timeouts time out before the requests themselves time out, causing very weird & hard to debug issues. In my branch I now have it for create and delete, since they both were being quite flaky. Needs a bit more of a systematic approach. |
I've been running my PR in a very low-use production environment for a few days. I see a 5-second delay as expected quite a lot, but haven't had any failures so by that mark it is a success. In the past I have also seen pod shutdown failures but could never get a reproduction, and havent seen them recently. I can well believe that AKS is flaky all over the place; at my work we have no real insight into the master plane :-( |
I'm on mobile at the moment so can't look at code very well, but I'd thought of making a more failure-aware exponential-backoff ala Polly in .net where I could specify strategies for ReadTimeout without changing the logic of the rest of the code |
This should ideally not happen, but hey!
Only implemented for creation & deletion of pods.
Was useful when debugging, not so anymore
We now use our own timeout mechanism, since the one from the kubernetes API seems very flaky / inconsistent
The defaults are decent, I think
I've added timeout and retry logic to pod / pvc creations / deletions. I ended up using tornado's timeout mechanism - the k8s API libraries was very iffy, and I couldn't quite figure out why. This also means we don't have to import urllib3 anymore, so one less implicit dependency! Without this, we had extremely high spawn failure rates. This seems to have placated that a little, and I'm continuing to run this in production. I hope this is useful to your production users too, @gravenimage. I am sorry I sort of took over your PR - I've sent you a private note about that too. |
""" | ||
Make an HTTP request to create the given pod | ||
|
||
Designed to be used with exponential_backoff, so returns | ||
True / False on success / failure | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function also relies on raising errors to signal an outcome. I'm not confident on how JupyterHub which is using a KubeSpawner object throwing an error will react. @yuvipanda do you know?
To have this docstring or similar describe a bit about that would be relevant for me at least, I've asked this myself many times while inspecting this code base and failed to spot the associated JupyterHub try/except logic as I assume this wouldn't bring the hub down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Letting unhandled errors raise is normal error behavior, i.e. spawn failed with an error, resulting in logged exception and 500 error. Using exponential backoff has three cases:
- success (return True/truthy)
- handled failure that should be retried (return False)
- unhandled error (let it raise, will propagate up to the caller)
I don't think this PR changes the outcomes of different errors other than these short timeouts.
|
||
self.log.info(f'Killed pod {pod_name}, will try starting singleuser pod again') | ||
# We tell exponential_backoff to retry | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the following list of k8s api-server response status codes to be useful. For a create resource request, the 409 status means the resource already exists.
I think this logic here is invalid or needs clarifications. Should we raise an error, return True / False, and/or take actions like deleting the pod etc in various situations?
Confusion points:
- We assume to have found existing pod whenever we get an error, but specifically not for a 409 error that is the explicit error for having a conflict.
- We fallback to stopping (deleting) pods and retrying as a consequence of ApiException, given the k8s api-server response status codes, we should be a bit more selective on this fallback strategy I think.
- What is the ApiException error that makes us need this kind of fallback?
Concern points:
- We use the fallback strategy to delete and retry also on 429 responses etc.
- We use the fallback strategy without to my mind knowing it could make sense - I don't understand the situation when it would at least. This would make sense for an evicted pod perhaps, but specifically for an evicted pod (leading to 409), I think we don't!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is unchanged from before - if a pod is already created (i.e. not by us!), delete it and recreate it.
The very-short timeouts in this PR, though, make it a bit more likely that we:
- create it successfully, but timeout is hit
- next attempt fails with 409, but success was ours!
- delete our first creation and try again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up to clarify the last point: this case where we created the pod, API call raised but creation actually happened, and then we delete & retry was already technically a possibility that could occur. This PR adds one more situation where it can happen that can be triggered by load on either kubernetes or the Hub.
return True | ||
except gen.TimeoutError: | ||
# Just try again | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a bit of pr-review-creep perhaps, but I feel like I want to mention that there are various others situations that will make sense to retry that will be listed as ApiException errors.
Here are those recommended to retry according to this kubernetes community documentation.
429 StatusTooManyRequests
500 StatusInternalServerError
503 StatusServiceUnavailable
504 StatusServerTimeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great comment! I think it's okay to address this here if desired by the author, but also fine to improve the retry conditions in a later PR, since this one doesn't change the PVC retry logic, just reorganizes it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #436 to represent this matter so we can safely focus on this PRs core intention in this PR.
kubespawner/spawner.py
Outdated
yield exponential_backoff( | ||
partial(self._make_delete_pod_request, self.pod_name, delete_options, grace_seconds, self.k8s_api_request_timeout), | ||
f'Could not delete pod {self.pod_name}', | ||
# FIXME: We should instead add a timeout_times property to exponential_backoff instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I don't agree that this timeout_times
logic belongs in exponential_backoff. exponential_backoff is well defined for retrying something up to a time limit, letting unhandled errors raise. Handling a separate, shorter timeout as one of the cases that should be retried rightly belongs in the internal function without any knowledge of that in the outer exponential_backoff logic.
kubespawner/spawner.py
Outdated
partial(self._make_delete_pod_request, self.pod_name, delete_options, grace_seconds, self.k8s_api_request_timeout), | ||
f'Could not delete pod {self.pod_name}', | ||
# FIXME: We should instead add a timeout_times property to exponential_backoff instead | ||
timeout=self.k8s_api_request_timeout * self.k8s_api_request_timeout_retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timeout math won't allow the full number of retries to be attempted, because the timeout includes both the execution time and the exponentially increasing backoff in between attempts.
It may be appropriate instead to promote a max_attempts argument to exponential_backoff to be used instead of the timeout in exponential_backoff, but in the absence of that, implement a different retry scheme here (exponential backoff a fixed number of attempts), or set an independent timeout that is defined on its own, i.e. "spend up to 30 seconds attempting to do this, considering a single request timeout of 3 seconds a failure" where the number of attempts can vary within a small range.
kubespawner/spawner.py
Outdated
k8s_api_request_timeout_retries = Integer( | ||
5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the use of exponential_backoff, I think using a timeout configurable here instead of a retry count will result in more accurate behavior, and more intuitive behavior when changing the per-attempt timeout:
k8s_api_request_timeout_retry = Float(3
- timeout before an API request is canceled and retriedk8s_api_request_timeout = Float(30
- outer timeout to give up and stop retrying the request
(I'm open to clearer naming for the two timeouts - per-retry and for all retries)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree - both on having a timeout here and on the naming. Will amend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
kubespawner/spawner.py
Outdated
@@ -207,6 +208,31 @@ def __init__(self, *args, **kwargs): | |||
""" | |||
) | |||
|
|||
k8s_api_request_timeout = Integer( | |||
1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this includes local thread processing, high load on the Hub can probably trigger false-positive failures here if this number is too low. I'd recommend a safer default of e.g. ~3-5s.
return True | ||
except gen.TimeoutError: | ||
# Just try again | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great comment! I think it's okay to address this here if desired by the author, but also fine to improve the retry conditions in a later PR, since this one doesn't change the PVC retry logic, just reorganizes it.
kubespawner/spawner.py
Outdated
f'Could not create pod {self.pod_name}', | ||
# Each req should be given k8s_api_request_timeout seconds. | ||
# FIXME: We should instead add a timeout_times property to exponential_backoff instead | ||
timeout=self.k8s_api_request_timeout * self.k8s_api_request_timeout_retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above suggestion for making this timeout directly configurable instead of trying to calculate it from countable retries
""" | ||
Make an HTTP request to create the given pod | ||
|
||
Designed to be used with exponential_backoff, so returns | ||
True / False on success / failure | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Letting unhandled errors raise is normal error behavior, i.e. spawn failed with an error, resulting in logged exception and 500 error. Using exponential backoff has three cases:
- success (return True/truthy)
- handled failure that should be retried (return False)
- unhandled error (let it raise, will propagate up to the caller)
I don't think this PR changes the outcomes of different errors other than these short timeouts.
|
||
self.log.info(f'Killed pod {pod_name}, will try starting singleuser pod again') | ||
# We tell exponential_backoff to retry | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is unchanged from before - if a pod is already created (i.e. not by us!), delete it and recreate it.
The very-short timeouts in this PR, though, make it a bit more likely that we:
- create it successfully, but timeout is hit
- next attempt fails with 409, but success was ours!
- delete our first creation and try again
Based on @minrk's comment.
Much better than trying to hack it with retry counts
Thanks for the comments, @consideRatio @minrk. I've made the changes suggested I think. I agree the retry and error handling logic could be improved further, but hope that can be made a separate PR to keep this one small. |
Awesome, thank you! |
Users were reporting 'stuck' server creating screens, which to be just jupyterhub/kubespawner#433. So I'm trying to test that PR in production here. - Use chartpress for image building. <3 - Push image into ACR. Use imagePullSecret to pull from it - Chartpress sets image info, reformats values.yaml
# If there's a timeout, just let it propagate | ||
yield exponential_backoff( | ||
partial(self._make_create_pvc_request, pvc, self.k8s_api_request_timeout), | ||
f'Could not create pod {self.pvc_name}', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be something like "Could not create PVC" right? Copy/paste error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Good catch!
Some of the failures we were seeing - of pod spawns getting 'stuck', might be bugs in AKS versions. See jupyterhub/kubespawner#433. Hopefully upgrading fixes it?
We've found persistent intermittent failures to spawn on Azure AKS (multiple k8s versions). On a rancher cluster created on Azure VMs (k8s 1.18) we do not see this. It appears that this particular POST call to the create_namespaced_pod API occasionally just doesn't return (within 5 minutes at least). Adding a timeout and using the existing retry logic fixes this.
Improvements that could be made: