Add timeout and retry to create_namespaced_pod #433

gravenimage · 2020-09-14T06:45:59Z

We've found persistent intermittent failures to spawn on Azure AKS (multiple k8s versions). On a rancher cluster created on Azure VMs (k8s 1.18) we do not see this. It appears that this particular POST call to the create_namespaced_pod API occasionally just doesn't return (within 5 minutes at least). Adding a timeout and using the existing retry logic fixes this.

Improvements that could be made:

explaining why k8s fails sometimes!
could this happen on other k8s APIs?
should this use a better backoff mechanism
how to avoid pulling in urllib3 dependency (used by kubernetes package) - or just accept it.

welcome · 2020-09-14T06:46:02Z

Thanks for submitting your first pull request! You are awesome! 🤗

If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly.

You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! 👋

Welcome to the Jupyter community! 🎉

Some of the failures we were seeing - of pod spawns getting 'stuck', might be bugs in AKS versions. See jupyterhub/kubespawner#433. Hopefully upgrading fixes it?

yuvipanda · 2020-09-14T08:51:49Z

This is great! I was just running into this issue, I think 👍

We already use exponential backoff with timoutes in various other places - see

kubespawner/kubespawner/spawner.py

Line 1851 in b209ce1

yield exponential_backoff(

for example. Can you modify this to use that pattern? That would be awesome <3

We already use the kubernetes package, so urllib3 isn't a new dependency, right? Should be ok.

Thanks a lot for working on this!

yuvipanda · 2020-09-14T19:13:13Z

I totally misunderstood the backoff comment you made, please ignore what I said 👍

I'm testing this in production now :)

This wasn't defined before

We already have a dependency on urllib3 via kubernetes client API. If kubernetes client changes, we'll have to.

kubespawner/spawner.py

This is probably an error at the intersection of networking and async, which makes it pretty hard to pin down

Turns out we still have to use dot notation for objects we create - subscription is only for objects we read

Maybe this is what's causing the issue?

Replaces previous home-grown retry functionality

It redoes on return value (right thing to do), not on exceptions

- Passing _request_timeout doesn't seem to work - We weren't actually catching the Timeout errors and returning False before.

yuvipanda · 2020-09-15T14:05:10Z

Since I'm running into this right now in production, I've been hacking away trying to get it fixed. You can see my work in my branch: https://github.com/yuvipanda/kubespawner/tree/timeout.

Are you running this in production right now, @gravenimage?

Again, thank you very much for finding this issue and contributing :)

gen is actually imported! These are all aliases of each other

yuvipanda · 2020-09-15T16:04:41Z

Basically, it looks like the AKS master we are using is very flaky - lots of ReadTimeouts everywhere. So we have to do the right thing and have timeouts / retries everywhere we talk to the k8s API. Our start / stop timeouts time out before the requests themselves time out, causing very weird & hard to debug issues.

In my branch I now have it for create and delete, since they both were being quite flaky. Needs a bit more of a systematic approach.

gravenimage · 2020-09-15T17:17:12Z

I've been running my PR in a very low-use production environment for a few days. I see a 5-second delay as expected quite a lot, but haven't had any failures so by that mark it is a success. In the past I have also seen pod shutdown failures but could never get a reproduction, and havent seen them recently. I can well believe that AKS is flaky all over the place; at my work we have no real insight into the master plane :-(

gravenimage · 2020-09-15T17:21:31Z

I'm on mobile at the moment so can't look at code very well, but I'd thought of making a more failure-aware exponential-backoff ala Polly in .net where I could specify strategies for ReadTimeout without changing the logic of the rest of the code

This should ideally not happen, but hey!

Only implemented for creation & deletion of pods.

Was useful when debugging, not so anymore

We now use our own timeout mechanism, since the one from the kubernetes API seems very flaky / inconsistent

The defaults are decent, I think

yuvipanda · 2020-09-16T06:54:58Z

I've added timeout and retry logic to pod / pvc creations / deletions. I ended up using tornado's timeout mechanism - the k8s API libraries was very iffy, and I couldn't quite figure out why. This also means we don't have to import urllib3 anymore, so one less implicit dependency!

Without this, we had extremely high spawn failure rates. This seems to have placated that a little, and I'm continuing to run this in production.

I hope this is useful to your production users too, @gravenimage. I am sorry I sort of took over your PR - I've sent you a private note about that too.

consideRatio · 2020-09-16T08:57:40Z

kubespawner/spawner.py

+        """
+        Make an HTTP request to create the given pod
+
+        Designed to be used with exponential_backoff, so returns
+        True / False on success / failure
+        """


The function also relies on raising errors to signal an outcome. I'm not confident on how JupyterHub which is using a KubeSpawner object throwing an error will react. @yuvipanda do you know?

To have this docstring or similar describe a bit about that would be relevant for me at least, I've asked this myself many times while inspecting this code base and failed to spot the associated JupyterHub try/except logic as I assume this wouldn't bring the hub down.

Letting unhandled errors raise is normal error behavior, i.e. spawn failed with an error, resulting in logged exception and 500 error. Using exponential backoff has three cases:

success (return True/truthy)

handled failure that should be retried (return False)

unhandled error (let it raise, will propagate up to the caller)

I don't think this PR changes the outcomes of different errors other than these short timeouts.

consideRatio · 2020-09-16T09:13:11Z

kubespawner/spawner.py

+
+            self.log.info(f'Killed pod {pod_name}, will try starting singleuser pod again')
+            # We tell exponential_backoff to retry
+            return False


I found the following list of k8s api-server response status codes to be useful. For a create resource request, the 409 status means the resource already exists.

I think this logic here is invalid or needs clarifications. Should we raise an error, return True / False, and/or take actions like deleting the pod etc in various situations?

Confusion points:

We assume to have found existing pod whenever we get an error, but specifically not for a 409 error that is the explicit error for having a conflict.

We fallback to stopping (deleting) pods and retrying as a consequence of ApiException, given the k8s api-server response status codes, we should be a bit more selective on this fallback strategy I think.

What is the ApiException error that makes us need this kind of fallback?

Concern points:

We use the fallback strategy to delete and retry also on 429 responses etc.

We use the fallback strategy without to my mind knowing it could make sense - I don't understand the situation when it would at least. This would make sense for an evicted pod perhaps, but specifically for an evicted pod (leading to 409), I think we don't!

This logic is unchanged from before - if a pod is already created (i.e. not by us!), delete it and recreate it.

The very-short timeouts in this PR, though, make it a bit more likely that we:

create it successfully, but timeout is hit

next attempt fails with 409, but success was ours!

delete our first creation and try again

Follow-up to clarify the last point: this case where we created the pod, API call raised but creation actually happened, and then we delete & retry was already technically a possibility that could occur. This PR adds one more situation where it can happen that can be triggered by load on either kubernetes or the Hub.

consideRatio · 2020-09-16T09:20:49Z

kubespawner/spawner.py

+            return True
+        except gen.TimeoutError:
+            # Just try again
+            return False


Its a bit of pr-review-creep perhaps, but I feel like I want to mention that there are various others situations that will make sense to retry that will be listed as ApiException errors.

Here are those recommended to retry according to this kubernetes community documentation.

429 StatusTooManyRequests
500 StatusInternalServerError
503 StatusServiceUnavailable
504 StatusServerTimeout

That's a great comment! I think it's okay to address this here if desired by the author, but also fine to improve the retry conditions in a later PR, since this one doesn't change the PVC retry logic, just reorganizes it.

I opened #436 to represent this matter so we can safely focus on this PRs core intention in this PR.

minrk · 2020-09-16T09:38:02Z

kubespawner/spawner.py

+        yield exponential_backoff(
+            partial(self._make_delete_pod_request, self.pod_name, delete_options, grace_seconds, self.k8s_api_request_timeout),
+            f'Could not delete pod {self.pod_name}',
+            # FIXME: We should instead add a timeout_times property to exponential_backoff instead


FWIW, I don't agree that this timeout_times logic belongs in exponential_backoff. exponential_backoff is well defined for retrying something up to a time limit, letting unhandled errors raise. Handling a separate, shorter timeout as one of the cases that should be retried rightly belongs in the internal function without any knowledge of that in the outer exponential_backoff logic.

minrk · 2020-09-16T09:49:36Z

kubespawner/spawner.py

+            partial(self._make_delete_pod_request, self.pod_name, delete_options, grace_seconds, self.k8s_api_request_timeout),
+            f'Could not delete pod {self.pod_name}',
+            # FIXME: We should instead add a timeout_times property to exponential_backoff instead
+            timeout=self.k8s_api_request_timeout * self.k8s_api_request_timeout_retries


This timeout math won't allow the full number of retries to be attempted, because the timeout includes both the execution time and the exponentially increasing backoff in between attempts.

It may be appropriate instead to promote a max_attempts argument to exponential_backoff to be used instead of the timeout in exponential_backoff, but in the absence of that, implement a different retry scheme here (exponential backoff a fixed number of attempts), or set an independent timeout that is defined on its own, i.e. "spend up to 30 seconds attempting to do this, considering a single request timeout of 3 seconds a failure" where the number of attempts can vary within a small range.

minrk · 2020-09-16T09:52:39Z

kubespawner/spawner.py

+    k8s_api_request_timeout_retries = Integer(
+        5,


Given the use of exponential_backoff, I think using a timeout configurable here instead of a retry count will result in more accurate behavior, and more intuitive behavior when changing the per-attempt timeout:

k8s_api_request_timeout_retry = Float(3 - timeout before an API request is canceled and retried

k8s_api_request_timeout = Float(30 - outer timeout to give up and stop retrying the request

(I'm open to clearer naming for the two timeouts - per-retry and for all retries)

Yeah, I agree - both on having a timeout here and on the naming. Will amend.

minrk · 2020-09-16T09:53:44Z

kubespawner/spawner.py

@@ -207,6 +208,31 @@ def __init__(self, *args, **kwargs):
        """
    )

+    k8s_api_request_timeout = Integer(
+        1,


Since this includes local thread processing, high load on the Hub can probably trigger false-positive failures here if this number is too low. I'd recommend a safer default of e.g. ~3-5s.

minrk · 2020-09-16T10:00:06Z

kubespawner/spawner.py

+            return True
+        except gen.TimeoutError:
+            # Just try again
+            return False


That's a great comment! I think it's okay to address this here if desired by the author, but also fine to improve the retry conditions in a later PR, since this one doesn't change the PVC retry logic, just reorganizes it.

minrk · 2020-09-16T10:01:01Z

kubespawner/spawner.py

+            f'Could not create pod {self.pod_name}',
+            # Each req should be given k8s_api_request_timeout seconds.
+            # FIXME: We should instead add a timeout_times property to exponential_backoff instead
+            timeout=self.k8s_api_request_timeout * self.k8s_api_request_timeout_retries


See above suggestion for making this timeout directly configurable instead of trying to calculate it from countable retries

minrk · 2020-09-16T10:06:39Z

kubespawner/spawner.py

+        """
+        Make an HTTP request to create the given pod
+
+        Designed to be used with exponential_backoff, so returns
+        True / False on success / failure
+        """


Letting unhandled errors raise is normal error behavior, i.e. spawn failed with an error, resulting in logged exception and 500 error. Using exponential backoff has three cases:

success (return True/truthy)

handled failure that should be retried (return False)

unhandled error (let it raise, will propagate up to the caller)

I don't think this PR changes the outcomes of different errors other than these short timeouts.

minrk · 2020-09-16T10:10:46Z

kubespawner/spawner.py

+
+            self.log.info(f'Killed pod {pod_name}, will try starting singleuser pod again')
+            # We tell exponential_backoff to retry
+            return False


This logic is unchanged from before - if a pod is already created (i.e. not by us!), delete it and recreate it.

The very-short timeouts in this PR, though, make it a bit more likely that we:

create it successfully, but timeout is hit

next attempt fails with 409, but success was ours!

delete our first creation and try again

@minrk

Based on @minrk's comment.

Much better than trying to hack it with retry counts

yuvipanda · 2020-09-16T10:53:45Z

Thanks for the comments, @consideRatio @minrk. I've made the changes suggested I think. I agree the retry and error handling logic could be improved further, but hope that can be made a separate PR to keep this one small.

kubespawner/spawner.py

welcome · 2020-09-16T13:17:13Z

Congrats on your first merged pull request in this project! 🎉

Thank you for contributing, we are very proud of you! ❤️

minrk · 2020-09-16T13:17:16Z

Awesome, thank you!

Users were reporting 'stuck' server creating screens, which to be just jupyterhub/kubespawner#433. So I'm trying to test that PR in production here. - Use chartpress for image building. <3 - Push image into ACR. Use imagePullSecret to pull from it - Chartpress sets image info, reformats values.yaml

mriedem · 2021-02-05T01:22:30Z

kubespawner/spawner.py

+            # If there's a timeout, just let it propagate
+            yield exponential_backoff(
+                partial(self._make_create_pvc_request, pvc, self.k8s_api_request_timeout),
+                f'Could not create pod {self.pvc_name}',


Should be something like "Could not create PVC" right? Copy/paste error?

Ah! Good catch!

Some of the failures we were seeing - of pod spawns getting 'stuck', might be bugs in AKS versions. See jupyterhub/kubespawner#433. Hopefully upgrading fixes it?

gravenimage added 2 commits September 14, 2020 08:10

Add timeout to create_namespaced_pod using k8s_post_timeout setting

19be50f

Catch read timeout exception and retry if necessary

c04125c

yuvipanda added 3 commits September 15, 2020 00:58

Define k8s API request timeout via traitlet

ea6ba35

This wasn't defined before

Catch ReadTimeoutError properly

6b33589

We already have a dependency on urllib3 via kubernetes client API. If kubernetes client changes, we'll have to.

Log whose create_namespaced_pod timed out

26053a3

gravenimage commented Sep 15, 2020

View reviewed changes

kubespawner/spawner.py Outdated Show resolved Hide resolved

yuvipanda added 8 commits September 15, 2020 12:20

Add some more logging

fa0db45

This is probably an error at the intersection of networking and async, which makes it pretty hard to pin down

Fix pod method access

c85647f

Turns out we still have to use dot notation for objects we create - subscription is only for objects we read

Use Integers, not Floats, for timeout

7b16bff

Maybe this is what's causing the issue?

Use exponential_backoff when trying to create pod

b91e1ea

Replaces previous home-grown retry functionality

Use exponential_backoff properly

826ab64

It redoes on return value (right thing to do), not on exceptions

Use gen.wait_timeout for create_pod timeouts

a084f53

- Passing _request_timeout doesn't seem to work - We weren't actually catching the Timeout errors and returning False before.

Tweak timeouts for create_namespaced_pod

4ed99fe

Try create_namespaced_pod again when request fails

659b749

yuvipanda added 2 commits September 15, 2020 20:18

Add timeout for delete operations too

77adbd4

Use gen.TimeoutError instead of tornado.util

b91b888

gen is actually imported! These are all aliases of each other

yuvipanda added 6 commits September 15, 2020 23:03

Treat non-existent pod as success when trying to delete

827898d

This should ideally not happen, but hey!

Configure number of times a k8s API request is retried

13045d3

Only implemented for creation & deletion of pods.

Cleanup helper functions to create / delete pod

de53472

Remove excessive logging

a46e52c

Was useful when debugging, not so anymore

Fix typo in log message

021134d

Apply timeout & retry logic to creating PVCs

4a8ecf5

yuvipanda added 2 commits September 16, 2020 12:18

Remove urllib3 import

0223ac1

We now use our own timeout mechanism, since the one from the kubernetes API seems very flaky / inconsistent

Don't explicitly set request_timeout in jupyterhub_config.py

f216daa

The defaults are decent, I think

yuvipanda requested review from consideRatio and minrk and removed request for consideRatio September 16, 2020 07:08

consideRatio reviewed Sep 16, 2020

View reviewed changes

minrk reviewed Sep 16, 2020

View reviewed changes

Bump up the k8s_request_timeout a little

c9c8709

Based on @minrk's comment.

consideRatio mentioned this pull request Sep 16, 2020

k8s-apiserver requests' responses status code like 429 should trigger request retries #436

Open

Allow specifying timeout for api requests - including retries

b2abab5

Much better than trying to hack it with retry counts

minrk approved these changes Sep 16, 2020

View reviewed changes

kubespawner/spawner.py Outdated Show resolved Hide resolved

typo

6bb641c

minrk merged commit 0dfccb2 into jupyterhub:master Sep 16, 2020

consideRatio added the bug label Oct 4, 2020

consideRatio mentioned this pull request Oct 4, 2020

changelog: preparations for 0.14.0 #443

Merged

consideRatio mentioned this pull request Oct 26, 2020

AKS reliability issue - pending spawn / pending stop - resolved but undocumented fix #282

Closed

mriedem reviewed Feb 5, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeout and retry to create_namespaced_pod #433

Add timeout and retry to create_namespaced_pod #433

gravenimage commented Sep 14, 2020

welcome bot commented Sep 14, 2020

yuvipanda commented Sep 14, 2020

yuvipanda commented Sep 14, 2020

yuvipanda commented Sep 15, 2020

yuvipanda commented Sep 15, 2020

gravenimage commented Sep 15, 2020

gravenimage commented Sep 15, 2020

yuvipanda commented Sep 16, 2020

consideRatio Sep 16, 2020

minrk Sep 16, 2020

consideRatio Sep 16, 2020 •

edited

Loading

minrk Sep 16, 2020

minrk Sep 16, 2020

consideRatio Sep 16, 2020

minrk Sep 16, 2020

consideRatio Sep 16, 2020 •

edited

Loading

minrk Sep 16, 2020

minrk Sep 16, 2020

minrk Sep 16, 2020

yuvipanda Sep 16, 2020

yuvipanda Sep 16, 2020

minrk Sep 16, 2020

minrk Sep 16, 2020

minrk Sep 16, 2020

minrk Sep 16, 2020

minrk Sep 16, 2020

yuvipanda commented Sep 16, 2020

welcome bot commented Sep 16, 2020

minrk commented Sep 16, 2020

mriedem Feb 5, 2021

consideRatio Feb 5, 2021

Add timeout and retry to create_namespaced_pod #433

Add timeout and retry to create_namespaced_pod #433

Conversation

gravenimage commented Sep 14, 2020

welcome bot commented Sep 14, 2020

yuvipanda commented Sep 14, 2020

yuvipanda commented Sep 14, 2020

yuvipanda commented Sep 15, 2020

yuvipanda commented Sep 15, 2020

gravenimage commented Sep 15, 2020

gravenimage commented Sep 15, 2020

yuvipanda commented Sep 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

consideRatio Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

consideRatio Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuvipanda commented Sep 16, 2020

welcome bot commented Sep 16, 2020

minrk commented Sep 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

consideRatio Sep 16, 2020 •

edited

Loading

consideRatio Sep 16, 2020 •

edited

Loading