Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation - How to react properly to evicted pods #233

Open
2 tasks
consideRatio opened this issue Aug 9, 2018 · 1 comment
Open
2 tasks

Investigation - How to react properly to evicted pods #233

consideRatio opened this issue Aug 9, 2018 · 1 comment

Comments

@consideRatio
Copy link
Member

consideRatio commented Aug 9, 2018

Intro

In #223 we see that evicted user pods will cause a user to have a faulty routing and be unable to login, as the spawner does not realize the user pod is in bad shape, and can only be corrected by a hub restart.

I think I have found a solution to this, but first I want to share what I've learned about a pod's "status".

Theory

A pods status

image
What you see here under "STATUS", written out by kubectl get pods, is actually the a ContainerStatus's reason.

status.phase

The phase is easy to overview, but it is not what you see if you write kubectl get pods even though you will recognize Pending and Running.
image

status.containerStatuses.[0].state / lastState

This is what you actually see when you write kubectl get pods in the STATUS field. There are three kinds of states: Running, Terminated, Waitining. Both Terminated and Waiting has a reason field along with a message field.
image

Issue analysis

Inspect this code

if data.status.phase == 'Pending':
return None
ctr_stat = data.status.container_statuses
if ctr_stat is None: # No status, no container (we hope)
# This seems to happen when a pod is idle-culled.
return 1
for c in ctr_stat:
# return exit code if notebook container has terminated
if c.name == 'notebook':
if c.state.terminated:
# call self.stop to delete the pod
if self.delete_stopped_pods:
yield self.stop(now=True)
return c.state.terminated.exit_code
break
# None means pod is running or starting up
return None

The code's execution logic

  1. Is the pod phase Pending? Do nothing.
  2. If not, does the notebook container lack a state? Do something!!!
  3. If not, is the notebook container a terminated state? Do something!!!
  4. Else, do nothing.

I think we can do something here to fix #223, but I'm not sure what, because I have not been able to figure out how status.phase and status.containerStatuses[<the notebook container>].state will behave if we have an Evicted pod for example.

Suggested change and action plan

Perhaps we should delete pods that are in the Succeeded and Failed status.phase? That would probably make routes etc for users having pods with a kubectl get pods "STATUS" of Completed or Evicted be deleted properly and be able to respawn without needing the hub to restart.

Ping @minrk @betatim @choldgraf !

Things to learn / document

  • Figure out that value the pod.phase and container state reason will have for a pod eviction!
    1. Add some logging or similar to kubespawner, install and run that version.
    2. Spawn a user and make it run out of memory and get evicted somehow, fork bomb or set a very narrow limit.
    3. Inspect the hub's logs where kubespawner logs will be shown.
  • Document info about an evicted pod
kubectl get pod --namespace <my-namespace> <name-of-evicted-pod> --output yaml
kubectl describe pod --namespace <my-namespace> <name-of-evicted-pod>

Concrete questions I'd like answered

  • When is the containerStatuses array None and what is the status.phase when it happens?
            ctr_stat = data.status.container_statuses
            if ctr_stat is None:  # No status, no container (we hope)
                # This seems to happen when a pod is idle-culled.
                return 1
  • What values can reason take? What is the status.phase when a container is found in terminated state?
    We should log something about c.state.terminated.reason as well as data.status.phase when c.state.terminated is truthy.
                    if c.state.terminated:
                        # call self.stop to delete the pod
                        if self.delete_stopped_pods:
                            yield self.stop(now=True)
                        return c.state.terminated.exit_code

References

By looking at the PodStatus object, you can inspect nested resources like the phase field, or the containerStatuses array of ContaerinStatus etc...

I made a mindmap about pod.state things and events.

@minrk
Copy link
Member

minrk commented Nov 30, 2021

I just came across this because I was looking at orphaned, evicted pods on mybinder.org.

Using a pod with this state:

Status:               Failed
Reason:               Evicted
Message:              The node was low on resource: memory. Container notebook was using 1900004Ki, which exceeds its request of 471859200.

The KubeSpawner logs show that the Spawner does notice that the pod has stopped and treat it as a failure:

[W 2021-11-20 20:47:31.401 JupyterHub base:1072] User jupyterlab-jupyterlab-demo-cmqn27qt server stopped, with exit [I 2021-11-20 20:47:31.401 JupyterHub proxy:309] Removing user jupyterlab-jupyterlab-demo-cmqn27qt from proxy (/user/jupyterlab-jupyterlab-demo-cmqn27qt/)

which means that the more severe problem that prompted this Issue may be resolved (I haven't been able to figure out the time between eviction and noticing that it stopped). But the pod is still not deleted for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants