-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ip-reconciler issue with StatefullSet cleanup #182
Comments
maiqueb
added a commit
to maiqueb/whereabouts
that referenced
this issue
Mar 10, 2022
Issue [0] reports an error when a pod associated to a `StatefulSet` whose IPPool is already full is deleted. According to it, the new pod - scheduled by the `StatefulSet` - cannot run because the IPPool is already full, and the old pod's IP cannot be garbage collected because we match by pod reference - and the "new" pod is stuck in `creating` phase. [0] - k8snetworkplumbingwg#182 Signed-off-by: Miguel Duarte Barroso <[email protected]>
Merged
maiqueb
added a commit
to maiqueb/whereabouts
that referenced
this issue
Mar 11, 2022
Issue [0] reports an error when a pod associated to a `StatefulSet` whose IPPool is already full is deleted. According to it, the new pod - scheduled by the `StatefulSet` - cannot run because the IPPool is already full, and the old pod's IP cannot be garbage collected because we match by pod reference - and the "new" pod is stuck in `creating` phase. [0] - k8snetworkplumbingwg#182 Signed-off-by: Miguel Duarte Barroso <[email protected]>
maiqueb
added a commit
to maiqueb/whereabouts
that referenced
this issue
Mar 25, 2022
Issue [0] reports an error when a pod associated to a `StatefulSet` whose IPPool is already full is deleted. According to it, the new pod - scheduled by the `StatefulSet` - cannot run because the IPPool is already full, and the old pod's IP cannot be garbage collected because we match by pod reference - and the "new" pod is stuck in `creating` phase. [0] - k8snetworkplumbingwg#182 Signed-off-by: Miguel Duarte Barroso <[email protected]>
maiqueb
added a commit
to maiqueb/whereabouts
that referenced
this issue
Apr 4, 2022
Issue [0] reports an error when a pod associated to a `StatefulSet` whose IPPool is already full is deleted. According to it, the new pod - scheduled by the `StatefulSet` - cannot run because the IPPool is already full, and the old pod's IP cannot be garbage collected because we match by pod reference - and the "new" pod is stuck in `creating` phase. [0] - k8snetworkplumbingwg#182 Signed-off-by: Miguel Duarte Barroso <[email protected]>
maiqueb
added a commit
to maiqueb/whereabouts
that referenced
this issue
Apr 12, 2022
Issue [0] reports an error when a pod associated to a `StatefulSet` whose IPPool is already full is deleted. According to it, the new pod - scheduled by the `StatefulSet` - cannot run because the IPPool is already full, and the old pod's IP cannot be garbage collected because we match by pod reference - and the "new" pod is stuck in `creating` phase. [0] - k8snetworkplumbingwg#182 Signed-off-by: Miguel Duarte Barroso <[email protected]>
dougbtv
pushed a commit
that referenced
this issue
Apr 13, 2022
* build: generate ip pool clientSet/informers/listers Signed-off-by: Miguel Duarte Barroso <[email protected]> * vendor: update vendor stuff Signed-off-by: Miguel Duarte Barroso <[email protected]> * build: vendor net-attach-def-client types Signed-off-by: Miguel Duarte Barroso <[email protected]> * config: look for the whereabouts config file in multiple places The reconciler controller will have access to the whereabouts configuration via a mount point. As such, we need a way to specify its path. Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: requires the IP ranges in normalized format The IP reconcile loop also requires the IP ranges in a normalized format; as such, we export it into a function, which will be used in a follow-up commit. Signed-off-by: Miguel Duarte Barroso <[email protected]> * config: allow IPAM config parsing from a NetConfList Currently whereabouts is only able to parse network configurations in the strict [0] format - i.e. **do not accept** a plugin list - [1]. The `ip-control-loop` must recover the full plugin configuration, which may be in the network configuration format. This commit allows whereabouts to now understand both formats. Furthermore, the current CNI release - v1.0.Z - removed the support for [0], meaning that only the configuration list format is now supported [2]. [0] - https://github.com/containernetworking/cni/blob/v0.8.1/SPEC.md#network-configuration [1] - https://github.com/containernetworking/cni/blob/v0.8.1/SPEC.md#network-configuration-lists [2] - https://github.com/containernetworking/cni/blob/master/SPEC.md#released-versions Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: add a controller Listen to pod deletion, and for every deleted pod, assure their IPs are gone. The rough algorithm goes like this: - for every network-status in the pod's annotations: - read associated net-attach-def from the k8s API - extract the range from the net-attach-def - find the corresponding IP pool - look for allocations belonging to the deleted pod - delete them using `IPManagement(..., types.Deallocate, ...)` All the API reads go through the informer cache, which is kept updated whenever the objects are updated on the API. The dockerfiles are also updated, to ship this new binary. Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: remove manual cluster reconciliation This would leave the `ip-control-loop` as the reconciliation tool. Signed-off-by: Miguel Duarte Barroso <[email protected]> * unit tests: assure stale IPAllocation cleanup This commit adds a unit where it is checked that the pod deletion leads to the cleanup of a stale IP address. This commit features the automatic provisioning of the controller informer cache with the data present on the fake clientset tracker (the "fake" datastore). This way, users can just create the client with provisioned data, and that'll trickle down to the informer cache of the pod controller. Because the `network-attachment-definitions` resources feature dashes, the heuristic function that guesses - yes, guesses. very deterministic ... - the name of the resource can't be used - [0]. As such, it was needed to create an alternate `newFakeNetAttachDefClient` where it is possible to specify the correct resource name. [0] - https://github.com/k8snetworkplumbingwg/network-attachment-definition-client/blob/2fd7267afcc4d48dfe6a8cd756b5a08bd04c2c97/vendor/k8s.io/client-go/testing/fixture.go#L331 Signed-off-by: Miguel Duarte Barroso <[email protected]> * unit tests: move helper funcs to other files The helper files are tagged with the `test` build tag, to prevent them from being shipped on the production code binary. Signed-off-by: Miguel Duarte Barroso <[email protected]> * control loop, queueing: use a rate-limiting queue Using a queue allows us to re-queue errors. Signed-off-by: Miguel Duarte Barroso <[email protected]> * control loop: add IPAllocation cleanup related events Adds two new events related to garbage collection of the whereabouts IP addresses: - when an IP address is garbage collected - when a cleanup operation fails and is not re-queued The former event looks like: ``` 116s Normal IPAddressGarbageCollected pod/macvlan1-worker1 \ successful cleanup of IP address [192.168.2.1] from network \ whereabouts-conf ``` The latter event looks like: ``` 10s Warning IPAddressGarbageCollectionFailed failed to garbage \ collect addresses for pod default/macvlan1-worker1 ``` Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: check out statefulset scenarios Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: test different scale up/down order and instance deltas Signed-off-by: Miguel Duarte Barroso <[email protected]> * ci: test e2e bash scripts last These ugly tests do not cleanup after themselves; this way, the golang based tests (which **do** cleanup after themselves) will not be impacted by these left-overs. Signed-off-by: Miguel Duarte Barroso <[email protected]> * ip control loop, unit tests: test negative scenarios Check the event thrown when a request is dropped from the queue, and assure reconciling an allocation is impossible without having access to the attachment configuration data. Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: test fix for issue #182 Issue [0] reports an error when a pod associated to a `StatefulSet` whose IPPool is already full is deleted. According to it, the new pod - scheduled by the `StatefulSet` - cannot run because the IPPool is already full, and the old pod's IP cannot be garbage collected because we match by pod reference - and the "new" pod is stuck in `creating` phase. [0] - #182 Signed-off-by: Miguel Duarte Barroso <[email protected]> * ip-control-loop: strip pod before queueing it The ip reconcile loop only requires the pod metadata and its network status annotatations to garbage collect the stale IP addresses. As such, we remove the status and spec parameters from the pod before queueing it. Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: focus on networks w/ whereabouts IPAM type Signed-off-by: Miguel Duarte Barroso <[email protected]>
nicklesimba
pushed a commit
to nicklesimba/whereabouts
that referenced
this issue
Apr 20, 2022
* build: generate ip pool clientSet/informers/listers Signed-off-by: Miguel Duarte Barroso <[email protected]> * vendor: update vendor stuff Signed-off-by: Miguel Duarte Barroso <[email protected]> * build: vendor net-attach-def-client types Signed-off-by: Miguel Duarte Barroso <[email protected]> * config: look for the whereabouts config file in multiple places The reconciler controller will have access to the whereabouts configuration via a mount point. As such, we need a way to specify its path. Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: requires the IP ranges in normalized format The IP reconcile loop also requires the IP ranges in a normalized format; as such, we export it into a function, which will be used in a follow-up commit. Signed-off-by: Miguel Duarte Barroso <[email protected]> * config: allow IPAM config parsing from a NetConfList Currently whereabouts is only able to parse network configurations in the strict [0] format - i.e. **do not accept** a plugin list - [1]. The `ip-control-loop` must recover the full plugin configuration, which may be in the network configuration format. This commit allows whereabouts to now understand both formats. Furthermore, the current CNI release - v1.0.Z - removed the support for [0], meaning that only the configuration list format is now supported [2]. [0] - https://github.com/containernetworking/cni/blob/v0.8.1/SPEC.md#network-configuration [1] - https://github.com/containernetworking/cni/blob/v0.8.1/SPEC.md#network-configuration-lists [2] - https://github.com/containernetworking/cni/blob/master/SPEC.md#released-versions Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: add a controller Listen to pod deletion, and for every deleted pod, assure their IPs are gone. The rough algorithm goes like this: - for every network-status in the pod's annotations: - read associated net-attach-def from the k8s API - extract the range from the net-attach-def - find the corresponding IP pool - look for allocations belonging to the deleted pod - delete them using `IPManagement(..., types.Deallocate, ...)` All the API reads go through the informer cache, which is kept updated whenever the objects are updated on the API. The dockerfiles are also updated, to ship this new binary. Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: remove manual cluster reconciliation This would leave the `ip-control-loop` as the reconciliation tool. Signed-off-by: Miguel Duarte Barroso <[email protected]> * unit tests: assure stale IPAllocation cleanup This commit adds a unit where it is checked that the pod deletion leads to the cleanup of a stale IP address. This commit features the automatic provisioning of the controller informer cache with the data present on the fake clientset tracker (the "fake" datastore). This way, users can just create the client with provisioned data, and that'll trickle down to the informer cache of the pod controller. Because the `network-attachment-definitions` resources feature dashes, the heuristic function that guesses - yes, guesses. very deterministic ... - the name of the resource can't be used - [0]. As such, it was needed to create an alternate `newFakeNetAttachDefClient` where it is possible to specify the correct resource name. [0] - https://github.com/k8snetworkplumbingwg/network-attachment-definition-client/blob/2fd7267afcc4d48dfe6a8cd756b5a08bd04c2c97/vendor/k8s.io/client-go/testing/fixture.go#L331 Signed-off-by: Miguel Duarte Barroso <[email protected]> * unit tests: move helper funcs to other files The helper files are tagged with the `test` build tag, to prevent them from being shipped on the production code binary. Signed-off-by: Miguel Duarte Barroso <[email protected]> * control loop, queueing: use a rate-limiting queue Using a queue allows us to re-queue errors. Signed-off-by: Miguel Duarte Barroso <[email protected]> * control loop: add IPAllocation cleanup related events Adds two new events related to garbage collection of the whereabouts IP addresses: - when an IP address is garbage collected - when a cleanup operation fails and is not re-queued The former event looks like: ``` 116s Normal IPAddressGarbageCollected pod/macvlan1-worker1 \ successful cleanup of IP address [192.168.2.1] from network \ whereabouts-conf ``` The latter event looks like: ``` 10s Warning IPAddressGarbageCollectionFailed failed to garbage \ collect addresses for pod default/macvlan1-worker1 ``` Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: check out statefulset scenarios Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: test different scale up/down order and instance deltas Signed-off-by: Miguel Duarte Barroso <[email protected]> * ci: test e2e bash scripts last These ugly tests do not cleanup after themselves; this way, the golang based tests (which **do** cleanup after themselves) will not be impacted by these left-overs. Signed-off-by: Miguel Duarte Barroso <[email protected]> * ip control loop, unit tests: test negative scenarios Check the event thrown when a request is dropped from the queue, and assure reconciling an allocation is impossible without having access to the attachment configuration data. Signed-off-by: Miguel Duarte Barroso <[email protected]> * e2e tests: test fix for issue k8snetworkplumbingwg#182 Issue [0] reports an error when a pod associated to a `StatefulSet` whose IPPool is already full is deleted. According to it, the new pod - scheduled by the `StatefulSet` - cannot run because the IPPool is already full, and the old pod's IP cannot be garbage collected because we match by pod reference - and the "new" pod is stuck in `creating` phase. [0] - k8snetworkplumbingwg#182 Signed-off-by: Miguel Duarte Barroso <[email protected]> * ip-control-loop: strip pod before queueing it The ip reconcile loop only requires the pod metadata and its network status annotatations to garbage collect the stale IP addresses. As such, we remove the status and spec parameters from the pod before queueing it. Signed-off-by: Miguel Duarte Barroso <[email protected]> * reconcile-loop: focus on networks w/ whereabouts IPAM type Signed-off-by: Miguel Duarte Barroso <[email protected]> Replaced bash e2e test with Golang e2e test (k8snetworkplumbingwg#181) Fixed code errors causing e2e tests to not compile Signed-off-by: nicklesimba <[email protected]> e2e tests: remove manual ip pool reconciliation Signed-off-by: Miguel Duarte Barroso <[email protected]> e2e tests: use the default namespace Signed-off-by: Miguel Duarte Barroso <[email protected]> e2e tests: read the IPAllocations from the correct namespace Furthermore, harden the `isIPPoolAllocationsEmpty` function - currently, if the IPPool does not exist, it will be created without allocations. When that happens, the pool is indeed empty. Signed-off-by: Miguel Duarte Barroso <[email protected]> Changed variable name for readability (envVars *envVars changed to testConfig *envVars) Signed-off-by: nicklesimba <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
@maiqueb @dougbtv There is a larger problem with StatefulSets and container IDs. My current PR (use the container ID to match when deleting) only fixes one issue #180
A short while back we changed the isPodAlive() to skip pods in Pending phase: 55be906
This was because if the ip-reconciler ran while a pod was being deployed, it cleaned up the reservation because kubelet takes some time before adding the network-status annotations and IPs to the Pod resource. So the Pod resource is missing the IP temporarily, and the reservation would be erroneously marked for cleanup.
Now there is new issue where, if the IP range is full. Suppose 3 IPs in range, 3 Pods exist. Now suppose an STS Pod gets recreated after a node goes down. It hits ContainerCreating state and stays there because No IP in Range.
When ip-reconciler runs, it only checks by podRef. and it will match because it's a StatefulSet. The Pod that is stuck ContainerCreating will also be in Pending Phase, as kubelet is still retrying/awaiting the CNI. Therefore the orhpaned IP won't get cleaned up, and new Pod can never come up.
How can we match by the containerID in the IPPool? Except the containerID stored in the reservation is the pause container ID. And this container ID is nowhere to be found in the k8s Pod resource. And I couldn't find an easy way to get the mapping (without going on actual host, grepping output of docker ps, etc...). So not sure how the ip-reconciler could achieve this. Meaning we are stuck using the podRef... which isn't unique for StatefulSets.
If we remove the PodPending check, we are back to issue #162: #162
But this seems less likely to occur than the issue described in this ticket.
The text was updated successfully, but these errors were encountered: