Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application status "Degraded" after down scaling #3936

Open
greeddj opened this issue Nov 7, 2024 · 1 comment
Open

Application status "Degraded" after down scaling #3936

greeddj opened this issue Nov 7, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@greeddj
Copy link

greeddj commented Nov 7, 2024

Hello there 👋🏻

Checklist:

  • [+] I've included steps to reproduce the bug.
  • [+] I've included the version of argo rollouts.

Describe the bug
From time to time (yes, it is floating issue), I encounter that "kind:Rollout" applications become to a Degraded status.

After examining the logs and the application's state, I discovered the following:

  • The status field does not contain up-to-date information, for instance, about the replica and the HPA state.
  • The controller attempts to create a new replica when the number of pods is decreased.

The HPA processes are managed by the KEDA controller kind:ScaledObject

➜ ~ kubectl -n staging get so app-api -ojson | jq '.spec'
{
  "cooldownPeriod": 30,
  "maxReplicaCount": 3,
  "minReplicaCount": 1,
  "pollingInterval": 10,
  "scaleTargetRef": {
    "apiVersion": "argoproj.io/v1alpha1",
    "kind": "Rollout",
    "name": "app-api"
  },
  "triggers": [
    {
      "metadata": {
        "query": "round(workers_utilization{container=\"app-api\", namespace=\"staging\"})",
        "serverAddress": "https://prom.k8s.local/",
        "threshold": "70"
      },
      "metricType": "Value",
      "type": "prometheus"
    }
  ]
}

The current number of replicas is determined by the above kind:ScaledObject

➜ ~ kubectl -n staging get rollout app-api -ojson | jq '.spec.replicas'
1

➜ ~ kubectl -n staging get rs app-api-54cc7b4b4d
NAME                    DESIRED   CURRENT   READY   AGE
app-api-54cc7b4b4d      1         1         1       46h

Current status (Degraded)

➜ ~ kubectl -n staging get rollout app-api -ojson | jq '.status'
{
  "HPAReplicas": 2,
  "availableReplicas": 2,
  "blueGreen": {
    "activeSelector": "54cc7b4b4d"
  },
  "canary": {},
  "conditions": [
    {
      "lastTransitionTime": "2024-11-05T14:48:24Z",
      "lastUpdateTime": "2024-11-05T14:48:24Z",
      "message": "RolloutCompleted",
      "reason": "RolloutCompleted",
      "status": "True",
      "type": "Completed"
    },
    {
      "lastTransitionTime": "2024-11-07T11:25:24Z",
      "lastUpdateTime": "2024-11-07T11:25:24Z",
      "message": "Rollout is not healthy",
      "reason": "RolloutHealthy",
      "status": "False",
      "type": "Healthy"
    },
    {
      "lastTransitionTime": "2024-11-07T11:25:24Z",
      "lastUpdateTime": "2024-11-07T11:25:24Z",
      "message": "Rollout does not have minimum availability",
      "reason": "AvailableReason",
      "status": "False",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2024-11-07T11:35:25Z",
      "lastUpdateTime": "2024-11-07T11:35:25Z",
      "message": "ReplicaSet \"app-api-54cc7b4b4d\" has timed out progressing.",
      "reason": "ProgressDeadlineExceeded",
      "status": "False",
      "type": "Progressing"
    }
  ],
  "currentPodHash": "54cc7b4b4d",
  "message": "ProgressDeadlineExceeded: ReplicaSet \"app-api-54cc7b4b4d\" has timed out progressing.",
  "observedGeneration": "154",
  "phase": "Degraded",
  "readyReplicas": 2,
  "replicas": 2,
  "restartedAt": "2024-11-06T18:19:49Z",
  "selector": "app=app-api,app.kubernetes.io/instance=app,app.kubernetes.io/name=app,rollouts-pod-template-hash=54cc7b4b4d",
  "stableRS": "54cc7b4b4d",
  "updatedReplicas": 2
}

Expected behavior
I expect the controller's behavior to be stable and predictable.

Version

# kubectl version
Client Version: v1.31.1
Kustomize Version: v5.4.2
Server Version: v1.31.1

# Argo Rollouts Chart/App version
2.37.7/v1.7.2

# Keda Chart/App version
2.15.1/2.15.1

Logs

Logs indicate the following:

  1. When decreasing from 3 to 2 replicas, there are no issues; a patch for HPA is applied and no new replicas are created.
  2. When decreasing from 2 to 1 replicas, there are issues: a new replica is created (unsuccessfully), the HPA patch is absent, and negative conditions patch is present.
➜ ~ kubectl -n argo-rollouts logs deployment/argo-rollouts | grep "namespace=staging" | grep "rollout=app-api"
...

# generation=153 / scale down 3->2

time="2024-11-07T11:20:24Z" level=info msg="Started syncing rollout" generation=153 namespace=staging resourceVersion=585378104 rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Syncing replicas only due to scaling event" namespace=staging rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Reconciling stable ReplicaSet 'app-api-54cc7b4b4d'" namespace=staging rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Scaled down ReplicaSet app-api-54cc7b4b4d (revision 50) from 3 to 2" event_reason=ScalingReplicaSet namespace=staging rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Conflict when updating replicaset app-api-54cc7b4b4d, falling back to patch" namespace=staging rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Patching replicaset with patch: {\"metadata\":{\"annotations\":{\"rollout.argoproj.io/desired-replicas\":\"2\",\"rollout.argoproj.io/revision\":\"50\"},\"labels\":{\"rollouts-pod-template-hash\":\"54cc7b4b4d\"}},\"spec\":{\"replicas\":2,\"selector\":{\"matchLabels\":{\"rollouts-pod-template-hash\":\"54cc7b4b4d\"}},\"template\":{\"metadata\":{\"annotations\":{\"vector-format\":\"pod-app\"},\"labels\":{\"app\":\"app-api\",\"app.kubernetes.io/instance\":\"app\",\"app.kubernetes.io/managed-by\":\"Helm\",\"app.kubernetes.io/name\":\"app\",\"app.kubernetes.io/version\":\"5ae9aef4fcd097a2059bf31802273518b2e3cde8\",\"helm.sh/chart\":\"app-v1.0.0-5ae9aef4fcd097a2059bf31802273518b2e3cde8\",\"rollouts-pod-template-hash\":\"54cc7b4b4d\"}}}}}" namespace=staging rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Scaled down ReplicaSet app-api-54cc7b4b4d (revision 50) from 3 to 2" event_reason=ScalingReplicaSet namespace=staging rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Patched: {\"status\":{\"observedGeneration\":\"153\"}}" generation=153 namespace=staging resourceVersion=585378104 rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="persisted to informer" generation=153 namespace=staging resourceVersion=585378118 rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Reconciliation completed" generation=153 namespace=staging resourceVersion=585378104 rollout=app-api time_ms=85.632779
time="2024-11-07T11:20:24Z" level=info msg="Started syncing rollout" generation=153 namespace=staging resourceVersion=585378118 rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Reconciling stable ReplicaSet 'app-api-54cc7b4b4d'" namespace=staging rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Patched: {\"status\":{\"HPAReplicas\":2,\"availableReplicas\":2,\"readyReplicas\":2,\"replicas\":2,\"updatedReplicas\":2}}" generation=153 namespace=staging resourceVersion=585378118 rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="persisted to informer" generation=153 namespace=staging resourceVersion=585378119 rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Reconciliation completed" generation=153 namespace=staging resourceVersion=585378118 rollout=app-api time_ms=26.638623
time="2024-11-07T11:20:24Z" level=info msg="Started syncing rollout" generation=153 namespace=staging resourceVersion=585378119 rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Reconciling stable ReplicaSet 'app-api-54cc7b4b4d'" namespace=staging rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="No status changes. Skipping patch" generation=153 namespace=staging resourceVersion=585378119 rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Reconciliation completed" generation=153 namespace=staging resourceVersion=585378119 rollout=app-api time_ms=3.438866
time="2024-11-07T11:20:24Z" level=info msg="Started syncing rollout" generation=153 namespace=staging resourceVersion=585378119 rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Reconciling stable ReplicaSet 'app-api-54cc7b4b4d'" namespace=staging rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="No status changes. Skipping patch" generation=153 namespace=staging resourceVersion=585378119 rollout=app-api
time="2024-11-07T11:20:24Z" level=info msg="Reconciliation completed" generation=153 namespace=staging resourceVersion=585378119 rollout=app-api time_ms=3.333507

# generation=154 / scale down 2->1

time="2024-11-07T11:25:24Z" level=info msg="Started syncing rollout" generation=154 namespace=staging resourceVersion=585381067 rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Syncing replicas only due to scaling event" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Reconciling stable ReplicaSet 'app-api-54cc7b4b4d'" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Scaled down ReplicaSet app-api-54cc7b4b4d (revision 50) from 2 to 1" event_reason=ScalingReplicaSet namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Conflict when updating replicaset app-api-54cc7b4b4d, falling back to patch" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Patching replicaset with patch: {\"metadata\":{\"annotations\":{\"rollout.argoproj.io/desired-replicas\":\"1\",\"rollout.argoproj.io/revision\":\"50\"},\"labels\":{\"rollouts-pod-template-hash\":\"54cc7b4b4d\"}},\"spec\":{\"replicas\":1,\"selector\":{\"matchLabels\":{\"rollouts-pod-template-hash\":\"54cc7b4b4d\"}},\"template\":{\"metadata\":{\"annotations\":{\"vector-format\":\"pod-app\"},\"labels\":{\"app\":\"app-api\",\"app.kubernetes.io/instance\":\"app\",\"app.kubernetes.io/managed-by\":\"Helm\",\"app.kubernetes.io/name\":\"app\",\"app.kubernetes.io/version\":\"5ae9aef4fcd097a2059bf31802273518b2e3cde8\",\"helm.sh/chart\":\"app-v1.0.0-5ae9aef4fcd097a2059bf31802273518b2e3cde8\",\"rollouts-pod-template-hash\":\"54cc7b4b4d\"}}}}}" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Scaled down ReplicaSet app-api-54cc7b4b4d (revision 50) from 2 to 1" event_reason=ScalingReplicaSet namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Patched: {\"status\":{\"observedGeneration\":\"154\"}}" generation=154 namespace=staging resourceVersion=585381067 rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="persisted to informer" generation=154 namespace=staging resourceVersion=585381080 rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Reconciliation completed" generation=154 namespace=staging resourceVersion=585381067 rollout=app-api time_ms=85.818662
time="2024-11-07T11:25:24Z" level=info msg="Started syncing rollout" generation=154 namespace=staging resourceVersion=585381080 rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Reconciling stable ReplicaSet 'app-api-54cc7b4b4d'" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="New RS 'app-api-54cc7b4b4d' is not ready to pause" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="skipping active service switch: New RS 'app-api-54cc7b4b4d' is not fully saturated" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Patched: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-11-05T14:48:24Z\",\"lastUpdateTime\":\"2024-11-05T14:48:24Z\",\"message\":\"RolloutCompleted\",\"reason\":\"RolloutCompleted\",\"status\":\"True\",\"type\":\"Completed\"},{\"lastTransitionTime\":\"2024-11-07T11:25:24Z\",\"lastUpdateTime\":\"2024-11-07T11:25:24Z\",\"message\":\"Rollout is not healthy\",\"reason\":\"RolloutHealthy\",\"status\":\"False\",\"type\":\"Healthy\"},{\"lastTransitionTime\":\"2024-11-06T18:19:49Z\",\"lastUpdateTime\":\"2024-11-07T11:25:24Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"ReplicaSetNotAvailable\",\"status\":\"True\",\"type\":\"Progressing\"},{\"lastTransitionTime\":\"2024-11-07T11:25:24Z\",\"lastUpdateTime\":\"2024-11-07T11:25:24Z\",\"message\":\"Rollout does not have minimum availability\",\"reason\":\"AvailableReason\",\"status\":\"False\",\"type\":\"Available\"}]}}" generation=154 namespace=staging resourceVersion=585381080 rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="persisted to informer" generation=154 namespace=staging resourceVersion=585381081 rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Reconciliation completed" generation=154 namespace=staging resourceVersion=585381080 rollout=app-api time_ms=28.417491
time="2024-11-07T11:25:24Z" level=info msg="Started syncing rollout" generation=154 namespace=staging resourceVersion=585381081 rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Reconciling stable ReplicaSet 'app-api-54cc7b4b4d'" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="New RS 'app-api-54cc7b4b4d' is not ready to pause" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="skipping active service switch: New RS 'app-api-54cc7b4b4d' is not fully saturated" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Timed out (false) [last progress check: 2024-11-07 11:25:24 +0000 UTC - now: 2024-11-07 11:25:24.564589362 +0000 UTC m=+779860.795748433]" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="No status changes. Skipping patch" generation=154 namespace=staging resourceVersion=585381081 rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Queueing up rollout for a progress after 599s" namespace=staging rollout=app-api
time="2024-11-07T11:25:24Z" level=info msg="Reconciliation completed" generation=154 namespace=staging resourceVersion=585381081 rollout=app-api time_ms=3.115118
...

How to reproduce the bug

  1. Create the rollouts object (blue green) with one service
  2. Add keda ScaledObject with prom metrics
  3. Pass some load until the HPA not be triggered, after that stop pass load and wait for max->min replicas will be reached (repeat until not catched "Degraded" phase)

For comfortable analysis you can use this snippet in another terminal window.

while sleep 10; do v=$(kubectl -n staging get rollout app-api -o jsonpath='{.status.HPAReplicas}/{.spec.replicas} ({.status.phase})'); echo "${v} at $(date)"; done

if HPAReplicas == replicas - all ok, ex: 1/1 (Healthy) at Mon Nov 11 09:46:21 AM UTC 2024
else phase Degraded will be soon, ex: 2/1 (Degraded) at Mon Nov 11 09:46:21 AM UTC 2024


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@greeddj greeddj added the bug Something isn't working label Nov 7, 2024
@greeddj
Copy link
Author

greeddj commented Jan 9, 2025

Hello

Can anyone help or point me in the direction of where else to look or check?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant