Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AvSet got deleted and Infrastructure reconciliation succeeded without recreating the AvSet #1054

Open
ialidzhikov opened this issue Jan 7, 2025 · 0 comments
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug platform/azure Microsoft Azure platform/infrastructure

Comments

@ialidzhikov
Copy link
Member

How to categorize this issue?

/area quality
/kind bug
/platform azure

What happened:
AvSet got deleted and Infrastructure reconciliation succeeded without recreating the AvSet. Machines later on were failing to be created because of the missing AvSet.

With @dnaeon we faced an Azure Shoot failing to wake up with error:

Status
Waking up, Reconcile Failed
Last Message
Flow "Shoot cluster reconciliation" encountered task errors: [task "Waiting until shoot worker nodes have been reconciled" failed: Error while waiting for Worker shoot--foo--testy/testy to become ready: error during reconciliation: Error reconciling Worker: failed while waiting for all machine deployments to be ready: machine(s) failed: 1 error occurred: "shoot--foo--testy-static-dynamic-8577b-5vplb": Cloud provider message - machine codes error: code = [Internal] message = [Failed to create VirtualMachine: [ResourceGroup: shoot--foo--testy, Name: shoot--foo--testy-static-dynamic-8577b-5vplb], Err: PUT https://management.azure.com/subscriptions/<omitted>/resourceGroups/shoot--foo--testy/providers/Microsoft.Compute/virtualMachines/shoot--foo--testy-static-dynamic-8577b-5vplb
--------------------------------------------------------------------------------
RESPONSE 404: 404 Not Found
ERROR CODE: NotFound
--------------------------------------------------------------------------------
{
  "error": {
    "code": "NotFound",
    "message": "The Availability Set '/subscriptions/<omitted>/resourceGroups/shoot--foo--testy/providers/Microsoft.Compute/availabilitySets/shoot--foo--testy-avset-workers' cannot be found."
  }
}

The provider-azure deleted the AvSet:

2025-01-07 05:00:10	{"log":{"controller":"infrastructure","error":"differences between the current and target spec require the object to be deleted.Resource: Microsoft.Compute/availabilitySets, Name: shoot--foo--testy-avset-workers, Field: location, Expected: , Found: westeurope","flow":"Azure infrastructure reconciliation","level":"error","msg":"will attempt to delete availability set due to irreconcilable error","name":"testy","namespace":"shoot--foo--testy","object":{"name":"testy","namespace":"shoot--foo--testy"},"operation":"reconcile","reconcileID":"aedb2566-e623-488c-8859-80345619b915","stacktrace":"github.com/gardener/gardener-extension-provider-azure/pkg/controller/infrastructure/infraflow.(*FlowContext).ensureAvailabilitySet\n\t/go/src/github.com/gardener/gardener-extension-provider-azure/pkg/controller/infrastructure/infraflow/ensurer.go:201\ngithub.com/gardener/gardener-extension-provider-azure/pkg/controller/infrastructure/infraflow.(*FlowContext).EnsureAvailabilitySet\n\t/go/src/github.com/gardener/gardener-extension-provider-azure/pkg/controller/infrastructure/infraflow/ensurer.go:175\ngithub.com/gardener/gardener-extension-provider-azure/pkg/controller/infrastructure/infraflow/shared.(*BasicFlowContext).AddTask.TaskFn.Timeout.func1\n\t/go/pkg/mod/github.com/gardener/[email protected]/pkg/utils/flow/taskfn.go:35\ngithub.com/gardener/gardener-extension-provider-azure/pkg/controller/infrastructure/infraflow/shared.(*BasicFlowContext).AddTask.(*BasicFlowContext).wrapTaskFn.func2\n\t/go/src/github.com/gardener/gardener-extension-provider-azure/pkg/controller/infrastructure/infraflow/shared/basic_context.go:169\ngithub.com/gardener/gardener/pkg/utils/flow.(*execution).runNode.func2\n\t/go/pkg/mod/github.com/gardener/[email protected]/pkg/utils/flow/flow.go:226","task":"ensure availability set","ts":"2025-01-07T05:00:10.982Z"}}

The corresponding handling is

if avset != nil {
if location := ptr.Deref(avset.Location, ""); location != fctx.adapter.Region() {
log.Error(NewSpecMismatchError(avsetCfg.AzureResourceMetadata, "location", fctx.adapter.Region(), location, nil), "will attempt to delete availability set due to irreconcilable error")
err = asClient.Delete(ctx, avsetCfg.ResourceGroup, avsetCfg.Name)
if err != nil {
return nil, err
}
}
// domain counts are immutable, therefore we need live with whatever is currently present.
return avset, nil
}
avset = &armcompute.AvailabilitySet{
Location: to.Ptr(fctx.adapter.Region()),
// the DomainCounts are computed from the current InfrastructureStatus. They cannot be updated after shoot creation.
Properties: &armcompute.AvailabilitySetProperties{
PlatformFaultDomainCount: avsetCfg.CountFaultDomains,
PlatformUpdateDomainCount: avsetCfg.CountUpdateDomains,
},
SKU: &armcompute.SKU{Name: to.Ptr(string(armcompute.AvailabilitySetSKUTypesAligned))}, // equal to managed = True in tf
}
log.Info("reconciling availability set", "name", avset.Name)
log.V(1).Info("reconciling availability set", "spec", *avset)
return asClient.CreateOrUpdate(ctx, fctx.adapter.ResourceGroupName(), avsetCfg.Name, *avset)
.

After investigation with @dnaeon we found two potential issues:

  1. The expected region (fctx.adapter.Region()) in

    if location := ptr.Deref(avset.Location, ""); location != fctx.adapter.Region() {
    was wrong, it was empty. We have to understand why.

  2. The early exit in

    // domain counts are immutable, therefore we need live with whatever is currently present.
    return avset, nil
    seems to be wrong. We delete the AvSet and we exit, later on Infrastructure reconciliation succeeds and the AvSet is missing.

What you expected to happen:
Infrastructure reconciliation to do no succeed with missing/deleted AvSet.

How to reproduce it (as minimally and precisely as possible):
Not clear for not. But we were suspecting something like:

  1. Create AvSet-based cluster.
  2. Hibernated it.
  3. Migrate to the flow reconciler.
  4. Wake up the cluster.

Anything else we need to know?:
N/A

Environment:

  • Gardener version (if relevant):
  • Extension version: v1.49.1
  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • Others:
@gardener-robot gardener-robot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug platform/azure Microsoft Azure platform/infrastructure labels Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug platform/azure Microsoft Azure platform/infrastructure
Projects
None yet
Development

No branches or pull requests

2 participants