Skip to content

Commit

Permalink
Add retry logic for registermanifests, introduce skip-build override …
Browse files Browse the repository at this point in the history
…for LRT (#8495)

# Description

The pull request addresses findings in the failures related to Long
Running Test:
We see manifests not being registered successfully due to errors "409
Conflict : The target resource is in Accepted state"
This was determined due to the fact that every PUT request for a
resourceprovider/resource type/location/api within a file will
internally make an entry towards a resourceprovidersummary entry which
is kept updated with the latest updates for the resourceprovider and is
optimized for GET calls which receive summarized data for the resource
provider.

The system actually corrects itself eventually and the pods are up with
the manifests registered and ucp is running but the workflow logic will
fail as is designed, the in-built types are not saved in the
skip-delete-resources-list.txt and will be deleted in the next
subsequent runs and they fail.
This error is intermittent (latest fresh build run this evening did not
have this error,
https://github.com/radius-project/radius/actions/runs/13316073926/job/37190564570)
but when it does error, it can lead to 12 subsequent failures.

The PR includes addition of retry logic with exponential backoff for
handling 409 conflict errors.
Introduction of a 'skip-build' override mechism for workflow_dispatch to
be able to run Long Running Tests against latest build on demand.

## Type of change
- This pull request fixes a bug in Radius and has an approved issue
(issue link required).

Fixes: #8449 

## Contributor checklist
Please verify that the PR meets the following requirements, where
applicable:

- An overview of proposed schema changes is included in a linked GitHub
issue.
    - [ ] Yes
    - [ ] Not applicable
- A design document PR is created in the [design-notes
repository](https://github.com/radius-project/design-notes/), if new
APIs are being introduced.
    - [ ] Yes
    - [ ] Not applicable
- The design document has been reviewed and approved by Radius
maintainers/approvers.
    - [ ] Yes
    - [ ] Not applicable
- A PR for the [samples
repository](https://github.com/radius-project/samples) is created, if
existing samples are affected by the changes in this PR.
    - [ ] Yes
    - [ ] Not applicable
- A PR for the [documentation
repository](https://github.com/radius-project/docs) is created, if the
changes in this PR affect the documentation or any user facing updates
are made.
    - [ ] Yes
    - [ ] Not applicable
- A PR for the [recipes
repository](https://github.com/radius-project/recipes) is created, if
existing recipes are affected by the changes in this PR.
    - [ ] Yes
    - [ ] Not applicable

Signed-off-by: lakshmimsft <[email protected]>
  • Loading branch information
lakshmimsft authored Feb 18, 2025
1 parent 7709a79 commit 83d8a72
Show file tree
Hide file tree
Showing 3 changed files with 357 additions and 42 deletions.
29 changes: 22 additions & 7 deletions .github/workflows/long-running-azure.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,12 @@ permissions:
on:
# Enable manual trigger to deploy the latest changes from main.
workflow_dispatch:
inputs:
skip-build:
description: 'Skip build (true/false)'
required: false
default: 'true'
type: string
schedule:
# Run every 2 hours
- cron: "0 */2 * * *"
Expand Down Expand Up @@ -115,20 +121,29 @@ jobs:
path: ./dist/cache
key: radius-test-latest-
- name: Skip build if build is still valid
if: github.event_name != 'pull_request' && github.event_name != 'workflow_dispatch'
if: github.event_name != 'pull_request'
id: skip-build
run: |
# check if the last build time to see if we need to build again
SKIP_BUILD="false"
if [ -f ./dist/cache/.lastbuildtime ]; then
lastbuild=$(cat ./dist/cache/.lastbuildtime)
current_time=$(date +%s)
if [ $((current_time-lastbuild)) -lt ${{ env.VALID_RADIUS_BUILD_WINDOW }} ]; then
echo "Skipping build as the last build is still valid."
echo "SKIP_BUILD=true" >> $GITHUB_OUTPUT
echo "Last build is still within valid window"
SKIP_BUILD="true"
fi
fi
# Check override in workflow_dispatch mode
if [ "${{ github.event_name }}" = "workflow_dispatch" ] && [ "${{ github.event.inputs.skip-build }}" = "false" ]; then
echo "Manual run with skip-build=false, forcing build"
SKIP_BUILD="false"
fi
echo "SKIP_BUILD=${SKIP_BUILD}" >> $GITHUB_OUTPUT
- name: Set up checkout target (scheduled)
if: steps.skip-build.outputs.SKIP_BUILD != 'true' && github.event_name == 'schedule'
if: steps.skip-build.outputs.SKIP_BUILD != 'true' && (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch')
run: |
echo "CHECKOUT_REPO=${{ github.repository }}" >> $GITHUB_ENV
echo "CHECKOUT_REF=refs/heads/main" >> $GITHUB_ENV
Expand Down Expand Up @@ -436,8 +451,8 @@ jobs:
exit 1
fi
# Poll logs for up to iterations, 30 seconds each (upto 3 minutes total)
for i in {1..6}; do
# Poll logs for up to 20 iterations, 30 seconds each (up to 10 minutes total)
for i in {1..20}; do
kubectl logs "$POD_NAME" -n radius-system | tee registermanifest_logs.txt > /dev/null
# Exit on error
Expand All @@ -459,7 +474,7 @@ jobs:
# Final check to ensure success message was found
if ! grep -q "Successfully registered manifests" registermanifest_logs.txt; then
echo "Manifests not registered after 3 minutes."
echo "Manifests not registered after 10 minutes."
exit 1
fi
- name: Create a list of resources not to be deleted
Expand Down
139 changes: 104 additions & 35 deletions pkg/cli/manifest/registermanifest.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,23 @@ package manifest

import (
"context"
"errors"
"fmt"
"os"
"path/filepath"
"time"

"github.com/Azure/azure-sdk-for-go/sdk/azcore"
v1 "github.com/radius-project/radius/pkg/armrpc/api/v1"
"github.com/radius-project/radius/pkg/to"
"github.com/radius-project/radius/pkg/ucp/api/v20231001preview"
)

const (
initialBackoff = 2 * time.Second
maxRetries = 5
)

// RegisterFile registers a manifest file
func RegisterFile(ctx context.Context, clientFactory *v20231001preview.ClientFactory, planeName string, filePath string, logger func(format string, args ...any)) error {
if filePath == "" {
Expand All @@ -51,15 +59,22 @@ func RegisterFile(ctx context.Context, clientFactory *v20231001preview.ClientFac
}

logIfEnabled(logger, "Creating resource provider %s at location %s", resourceProvider.Name, locationName)
resourceProviderPoller, err := clientFactory.NewResourceProvidersClient().BeginCreateOrUpdate(ctx, planeName, resourceProvider.Name, v20231001preview.ResourceProviderResource{
Location: to.Ptr(locationName),
Properties: &v20231001preview.ResourceProviderProperties{},
}, nil)
if err != nil {
return err
}

_, err = resourceProviderPoller.PollUntilDone(ctx, nil)
err = retryOperation(ctx, func() error {
resourceProviderPoller, err := clientFactory.NewResourceProvidersClient().BeginCreateOrUpdate(
ctx, planeName, resourceProvider.Name,
v20231001preview.ResourceProviderResource{
Location: to.Ptr(locationName),
Properties: &v20231001preview.ResourceProviderProperties{},
}, nil)
if err != nil {
return err
}
_, err = resourceProviderPoller.PollUntilDone(ctx, nil)
if err != nil {
return err // also retried if error indicates a 409 conflict
}
return nil
}, logger)
if err != nil {
return err
}
Expand All @@ -74,17 +89,22 @@ func RegisterFile(ctx context.Context, clientFactory *v20231001preview.ClientFac

for resourceTypeName, resourceType := range resourceProvider.Types {
logIfEnabled(logger, "Creating resource type %s/%s", resourceProvider.Name, resourceTypeName)
resourceTypePoller, err := clientFactory.NewResourceTypesClient().BeginCreateOrUpdate(ctx, planeName, resourceProvider.Name, resourceTypeName, v20231001preview.ResourceTypeResource{
Properties: &v20231001preview.ResourceTypeProperties{
Capabilities: to.SliceOfPtrs(resourceType.Capabilities...),
DefaultAPIVersion: resourceType.DefaultAPIVersion,
},
}, nil)
if err != nil {
return err
}

_, err = resourceTypePoller.PollUntilDone(ctx, nil)
err = retryOperation(ctx, func() error {
resourceTypePoller, err := clientFactory.NewResourceTypesClient().BeginCreateOrUpdate(ctx, planeName, resourceProvider.Name, resourceTypeName, v20231001preview.ResourceTypeResource{
Properties: &v20231001preview.ResourceTypeProperties{
Capabilities: to.SliceOfPtrs(resourceType.Capabilities...),
DefaultAPIVersion: resourceType.DefaultAPIVersion,
},
}, nil)
if err != nil {
return err
}
_, err = resourceTypePoller.PollUntilDone(ctx, nil)
if err != nil {
return err
}
return nil
}, logger)
if err != nil {
return err
}
Expand All @@ -95,18 +115,22 @@ func RegisterFile(ctx context.Context, clientFactory *v20231001preview.ClientFac

for apiVersionName := range resourceType.APIVersions {
logIfEnabled(logger, "Creating API Version %s/%s@%s", resourceProvider.Name, resourceTypeName, apiVersionName)
apiVersionsPoller, err := clientFactory.NewAPIVersionsClient().BeginCreateOrUpdate(ctx, planeName, resourceProvider.Name, resourceTypeName, apiVersionName, v20231001preview.APIVersionResource{
Properties: &v20231001preview.APIVersionProperties{},
}, nil)
err = retryOperation(ctx, func() error {
apiVersionsPoller, err := clientFactory.NewAPIVersionsClient().BeginCreateOrUpdate(ctx, planeName, resourceProvider.Name, resourceTypeName, apiVersionName, v20231001preview.APIVersionResource{
Properties: &v20231001preview.APIVersionProperties{},
}, nil)
if err != nil {
return err
}
_, err = apiVersionsPoller.PollUntilDone(ctx, nil)
if err != nil {
return err
}
return nil
}, logger)
if err != nil {
return err
}

_, err = apiVersionsPoller.PollUntilDone(ctx, nil)
if err != nil {
return err
}

locationResourceType.APIVersions[apiVersionName] = map[string]any{}
}

Expand All @@ -118,12 +142,17 @@ func RegisterFile(ctx context.Context, clientFactory *v20231001preview.ClientFac
}

logIfEnabled(logger, "Creating location %s/%s/%s", resourceProvider.Name, locationName, address)
locationPoller, err := clientFactory.NewLocationsClient().BeginCreateOrUpdate(ctx, planeName, resourceProvider.Name, locationName, locationResource, nil)
if err != nil {
return err
}

_, err = locationPoller.PollUntilDone(ctx, nil)
err = retryOperation(ctx, func() error {
locationPoller, err := clientFactory.NewLocationsClient().BeginCreateOrUpdate(ctx, planeName, resourceProvider.Name, locationName, locationResource, nil)
if err != nil {
return err
}
_, err = locationPoller.PollUntilDone(ctx, nil)
if err != nil {
return err
}
return nil
}, logger)
if err != nil {
return err
}
Expand Down Expand Up @@ -275,3 +304,43 @@ func logIfEnabled(logger func(format string, args ...any), format string, args .
logger(format, args...)
}
}

// retryOperation retries an operation with exponential backoff upon a 409 conflict.
// It also handles context cancellation or timeouts, returning immediately if ctx is done.
func retryOperation(ctx context.Context, operation func() error, logger func(format string, args ...any)) error {
backoff := initialBackoff

for attempt := 1; attempt <= maxRetries; attempt++ {
err := operation()
if err != nil {
if is409ConflictError(err) {
if logger != nil {
logger("Got 409 conflict on attempt %d/%d. Retrying in %s...", attempt, maxRetries, backoff)
}
// Wait for either the context to be cancelled or the backoff duration to pass
select {
case <-ctx.Done():
// Context cancelled or timed out
return ctx.Err()
case <-time.After(backoff):
// Increase backoff and try again
backoff *= 2
continue
}
}
return err
}
return nil
}
return fmt.Errorf("exceeded %d retries", maxRetries)
}

// is409ConflictError returns true if the error is a 409 Conflict error
func is409ConflictError(err error) bool {
if err == nil {
return false
}

var respErr *azcore.ResponseError
return errors.As(err, &respErr) && respErr.StatusCode == 409
}
Loading

0 comments on commit 83d8a72

Please sign in to comment.