Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review the reconcilation logic to prevent system overload #143

Open
SaschaSchwarze0 opened this issue Apr 21, 2020 · 5 comments
Open

Review the reconcilation logic to prevent system overload #143

SaschaSchwarze0 opened this issue Apr 21, 2020 · 5 comments
Assignees

Comments

@SaschaSchwarze0
Copy link
Member

We just had the situation on our development cluster that two build custom resources in the system were already defining the service account as an object. Due to a mistake during deployment, an old build operator was expecting a string there.

The result is that the reconcilation happens endlessly. And this is just one sample, other reasons for reconcilation are bad references to credentials. To prevent a system overload from these reconcilations, we should do things:

  1. Apply a delay time when reconciling, see discussion at Refine the reconcile logic #109 (comment)
  2. Investigate whether we can stop the reconcilation process if the user does not fix the root cause within a certain time (maybe one hour) and put the custom resource into some "permanently failed state"
@qu1queee qu1queee self-assigned this May 8, 2020
@qu1queee
Copy link
Contributor

qu1queee commented May 8, 2020

This is not a bug, it works as designed. It would be good to know numbers of a potential performance degradation when reconciles never stops. Adding this to #174 for a short discussion.

@sbose78
Copy link
Member

sbose78 commented May 15, 2020

Agreed, this constant reconcilation paradigm does feel little chatty. Though in general, it isn't expensive. Nevertheless, would be good to see what's the resource footprint.

@qu1queee
Copy link
Contributor

@SaschaSchwarze0 do u know if we have some internal results around this? or are this metrics(multiple reconciles system overload) something we can request to Emily or similar to get for us?

@SaschaSchwarze0
Copy link
Member Author

@qu1queee no, I do not have results. But agree, would be interesting to see the difference between a performance run on a clean system vs one where 1000 (just a random number) build runs are reconciling because of some failure.

@otaviof
Copy link
Member

otaviof commented May 20, 2020

The result is that the reconcilation happens endlessly. And this is just one sample, other reasons for reconcilation are bad references to credentials. To prevent a system overload from these reconcilations, we should do things:

1. Apply a delay time when reconciling, see discussion at [#109 (comment)](https://github.com/redhat-developer/build/pull/109#issuecomment-614079616)

Good approach! This will ease the pressure the API-Server will try requeue failed attempts.

2. Investigate whether we can stop the reconcilation process if the user does not fix the root cause within a certain time (maybe one hour) and put the custom resource into some "permanently failed state"

An example of permanent failed state can be taken from service-binding-operator:

// NoRequeue returns error without requeue flag.
func NoRequeue(err error) (reconcile.Result, error) {
	return reconcile.Result{}, err
}

Additionally, we should define the different result scenarios as dedicated functions, to inform the Kubernetes API-Server how to proceed, and re-use this behavior throughout the operator.

As a practical example, please consider the methods defined here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants