Review the reconcilation logic to prevent system overload #143

SaschaSchwarze0 · 2020-04-21T13:16:22Z

We just had the situation on our development cluster that two build custom resources in the system were already defining the service account as an object. Due to a mistake during deployment, an old build operator was expecting a string there.

The result is that the reconcilation happens endlessly. And this is just one sample, other reasons for reconcilation are bad references to credentials. To prevent a system overload from these reconcilations, we should do things:

Apply a delay time when reconciling, see discussion at Refine the reconcile logic #109 (comment)
Investigate whether we can stop the reconcilation process if the user does not fix the root cause within a certain time (maybe one hour) and put the custom resource into some "permanently failed state"

qu1queee · 2020-05-08T13:28:14Z

This is not a bug, it works as designed. It would be good to know numbers of a potential performance degradation when reconciles never stops. Adding this to #174 for a short discussion.

sbose78 · 2020-05-15T04:43:30Z

Agreed, this constant reconcilation paradigm does feel little chatty. Though in general, it isn't expensive. Nevertheless, would be good to see what's the resource footprint.

qu1queee · 2020-05-18T15:25:00Z

@SaschaSchwarze0 do u know if we have some internal results around this? or are this metrics(multiple reconciles system overload) something we can request to Emily or similar to get for us?

SaschaSchwarze0 · 2020-05-20T06:49:22Z

@qu1queee no, I do not have results. But agree, would be interesting to see the difference between a performance run on a clean system vs one where 1000 (just a random number) build runs are reconciling because of some failure.

otaviof · 2020-05-20T10:19:19Z

The result is that the reconcilation happens endlessly. And this is just one sample, other reasons for reconcilation are bad references to credentials. To prevent a system overload from these reconcilations, we should do things:
1. Apply a delay time when reconciling, see discussion at [#109 (comment)](https://github.com/redhat-developer/build/pull/109#issuecomment-614079616)

Good approach! This will ease the pressure the API-Server will try requeue failed attempts.

2. Investigate whether we can stop the reconcilation process if the user does not fix the root cause within a certain time (maybe one hour) and put the custom resource into some "permanently failed state"

An example of permanent failed state can be taken from service-binding-operator:

// NoRequeue returns error without requeue flag.
func NoRequeue(err error) (reconcile.Result, error) {
	return reconcile.Result{}, err
}

Additionally, we should define the different result scenarios as dedicated functions, to inform the Kubernetes API-Server how to proceed, and re-use this behavior throughout the operator.

As a practical example, please consider the methods defined here.

qu1queee self-assigned this May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review the reconcilation logic to prevent system overload #143

Review the reconcilation logic to prevent system overload #143

SaschaSchwarze0 commented Apr 21, 2020

qu1queee commented May 8, 2020

sbose78 commented May 15, 2020

qu1queee commented May 18, 2020

SaschaSchwarze0 commented May 20, 2020

otaviof commented May 20, 2020

Review the reconcilation logic to prevent system overload #143

Review the reconcilation logic to prevent system overload #143

Comments

SaschaSchwarze0 commented Apr 21, 2020

qu1queee commented May 8, 2020

sbose78 commented May 15, 2020

qu1queee commented May 18, 2020

SaschaSchwarze0 commented May 20, 2020

otaviof commented May 20, 2020