Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issue#398 that decreasing replicas will make zookeeper unrecoverable when zookeeper not running. #399

Closed
wants to merge 9 commits into from

Conversation

stop-coding
Copy link

@stop-coding stop-coding commented Oct 14, 2021

Change log description

Fixes the bug that decreasing replicas will make zookeeper unrecoverable when zookeeper not running.

Purpose of the change

Fixes #398

What the code does

Add protection for setting Stateful when zookeeper not running.
If zookeeper not running, we will prohibited to update replicas status until zookeeper resume.
When user decrease replicas value, it will remove node with reconfig firstly.
Keep do that remove node with reconfig on preStop before pod exit.

How to verify it

  1. Create an cluster that size is 3 (kubectl create -f zk.yaml).
  2. Wait all pod running, named: zk-0\zk-1\zk-2.
  3. Delete zk-1\zk-2 pod, make cluster of zookeeper unable to provide services.
  4. "kubectl edit zk" that change replicas to 1 immediately.
  5. Wait some time, replicas will decrease to 1.
  6. Now, checking that:
    Is zk-0 is all right?

@codecov
Copy link

codecov bot commented Oct 14, 2021

Codecov Report

Merging #399 (15e58cf) into master (e6d84b8) will decrease coverage by 0.07%.
The diff coverage is 85.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #399      +/-   ##
==========================================
- Coverage   84.11%   84.04%   -0.08%     
==========================================
  Files          12       12              
  Lines        1643     1667      +24     
==========================================
+ Hits         1382     1401      +19     
- Misses        177      185       +8     
+ Partials       84       81       -3     
Impacted Files Coverage Δ
pkg/zk/zookeeper_client.go 82.50% <77.77%> (-1.38%) ⬇️
...er/zookeepercluster/zookeepercluster_controller.go 63.50% <88.88%> (+0.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e6d84b8...15e58cf. Read the comment docs.

if instance.Spec.Replicas == instance.Status.ReadyReplicas && (!instance.Status.MetaRootCreated) {
// instance.Spec.Replicas is just an expected value that we set it, but it maybe not take effect by k8s.
// So we should check that instance.Status.Replicas is equal to ReadyReplicas, which means true true status of pods.
if instance.Status.Replicas == instance.Status.ReadyReplicas && (!instance.Status.MetaRootCreated) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wont compare with spec.Replicas status wont be shown correctly rt, in the case of scaling

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid replacing instance.Spec.Replicas with instance.Status.Replicas won't work for initial ZK cluster deployment and scale up scenarios.
The very next line logging Cluster is Ready would not be true if we only compare to the status.

Remember, when the cluster is initially created, the pods are created one by one.
If we adopt instance.Status.Replicas == instance.Status.ReadyReplicas condition, then we will be moving to the next step after the first pod is created, not waiting for the rest of the replicas.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I get it.
This condition only take effect on scale up, so It won't affect shrinking.
Is it more appropriate that judging if the cluster is ready with the condition of
instance.Spec.Replicas == instance.Status.ReadyReplicas and instance.Spec.Replicas == instance.Status.Replicas ?

@anishakj
Copy link
Contributor

@stop-coding have one more question. How you made the service of zookeeper cluster to stop? will it be possible to add a test for the same

hongchunhua added 4 commits October 19, 2021 18:15
…make cluster of zookeeper Unrecoverable

Signed-off-by: hongchunhua <[email protected]>
…recoverable when zookeeper not running.

Signed-off-by: hongchunhua <[email protected]>
Signed-off-by: hongchunhua <[email protected]>
@stop-coding
Copy link
Author

@stop-coding have one more question. How you made the service of zookeeper cluster to stop? will it be possible to add a test for the same

@anishakj
Delete more than half pods of zookeeper after editing the replicas on zk immediately.
It will also happen that some node which pod is running have network failure when we edit the replicas.

@stop-coding stop-coding force-pushed the fix_issue_398 branch 2 times, most recently from 72ac80d to 10102fe Compare October 20, 2021 06:38
realAaronWu and others added 2 commits October 20, 2021 17:10
* add disableFinalizer flag and skip appending finalizers if set to true. Update charts

Signed-off-by: Aaron <[email protected]>

* set DisableFinalizer in main

Signed-off-by: Aaron <[email protected]>

* add README

Signed-off-by: Aaron <[email protected]>

* add UTs

Signed-off-by: Aaron <[email protected]>
@stop-coding stop-coding deleted the fix_issue_398 branch October 21, 2021 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

If zookeeper is not running, then decreasing replicas will make cluster of zookeeper Unrecoverable
4 participants