Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zookeeper pod keeps crashing when scaling down and up #513

Open
hoyhbx opened this issue Nov 18, 2022 · 4 comments · May be fixed by #526
Open

Zookeeper pod keeps crashing when scaling down and up #513

hoyhbx opened this issue Nov 18, 2022 · 4 comments · May be fixed by #526

Comments

@hoyhbx
Copy link
Contributor

hoyhbx commented Nov 18, 2022

Description

(Describe the feature, bug, question, proposal that you are requesting)
We found that zookeeper pod keeps crashing after a scale-down workload and a scale-up workload.
We found the crash is because the new pods are reusing the undeleted PVC during the scale-down process. However, this problem still persists even when we specify the reclaimPolicy: Delete.

The root cause of this issue is because there is no guard for scale-up before finishing deleting the PVC. We found that zookeeper-operator checks if the upgrade fails before updating the statefulSet, but this check misses checking for the undeleted OrphanPVCs. When reusing the old PVC, the startup script mistakenly thinks the membership list on the old PVC is the up-to-date one and skips updating it. When ZooKeeper starts, it crashes due to inconsistent membership list.

The orphaned PVCs will never get deleted because the operator always waits for the number of ready replicas to equal to desired replicas, which will never happen since one pod keeps crashing.

Importance

(Indicate the importance of this issue to you (blocker, must-have, should-have, nice-to-have))
Importance: must-have

Location

(Where is the piece of code, package, or document affected by this issue?)
The PVC cleanup code is here:

func (r *ZookeeperClusterReconciler) cleanupOrphanPVCs(instance *zookeeperv1beta1.ZookeeperCluster) (err error) {

The logic to upgrade the statefulSet is here:

func (r *ZookeeperClusterReconciler) upgradeStatefulSet(instance *zookeeperv1beta1.ZookeeperCluster, foundSts *appsv1.StatefulSet) (err error) {

Suggestions for an improvement

And we think that the operator should check if all OrphanPVCs are deleted before proceeding to the next upgrade.
We suggest to add a check to check if there is no Orphaned PVCs before updating the statefulSet

(How do you suggest to fix or proceed with this issue?)

@hoyhbx hoyhbx linked a pull request Feb 5, 2023 that will close this issue
@hoyhbx
Copy link
Contributor Author

hoyhbx commented Apr 7, 2023

This PR (#526) can fix this issue.
An alternative (and simpler) fix is to fix the start up script in the zookeeper container, so that the membership list is updated on start. The root cause of the crashing zookeeper pods when reusing the PVC is because the membership list file on the old PVC does not contain the new node.

@hoyhbx
Copy link
Contributor Author

hoyhbx commented Apr 10, 2023

@anishakj Do you mind to take a look at this?
The root cause of this bug is because the restarted zookeeper node has inconsistent membership list.
In the zookeeper teardown script, the ID of the node is removed from the membership list from the dynamic configuration file which exist on all PVCs (

java -Dlog4j.configuration=file:"$LOG4J_CONF" -jar /opt/libs/zu.jar remove $ZKURL $MYID
).
But in the zookeeper start up script, the script does not update the dynamic configuration file (
if [[ "$REGISTER_NODE" == true ]]; then
ROLE=observer
ZKURL=$(zkConnectionString)
ZKCONFIG=$(zkConfig)
set -e
echo Registering node and writing local configuration to disk.
java -Dlog4j.configuration=file:"$LOG4J_CONF" -jar /opt/libs/zu.jar add $ZKURL $MYID $ZKCONFIG $DYNCONFIG
) if the myid and the dynamic config file already exist (this happens if the PVC is reused) (
if [[ "$ONDISK_MYID_CONFIG" == false || "$ONDISK_DYN_CONFIG" == false ]]; then
REGISTER_NODE=true
else
REGISTER_NODE=false
fi
).

This causes problem when the new zookeeper node reuses an old PVC. During bootup, if the myid file and dynamic config file exist, the startup script skips updating the membership list, assuming it is correct. But in reality, this node was removed from the membership list from the dynamic config file during teardown.
This causes the newly started node to keep crashing with the following exception:

│ 2023-04-10 19:34:44,914 [myid:2] - INFO  [main:AbstractConnector@381] - Stopped ServerConnector@5c30a9b0{HTTP/1.1, (http/1.1)}{0.0.0.0:7000}                                                            │
│ 2023-04-10 19:34:44,915 [myid:2] - INFO  [main:ContextHandler@1154] - Stopped o.e.j.s.ServletContextHandler@9d5509a{/,null,STOPPED}                                                                     │
│ 2023-04-10 19:34:44,915 [myid:2] - ERROR [main:QuorumPeerMain@114] - Unexpected exception, exiting abnormally                                                                                           │
│ java.lang.RuntimeException: My id 2 not in the peer list                                                                                                                                                │
│     at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1128)                                                                                                                        │
│     at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229)                                                                                                         │
│     at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137)                                                                                                      │
│     at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)                                                                                                                   │
│ 2023-04-10 19:34:44,916 [myid:2] - INFO  [main:ZKAuditProvider@42] - ZooKeeper audit is disabled.                                                                                                       │
│ 2023-04-10 19:34:44,917 [myid:2] - ERROR [main:ServiceUtils@42] - Exiting JVM with code 1

@iampranabroy
Copy link

Hey @hoyhbx @anishakj - Any plans to merge the changes for future releases?

@anishakj
Copy link
Contributor

@hoyhbx Could you please increase the coverage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants