Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: unable to upgrade to 3.9.0: Could not upgrade metadata.version to 21 #11039

Open
osklil opened this issue Jan 15, 2025 · 9 comments
Open

Comments

@osklil
Copy link

osklil commented Jan 15, 2025

Bug Description

I have upgraded operator from 0.44.0 to 0.45.0 and then edited the Kafka CRD to change spec.kafka.version from 3.8.0 to 3.9.0. The pods recreated with new image but the upgrade did not complete. Now the cluster got this status:

  status:
    conditions:
    - lastTransitionTime: "2025-01-15T05:37:35.221452943Z"
      message: Failed to update metadata version to 3.9
      reason: MetadataUpdateFailed
      status: "True"
      type: Warning

I tried manual upgrade:

$ kubectl exec -n … --context=… kafka-cluster-kafka-0 -- /opt/kafka/bin/kafka-features.sh --bootstrap-server localhost:9092 upgrade --metadata 3.9
Could not upgrade metadata.version to 21. Invalid update version 21 for feature metadata.version. Controller 5 only supports versions 1-20
1 out of 1 operation(s) failed.
command terminated with exit code 1
$

Steps to reproduce

No response

Expected behavior

No response

Strimzi version

0.45.0

Kubernetes version

1.27.11

Installation method

Helm

Infrastructure

Bare-metal

Configuration files and logs

No response

Additional context

No response

@scholzj
Copy link
Member

scholzj commented Jan 15, 2025

You probably scaled the Kafka cluster down in the past with an older Strimzi version and Kafka still has this node registered but invisible because of missing APIs. This is not a Strimzi bug but a Kafka KRaft limitation. It should be addressed only in Kafka 4.0.

You have to work around it manually by unregistering the node using the Kafka Admin API.

@osklil
Copy link
Author

osklil commented Jan 15, 2025

@scholzj Thanks, that could be it. It seems the command line tools to list and unregister nodes aren't available with 3.9.0, is this correct?

@scholzj
Copy link
Member

scholzj commented Jan 15, 2025

There was no command line tool for it. But I'm not sure I ever checked in Kafka 3.9 - maybe someone added it there in that version.

You could also try to scale-up (add the node reported in the error message) and scale it down again. New Strimzi versions try to workaround this Kafka limitation and should unregister the node. But if it was controller, the scaling is tricky as that is another unsupported thing :-/. You can also try to add it to the .status.registeredNodeIds list in the Kafka CR with kubectl edit kafka my-cluster --subresource=status to trigger the unregistration.

@ppatierno
Copy link
Member

Triaged on 23.1.2025: it seems the kafka-cluster.sh tool support the unregister option to do that. @osklil can you try and let us know if that didn't work for you?

@osklil
Copy link
Author

osklil commented Jan 23, 2025

Not sure I did the right thing, but it complains that the "given broker ID was not registered" (tried with id >= 3).

[kafka@kafka-cluster-kafka-0 kafka]$ /opt/kafka/bin/kafka-cluster.sh unregister --bootstrap-server localhost:9092  --id 5
[2025-01-23 20:31:45,015] ERROR [AdminClient clientId=adminclient-1] Unregister broker request for broker ID 5 failed: The given broker ID was not registered. (org.apache.kafka.clients.admin.KafkaAdminClient)
org.apache.kafka.common.errors.BrokerIdNotRegisteredException: The given broker ID was not registered.
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.BrokerIdNotRegisteredException: The given broker ID was not registered.
        at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
        at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
        at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165)
        at org.apache.kafka.tools.ClusterTool.unregisterCommand(ClusterTool.java:127)
        at org.apache.kafka.tools.ClusterTool.execute(ClusterTool.java:107)
        at org.apache.kafka.tools.ClusterTool.mainNoExit(ClusterTool.java:48)
        at org.apache.kafka.tools.ClusterTool.main(ClusterTool.java:43)
Caused by: org.apache.kafka.common.errors.BrokerIdNotRegisteredException: The given broker ID was not registered.

[kafka@kafka-cluster-kafka-0 kafka]$

@ppatierno
Copy link
Member

How did you get it was the right broker id when you scaled down?

@osklil
Copy link
Author

osklil commented Jan 24, 2025

0, 1, 2 are the current broker/controller combos, and at some point there was 0, 1, 2 as brokers and 3, 4, 5 that were controller only IIRC. At least I assumed that those were the ids needed for kafka-cluster.sh unregister.

@ppatierno
Copy link
Member

ppatierno commented Jan 24, 2025

So you scaled down controllers which is something not really supported by KRaft right now. The quorum is static and dynamic quorum (with controllers to be scaled down), will come with Kafka 4.x. So I guess this is the reason why the unregister doesn't work, because it's for brokers.
Did you try last suggestion by Jakub?

You can also try to add it to the .status.registeredNodeIds list in the Kafka CR with kubectl edit kafka my-cluster --subresource=status to trigger the unregistration.

Also can you describe the steps you did to go from brokers 0,1,2 (in one nodepool) and controllers 3,4,5 (in another nodepool) to brokers/controllers 0,1,2? I could try to replicate what you had.

@osklil
Copy link
Author

osklil commented Jan 24, 2025

@ppatierno If you mean "You can also try to add it to the .status.registeredNodeIds list in the Kafka CR with kubectl edit kafka my-cluster --subresource=status to trigger the unregistration." - I did try that (by adding 3,4,5 to .status.registeredNodeIds which contained 0,1,2). It had no effect and only 0,1,2 remained.

It was a little while ago, but I recall doing this:

  1. Getting to the state with two KafkaNodePools, one with brokers only and one with controllers.
  2. Updating the first KafkaNodePool to have both controller and broker roles.
  3. Deleting the second KafkaNodePool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants