Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce GetKeyValuesStretchCluster for Accurate Label Matching in StretchCluster #2965

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

OdedViner
Copy link
Contributor

@OdedViner OdedViner commented Jan 12, 2025

This PR addresses FBUGS-160a and introduces the GetKeyValuesStretchCluster function to resolve an issue with the contains string method. The new function ensures a separation of logic without impacting the existing GetKeyValues function. However, this PR did not resolve the issue: #2841

Copy link
Contributor

openshift-ci bot commented Jan 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: OdedViner
Once this PR has been reviewed and has the lgtm label, please assign umangachapagain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines 83 to 82
// GetKeyValues returns a node label matching the topologyKey and all values for StrechCluster
// for that label across all storage nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the comment as per the working of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

for label, labelValues := range m.Labels {
if label == corev1.LabelZoneFailureDomainStable || label == labelZoneFailureDomainWithoutBeta {
topologyKey = label
values = labelValues
break
}
}
Copy link
Member

@Nikhil-Ladha Nikhil-Ladha Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means for stretch cluster we can't have a different failure domain? Like, rack, host?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the GetKeysValues function to do an equality check with valid labels that we define in the validTopologyLabelKeys and return the key, values accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malayparida2000 @iamniting what do you guys think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @Nikhil-Ladha

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stretch always uses zones, just to confirm in this thread in addition to comments from others that confirmed the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@malayparida2000 malayparida2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OdedViner Can you add some details on what's the problem with the existing code? And what's the exact issue. It's difficult to understand the motive for the change without knowing the root cause you are trying to fix.

"strings"
)

const labelZoneFailureDomainWithoutBeta = "failure-domain.kubernetes.io/zone"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't this already defined somewhere in the codebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable is defined in another package within the project, but it is a private variable (starts with a lowercase letter). Since it cannot be accessed outside its package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw our docs & went as far back as OCS 4.7. And I can confirm 2 things.

  1. Stretch cluster always uses failure domain zone
  2. For zone failure domain always topology.kubernetes.io/zone label is used. I don't see any need to search for any other labels such as failure-domain.kubernetes.io/zone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with Malay on all these points. failure-domain.kubernetes.io/zone is an old deprecated label for topology that was removed from K8s many releases ago, we should not use it anymore.

@OdedViner
Copy link
Contributor Author

@OdedViner Can you add some details on what's the problem with the existing code? And what's the exact issue. It's difficult to understand the motive for the change without knowing the root cause you are trying to fix.

Root Cause:
When incorrect labels are matched, it updates the NodeTopologies in the StorageCluster spec with incorrect zone information. This directly impacts the stretch cluster settings in the CephCluster spec by adding invalid zone information, causing the Rook operator to fail.

Impact:
1.The storage cluster misidentifies node topology information, leading to incorrect zone configuration.
2.In the stretch cluster setup, this results in the wrong zone labels being applied, disrupting the intended fault domain configuration.
3.Ultimately, the Rook operator fails to reconcile the CephCluster, breaking the deployment process.

Proposed Solution: To address this issue, I introduced a new function, GetKeyValuesStrechCluster, which specifically checks for relevant and valid topology keys (corev1.LabelZoneFailureDomainStable and labelZoneFailureDomainWithoutBeta). This ensures only the intended labels are considered, preventing incorrect matches and maintaining the integrity of the stretch cluster configuration.
The function isolates the logic required for stretch cluster scenarios, leaving the existing GetKeyValues behavior unchanged for other use cases. This makes the fix targeted and minimizes the risk of regressions elsewhere.

Why This Fix Is Necessary:
The root cause of the issue lies in the overly broad match logic (strings.Contains), which needs to be refined for stretch cluster use cases.
Without this fix, incorrect labels (e.g., topology.kubernetes.io/zone-principal) will continue to disrupt the stretch cluster settings.
The new function ensures correctness by limiting the keys checked to those explicitly relevant for stretch cluster configurations.
Let me know if more clarification or details are needed!

@malayparida2000 @Nikhil-Ladha @iamniting @sp98
Is the stretch cluster failure domain always set to "zone"?

@sp98
Copy link
Contributor

sp98 commented Jan 13, 2025

Yes. It's always zone.

@malayparida2000
Copy link
Contributor

malayparida2000 commented Jan 13, 2025

In the whole of ODF we have only 3 supported kind of failure domains rack, hostname, and zone.
rack- topology.rook.io/rack
hostname/host- kubernetes.io/hostname
zone- topology.kubernetes.io/zone
So I think we can just refactor the code base & stop caring about any other labels in here. This would help us eliminate the contains logic. And we can just directly match the label key with these 3 values of labels.

@OdedViner OdedViner force-pushed the strech_cluster_zones branch 3 times, most recently from 99f9f7d to 0796ed4 Compare January 14, 2025 17:14
@malayparida2000
Copy link
Contributor

@OdedViner This fix selectively addresses the case of stretch cluster only, while the mis match of labels can still happen to a normal cluster if the customer uses an improperly formatted label. I would suggest fixing the issue properly for all cases.
Ref-#2965 (comment)
We should refactor the parts like how topologymap is built, failure domain is determined, keyvalues are fetched so that we finally fix this issue as a whole.

@OdedViner OdedViner force-pushed the strech_cluster_zones branch from 0796ed4 to 8f85424 Compare January 15, 2025 15:09
@OdedViner OdedViner changed the title Adjust Node Label Conditions Based on Full Label Name on strechcluster Introduce GetKeyValuesStretchCluster for Accurate Label Matching in StretchCluster Jan 16, 2025
@OdedViner OdedViner force-pushed the strech_cluster_zones branch 8 times, most recently from 03d786c to 4d9e1bc Compare January 16, 2025 11:42
@OdedViner OdedViner force-pushed the strech_cluster_zones branch from 4d9e1bc to cb85115 Compare January 16, 2025 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants