-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding system metrics support #97
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Edwinhr716 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @Edwinhr716. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/ok-to-test |
…es, and functionality of the latter
is this ready for review? asking because it is still marked as WIP |
not yet, currently working on adding a metric for latency for replicas to be scheduled and become ready, and also plan on adding unit tests |
Done with most of it, PTAL. Only missing unit test for |
}, []string{"leadername"}, | ||
) | ||
|
||
replicaReadyStatusDuration = prometheus.NewHistogramVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here about what is the metrics record, like "the latency for a pod group to be ready after creation"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but overall, I start to think about whether it's necessary to have this metrics for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the help attribute of the object describes what it records, but I can add a comment for it as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need such a metric. The time it takes for the pod to become ready is not a characteristic of the lws controller.
|
||
func ReplicaReadyStatus(readyPods []corev1.Pod, startTime metav1.Time) { | ||
for _, pod := range readyPods { | ||
readyTime := podutils.PodReadyConditionLastTransitionTime(pod).Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the leader pod is ready, its worker statefulset may not be ready right? the latency here calculates the leader pods' latency to be ready
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we may need to pass the worker sts and leader pod as a pair and make sure that latency=worker_sts_ready-leader_pod_created, wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The LeaderPod is only added into the list that is passed if the worker sts is ready, that is checked here
I guess I'm a bit confused with what the replica latency measures, I thought it was from when the lws object was created/edited to when the replica is ready. So is it from when the leader statefulset is created to when the worker sts is ready?
@@ -569,3 +573,12 @@ func templateUpdated(sts *appsv1.StatefulSet, lws *leaderworkerset.LeaderWorkerS | |||
func replicasUpdated(sts *appsv1.StatefulSet, lws *leaderworkerset.LeaderWorkerSet) bool { | |||
return *sts.Spec.Replicas != *lws.Spec.Replicas | |||
} | |||
|
|||
func getLastTransitionTime(conditionType string, lws *leaderworkerset.LeaderWorkerSet) metav1.Time { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move this to a util file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was thinking about that as well, can create an lws utils file similar to the one that exists for pods, and move this function, templateUpdated
and replicasUpdated
to it in a different PR.
@@ -389,6 +390,9 @@ func (r *LeaderWorkerSetReconciler) updateConditions(ctx context.Context, lws *l | |||
updateCondition := setCondition(lws, condition) | |||
// if condition changed, record events | |||
if updateCondition { | |||
if updatedCount == int(*lws.Spec.Replicas) { | |||
metrics.ReplicaReadyStatus(readyLeaderPods, getLastTransitionTime(string(leaderworkerset.LeaderWorkerSetProgressing), lws)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the start time here is when the lws just progress, there might be a possibility that the leader pod is not scheduled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asked a similar question in other comment: what is the replica latency actually measuring? I thought it was from lws creation to when the replica is ready
}, []string{"hash"}, | ||
) | ||
|
||
recreateGroupTimes = prometheus.NewCounterVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not called anywhere else right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, adding it to register it, but waiting on #106 to implement it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we really need to add our own metrics at this stage, I would focus on making sure that we are getting the metrics that controller-runtime publishes by default; see https://book.kubebuilder.io/reference/metrics-reference
Those metrics cover a lot of what we need already for monitoring the health of the operator, like how long a reconcile call takes, and I think the first PR should focus on that only and it should be included in the 0.2.0 release.
}, []string{"leadername"}, | ||
) | ||
|
||
replicaReadyStatusDuration = prometheus.NewHistogramVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need such a metric. The time it takes for the pod to become ready is not a characteristic of the lws controller.
+1 to add these metrics later since not part of the 0.2.0. |
/hold |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
close for now, @Edwinhr716 |
What type of PR is this?
/kind feature
What this PR does / why we need it
Which issue(s) this PR fixes
Fixes #87
Special notes for your reviewer
Does this PR introduce a user-facing change?