Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add heartbeat monitoring for controller #1192

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Add heartbeat monitoring for controller #1192

wants to merge 3 commits into from

Conversation

yujiezhu0
Copy link
Contributor

@yujiezhu0 yujiezhu0 commented Jan 31, 2025

Why are these changes needed?

  • Implemented health checking mechanisms for the controller, including heartbeat monitoring, readiness probe, liveness probe, and health probe.
  • These changes are essential for alerts and monitoring.
  • Linear: EGDA-814.

Checks

  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • I've checked the new test coverage and the coverage percentage didn't drop.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

@yujiezhu0 yujiezhu0 marked this pull request as ready for review January 31, 2025 13:38
disperser/cmd/controller/main.go Outdated Show resolved Hide resolved
disperser/cmd/controller/main.go Show resolved Hide resolved
disperser/cmd/controller/main.go Outdated Show resolved Hide resolved
@yujiezhu0 yujiezhu0 requested a review from dmanc February 4, 2025 18:32
controllerHealthProbePath string = "/tmp/controller-health"
controllerMaxStallDuration time.Duration = 240 * time.Second
controllerLivenessChan = make(chan time.Time, 1)
controllerHeartbeatChan = make(chan time.Time, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this is needed.

}

// Signal heartbeat
signalHeartbeat(logger)
Copy link
Contributor

@dmanc dmanc Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add heartbeat signals in the main processing loops of both components:

  1. Encoding manager: HandleBatch()
  2. Dispatcher: HandleBatch()

Right now we're only signaling liveness at startup. But what we want is to have heartbeats for each control loop. Then our alerting system should alert if any of the two control loops fail and we should know which one failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants