Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-666] - Migrate daemon monitoring to use health checks #783

Merged
merged 15 commits into from
Dec 8, 2023

Conversation

clemire
Copy link
Contributor

@clemire clemire commented Nov 9, 2023

Changelist

Refactor the existing update monitor to be a health check monitor, and to re-issue panics after a sustained period of daemon unhealthiness meets the maximum allowable threshold.

Test Plan

  • added unit tests to validate checker behavior
  • ran localnet and after 6 minutes, got a panic because the bridge daemon does not work locally.
  • ran localnet with code modified to always report success for bridge daemon, and report consistently unhealthy for liquidations, and for pricefeed - each daemon caused a panic after the grace period + max acceptable delay.

Author/Reviewer Checklist

  • If this PR has changes that result in a different app state given the same prior state and transaction list, manually add the state-breaking label.
  • If this PR isn't state-breaking but has changes that modify behavior in PrepareProposal or ProcessProposal, manually add the label proposal-breaking.
  • If this PR is one of many that implement a specific feature, manually label them all feature:[feature-name].
  • If you wish to for mergify-bot to automatically create a PR to backport your change to a release branch, manually add the label backport/[branch-name].
  • Manually add any of the following labels: refactor, chore, bug.

Copy link
Contributor

coderabbitai bot commented Nov 9, 2023

Walkthrough

The changes introduce a comprehensive health monitoring system for daemons, including new structures, methods, and tests for health checks. Old monitoring functions are removed from the daemon server, and new metrics capture is implemented. Client and integration tests are updated to incorporate the health monitor. The changes also include new flags for daemon failure behavior and unhealthy duration, with corresponding updates in tests and CLI test suites.

Changes

Files Summary
protocol/app/.../app.go, protocol/app/.../app_test.go,
protocol/daemons/server/types/health_monitor.go,
protocol/daemons/server/types/health_monitor_test.go,
protocol/daemons/server/types/health_checker.go,
protocol/daemons/server/types/health_checker_test.go,
protocol/daemons/types/health_checkable.go,
protocol/mocks/HealthCheckable.go,
protocol/mocks/Makefile,
protocol/testutil/app/app.go
Introduced a new health monitoring system for daemons with health check structures, methods, and tests. Added methods for monitoring daemon health and disabling health monitoring for testing. Created new mocks and updated test utilities.
protocol/daemons/server/bridge.go, protocol/daemons/server/liquidation.go,
protocol/daemons/server/pricefeed.go, protocol/daemons/server/server.go,
protocol/daemons/server/server_test.go
Removed old update monitoring functions from daemon server. Modified methods to return response and err variables and capture valid responses in metrics. Renamed reportResponse to reportValidResponse in Server struct.
protocol/daemons/metrics/client/client.go,
protocol/daemons/pricefeed/client/client_integration_test.go
Metrics daemon does not implement health check due to simplicity. Updated pricefeed daemon integration test suite to include health monitor and register the daemon.
protocol/daemons/flags/flags.go, protocol/daemons/flags/flags_test.go,
protocol/x/clob/client/cli/cancel_order_cli_test.go,
protocol/x/clob/client/cli/liquidations_cli_test.go,
protocol/x/clob/client/cli/place_order_cli_test.go
Added new flags for daemon failure behavior and maximum unhealthy duration. Updated tests and CLI test suites to handle new flags.
protocol/app/flags/flags.go Removed import statement for "fmt" package, indicating a possible change in logging or error handling.

Please note that the changes are grouped based on the relatedness of changes across multiple files, and each row provides a high-level summary of the changes made to the files included in that set.


Tips

Chat with CodeRabbit Bot (@coderabbitai)

  • If you reply to a review comment from CodeRabbit, the bot will automatically respond.
  • To engage with CodeRabbit bot directly around the specific lines of code in the PR, mention @coderabbitai in your review comment
  • Note: Review comments are made on code diffs or files, not on the PR overview.
  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai help to get help.

Note: For conversation with the bot, please use the review comments on code diffs or files.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • The JSON schema for the configuration file is available here.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

@clemire clemire changed the base branch from main to crystal/CORE-666-implement-bridge November 9, 2023 22:10
@clemire clemire force-pushed the crystal/CORE-666-monitor-update branch from 7715685 to 9a48642 Compare November 9, 2023 22:12
Base automatically changed from crystal/CORE-666-implement-bridge to main November 9, 2023 23:03
@@ -290,6 +290,9 @@ type App struct {

PriceFeedClient *pricefeedclient.Client
LiquidationsClient *liquidationclient.Client
BridgeClient *bridgeclient.Client

HealthMonitor *daemonservertypes.HealthMonitor
Copy link
Contributor Author

@clemire clemire Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Health Monitor does not need access to daemon server client request contents anymore, and has been moved to live on the app.

@clemire clemire force-pushed the crystal/CORE-666-monitor-update branch from 5a6a503 to 25e93be Compare November 10, 2023 01:47
@clemire clemire force-pushed the crystal/CORE-666-monitor-update branch 2 times, most recently from def8cbf to 42bc296 Compare November 21, 2023 23:11
@clemire clemire marked this pull request as ready for review November 21, 2023 23:11
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 3

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 68119d8 and 42bc296.
Files selected for processing (14)
  • protocol/app/app.go (10 hunks)
  • protocol/app/app_test.go (2 hunks)
  • protocol/daemons/pricefeed/client/client_integration_test.go (4 hunks)
  • protocol/daemons/server/bridge.go (2 hunks)
  • protocol/daemons/server/liquidation.go (3 hunks)
  • protocol/daemons/server/pricefeed.go (3 hunks)
  • protocol/daemons/server/server.go (3 hunks)
  • protocol/daemons/server/server_test.go (3 hunks)
  • protocol/daemons/server/types/health_monitor.go (1 hunks)
  • protocol/daemons/server/types/health_monitor_test.go (1 hunks)
  • protocol/daemons/types/health_checkable.go (3 hunks)
  • protocol/mocks/HealthCheckable.go (1 hunks)
  • protocol/mocks/Makefile (1 hunks)
  • protocol/testutil/app/app.go (1 hunks)
Files skipped from review due to trivial changes (2)
  • protocol/daemons/server/server_test.go
  • protocol/mocks/Makefile
Additional comments: 24
protocol/daemons/server/pricefeed.go (1)
  • 43-48: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [33-47]

The removal of the ExpectPricefeedDaemon method and the response reporting check in the UpdateMarketPrices method is consistent with the shift to the new health check-based monitoring system. However, ensure that all references to these removed functionalities are also updated or removed throughout the codebase to prevent any broken dependencies or calls.

protocol/mocks/HealthCheckable.go (1)
  • 1-63: The mock implementation for HealthCheckable is correctly following the pattern for mocks using the testify library. It includes methods for HealthCheck, ReportFailure, ReportSuccess, and ServiceName, which are essential for testing the health check functionality. The NewHealthCheckable function is also correctly setting up a new mock with a cleanup function to assert expectations, which is a good practice for ensuring that mock behavior is verified after tests run. This should help in writing effective unit tests for the health monitoring system.
protocol/daemons/server/liquidation.go (2)
  • 2-7: The import of the telemetry package might no longer be necessary if telemetry functionality has been removed from the LiquidationServer. Verify if this import can be removed to clean up the code.

  • 36-41: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [34-41]

The use of telemetry metrics (telemetry.ModuleSetGauge) should be reassessed. Since the pull request indicates that telemetry and metrics-related code has been removed from the Server struct, this code might be obsolete. If telemetry is no longer used, this block should be removed to avoid dead code and potential confusion.

protocol/testutil/app/app.go (1)
  • 490-494: The change from DisableUpdateMonitoringForTesting to DisableHealthMonitorForTesting aligns with the new health monitoring system introduced in the pull request. This update is necessary to ensure that the health monitoring behavior is correctly disabled during testing, which is important for test stability and performance. The change is consistent with the overall shift from the old update monitoring system to the new health check-based approach.
protocol/daemons/pricefeed/client/client_integration_test.go (4)
  • 5-11: The import of the app package is correctly added to support the new HealthMonitor functionality. However, ensure that all other imports are still necessary after the refactoring and that there are no unused imports left.

  • 220-223: The addition of the healthMonitor field to the PriceDaemonIntegrationTestSuite struct is consistent with the summary provided. This field is necessary for the integration of the health monitoring system into the test suite.

  • 280-292: The initialization of the healthMonitor field within the SetupTest function is correctly implemented. The NewHealthMonitor function is called with appropriate parameters, and the healthMonitor is then used to register services. This aligns with the new health check-based monitoring system.

  • 337-342: The registration of the pricefeedDaemon service with the healthMonitor in the startClient function is correctly implemented. This ensures that the health of the pricefeedDaemon service will be monitored as part of the new health check system.

protocol/daemons/types/health_checkable.go (5)
  • 22-26: The addition of the ServiceName method to the HealthCheckable interface is a good practice as it provides a standardized way to reference services during health checks. This change will require all implementers of the HealthCheckable interface to provide a unique service name, which can be useful for logging and error reporting.

  • 67-73: The serviceName field has been added to the timeBoundedHealthCheckable struct to store the name of the service being monitored. This is a good practice as it associates a human-readable identifier with the health checkable service, which can be used for logging and monitoring purposes.

  • 81-88: The NewTimeBoundedHealthCheckable function initializes the timeBoundedHealthCheckable to an unhealthy state by reporting an error during its creation. This is a good practice as it ensures that the health checkable service starts in a known state and requires an explicit successful update to be considered healthy.

  • 91-94: The implementation of the ServiceName method for the timeBoundedHealthCheckable struct is straightforward and follows the interface contract by returning the stored service name. This is a simple and effective way to fulfill the requirements of the HealthCheckable interface.

  • 96-98: The ReportSuccess method is correctly implemented to be thread-safe by using a mutex lock. This is important to prevent data races when multiple goroutines may be updating the health check status concurrently.

protocol/app/app_test.go (2)
  • 3-9: The import of the mocks package is necessary for the new test function TestMonitorDaemon_Panics that uses a mock implementation of the HealthCheckable interface. However, the import of gopkg.in/typ.v4/slices does not seem to be used in the provided code. If it's not used elsewhere in the file, it should be removed to keep the code clean and avoid unnecessary dependencies.

  • 228-237: The new test TestMonitorDaemon_Panics is designed to ensure that the MonitorDaemon method panics when a service is registered twice. This is a good test for the health monitoring system's robustness. However, it's important to ensure that the panic is caused by the expected condition (double registration) and not by some other issue. It would be beneficial to assert the panic message to confirm the cause of the panic.

protocol/app/app.go (8)
  • 298-298: The addition of the HealthMonitor field to the App struct is a critical change for the new health monitoring system. It's important to ensure that the HealthMonitor is properly initialized and that all necessary daemons are registered with it. This change aligns with the pull request's goal of integrating health checks into the daemon services.

  • 595-602: The instantiation of the HealthMonitor within the App struct is a key part of the new health monitoring system. It's crucial to verify that the constants DaemonStartupGracePeriod and HealthCheckPollFrequency are set to sensible defaults and that they can be configured if needed. Additionally, ensure that the Logger is correctly passed to the HealthMonitor for logging purposes.

  • 611-616: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [612-627]

The code for starting the liquidations client and registering it with the health monitor is crucial. It's important to ensure that the MaximumDaemonUnhealthyDuration is being used correctly and that the daemon is properly monitored. If the daemon fails to start or becomes unhealthy, the system should respond appropriately, which may include logging, alerting, or taking corrective action.

  • 623-635: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [631-648]

The price feed client is being started and monitored similarly to the liquidations client. It's important to verify that the StartNewClient function is correctly implemented and that the MaximumDaemonUnhealthyDuration is being used appropriately. The health monitoring system should be robust enough to handle any issues with the price feed client.

  • 645-660: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [652-670]

The bridge client is being started and monitored. As with the other daemons, it's important to verify that the MaximumDaemonUnhealthyDuration is being used correctly and that the health monitoring system can handle any issues with the bridge client. The commented TODO on line 654 suggests that there may be pending work related to the bridge daemon, which should be tracked and resolved.

  • 688-688: The comment on line 688 indicates a TODO related to the metrics daemon and its health tracking. It's important to follow up on this TODO to ensure that the metrics daemon's health is properly monitored and that any necessary changes are made in a timely manner.

  • 1233-1251: The MonitorDaemon method is a new addition that registers a daemon service with the health monitor. It's critical to ensure that this method is called for all daemons that need to be monitored and that the error handling is appropriate. The use of panic on line 1249 should be carefully considered, as it will cause the application to crash if the daemon cannot be registered. It may be more appropriate to handle the error in a way that does not bring down the entire application.

  • 1253-1256: The DisableHealthMonitorForTesting method is a useful addition for testing purposes. It's important to ensure that this method is only used in test environments and that it does not accidentally get called in production, as it would disable an important monitoring feature.

protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 42bc296 and 7ddf9c7.
Files selected for processing (1)
  • protocol/app/app.go (10 hunks)
Additional comments: 5
protocol/app/app.go (5)
  • 5-17: The import block has been modified to remove unused imports. This is a good practice for maintainability and to avoid confusion about dependencies. However, ensure that these imports are indeed not used anywhere in the codebase to prevent breaking changes.

  • 298-298: The HealthMonitor field has been added to the App struct, which is a significant change. This aligns with the previous comment about moving health monitoring to live on the app. Ensure that the HealthMonitor is properly initialized and integrated with the rest of the application's lifecycle.

  • 595-603: The HealthMonitor is being initialized with default values for the grace period and poll frequency. Ensure that these values are appropriate for the application's needs and consider making them configurable for flexibility.

  • 645-660: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [611-670]

The MonitorDaemon method is being used to register daemon services with the health monitor. It is crucial to ensure that all daemon services are registered correctly and that the MaximumDaemonUnhealthyDuration is set appropriately. Additionally, the error handling here is to panic, which is a drastic measure. Consider if there are more graceful ways to handle initialization failures that won't necessarily bring down the entire application.

  • 1227-1253: The MonitorDaemon and DisableHealthMonitorForTesting methods have been added to the App struct. These methods are part of the new health monitoring system and should be reviewed to ensure they are correctly implemented and used throughout the application. The DisableHealthMonitorForTesting method, in particular, should be used with caution to avoid disabling health checks in production environments.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 7ddf9c7 and a26109a.
Files selected for processing (4)
  • protocol/app/app.go (9 hunks)
  • protocol/daemons/bridge/client/client.go (1 hunks)
  • protocol/daemons/liquidation/client/client.go (1 hunks)
  • protocol/daemons/pricefeed/client/client.go (1 hunks)
Files skipped from review due to trivial changes (2)
  • protocol/daemons/bridge/client/client.go
  • protocol/daemons/liquidation/client/client.go
Additional comments: 11
protocol/daemons/pricefeed/client/client.go (1)
  • 81-84: The addition of the ServiceName method to the Client struct is consistent with the implementation of the HealthCheckable interface, which requires a ServiceName method. This change will allow instances of Client to be used in the context of health monitoring where a service name identifier is required.
protocol/app/app.go (10)
  • 5-14: The import block includes several standard library packages. Ensure that all imported packages are used within the file to avoid unnecessary imports that can lead to increased compilation time and binary size.

  • 298-298: The HealthMonitor field has been added to the App struct as expected, following the removal of the old monitoring mechanisms. This change aligns with the pull request's intent to introduce health checks into the daemon services.

  • 595-602: The HealthMonitor is instantiated with default values for the startup grace period and health check poll frequency. Ensure that these values are appropriate for the application's needs and consider making them configurable if necessary.

  • 611-619: The MonitorDaemon method is used to register the LiquidationsClient with the health monitor. It is important to ensure that the MaximumDaemonUnhealthyDuration is an appropriate threshold for the liquidations daemon's health checks.

  • 629-636: The PriceFeedClient is registered with the health monitor using the MonitorDaemon method. Verify that the MaximumDaemonUnhealthyDuration is suitable for the price feed daemon's health checks and that the daemon's operation is critical to the application's functionality.

  • 653-661: The BridgeClient is registered with the health monitor using the MonitorDaemon method. As with other daemons, ensure that the MaximumDaemonUnhealthyDuration is an appropriate threshold for the bridge daemon's health checks.

  • 684-689: The metricsclient.Start function is invoked without being monitored by the health monitor. This is intentional as the metrics daemon is used for observability and should not bring the application down. Ensure that this aligns with the application's reliability requirements.

  • 1230-1246: The MonitorDaemon method is used to register daemon services with the health monitor. It is critical to ensure that all daemon services are registered correctly and that the application's startup sequence allows for the daemons to be initialized before the health checks begin.

  • 1250-1252: The DisableHealthMonitorForTesting method provides a way to disable the health monitor during testing. This is a useful feature for testing scenarios where the health checks may interfere with the test conditions or outcomes.

  • 1227-1228: The New function correctly returns an instance of the App struct. This is the expected behavior for the constructor function of the application.

protocol/app/app.go Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between a26109a and 0c91c09.
Files selected for processing (1)
  • protocol/app/app.go (9 hunks)
Additional comments: 8
protocol/app/app.go (8)
  • 298-298: The HealthMonitor field has been added to the App struct, which is responsible for monitoring the health of daemon services. Ensure that the HealthMonitor is properly initialized and used throughout the application.

  • 595-602: The HealthMonitor is being initialized with default values for the startup grace period and health check poll frequency. It's important to ensure that these values are appropriate for the application's requirements and that they can be configured if necessary.

  • 612-617: The MonitorDaemon method is being used to monitor the health of the LiquidationsClient. Ensure that the MaximumDaemonUnhealthyDuration is appropriate for this daemon and that the monitoring is correctly implemented.

  • 630-635: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [630-648]

The PriceFeedClient is being monitored using the MonitorDaemon method. Verify that the MaximumDaemonUnhealthyDuration is suitable for the price feed daemon's requirements and that the monitoring logic is correctly implemented.

  • 651-658: The BridgeClient is being monitored using the MonitorDaemon method. As with the other daemons, ensure that the MaximumDaemonUnhealthyDuration is appropriate and that the monitoring is correctly implemented.

  • 684-689: The metricsclient.Start method is being called without monitoring. Since the metrics daemon is used for observability and should not bring the app down, ensure that this behavior is intentional and that the lack of monitoring is acceptable.

  • 1230-1246: The MonitorDaemon method registers a daemon service with the health monitor and panics if registration fails. Ensure that this panic behavior is acceptable and that the error handling is consistent with the application's requirements.

  • 1250-1252: The DisableHealthMonitorForTesting method has been added to disable the health monitor during testing. Verify that this method is only used in test environments and that it does not affect production code paths.

server.gsrv.Stop()
}

// DisableUpdateMonitoringForTesting disables the update monitor for testing purposes. This is needed in integration
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed health monitor from daemon server and attached it to the app as a top level field, so this pass-through interface is no longer necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant is not needed anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file renamed to health_monitor.go

@@ -22,6 +22,8 @@ type HealthCheckable interface {
ReportFailure(err error)
// ReportSuccess records a successful update.
ReportSuccess()
// ServiceName returns the name of the service being monitored. This name is expected to be unique.
ServiceName() string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this field to easily key health checkables by service name.

@clemire clemire force-pushed the crystal/CORE-666-monitor-update branch from aa7e502 to db8d8d0 Compare November 23, 2023 01:39
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 8

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 2c64cf6 and db8d8d0.
Files selected for processing (14)
  • protocol/app/app.go (9 hunks)
  • protocol/app/app_test.go (2 hunks)
  • protocol/daemons/pricefeed/client/client_integration_test.go (4 hunks)
  • protocol/daemons/server/bridge.go (2 hunks)
  • protocol/daemons/server/liquidation.go (3 hunks)
  • protocol/daemons/server/pricefeed.go (3 hunks)
  • protocol/daemons/server/server.go (3 hunks)
  • protocol/daemons/server/server_test.go (3 hunks)
  • protocol/daemons/server/types/health_monitor.go (1 hunks)
  • protocol/daemons/server/types/health_monitor_test.go (1 hunks)
  • protocol/daemons/types/health_checkable.go (3 hunks)
  • protocol/mocks/HealthCheckable.go (1 hunks)
  • protocol/mocks/Makefile (1 hunks)
  • protocol/testutil/app/app.go (1 hunks)
Files skipped from review due to trivial changes (3)
  • protocol/daemons/server/liquidation.go
  • protocol/daemons/server/server_test.go
  • protocol/mocks/Makefile
Additional comments: 24
protocol/daemons/server/bridge.go (2)
  • 3-7:
    The removal of types and time imports suggests that they are no longer used in this file. Verify that no other parts of the file require these imports to function correctly.

  • 20-25:
    The removal of ExpectBridgeDaemon function should be verified to ensure that its responsibilities are now correctly handled by the new health check system.

protocol/testutil/app/app.go (1)
  • 490-494: The change from DisableUpdateMonitoringForTesting to DisableHealthMonitorForTesting aligns with the pull request's goal of transitioning from update monitoring to health checks. However, ensure that the DisableHealthMonitorForTesting method has been implemented correctly in the App struct and that it effectively disables the health monitor during tests.
protocol/mocks/HealthCheckable.go (1)
  • 1-63:
    The mock implementation for the HealthCheckable interface appears to be correctly generated and includes all necessary methods and a constructor with a cleanup function for testing.
protocol/app/app_test.go (3)
  • 3-9: The import of the time package is added, which is necessary for the new test TestMonitorDaemon_Panics that uses time.Minute. This change is correct and relevant to the new functionality being tested.

  • 4-4: The import of the mocks package is added, which is necessary for the new test TestMonitorDaemon_Panics that uses mocks.HealthCheckable. This change is correct and relevant to the new functionality being tested.

  • 228-237: The new test TestMonitorDaemon_Panics is added to ensure that the MonitorDaemon function panics when a health checkable service is registered twice. This is a good test to ensure that the health check monitor behaves as expected in the case of duplicate registrations.

protocol/daemons/server/server.go (3)
  • 3-12:
    The removal of imports should be verified to ensure that they are not used elsewhere in the file or that their removal does not affect other functionalities.

  • 21-26: > Note: This review was outside of the patch, so it was mapped to the patch with the greatest overlap. Original lines [14-26]

The removal of the update monitoring feature from the Server struct is consistent with the changes described in the summary.

  • 46-49:
    The Stop method has been simplified to only stop the gRPC server, which aligns with the removal of the update monitoring feature.
protocol/daemons/types/health_checkable.go (4)
  • 26-26:
    The addition of ServiceName() method to the HealthCheckable interface is a good practice for identifying services.

  • 72-72:
    The serviceName field is correctly added as a read-only field to store the service name.

  • 81-88:
    The NewTimeBoundedHealthCheckable function correctly initializes the serviceName field and sets the initial state to unhealthy.

  • 91-93:
    The ServiceName method is correctly implemented to return the service name.

protocol/daemons/pricefeed/client/client_integration_test.go (4)
  • 5-11:
    Ensure that the new import "github.com/dydxprotocol/v4-chain/protocol/app" is used within the file, otherwise it should be removed to avoid unnecessary imports.

  • 220-223:
    The declaration of healthMonitor as a global variable within the test suite is appropriate given the context of the change summary. It is used to register services for health checks.

  • 280-292:
    The initialization of healthMonitor and its integration with daemonServer is consistent with the pull request's goal of refactoring the update monitor to use health checks.

  • 337-342:
    The registration of pricefeedDaemon with healthMonitor is correctly implemented with error handling. However, the use of app.MaximumDaemonUnhealthyDuration suggests a dependency on a global variable which should be verified for correctness.

protocol/app/app.go (6)
  • 298-298: The HealthMonitor field has been added to the App struct, which is in line with the summary of changes. Ensure that the HealthMonitor is properly initialized and used throughout the application.

  • 611-617: The MonitorDaemon method is used to register the LiquidationsClient with the health monitor. Ensure that the MaximumDaemonUnhealthyDuration is properly passed and handled in the MonitorDaemon method.

  • 645-648: The MonitorDaemon method is used to register the PriceFeedClient with the health monitor. Ensure that the MaximumDaemonUnhealthyDuration is properly passed and handled in the MonitorDaemon method.

  • 658-658: The MonitorDaemon method is used to register the BridgeClient with the health monitor. Ensure that the MaximumDaemonUnhealthyDuration is properly passed and handled in the MonitorDaemon method.

  • 1230-1248: The MonitorDaemon method registers a daemon service with the health monitor and panics if registration fails. Ensure that this panic is an acceptable behavior and that the health monitor is correctly initialized before this method is called.

  • 1250-1253: The DisableHealthMonitorForTesting method has been added to disable the health monitor during testing. Verify that this method is only used in test environments and does not affect production code.

protocol/daemons/server/pricefeed.go Outdated Show resolved Hide resolved
protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
protocol/app/app.go Outdated Show resolved Hide resolved
protocol/app/app.go Outdated Show resolved Hide resolved
protocol/app/app.go Outdated Show resolved Hide resolved
protocol/app/app.go Outdated Show resolved Hide resolved
@@ -47,12 +36,6 @@ func (s *Server) LiquidateSubaccounts(
metrics.Received,
metrics.Count,
)
// If the daemon is unable to report a response, there is either an error in the registration of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my own understanding, why are we removing this error logging?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error here was emitted by the daemon monitor (which lived on the daemon server) if it failed to register the response. We don't report daemon responses to the monitor anymore; instead, we query the daemon for it's health directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to still log these errors for debugging purposes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to make sense to log this as we never expect the report response call to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...we never expect the response call to fail" - yes, since we've removed health monitoring code from this method, there's nothing here but telemetry and this call will never generate an error.

protocol/daemons/server/server.go Show resolved Hide resolved
protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between db8d8d0 and cbde270.
Files selected for processing (7)
  • protocol/app/app.go (9 hunks)
  • protocol/daemons/server/bridge.go (3 hunks)
  • protocol/daemons/server/liquidation.go (3 hunks)
  • protocol/daemons/server/pricefeed.go (6 hunks)
  • protocol/daemons/server/server.go (4 hunks)
  • protocol/daemons/server/types/health_monitor.go (1 hunks)
  • protocol/daemons/server/types/health_monitor_test.go (1 hunks)
Files skipped from review due to trivial changes (1)
  • protocol/daemons/server/types/health_monitor_test.go
Additional comments: 17
protocol/app/app.go (3)
  • 1230-1248: The MonitorDaemon method correctly registers a daemon service with the health monitor and panics if registration fails, ensuring that the application does not run with unmonitored critical services.

  • 1250-1253: The DisableHealthMonitorForTesting method is a good addition for testing purposes, allowing tests to run without the health monitor interfering.

  • 645-661: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [611-661]

The startDaemons function correctly starts daemon clients and registers them with the health monitor. However, there is a TODO comment regarding the bridge daemon that should be addressed in the future to ensure it is properly integrated.

protocol/daemons/server/bridge.go (3)
  • 2-8: The removal of imports suggests that the associated functionality is no longer needed. Verify that no other parts of the codebase require these imports to function correctly.

  • 21-22: The WithBridgeEventManager method correctly sets the bridgeEventManager field and returns the Server instance for method chaining.

  • 33-36: The AddBridgeEvents method now unconditionally calls s.reportResponse(types.BridgeDaemonServiceName). Ensure that this change aligns with the new health check monitoring system and that the absence of error handling here is intentional and safe.

protocol/daemons/server/liquidation.go (2)
  • 2-10: Verify that the "time" package is not used elsewhere in the file since it has been removed from the imports.

  • 37-42: The call to s.reportResponse has been removed, which is consistent with the transition to the new health check monitoring system. Ensure that the new health monitoring system is correctly capturing the necessary information that was previously handled by reportResponse.

protocol/daemons/server/pricefeed.go (6)
  • 6-6: The import alias types has been removed, ensure that there are no remaining references to types that would cause compilation errors.

  • 39-39: The call to s.reportResponse has been removed. Confirm that this is intentional and that the removal aligns with the new health check monitoring system.

  • 31-42: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [41-46]

The telemetry code for measuring latency has been removed. Ensure that this is consistent with the new health check monitoring system and that no alternative metrics are required.

  • 48-54: The panic now uses the daemontypes.ErrServerNotInitializedCorrectly error constant. Confirm that this change is consistent with the new error handling strategy.

  • 69-71: The error constant daemontypes.ErrPriceFeedMarketPriceUpdateEmpty is used for validation. Confirm that all error constants have been updated to use the new daemontypes package.

  • 97-111: The error constants daemontypes.ErrPriceFeedInvalidPrice and daemontypes.ErrPriceFeedLastUpdateTimeNotSet are used for validation. Confirm that these constants are correctly defined in the daemontypes package and that their usage here is appropriate.

protocol/daemons/server/server.go (3)
  • 8-14: The removal of certain imports suggests that the associated functionality has been deprecated or refactored. Ensure that no other parts of the codebase rely on these imports to prevent any breakage.

  • 24-29: The Server struct still contains embedded fields for BridgeServer, PriceFeedServer, and LiquidationServer. Ensure that the new health monitoring system is compatible with these services and that their integration is properly handled.

  • 43-58: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [54-66]

The reportResponse method has been refactored to only report metrics. Confirm that this change aligns with the new health monitoring system's requirements and that no other functionality is lost due to the removal of the error return type.

protocol/app/app.go Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between cbde270 and 5fe6028.
Files selected for processing (1)
  • protocol/daemons/server/types/health_monitor.go (1 hunks)

protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
@clemire clemire force-pushed the crystal/CORE-666-monitor-update branch from 5fe6028 to bc6fe09 Compare November 28, 2023 00:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 3

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 1c32c5f and bc6fe09.
Files selected for processing (14)
  • protocol/app/app.go (9 hunks)
  • protocol/app/app_test.go (2 hunks)
  • protocol/daemons/pricefeed/client/client_integration_test.go (4 hunks)
  • protocol/daemons/server/bridge.go (3 hunks)
  • protocol/daemons/server/liquidation.go (3 hunks)
  • protocol/daemons/server/pricefeed.go (6 hunks)
  • protocol/daemons/server/server.go (4 hunks)
  • protocol/daemons/server/server_test.go (3 hunks)
  • protocol/daemons/server/types/health_monitor.go (1 hunks)
  • protocol/daemons/server/types/health_monitor_test.go (1 hunks)
  • protocol/daemons/types/health_checkable.go (3 hunks)
  • protocol/mocks/HealthCheckable.go (1 hunks)
  • protocol/mocks/Makefile (1 hunks)
  • protocol/testutil/app/app.go (1 hunks)
Files skipped from review due to trivial changes (2)
  • protocol/daemons/server/server_test.go
  • protocol/mocks/Makefile
Additional comments: 17
protocol/app/app_test.go (1)
  • 3-4: The addition of the github.com/dydxprotocol/v4-chain/protocol/mocks import is appropriate for the new test function TestMonitorDaemon_Panics that uses mocks.
protocol/daemons/pricefeed/client/client_integration_test.go (4)
  • 5-11: The addition of the import statement aligns with the changes described in the summary.

  • 284-288: The initialization of healthMonitor is correctly implemented and follows the new health check monitoring system's design.

  • 340-341: The registration of the s.pricefeedDaemon service with the healthMonitor is correctly implemented and is a critical part of the new health check monitoring system.

  • 280-292: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [5-342]

Overall, the changes in the provided hunks are consistent with the pull request's intent to implement a robust health check monitoring system and replace the previous update monitoring approach.

protocol/daemons/server/bridge.go (1)
  • 33-36: The AddBridgeEvents method now unconditionally calls s.reportResponse(types.BridgeDaemonServiceName). Ensure that the reportResponse method is designed to handle all scenarios appropriately since the conditional panic has been removed.
protocol/daemons/server/liquidation.go (3)
  • 3-10: The import reorganization and addition of liquidationtypes appear to be correct and necessary for the new functionality.

  • 37-42: Ensure that the reportResponse method is correctly implemented and used in the context of the new health monitoring system, as it is not visible in the provided code snippet.

  • 37-43: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [29-39]

Verify whether the telemetry.ModuleSetGauge call is still required, as the summary indicates that telemetry and metrics functionality is being stripped from the package.

protocol/daemons/server/pricefeed.go (1)
  • 48-54: The use of panic here will crash the server if s.marketToExchange is not initialized. Ensure that this is the intended behavior and that there are proper recovery mechanisms in place to handle such crashes gracefully.
protocol/daemons/server/server.go (1)
  • 43-58: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [54-66]

The reportResponse method has been correctly updated to remove references to the updateMonitor and now focuses solely on metrics collection. Ensure that the metrics collection system is still compatible with these changes.

protocol/daemons/types/health_checkable.go (4)
  • 22-26: The addition of the ServiceName method to the HealthCheckable interface is consistent with the summary provided. This change will require all implementing types to provide a unique service name, which can be useful for identifying services during health checks.

  • 71-72: The serviceName field has been correctly added to the timeBoundedHealthCheckable struct to store the name of the service being monitored. This aligns with the summary and the new ServiceName method implementation.

  • 81-88: The initialization of timeBoundedHealthCheckable in the NewTimeBoundedHealthCheckable function correctly sets the serviceName, timeProvider, and logger. Additionally, it initializes the health checkable to an unhealthy state, which is a good practice to ensure that the service is checked for health immediately after creation.

  • 91-93: The implementation of the ServiceName method in the timeBoundedHealthCheckable struct is correct and returns the stored service name as expected.

protocol/mocks/HealthCheckable.go (1)
  • 1-63: The mock implementation for HealthCheckable and the NewHealthCheckable function are correctly set up and follow the standard practices for using the testify mock package.
protocol/testutil/app/app.go (1)
  • 493-493: The change from tApp.App.Server.DisableUpdateMonitoringForTesting() to tApp.App.DisableHealthMonitorForTesting() aligns with the refactoring from an update monitoring system to a health check monitor system. Ensure that the DisableHealthMonitorForTesting method is implemented correctly in the App struct and that it effectively disables health monitoring during tests.

protocol/app/app_test.go Outdated Show resolved Hide resolved
protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
protocol/app/app.go Show resolved Hide resolved
@@ -47,12 +36,6 @@ func (s *Server) LiquidateSubaccounts(
metrics.Received,
metrics.Count,
)
// If the daemon is unable to report a response, there is either an error in the registration of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to make sense to log this as we never expect the report response call to fail.

protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
protocol/daemons/server/pricefeed.go Show resolved Hide resolved
protocol/daemons/server/pricefeed.go Outdated Show resolved Hide resolved
protocol/daemons/server/liquidation.go Outdated Show resolved Hide resolved
protocol/daemons/server/bridge.go Outdated Show resolved Hide resolved
protocol/daemons/server/types/health_monitor.go Outdated Show resolved Hide resolved
@clemire
Copy link
Contributor Author

clemire commented Dec 2, 2023

@lcwik reached out to you via slack to discuss the test case error I'm seeing consistently in this PR.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 7b8f83f and cd77cee.
Files selected for processing (7)
  • protocol/app/app.go (10 hunks)
  • protocol/app/app_test.go (2 hunks)
  • protocol/daemons/server/bridge.go (2 hunks)
  • protocol/daemons/server/liquidation.go (2 hunks)
  • protocol/daemons/server/pricefeed.go (5 hunks)
  • protocol/daemons/server/types/health_checker.go (1 hunks)
  • protocol/daemons/server/types/health_monitor.go (1 hunks)
Additional comments: 17
protocol/app/app_test.go (2)
  • 3-9: The addition of new imports for github.com/dydxprotocol/v4-chain/protocol/mocks and time is consistent with the summary and the PR objective of refactoring the daemon monitoring system.

  • 228-241: The addition of the TestRegisterDaemonWithHealthMonitor_Panics test function is consistent with the summary and aligns with the PR objective to ensure robustness in the health check monitor system by not allowing the same service to be registered more than once.

protocol/daemons/server/bridge.go (1)
  • 26-41: The change in the function signature of AddBridgeEvents to return a pointer to api.AddBridgeEventsResponse and an error separately is consistent with the PR's objective to refactor the daemon monitoring system. Ensure that all calls to this function are updated to handle the new signature correctly.
protocol/daemons/server/liquidation.go (3)
  • 2-10: The import changes are appropriate and align with the restructuring of the codebase.

  • 27-50: The function signature change for LiquidateSubaccounts is correct and reflects the new health check monitoring system's requirements.

  • 46-47: The addition of reportValidResponse is a good practice to capture valid responses in metrics, aligning with the new health check monitoring system.

protocol/daemons/server/pricefeed.go (6)
  • 9-9: The previous comment about removing the gometrics import if it's not used is still valid. Please verify if this import is necessary, and if not, remove it to keep the code clean.

  • 6-6: The import alias servertypes has been correctly updated to types to reflect the changes in the import path.

  • 47-81: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [34-72]

The function signature of UpdateMarketPrices has been updated to use named return values, which is consistent with the PR objective and summary.

  • 75-79: The error types in the validateMarketPricesUpdatesMessage and validateMarketPriceUpdate functions have been updated to use daemontypes.Err*, aligning with the PR objective and summary.

  • 52-58: The comment explaining why the panic in UpdateMarketPrices would be unexpected has been added, providing the necessary context as per the previous discussion.

  • 70-70: The response reporting in UpdateMarketPrices now uses types.PricefeedDaemonServiceName, which is consistent with the PR objective and summary.

protocol/daemons/server/types/health_checker.go (4)
  • 22-22: The previous comment suggested renaming the Update method to UpdateLastErr to clarify that it updates the error and sets the timestamp only if it's the first error in a streak. This change is not reflected in the provided code.
- func (u *timestampWithError) Update(timestamp time.Time, err error) {
+ func (u *timestampWithError) UpdateLastErr(timestamp time.Time, err error) {
  • 38-38: The previous comment suggested renaming the IsZero method to IsUnset to better reflect its purpose. This change is not reflected in the provided code.
- func (u *timestampWithError) IsZero() bool {
+ func (u *timestampWithError) IsUnset() bool {
  • 69-69: The previous comment suggested renaming the field mostRecentFailureStreakError to lastErrWithFirstErrTimestamp for clarity. This change is not reflected in the provided code.
- mostRecentFailureStreakError timestampWithError
+ lastErrWithFirstErrTimestamp timestampWithError
  • 1-218: Overall, the code aligns with the PR objective of refactoring the daemon monitoring system into a health check monitor. The new structures and methods introduced in this file are consistent with the goal of re-issuing panics if a daemon's unhealthiness persists beyond a maximum allowable threshold.
protocol/daemons/server/types/health_monitor.go (1)
  • 1-223: The changes in the provided hunk align with the PR objective and the generated summaries. The new health monitoring system is implemented with proper synchronization, error handling, and logging. The code is well-structured and follows best practices. It appears ready for further testing and integration.

err error
}

func (u *timestampWithError) Update(timestamp time.Time, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any thoughts on this?

u.err = nil
}

func (u *timestampWithError) IsZero() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any thoughts here?

// This field is used to determine how long the daemon has been unhealthy. If this timestamp is nil, then either
// the service has never been unhealthy, or the most recent error streak ended before it could trigger a callback.
// Access to mostRecentFailureStreakError is synchronized.
mostRecentFailureStreakError timestampWithError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any thoughts here?

// maxAcceptableUnhealthyDuration is the maximum acceptable duration for a health-checkable service to
// remain unhealthy. If the service remains unhealthy for this duration, the monitor will execute the
// specified callback function.
maxAcceptableUnhealthyDuration time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we name this to maxDaemonUnhealthyDuration so it's consistent w #802?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ it's renamed in that PR, which will be merged into this one before merging to main.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see here

protocol/daemons/server/types/health_checker.go Outdated Show resolved Hide resolved
}

// Schedule next poll.
hc.mutableState.SchedulePoll(hc.pollFrequency)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm, we want to schedule the next poll when the daemon is unhealthy right? Makes sense to me, but would be nice to add some comments why this is intended.

if maximumAcceptableUnhealthyDuration <= 0 {
return fmt.Errorf(
"health check registration failure for service %v: "+
"maximum acceptable unhealthy duration %v must be positive",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just to confirm, will you be updating this text too to match maxDaemonUnhealthyDuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's updated here

u.err = nil
}

func (u *timestampWithError) IsZero() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like IsUnset

Crystal Lemire added 8 commits December 7, 2023 10:29
…PC query.

Update error for a more specific test.

PR comments.

Revert liquidations health reporting to occur only at task loop boundary.

Checkpoint.

checkpoint - implementation, moved methods to app. need to migrate / add testing.

Update test app.

Checkpoint for tests.

Finish test cases.

App unit test + cleanup.

Cleanup, rename.

cleanup.

Remove todo for refactoring metrics daemon.

Move monitoring call to before client start.

Update placement of monitor code for pricefeed client.

Remove defunct constants.

Revert ServiceName changes.
Remove superfluous methods.

Comment nit.

Method name fix.
@clemire clemire force-pushed the crystal/CORE-666-monitor-update branch from cd77cee to f996947 Compare December 7, 2023 18:30
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 10

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between a924b75 and f996947.
Files selected for processing (18)
  • protocol/app/app.go (10 hunks)
  • protocol/app/app_test.go (2 hunks)
  • protocol/daemons/metrics/client/client.go (1 hunks)
  • protocol/daemons/pricefeed/client/client_integration_test.go (4 hunks)
  • protocol/daemons/server/bridge.go (2 hunks)
  • protocol/daemons/server/liquidation.go (2 hunks)
  • protocol/daemons/server/pricefeed.go (5 hunks)
  • protocol/daemons/server/server.go (4 hunks)
  • protocol/daemons/server/server_test.go (3 hunks)
  • protocol/daemons/server/types/health_checker.go (1 hunks)
  • protocol/daemons/server/types/health_checker_test.go (1 hunks)
  • protocol/daemons/server/types/health_monitor.go (1 hunks)
  • protocol/daemons/server/types/health_monitor_test.go (1 hunks)
  • protocol/daemons/server/types/stoppable.go (1 hunks)
  • protocol/daemons/types/health_checkable.go (3 hunks)
  • protocol/mocks/HealthCheckable.go (1 hunks)
  • protocol/mocks/Makefile (1 hunks)
  • protocol/testutil/app/app.go (1 hunks)
Additional comments: 46
protocol/app/app.go (5)
  • 189-189: As previously discussed, ensure that the follow-up PR to make MaximumDaemonUnhealthyDuration configurable has been merged or is tracked for merging to accommodate different environments and use cases.

  • 298-298: The addition of DaemonHealthMonitor to the App struct is consistent with the PR's objective to refactor daemon monitoring to use health checks.

  • 1233-1256: The addition of RegisterDaemonWithHealthMonitor and DisableHealthMonitorForTesting methods to the App struct aligns with the PR's objective to enhance daemon health monitoring.

  • 645-661: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [611-661]

The startDaemons function correctly starts various daemon clients and registers them with the health monitor, which is in line with the PR's objective to monitor daemon health.

  • 672-680: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [672-692]

Ensure that a comment is added explaining why the Metrics Daemon is not monitored by the health check system, as discussed in the previous comments.

protocol/app/app_test.go (2)
  • 3-4: The addition of the import statement for "github.com/dydxprotocol/v4-chain/protocol/mocks" is appropriate for the new test function that utilizes mocks.

  • 228-241: The new test TestRegisterDaemonWithHealthMonitor_Panics correctly verifies that the health monitor panics when a service is registered more than once, aligning with the PR's objective to handle daemon unhealthiness.

protocol/daemons/metrics/client/client.go (1)
  • 20-24: The comment added provides a clear explanation for the absence of a health-check implementation for the metrics daemon, which is consistent with the PR's objective to refactor daemon monitoring.
protocol/daemons/pricefeed/client/client_integration_test.go (4)
  • 5-11: The addition of the import statement aligns with the PR objectives and the summary provided.

  • 220-223: The declaration of the healthMonitor variable is consistent with the PR objectives and the summary provided.

  • 280-292: The modifications in the SetupTest function to initialize the healthMonitor variable align with the PR objectives and the summary provided.

  • 337-342: The addition of a call to healthMonitor.RegisterService in the startClient function is consistent with the PR objectives and the summary provided.

protocol/daemons/server/bridge.go (2)
  • 26-40: The changes to the AddBridgeEvents function signature and the direct handling of errors within the function body align with the PR's objectives to refactor daemon monitoring to use health checks. The use of named return values (response and err) is a common Go idiom that can improve readability, especially when dealing with multiple return values.

  • 2-8: The removal of the ExpectBridgeDaemon method is consistent with the PR's objectives to transition from an update monitor to a health check monitor. Ensure that any code that previously relied on this method has been updated to work with the new health check system.

protocol/daemons/server/liquidation.go (3)
  • 2-10: The import modification is non-functional and does not impact the behavior of the code.

  • 27-50: The function signature change for LiquidateSubaccounts is correct and aligns with the PR objectives.

  • 36-42: The addition of telemetry metrics is consistent with the PR objectives and the previous comments.

protocol/daemons/server/pricefeed.go (4)
  • 9-9: The import gometrics "github.com/armon/go-metrics" appears to be unused in the provided code. If it is indeed not used, it should be removed to keep the code clean.

The import gometrics "github.com/armon/go-metrics" is used in the file protocol/daemons/server/pricefeed.go. Therefore, it should not be removed.

  • 6-6: The import alias servertypes is still being used in the code. If the import path has been changed to types as mentioned in the summary, this alias should be updated accordingly.

  • 52-58: The panic condition in the UpdateMarketPrices function has been updated to use daemontypes.ErrServerNotInitializedCorrectly. This change aligns with the PR objectives to use health checks for daemon monitoring.

  • 47-81: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [34-73]

The removal of the ExpectPricefeedDaemon function is consistent with the PR objectives to refactor the daemon monitoring system into a health check monitor.

protocol/daemons/server/server.go (1)
  • 24-29: The summary does not mention the addition of fileHandler and socketAddress fields to the Server struct, which are present in the hunks. This should be included in the summary to accurately reflect the changes made to the Server struct.
protocol/daemons/server/server_test.go (1)
  • 171-176: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [161-175]

The changes from lines 161 to 175 show the createServerWithMocks function without the server.DisableUpdateMonitoringForTesting() call, which aligns with the PR's objective to migrate from update monitoring to health checks. Ensure that the removal of this call is consistent with the new health check system and that no other parts of the test suite rely on this method.

protocol/daemons/server/types/health_monitor.go (12)
  • 15-19: The constants HealthCheckPollFrequency and HealthMonitorLogModuleName are well-named and documented, providing clear context for their usage.

  • 21-34: The healthMonitorMutableState struct is well-encapsulated, with a mutex to protect its mutable state. This should help prevent data races when accessing the serviceToHealthChecker map and the stopped and disabled flags.

  • 43-50: The DisableForTesting method is a good practice for testing purposes, allowing the health monitor to be disabled without affecting the production code path.

  • 52-68: The Stop method correctly checks if the monitor has already been stopped before proceeding, which is a good safeguard against redundant operations.

  • 70-116: The RegisterHealthChecker method is synchronized and handles the case where the monitor is disabled or stopped, which is crucial for avoiding race conditions and ensuring that no new health checkers are registered after the monitor is stopped.

  • 118-128: The HealthMonitor struct is well-defined with clear separation of mutable and immutable state. The immutable fields are set during construction and are not modified afterward, which is a good practice for maintaining thread safety.

  • 130-141: The NewHealthMonitor constructor function initializes the HealthMonitor with a logger that includes the module name, which is good for traceability in logs.

  • 148-178: The RegisterServiceWithCallback method ensures that the maximumAcceptableUnhealthyDuration is positive before proceeding with registration, which is a good validation check to prevent misconfiguration.

  • 181-187: The PanicServiceNotResponding function provides a callback that panics with a service-specific message, which aligns with the PR objective to re-issue panics for unhealthy daemons.

  • 189-200: The LogErrorServiceNotResponding function is a good alternative to panicking, allowing for logging errors without stopping the service. This could be useful in non-critical environments or during development.

  • 203-218: The RegisterService method provides a convenient wrapper around RegisterServiceWithCallback, using the panic callback by default. This aligns with the PR objective and simplifies the registration process for services that should panic when unhealthy.

  • 220-223: The Stop method in the HealthMonitor struct delegates to the mutable state's Stop method, maintaining consistency in the stopping behavior across the system.

protocol/daemons/server/types/stoppable.go (1)
  • 3-7: The Stoppable interface has been correctly defined according to the PR objectives and summary.
protocol/daemons/types/health_checkable.go (7)
  • 22-26: The addition of the ServiceName method to the HealthCheckable interface aligns with the PR objectives and the provided summaries.

  • 69-73: The inclusion of the serviceName field in the timeBoundedHealthCheckable struct and the implementation of the ServiceName method are consistent with the PR objectives and the provided summaries.

  • 81-88: The changes to the NewTimeBoundedHealthCheckable function, including the addition of the serviceName parameter and its assignment to the serviceName field, are in line with the PR objectives and the provided summaries.

  • 91-93: The implementation of the ServiceName method in the timeBoundedHealthCheckable struct is correct and aligns with the PR objectives and the provided summaries.

  • 96-98: The ReportSuccess method in the timeBoundedHealthCheckable struct is correctly implemented and thread-safe, which is consistent with the PR objectives and the provided summaries.

  • 98-98: The ReportFailure method in the timeBoundedHealthCheckable struct is correctly implemented and thread-safe, which is consistent with the PR objectives and the provided summaries.

  • 98-98: The HealthCheck method in the timeBoundedHealthCheckable struct is correctly implemented and thread-safe, which is consistent with the PR objectives and the provided summaries.

protocol/mocks/HealthCheckable.go (1)
  • 1-63: The changes to the HealthCheckable mock type are consistent with the PR objectives and summaries provided. The mock functions for HealthCheck, ReportFailure, ReportSuccess, and ServiceName are correctly implemented.
protocol/mocks/Makefile (1)
  • 24-24: The changes correctly add the mock generation command for the HealthCheckable interface, aligning with the PR objectives and the summary provided.
protocol/testutil/app/app.go (1)
  • 492-493: The change from DisableUpdateMonitoringForTesting to DisableHealthMonitorForTesting correctly reflects the PR's objective to migrate daemon monitoring to health checks.

Comment on lines 22 to 28
func (u *timestampWithError) Update(timestamp time.Time, err error) {
// If the timestamp is zero, this is the first update, so set the timestamp.
if u.timestamp.IsZero() {
u.timestamp = timestamp
}

u.err = err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming the Update method to UpdateLastErr to clarify that it updates the error and the timestamp only if it's the first error in a streak.

protocol/daemons/server/types/health_checker.go Outdated Show resolved Hide resolved
Comment on lines 34 to 42
// createTestMonitor creates a health monitor with a poll frequency of 10ms and a zero duration grace period.
func createTestMonitor() (*types.HealthMonitor, *mocks.Logger) {
logger := &mocks.Logger{}
logger.On("With", "module", "health-monitor").Return(logger).Once()
return types.NewHealthMonitor(
ZeroDuration,
10*time.Millisecond,
logger,
), logger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The createTestMonitor function sets up a health monitor with a very short poll frequency of 10ms. While this is likely intended to make the tests run quickly, it's important to ensure that this does not lead to flaky tests in slower environments or CI systems. Consider parameterizing the poll frequency or using a mock clock to control time in tests.

protocol/daemons/server/types/health_monitor_test.go Outdated Show resolved Hide resolved
protocol/app/app.go Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between f996947 and b5e62a3.
Files selected for processing (1)
  • protocol/daemons/server/types/health_monitor_test.go (1 hunks)
Additional comments: 1
protocol/daemons/server/types/health_monitor_test.go (1)
  • 202-206: The error message in the assertion on line 206 has been corrected to match the nature of the error being tested.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 5

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between b5e62a3 and 80c481b.
Files selected for processing (1)
  • protocol/daemons/server/types/health_checker.go (1 hunks)


// Update updates the timeStampWithError to reflect the current error. If the timestamp is zero, this is the first
// update, so set the timestamp.
func (u *timestampWithError) Update(timestamp time.Time, err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming the Update method to UpdateLastErr to clarify that it updates the error and the timestamp only if it's the first error in a streak.

}

// IsZero returns true if the timestampWithError is zero, indicating that the service is healthy.
func (u *timestampWithError) IsZero() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method IsZero should be renamed to IsUnset as per previous discussions to better reflect its purpose.

// This field is used to determine how long the daemon has been unhealthy. If this timestamp is nil, then either
// the service has never been unhealthy, or the most recent error streak ended before it could trigger a callback.
// Access to mostRecentFailureStreakError is synchronized.
mostRecentFailureStreakError timestampWithError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider renaming the mostRecentFailureStreakError field to lastErrWithFirstErrTimestamp for clarity, as previously suggested.

protocol/daemons/server/types/health_checker.go Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 80c481b and 21585a0.
Files selected for processing (1)
  • protocol/daemons/server/types/health_checker.go (1 hunks)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 6

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 21585a0 and 4336388.
Files ignored due to filter (1)
  • protocol/docker-compose.yml
Files selected for processing (11)
  • protocol/app/app.go (8 hunks)
  • protocol/app/flags/flags.go (1 hunks)
  • protocol/daemons/flags/flags.go (5 hunks)
  • protocol/daemons/flags/flags_test.go (4 hunks)
  • protocol/daemons/pricefeed/client/client_integration_test.go (3 hunks)
  • protocol/daemons/server/types/health_checker.go (1 hunks)
  • protocol/daemons/server/types/health_monitor.go (1 hunks)
  • protocol/daemons/server/types/health_monitor_test.go (1 hunks)
  • protocol/x/clob/client/cli/cancel_order_cli_test.go (2 hunks)
  • protocol/x/clob/client/cli/liquidations_cli_test.go (2 hunks)
  • protocol/x/clob/client/cli/place_order_cli_test.go (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • protocol/daemons/pricefeed/client/client_integration_test.go
Additional comments: 24
protocol/app/app.go (1)
  • 8-8: The math/big package import is still present, which contradicts the summary stating it was removed. If this package is no longer used, it should be removed to keep the code clean.
- "math/big"
protocol/app/flags/flags.go (1)
  • 3-7: The summary indicates that the "fmt" package import was removed, but the import statement is still present in the hunk. This discrepancy should be corrected in the summary.
protocol/daemons/flags/flags.go (5)
  • 9-17: The addition of new flags FlagPanicOnDaemonFailureEnabled and FlagMaxDaemonUnhealthySeconds is consistent with the PR objectives and the summary provided. These flags are crucial for the new health check monitoring system.

  • 30-37: The SharedFlags struct has been correctly updated to include the new fields PanicOnDaemonFailureEnabled and MaxDaemonUnhealthySeconds, aligning with the PR's goal to enhance daemon health monitoring.

  • 80-86: The default values for the new flags PanicOnDaemonFailureEnabled and MaxDaemonUnhealthySeconds have been set appropriately in the GetDefaultDaemonFlags function. This ensures that the daemons will panic on failure and have a default unhealthy duration threshold, which is in line with the PR's objectives.

  • 117-132: The command line flag additions for FlagPanicOnDaemonFailureEnabled and FlagMaxDaemonUnhealthySeconds are correctly implemented, allowing users to configure the behavior of the health check monitoring system via command line options.

  • 196-208: The parsing logic for the new flags FlagPanicOnDaemonFailureEnabled and FlagMaxDaemonUnhealthySeconds is correctly implemented, ensuring that the values can be retrieved and set from the command line options.

protocol/daemons/flags/flags_test.go (4)
  • 17-24: The addition of new flags FlagPanicOnDaemonFailureEnabled and FlagMaxDaemonUnhealthySeconds to the command line interface is correctly implemented and tested for presence.

  • 43-50: The test function TestGetDaemonFlagValuesFromOptions_Custom correctly sets custom values for the new flags and verifies their retrieval from the options map.

  • 68-79: The assertions for the new flags FlagPanicOnDaemonFailureEnabled and FlagMaxDaemonUnhealthySeconds in the TestGetDaemonFlagValuesFromOptions_Custom test function are correctly implemented.

  • 91-97: The TestGetDaemonFlagValuesFromOptions_Default test function correctly checks for the default values of the daemon flags, ensuring that the new flags will have their expected default values when not explicitly set.

protocol/daemons/server/types/health_checker.go (1)
  • 182-186: The implementation of the Poll method correctly schedules the next poll regardless of the health check result, which is consistent with the intended behavior as per the PR's objective and previous comments. The comments within the code provide clear explanation for this behavior, which is to ensure continuous monitoring even when the daemon is unhealthy and the callback does not halt the daemon.
protocol/daemons/server/types/health_monitor.go (5)
  • 21-34: The mutable state struct healthMonitorMutableState is well encapsulated and uses a mutex for synchronization, which is good for thread safety.

  • 188-194: The PanicServiceNotResponding function provides a clear panic message including the service name and error, which is helpful for debugging.

  • 196-208: The LogErrorServiceNotResponding function is a good counterpart to PanicServiceNotResponding, providing a non-terminating error logging option. It's important to ensure that the logger is properly configured to capture these error messages.

  • 210-233: The RegisterService method provides a clear API for registering services with the health monitor, with the behavior toggled by the enablePanics flag. This is a good use of a configuration flag to alter behavior without changing the interface.

  • 235-238: The Stop method in the HealthMonitor struct delegates to the mutable state's Stop method, maintaining consistency in the approach to stopping services.

protocol/x/clob/client/cli/cancel_order_cli_test.go (3)
  • 11-11: The addition of the math package import is correct and necessary for the use of math.MaxUint32 later in the code.

  • 77-82: The configuration setting to disable the health monitor panic timeout by setting daemonflags.FlagMaxDaemonUnhealthySeconds to math.MaxUint32 is consistent with the PR objectives and is a valid approach for integration tests where the health check panic is not desirable.

  • 74-86: > Note: This review was outside the patches, so it was mapped to the patch with the greatest overlap. Original lines [8-86]

The changes in the provided hunks are consistent with the PR objectives and the AI-generated summary. There are no alterations to the declarations of exported entities or any other issues that need to be addressed.

protocol/x/clob/client/cli/liquidations_cli_test.go (2)
  • 5-10: The addition of the "math" package import is correctly reflected in the hunk and is used within the test setup.

  • 78-83: The setting for daemonflags.FlagMaxDaemonUnhealthySeconds is correctly added to disable the health monitor panic timeout for the tests.

protocol/x/clob/client/cli/place_order_cli_test.go (2)
  • 5-8: ```shell

Verify that the "math" package is used elsewhere in the file.

ast-grep --lang go --pattern $'import("math")'


* 72-77: ```shell
# Verify that the setting for FlagMaxDaemonUnhealthySeconds is not used elsewhere in a way that conflicts with this test configuration.
ast-grep --lang go --pattern $'Set(daemonflags.FlagMaxDaemonUnhealthySeconds, $_)'

Comment on lines +64 to +65
// The following tests may still intermittently fail or report false positives on an overloaded system as they rely
// on callbacks to execute before the termination of the `time.Sleep` call, which is not guaranteed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests still rely on time.Sleep for synchronization, which can lead to flaky tests. Consider refactoring the tests to use synchronization primitives or mocking time-dependent functionality to make the tests more deterministic and reliable.

protocol/app/app.go Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 4336388 and abdeb97.
Files selected for processing (1)
  • protocol/daemons/server/types/health_checker.go (1 hunks)
Additional comments: 1
protocol/daemons/server/types/health_checker.go (1)
  • 1-217: The summary mentions the introduction of new global variables and flags, such as MaximumDaemonUnhealthyDuration and FlagMaxDaemonUnhealthySeconds, but these are not visible in the provided code. Please ensure that all relevant changes are included in the review or update the summary accordingly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 1

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between abdeb97 and 9192bae.
Files selected for processing (1)
  • protocol/daemons/server/types/health_checker.go (1 hunks)
Additional comments: 6
protocol/daemons/server/types/health_checker.go (6)
  • 77-87:
    The ReportSuccess method correctly updates the last success timestamp and resets the error streak, ensuring that the health checker's state is accurately maintained after a successful health check.

  • 89-97:
    The ReportFailure method properly updates the error streak with the most recent error and calculates the duration of the current failure streak, which is essential for determining if the maximum unhealthy duration has been exceeded.

  • 100-113:
    The SchedulePoll method correctly checks if the health checker has been stopped before scheduling the next poll, which prevents unnecessary operations after the health checker has been stopped.

  • 115-127:
    The Stop method ensures that the health checker is stopped only once and that subsequent calls to stop do not have any effect, which is a good practice to prevent redundant operations.

  • 156-179:
    The Poll method encapsulates the logic for executing a health check, handling both healthy and unhealthy responses, and scheduling the next poll. The method also correctly invokes the unhealthy callback when the service has been unhealthy for longer than the maximum acceptable duration.

  • 186-209:
    The StartNewHealthChecker function initializes and starts a new health checker with the provided parameters, including a startup grace period before the first poll, which allows the service to initialize properly.

Comment on lines +174 to +178
// Schedule next poll. We schedule another poll whether the callback was invoked or not, as callbacks are not
// guaranteed to panic or otherwise halt the daemon. In such cases, we may end up invoking the callback several
// times once the service exceeds the maximum unhealthy duration. For example, a callback that emits error logs
// will continue to emit error logs every 5s until the service becomes healthy again.
hc.mutableState.SchedulePoll(hc.pollFrequency)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code schedules a new poll after each health check, regardless of the health check's outcome. It would be beneficial to add a comment explaining the rationale behind this decision, as previously suggested by other reviewers.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 2

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 9192bae and 72d09dc.
Files selected for processing (1)
  • protocol/daemons/server/types/health_monitor_test.go (1 hunks)

Comment on lines +215 to +239
func TestRegisterService_RegistrationFailsAfterStop(t *testing.T) {
ufm, _ := createTestMonitor()
ufm.Stop()

stoppableHc := &stoppableFakeHealthChecker{}
hc2 := mockFailingHealthCheckerWithError("test-service-2", TestError1)

// Register unhealthy services. These services are confirmed to trigger a panic if registered when the monitor is
// not stopped.
// Register a stoppable unhealthy service.
err := ufm.RegisterService(stoppableHc, 10*time.Millisecond)
require.Nil(t, err)

// Register a non-stoppable unhealthy service.
err = ufm.RegisterService(hc2, 10*time.Millisecond)
require.Nil(t, err)

// Since the max allowable unhealthy duration is 10ms, and the polling period is 10ms, 100ms is long enough to wait
// in order to trigger a panic if a service is polled.
time.Sleep(100 * time.Millisecond)

// Assert that the monitor proactively stops any stoppable service that was registered after the monitor was
// stopped.
require.True(t, stoppableHc.stopped)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case does not assert the expected behavior for non-stoppable services after the monitor is stopped. Consider adding an assertion to verify that non-stoppable services are not stopped by the monitor.

}()

// Give the monitor time to poll the health checkable service. Polls occur once every 10ms.
time.Sleep(100 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests still rely on time.Sleep for synchronization, which can lead to flaky tests. Consider refactoring the tests to use synchronization primitives or mocking time-dependent functionality to make the tests more deterministic and reliable.

Also applies to: 119-119, 159-159, 185-185, 234-234

@clemire clemire merged commit f57fd40 into main Dec 8, 2023
16 of 17 checks passed
@clemire clemire deleted the crystal/CORE-666-monitor-update branch December 8, 2023 01:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants