-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No more sending on closed channels during checkscontroller shutdown #245
Conversation
First metric announces registration state.
When teh sparrow check controller would shut down, it would iterate over the actual slice of the checks, shutdown each check and then procceed to delete said check from the slice. Since the shutting down procedure is instant, there was a race condition that would delete a wrong check from the slice and then the same shutting down loop would try and shutdown the same check again. Just returning a copy of the slice resolves this problem, as the iteration is now done on the copy only. A more sophisticated deletion routine for the checks slice could be another way to handle this, but it would probably increase the complexity of the checks and checkscontroller structs.
Signed-off-by: Bruno Bressi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -53,5 +54,5 @@ func (c *Checks) Delete(check checks.Check) { | |||
func (c *Checks) Iter() []checks.Check { | |||
c.mu.RLock() | |||
defer c.mu.RUnlock() | |||
return c.checks | |||
return slices.Clone(c.checks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK will this be a problem. When slices.Clone(c.checks)
is called, it creates a copy of the checks. Any updates applied to these cloned instances will not affect the original running instances. This means that the original checks will never receive the updated configurations, which can prevent them from functioning correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll write some tests to replicate the problem. And then we can talk about how to better fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good news: the tests show that the change works as intended. Or at least I think they do :)
The shallow copy still provides a slice with the references intact. Each check instance will run its UpdateConfig
and the actual check in the checks
field of the controller will get updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that makes sense, but let me give it a try in a manual e2e test just to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it was the next thing on my list. Let's sit down after you're done with your exams and talk about your e2e branch 🌿
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests were also successful.
Manual E2E Test
First I've started the sparrow with only 2 health check targets and then updated the runtime config after the first interval to include two additional targets:
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"
After I added the two additional targets:
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:166 msg="Checking health"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:171 msg="Getting health status for each target in separate routine" amount=4
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:208 msg="Waiting for all routines to finish"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://gitlab.devops.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://gitlab.devops.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"
Complete log
❯ go run . run -c ./.tmp/.sparrow.yaml
Using config file: ./.tmp/.sparrow.yaml
time=11:04:09 level=INFO source=/home/installadm/dev/sparrow/cmd/run.go:89 msg="Running sparrow"
time=11:04:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/sparrow/metrics/metrics.go:117 msg="Tracing initialized with new provider" provider=http
time=11:04:09 level=INFO source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:76 msg="Starting healthcheck" interval=30s
time=11:04:09 level=INFO source=/home/installadm/dev/sparrow/pkg/api/api.go:101 msg="Serving Api" addr=:8081
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:166 msg="Checking health"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:171 msg="Getting health status for each target in separate routine" amount=2
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:208 msg="Waiting for all routines to finish"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:04:39 level=INFO source=/home/installadm/dev/sparrow/pkg/config/file.go:93 msg="Successfully got local runtime configuration"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"
time=11:05:09 level=INFO source=/home/installadm/dev/sparrow/pkg/config/file.go:93 msg="Successfully got local runtime configuration"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:166 msg="Checking health"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:171 msg="Getting health status for each target in separate routine" amount=4
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:208 msg="Waiting for all routines to finish"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://gitlab.devops.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://gitlab.devops.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"
^Ctime=11:05:19 level=INFO source=/home/installadm/dev/sparrow/cmd/run.go:96 msg="Signal received, shutting down"
time=11:05:19 level=ERROR source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:80 msg="Context canceled" err="context canceled"
time=11:05:19 level=ERROR source=/home/installadm/dev/sparrow/pkg/sparrow/controller.go:146 msg="Failed to run check" check=health error="context canceled"
time=11:05:19 level=INFO source=/home/installadm/dev/sparrow/pkg/sparrow/run.go:185 msg="Shutting down sparrow"
time=11:05:19 level=ERROR source=/home/installadm/dev/sparrow/pkg/api/api.go:110 msg="Failed to serve api" error="http: Server closed" scheme=http
time=11:05:19 level=DEBUG source=/home/installadm/dev/sparrow/pkg/sparrow/metrics/metrics.go:132 msg="Tracing shutdown"
time=11:05:19 level=DEBUG source=/home/installadm/dev/sparrow/pkg/config/file.go:135 msg="Sending signal to shut down file loader"
time=11:05:19 level=INFO source=/home/installadm/dev/sparrow/pkg/sparrow/controller.go:84 msg="Shutting down checks controller"
time=11:05:19 level=INFO source=/home/installadm/dev/sparrow/pkg/sparrow/run.go:134 msg="Sparrow was shut down"
Used startup/runtime config
name: sparrow.localhost
api:
address: :8081
telemetry:
exporter: http
tls:
enabled: true
url: <redacted>
loader:
type: file
interval: 30s
file:
path: ./.tmp/.sparrow.yaml
health:
interval: 30s
timeout: 10s
retry:
count: 3
delay: 1s
targets:
- https://www.example.com
- https://www.google.com
- https://www.telekom.de
- https://gitlab.devops.telekom.de
This test proves that the shallow clone works as intended and returns a clone of the slice, where the original references can still be used and updated.
This should fix the bodyclose linting remarks Signed-off-by: Bruno Bressi <[email protected]>
Motivation
Fixes #244
Changes
The change is just one LoC: return a copy of the
checks
slice when callingIter()
, which TBH is a safer and simpler way to handle this slice (long live functional programming).The rest is just some light housekeeping:
Tests done
The manual test case mentioned in the issue was carried out; started the sparrow and sent a keyboard interrupt. The error doesn't occur anymore:
Additionally, I've added a new unit test case (which does fail in the main branch ATM). The test actually tests the shutdown function of the
ChecksController
, which we had unfortunately neglected.TODO