Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No more sending on closed channels during checkscontroller shutdown #245

Merged
merged 10 commits into from
Jan 27, 2025

Conversation

puffitos
Copy link
Collaborator

Motivation

Fixes #244

Changes

The change is just one LoC: return a copy of the checks slice when calling Iter(), which TBH is a safer and simpler way to handle this slice (long live functional programming).

The rest is just some light housekeeping:

  • Adding context to the logs
  • Adding a "sparrow was shutdown" log message to know that everything went well after shutting down the sparrow; currently nothing is being shown as feedback to the user outside an exit code of 0.

Tests done

The manual test case mentioned in the issue was carried out; started the sparrow and sent a keyboard interrupt. The error doesn't occur anymore:

time=17:07:25 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:185 msg="Shutting down sparrow"
time=17:07:25 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:154 msg="Shut down signal received"
time=17:07:25 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/remote/gitlab/gitlab.go:335 msg="Deleting file from gitlab" file="{AuthorEmail:dev.telekom.de@sparrow AuthorName:dev.telekom.de CommitMessage:Unregistering global target Content:{Url: LastSeen:0001-01-01 00:00:00 +0000 UTC} Name:dev.telekom.de.json}"
time=17:07:25 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/checks/latency/latency.go:82 msg="Context canceled" err="context canceled"
time=17:07:25 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:146 msg="Failed to run check" check=latency error="context canceled"
time=17:07:26 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:172 msg="Successfully unregistered as global target"
time=17:07:26 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:177 msg="Stopping gitlab reconciliation routine"
time=17:07:26 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/api/api.go:110 msg="Failed to serve api" error="http: Server closed" scheme=http
time=17:07:26 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/metrics/metrics.go:132 msg="Tracing shutdown"
time=17:07:26 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/config/file.go:135 msg="Sending signal to shut down file loader"
time=17:07:26 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/controller.go:84 msg="Shutting down checks controller"
time=17:07:26 level=ERROR source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:130 msg="Non-recoverable error in sparrow component" error="failed serving API: context canceled"
time=17:07:26 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/run.go:134 msg="Sparrow was shut down"

Process finished with the exit code 0

Additionally, I've added a new unit test case (which does fail in the main branch ATM). The test actually tests the shutdown function of the ChecksController, which we had unfortunately neglected.

  • Unit tests succeeded
  • E2E tests succeeded

TODO

  • I've assigned this PR to myself
  • I've labeled this PR correctly

First metric announces registration state.
When teh sparrow check controller would shut down, it would iterate over
the actual slice of the checks, shutdown each check and then procceed to
delete said check from the slice.

Since the shutting down procedure is instant, there was a race condition
that would delete a wrong check from the slice and then the same
shutting down loop would try and shutdown the same check again.

Just returning a copy of the slice resolves this problem, as the
iteration is now done on the copy only. A more sophisticated deletion
routine for the checks slice could be another way to handle this, but it
would probably increase the complexity of the checks and
checkscontroller structs.
@puffitos puffitos self-assigned this Dec 25, 2024
@puffitos puffitos added bug Something isn't working area/checks-controller Issues/PRs related to the ChecksController labels Dec 25, 2024
Copy link
Member

@y-eight y-eight left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -53,5 +54,5 @@ func (c *Checks) Delete(check checks.Check) {
func (c *Checks) Iter() []checks.Check {
c.mu.RLock()
defer c.mu.RUnlock()
return c.checks
return slices.Clone(c.checks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK will this be a problem. When slices.Clone(c.checks) is called, it creates a copy of the checks. Any updates applied to these cloned instances will not affect the original running instances. This means that the original checks will never receive the updated configurations, which can prevent them from functioning correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll write some tests to replicate the problem. And then we can talk about how to better fix this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news: the tests show that the change works as intended. Or at least I think they do :)

The shallow copy still provides a slice with the references intact. Each check instance will run its UpdateConfig and the actual check in the checks field of the controller will get updated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense, but let me give it a try in a manual e2e test just to make sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it was the next thing on my list. Let's sit down after you're done with your exams and talk about your e2e branch 🌿

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests were also successful.

Manual E2E Test

First I've started the sparrow with only 2 health check targets and then updated the runtime config after the first interval to include two additional targets:

time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"

After I added the two additional targets:

time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:166 msg="Checking health"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:171 msg="Getting health status for each target in separate routine" amount=4
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:208 msg="Waiting for all routines to finish"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://gitlab.devops.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://gitlab.devops.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"
Complete log
❯ go run . run -c ./.tmp/.sparrow.yaml
Using config file: ./.tmp/.sparrow.yaml
time=11:04:09 level=INFO source=/home/installadm/dev/sparrow/cmd/run.go:89 msg="Running sparrow"
time=11:04:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/sparrow/metrics/metrics.go:117 msg="Tracing initialized with new provider" provider=http
time=11:04:09 level=INFO source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:76 msg="Starting healthcheck" interval=30s
time=11:04:09 level=INFO source=/home/installadm/dev/sparrow/pkg/api/api.go:101 msg="Serving Api" addr=:8081
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:166 msg="Checking health"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:171 msg="Getting health status for each target in separate routine" amount=2
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:208 msg="Waiting for all routines to finish"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:04:39 level=INFO source=/home/installadm/dev/sparrow/pkg/config/file.go:93 msg="Successfully got local runtime configuration"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"
time=11:05:09 level=INFO source=/home/installadm/dev/sparrow/pkg/config/file.go:93 msg="Successfully got local runtime configuration"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:166 msg="Checking health"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:171 msg="Getting health status for each target in separate routine" amount=4
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:208 msg="Waiting for all routines to finish"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://gitlab.devops.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://gitlab.devops.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"
^Ctime=11:05:19 level=INFO source=/home/installadm/dev/sparrow/cmd/run.go:96 msg="Signal received, shutting down"
time=11:05:19 level=ERROR source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:80 msg="Context canceled" err="context canceled"
time=11:05:19 level=ERROR source=/home/installadm/dev/sparrow/pkg/sparrow/controller.go:146 msg="Failed to run check" check=health error="context canceled"
time=11:05:19 level=INFO source=/home/installadm/dev/sparrow/pkg/sparrow/run.go:185 msg="Shutting down sparrow"
time=11:05:19 level=ERROR source=/home/installadm/dev/sparrow/pkg/api/api.go:110 msg="Failed to serve api" error="http: Server closed" scheme=http
time=11:05:19 level=DEBUG source=/home/installadm/dev/sparrow/pkg/sparrow/metrics/metrics.go:132 msg="Tracing shutdown"
time=11:05:19 level=DEBUG source=/home/installadm/dev/sparrow/pkg/config/file.go:135 msg="Sending signal to shut down file loader"
time=11:05:19 level=INFO source=/home/installadm/dev/sparrow/pkg/sparrow/controller.go:84 msg="Shutting down checks controller"
time=11:05:19 level=INFO source=/home/installadm/dev/sparrow/pkg/sparrow/run.go:134 msg="Sparrow was shut down"
Used startup/runtime config
name: sparrow.localhost

api:
  address: :8081

telemetry:
  exporter: http
  tls:
    enabled: true
  url: <redacted>

loader:
  type: file
  interval: 30s
  file:
    path: ./.tmp/.sparrow.yaml

health:
  interval: 30s
  timeout: 10s
  retry:
    count: 3
    delay: 1s
  targets:
    - https://www.example.com
    - https://www.google.com
    - https://www.telekom.de
    - https://gitlab.devops.telekom.de

This test proves that the shallow clone works as intended and returns a clone of the slice, where the original references can still be used and updated.
This should fix the bodyclose linting remarks

Signed-off-by: Bruno Bressi <[email protected]>
@puffitos puffitos requested review from lvlcn-t and y-eight January 23, 2025 22:18
@puffitos puffitos merged commit b5337ce into main Jan 27, 2025
8 checks passed
@puffitos puffitos deleted the fix/send-on-closed branch January 27, 2025 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/checks-controller Issues/PRs related to the ChecksController bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Sparrow's shutdown sends signal on closed channel
3 participants