Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No more sending on closed channels during checkscontroller shutdown #245

Merged
merged 10 commits into from
Jan 27, 2025
2 changes: 1 addition & 1 deletion .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ jobs:

- name: Install go toolchain for pre-commit
run: |
curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(go env GOPATH)/bin v1.60.3
curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(go env GOPATH)/bin v1.63.4
go install mvdan.cc/gofumpt@latest
go install github.com/matryer/moq@latest
go install github.com/norwoodj/helm-docs/cmd/helm-docs@latest
Expand Down
7 changes: 4 additions & 3 deletions cmd/run.go
Original file line number Diff line number Diff line change
Expand Up @@ -86,17 +86,18 @@ func run() func(cmd *cobra.Command, args []string) error {

s := sparrow.New(cfg)
cErr := make(chan error, 1)
log.Info("Running sparrow")
log.InfoContext(ctx, "Running sparrow")
go func() {
cErr <- s.Run(ctx)
}()

select {
case <-sigChan:
log.Info("Signal received, shutting down")
log.InfoContext(ctx, "Signal received, shutting down")
cancel()
<-cErr
case err := <-cErr:
case err = <-cErr:
log.InfoContext(ctx, "Sparrow was shut down")
return err
}

Expand Down
2 changes: 1 addition & 1 deletion pkg/api/api_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ func TestAPI_Run(t *testing.T) {
rec := httptest.NewRecorder()
a.router.ServeHTTP(rec, req)

if status := rec.Result().StatusCode; status != tt.want.status { //nolint:bodyclose // closed in defer below
if status := rec.Result().StatusCode; status != tt.want.status {
t.Errorf("Handler for route %s returned wrong status code: got %v want %v", tt.want.path, status, tt.want.status)
}

Expand Down
3 changes: 2 additions & 1 deletion pkg/checks/runtime/checks.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
package runtime

import (
"slices"
"sync"

"github.com/telekom/sparrow/pkg/checks"
Expand Down Expand Up @@ -53,5 +54,5 @@ func (c *Checks) Delete(check checks.Check) {
func (c *Checks) Iter() []checks.Check {
c.mu.RLock()
defer c.mu.RUnlock()
return c.checks
return slices.Clone(c.checks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK will this be a problem. When slices.Clone(c.checks) is called, it creates a copy of the checks. Any updates applied to these cloned instances will not affect the original running instances. This means that the original checks will never receive the updated configurations, which can prevent them from functioning correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll write some tests to replicate the problem. And then we can talk about how to better fix this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good news: the tests show that the change works as intended. Or at least I think they do :)

The shallow copy still provides a slice with the references intact. Each check instance will run its UpdateConfig and the actual check in the checks field of the controller will get updated.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense, but let me give it a try in a manual e2e test just to make sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it was the next thing on my list. Let's sit down after you're done with your exams and talk about your e2e branch 🌿

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests were also successful.

Manual E2E Test

First I've started the sparrow with only 2 health check targets and then updated the runtime config after the first interval to include two additional targets:

time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"

After I added the two additional targets:

time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:166 msg="Checking health"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:171 msg="Getting health status for each target in separate routine" amount=4
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:208 msg="Waiting for all routines to finish"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://gitlab.devops.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://gitlab.devops.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"
Complete log
❯ go run . run -c ./.tmp/.sparrow.yaml
Using config file: ./.tmp/.sparrow.yaml
time=11:04:09 level=INFO source=/home/installadm/dev/sparrow/cmd/run.go:89 msg="Running sparrow"
time=11:04:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/sparrow/metrics/metrics.go:117 msg="Tracing initialized with new provider" provider=http
time=11:04:09 level=INFO source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:76 msg="Starting healthcheck" interval=30s
time=11:04:09 level=INFO source=/home/installadm/dev/sparrow/pkg/api/api.go:101 msg="Serving Api" addr=:8081
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:166 msg="Checking health"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:171 msg="Getting health status for each target in separate routine" amount=2
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:208 msg="Waiting for all routines to finish"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:04:39 level=INFO source=/home/installadm/dev/sparrow/pkg/config/file.go:93 msg="Successfully got local runtime configuration"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:04:39 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"
time=11:05:09 level=INFO source=/home/installadm/dev/sparrow/pkg/config/file.go:93 msg="Successfully got local runtime configuration"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:166 msg="Checking health"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:171 msg="Getting health status for each target in separate routine" amount=4
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:208 msg="Waiting for all routines to finish"
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.google.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.example.com
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://www.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:193 msg="Starting retry routine to get health status" target=https://gitlab.devops.telekom.de
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.example.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.google.com status=healthy
time=11:05:09 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://www.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:199 msg="Successfully got health status of target" target=https://gitlab.devops.telekom.de status=healthy
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:211 msg="Successfully got health status from all targets"
time=11:05:10 level=DEBUG source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:95 msg="Successfully finished health check run"
^Ctime=11:05:19 level=INFO source=/home/installadm/dev/sparrow/cmd/run.go:96 msg="Signal received, shutting down"
time=11:05:19 level=ERROR source=/home/installadm/dev/sparrow/pkg/checks/health/health.go:80 msg="Context canceled" err="context canceled"
time=11:05:19 level=ERROR source=/home/installadm/dev/sparrow/pkg/sparrow/controller.go:146 msg="Failed to run check" check=health error="context canceled"
time=11:05:19 level=INFO source=/home/installadm/dev/sparrow/pkg/sparrow/run.go:185 msg="Shutting down sparrow"
time=11:05:19 level=ERROR source=/home/installadm/dev/sparrow/pkg/api/api.go:110 msg="Failed to serve api" error="http: Server closed" scheme=http
time=11:05:19 level=DEBUG source=/home/installadm/dev/sparrow/pkg/sparrow/metrics/metrics.go:132 msg="Tracing shutdown"
time=11:05:19 level=DEBUG source=/home/installadm/dev/sparrow/pkg/config/file.go:135 msg="Sending signal to shut down file loader"
time=11:05:19 level=INFO source=/home/installadm/dev/sparrow/pkg/sparrow/controller.go:84 msg="Shutting down checks controller"
time=11:05:19 level=INFO source=/home/installadm/dev/sparrow/pkg/sparrow/run.go:134 msg="Sparrow was shut down"
Used startup/runtime config
name: sparrow.localhost

api:
  address: :8081

telemetry:
  exporter: http
  tls:
    enabled: true
  url: <redacted>

loader:
  type: file
  interval: 30s
  file:
    path: ./.tmp/.sparrow.yaml

health:
  interval: 30s
  timeout: 10s
  retry:
    count: 3
    delay: 1s
  targets:
    - https://www.example.com
    - https://www.google.com
    - https://www.telekom.de
    - https://gitlab.devops.telekom.de

}
147 changes: 147 additions & 0 deletions pkg/sparrow/controller_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,58 @@ func TestRun_ContextCancellation(t *testing.T) {
}
}

// TestChecksController_Shutdown tests the shutdown of the ChecksController
// when none, one or multiple checks are registered. The test checks that after shutdown no
// checks are registered anymore (the checks slice is empty) and that the done channel is closed.
func TestChecksController_Shutdown(t *testing.T) {
tests := []struct {
name string
checks []checks.Check
}{
{
name: "no checks registered",
checks: nil,
},
{
name: "one check registered",
checks: []checks.Check{newMockCheck(t, "mockCheck")},
},
{
name: "multiple checks registered",
checks: []checks.Check{
newMockCheck(t, "mockCheck1"),
newMockCheck(t, "mockCheck2"),
newMockCheck(t, "mockCheck3"),
newMockCheck(t, "mockCheck4"),
},
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
cc := NewChecksController(db.NewInMemory(), metrics.New(metrics.Config{}))

if tt.checks != nil {
for _, check := range tt.checks {
cc.RegisterCheck(context.Background(), check)
}
}

cc.Shutdown(context.Background())

select {
case <-cc.done:
if len(cc.checks.Iter()) != 0 {
t.Errorf("Expected no checks to be registered")
}
return
case <-time.After(time.Second):
t.Fatal("Expected done channel to be closed")
}
})
}
}

func TestChecksController_Reconcile(t *testing.T) {
ctx, cancel := logger.NewContextWithLogger(context.Background())
defer cancel()
Expand Down Expand Up @@ -235,6 +287,74 @@ func TestChecksController_Reconcile(t *testing.T) {
}
}

// TestChecksController_Reconcile_Update tests the update of the checks
// when the runtime configuration changes.
func TestChecksController_Reconcile_Update(t *testing.T) {
ctx, cancel := logger.NewContextWithLogger(context.Background())
defer cancel()

tests := []struct {
name string
checks []checks.Check
newRuntimeConfig runtime.Config
}{
{
name: "update health check",
checks: []checks.Check{
health.NewCheck(),
},
newRuntimeConfig: runtime.Config{
Health: &health.Config{
Targets: []string{"https://new.com"},
Interval: 200 * time.Millisecond,
Timeout: 1000 * time.Millisecond,
},
},
},
{
name: "update health & latency check",
checks: []checks.Check{
health.NewCheck(),
latency.NewCheck(),
},
newRuntimeConfig: runtime.Config{
Health: &health.Config{
Targets: []string{"https://new.com"},
Interval: 200 * time.Millisecond,
Timeout: 1000 * time.Millisecond,
},
Latency: &latency.Config{
Targets: []string{"https://new.com"},
Interval: 200 * time.Millisecond,
Timeout: 1000 * time.Millisecond,
},
},
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
cc := NewChecksController(db.NewInMemory(), metrics.New(metrics.Config{}))
for _, c := range tt.checks {
cc.checks.Add(c)
}

cc.Reconcile(ctx, tt.newRuntimeConfig)

for _, c := range cc.checks.Iter() {
switch c.GetConfig().For() {
case health.CheckName:
hc := c.(*health.Health)
assert.Equal(t, tt.newRuntimeConfig.Health.Targets, hc.GetConfig().(*health.Config).Targets)
case latency.CheckName:
lc := c.(*latency.Latency)
assert.Equal(t, tt.newRuntimeConfig.Latency.Targets, lc.GetConfig().(*latency.Config).Targets)
}
}
})
}
}

func TestChecksController_RegisterCheck(t *testing.T) {
tests := []struct {
name string
Expand Down Expand Up @@ -379,3 +499,30 @@ func TestGenerateCheckSpecs(t *testing.T) {
})
}
}

// newMockCheck creates a new mock check with the given name.
func newMockCheck(t *testing.T, name string) *checks.CheckMock {
t.Helper()
return &checks.CheckMock{
GetMetricCollectorsFunc: func() []prometheus.Collector {
return []prometheus.Collector{
prometheus.NewCounter(prometheus.CounterOpts{
Name: fmt.Sprintf("%s_mock_metric", name),
}),
}
},
NameFunc: func() string {
return name
},
RemoveLabelledMetricsFunc: nil,
RunFunc: func(ctx context.Context, cResult chan checks.ResultDTO) error {
t.Logf("Run called for check %s", name)
return nil
},
SchemaFunc: nil,
ShutdownFunc: func() {
t.Logf("Shutdown called for check %s", name)
},
UpdateConfigFunc: nil,
}
}
2 changes: 1 addition & 1 deletion pkg/sparrow/handlers_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ func TestSparrow_handleCheckMetrics(t *testing.T) {
}

s.handleCheckMetrics(w, r)
resp := w.Result() //nolint:bodyclose
resp := w.Result()
body, _ := io.ReadAll(resp.Body)

if tt.wantCode == http.StatusOK {
Expand Down
5 changes: 3 additions & 2 deletions pkg/sparrow/run.go
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ func (s *Sparrow) Run(ctx context.Context) error {
s.shutdown(ctx)
}
case <-s.cDone:
log.InfoContext(ctx, "Sparrow was shut down")
return fmt.Errorf("sparrow was shut down")
}
}
Expand Down Expand Up @@ -181,7 +182,7 @@ func (s *Sparrow) shutdown(ctx context.Context) {
defer cancel()

s.shutOnce.Do(func() {
log.Info("Shutting down sparrow gracefully")
log.InfoContext(ctx, "Shutting down sparrow")
var sErrs ErrShutdown
if s.tarMan != nil {
sErrs.errTarMan = s.tarMan.Shutdown(ctx)
Expand All @@ -192,7 +193,7 @@ func (s *Sparrow) shutdown(ctx context.Context) {
s.controller.Shutdown(ctx)

if sErrs.HasError() {
log.Error("Failed to shutdown gracefully", "contextError", errC, "errors", sErrs)
log.ErrorContext(ctx, "Failed to shutdown gracefully", "contextError", errC, "errors", sErrs)
}

// Signal that shutdown is complete
Expand Down
3 changes: 3 additions & 0 deletions pkg/sparrow/run_errors.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,15 @@

package sparrow

// ErrShutdown holds any errors that may
// have occurred during shutdown of the Sparrow
type ErrShutdown struct {
errAPI error
errTarMan error
errMetrics error
}

// HasError returns true if any of the errors are set
func (e ErrShutdown) HasError() bool {
return e.errAPI != nil || e.errTarMan != nil || e.errMetrics != nil
}
Loading