Bug: File conflicts are not handled correctly #231

niklastreml · 2024-12-09T15:58:48Z

Is there an existing issue for this?

I have searched the existing issues

Current Behavior

If a registration file remains in a gitlab repo, like after an unsuccessful shutdown, sparrow fails to re-register at the next boot. This happens, because we're POSTing the new file against the API, which gitlab views as a conflict, as it already exists. Gitlab returns a 400 error, and sparrow fails the initial registration.

Workaround

Until this is fixed, one can manually remove the old registration file from the repo, which should allow the sparrow instance to register again.

Expected Behavior

For now we should print a proper log output that "a file with this name already exits".

Gitlab API returns a 400 with an error:

{
  "message": "A file with this name already exists"
}

A metric should show the registration status of the sparrow (registered or not). In case the registration failed the metric should be false/0.

Edit: We decided not to overwrite the file that is already present. This should also prevent from choosing the same name for different sparrows twice. We decided that the failing registration is the default behaviour for now.

Steps To Reproduce

Create a test repo
Commit a registration file for your sparrow

{"url":"http://my-sparrow.com","lastSeen":"2024-12-09T15:52:51.763660555Z"}

Start a sparrow which would try to overwrite this file:

name: sparrow.caas-t02.telekom.de
targetManager:
  checkInterval: 1m
  enabled: true
  gitlab:
    baseUrl: https://gitlab.com
    projectId: 123456
  registrationInterval: 10m
  scheme: https
  type: gitlab
  unhealthyThreshold: 48h
  updateInterval: 5h

Observe logs

Relevant logs and/or screenshots, environment information, etc.

{"time":"2024-12-09T15:04:37.426700882Z","level":"ERROR","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*manager).register","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/manager.go","line":180},"msg":"Failed to register global gitlabTargetManager","error":"request failed, status is 400 Bad Request"}
{"time":"2024-12-09T15:04:37.426726504Z","level":"WARN","source":{"function":"github.com/caas-team/sparrow/pkg/sparrow/targets.(*manager).Reconcile","file":"/home/runner/work/sparrow/sparrow/pkg/sparrow/targets/manager.go","line":100},"msg":"Failed to register self as global target","error":"request failed, status is 400 Bad Request"}

Who can address the issue?

Any Dev

Anything else?

https://docs.gitlab.com/ee/api/repository_files.html#update-existing-file-in-repository

puffitos · 2024-12-09T19:42:51Z

This was intended behavior, to be honest. The sparrow should remain simple and not try to assume anything about the backend repository, if it cannot register itself because of a sparrow with the same name. As you've mentioned, this cpuld indicate another concurrent sparrow running, which is something the user should first handle themselves.

The maintainers of the backend should keep their inventory in check. If we want to make sure that an unregistered sparrow is unhealthy, we should make that part of the healthiness endpoint and not make assumptions about the target repository. Of course, only if the targetManager is active.

I find the idea of a forceUpdate flag good middle ground, as it moves the responsibility to the user. Its implementation may vary heavily depending on the target backend, which may also be another can of worms we don't wanna open.

niklastreml · 2024-12-10T08:31:49Z

if we want to handle the case of multiple sparrows properly,we might want to make it easier for the sparrows to distinguish this case of multiple instances. The same issue could still arise now in the following case:

Sparrow1 registers as sparrow.com
Sparrow2 registers as sparrow.com => fails
User removes duplicate config
Sparrow2 creates new config, before Sparrow1 tries to update the file
Now both Sparrow1 and Sparrow2 are happily writing to the same config, unaware of each other

I think this is highly unlikely to be honest. If we really want to detect this case properly, we might want to generate a random token on startup and include it in the registration. That way sparrow can detect, if the config has a token different from its own and log the case.

I'm gonna try to implement something this week, to allow a forceUpdate flag, so this is easier to fix for users. Although I do kind of worry, that this will be the kind of flag that is always going to be set to true, as it's easier

y-eight · 2024-12-11T13:53:24Z

I have updated the issue with the output of the discussion some minutes ago.

puffitos · 2024-12-25T18:05:12Z

Regarding the 400 from the POST request:

This shouldn't happen if users properly clean up their registration repos - the only case of this happening is when the scheme of the sparrow changes, since we save the file with just the DNS name and not the scheme in its name (the scheme came after the conception of the target manager). The target manager catches the case where the sparrow didn't deregister properly by checking for the presence of a registration file:

https://github.com/caas-team/sparrow/blob/92835cfcbcdded009e925495d1705ab3312a81b9/pkg/sparrow/targets/manager.go#L259-L264

In action:

time=18:51:18 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/remote/gitlab/gitlab.go:234 msg="Successfully fetched file list recursively" files=1
time=18:51:18 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/remote/gitlab/gitlab.go:151 msg="Successfully fetched file" file=dev.telekom.de.json
time=18:51:18 level=INFO source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/remote/gitlab/gitlab.go:105 msg="Successfully fetched all target files" files=1
time=18:51:18 level=DEBUG source=/Users/puffito/dev/repos/sparrow/pkg/sparrow/targets/manager.go:261 msg="Found self as global target" lastSeenMinsAgo=21.492964566666668

So this error can really only happen if someone changed their scheme from http -> https (for example) and now the sparrow can't find itself anymore and tries to reregister. It that case, the user should have either:

explained in an issue why the sparrow didn't deregister properly, so we can address it
clean up the old registration file themselves (the error is a 400 -> something's wrong on our side)

Just marshalling the error message here isn't the most sustainable solution here. It adds code which could still result in errors not being properly decoded into a BadRequest struct (see the various possible structures of a 400 Error in the official docs. ATM we're only getting

{
  "message": "A file with this name already exists"
}

but who knows, what other kind of error message is possible in this case. Just reading the body in case of an error seems a bit simpler and it should provide the information needed for a non-technical user.

IMO, we shouldn't start decoding the Gitlab Errors. We're not a Gitlab client, we're a canary checker that offers some integration with a remote Gitlab repository as a state manager. Conveying the error code and the response body should be more than enough.

niklastreml added bug Something isn't working area/target-manager Issues/PRs related to the TargetManager labels Dec 9, 2024

niklastreml self-assigned this Dec 10, 2024

niklastreml removed their assignment Dec 12, 2024

puffitos self-assigned this Dec 23, 2024

puffitos mentioned this issue Dec 23, 2024

feat/target-manager-registered-metric #243

Merged

4 tasks

puffitos mentioned this issue Dec 25, 2024

Feat/decode gitlab error #246

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: File conflicts are not handled correctly #231

Bug: File conflicts are not handled correctly #231

niklastreml commented Dec 9, 2024 •

edited by y-eight

Loading

puffitos commented Dec 9, 2024

niklastreml commented Dec 10, 2024

y-eight commented Dec 11, 2024 •

edited

Loading

puffitos commented Dec 25, 2024

Bug: File conflicts are not handled correctly #231

Bug: File conflicts are not handled correctly #231

Comments

niklastreml commented Dec 9, 2024 • edited by y-eight Loading

Is there an existing issue for this?

Current Behavior

Workaround

Expected Behavior

Steps To Reproduce

Relevant logs and/or screenshots, environment information, etc.

Who can address the issue?

Anything else?

puffitos commented Dec 9, 2024

niklastreml commented Dec 10, 2024

y-eight commented Dec 11, 2024 • edited Loading

puffitos commented Dec 25, 2024

niklastreml commented Dec 9, 2024 •

edited by y-eight

Loading

y-eight commented Dec 11, 2024 •

edited

Loading