Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky postgresql test #1027

Open
ndr-brt opened this issue Feb 5, 2024 · 26 comments · Fixed by #1574
Open

Flaky postgresql test #1027

ndr-brt opened this issue Feb 5, 2024 · 26 comments · Fixed by #1574
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@ndr-brt
Copy link
Contributor

ndr-brt commented Feb 5, 2024

WHAT

There's a flaky test in the "postgresql" test cluster, it fails from time to time, e.g.:
https://github.com/eclipse-tractusx/tractusx-edc/actions/runs/7785340855/job/21227835752

FURTHER NOTES

// anything else you want to outline

Please be sure to take a look at
our contribution guidelines and
our PR etiquette.

@ndr-brt ndr-brt added enhancement New feature or request triage all new issues awaiting classification labels Feb 5, 2024
@github-project-automation github-project-automation bot moved this to Open in EDC Board Feb 5, 2024
Copy link
Contributor

github-actions bot commented Mar 9, 2024

This issue is stale because it has been open for 4 weeks with no activity.

@github-actions github-actions bot added the stale label Mar 9, 2024
@wolf4ood wolf4ood removed the stale label Mar 10, 2024
@wolf4ood wolf4ood added this to the Backlog milestone Mar 10, 2024
@wolf4ood
Copy link
Contributor

After the refactor, the parallelization of tests done here
the tests suite have been stable for the past week. I would closed this and in case some flaky tests emerge again i would re-open this

@github-project-automation github-project-automation bot moved this from Open to Done in EDC Board Mar 29, 2024
@wolf4ood wolf4ood removed the triage all new issues awaiting classification label Mar 29, 2024
@wolf4ood
Copy link
Contributor

Seems that it's still valid, reopening for investigation

https://github.com/eclipse-tractusx/tractusx-edc/actions/runs/9268208526

@wolf4ood wolf4ood reopened this May 28, 2024
@ndr-brt
Copy link
Contributor Author

ndr-brt commented May 28, 2024

looks like a new runtime with jetty is started when another one is still running, and the ports are defined statically so they are the same for every runtime (also for different tests).
One solution could be to generate new ports for every test, another to use the same runtime for all the tests (this will also make them run significantly faster).

@wolf4ood
Copy link
Contributor

Seems strange that a runtime is started when another one still running, we don't run them in parallel afaik

Copy link
Contributor

This issue is stale because it has been open for 2 weeks with no activity.

@github-actions github-actions bot added the stale label Jun 12, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

1 similar comment
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 19, 2024
@wolf4ood wolf4ood reopened this Jun 19, 2024
@wolf4ood wolf4ood removed the stale label Jun 19, 2024
@ndr-brt
Copy link
Contributor Author

ndr-brt commented Jun 27, 2024

Looks like the last time they broke on main was 3 weeks ago, maybe we fixed them unintentionally (maybe with the upstream e2e test runtime refactoring?).

@wolf4ood
Copy link
Contributor

It could be, but i think I saw some failure on dependabot PRs, i would leave this open for now, probably it might need further investigation

@wolf4ood
Copy link
Contributor

wolf4ood commented Jul 1, 2024

Seems that similar failure happens also in upstream, less frequent though

https://github.com/eclipse-edc/Connector/actions/runs/9743379424/job/26886697525?pr=4312

Copy link
Contributor

This issue is stale because it has been open for 2 weeks with no activity.

@github-actions github-actions bot added the stale label Jul 16, 2024
@ndr-brt
Copy link
Contributor Author

ndr-brt commented Jul 17, 2024

A possibility could be that the Participant object (that's instantiated statically) gets a free random port, but that port gets then used by the postgresql container as host port. The probability is quite low to be honest, but it could happen anyway. Will refactor it a little, then let's see if that fixes the issue

@ndr-brt ndr-brt self-assigned this Jul 17, 2024
@ndr-brt ndr-brt removed the stale label Jul 17, 2024
@wolf4ood
Copy link
Contributor

We can try but the linked upstream failure uses a global service from actions and not a containerized postgres.

I also saw failure on e2e tests without postgres

@wolf4ood
Copy link
Contributor

@ndr-brt
Copy link
Contributor Author

ndr-brt commented Jul 17, 2024

the upstream error is more specific because it says:
A binding for port 32762 already exists
that means that another binding with the same port is defined in the same runtime (maybe because of different call to getFreePort returned the same value.

while this says:
Address already in use
so it means that an external service is using the same port, and it could be either postgres or mockserver (some tests use it).

in any case I think it's something related to the getFreePort, maybe we could add a memory to it to avoid to return the same value twice on the same execution.
I'll open an issue upstream

@ndr-brt
Copy link
Contributor Author

ndr-brt commented Jul 29, 2024

My previous theory has been debunked, tests are still failing for the same issue 🤷
https://github.com/eclipse-tractusx/tractusx-edc/actions/runs/10141302101/job/28038270072

Copy link
Contributor

This issue is stale because it has been open for 4 weeks with no activity.

@github-actions github-actions bot added the stale label Aug 31, 2024
Copy link
Contributor

github-actions bot commented Sep 8, 2024

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 8, 2024
@ndr-brt
Copy link
Contributor Author

ndr-brt commented Sep 23, 2024

Copy link
Contributor

This issue is stale because it has been open for 4 weeks with no activity.

@github-actions github-actions bot added the stale label Oct 27, 2024
@ndr-brt ndr-brt removed the stale label Oct 28, 2024
Copy link
Contributor

This issue is stale because it has been open for 4 weeks with no activity.

@github-actions github-actions bot added the stale label Nov 30, 2024
@ndr-brt ndr-brt removed the stale label Dec 2, 2024
Copy link
Contributor

github-actions bot commented Jan 4, 2025

This issue is stale because it has been open for 4 weeks with no activity.

@ndr-brt
Copy link
Contributor Author

ndr-brt commented Jan 9, 2025

Ok, I've got another theory:
every test that involves testcontainers in fact could cause issues, because testcontainers itself spins up a connector with an exposed port, and this happens always after the getFreePort is called, here's the suspect:

301c3c46fe23   testcontainers/ryuk:0.11.0   "/bin/ryuk"              29 seconds ago   Up 28 seconds      0.0.0.0:49158->8080/tcp, :::49158->8080/tcp                           testcontainers-ryuk-5b664308-1f43-4381-962d-ed6531f5c6ef

the port that it uses is generated randomly after the getFreePort is called to define the connector configuration but before the connectors are actually started, like:

  • EDC test runtime defines ports to be used by choosing not used ones
  • testcontainers spins up a container that could use one of the ports already defined by the EDC test runtime
  • EDC test runtime spins up the runtimes, that could break because one of the port has been used by testcontainers

At this point I'd see two solutions:

  • spin testcontainers before defining ports to be used (so make them not statically generated as it is now)
  • force the ryuk port to be generated by getFreePort (if that's possible)

@rafaelmag110
Copy link
Contributor

rafaelmag110 commented Jan 10, 2025

@ndr-brt Looked around and didn't see any way to force a custom port in the ryuk container...

It's weird to me how frequent the port clashes are, given the large range of ports the getFreePort has...

One idea to debug this would be to print the port that clashes, and also get a log of all binded ports in the host. Maybe we could get the exact culprit then?

@ndr-brt
Copy link
Contributor Author

ndr-brt commented Jan 10, 2025

@rafaelmag110 a first fix has been proposed here: eclipse-edc/Connector#4712

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants