[Bug]: Error: Not connected - is it possible to make connection timeout configurable? #519

akramarev · 2024-06-24T21:53:23Z

What happened?

Please compare these two outputs from livecycle/[email protected]:

Build step done in 716.21s
    Error: Not connected
Error: Process completed with exit code 1.

and

Build step done in 31.32s
- Copying files: Calculating...
✔ Copied 4 files to remote machine
Running: docker compose up -d --remove-orphans --no-build
- Connecting to remote docker socket...
✔ Connected to remote docker socket

The first one is a cold run of my workflow where the build stage took >10m, and succeeded, but artifacts copying failed right after it. The second output is from the same workflow retried - it reused cached images and thus build phase finished in 30s and preevy-up-action didn't have any problems with copying the artifacts to the runner.

I suspect that there is a timeout (ssh connection timeout) that approximately equals 10m somewhere. Wondering if it's possible to make it configurable for docker-compose stacks that require a longer build phase?

Add screenshots

please see the previous section with error details

Steps to reproduce the behavior

My setup:

public tunnel server
deploy runtime: AWS Lightsail
a docker-compose with a service which build:context step takes a long time (10m)
GHA with livecycle/[email protected] that uses GH builder and GHCR

GHA:

...
      - name: Set up Docker Buildx
        id: buildx_setup
        uses: docker/setup-buildx-action@v3
        with:
          buildkitd-config: .github/buildkitd.toml

      - name: Login to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ github.token }}

      - uses: livecycle/[email protected]
        id: preevy
        with:
          install: gh-release
          profile-url: "${{ vars.PREEVY_PROFILE_URL }}"
          args: --registry ghcr.io/my-org --builder ${{ steps.buildx_setup.outputs.name }}
          docker-compose-yaml-paths: "./docker-compose.yml"
        env:
          GITHUB_TOKEN: ${{ github.token }}

Expected behavior

Avoid Error: Not connected error when the build step takes a long time, i.e. either make the timeout configurable or retry the connection just a few times.

What OS are you seeing the problem on?

Linux

Additional context

No response

Record

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

akramarev · 2024-06-25T01:12:10Z

hmm I noticed the same issue with builds that took "only" ~7 minutes:

royra · 2024-06-29T18:56:18Z

Hey @akramarev, can you post some of the logs before the error? Preferably add --debug. I had builds longer than 10m and they did not time out. Usually during the build the ssh connection is active (messages are being sent) and there is no reason for it to time out.

What I did see in the past is builds that overload the machine's resources to the point where the ssh server hangs. You can try running top while the machine is building (connect using preevy ssh). You can also try a larger instance type or offloading the build to the GH action runner. LMK.

akramarev · 2024-06-29T20:06:44Z

Thanks for the reply @royra. I observe this problem only when I offload the build to the GHA runner (please check "My setup" section above for details), if I use the default builder (build happens on the remove Lightsail instance) I don't have this issue.

So my suspicion is while GHA runner is building the image preevy doesn't actively use the opened earlier SSH connection and it's timing out by the moment GHA finishes the build and is ready to upload artifacts. During the build, I can ssh to the Lightsail machine and see that it's almost idle.

Attaching logs (build details in the middle manually reducted):
preevy-logs.txt

royra · 2024-06-29T20:26:30Z

Sorry I missed the fact that you're already offloading the build.

Can you look at the ssh server logs on the lightsail instance? If there's nothing interesting try changing the LogLevel setting at /etc/ssh/sshd_config

akramarev · 2024-09-07T01:41:21Z

Thank you for your reply @royra and livecycle team for keeping this github issue open.

Noticed in /var/log/auth.log that SSH server restarted right at the moment when preevy action reported that it successfully configured the new lightsail machine:

At the same time /var/log/syslog indicates that cloud-init is the process that restarted sshd.

Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Starting Execute cloud user/final scripts...
Sep  7 00:41:01 ip-172-26-8-210 dockerd[1011]: time="2024-09-07T00:41:01.470172700Z" level=info msg="API listen on /run/docker.sock"
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Starting Update UTMP about System Runlevel Changes...
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: systemd-update-utmp-runlevel.service: Succeeded.
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Finished Update UTMP about System Runlevel Changes.
Sep  7 00:41:01 ip-172-26-8-210 cloud-init[2353]: Lightsail: Starting Instance Initialization.
Sep  7 00:41:01 ip-172-26-8-210 cloud-init[2353]: Lightsail: SSH CA Public Key created.
Sep  7 00:41:01 ip-172-26-8-210 cloud-init[2353]: Lightsail: SSH CA Public Key registered.
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Stopping OpenBSD Secure Shell server...
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: ssh.service: Succeeded.
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Stopped OpenBSD Secure Shell server.
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Starting OpenBSD Secure Shell server...
Sep  7 00:41:01 ip-172-26-8-210 systemd[1]: Started OpenBSD Secure Shell server.
Sep  7 00:41:01 ip-172-26-8-210 cloud-init[2353]: Lightsail: sshd restarted.
Sep  7 00:41:02 ip-172-26-8-210 cloud-init: #############################################################
Sep  7 00:41:02 ip-172-26-8-210 cloud-init: -----BEGIN SSH HOST KEY FINGERPRINTS-----
-- {redacted}
Sep  7 00:41:02 ip-172-26-8-210 cloud-init: -----END SSH HOST KEY FINGERPRINTS-----
Sep  7 00:41:02 ip-172-26-8-210 cloud-init: #############################################################
Sep  7 00:41:02 ip-172-26-8-210 cloud-init[2353]: Cloud-init v. 23.3.3-0ubuntu0~20.04.1 running 'modules:final' at Sat, 07 Sep 2024 00:41:01 +0000. Up 72.92 seconds.
Sep  7 00:41:02 ip-172-26-8-210 cloud-init[2353]: Cloud-init v. 23.3.3-0ubuntu0~20.04.1 finished at Sat, 07 Sep 2024 00:41:02 +0000. Datasource DataSourceEc2Local.  Up 73.33 seconds
...

As expected any further attempts to restart the job:

doesn't lead to sshd restart
succeeded, i.e. successfully copied context to the lightsail instance and runs compose

Is there anything you can suggest in this situation?

akramarev added the bug Something isn't working label Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Error: Not connected - is it possible to make connection timeout configurable? #519

[Bug]: Error: Not connected - is it possible to make connection timeout configurable? #519

akramarev commented Jun 24, 2024

akramarev commented Jun 25, 2024

royra commented Jun 29, 2024

akramarev commented Jun 29, 2024 •

edited

Loading

royra commented Jun 29, 2024

akramarev commented Sep 7, 2024

[Bug]: Error: Not connected - is it possible to make connection timeout configurable? #519

[Bug]: Error: Not connected - is it possible to make connection timeout configurable? #519

Comments

akramarev commented Jun 24, 2024

What happened?

Add screenshots

Steps to reproduce the behavior

Expected behavior

What OS are you seeing the problem on?

Additional context

Record

akramarev commented Jun 25, 2024

royra commented Jun 29, 2024

akramarev commented Jun 29, 2024 • edited Loading

royra commented Jun 29, 2024

akramarev commented Sep 7, 2024

akramarev commented Jun 29, 2024 •

edited

Loading