-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connections refused (err 111) connecting to unix:///var/run/iotedge/workload.sock after edgeAgent restart #5505
Comments
In case it is helpful, I have some additional info. Most of our devices have been deployed and working without issue for over 6 months, some of them were running iotedge 1.0.x for some time, and all of them have generally been updated on a weekly basis. We have not had stability issues such as I described above until the past 6 weeks or so. Although until recently everything worked well, I figured I'd go ahead and verify that my Linux Kernels are compatible with moby, so I went ahead and ran the script from here:
Also, given that our entire device fleet is now affected by this issue and it increases the likelihood of data loss, I am going to open an Azure support request and include a link to this issue in that request. |
FYI I have tested downgrading and this issue is not reproducible after downgrading |
Looking into this and will get back to you soon. |
Can you enable the debug logs for the aziot-edged ? During the time ( 17:18:02) the edgehub has failed with 111 error, there are no logs available in the edged after 17:17:36 - may be it crashed and its not running. thats why the error shows up. I will try to repro in my end in the mean time. if you can get debug logs please attach them as well. thanks! |
@ggjjj since I am on iotedge 1.1.X it seems the system command doesn't exist:
Also, I tried setting |
When you can let me know about the above I will get debug level logs for you ASAP. FYI, after downgrading our fleet of devices to |
Please use this method to enable debug logs : https://docs.microsoft.com/en-us/azure/iot-edge/troubleshoot?view=iotedge-2018-06&preserve-view=true#check-the-status-of-the-iot-edge-security-manager-and-its-logs |
Hi @ggjjj here's another reproduction of the issue ( aziot-edged logsedge-agent logs
edge-hub logs
|
I think the issue is due to a change introduced in 1.1.6.
|
Are there any updates on this issue? The workaround @huguesBouvier shared works in most but not all cases for our set of devices. Exacerbating the issue is that the workaround doesn't work on a non-changing set of devices, so we now have a few devices with 14+ days of data backlogged which we're risking losing forever as time moves forward and the issue persists. |
Hello, we are working on releasing the fix, though I can't commit on a specific date. |
Understood - thank you for the update! |
@jackt-moran 1.1.7 was released a few hours ago with a fix for this issue. Please test this out and let us know if it helps! |
@veyalla I had the same problem with 1.1.6, looked into it for some days. |
@veyalla that's great news! 1.1.7 has fixed the issue in our staging environment, and I'm phasing the patch out to the rest of our devices. I will update this issue with my findings. |
@veyalla I think I still have the problem. I've pushed a new version of our module to a system, and now the edgeHub is complaining with the same error (Permission denied on workload.sock). |
@wvangeem -
What components did you revert? Also, to help us debug further, can you provide us a little more information -
@huguesBouvier fyi. |
@wvangeem This an unexpected issue, changing/adding a module should not have impact on edgeHub workload.sock permission. |
@huguesBouvier Here are the screenshots of the requested folders when back on 1.1.7. I'm still on 1.1, so the folders are iotedge instead of aziot. The subfolder edged is not there in this case. But the problems start coming up, if I enable the 4G modem on the device, seems the problem is there. I have to look further into it. |
Thanks! If the issue happens again, please let us know and tell what you see in that "/var/lib/aziot/edged/mnt" folder (not "/var/run/iotedge.") |
@huguesBouvier I've had some issues with the 1.1.7 (although it isn't clear if it is a single issue or multiple). On our staging machine I found a module had failed. Based on the discussion above I inspected
The socket with odd permissions is the one that I found to be failing, and it had a Python stack trace as follows (which I've seen throughout the lifetime of this issue):
In a test, I deleted that socket and ran
Hopefully this gives a clue as to what's going on. I got partially through phasing the update out, after which I noticed issues on patched production devices, so I didn't phase out any further. The production devices are having a different looking issue which I will share below. For all I know it could be 2 different manifestations of the same issue or 2 separate issues. |
Below is a sample of the iotedged logs
|
One additional piece of info, in case it provides any clues: every device that has been having issues such as the above 2 cases shows 2 zombie processes when I log on to the machine; however, when I try to find the zombie processes through |
Another info that comes to mind, like @wvangeem 's environment, our devices use a 4G LTE internet connection via a Cradlepoint (this is our only option). Seems like there is a common thread there. |
@huguesBouvier do you have any ideas why |
Hum, I don't know. This seems like a separate issue. Could you create a different ticket for this? |
This issue can be explained: |
Fix for #5672, #5505 , #5693 Restarting iotedge many time lead to the following permission issues:  This doesn't seem to happen in 1.1.6. I tried many times but could not get it to fail however some customer experienced the same symptoms so this is most likely an issue across version. The fact that it seems to fail more on some setup is something we could not explain. Tests: 1. Tried all of those. A few times manually (3-5 times), a cycle of hundreds time each with a script (Ubuntu + Centos). For each test we check that all module are up and running, that permissions and user are correct and that there is a listener on the socket: 1.1 sudo iotedge system restart 1.2 sudo iotedge system stop + delete all containers + sudo iotedge system restart 1.3 sudo iotedge system stop + delete /var/lib/aziot/edged/mnt/ folder + sudo iotedge system restart 1.4 sudo iotedge system stop + delete all container + delete /var/lib/aziot/edged/mnt/ folder + sudo iotedge system restart 1.5 sudo iotedge system stop + delete workloads inside /var/lib/aziot/edged/mnt + create a sudo dir inside with the name of the sockets + sudo iotedge system restart Logs [ubuntu18.txt](https://github.com/Azure/iotedge/files/7358293/ubuntu18.txt) [centos.txt](https://github.com/Azure/iotedge/files/7358294/centos.txt) scripts code: ``` while : do echo "Test 1" sudo iotedge system restart sleep 310s sudo iotedge list ls -l /var/lib/aziot/edged/mnt/ curl curl --unix-socket /var/lib/aziot/edged/mnt/SimulatedTemperatureSensor.sock http://127.0.0.1 curl curl --unix-socket /var/lib/aziot/edged/mnt/edgeAgent.sock http://127.0.0.1 curl curl --unix-socket /var/lib/aziot/edged/mnt/edgeHub.sock http://127.0.0.1 echo "Test 2" sudo iotedge system stop sudo docker rm -f edgeAgent sudo docker rm -f edgeHub sudo docker rm -f SimulatedTemperatureSensor sudo iotedge system restart sleep 120s sudo iotedge list ls -l /var/lib/aziot/edged/mnt/ curl curl --unix-socket /var/lib/aziot/edged/mnt/SimulatedTemperatureSensor.sock http://127.0.0.1 curl curl --unix-socket /var/lib/aziot/edged/mnt/edgeAgent.sock http://127.0.0.1 curl curl --unix-socket /var/lib/aziot/edged/mnt/edgeHub.sock http://127.0.0.1 echo "Test 3" sudo iotedge system stop sudo rm -r /var/lib/aziot/edged/mnt sudo iotedge system restart sleep 310s sudo iotedge list ls -l /var/lib/aziot/edged/mnt/ curl curl --unix-socket /var/lib/aziot/edged/mnt/SimulatedTemperatureSensor.sock http://127.0.0.1 curl curl --unix-socket /var/lib/aziot/edged/mnt/edgeAgent.sock http://127.0.0.1 curl curl --unix-socket /var/lib/aziot/edged/mnt/edgeHub.sock http://127.0.0.1 echo "Test 4" sudo iotedge system stop sudo docker rm -f edgeAgent sudo docker rm -f edgeHub sudo docker rm -f SimulatedTemperatureSensor sudo rm -r /var/lib/aziot/edged/mnt sudo iotedge system restart sleep 120s sudo iotedge list ls -l /var/lib/aziot/edged/mnt/ curl curl --unix-socket /var/lib/aziot/edged/mnt/SimulatedTemperatureSensor.sock http://127.0.0.1 curl curl --unix-socket /var/lib/aziot/edged/mnt/edgeAgent.sock http://127.0.0.1 curl curl --unix-socket /var/lib/aziot/edged/mnt/edgeHub.sock http://127.0.0.1 echo "Test 5" sudo iotedge system stop sudo rm /var/lib/aziot/edged/mnt/SimulatedTemperatureSensor.sock sudo rm /var/lib/aziot/edged/mnt/edgeAgent.sock sudo rm /var/lib/aziot/edged/mnt/edgeHub.sock sudo mkdir /var/lib/aziot/edged/mnt/SimulatedTemperatureSensor.sock sudo mkdir /var/lib/aziot/edged/mnt/edgeAgent.sock sudo mkdir /var/lib/aziot/edged/mnt/edgeHub.sock sudo iotedge system restart sleep 310s sudo iotedge list ls -l /var/lib/aziot/edged/mnt/ curl curl --unix-socket /var/lib/aziot/edged/mnt/SimulatedTemperatureSensor.sock http://127.0.0.1 curl curl --unix-socket /var/lib/aziot/edged/mnt/edgeAgent.sock http://127.0.0.1 curl curl --unix-socket /var/lib/aziot/edged/mnt/edgeHub.sock http://127.0.0.1 done ``` 2. Pipeline run: Package: 47895915 Run https://dev.azure.com/msazure/One/_build/results?buildId=47896816&view=results 5. Test the protection against 2 listeners: Crashed edgeAgent to have edged restart it. EdgeAgent doesn't get stopped before starting like other module: Listener EdgeAgent already started, removing old listener. Confirmed that 2 listeners is not a problem. # Azure IoT Edge PR checklist: This checklist is used to make sure that common guidelines for a pull request are followed. ### General Guidelines and Best Practices - [x] I have read the [contribution guidelines](https://github.com/azure/iotedge#contributing). - [x] Title of the pull request is clear and informative. - [x] Description of the pull request includes a concise summary of the enhancement or bug fix. ### Testing Guidelines - [x] Pull request includes test coverage for the included changes. - Description of the pull request includes - [ ] concise summary of tests added/modified - [x] local testing done. ### Draft PRs - Open the PR in `Draft` mode if it is: - Work in progress or not intended to be merged. - Encountering multiple pipeline failures and working on fixes. _Note: We use the kodiakhq bot to merge PRs once the necessary checks and approvals are in place. When it merges a PR, kodiakhq converts the PR title to the commit title, PR description to the commit description, and squashes all the commits in the PR to a single commit. The net effect is that entire PR becomes a single commit. Please follow the best practices mentioned [here](https://chris.beams.io/posts/git-commit/#:~:text=The%20seven%20rules%20of%20a%20great%20Git%20commit,what%20and%20why%20vs.%20how%20For%20example%3A%20) for the PR title and description_
@huguesBouvier that all makes sense, thanks. I have been digging into the networking issue and if it looks like an issue with IoT Edge I'll be sure to make a ticket. That said, my edgeAgents with the issue continually are getting this output:
I take it that this could mean more than just an undefined config, is that correct? I assume this because I know for a fact the config is defined, despite what the log suggests. |
This is something different, config and workload are not related. get-netadapterrsc > before-disable.txt |
We are not using Windows, do you know of a similar workaround for Ubuntu? I will create a separate ticket for this as soon as I'm sure it doesn't have something to do with our hardware or networking. We are ruling things out on our end, and I will mention that I re-created the VM in question from scratch and it immediately ran into the same issue upon initial runtime install, which to me suggests it is something in the environment, not IoT Edge, that is causing the issue. That said, if it is an issue with our environment, I expect there may be at least some minor tweaks to IoT Edge that would be useful (such as better error messages) so I will make a ticket when I have more information. |
Overall, version 1.1.7 has helped with the issue on our side, but it doesn't seem completely resolved. We have a few devices that had failing modules and they were failing because of mangled permissions in Separately, is there a way to increase the timeout to download containers (for iotedged/edgeAgent)? Some of our devices have a weak connection and it is hard to tell if issues we are seeing on those vessels are due to this issue or bad connectivity, because edgeAgent continually times out trying to download the containers. FWIW, I actually suspect a weak connection induces this issue somehow, but I have no evidence of that yet apart from anecdotal. If I can increase the timeout to say, 3-5 minutes, that would be very helpful in understanding what's going on. |
On some devices we are also seeing many errors in edgeAgent such as this:
The sockets look like this:
So it doesn't seem to be the sockets that are the issue. Also, the edgeAgent logs above are the result after a clean install, Any idea what's going on here and what a workaround might be? |
FYI it seems like the above edgeAgent logs were resolved by issuing a |
"Overall, version 1.1.7 has helped with the issue on our side, but it doesn't seem completely resolved. We have a few devices that had failing modules and they were failing because of mangled permissions in /var/run/iotedge/mnt. This was after over 24 hours of proper operation so it is unclear how these modules and their sockets got into this state, but they were for some time stable." "Overall, version 1.1.7 has helped with the issue on our side" => Yes, that issue suddenly appeared. Though it doesn't seem to happen on every setup. We have a fix merged in master: #5698 "Separately, is there a way to increase the timeout to download containers (for iotedged/edgeAgent)?" This is not possible :(. ""/var/lib/iotedge/mnt/mongodb.sock" to rootfs at "/var/run/iotedge/workload.sock" caused: mount through procfd: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? " This is strange I expected to see a root folder in its place. I don't understand the permissions, owner and group are correct. Maybe you can get more details from the docker logs? "docker rm " => This is also strange, it means the socket was indeed correct. If you try to bootstrap edgeAgent with an older edgeAgent, does that help? Try to put edgeAgent 1.1.4 or below in config.yaml, it will just be used to pull you edgeAgent 1.1.7 from your config in portal. However the effect would be that the mnt folder is ignored and only sockets in /var/lib/iotedge would be used. Those are managed by systemd. Maybe it helps? |
Version 1.2.4: I was able to hotfix this issue by reinstalling completely. This is Kali Linux on a RPi4B copy config to tmpsudo cp /etc/aziot/config.toml /var/tmp install dockersudo apt-get remove aziot-edge --purge sudo systemctl start docker install edgesudo curl -L "https://github.com/Azure/azure-iotedge/releases/download/1.2.4/aziot-identity-service_1.2.3-1_debian10_arm64.deb" -o aziot-identity-service.deb sudo systemctl enable aziot-identityd.service --now sudo apt install docker.io -y sudo cp /var/tmp/config.toml /etc/aziot/ sudo iotedge config apply This issue almost singlehandedly ruined our product... |
@cgrundem Sorry to hear that, we have a patch on the way. That aside, we have a fix but we are still trying to figure out how that issue is happening. |
Any restarts of docker or edge after the initial issue will not make it work. Only a reinstall can, for us. After the reinstall, there is no assurance that it will not happen again. It just makes it able to work for the time being. If we change networks, the issue will happen probably upwards of 50% of the time. For example, my coworker would go from home (works upon reboot etc) -> office (sometimes works, sometimes fails with socket error) -> home (sometimes fails with socket error, sometimes works). The office has poor Wifi, and some strange network configuration. Otherwise, I’ve seen it happen randomly (but very rarely) while on the same internet. It is more likely to happen upon overheat of the LAN on the RPi, for sure. More generally, if the Wifi strength is poor, it will fail easier as well. Another issue we had to workaround is edge failing with a runtime error after reboots with no internet. There is another post on the issues here about that I think. |
@cgrundem The workaround is to: Also using edgeAgent 1.2.3 or below in config.yaml might help. |
FYI @huguesBouvier my team has determined the root cause of the issue quoted below was from a hardware issue in our network setup. Now that we have resolved the issue at the hardware level there is no issue as far as we're aware of. As such, I won't be creating a separate ticket for the below.
|
1.1.8 release went out yesterday and we believe that should take of any remaining issues related to Workload socket that we're aware of. Closing this issue, please reopen if your tests with 1.1.8 show any related problems. |
We have observed this problem on devices running iotedge 1.1.8 after being upgraded to 1.4. The aziot edge config still points to the 1.1 agent as bootstrap module |
we also have this problem in aziot-edged 1.4.2, aziot-edged-identity 1.4.1, edgeAgent 1.2.8, edgeHub 1.2.8 |
We have the same error on 1.3 today on a production environment and edgeHub is down so we are missing data! Any news on a solution in these releases? edgeHub logs´ edgeAgent Logs
|
We encounter this issue with 1.4.29, with a container stuck in a startup loop.
Executing edgeAgent logs<3> 2024-04-19 16:57:57.077 +00:00 [ERR] - Executing command for operation ["start"] failed. Microsoft.Azure.Devices.Edge.Agent.Edgelet.EdgeletCommunicationException- Message:Error calling start module pxeBoot: runtime operation error: start module "pxeBoot", StatusCode:400, at: at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleMa> at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagementHttpClientVersioned.Execute[T](Func`1 func, String operation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/versioning/ModuleManagementHttpClientVersion> at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleManagementHttpClient.StartModuleAsync(String name) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/version_2022_08_03/ModuleManagementHttpClient.cs:line 180 at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.<>c__DisplayClass26_0.<b__0>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 145 --- End of stack trace from previous location --- at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.Throttle[T](Func`1 identityOperation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 164 at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69 <3> 2024-04-19 16:57:57.079 +00:00 [ERR] - Executing command for operation ["Command Group: (\n [Stop module pxeBoot]\n [Start module pxeBoot]\n [Saving pxeBoot to store]\n)"] failed. Microsoft.Azure.Devices.Edge.Agent.Edgelet.EdgeletCommunicationException- Message:Error calling start module pxeBoot: runtime operation error: start module "pxeBoot", StatusCode:400, at: at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleMa> at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagementHttpClientVersioned.Execute[T](Func`1 func, String operation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/versioning/ModuleManagementHttpClientVersion> at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleManagementHttpClient.StartModuleAsync(String name) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/version_2022_08_03/ModuleManagementHttpClient.cs:line 180 at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.<>c__DisplayClass26_0.<b__0>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 145 --- End of stack trace from previous location --- at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.Throttle[T](Func`1 identityOperation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 164 at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69 at Microsoft.Azure.Devices.Edge.Agent.Core.Commands.GroupCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/commands/GroupCommand.cs:line 35 at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69 <3> 2024-04-19 16:57:57.082 +00:00 [ERR] - Step failed in deployment 3, continuing execution. Failure when running command Command Group: ( [Stop module pxeBoot] [Start module pxeBoot] [Saving pxeBoot to store] ). Will retry in 00s. <3> 2024-04-19 16:57:57.083 +00:00 [ERR] - Edge agent plan execution failed. System.AggregateException: One or more errors occurred. (Error calling start module pxeBoot: runtime operation error: start module "pxeBoot") ---> Microsoft.Azure.Devices.Edge.Agent.Edgelet.EdgeletCommunicationException- Message:Error calling start module pxeBoot: runtime operation error: start module "pxeBoot", StatusCode:400, at: at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.Mo> at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagementHttpClientVersioned.Execute[T](Func`1 func, String operation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/versioning/ModuleManagementHttpClientVersion> at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleManagementHttpClient.StartModuleAsync(String name) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/version_2022_08_03/ModuleManagementHttpClient.cs:line 180 at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.<>c__DisplayClass26_0.<b__0>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 145 --- End of stack trace from previous location --- at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.Throttle[T](Func`1 identityOperation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 164 at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69 at Microsoft.Azure.Devices.Edge.Agent.Core.Commands.GroupCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/commands/GroupCommand.cs:line 35 at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69 at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.ExecuteAsync(Int64 deploymentId, Plan plan, CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/planrunner/OrdererdRetryPlanRunner.c> --- End of inner exception stack trace --- at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.<>c.b__7_0(List`1 f) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/planrunner/OrdererdRetryPlanRunner.cs:line 129 at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.ExecuteAsync(Int64 deploymentId, Plan plan, CancellationToken token) at Microsoft.Azure.Devices.Edge.Agent.Core.Agent.ReconcileAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/Agent.cs:line 208 <4> 2024-04-19 16:57:57.127 +00:00 [WRN] - Reconcile failed because of the an exception System.AggregateException: One or more errors occurred. (Error calling start module pxeBoot: runtime operation error: start module "pxeBoot") ---> Microsoft.Azure.Devices.Edge.Agent.Edgelet.EdgeletCommunicationException- Message:Error calling start module pxeBoot: runtime operation error: start module "pxeBoot", StatusCode:400, at: at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.Mo> at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagementHttpClientVersioned.Execute[T](Func`1 func, String operation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/versioning/ModuleManagementHttpClientVersion> at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleManagementHttpClient.StartModuleAsync(String name) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/version_2022_08_03/ModuleManagementHttpClient.cs:line 180 at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.<>c__DisplayClass26_0.<b__0>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 145 --- End of stack trace from previous location --- at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.Throttle[T](Func`1 identityOperation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 164 at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69 at Microsoft.Azure.Devices.Edge.Agent.Core.Commands.GroupCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/commands/GroupCommand.cs:line 35 at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69 at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.ExecuteAsync(Int64 deploymentId, Plan plan, CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/planrunner/OrdererdRetryPlanRunner.c> --- End of inner exception stack trace --- at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.<>c.b__7_0(List`1 f) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/planrunner/OrdererdRetryPlanRunner.cs:line 129 at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.ExecuteAsync(Int64 deploymentId, Plan plan, CancellationToken token) at Microsoft.Azure.Devices.Edge.Agent.Core.Agent.ReconcileAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/Agent.cs:line 208 at Microsoft.Azure.Devices.Edge.Agent.Core.Agent.ReconcileAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/Agent.cs:line 208 |
Might be related to this issue?
<!--Must NOT UPGRADE this package - see: Azure/azure-iot-sdk-csharp#3413 -->
<!--<PackageReference Include="Microsoft.Azure.Devices.Provisioning.Client" Version="1.19.2" />-->
1.19.2 works, while the latest version will fail the MQTT connection. Only took me something like 5 days of effort to find this, as the error is obscure.
HTH,
Richard
…________________________________
From: Gabríel Arthúr Pétursson ***@***.***>
Sent: April 19, 2024 10:16 AM
To: Azure/iotedge ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [Azure/iotedge] Connections refused (err 111) connecting to unix:///var/run/iotedge/workload.sock after edgeAgent restart (#5505)
We encounter this issue with 1.4.29, with a container stuck in a startup loop.
$ docker start pxeBoot
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error mounting "/var/lib/aziot/edged/mnt/pxeBoot.sock" to rootfs at "/var/run/iotedge/workload.sock": mount /var/lib/aziot/edged/mnt/pxeBoot.sock:/var/run/iotedge/workload.sock (via /proc/self/fd/6), flags: 0x5000: not a directory: unknown: Are you trying to mount a directory onto a file (or vice-versa)? Check if the specified host path exists and is the expected type
Error: failed to start containers: pxeBoot
Executing docker rm pxeBoot resolved the issue right away.
edgeAgent logs
<3> 2024-04-19 16:57:57.077 +00:00 [ERR] - Executing command for operation ["start"] failed.
Microsoft.Azure.Devices.Edge.Agent.Edgelet.EdgeletCommunicationException- Message:Error calling start module pxeBoot: runtime operation error: start module "pxeBoot", StatusCode:400, at: at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleMa>
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagementHttpClientVersioned.Execute[T](Func`1 func, String operation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/versioning/ModuleManagementHttpClientVersion>
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleManagementHttpClient.StartModuleAsync(String name) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/version_2022_08_03/ModuleManagementHttpClient.cs:line 180
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.<>c__DisplayClass26_0.<b__0>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 145
--- End of stack trace from previous location ---
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.Throttle[T](Func`1 identityOperation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 164
at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69
<3> 2024-04-19 16:57:57.079 +00:00 [ERR] - Executing command for operation ["Command Group: (\n [Stop module pxeBoot]\n [Start module pxeBoot]\n [Saving pxeBoot to store]\n)"] failed.
Microsoft.Azure.Devices.Edge.Agent.Edgelet.EdgeletCommunicationException- Message:Error calling start module pxeBoot: runtime operation error: start module "pxeBoot", StatusCode:400, at: at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleMa>
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagementHttpClientVersioned.Execute[T](Func`1 func, String operation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/versioning/ModuleManagementHttpClientVersion>
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleManagementHttpClient.StartModuleAsync(String name) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/version_2022_08_03/ModuleManagementHttpClient.cs:line 180
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.<>c__DisplayClass26_0.<b__0>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 145
--- End of stack trace from previous location ---
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.Throttle[T](Func`1 identityOperation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 164
at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69
at Microsoft.Azure.Devices.Edge.Agent.Core.Commands.GroupCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/commands/GroupCommand.cs:line 35
at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69
<3> 2024-04-19 16:57:57.082 +00:00 [ERR] - Step failed in deployment 3, continuing execution. Failure when running command Command Group: (
[Stop module pxeBoot]
[Start module pxeBoot]
[Saving pxeBoot to store]
). Will retry in 00s.
<3> 2024-04-19 16:57:57.083 +00:00 [ERR] - Edge agent plan execution failed.
System.AggregateException: One or more errors occurred. (Error calling start module pxeBoot: runtime operation error: start module "pxeBoot")
---> Microsoft.Azure.Devices.Edge.Agent.Edgelet.EdgeletCommunicationException- Message:Error calling start module pxeBoot: runtime operation error: start module "pxeBoot", StatusCode:400, at: at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.Mo>
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagementHttpClientVersioned.Execute[T](Func`1 func, String operation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/versioning/ModuleManagementHttpClientVersion>
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleManagementHttpClient.StartModuleAsync(String name) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/version_2022_08_03/ModuleManagementHttpClient.cs:line 180
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.<>c__DisplayClass26_0.<b__0>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 145
--- End of stack trace from previous location ---
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.Throttle[T](Func`1 identityOperation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 164
at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69
at Microsoft.Azure.Devices.Edge.Agent.Core.Commands.GroupCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/commands/GroupCommand.cs:line 35
at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69
at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.ExecuteAsync(Int64 deploymentId, Plan plan, CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/planrunner/OrdererdRetryPlanRunner.c>
--- End of inner exception stack trace ---
at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.<>c.b__7_0(List`1 f) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/planrunner/OrdererdRetryPlanRunner.cs:line 129
at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.ExecuteAsync(Int64 deploymentId, Plan plan, CancellationToken token)
at Microsoft.Azure.Devices.Edge.Agent.Core.Agent.ReconcileAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/Agent.cs:line 208
<4> 2024-04-19 16:57:57.127 +00:00 [WRN] - Reconcile failed because of the an exception
System.AggregateException: One or more errors occurred. (Error calling start module pxeBoot: runtime operation error: start module "pxeBoot")
---> Microsoft.Azure.Devices.Edge.Agent.Edgelet.EdgeletCommunicationException- Message:Error calling start module pxeBoot: runtime operation error: start module "pxeBoot", StatusCode:400, at: at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.Mo>
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Versioning.ModuleManagementHttpClientVersioned.Execute[T](Func`1 func, String operation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/versioning/ModuleManagementHttpClientVersion>
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.Version_2022_08_03.ModuleManagementHttpClient.StartModuleAsync(String name) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/version_2022_08_03/ModuleManagementHttpClient.cs:line 180
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.<>c__DisplayClass26_0.<b__0>d.MoveNext() in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 145
--- End of stack trace from previous location ---
at Microsoft.Azure.Devices.Edge.Agent.Edgelet.ModuleManagementHttpClient.Throttle[T](Func`1 identityOperation) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Edgelet/ModuleManagementHttpClient.cs:line 164
at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69
at Microsoft.Azure.Devices.Edge.Agent.Core.Commands.GroupCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/commands/GroupCommand.cs:line 35
at Microsoft.Azure.Devices.Edge.Agent.Core.LoggingCommandFactory.LoggingCommand.ExecuteAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/LoggingCommandFactory.cs:line 69
at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.ExecuteAsync(Int64 deploymentId, Plan plan, CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/planrunner/OrdererdRetryPlanRunner.c>
--- End of inner exception stack trace ---
at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.<>c.b__7_0(List`1 f) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/planrunner/OrdererdRetryPlanRunner.cs:line 129
at Microsoft.Azure.Devices.Edge.Agent.Core.PlanRunner.OrderedRetryPlanRunner.ExecuteAsync(Int64 deploymentId, Plan plan, CancellationToken token)
at Microsoft.Azure.Devices.Edge.Agent.Core.Agent.ReconcileAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/Agent.cs:line 208
at Microsoft.Azure.Devices.Edge.Agent.Core.Agent.ReconcileAsync(CancellationToken token) in /mnt/vss/_work/1/s/edge-agent/src/Microsoft.Azure.Devices.Edge.Agent.Core/Agent.cs:line 208
—
Reply to this email directly, view it on GitHub<#5505 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAEVJVOWFVSJ32LII6F7MYLY6FGP5AVCNFSM5EAUHJL2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBWGY4TQMJTGM3A>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Expected Behavior
After restarting edgeAgent via
iotedge restart edgeAgent
all modules should eventually restart successfully without any further intervention.Current Behavior
Following an
iotedge restart edgeAgent
command, modules that connect to unix:///var/run/iotedge/workload.sock are unable to do so and fail with a 111 error. This includes edgeHub, the Azure Blob Storage container, and custom Python modules using the Azure IoT Python SDK. The failing modules will never recover without manual intervention - one workaround I found is to modify/etc/iotedge/config.yml
and thensudo systemctl restart iotedge
.Steps to Reproduce
Given that no one else has reported this as an issue, it seems unlikely my repro steps will lead to a repro for others. That said, the repro steps on my systems are simple:
iotedge restart edgeAgent
.Context (Environment)
Output of
iotedge check
Click here
Device Information
Runtime Versions
iotedge version
]: iotedge 1.1.6docker version
]:Client:
Version: 20.10.8+azure
API version: 1.41
Go version: go1.16.7
Git commit: 3967b7d28e15a020e4ee344283128ead633b3e0c
Built: Thu Jul 29 13:55:47 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.8+azure
API version: 1.41 (minimum version 1.12)
Go version: go1.16.7
Git commit: 75249d88bc107a122b503f6a50e89c994331867c
Built: Fri Jul 30 01:30:57 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.9+azure
GitCommit: e25210fe30a0a703442421b0f60afac609f950a3
runc:
Version: 1.0.1
GitCommit: 4144b63817ebcc5b358fc2c8ef95f7cddd709aa7
docker-init:
Version: 0.19.0
GitCommit:
Note: when using Windows containers on Windows, run
docker -H npipe:////./pipe/iotedge_moby_engine version
insteadLogs
aziot-edged logs
edge-agent logs
edge-hub logs
Additional Information
Because it seems relevant, here is the output of
ls -alh /var/run/iotedge
:The text was updated successfully, but these errors were encountered: