-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky test: TestBuildWCOWSandboxSize (insufficient disk space?) #42743
Comments
Ahm, right, so looking more closely at the test, the The test is a bit weird, because it passes if it either "builds successfully" or "fails" because it ran out of disk space.
For the failure case, the output should contain:
But it sometimes gets:
Perhaps the error changed in Windows or in hcsshim? |
Gotta love the double |
Opened microsoft/hcsshim#1139 to fix the duplicate |
So the Not sure why it now fails with a different error. I somewhat suspect this is a change in Windows ? |
@TBBle any ideas for this one? |
I was looking at this when it showed up in CI earlier, and I'm mystified. And then forgot to come back, sorry. >_< This is coming from Win32, so it's some HCS API call that's failing. The data layout should be fine, because this is importing from a tarball it created with exportLayer elsewhere, so it's also HCS that generated the data being consumed. So I can find it later, I believe we're failing this block:
where I don't know why this would have suddenly appeared, since I put this test through its paces pretty heavily at the time I added it. My first guess is that something is cleaning up the temp directory from which the import is running, while the import is running, but that would be very odd; unless running low/out of space is somehow triggering a naive temp-dir clean-up, e.g. Storage Sense self-activating (see the FAQ). Oddly enough, earlier this or last week we saw a problem from Storage Sense self-activating We might get more info (like, which file was missing? That would help a lot) if it reproduces without re-exec and maybe hook up the opencensus tracer to dockerd's log output. Unless I misremember, the equivalent of this doesn't appear in this project, so we haven't had hcsshim logs since hcsshim moved from logrus to opencensus internally. (I assume that was an oversight, but perhaps it was intentional?) It's also possible that there is no more information to be had from the API call, and we'd need to reproduce it under Process Monitor or something to see the filesystem access that's failing inside vmcompute.dll. Poking around, there's microsoft/hcsshim#835 which is the same error message (but that doesn't mean a lot; I reckon there's three different issues just in that ticket which lead to the same error text). Possibilities that arise from that ticket are:
|
Good point on virus scanner; I recall there were issues in CI w.r.t. Defender (could've been because CI uses non-standard paths for some storage, don't recall the details). And there's some code in place to check if it's enabled; Lines 273 to 283 in 8207c05
(Haven't checked the logs yet if it's printing that warning) |
Nope; just checked a failed run, and it doesn't have that message, so shouldn't be the cause 🤔 (https://ci-next.docker.com/public/blue/rest/organizations/jenkins/pipelines/moby/branches/PR-42736/runs/5/nodes/305/log/?start=0) |
This test is failing frequently once nodes have less disk space available. Skipping the test for now, but we can continue looking for a good solution. Tracked through moby#42743 Signed-off-by: Sebastiaan van Stijn <[email protected]>
This test is failing frequently once nodes have less disk space available. Skipping the test for now, but we can continue looking for a good solution. Tracked through moby/moby#42743 Signed-off-by: Sebastiaan van Stijn <[email protected]> Upstream-commit: 2a6a4587fab0c89317585965dd80e6bb4bd03293 Component: engine
This test is failing frequently once nodes have less disk space available. Skipping the test for now, but we can continue looking for a good solution. Tracked through moby#42743 Signed-off-by: Sebastiaan van Stijn <[email protected]>
This test is failing frequently with a
There is not enough space on the disk.
failure;Cleaned-up output:
We should
It's possible that the Jenkins agents run out of space over time, but they are purged/rotated frequently, which should still give us coverage for that specific test (I think?)
The text was updated successfully, but these errors were encountered: