Flaky test: TestBuildWCOWSandboxSize (insufficient disk space?) #42743

thaJeztah · 2021-08-12T14:26:38Z

This test is failing frequently with a There is not enough space on the disk. failure;

=== RUN   TestBuildWCOWSandboxSize
    build_test.go:572: assertion failed: value "{\"stream\":\"Step 1/8 : FROM busybox AS intermediate\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e ea8f1b6e5c44\\n\"}\r\n{\"stream\":\"Step 2/8 : WORKDIR C:\\\\\\\\stuff\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in a2d453ddd861\\n\"}\r\n{\"stream\":\"Removing intermediate container a2d453ddd861\\n\"}\r\n{\"stream\":\" ---\\u003e 395a1b07f269\\n\"}\r\n{\"stream\":\"Step 3/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_0.txt 22548578304 \\u0026\\u0026 del bigfile_0.txt\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in 1f6957162b60\\n\"}\r\n{\"stream\":\"File C:\\\\stuff\\\\bigfile_0.txt is created\\r\\n\"}\r\n{\"stream\":\"Removing intermediate container 1f6957162b60\\n\"}\r\n{\"stream\":\" ---\\u003e 0a7a087b0669\\n\"}\r\n{\"stream\":\"Step 4/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_1.txt 7516192768\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in 478b5492a998\\n\"}\r\n{\"stream\":\"File C:\\\\stuff\\\\bigfile_1.txt is created\\r\\n\"}\r\n{\"stream\":\"Removing intermediate container 478b5492a998\\n\"}\r\n{\"stream\":\" ---\\u003e d26ac7985ca8\\n\"}\r\n{\"stream\":\"Step 5/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_2.txt 7516192768\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in a776a76797c8\\n\"}\r\n{\"stream\":\"File C:\\\\stuff\\\\bigfile_2.txt is created\\r\\n\"}\r\n{\"stream\":\"Removing intermediate container a776a76797c8\\n\"}\r\n{\"stream\":\" ---\\u003e 0a0dd247d26f\\n\"}\r\n{\"stream\":\"Step 6/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_3.txt 7516192768\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e Running in 8dcb313b2e5c\\n\"}\r\n{\"stream\":\"File C:\\\\stuff\\\\bigfile_3.txt is created\\r\\n\"}\r\n{\"stream\":\"Removing intermediate container 8dcb313b2e5c\\n\"}\r\n{\"stream\":\" ---\\u003e 9111caa5c35e\\n\"}\r\n{\"aux\":{\"ID\":\"sha256:9111caa5c35ea45c8a2bcc410a5e2a62532ce48fe877fd065c1e457c2fafd7f6\"}}\r\n{\"stream\":\"Step 7/8 : FROM busybox\"}\r\n{\"stream\":\"\\n\"}\r\n{\"stream\":\" ---\\u003e ea8f1b6e5c44\\n\"}\r\n{\"stream\":\"Step 8/8 : COPY --from=intermediate C:\\\\\\\\stuff C:\\\\\\\\stuff\"}\r\n{\"stream\":\"\\n\"}\r\n{\"errorDetail\":{\"message\":\"re-exec error: exit status 1: output: hcsshim::ImportLayer - failed failed in Win32: The system cannot find the path specified. (0x3)\"},\"error\":\"re-exec error: exit status 1: output: hcsshim::ImportLayer - failed failed in Win32: The system cannot find the path specified. (0x3)\"}\r\n" does not match regexp "Successfully built|re-exec error: exit status 1: output: write.*daemon\\\\\\\\tmp\\\\\\\\hcs.*bigfile_[1-3].txt: There is not enough space on the disk."
--- FAIL: TestBuildWCOWSandboxSize (547.71s)

Cleaned-up output:

{"stream":"Step 1/8 : FROM busybox AS intermediate"}
{"stream":"\\n"}
{"stream":" ---\\u003e ea8f1b6e5c44\\n"}
{"stream":"Step 2/8 : WORKDIR C:\\\\\\\\stuff"}
{"stream":"\\n"}
{"stream":" ---\\u003e Running in a2d453ddd861\\n"}
{"stream":"Removing intermediate container a2d453ddd861\\n"}
{"stream":" ---\\u003e 395a1b07f269\\n"}
{"stream":"Step 3/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_0.txt 22548578304 \\u0026\\u0026 del bigfile_0.txt"}
{"stream":"\\n"}
{"stream":" ---\\u003e Running in 1f6957162b60\\n"}
{"stream":"File C:\\\\stuff\\\\bigfile_0.txt is created\\r\\n"}
{"stream":"Removing intermediate container 1f6957162b60\\n"}
{"stream":" ---\\u003e 0a7a087b0669\\n"}
{"stream":"Step 4/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_1.txt 7516192768"}
{"stream":"\\n"}
{"stream":" ---\\u003e Running in 478b5492a998\\n"}
{"stream":"File C:\\\\stuff\\\\bigfile_1.txt is created\\r\\n"}
{"stream":"Removing intermediate container 478b5492a998\\n"}
{"stream":" ---\\u003e d26ac7985ca8\\n"}
{"stream":"Step 5/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_2.txt 7516192768"}
{"stream":"\\n"}
{"stream":" ---\\u003e Running in a776a76797c8\\n"}
{"stream":"File C:\\\\stuff\\\\bigfile_2.txt is created\\r\\n"}
{"stream":"Removing intermediate container a776a76797c8\\n"}
{"stream":" ---\\u003e 0a0dd247d26f\\n"}
{"stream":"Step 6/8 : RUN fsutil file createnew C:\\\\\\\\stuff\\\\\\\\bigfile_3.txt 7516192768"}
{"stream":"\\n"}
{"stream":" ---\\u003e Running in 8dcb313b2e5c\\n"}
{"stream":"File C:\\\\stuff\\\\bigfile_3.txt is created\\r\\n"}
{"stream":"Removing intermediate container 8dcb313b2e5c\\n"}
{"stream":" ---\\u003e 9111caa5c35e\\n"}
{"aux":{"ID":"sha256:9111caa5c35ea45c8a2bcc410a5e2a62532ce48fe877fd065c1e457c2fafd7f6"}}
{"stream":"Step 7/8 : FROM busybox"}
{"stream":"\\n"}
{"stream":" ---\\u003e ea8f1b6e5c44\\n"}
{"stream":"Step 8/8 : COPY --from=intermediate C:\\\\\\\\stuff C:\\\\\\\\stuff"}
{"stream":"\\n"}
{"errorDetail":{"message":"re-exec error: exit status 1: output: hcsshim::ImportLayer - failed failed in Win32: The system cannot find the path specified. (0x3)"},"error":"re-exec error: exit status 1: output: hcsshim::ImportLayer - failed failed in Win32: The system cannot find the path specified. (0x3)"}

We should

check if the error is genuine (is there indeed insufficient disk space, or is it an error that's converted incorrectly?
if disk space is insufficient for the test, we should check space before running the test, and skip if if there's not enough space.

It's possible that the Jenkins agents run out of space over time, but they are purged/rotated frequently, which should still give us coverage for that specific test (I think?)

The text was updated successfully, but these errors were encountered:

thaJeztah · 2021-08-27T22:25:55Z

Ahm, right, so looking more closely at the test, the There is not enough space on the disk is not the error in CI, but the string that the test expects to find.

The test is a bit weird, because it passes if it either "builds successfully" or "fails" because it ran out of disk space.

The test passes if either:

the image build succeeded; or

The COPY --from=intermediate step ran out of space during re-exec'd writing of the transport layer information to hcsshim's temp directory

The latter case means we finished the COPY operation, so the sandbox must have been larger than 20GB, which was the test, and then ran out of space on the host during importLayer in the WindowsFilter graph driver, while committing the layer. See #41636 (comment) for more details on the operations being done here.
Specifically, this happens on the Docker Jenkins CI Windows-RS5 build nodes. The two parts of the acceptable-failure case are on different lines, so we need two regexp checks.

For the failure case, the output should contain:

re-exec error: exit status 1: output: write.*daemon\\tmp\\hcs.*bigfile_[1-3].txt: There is not enough space on the disk.

But it sometimes gets:

re-exec error: exit status 1: output: hcsshim::ImportLayer - failed failed in Win32: The system cannot find the path specified. (0x3)

Perhaps the error changed in Windows or in hcsshim?

thaJeztah · 2021-08-27T22:26:39Z

Gotta love the double failed in the error message (hcsshim::ImportLayer - failed failed)

thaJeztah · 2021-08-27T22:44:55Z

Opened microsoft/hcsshim#1139 to fix the duplicate failed failed in the error messages

thaJeztah · 2021-09-01T11:43:28Z

So the (0x3) looks to indeed correspond with ERROR_PATH_NOT_FOUND; https://docs.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499-

Not sure why it now fails with a different error. I somewhat suspect this is a change in Windows ?

thaJeztah · 2021-09-01T11:53:06Z

@TBBle any ideas for this one?

TBBle · 2021-09-01T15:00:36Z

I was looking at this when it showed up in CI earlier, and I'm mystified. And then forgot to come back, sorry. >_<

This is coming from Win32, so it's some HCS API call that's failing. The data layout should be fine, because this is importing from a tarball it created with exportLayer elsewhere, so it's also HCS that generated the data being consumed.

So I can find it later, I believe we're failing this block:

	err = importLayer(&stdDriverInfo, path, importFolderPath, layers)
	if err != nil {
		return hcserror.New(err, title+" - failed", "")
	}

where importLayer is a syscall to vmcompute.dll's ImportLayer.

I don't know why this would have suddenly appeared, since I put this test through its paces pretty heavily at the time I added it.

My first guess is that something is cleaning up the temp directory from which the import is running, while the import is running, but that would be very odd; unless running low/out of space is somehow triggering a naive temp-dir clean-up, e.g. Storage Sense self-activating (see the FAQ).

Oddly enough, earlier this or last week we saw a problem from Storage Sense self-activating ~~at my work~~ on ue4-docker Discord and deleting files from $TEMP between creation and consumption, which is why I know about the behaviour.

We might get more info (like, which file was missing? That would help a lot) if it reproduces without re-exec and maybe hook up the opencensus tracer to dockerd's log output. Unless I misremember, the equivalent of this doesn't appear in this project, so we haven't had hcsshim logs since hcsshim moved from logrus to opencensus internally. (I assume that was an oversight, but perhaps it was intentional?)

It's also possible that there is no more information to be had from the API call, and we'd need to reproduce it under Process Monitor or something to see the filesystem access that's failing inside vmcompute.dll.

Poking around, there's microsoft/hcsshim#835 which is the same error message (but that doesn't mean a lot; I reckon there's three different issues just in that ticket which lead to the same error text). Possibilities that arise from that ticket are:

Virus scanner or other filesystem messer-with is removing a file (or making it inaccessible temporarily) I really really hope isn't a thing on the CI nodes? Did a virus scanner get recently installed or enabled? I'd have expected 'file in use' for this case though, if it was being scanned, and if it's been quarantined... Why would the virus scanner quarantine a multi-gigabyte file full of 0?
Junction points in the layer being committed (No junction points in the test as far as I know)
Committing a large RUN layer. Which is close to what we are doing here, but there's no other details in that comment.

thaJeztah · 2021-09-02T14:41:22Z

Virus scanner or other filesystem messer-with is removing a file (or making it inaccessible temporarily) I really really hope isn't a thing on the CI nodes? Did a virus scanner get recently installed or enabled? I'd have expected 'file in use' for this case though, if it was being scanned, and if it's been quarantined... Why would the virus scanner quarantine a multi-gigabyte file full of 0?

Good point on virus scanner; I recall there were issues in CI w.r.t. Defender (could've been because CI uses non-standard paths for some storage, don't recall the details). And there's some code in place to check if it's enabled;

moby/hack/ci/windows.ps1

Lines 273 to 283 in 8207c05

    
           # Make sure Windows Defender is disabled 
        
           $defender = $false 
        
           Try { 
        
             $status = Get-MpComputerStatus 
        
             if ($status) { 
        
               if ($status.RealTimeProtectionEnabled) { 
        
                 $defender = $true 
        
               } 
        
             } 
        
           } Catch {} 
        
           if ($defender) { Write-Host -ForegroundColor Magenta "WARN: Windows Defender real time protection is enabled, which may cause some integration tests to fail" }

(Haven't checked the logs yet if it's printing that warning)

thaJeztah · 2021-09-02T14:44:51Z

Nope; just checked a failed run, and it doesn't have that message, so shouldn't be the cause 🤔 (https://ci-next.docker.com/public/blue/rest/organizations/jenkins/pipelines/moby/branches/PR-42736/runs/5/nodes/305/log/?start=0)

This test is failing frequently once nodes have less disk space available. Skipping the test for now, but we can continue looking for a good solution. Tracked through moby#42743 Signed-off-by: Sebastiaan van Stijn <[email protected]>

This test is failing frequently once nodes have less disk space available. Skipping the test for now, but we can continue looking for a good solution. Tracked through moby/moby#42743 Signed-off-by: Sebastiaan van Stijn <[email protected]> Upstream-commit: 2a6a4587fab0c89317585965dd80e6bb4bd03293 Component: engine

This test is failing frequently once nodes have less disk space available. Skipping the test for now, but we can continue looking for a good solution. Tracked through moby#42743 Signed-off-by: Sebastiaan van Stijn <[email protected]>

thaJeztah added platform/windows kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/testing labels Aug 12, 2021

thaJeztah mentioned this issue Aug 12, 2021

daemon.WithCommonOptions() fix detection of user-namespaces #42736

Merged

thaJeztah mentioned this issue Aug 27, 2021

Fix duplicate "failed" in HCS errors microsoft/hcsshim#1139

Merged

thaJeztah mentioned this issue Sep 27, 2021

Makefile: remove passing proxy env-vars #42880

Merged

thaJeztah mentioned this issue Oct 15, 2021

Update containerd binary to v1.5.7 #42902

Merged

adamjseitz mentioned this issue Oct 16, 2021

[Carry 38704] Implement exec kill API #41548

Open

thaJeztah mentioned this issue Nov 10, 2021

testing: skip TestBuildWCOWSandboxSize for now #43009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: TestBuildWCOWSandboxSize (insufficient disk space?) #42743

Flaky test: TestBuildWCOWSandboxSize (insufficient disk space?) #42743

thaJeztah commented Aug 12, 2021 •

edited

Loading

thaJeztah commented Aug 27, 2021

thaJeztah commented Aug 27, 2021

thaJeztah commented Aug 27, 2021

thaJeztah commented Sep 1, 2021

thaJeztah commented Sep 1, 2021

TBBle commented Sep 1, 2021 •

edited

Loading

thaJeztah commented Sep 2, 2021

thaJeztah commented Sep 2, 2021

Flaky test: TestBuildWCOWSandboxSize (insufficient disk space?) #42743

Flaky test: TestBuildWCOWSandboxSize (insufficient disk space?) #42743

Comments

thaJeztah commented Aug 12, 2021 • edited Loading

thaJeztah commented Aug 27, 2021

thaJeztah commented Aug 27, 2021

thaJeztah commented Aug 27, 2021

thaJeztah commented Sep 1, 2021

thaJeztah commented Sep 1, 2021

TBBle commented Sep 1, 2021 • edited Loading

thaJeztah commented Sep 2, 2021

thaJeztah commented Sep 2, 2021

thaJeztah commented Aug 12, 2021 •

edited

Loading

TBBle commented Sep 1, 2021 •

edited

Loading