Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rare but reproducible segfault related to clear_image() and thumbnail_image() #251

Open
MaxPower15 opened this issue Jan 18, 2022 · 1 comment

Comments

@MaxPower15
Copy link
Contributor

MaxPower15 commented Jan 18, 2022

I am getting a segfault that I'd guess, based on observed behavior, is due to a race condition with how vips or govips manages its memory internally. I'm also open to it being something in my code, though it hasn't seemed that way in my research on it. Please let me know if I'm wrong!

I've tested with govips 2.7.1, 2.8.0, and 2.9.0. I'm using vips 8.12.1, but have also tested with vips 8.11.3. It's reproducible for me in AWS Lambda using an Alpine Docker image and a fairly large input file or two, both of which I unfortunately cannot share.

For an individual worker, I'm running 12 concurrent vips load/transform/save operations. Overall, this is a ballpark estimate and based on a few different inputs I've been testing with, but I expect to see this error in about 1 in 40,000 of those operations. This sounds pretty rare, but for my use case, it means I'm basically guaranteed a segfault when processing certain input files. (Input files are videos that we're processing frame by frame.) When it fails, it doesn't always fail on the same image inputs (i.e. it is not related purely to the content), though it does happen more often for images with higher quality/resolution.

I cannot reproduce this locally, maybe because I just can't run enough operations, or maybe because my computer is faster than a Lambda worker, or because I'm on OSX. I'm not sure. But it is reliably reproducible for me in AWS Lambda (10 GB memory) with longer and larger inputs.

Here is a repo that shows a simplified version of how I'm using govips. The relevant code is in main.go. The "parmap" helper is short for "parallel map," and is an almost-identical port of what I'm using for real. I'm hoping that I'm just not doing something correctly, so please let me know if anything stands out to you. Happy to give suggestions a try.

My temporary workaround for this issue is to remove the canvas.Close() line. Without that, I have yet to encounter the segfault. Interestingly, if I also remove file1.Close() and file2.Close(), I reliably encounter the segfault again. So I left those alone. To avoid a memory leak from not freeing the canvas, I launch a separate process to wrap the vips compositing code, then let the OS clean it up when the compositing process terminates. This solves my problem in the short term, but it has me worried about dealing with segfaults in the future.

About half the time, the error stack shows the call to thumbnail_image():

[signal SIGSEGV: segmentation violation code=0x1 addr=0x6a7dfc53 pc=0x7f07b815e2d0]
runtime stack:
runtime.throw({0xb9c9f2, 0x7f078bbc0360})
/usr/local/go/src/runtime/panic.go:1198 +0x71
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:719 +0x396
goroutine 263 [syscall]:
runtime.cgocall(0xa22560, 0xc0003695d0)
/usr/local/go/src/runtime/cgocall.go:156 +0x5c fp=0xc0003695a8 sp=0xc000369570 pc=0x409ffc
github.com/davidbyttow/govips/v2/vips._Cfunc_thumbnail_image(0x7f078bc54960, 0xc0000ec070, 0x500, 0x2d0, 0x1, 0x0)
_cgo_gotypes.go:1843 +0x4c fp=0xc0003695d0 sp=0xc0003695a8 pc=0x9ac90c
github.com/davidbyttow/govips/v2/vips.vipsThumbnail.func1(0x0, 0x0, 0x500, 0x2d0, 0x1, 0x0)
/go/pkg/mod/github.com/davidbyttow/govips/[email protected]/vips/resample.go:57 +0xa7 fp=0xc000369618 sp=0xc0003695d0 pc=0x9bcb47
github.com/davidbyttow/govips/v2/vips.vipsThumbnail(0x2c220, 0x2c221, 0xc0011c8000, 0xa79600, 0xc000304060)
/go/pkg/mod/github.com/davidbyttow/govips/[email protected]/vips/resample.go:57 +0x97 fp=0xc000369698 sp=0xc000369618 pc=0x9bca17
github.com/davidbyttow/govips/v2/vips.(*ImageRef).Thumbnail(0xc00007a3c0, 0x20, 0x1, 0xc0003f60c0)
/go/pkg/mod/github.com/davidbyttow/govips/[email protected]/vips/image.go:1414 +0x27 fp=0xc0003696d0 sp=0xc000369698 pc=0x9baac7

The other half, the error stack shows the call to clear_image():

[signal SIGSEGV: segmentation violation code=0x1 addr=0xfffffffd00000019 pc=0x7f04abf21325]
runtime stack:
runtime.throw({0xb9c9f2, 0x7f04ac1aa9c0})
/usr/local/go/src/runtime/panic.go:1198 +0x71
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:719 +0x396
goroutine 148 [syscall]:
runtime.cgocall(0xa22310, 0xc000345660)
/usr/local/go/src/runtime/cgocall.go:156 +0x5c fp=0xc000345638 sp=0xc000345600 pc=0x409ffc
github.com/davidbyttow/govips/v2/vips._Cfunc_clear_image(0xc0000740a0)
_cgo_gotypes.go:942 +0x45 fp=0xc000345660 sp=0xc000345638 pc=0x9a91a5
github.com/davidbyttow/govips/v2/vips.clearImage.func1(0xc000320570)
/go/pkg/mod/github.com/davidbyttow/govips/[email protected]/vips/image.go:1575 +0x45 fp=0xc000345698 sp=0xc000345660 pc=0x9bb825
github.com/davidbyttow/govips/v2/vips.clearImage(0x7f0476e9be30)
/go/pkg/mod/github.com/davidbyttow/govips/[email protected]/vips/image.go:1575 +0x5b fp=0xc0003456c8 sp=0xc000345698 pc=0x9bb7bb
github.com/davidbyttow/govips/v2/vips.(*ImageRef).Close(0xc00007a410)
/go/pkg/mod/github.com/davidbyttow/govips/[email protected]/vips/image.go:497 +0x65 fp=0xc0003456f0 sp=0xc0003456c8 pc=0x9b7265

Let me know if there's any more information I can get, or anything else I should try!

@mrngm
Copy link
Contributor

mrngm commented Sep 15, 2022

Coincidentally, I'm looking into an (accidental) double-free, on govips v2.11.0, libvips 8.10.5, when calling *vips.ImageRef.Close() twice (which should be okay from the looks of the Go source).

Could it be that a nil overlay here causes the function to still call canvas.Insert() with a nil overlay? And that, internally, this calls clearImage() that somehow cleans up the canvas reference as well?

(My current hunch in out setup is that *vips.ImageRef.Close() neatly uses a lock before calling clearImage on an internal reference, but it's not guaranteed that all code paths using clearImage also set the internal reference in ImageRef to nil)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants