Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Memory leak on A12Z-based iPad Pro, ATV 4K 1st Gen #2341

Closed
warmenhoven opened this issue Sep 18, 2024 · 4 comments · Fixed by #2345
Closed

GPU Memory leak on A12Z-based iPad Pro, ATV 4K 1st Gen #2341

warmenhoven opened this issue Sep 18, 2024 · 4 comments · Fixed by #2345
Labels
Bug Completed Issue has been fixed, or enhancement implemented.

Comments

@warmenhoven
Copy link
Contributor

Starting with 188a21a, on these older devices, I'm seeing a GPU memory leak of about 22MB per second. I don't see it under the same usage scenarios on my M2 Mac or my iPhone 15 Pro.

ipad.log

@warmenhoven
Copy link
Contributor Author

I am having difficulty debugging a memory leak, and I'm not sure where to go from here.

The leak was caused by this commit in moltenvk. If I revert that commit, the leak goes away. If I move the _image->release() from the MVKImageView destructor into MVKImageView::destroy() the leak goes away; I put up a PR that does that but was told that's unsafe because that memory might still be referenced after the destroy through a completion block.

I've used MVK_CONFIG_PERFORMANCE_TRACKING and MVK_CONFIG_PERFORMANCE_LOGGING_FRAME_COUNT to see where the leak is and confirmed it is definitely in GPU memory allocated. It's growing about 350KB/frame; at that rate my app crashes after about three minutes.

I've tried creating static counters in the constructors/destructors of MVKImage and MVKImageView to see if they are being destroyed without being destructed, but they are not; all of them are being destructed correctly, and the count does not grow.

This leak does not happen on my iPhone 15 Pro, but does happen on my (relatively old now) A12Z-based iPad Pro, with the exact same IPA. I've also seen it on my 1st gen ATV4K (and confirmed the same commit causes it), but never on any of my Macs.

@warmenhoven
Copy link
Contributor Author

I've tried to use the Metal System Trace instrument to compare the Metal Resource Events between builds with/without the image retain/release, and I believe this is the offending stack trace:

124.34 MiB  16.1%	0 Bytes	               vkFreeMemory
124.34 MiB  16.1%	0 Bytes	                MVKDevice::freeMemory(MVKDeviceMemory*, VkAllocationCallbacks const*)
124.34 MiB  16.1%	0 Bytes	                 MVKReferenceCountingMixin<MVKBaseObject>::destroy()
124.34 MiB  16.1%	0 Bytes	                  MVKReferenceCountingMixin<MVKBaseObject>::release()
124.34 MiB  16.1%	0 Bytes	                   MVKBaseObject::destroy()
124.34 MiB  16.1%	0 Bytes	                    MVKDeviceMemory::~MVKDeviceMemory()
124.34 MiB  16.1%	0 Bytes	                     MVKDeviceMemory::~MVKDeviceMemory()
124.34 MiB  16.1%	0 Bytes	                      MVKDeviceMemory::~MVKDeviceMemory()
124.34 MiB  16.1%	0 Bytes	                       MVKImageMemoryBinding::bindDeviceMemory(MVKDeviceMemory*, unsigned long long)
124.34 MiB  16.1%	0 Bytes	                        -[AGXBuffer initWithDevice:length:alignment:options:isSuballocDisabled:pinnedGPULocation:]
124.34 MiB  16.1%	0 Bytes	                         -[AGXBuffer(Internal) initWithDevice:length:alignment:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
124.34 MiB  16.1%	0 Bytes	                          -[AGXBuffer(Internal) initWithDevice:length:alignment:pointerTag:options:isSuballocDisabled:resourceInArgs:pinnedGPULocation:]
124.34 MiB  16.1%	0 Bytes	                           -[IOGPUMetalBuffer initWithDevice:pointer:length:alignment:options:sysMemSize:gpuAddress:gpuTag:args:argsSize:deallocator:]
124.34 MiB  16.1%	124.34 MiB	                            __kdebug_trace64

@warmenhoven
Copy link
Contributor Author

It's coming from this line; usesTexelBuffer is false (because _deviceMemory is nullptr), but on the devices with the leak, _image->_isLinearForAtomics is true.

I believe bindDeviceMemory is being called with a nullptr from the destructor in order for it to act as if it were UNbindDeviceMemory. I think the right check is (_deviceMemory && _image->_isLinearForAtomics); that does fix the leak for me. I'll update my PR with that.

@billhollings
Copy link
Contributor

It's coming from this line; usesTexelBuffer is false (because _deviceMemory is nullptr), but on the devices with the leak, _image->_isLinearForAtomics is true.

I believe bindDeviceMemory is being called with a nullptr from the destructor in order for it to act as if it were UNbindDeviceMemory. I think the right check is (_deviceMemory && _image->_isLinearForAtomics); that does fix the leak for me. I'll update my PR with that.

Thanks for all your incredible detective work! I appreciate you digging into this so thoroughly! And great catch!

And thanks for providing PR #2345

This was a significant per-frame leak. Out of curiosity, under what conditions was vkFreeMemory() being called so frequently on every frame?

@billhollings billhollings added Bug Completed Issue has been fixed, or enhancement implemented. labels Sep 24, 2024
warmenhoven added a commit to warmenhoven/MoltenVK that referenced this issue Sep 25, 2024
billhollings added a commit that referenced this issue Sep 25, 2024
Fix leak where texel buffer is occasionally accidentally created during image-memory unbinding (#2341)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Completed Issue has been fixed, or enhancement implemented.
Projects
None yet
2 participants