-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Introduce SyncStatus and InvalidSyncPoint for enhanced synchronization feedback #248
Comments
I've checked the codebase and there's no mention of This might be the cause of numerous issues reported on zed about high CPU usage, hang on suspend/resume, or just UI freezes. |
If this is accepted, the next step would be introducing recovery mechanism to gpui or blade. Something like this: self.recovery_handler = DeviceRecovery::new();
match sync_point.wait_for_detailed(1000) {
SyncStatus::InvalidSyncPoint => {
self.recovery_handler.on_invalid_sync_point("GPU reinitialized.");
self.recovery_handler.cleanup_gpu_resources();
self.recovery_handler.reinitialize_gpu();
self.recovery_handler.restore_application_state();
}
_ => {}
} I think this would solve most of the UI freezes. |
Thank you for this detailed suggestion! Currently, blade doesn't have any ways to handle device lost gracefully. You can be waiting on a sync point, or you can be creating a new resource, or even doing a new submission - all of those would fail. Trying to protect Wouldn't there be some OS event coming for suspend/resume? The application could handle that externally via OS, it's not clear to me that we necessarily need all Blade APIs to be aware of this process. |
Isn't the sole point of blade to hide cross platform intricacies? Why do you need blade at all of you need to handle special cases like this? |
The suspend/resume detection is not a solution. There are cases where device is lost without suspension. And there are cases where suspension does not cause the device loss. The only way to know is wait_for stuck for seconds, which could either indicate that it's busy or truly stuck. If it's truly stuck, the wait_for would return immediately with error. So calling it in a loop will exacerbate the problem. |
I don't understand this part. If your code doesn't know when Suspend/Resume happened, it might be doing a variety of things. E.g.
Yes, but there is a nuance. Blade isn't necessarily trying to be a complete opaque abstraction. If there is a nice way to handle suspend resume at the GPU abstraction level - let's consider it for sure! Also, if there is a way to handle it at the OS level - that seems to be even better. Again, Blade isn't abstracting away all aspects the platforms, it only cares about the GPU side of things. E.g. input is handled by Let's clarify one question. Are you caring about handling device lost in general, or just the suspend/resume scenario? |
The context of Addressing your specific questions with my interpretation of what you're asking. Clarify if I'm misunderstanding:
Changing
Blade is in the unique position here as VK_ERROR_DEVICE_LOST is only visible in this layer.
The focus is on handling GPU device loss in general, not just suspend/resume scenarios. Device loss can occur due to various reasons (e.g., driver crashes, overheating), and a robust solution must address all these cases. The complete solution would combine Blade’s error detection and possibly some recovery utilities with application-level handling (e.g., saving data, restarting). The goal is to create a seamless and user-friendly experience. I still don't understand what solution are you proposing. What other ways you propose to signal the Maybe you're proposing some kind of C-inspired errno/perror API? There's no way around blade if wait_for is used. |
Rereading this, I think now I understand better. You're implying that wait_for is not the only place where Zed might be stuck when unrecoverable error occurs? fn wait_for_gpu(&mut self) {
if let Some(last_sp) = self.last_sync_point.take() {
if !self.gpu.wait_for(&last_sp, MAX_FRAME_TIME_MS) {
log::error!("GPU hung");
while !self.gpu.wait_for(&last_sp, MAX_FRAME_TIME_MS) {}
}
}
} I've reproduced the issue consistently, and My hypothesis here is that VK_ERROR_DEVICE_LOST occurs in this loop. If we could implement something like a get_last_error() and detect that this is indeed the case, we would have much more freedom to decide what to do next. For example:
This would allow us to handle GPU device loss more gracefully and provide a better user experience. |
Thank you for details elaboration here and in the Zed issue! I'd like to consider the scenarios first before jumping to solution. Vulkan specification lists the following reasons for device loss:
Please let me know what you think! |
Let the application decide please
Huh? You're saying it's acceptable to freeze the application?
The opposite is true. There might be cases where power management doesn't cause the device loss. So power management events should not be treated as device loss events. I really don't understand your take. How do you debug the device loss if you don't even know it happened? The only symptom is UI freezing. This is just bad UX, bad DX, for the sake of what? Intellectual purity? |
So, would you willing to review a PR that surfaces some kind of error to the application? Step 1: Update the
|
Problem statement
Blade’s current
wait_for
API returns aboolean
to indicate whether a synchronization point was reached within the specified timeout. However, this design has significant limitations:false
return value could mean either a timeout or an error.To address these limitations I propose introducing a new API:
SyncPoint::wait_for(timeout) -> SyncStatus
Proposed solution
Keep the existing API
The existing
wait_for
API will remain unchanged to ensure backward compatibility. It will continue to return aboolean
:true
: Synchronization completed successfully.false
: Synchronization failed (timeout or error).Introduce a new API
A new API,
sync_point.wait_for(timeout)
, will be introduced to provide detailed feedback through aSyncStatus
enum. This API will explicitly distinguish between:New
SyncStatus
enumThe
SyncStatus
enum will provide detailed feedback for the new API.New
wait_for
(orwait_for_detailed
) methodThe new
wait_for
methods to be added to theSyncPoint
and will returnSyncStatus
.Example usage
Simple usage
Enhanced usage
How different APIs handle invalid sync points
1. Vulkan
In Vulkan, synchronization primitives like semaphores and fences are tied to the logical device. If the device is lost (e.g., due to a GPU crash or driver issue), all synchronization primitives become invalid. Vulkan provides explicit mechanisms to detect device loss:
VK_ERROR_DEVICE_LOST
. This can be used to detect invalid sync points.Example:
2. Metal
In Metal, command buffers and their associated synchronization primitives are tied to the command queue and device. If the GPU is reset or the device is reinitialized, command buffers and their sync points may become invalid. Metal provides status checks for command buffers:
NotEnqueued
,Enqueued
,Committed
,Scheduled
,Completed
, orError
.NotEnqueued
after GPU reinitialization), it can be treated as an invalid sync point.Example:
3. GLES
In GLES, synchronization relies on sync objects (e.g., created with
glFenceSync
), which are tied to the GL context. If the context is lost—such as during suspend/resume or GPU reinitialization—all sync objects become invalid. Theglow
crate provides abstractions for working with GLES sync operations. TheglClientWaitSync
function is used to wait for a sync object to be signaled and can return specific statuses:GL_ALREADY_SIGNALED
orGL_CONDITION_SATISFIED
indicates the sync completed successfully,GL_TIMEOUT_EXPIRED
means the wait timed out, andGL_WAIT_FAILED
signals that the sync object is invalid, often due to context loss. This mechanism allows for explicit handling of synchronization outcomes, including errors and invalid states.Example:
Is "invalid sync point" a universal concept?
While the term invalid sync point isn’t explicitly defined in graphics APIs, the concept exists in practice. Each API has its own way of handling scenarios where synchronization primitives become unusable:
VK_ERROR_DEVICE_LOST
)NotEnqueued
orError
)By introducing an
InvalidSyncPoint
and theSyncStatus
enum, we provide a unified way to handle these scenarios across all backends.Use cases
The
InvalidSyncPoint
is particularly useful for handling suspend/resume scenarios, where the GPU may be reinitialized, causing sync points to become invalid. Without this variant, developers cannot distinguish between:By explicitly including
InvalidSyncPoint
, we enable developers to handle these cases appropriately, improving robustness and debuggability.The text was updated successfully, but these errors were encountered: