Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Introduce SyncStatus and InvalidSyncPoint for enhanced synchronization feedback #248

Open
Vanuan opened this issue Jan 12, 2025 · 11 comments

Comments

@Vanuan
Copy link

Vanuan commented Jan 12, 2025

Problem statement

Blade’s current wait_for API returns a boolean to indicate whether a synchronization point was reached within the specified timeout. However, this design has significant limitations:

  1. A false return value could mean either a timeout or an error.
  2. In scenarios like suspend/resume, the GPU may be reinitialized, causing sync points to become invalid. The current API cannot distinguish between these cases.
  3. Developers lack detailed feedback to handle synchronization outcomes effectively, making debugging and error recovery challenging.

To address these limitations I propose introducing a new API: SyncPoint::wait_for(timeout) -> SyncStatus

Proposed solution

Keep the existing API

The existing wait_for API will remain unchanged to ensure backward compatibility. It will continue to return a boolean:

  • true: Synchronization completed successfully.
  • false: Synchronization failed (timeout or error).

Introduce a new API

A new API, sync_point.wait_for(timeout), will be introduced to provide detailed feedback through a SyncStatus enum. This API will explicitly distinguish between:

  1. Completed: The synchronization point was reached successfully.
  2. Timeout: The operation timed out before the synchronization point was reached.
  3. InvalidSyncPoint: The sync point became invalid (e.g., due to GPU reinitialization).
  4. Error: Any other errors or unexpected issues during synchronization, with a detailed error message.

New SyncStatus enum

The SyncStatus enum will provide detailed feedback for the new API.

pub enum SyncStatus {
    Completed,
    Timeout,
    InvalidSyncPoint,
    Error { error_string: String },
}

New wait_for(or wait_for_detailed) method

The new wait_for methods to be added to the SyncPoint and will return SyncStatus.

trait SyncPoint {
    fn wait_for(&self, timeout_ms: u32) -> bool;
    fn wait_for_detailed(&self, timeout_ms: u32) -> SyncStatus;
}

Example usage

Simple usage

let sync_point = device.create_sync_point();
// ...
if sync_point.wait_for(1000) {
    println!("GPU work completed!");
} else {
    println!("Timeout or error occurred."); // Cannot distinguish between timeout, error, or invalid sync point
}

Enhanced usage

let sync_point = device.create_sync_point();
// ...
match sync_point.wait_for_detailed(1000) {
    SyncStatus::Completed => println!("GPU work completed!"),
    SyncStatus::Timeout => println!("Timeout while waiting for GPU work."),
    SyncStatus::InvalidSyncPoint => println!("Sync point is invalid (e.g., GPU reinitialized)."),
    SyncStatus::Error { error_string } => println!("An error occurred: {}", error_string),
}

How different APIs handle invalid sync points

1. Vulkan

In Vulkan, synchronization primitives like semaphores and fences are tied to the logical device. If the device is lost (e.g., due to a GPU crash or driver issue), all synchronization primitives become invalid. Vulkan provides explicit mechanisms to detect device loss:

  • Device Lost Error: When a device is lost, Vulkan operations return VK_ERROR_DEVICE_LOST. This can be used to detect invalid sync points.
  • Timeline Semaphores: If a timeline semaphore is used, its value may become invalid if the device is lost.

Example:

unsafe {
    match self.device.timeline_semaphore.wait_semaphores(&wait_info, timeout_ns) {
        Ok(_) => SyncStatus::Completed,
        Err(vk::Result::TIMEOUT) => SyncStatus::Timeout,
        Err(vk::Result::ERROR_DEVICE_LOST) => SyncStatus::InvalidSyncPoint,
        Err(err) => SyncStatus::Error {
            error_string: format!("Vulkan error: {:?}", err),
        },
    }
}

2. Metal

In Metal, command buffers and their associated synchronization primitives are tied to the command queue and device. If the GPU is reset or the device is reinitialized, command buffers and their sync points may become invalid. Metal provides status checks for command buffers:

  • Command Buffer Status: A command buffer can be in states like NotEnqueued, Enqueued, Committed, Scheduled, Completed, or Error.
  • Invalid State: If a command buffer is in an invalid state (e.g., NotEnqueued after GPU reinitialization), it can be treated as an invalid sync point.

Example:

match sync_point.cmd_buf.status() {
    metal::MTLCommandBufferStatus::Completed => SyncStatus::Completed,
    metal::MTLCommandBufferStatus::Error => {
        let error_message = sync_point.cmd_buf.error()
            .map(|e| e.to_string())
            .unwrap_or_else(|| "Unknown Metal error".to_string());
        SyncStatus::Error {
            error_string: error_message,
        }
    }
    metal::MTLCommandBufferStatus::NotEnqueued => SyncStatus::InvalidSyncPoint,
    _ => SyncStatus::Timeout,
}

3. GLES

In GLES, synchronization relies on sync objects (e.g., created with glFenceSync), which are tied to the GL context. If the context is lost—such as during suspend/resume or GPU reinitialization—all sync objects become invalid. The glow crate provides abstractions for working with GLES sync operations. The glClientWaitSync function is used to wait for a sync object to be signaled and can return specific statuses: GL_ALREADY_SIGNALED or GL_CONDITION_SATISFIED indicates the sync completed successfully, GL_TIMEOUT_EXPIRED means the wait timed out, and GL_WAIT_FAILED signals that the sync object is invalid, often due to context loss. This mechanism allows for explicit handling of synchronization outcomes, including errors and invalid states.

Example:

impl SyncPoint for GLESSyncPoint {
    fn wait_for(&self, timeout_ms: u32) -> bool {
        matches!(self.wait_for_detailed(timeout_ms), SyncStatus::Completed)
    }

    fn wait_for_detailed(&self, timeout_ms: u32) -> SyncStatus {
        let gl = self.lock();
        let timeout_ns = if timeout_ms == !0 { !0 } else { timeout_ms as u64 * 1_000_000 };
        let timeout_ns_i32 = timeout_ns.min(i32::MAX as u64) as i32;

        let status = unsafe {
            gl.client_wait_sync(self.fence, glow::SYNC_FLUSH_COMMANDS_BIT, timeout_ns_i32)
        };

        match status {
            glow::ALREADY_SIGNALED | glow::CONDITION_SATISFIED => SyncStatus::Completed,
            glow::TIMEOUT_EXPIRED => SyncStatus::Timeout,
            glow::WAIT_FAILED => SyncStatus::InvalidSyncPoint,
            _ => SyncStatus::Error { error_string: "GLES sync failed".to_string() },
        }
    }
}

Is "invalid sync point" a universal concept?

While the term invalid sync point isn’t explicitly defined in graphics APIs, the concept exists in practice. Each API has its own way of handling scenarios where synchronization primitives become unusable:

  • Vulkan: device loss (VK_ERROR_DEVICE_LOST)
  • Metal: command buffer invalidation (NotEnqueued or Error)
  • GLES: context loss (glow::WAIT_FAILED)

By introducing an InvalidSyncPoint and the SyncStatus enum, we provide a unified way to handle these scenarios across all backends.

Use cases

The InvalidSyncPoint is particularly useful for handling suspend/resume scenarios, where the GPU may be reinitialized, causing sync points to become invalid. Without this variant, developers cannot distinguish between:

  • A legitimate timeout (e.g., the GPU is busy but still operational), where we can present user with a choice to continue waiting or terminate (maybe through some Operating System API if the desktop environment is not experiencing the same busy status).
  • An invalid sync point (e.g., the GPU was reinitialized, and the sync point is no longer valid), where we need to either recover from error state or gracefully shutdown.

By explicitly including InvalidSyncPoint, we enable developers to handle these cases appropriately, improving robustness and debuggability.

@Vanuan
Copy link
Author

Vanuan commented Jan 12, 2025

I've checked the codebase and there's no mention of VK_ERROR_DEVICE_LOST whatsoever. It might mean that blade either doesn't handle and can't recover from such a case or that this is handled at the lower level. gpui doesn't handle it either.

This might be the cause of numerous issues reported on zed about high CPU usage, hang on suspend/resume, or just UI freezes.

@Vanuan
Copy link
Author

Vanuan commented Jan 12, 2025

If this is accepted, the next step would be introducing recovery mechanism to gpui or blade. Something like this:

self.recovery_handler = DeviceRecovery::new();

match sync_point.wait_for_detailed(1000) {
    SyncStatus::InvalidSyncPoint => {
        self.recovery_handler.on_invalid_sync_point("GPU reinitialized.");
        self.recovery_handler.cleanup_gpu_resources();
        self.recovery_handler.reinitialize_gpu();
        self.recovery_handler.restore_application_state();
    }
    _ => {}
}

I think this would solve most of the UI freezes.

@kvark
Copy link
Owner

kvark commented Jan 12, 2025

Thank you for this detailed suggestion!

Currently, blade doesn't have any ways to handle device lost gracefully. You can be waiting on a sync point, or you can be creating a new resource, or even doing a new submission - all of those would fail. Trying to protect wait_for only isn't a solution to the suspend/resume. There would need to be a more intrusive change for this... In that sense, the proposed API isn't good.

Wouldn't there be some OS event coming for suspend/resume? The application could handle that externally via OS, it's not clear to me that we necessarily need all Blade APIs to be aware of this process.

@Vanuan
Copy link
Author

Vanuan commented Jan 12, 2025

Isn't the sole point of blade to hide cross platform intricacies? Why do you need blade at all of you need to handle special cases like this?

@Vanuan
Copy link
Author

Vanuan commented Jan 12, 2025

The suspend/resume detection is not a solution. There are cases where device is lost without suspension. And there are cases where suspension does not cause the device loss. The only way to know is wait_for stuck for seconds, which could either indicate that it's busy or truly stuck.

If it's truly stuck, the wait_for would return immediately with error. So calling it in a loop will exacerbate the problem.

@kvark
Copy link
Owner

kvark commented Jan 13, 2025

The only way to know is wait_for stuck for seconds

I don't understand this part. If your code doesn't know when Suspend/Resume happened, it might be doing a variety of things. E.g. create_texture() call. All the things would supposedly fail. So changing wait_for is not a complete solution here.

Isn't the sole point of blade to hide cross platform intricacies? Why do you need blade at all of you need to handle special cases like this?

Yes, but there is a nuance. Blade isn't necessarily trying to be a complete opaque abstraction. If there is a nice way to handle suspend resume at the GPU abstraction level - let's consider it for sure! Also, if there is a way to handle it at the OS level - that seems to be even better. Again, Blade isn't abstracting away all aspects the platforms, it only cares about the GPU side of things. E.g. input is handled by winit. Arguably, suspend/resume can be seen as "power management" aspect of the platform.

Let's clarify one question. Are you caring about handling device lost in general, or just the suspend/resume scenario?

@Vanuan
Copy link
Author

Vanuan commented Jan 17, 2025

The context of wait_for being stuck for seconds and the broader design around handling GPU device loss in suspend/resume scenario is described in detail in this issue. Please comment if you feel this is feasible at all or doesn't make any sense.

Addressing your specific questions with my interpretation of what you're asking. Clarify if I'm misunderstanding:

  1. The only way to know is wait_for stuck for seconds

Changing wait_for is only a part of error recovery. The goal is to detect GPU device loss reliably (e.g., through VK_ERROR_DEVICE_LOST or similar APIs) rather than relying on timeouts. Yes, when GPU device loss occurs, all other GPU operations (e.g., create_texture()) should terminate and pass the error signal to the top-level or initialization handler.

  1. Why use Blade if special cases like suspend/resume need handling?

Blade is in the unique position here as VK_ERROR_DEVICE_LOST is only visible in this layer.

  1. Device loss vs. suspend/resume:

The focus is on handling GPU device loss in general, not just suspend/resume scenarios. Device loss can occur due to various reasons (e.g., driver crashes, overheating), and a robust solution must address all these cases.


The complete solution would combine Blade’s error detection and possibly some recovery utilities with application-level handling (e.g., saving data, restarting). The goal is to create a seamless and user-friendly experience.

I still don't understand what solution are you proposing. What other ways you propose to signal the VK_ERROR_DEVICE_LOST to the application? Are you saying instead of wait_for() there are other gpui/blade APIs that application-specific wait_for loop can use to detect the VK_ERROR_DEVICE_LOST? From what I'm seeing it's only true or false, no way to get which error has happened.

Maybe you're proposing some kind of C-inspired errno/perror API? There's no way around blade if wait_for is used.

@Vanuan
Copy link
Author

Vanuan commented Jan 17, 2025

Rereading this, I think now I understand better. You're implying that wait_for is not the only place where Zed might be stuck when unrecoverable error occurs?
We're still discussing this loop, right?

https://github.com/zed-industries/zed/blob/5c239be7572fa8dbe38a83160b26b22d5d383174/crates/gpui/src/platform/blade/blade_renderer.rs#L383-L390

    fn wait_for_gpu(&mut self) {
        if let Some(last_sp) = self.last_sync_point.take() {
            if !self.gpu.wait_for(&last_sp, MAX_FRAME_TIME_MS) {
                log::error!("GPU hung");
                while !self.gpu.wait_for(&last_sp, MAX_FRAME_TIME_MS) {}
            }
        }
    }

I've reproduced the issue consistently, and GPU hung is written to the log consistently, so if any issue occurs, this is a robust place of handling it. Are you saying Zed entering this infinite loop is not the reason for unresponsiveness? I mean, isn't this a sole point of sync points, to assess the health and prevent further corruption?

My hypothesis here is that VK_ERROR_DEVICE_LOST occurs in this loop. If we could implement something like a get_last_error() and detect that this is indeed the case, we would have much more freedom to decide what to do next. For example:

  • Crash the application (like vkcube does).
  • Restart the application (like Chrome does).
  • Display a GNOME-like "wait or kill" modal using a system call to run an external program.

This would allow us to handle GPU device loss more gracefully and provide a better user experience.
The current behaviour of zed just "freezing" is unacceptable.

@kvark
Copy link
Owner

kvark commented Jan 19, 2025

Thank you for details elaboration here and in the Zed issue!

I'd like to consider the scenarios first before jumping to solution. Vulkan specification lists the following reasons for device loss:

A logical device may become lost because of hardware errors, execution timeouts, power management events and/or platform-specific events.

  • Hardware errors. I think it's ok to just crash in this case.
  • Execution timeouts. This is a function of our workload submission. If the GPU is timing out, our code is wrong. It could be a rogue loop, an extreme draw call volume, or even things like accessing data outside of bounds. This is equivalent to an assert!() in code. You do it because it's a sanity check, it's not something you catch and carefully propagate up to the user. TLDR: if we have those events, we should treat them as implementation bugs and focus on fixing, instead of trying to surface them higher to Zed or the user.
  • Power management events and/or platform-specific events. Those should be handled by the platform APIs. Application should listen to suspend/resume and react accordingly. It should not treat device loss as a suspend/resume event.

Please let me know what you think!

@Vanuan
Copy link
Author

Vanuan commented Jan 20, 2025

Hardware errors: It's acceptable to crash the application.

Let the application decide please

Execution timeouts: These should be treated as implementation bugs and fixed.

Huh? You're saying it's acceptable to freeze the application?

Power management and platform-specific events: Application should not treat device loss as a suspend/resume event.

The opposite is true. There might be cases where power management doesn't cause the device loss. So power management events should not be treated as device loss events.

I really don't understand your take. How do you debug the device loss if you don't even know it happened? The only symptom is UI freezing. This is just bad UX, bad DX, for the sake of what? Intellectual purity?

@Vanuan
Copy link
Author

Vanuan commented Jan 22, 2025

So, would you willing to review a PR that surfaces some kind of error to the application?

Step 1: Update the CommandDevice trait

Add the new method wait_for_with_error

Something like
wait_for_with_error(&self, sp: &Self::SyncPoint, timeout_ms: u32) -> Result<bool, BladeError>

Step 2: Implement wait_for_with_error in GLES

Step 3: Implement wait_for_with_error in Metal

Step 4: Implement wait_for_with_error in Vulkan

    fn wait_for_with_error(&self, sp: &SyncPoint, timeout_ms: u32) -> Result<bool, BladeError> {
        // Implement error detection logic
        if /* condition to detect GPU device loss */ {
            return Err(BladeError::GpuDeviceLost);
        }
        if /* condition to detect it's ok */ {
        Ok(true)
        }
        if /* condition to detect it's recoverable timeout */ {
        Ok(false)
        }
   }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants