Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Edge Browser Scheduled Job timeout #1752

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amartya4256
Copy link
Contributor

No description provided.

@amartya4256 amartya4256 force-pushed the edge_timeout branch 2 times, most recently from 7d16ca0 to 8195777 Compare January 22, 2025 12:01
@fedejeanne
Copy link
Contributor

fedejeanne commented Jan 22, 2025

I would additionally reduce the timeout introduced in #1677 to 10 seconds (+5 seconds from your new timeout) just to make sure that your change here actually covers all failing tests and use cases. If it does then I would expect to see no timeouts hitting 10s, only timeouts after 5s.

WDYT?

Copy link
Contributor

github-actions bot commented Jan 22, 2025

Test Results

   493 files   -  1     493 suites   - 1   9m 34s ⏱️ +10s
 4 296 tests  - 37   4 282 ✅  - 38   10 💤  - 3  0 ❌ ±0  4 🔥 +4 
16 539 runs   - 35  16 429 ✅  - 37  104 💤  - 4  0 ❌ ±0  6 🔥 +6 

For more details on these errors, see this check.

Results for commit 98da9eb. ± Comparison against base commit 081eb9b.

This pull request removes 37 tests.
org.eclipse.swt.graphics.ImageWin32Tests ‑ testImageDataForDifferentFractionalZoomsShouldBeDifferent
org.eclipse.swt.graphics.ImageWin32Tests ‑ testImageShouldHaveDimesionAsPerZoomLevel
org.eclipse.swt.tests.win32.Test_org_eclipse_swt_dnd_DND ‑ testByteArrayTransfer
org.eclipse.swt.tests.win32.Test_org_eclipse_swt_dnd_DND ‑ testFileTransfer
org.eclipse.swt.tests.win32.Test_org_eclipse_swt_dnd_DND ‑ testHtmlTransfer
org.eclipse.swt.tests.win32.Test_org_eclipse_swt_dnd_DND ‑ testImageTransfer_fromCopiedImage
org.eclipse.swt.tests.win32.Test_org_eclipse_swt_dnd_DND ‑ testImageTransfer_fromImage
org.eclipse.swt.tests.win32.Test_org_eclipse_swt_dnd_DND ‑ testImageTransfer_fromImageData
org.eclipse.swt.tests.win32.Test_org_eclipse_swt_dnd_DND ‑ testImageTransfer_fromImageDataFromImage
org.eclipse.swt.tests.win32.Test_org_eclipse_swt_dnd_DND ‑ testRtfTransfer
…

♻️ This comment has been updated with latest results.

@amartya4256
Copy link
Contributor Author

I would additionally reduce the timeout introduced in #1677 to 10 seconds (+5 seconds from your new timeout) just to make sure that your change here actually covers all failing tests and use cases. If it does then I would expect to see no timeouts hitting 10s, only timeouts after 5s.

WDYT?

Alright, I'll add it here.

@amartya4256 amartya4256 force-pushed the edge_timeout branch 2 times, most recently from 04d2500 to ae4fcac Compare January 24, 2025 15:34
Copy link
Contributor

@HeikoKlare HeikoKlare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to have another thorough look at what may cause Edge initialization to block the UI thread. When trying to test the timeout logic, I came across several places where I can introduce blocking operations that will not be caught by this proposal.

When systematically going through the public methods of Edge, I see that the calls to getWebView*() and isWebView*Availabile() are the ones that block in case initialization or any of the subsequently scheduled futures do not finish timely.
Then, taking a look at which calls may potentially block inside those methods, I see these two:

  • waitForFutureToFinish()
  • future.join()

The latter will be captured for the initialization future by the timeout added in this PR. The former will only partly be resolved, as inside waitForFutureToFinish() there is a call to processNextOSMessage(), which spins a loop that may be blocking as well. Thus, I think we need to break that potentially blocking operation as well. Maybe we can simply spin Display.readAndDisplay() in the loop checking for future completion (inside waitForFutureToFinish()) instead of having the additional loop inside processNextOSMessage()?

Then, what about the other callers of processNextOSMessage()? In particular, createEnvironment() is called synchronously when creating the first Edge instance. May that one block as well? I also see further calls of that method of which I am not sure if they are correct. We need to check them as well.

webViewFuture.orTimeout(1, TimeUnit.MILLISECONDS).exceptionally(exception -> {
// Throw exception on the Display thread directly to prevent CompletableFuture
// to wrap the exception and throw it silently
browser.getDisplay().execute(() -> SWT.error(SWT.ERROR_UNSPECIFIED, exception, "Edge Browser initialization timed out"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case we run into a timeout, the initialization needs to be properly rolled back in terms of the state of Edge being cleaned up and potentially OS resources to be freed, like in the abortion logic of createControllerInitializationCallback().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have extracted the environment release method and added that here as now we have a webview wrapper which makes sure that we dont have multiple futures dealing with multiple instances of webView but one future initializing everything as a whole. And I have verified, the initialization of webview* happens very quickly after obtaining the webview instance which gives the initialization an atomic behaviour. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally sounds good, but I am sure whether it works like it is implemented now. You will run into the timeout handler and release the environment while the webview initialization is still running asynchronously at the OS. Is that the right thing to do? I would rather expect that to eventually be done when the initialization callback is executed (at some point in time after the callback) or, maybe, after a rather high amount of further waiting time.

webViewFuture.orTimeout(1, TimeUnit.MILLISECONDS).exceptionally(exception -> {
// Throw exception on the Display thread directly to prevent CompletableFuture
// to wrap the exception and throw it silently
browser.getDisplay().execute(() -> SWT.error(SWT.ERROR_UNSPECIFIED, exception, "Edge Browser initialization timed out"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of only throwing an error, should we maybe hide the (uninitialized) browser widget and add a label informing the user about the failed initialization instead? Calling something like this inside the execution on display might do the trick:

	private void replaceWithErrorLabel() {
		browser.setVisible(false);
		Label errorLabel = new Label(browser.getParent(), SWT.WRAP);
		errorLabel.setForeground(browser.getDisplay().getSystemColor(SWT.COLOR_RED));
		errorLabel.setText("Edge browser initialization failed");
		errorLabel.setLocation(0, 0);
		errorLabel.setSize(browser.getSize());
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like his idea. But how about we combine this with also throwing an error even though it doesn't pop up in a dialog. The error is atleast visible in the Error Log of the IDE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also if we don't use this execute method here, it will sill throw the error but the error will be wrapped in CompletionException and not SWTException since, the errors in the CompletableFuture are caught implicitly and wrapped with CompletionException. In case of SWT.error as well, I see the exception to be silent and there's no dialog coming up. What do you prefer? Is that okay to not wrap the exception with SWT.error and let it throw the CompletableFuture CompletionException and just replace the browser with a label?
image
image
The image above is for when I dont throw a wrapped SWT exception and let the future handle it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still be in favor of logging some kind of error. Even though the user will primarily see the information in the UI, someone debugging issues may only have a look at the logs, which is why having such an error documented there is reasonable.

Copy link
Contributor Author

@amartya4256 amartya4256 Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so replaceBrowser with label and log an error? the error is logged automatically by the future of exception. Do you mean that or should we throw an exception explicitly using, browser.getDisplay().execute(() -> SWT.error(SWT.ERROR_UNSPECIFIED, exception, "Edge Browser initialization timed out"));?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong opinion this, but I see two things to consider:

  1. If possible, we should avoid throwing a future-specific exception to consumers of SWT/Edge and rather provide proper information inside an SWTException to them.
  2. We should have a look at the where consumers may face that exception and how that conforms to the existing APIs they are using (usually those of Browser).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if one is interested in the exception, get() is better than join()

@amartya4256 amartya4256 linked an issue Jan 26, 2025 that may be closed by this pull request
@amartya4256 amartya4256 force-pushed the edge_timeout branch 2 times, most recently from 8b86b9b to 893b4dd Compare January 27, 2025 13:25
@amartya4256
Copy link
Contributor Author

When systematically going through the public methods of Edge, I see that the calls to getWebView*() and isWebView*Availabile() are the ones that block in case initialization or any of the subsequently scheduled futures do not finish timely. Then, taking a look at which calls may potentially block inside those methods, I see these two:

  • waitForFutureToFinish()
  • future.join()

I evaluated your points and i have implemented everything into a webviewWrapper which is responsible for containing all the webviews and we need just one future (webViewWrapperFuture) now which provides us this wrapper from where all the webview* can be obtained, so now the dependency flow is like this: webViewWrapperFuture -> lastWebViewTask. for every webView*, it will be responsible - in case of a webViewWrapperFuture timeout, everything times out.

@amartya4256
Copy link
Contributor Author

The latter will be captured for the initialization future by the timeout added in this PR. The former will only partly be resolved, as inside waitForFutureToFinish() there is a call to processNextOSMessage(), which spins a loop that may be blocking as well. Thus, I think we need to break that potentially blocking operation as well. Maybe we can simply spin Display.readAndDisplay() in the loop checking for future completion (inside waitForFutureToFinish()) instead of having the additional loop inside processNextOSMessage()?

Then, what about the other callers of processNextOSMessage()? In particular, createEnvironment() is called synchronously when creating the first Edge instance. May that one block as well? I also see further calls of that method of which I am not sure if they are correct. We need to check them as well.

I am wondering if addressing processNextOSMessage is necessary after this, since we are looping until the future is done eventually and it can happen either if the initialization completes or the timeout, so there always exists a path with which future will be done. In terms of temporal logic, this is verified: A F webViewWrapperFuture.isDone()

The question about if processNextOSMessage is a blocking anything: what kind of messages are processed in there and what control do we have? Looks like, we are just calling display.readAndDispatch(), which is just processing the messages on the display thread from the OS, if something unrelated to Edge was blocking it here, it could have been blocked anywhere in the platform. And We have all the handlers well defined to handling the Edge messages in the callbacks properly, my concern is about the block:

while (!OS.PeekMessage (msg, 0, 0, 0, OS.PM_NOREMOVE)) {
		display.sleep();
	}

Other than that, I think we should retest the new implementation and try to see if we still have tests / workspcaes freezing. If no freezes, at least we will have logs for the timeout.

@HeikoKlare
Copy link
Contributor

I evaluated your points and i have implemented everything into a webviewWrapper which is responsible for containing all the webviews and we need just one future (webViewWrapperFuture) now which provides us this wrapper from where all the webview* can be obtained, so now the dependency flow is like this: webViewWrapperFuture -> lastWebViewTask. for every webView*, it will be responsible - in case of a webViewWrapperFuture timeout, everything times out.

That sounds good. There are several good proposals in this PR now. Can we please separate into different smaller PRs to ease reviewing and testing? In particular, putting the web views inside a wrapper captured by one future is something that we might do as a simple "cleanup" and preparatory PR for this.

@HeikoKlare
Copy link
Contributor

The question about if processNextOSMessage is a blocking anything: what kind of messages are processed in there and what control do we have? Looks like, we are just calling display.readAndDispatch(), which is just processing the messages on the display thread from the OS, if something unrelated to Edge was blocking it here, it could have been blocked anywhere in the platform. And We have all the handlers well defined to handling the Edge messages in the callbacks properly, my concern is about the block:

while (!OS.PeekMessage (msg, 0, 0, 0, OS.PM_NOREMOVE)) {
		display.sleep();
	}

This code will put the thread to sleep until some OS message is received. If for some reason no message arrives anymore, this will block the whole application. Not sure how likely it is, but I would be in favor of avoiding the risk. As mentioned above, it might be possible to simplify this code anway.

@fedejeanne
Copy link
Contributor

The question about if processNextOSMessage is a blocking anything: what kind of messages are processed in there and what control do we have? Looks like, we are just calling display.readAndDispatch(), which is just processing the messages on the display thread from the OS, if something unrelated to Edge was blocking it here, it could have been blocked anywhere in the platform. And We have all the handlers well defined to handling the Edge messages in the callbacks properly, my concern is about the block:

while (!OS.PeekMessage (msg, 0, 0, 0, OS.PM_NOREMOVE)) {
		display.sleep();
	}

This code will put the thread to sleep until some OS message is received. If for some reason no message arrives anymore, this will block the whole application. Not sure how likely it is, but I would be in favor of avoiding the risk. As mentioned above, it might be possible to simplify this code anway.

Can it be that this already happened? I'm looking at the failing tests and I see this thread-dump:

"main" prio=5 Id=1 RUNNABLE
	at app//org.eclipse.swt.internal.win32.OS.PeekMessage(Native Method)
	at app//org.eclipse.swt.browser.Edge.processNextOSMessage(Edge.java:481)
	at app//org.eclipse.swt.browser.Edge$WebViewProvider.waitForFutureToFinish(Edge.java:460)
	at app//org.eclipse.swt.browser.Edge$WebViewProvider.getWebView(Edge.java:383)
	at app//org.eclipse.swt.browser.Edge.getUrl(Edge.java:910)
	at app//org.eclipse.swt.browser.Browser.getUrl(Browser.java:771)
	at app//org.eclipse.swt.tests.junit.Test_org_eclipse_swt_browser_Browser.createBrowser(Test_org_eclipse_swt_browser_Browser.java:309)
	at app//org.eclipse.swt.tests.junit.Test_org_eclipse_swt_browser_Browser.setUp(Test_org_eclipse_swt_browser_Browser.java:185)
	...

Maybe because the tests produce less OS messages and therefore the issue with the empty queue is more likely to happen? What I'm thinking is: an environment without a mouse (e.g. the test environment) probably generates way less OS events.

@laeubi
Copy link
Contributor

laeubi commented Jan 29, 2025

@fedejeanne Display.sleep() is very dangerous and you are correct that it can sleep "forever" (many seconds on linux if you don't move the mouse).

@@ -316,10 +316,23 @@ ICoreWebView2 initializeWebView(ICoreWebView2Controller controller) {
return webView;
}

private CompletableFuture<WebViewWrapper> initializeWebViewFutureWithTimeOut() {
CompletableFuture<WebViewWrapper> webViewWrapperFuture = new CompletableFuture<>();
webViewWrapperFuture.orTimeout(3, TimeUnit.SECONDS).exceptionally(exception -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
webViewWrapperFuture.orTimeout(3, TimeUnit.SECONDS).exceptionally(exception -> {
webViewWrapperFuture.orTimeout(3, TimeUnit.SECONDS).exceptionallyAsync(exception -> {...}, browser.getDisplay())

@laeubi
Copy link
Contributor

laeubi commented Jan 29, 2025

Just a general remark here:

The usage of CompleteableFuture here (even though very convenient and powerful) does not seem to be the proper way when looking at the overall design here.

Instead one should probably better use a Task-List design here, so that different task are processed one after the other, this could be for example a (single thread) ExecutorService where one submit(...) tasks, or a Queue that is processed by a special thread (e.g. if SWT thread is needed) unless it is empty.

If ExecutorService is used then one can also combine it with CompleteableFuture, but then it should be used in an event driven way without ever using a join().

Beside that, if a browser control fails to initialize, I think it is fine to just show nothing / something broken / whatever to the user so no need to replace it with something else.

@laeubi
Copy link
Contributor

laeubi commented Jan 29, 2025

When systematically going through the public methods of Edge

I think the key is actually: What public API methods really require blocking operations.

These methods should not use join() at all, they must use a busy loop that drives the event queue while waiting on this similar to what we do here (stage == future!):

stage.handle((nil1, nil2) -> {
if (!display.isDisposed()) {
try {
display.wake();
} catch (SWTException e) {
// ignore then, this can happen due to the async nature between our check for
// disposed and the actual call to wake the display can be disposed
}
}
return null;
});

while (!future.isDone() && !display.isDisposed()) {
if (!display.readAndDispatch()) {
display.sleep();
}
}

the wake is important to break out the sleep in a timely way!

@HeikoKlare
Copy link
Contributor

Thanks for the feedback! The proposals make sense and revising the design again might be a good thing. But we should move that discussion into a separate issue. This PR is part of efforts to mitigate the risk of freezes / UI thread blocks by using Edge for the upcoming release. For that reason, we need to improve the current design instead of revising the design, which we may defer to the next development cycle.

@laeubi
Copy link
Contributor

laeubi commented Jan 29, 2025

@HeikoKlare as a workaround replace all parts that do a join() on the event thread with this code:

public static void waitForFuture(CompletableFuture<?> future, Display display) {
	if (!future.isDone()) {
		future.handle((nil1, nil2) -> {
			if (!display.isDisposed()) {
				try {
					display.wake();
				} catch (SWTException e) {
					// ignore then, this can happen due to the async nature between our check for
					// disposed and the actual call to wake the display can be disposed
				}
			}
			return null;
		});
		while (!future.isDone() && !display.isDisposed()) {
		 	if (!display.readAndDispatch()) {
		 		display.sleep();
		 	}
		 }
	}
}

Then you don't need special timeout handling and won't see UI freeze.

@laeubi
Copy link
Contributor

laeubi commented Jan 29, 2025

it looks odd, but I think

private <T> void waitForFutureToFinish(CompletableFuture<T> future) {
	while(!future.isDone()) {
		Display display = Display.getCurrent();
		future.handle((nil1, nil2) -> {
			if (!display.isDisposed()) {
				try {
					display.wake();
				} catch (SWTException e) {
					// ignore then, this can happen due to the async nature between our check for
					// disposed and the actual call to wake the display can be disposed
				}
			}
			return null;
		});
		MSG msg = new MSG();
		while (!OS.PeekMessage (msg, 0, 0, 0, OS.PM_NOREMOVE)) {
			if (!future.isDone()) {
				display.sleep();
			}
		}
		display.readAndDispatch();
	}
}

would probably prevent blocking (future itself does not use (a)sync exec).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Application freezes due to using Edge
4 participants