-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-pause capture when user switches captured content #4
Auto-pause capture when user switches captured content #4
Comments
(Modulo that setting |
One question here is whether pausing at track or at source level. |
Thank you for migrating the discussion here. Repeating my last comment on the topic:
I think the comparison to |
I applaud user-agent-centered features like this! — I like how they tend to keep the end-user in charge. But before exposing APIs for apps to potentially wrest control over such user-initiated changes, I'd like to understand the concrete problems that require app intervention, since gating output on the app seems directly counter to the user's wishes to inject different input. What needs cannot be met without pausing?
What processing would this be? What app would e.g. blur screen capture?
What constraints would this be? Since a window may be resized by the end-user, and be larger or smaller than a monitor, an app that only reacts to such changes upon source-switching would seem to have the wrong idea.
That application can add buttons for the user to do exactly that. What if the user intends to save different surfaces to the same file, using this user agent feature? |
The user's wish will NOT be subverted; this control will just give the app time to adjust to the change that the user sprung on it. For example, if the user has interacted with the app and asked for the captured media to be written to a file, then when the user clicks share-this-tab-instead, auto-pausing will give the app time to (1) close the old file, (2) open a new file, and (3) start piping new frames to the new file.
Please read my earlier messages, as well as the previous paragraph of this message.
Cropping using Region Capture. If the user starts sharing a new tab, you might want to pause the capture until you can communicate with the other tab, obtain a CropTarget and apply it.
Meet uses a different frame-rate when capturing tabs, windows and screens.
Why burden the user?
Then that user will use an app that saves everything to the same file. I gave you examples of how this API could be used. Nowhere did I say that we'll be scouring the Internet and blocklisting any app that does not comport itself to my example. |
Jan-Ivar, have you perhaps missed that this new API is opt-in, and that the default behavior will remain to NOT auto-pause? |
I've rewatched the WebRTC WG January interim, Jan-Ivar, and I see you debated the shape with me. Now you seem to have forgotten the use-cases and are confused about the proposal. What has changed? Could you try rewatching my presentation there? |
Following on today's call, here is some related feedback:
Assuming we want it at the track level, below API could fit the bill:
Additional note, we could even reduce the API shape with a callback of the form
|
The use case as presented would seem solved by a simple event fired before a page is navigated in a captured tab. Why can't a user agent do that? Then pausing isn't necessary, and the API is super simple. If we accept pausing is necessary, my preference would be to:
This would accomplish several things:
|
That makes sense if we go with a source level API, let's discuss this important design point first. |
This issue :
|
This issue was mentioned in WEBRTCWG-2023-06-27 (Page 16) |
As mentioned during the June meeting of the Screen Capture CG, the issue presented in this thread might be tackled along with a different one, using a mechanism other than auto-pause. Other issue:
Reasonable applications deployed today are likely to only perform the check of surface type immediately after gDM resolves, and set cascading state based on it (modify user-facing controls that can later trigger step 4). Such applications would not have been written to check again before step 4, because dynamic switching across surface types did not yet exist. The backwards-compatible solution I presented during the Screen Capture CG meeting is as follows: const controller = new CaptureController();
controller.enableDynamicSwitching('switch', (event) => {
const videoElement = document.getElementById('myVideoElement');
videoElement.srcObject = event.mediaStream;
// And/or outgoing peer connection; imagine whatever you find relevant.
});
const stream = await navigator.mediaDevices.getDisplayMedia({ controller });
... The spec would then say that the user agent MAY constrain dynamic-switching options based on whether Note how the solution provides auto-pause "for free" by virtue of creating a new stream (and new tracks). |
I have some API nits (and agree we shouldn't overload addEventListener). I'd prefer: const controller = new CaptureController();
controller.addEventListener("switch", ({stream}) => video.srcObject = stream);
controller.manualSwitching();
video.srcObject = await navigator.mediaDevices.getDisplayMedia({controller}); This is modeled on messagePort.start() which is "only needed when using EventTarget.addEventListener; it is implied when using onmessage", which makes this equivalent: const controller = new CaptureController();
controller.onswitch = ({stream}) => video.srcObject = stream;
video.srcObject = await navigator.mediaDevices.getDisplayMedia({controller}); I.e. While I'm coming around on the API, I don't like the idea of limiting this existing and useful end-user feature only to sites that opt-in. That seems like a loss for users. Hence the name I'd prefer if this new API came with language to encourage browsers to not gate the feature only on apps that opt-in to the event model. I have reservations about abandoning the old injection model which had some unique advantages:
So I'm not convinced the event model is always better just because it wins solving the most advanced case (the tightly integrated crop-enabled VC+SLIDE app). For VC apps that DON'T have integrated SLIDE apps, it's not clear why the event API is needed (apart from audio which I propose a solution for below). When it comes to audio (which I see as the lone flaw in the injection model for most VC sites), the following might work well enough in the injection model: const stream = await navigator.mediaDevices.getDisplayMedia({audio: true});
video.srcObject = stream;
stream.onaddtrack = ({track}) => console.log(`audio added`); The UA could subsequently mute the audio track as needed rather than remove it. Thoughts? |
Note that the 2 other occurrences I've found of doing something more in the setter
For the record, I'm not a huge fan of this as developers may read |
@jan-ivar, I think we have general alignment on the solution and are now discussing the specifics of the shape. That works for me. In the interest of clarity, I am splitting this response in two. This comment covers where we agree; the next comment covers an open issue for discussion.
That name change works for me. Almost. I think
Works for me.
We are not currently planning to abandon the old model in Chrome's implementation. Thank you for documenting some of the reasons to retain it.
In the interest of transparency, I'd like to remind you of surfaceSwitching. But please note that it's an orthogonal discussion. We are neither introducing
I think that we have multiple challenges and a single API that solves them all, so I don't think we need an additional API that only solves a subset of the challenges (audio). |
This seems less than ideal for me. I think it's a footgun for developers when ... I wonder if we should eschew both
It seems better to shape the API as |
Agreed, new tracks require extra work for the web developer, it is appealing to keep using the current track from this point of view. As of the source mutating, The new track approach main benefit is that it is future proof however the model evolves.
I am not a big fan, given many websites do tend to recreate MediaStreams on the fly or clone them. In the future, we might want to transfer MediaStream as well. |
Youenn, could you please clarify which approach you are favoring here? (Note that I am not proposing abandoning the old model. So any application that has a reason to stick to it, can still do so.) |
(answering multiple people)
👍
Yes, this literally happened to me with web workers! So I sympathize, but I did figure it out, and there IS precedence here, which may help devs... This might come down to one's view of onfoo vs addEventListener("foo", ) ...
Where is this advice? I recall it years back, but worry it may be outdated. See § 7.4. Always add event handler attributes. IMHO const video = document.createElement('video');
video.srcObject = stream;
await new Promise(r => video.onloadedmetadata = r); ...there's no problem since In this issue, I'm even hearing an argument for a single callback which (I don't agree with but) I suspect means
I think it's generally agreed that not allowing multiple listeners is the footgun. E.g. two areas (functions) of an app both set onfoo or setMy(callback) whatever, causing one to remove the other, a hard-to-debug action at a distance. Lots of precedence says multiple events are OK. See §7.3. Don’t invent your own event listener-like infrastructure.
Sure, but the same is true of the MediaStream(s) managed by RTCRtpReceiver. But I find it interesting that the video.srcObject sink handles it seamlessly in all browsers. But I agree real-world sinks would likely need to take some action, e.g. with the track-based RTCPeerConnection: stream.onaddtrack = pc.addTrack(track);
This API solves the audio problem in the injection case, which doesn't require CaptureController. The question here probably comes down to whether we want to solve the audio injection case or not... |
Yes, but that merely "signals whether the application would like the user agent to offer the user an option to dynamically switch". I.e. UAs are free to ignore it. |
Topic was discussed yesterday and it seems there is some convergence. AIUI, the proposal is:
@eladalon1983, @jan-ivar, can you clarify whether this is a correct summary? If this is correct, we should consider removing step 4 (closing the old tracks). The web application could decide to stick with the old tracks and stop the new tracks instead if it is thought more adequate. This would allow easy migration in the simple cases (just do nothing for video, check for new audio). |
Thanks, Youenn; I believe this summary is mostly accurate. [*]
I don't think this would work, because the old track would not get any new frames, even if it's ended. I'm also not sure what problem it's solving - if an existing application is already updated to call
Minor objection on my part here. I don't think I'd block on it, but I'd want a chance to argue against it, at a time that would not distract us away from the main discussion. [*] With some nits. For example, I don't currently see why it's necessary to mandate that no new frames be emitted on the new track before the old one is ended. But I think that's a minor point that can be discussed later. |
The point would be to delay sending frames to the old track until the event listeners are executed. I see two main benefits:
|
I think we still want to go with a callback rather than an event, so as to avoid the issues with multiple tracks with the same source, and the interaction between multiple listeners. With that, I don't really see how we could avoid Once a callback is set - using whichever method of attribute - there is no way for the UA to determine if the application wish to receive a new audio track and retain the old video track. Not without additional API surfaces, at any rate, and those are past MVP in my opinion. We need to fire a new video track, and we need to stop delivering frames to the old one (or else they effectively share a source). I think we have a clear and simple solution in:
This is future-proof:
|
The callback approach allows this as well. Instead, we stick with the injection model for old tracks. The web page can stop the old tracks anyway. I am ok adding an option so that the web page tells the UA to stop the tracks (hence the various proposals I made on top of the callback). Having a callback to deliver the stream is better since there is one place where you decide what to do with the new tracks (clone it, stop it...). And the spec can be made clear that MediaStreamTracks are not created if the callback is not set. This is more difficult with events. |
What enforces that the website can't keep both the old live injected track and the live new track? We need to specify this implicit action at a distance.
If this means there's one place where you decide what happens with the old tracks (enforced by the aforementioned action at a distance), then I agree that might be a good reason for a callback. Can we make it a settable attribute at least? |
I do not see any implicit action at a distance, the website can keep both
In my mind, the default behavior (whether setting the callback or not) is that no track is being stopped by UA, the web page can deal with it by itself. We can enrich the callback to make the UA stop the previous tracks, for instance:
It is a bit less straightforward to extend things with an event. And again, it is not really compelling to have several event listeners sharing the responsibility to stop the old tracks (or the new tracks). Also, I could see a UA tailoring its UI based on the callback being registered (not showing the sharing audio check box if no audio was shared before the surface switching for instance).
Ah, good point, I guess this would disallow option 1 above. |
(for the record - this was discussed in the joint SCCWG/WebRTC meeting last week) |
A singular callback assumes a single downstream consumer. An app may have multiple consumers of an active screen-capture, e.g. a transmitter, a preview, and a recorder, each with distinct downstream needs. Tracks can be cloned, but a CaptureController cannot. So this becomes a choke point. We don't want different parts of an app competing to set the same callback and overwrite each other. The web platform tries hard to avoid single-consumer APIs. See § 7.3. Don’t invent your own event listener-like infrastructure, and requestVideoFrameCallback. I think we need a good reason to deviate from these principles.
Those are per-track and cannot tell you whether the source changed or e.g. was just resized.
This seems like a marginal optimization compared to a such a significant user action.
That's a fairly recent API with its own flaws. But it has a good reason: Many of its actions rely on the website to maintain a singular state. What's our reason? |
We've gone around a few times on this point. Yes the absence of a callback might preclude the app handling audio, but the presence of a callback does not guarantee it. But § 7.3. specifically mentions this point: for "an API which allows authors to start ... a process which generates notifications, use the existing event infrastructure" |
Just check for
I don't see how this particular flaw applies here. MediaStreamTrack events are where you distribute the info. |
Discussed at editor's meeting and we will try to converge via a design document. |
Another proposal to consider, maybe it could help convergel:
|
The spec says: "deviceIds are not exposed.". It's not listed in § 5.4 Constrainable Properties for Captured Display Surfaces.
Why is that critical? This is the kind of action at a (maybe not so much) distance we should document. This might justify a callback. |
Chrome and Safari are exposing deviceIds. Wrt callback vs. event, let's rediscuss this when we know what signals we want to expose. |
I've filed crbug 372252497 and webkit 281077. Agree callback vs. event seems secondary. The main question seems to be over allowing late decision vs. limiting injection to tracks returned from getDisplayMedia(). What's the benefit of exposing surface tracks rather than new session tracks in the callback/event? |
I am a bit confused about the purpose of "new session tracks". Who would need them? The entire idea of a "session track" is that it follows the session wherever it goes, whatever the captured surface is, across user-initiated changes. If a developer needs multiple such session tracks, can't they just clone the original ones? |
It seems useful information to provide, why not instead updating the spec? |
Thinking a bit more, I am not sure the separation between Let's look at the following two scenarios:
Given this, and given UX in that area is relatively new, I am not sure we can design an API that specifies a particular flow. Having an API that exposes new tracks and having a requirement that video frames of the switching surface do not get provided to sinks until some event/callback actually runs might be good enough for now. Plus some guidelines... That said, AFAIK, the only thing UAs are doing right now is scenario 1 above. And this is what the initial message of this issue is describing. |
I have created a design document to provide for a more structured discussion of the different proposals (view-only): https://docs.google.com/document/d/16CUOJeuXimNPi4kZHOS9rF-WhMuVvOqOg9P--Dvqi_w/edit?usp=sharing Edit permissions will be granted to members of the WebRTC working group upon request. |
To avoid confusion, I've defined a hybrid track to clarify what I mean. But it's really the surface track I'm questioning. [Edit 2: undid my edit to capture subtle differences] I've written up the model I have in mind as the late decision model. PTAL (edit: links fixed) |
Jan-Ivar, I am unsure what your current position is, given the edits. Do you withdraw your question about "new session tracks"? My position is that there is no benefit to including new objects in a fired event, if these objects are identical to objects we had before. (That is - new session tracks are identical to the originals, and so "new" ones are useless.) |
Sorry for editing multiple times. Calling mine a hybrid track now, to distinguish it. It wasn't clear from the session track definition that its feature set would be limited, which would be backwards incompatible. Do I understand correctly a driving goal of the session/surface split is to maintain the subclassing of MST? |
That's one vision of it, out of two - either (1) a normal, fully-fledged MediaStreamTrack, or (2) a reduced-feature-set MediaStreamTrack. But which is used is a secondary matter. The core offering of a session track, as I understand @tovepet here, is that it addresses your (Jan-Ivar's) expressed desire, to be able to seamlessly transition between two models (injection, switch-track).
I believe it is objectively true, that Tove's model is more flexible. |
I am a bit lost in what we are trying to solve here. |
We should also rely a bit more on an exploration of the uses cases, which I see only includes a single use case atm. I have taken the liberty to add 4 more. |
I have added a Scope section with the following bullets that I believe we want to solve in the first step:
Is this inline with what the rest of you think? |
ok
Is this already existing?
The first two points are user driven, this last one is already phrased in terms of API.
We have a solution when starting to capture and we want a solution when user decides to switch surface, whether same type or different type. |
After the discussions in the Captured Surface Switching - Working Doc, it looks like the most promising way to reach agreement is to continue with the existing injection model and add the following extensions: I have created three new issues for these extensions so that we can tackle them individually: |
Burn the ships. |
Both Chrome and Safari now allow the user to change what surface is captured.
That's obviously great stuff. Can we make it better still?
So I propose that we add two things:
enabled
back to true.)Possibly the two can be combined, by specifying that setting an event handler signals that the pausing behavior is desired (@alvestrand's idea).
Another natural extension of this idea is to also apply it when a captured tab undergoes cross-origin navigation of the top-level document. When that happens, some applications might wish to stop capture momentarily and (in-content) prompt the user - "do you still want to keep sharing?"
Relevant previous discussion here.
The text was updated successfully, but these errors were encountered: