Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHA-RL4b7][Room monitoring] Delayed RoomStatus transition to ATTACHING #436

Open
sacOO7 opened this issue Dec 13, 2024 · 9 comments
Open

Comments

@sacOO7
Copy link

sacOO7 commented Dec 13, 2024

  • As per CHA-RL4b7, If a room lifecycle operation is not in progress and the channel state is ATTACHING and no transient disconnect timeout exists for the contributor, then a transient disconnect timeout with a 5 second limit is created for the contributor. Upon timeout, the room status is transitioned to ATTACHING, using the reason from the initial channel state change as the error for the transition

┆Issue is synchronized with this Jira Task by Unito

@sacOO7 sacOO7 changed the title [CHA-RL4b7][Room monitoring] RoomStatus is set to ATTACHING after 5 seconds [CHA-RL4b7][Room monitoring] Delayed RoomStatus transition to ATTACHING Dec 13, 2024
@sacOO7
Copy link
Author

sacOO7 commented Dec 13, 2024

Associated Problems:

  • We currently need to manage an explicit attach timer that, when it expires after 5 seconds, sets the RoomStatus to Attaching.
  • This introduces additional complexity because the timer must be cleared when the channel transitions to other states during room monitoring.
  • It also needs to be reset when external operations like ATTACH, DETACH, or RELEASE are triggered. This is to conform to room atomicity. i.e. ongoing op isn't affected by RoomStatus change by external attach timer.
  • The current implementation of chat-js assumes that the ATTACHING state of a channel is emitted by the DETACHED event received from the server ( spec RTL13 )

// If we're in detached, we want to set a timeout to consider it transient
// If we don't already have one.
if (change.current === RoomStatus.Attaching && !this._transientDetachTimeouts.has(contributor)) {
this._logger.debug('RoomLifecycleManager(); detected channel detach', {
channel: contributor.channel.name,
});
const timeout = setTimeout(() => {
// If we get here, then we're still in the attaching state, so set the room status to attaching.
// We'll have the status as attaching and be optimistic that the channel will reattach, eventually.
// We'll let ably-js sort out the rest.
this._lifecycle.setStatus({ status: RoomStatus.Attaching, error: change.reason });
this._transientDetachTimeouts.delete(contributor);
clearTimeout(timeout);
}, transientDetachTimeout);
this._transientDetachTimeouts.set(contributor, timeout);

  • However, this isn't always the case. When the connection goes DISCONNECTED, the channel transitions from ATTACHED to ATTACHING if it reconnects within 2 minutes (before entering the SUSPENDED state). This behavior is in line with spec RTL3d
  • According to RTL3d, channels in the ATTACHING, ATTACHED, or SUSPENDED states must be reattached during reconnection.
  • As a result, we are not fully transparent about the RoomStatus.ATTACHING state in this common scenario.
  • This leads to a bug where the Room transitions from ATTACHED to ATTACHED again (as per CHA-RL4b8), without exposing the intermediate ATTACHING state.

Proposed Solution:

  • Instead of maintaining a separate timer, immediately set RoomStatus to ATTACHING (if it isn't already in that state).
  • If the ATTACHING state is triggered by a server-sent DETACH event (a rare case), we will still set the RoomStatus to ATTACHING. If the internal reattach attempt fails, the channel will enter the SUSPENDED state as per spec RTL13b. As part of the Room Monitoring spec CHA-RL4b9, this will cause the RoomStatus to transition to SUSPENDED and enter an internal retry loop.
  • This change eliminates the need to manage explicit attach timers and clear them during other operations. This helps with maintaining Room Atomicity where external timer for RoomStatus change will not interfere with ongoing operation.
  • It also makes the RoomStatus.ATTACHING state more transparent and fixes the bug where the Room incorrectly transitions from ATTACHED to ATTACHED during reconnection within 2 minutes.
  • As for handling feature discontinuities, users should rely on the callbacks registered for the respective room features, rather than tracking changes in RoomStatus.

@AndyTWF
Copy link
Collaborator

AndyTWF commented Dec 16, 2024

If I'm understanding you correctly - the inclusion of this timer was by design.

If the library temporarily goes from ATTACHED to ATTACHING to ATTACHED again in a short period with no errors of continuity, does it actually benefit the user to tell them that this has happened? The purpose of the timer was to make it so that transient state changes are handled internally without the user needing to take action.

Now, if we emit ATTACHED twice then I agree that's not so ideal - but I think not emitting the intermediary state is as intended.

When the connection goes DISCONNECTED, the channel transitions from ATTACHED to ATTACHING

Not quite - RTL3d is when connection goes to CONNECTED, RTL3e says that no changes to channel state are associated with the `DISCONNECTED state.

@sacOO7
Copy link
Author

sacOO7 commented Dec 16, 2024

  • I agree there are no channel state changes as per RTL3e
  • But as per RTL3d channel transitions from ATTACHED to ATTACHING, so RTL3e is a prerequisite to make it happen

@sacOO7
Copy link
Author

sacOO7 commented Dec 16, 2024

Now, if we emit ATTACHED twice then I agree that's not so ideal - but I think not emitting the intermediary state is as intended.

I strongly disagree : (

  • RoomStatus reflects the aggregated state of the underlying channels. If one or more channels transition to the ATTACHING state, this should immediately be reflected as a change in the RoomStatus (assuming no other operations are in progress).
  • I don't see any negative consequences to being transparent about the status change (we already have separate discontinuity events for each feature). Additionally, transitioning the room status from ATTACHED to ATTACHED could be seen as a bug from the user's perspective.
  • Introducing an attach timer complicates the implementation and could jeopardize the atomicity of ongoing room operations (if not managed correctly). Moreover, active timers could create challenges for future implementations as well.

@sacOO7
Copy link
Author

sacOO7 commented Dec 16, 2024

  • It's important to note that while RoomStatus.Attaching doesn't indicate message loss, the user can use it to display room status on the UI.
  • This implies that when we emit RoomStatus.ATTACHING event, messages will not be immediately sent/received across any features,so it won't be seen as a bug.
  • I'm considering the scenario where as per current impl. the Room is in the ATTACHED state, but no messages are being sent or received immediately ( because channels internally being ATTACHED), unfortunately, this will be seen as a bug : (

@sacOO7
Copy link
Author

sacOO7 commented Dec 16, 2024

  • Delayed RoomStatus transition to ATTACHING also seems to introduce a bug where room features will not adhere to the spec CHA-RL9
  • Since we are delaying the transition to RoomStatus.ATTACHING, features will not wait for Room to get ATTACHED to perform the given operation, instead, they will treat the room asATTACHED and messages will either be queued or rejected internally [ Also breaks room atomicity where some features might work/ some won't : | ]
  • As per CHA-RL9a, caller should wait till RoomStatus is ATTACHED, for other RoomStatus, it should return error with RoomInInvalidState code.
  • Currently, operation will not return with error even when Room goes into Suspended state after ATTACHING fails : (
  • So, current implementation doesn't conform to spec CHA-RL9 : (

@AndyTWF
Copy link
Collaborator

AndyTWF commented Dec 16, 2024

Introducing an attach timer complicates the implementation and could jeopardize the atomicity of ongoing room operations (if not managed correctly)

I don't think we should be potentially passing complexity onto users just because it requires more careful handling internally - surely the idea is that we take on the complexity and deal with it once so that users don't have to?

At the time of conception, CHA-RL9 wasn't a thing (which appreciate changes things), and this was the least-work-for-users approach at the time.

I'm considering the scenario where as per current impl. the Room is in the ATTACHED state, but no messages are being sent or received immediately ( because channels internally being ATTACHED), unfortunately, this will be seen as a bug

That's why the timeout is "transient" - its short enough that not receiving a couple of messages in that period won't be seen as a bug.

Delayed RoomStatus transition to ATTACHING also seems to introduce a bug where room features will not adhere to the spec

I agree that CHA-RL9 changes things and that we should look into it. My personal thought is that this shouldn't block the beta releases as this is an edge case, but is something we should address going forwards.

So, current implementation doesn't conform to spec

Yes - we added CHA-RL9 later on and have a ticket to conform to it at a later date (it's just not the highest priority at the moment).

@sacOO7
Copy link
Author

sacOO7 commented Dec 16, 2024

Thanks for the response. Though, I didn't mean what you exactly meant by

I don't think we should be potentially passing complexity onto users just because it requires more careful handling internally - surely the idea is that we take on the complexity and deal with it once so that users don't have to?

It's just a normal ATTACHING event and users can choose to ignore it since we already have feature discontinuity in place. The main point was to be transparent about RoomStatus and conform to room atomicity.
Also, I strongly believe there's unnecessary complexity being introduced into the code ( for emitting ATTACHING event ), without any concrete reason to back it up : (
As I mentioned earlier, chat-js assumes that the ATTACHING event is only triggered by a server detach action ( rare case ). However, this isn't always the case—it can also be triggered by a standard disconnection ( common case ). Therefore, we should be transparent about it.

@sacOO7
Copy link
Author

sacOO7 commented Dec 16, 2024

Also as per spec CHA-RS1g, we made clear that users only need to deal with RoomStatus.Failed events and don't need to bother about other events right 🤔
So, ATTACHING event doesn’t feel suspicious, unlike other events like DISCONNECTED or FAILED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants