Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(websocket): release client-lock during WEBSOCKET_EVENT_DATA (IDFGH-14545) #704

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bryghtlabs-richard
Copy link
Contributor

@bryghtlabs-richard bryghtlabs-richard commented Dec 2, 2024

Description

Prevent deadlocks when reserving locks in WEBSOCKET_EVENT_DATA handler, and lock is held by another thread sending a websocket message.

Fix high latency caused by writers serialized with WEBSOCKET_EVENT_DATA while calling esp_websocket_client_send(), even when TCP window has enough space for the entire message being queued to send.

Multiple writers are still serialized at fragment boundaries, but only with other writers and websocket error updates.

Related

Fixes #625.

Testing

I tested with a WebSocket echo server running on the same WiFi access point, using the target-example with the following modifications:

  • AppMain thread tries to send 20 unfragmented frames, waiting 100ms between frames.
  • WebSocket client is patched to record the time taken to send each frame.
  • WEBSOCKET_EVENT_DATA handler calls vTaskDelay(configTICK_RATE_HZ); to simulate application-processing.

Without this patch, the AppMain thread experiences long delays, often either almost 1 second, and sometimes much longer(like 6 seconds) depending on which cores the AppMain and WebSocketClient thread run. The AppMain thread is unable to send at 10Hz, even though the TCP send buffer is never full.

With this patch, the AppMain thread experiences no delays writing websocket frames, and is able to send at 10Hz, as long as the TCP send buffer is not filled.

Examples logs:
test_before_fix.txt
test_before_fix2.txt
test_with_fix.txt

Checklist

Before submitting a Pull Request, please ensure the following:

  • 🚨 This PR does not introduce breaking changes.
  • All CI checks (GH Actions) pass.
  • Documentation is updated as needed.
  • Tests are updated or added as necessary.
  • Code is well-commented, especially in complex areas.
  • Git history is clean — commits are squashed to the minimum necessary.

@erwinschrodinger1
Copy link

erwinschrodinger1 commented Jan 30, 2025

There was similar issue when I tried the esp_websocket_client. This PR commit solves the issue of sending in the socket but I am not being able to get the message response from the server.
image
The message data case is never called.

Thank you @bryghtlabs-richard for solving the send issue.

@espressif-bot espressif-bot added the Status: Opened Issue is new label Jan 30, 2025
@github-actions github-actions bot changed the title fix(websocket): release client-lock during WEBSOCKET_EVENT_DATA fix(websocket): release client-lock during WEBSOCKET_EVENT_DATA (IDFGH-14545) Jan 30, 2025
This resolves:

 1) Deadlock when trying to reserve a lock in WEBSOCKET_EVENT_DATA,
    but lock is held by a thread trying to send a websocket message.
 2) High latency caused by writers serialized with WEBSOCKET_EVENT_DATA
    while calling esp_websocket_client_send(), even when TCP window
    has enough space for the entire message being queued to send.

Multiple writers are still serialized at fragment boundaries, but
only with other writers and websocket error updates.

Fixes espressif#625
@erwinschrodinger1
Copy link

Hello @bryghtlabs-richard. I tried using this after you did the another push. Then the same issue of unable to lock resembles which was solved after the first push.
image

Thank you.

@bryghtlabs-richard
Copy link
Contributor Author

bryghtlabs-richard commented Feb 3, 2025

@erwinschrodinger1, if you try to send WS data with timeout=0 while the WS client is locked by the receive thread, xSemaphoreTakeRecursive(client->lock, timeout=0) will immediately fail. You might have better luck with a small, non-zero timeout. It is normal for the WS client receive thread to lock the client->lock for short periods while processing.

@erwinschrodinger1
Copy link

@bryghtlabs-richard how should I solve my error then? Use a small delay after sending message each time?

@bryghtlabs-richard
Copy link
Contributor Author

@erwinschrodinger1 , I wouldn't add a separate delay, but I would try a small timeout, maybe around 100ms.

If a thread tries to send while the client is locked, xSemaphoreTakeRecursive() will delay up to the timeout, and if the client is unlocked sooner then the sending thread can continue sooner.

If a thread tries to send while the client is unlocked, the sender can lock/send/unlock without delay.

@erwinschrodinger1
Copy link

Sorry initially I mistook ticks with ms. Even after adding the timeout of around 10 second. The ws-client is not being able to lock. What can be the issue in it?
image

@bryghtlabs-richard
Copy link
Contributor Author

I would guess the client is either not setup properly, or lock is taken but not released. I can't think of anything that would take 10 seconds. Its possible a thread has locked it and forgotten to unlock it, but I don't see how this change would do it.

The next thing I'd want to know - is client->lock locked? If so, by which line of code?

@erwinschrodinger1
Copy link

Thank you so much @bryghtlabs-richard.

the issue was done by my handler which locked the WEBSOCKET_DATA event resulting in lock of client. It is was just me and my idiotic acts. Sorry.

@bryghtlabs-richard
Copy link
Contributor Author

No worries @erwinschrodinger1 , thanks for letting us know the cause was unrelated to this patchset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Opened Issue is new
Projects
None yet
4 participants