Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impl embedded_io_async::Read for UartRx<'_, Async> is very unreliable. #3144

Closed
mattiasgronlund opened this issue Feb 17, 2025 · 12 comments · Fixed by #3142
Closed

impl embedded_io_async::Read for UartRx<'_, Async> is very unreliable. #3144

mattiasgronlund opened this issue Feb 17, 2025 · 12 comments · Fixed by #3142
Labels
bug Something isn't working

Comments

@mattiasgronlund
Copy link

Bug description

We have been trying to use the UART together with WIFI on esp32s3 using eps-hal and embassy-rs, but our feeling has for quite some time been that the UART implementation is very unreliable under load.

I have been trying to understand the root cause of the problem, and I am currently leaning towards timing issues related to the unusual implementation where most work is done outside the interrupt handler.

An example of problematic code in the driver:

    let events_happened = UartRxFuture::new(self.uart.reborrow(), events).await;
    // always drain the fifo, if an error has occurred the data is lost
    let read = self.flush_buffer(buf);
    // check error events
    rx_event_check_for_error(events_happened)?;
  1. RX events are sampled.
  2. bytes are copied to the buffer.
  3. Errors are reported if there where any RX error events before starting to copy bytes. But what if an error occurred during the copy?

The time it take to copy bytes, may actually be very substantial if the embassy task got interrupted by e.g. WIFI tasks. I have seen that read_async() may return more than 200 bytes, even if the HW buffer is configured to 128 bytes.

But the feeling is also that depending on some timings, neither timeout nor fifo full threshold interrupt will trigger in some cases.

To Reproduce

I have created a demo-project to allow reproduction of the problems at https://github.com/mattiasgronlund/esp32s3-uart-stability.

The demo-project configures all three UARTS and spawns four tasks to test UART stability, and is currently tested against current main branch at acbb983.

There is one task writing increasingly long sequences of bytes, each sequence starts with a byte with the value 0 and increments it by one for each byte.

Then there are three tasks using one UART each, reading bytes written by the first task. Each of them have different size for their RX fifo full threshold 64,32,16 bytes.

There is also an idle-task and two tasks placing some load on the system by starting and stopping the WIFI connection.

Expected behavior

I would expect to see UART errors to happen, but not as frequently.
I would NOT expect breaks in the byte sequence unless an UART error was returned.

Why do we get overflow here?

4.300775 INFO Read[UART_2]: len: 15 [0]=121
14.300797 INFO Read[UART_1]: len: 2 [0]=134
14.300819 INFO Read[UART_0]: len: 8 [0]=128
14.354181 INFO Idle
14.403278 ERROR RX Error[UART_2]: FifoOverflowed
14.403388 ERROR RX Error[UART_1]: FifoOverflowed
14.403452 ERROR RX Error[UART_0]: FifoOverflowed
14.403514 INFO Read[UART_1]: len: 1 [0]=128

Idle was prioritised read_async seems to have missed interrupts, as it failed to get re-scheduled even if the previous call returned very few bytes

Why is it reading so few bytes?

7.455290 INFO Read[UART_0]: len: 1 [0]=64
7.455310 INFO Read[UART_1]: len: 1 [0]=64
7.455331 INFO Read[UART_2]: len: 1 [0]=64
7.462510 INFO Read[UART_2]: len: 4 [0]=65
7.462532 INFO Read[UART_1]: len: 4 [0]=65
7.462554 INFO Read[UART_0]: len: 4 [0]=65

Why is there all of a sudden a long Idle period, and then garbage in the data?

30 INFO Read[UART_1]: len: 21 [0]=206
27.301362 INFO Read[UART_0]: len: 53 [0]=174
27.327795 INFO Idle
27.428930 INFO Idle
27.439505 INFO Read[UART_2]: len: 122 [0]=113
27.439520 ERROR RX[UART_2]: expected 233 got 113 at 0/122 iteration 136
27.439592 INFO Read[UART_1]: len: 123 [0]=107
27.439607 ERROR RX[UART_1]: expected 227 got 107 at 0/123 iteration 136
27.439691 INFO Read[UART_0]: len: 124 [0]=107
27.439706 ERROR RX[UART_0]: expected 227 got 107 at 0/124 iteration 136
27.444040 INFO Read[UART_0]: len: 50 [0]=103
27.444083 INFO Read[UART_1]: len: 51 [0]=102

Environment

  • Target device:
    Chip type: esp32s3 (revision v0.1)
    Crystal frequency: 40 MHz
    Flash size: 16MB
    Features: WiFi, BLE
    MAC address: 34:85:18:9b:55:60
      • Crate name and version: esp-hal acbb983
@bugadani
Copy link
Contributor

Thanks, I'm in the process of revising this code, I'll see if I can improve this or not.

@bugadani
Copy link
Contributor

If you could please try #3142, that would be great. A few observations I have with that PR:

  • You'll often see that a read returns 1 bytes. That is likely because there is enough time between a previous FIFO full interrupt and flushing the data, that a new byte is received.
  • You'll see a long Idle period at 26-27 seconds. This is caused by your firmware: the writer task finished writing 256 bytes, and is sleeping for a second.

@mattiasgronlund
Copy link
Author

Thanks @bugadani for the quick reply.

I have tested with d6e82d5.

I get a feeling that it works better but I think RX error events are lost, i.e. I see that there are lost bytes when reading bytes in comparison to what has been written, but I do not get any FifoOverflowed error.

@MabezDev MabezDev removed the status:needs-attention This should be prioritized label Feb 18, 2025
@mattiasgronlund
Copy link
Author

Tested with 143a941

The test did not indicate any errors in the data, and I found no overflows, but instead it looks like it stopped writing to the UART after 627 seconds...

627.360302 INFO  Read[UART_2]: 16 => 88 of 238 for 10990
627.360336 INFO  Read[UART_1]: 32 => 88 of 238 for 10990
627.361689 INFO  Read[UART_2]: 16 => 104 of 238 for 10990
627.363080 INFO  Read[UART_2]: 16 => 120 of 238 for 10990
627.363114 INFO  Read[UART_1]: 32 => 120 of 238 for 10990
627.367339 INFO  Read[UART_2]: 8 => 128 of 238 for 10990
627.367371 INFO  Read[UART_1]: 8 => 128 of 238 for 10990
627.367400 WARN  Wrote: 238
627.367426 INFO  Read[UART_0]: 64 => 128 of 238 for 10990
627.422214 INFO  Read[UART_2]: 55 of 239 for 10991
627.422259 INFO  Read[UART_1]: 55 of 239 for 10991
627.422282 INFO  Read[UART_1]: 1 => 56 of 239 for 10991
627.422319 INFO  Read[UART_2]: 1 => 56 of 239 for 10991
627.422988 INFO  Read[UART_0]: 64 of 239 for 10991
627.423331 INFO  Idle
627.423677 INFO  Read[UART_2]: 16 => 72 of 239 for 10991
627.425067 INFO  Read[UART_2]: 16 => 88 of 239 for 10991
627.425102 INFO  Read[UART_1]: 32 => 88 of 239 for 10991
627.426455 INFO  Read[UART_2]: 16 => 104 of 239 for 10991
627.427845 INFO  Read[UART_2]: 16 => 120 of 239 for 10991
627.427880 INFO  Read[UART_1]: 32 => 120 of 239 for 10991
627.431746 INFO  Read[UART_2]: 8 => 128 of 239 for 10991
627.431783 INFO  Read[UART_1]: 8 => 128 of 239 for 10991
627.431821 INFO  Read[UART_0]: 64 => 128 of 239 for 10991
627.523495 INFO  Idle
627.625395 INFO  Idle
627.714010 INFO  Starting wifi
627.716934 INFO  Wifi started!

@mattiasgronlund
Copy link
Author

Just a quick update, I restarted the test and now it has run for >900 sec with no RX errors, but with quite a lot of data corruption, probably a consequence of FifoOverflowed.

@bugadani
Copy link
Contributor

bugadani commented Feb 18, 2025

I'm not sure what I can do about that. If wifi blocks the reader tasks for a considerable amount of time, but the writer task is writing more data than can be buffered, data will be lost. Ideally you'd have hardware flow control to throttle the writer in this case.

@mattiasgronlund
Copy link
Author

I might not have been clear enough, I have no problem loosing data, bat reading corrupted data caused by a buffer overflow in the UART without any indication is bad.

That is, it seems like the with the patches for the UART, there is no longer any overflow errors reported.

@bugadani
Copy link
Contributor

bugadani commented Feb 18, 2025

1680422 (edit: and 8ec3d29) was supposed to remedy the missed errors. It's later than your most recent test, have you tried the PR's current state? If not, can I ask you to? Missing errors is definitely not something we want to do.

@mattiasgronlund
Copy link
Author

Thanks, 8ec3d29 works much better!

Looks like there now is a mix, sometimes I get silent corruption and sometime I get FifoOverflowed.

I will try to improve test-application to provide more context to the logs when an error is returned or when the data seems corrupt.

@bugadani
Copy link
Contributor

re. silently missing data, you're using Write::write(&mut tx, &wbuf[0..=n]).await.unwrap(); to send bytes, without checking the actually transmitted amount. To quote the documentation of write:

If it's ready to accept bytes, a non-zero amount of bytes is written from the beginning of buf, and the amount is returned. It is not guaranteed that all available buffer space is filled, i.e. it is possible for the implementation to write an amount of bytes less than buf.len() while the writer continues to be ready to accept more bytes immediately.

The internal TX FIFO is 128 bytes long, so it's unlikely you actually send more data than that. Please use write_all instead if you absolutely need all data to be sent.

@mattiasgronlund
Copy link
Author

Oops, your are right, I just copied the write part from examples/src/bin/embassy_serial.rs.

I have been trying to improve other things in the demo project, just trying to find new ways to torture the UART some more, but I will probably not have any results before the weekend.

But we could close link this issue with #3142, and if/when I find more issues I can create a new issue for that part, as in any case your fixes seems to make a huge difference is stability.

@bugadani
Copy link
Contributor

Sounds like a plan, thanks for the reproducer and the report, it has been very helpful :)

@github-project-automation github-project-automation bot moved this from Todo to Done in esp-rs Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants