Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve/fix handling of read errors (IDFGH-7937) #232

Closed
wants to merge 5 commits into from

Conversation

egnor
Copy link

@egnor egnor commented Jul 28, 2022

(Edited)

This change simplifies and should correct socket error handling in the MQTT implementation for all ESP-IDF and MQTT versions, giving improved performance and correctness.

Explanation:

Previously, an error (return value <= 0) from esp_transport_read() from any locus was followed by a call to esp_mqtt_handle_transport_read_error(), whose semantics are unclear but would return 0 for closed connection OR timeout, and return -1 (and log an error) for any other error type. The calling sites would generally return error for -1, and return success (but still abandon the operation) for 0.

This was often not quite right.

  • When waiting on an idle socket for a message, a timeout is not an error, but a closed connection IS an error. Handling a closed connection as a non-error delays recognition of a closed connection (it has to be discovered when a message/keepalive is sent, or by failing an idleness check).
  • When waiting for the rest of a message body, both timeout (generally the 10sec network timeout) and closed connection are errors. In any case returning ESP_OK when the operation did not actually succeed is not right.

Change:

Revise the error handling to make esp_mqtt_handle_transport_read_error() unconditionally describe the error and call esp_mqtt_client_dispatch_transport_error(). The one place where timeouts should be accepted (the first-byte read) wraps that call accordingly. Comments are added and error messages improved.

@david-cermak
Copy link
Collaborator

Hi @egnor

We have already branched-off of the v4.x support, just to keep the code simple(-er), with less #ifdef's.

There's a branch idf_v4.x for IDF<=v4.4 that would still accept all fixes and important updates.

@github-actions github-actions bot changed the title Improve backward compatibility with ESP-IDF <=v4.4 Improve backward compatibility with ESP-IDF <=v4.4 (IDFGH-7937) Jul 29, 2022
@egnor egnor changed the title Improve backward compatibility with ESP-IDF <=v4.4 (IDFGH-7937) Improve/fix handling of read errors (IDFGH-7937) Jul 29, 2022
@egnor
Copy link
Author

egnor commented Jul 29, 2022

@david-cermak Thank you for the quick reply. On further investigation I now think this is an improvement (bug fix) even for ESP-IDF v5. I've updated the PR description accordingly. Can you take another look and let me know if this is something that makes sense to you?

mqtt_client.c Outdated
@@ -1030,7 +1013,8 @@ static esp_err_t deliver_publish(esp_mqtt_client_handle_t client)
msg_total_len - msg_read_len > buf_len ? buf_len : msg_total_len - msg_read_len,
client->config->network_timeout_ms);
if (ret <= 0) {
return esp_mqtt_handle_transport_read_error(ret, client) == 0 ? ESP_OK : ESP_FAIL;
esp_mqtt_client_dispatch_transport_error(client);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this change

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted to call esp_mtt_handle_transport_read_error(ret, client) but then return ESP_FAIL unconditionally.

mqtt_client.c Outdated
Comment on lines 1126 to 1176
if (read_len == 0) {
// On ESP-IDF v5+ and v4.4, esp_transport_read()=0 means timeout.
ESP_LOGD(TAG, "mqtt_message_receive(): no message (read timeout)");
return 0;
}
if (read_len < 0) {
// On ESP-IDF v5+ and v4.4, esp_transport_read()<0 on close or error.
esp_mqtt_client_dispatch_transport_error(client);
return -1;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Please revert

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a bit of logic around esp_mqtt_handle_transport_read_error() as discussed above.

mqtt_client.c Outdated
Comment on lines 1161 to 1202
esp_mqtt_client_dispatch_transport_error(client);
return -1; // Mid-message timeout or connection-close are errors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this makes sense to treat timeouts/EOFs as errors here, but I'd suggest using the previous notation:

Suggested change
esp_mqtt_client_dispatch_transport_error(client);
return -1; // Mid-message timeout or connection-close are errors.
esp_mqtt_handle_transport_read_error(read_len, client);
return -1; // Mid-message timeout or connection-close are errors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

mqtt_client.c Outdated
Comment on lines 1183 to 1224
esp_mqtt_client_dispatch_transport_error(client);
return -1; // Mid-message timeout or connection-close are errors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
esp_mqtt_client_dispatch_transport_error(client);
return -1; // Mid-message timeout or connection-close are errors.
esp_mqtt_handle_transport_read_error(read_len, client);
return -1; // Mid-message timeout or connection-close are errors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

mqtt_client.c Outdated
Comment on lines 1215 to 1256
esp_mqtt_client_dispatch_transport_error(client);
return -1; // Mid-message timeout or connection-close are errors.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
esp_mqtt_client_dispatch_transport_error(client);
return -1; // Mid-message timeout or connection-close are errors.
esp_mqtt_handle_transport_read_error(read_len, client);
return -1; // Mid-message timeout or connection-close are errors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@david-cermak
Copy link
Collaborator

Sorry for the delay and thanks for updating the PR. I agree that in certain places we can (and should!) treat the timeouts as errors.

Also please update the docs of the internal function, e.g.

/*
 * Returns:
- *     -1 in case of failure
+ *     -1 in case of failure or timeout while one message reading is in progress
 *      0 if no message has been received
 *      1 if a message has been received and placed to client->mqtt_state:
 *           message length:  client->mqtt_state.message_length
 *           message content: client->mqtt_state.in_buffer
 */
static int mqtt_message_receive(esp_mqtt_client_handle_t client, int read_poll_timeout_ms)

@egnor
Copy link
Author

egnor commented Dec 23, 2022

Sorry about the long hiatus! I hope you can remember what this is all about. I've updated the PR to be against current head, and did my best to address your requests; see what you think?

Copy link
Collaborator

@david-cermak david-cermak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, changes LGTM now. We would proceed with merging this internally (maybe making some minor cosmetic changes), but should merge and publish soon.

{
if (err == ERR_TCP_TRANSPORT_CONNECTION_CLOSED_BY_FIN) {
ESP_LOGD(TAG, "%s: transport_read(): EOF", __func__);
return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @egnor , i think regard ERR_TCP_TRANSPORT_CONNECTION_TIMEOUT as error is not suitable. Do you meet any case that need regard time-out as error? I think when err is ERR_TCP_TRANSPORT_CONNECTION_CLOSED_BY_FIN, we need return -1 here. Could you please only change code with return -1 here to test the case you reported?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is now void and no longer returns anything at all. (Are you sure you're looking at the changed version of the code? This annotation is on the old version.)

Whether ESP_TCP_TRANSPORT_CONNECTION_TIMEOUT is an error or not depends on context, and that case is handled in mqtt_message_receive() below, where it's treated as an error mid-message but not an error for the first byte.

Can you clarify your request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@egnor Yes, i mean that on the old version, which not include your changes, only modify the one line that when err is ERR_TCP_TRANSPORT_CONNECTION_CLOSED_BY_FIN, we return -1 instead of 0.

When we handle the mid-message and catch the ESP_TCP_TRANSPORT_CONNECTION_TIMEOUT , we still can return 0 and not treate it as error. And then we will continue call mqtt_message_receive to read the data for mid-message. As far as I understand, the first byte timeout and the subsequent byte timeout can actually be considered not errors. So I want to know whether you meet the problem, if you don’t handle mid-message time-out as error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean.

It's important that CLOSED_BY_FIN and other such "hard" errors are always treated as errors, whether they happen at the start of the message or mid-message.

As for whether CONNECTION_TIMEOUT is an error, even if midmessage-- well that's something of a policy matter, but if you can ensure that it is handled gracefully, and that keepalive failure is still handled correctly, then I suppose it could work if there's something like a very slow network connection.

david-cermak added a commit that referenced this pull request Jan 24, 2024
Fixes error processing on network reads:

1) Treat EOF as an error, since the connection is closed (FIN) from the
server side. If we didn't we would try to read (in the next iteration)
from the same socket that has been already closed and get an error
ENOTCONN.
Before the fix:
D (13760) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read(): EOF
E (13800) transport_base: tcp_read error, errno=Socket is not connected
E (13800) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read() error: errno=128
D (13810) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=0
I (13820) esp_mqtt_demo: MQTT_EVENT_ERROR
E (13830) esp_mqtt_demo: Last error reported from esp-tls: 0x8008
E (13830) esp_mqtt_demo: Last error captured as transport's socket errno: 0x80
I (13840) esp_mqtt_demo: Last errno string (Socket is not connected)
E (13850) mqtt_client: mqtt_process_receive: mqtt_message_receive() returned -1
D (13860) mqtt_client: Reconnect after 10000 ms
D (13860) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=2
I (13870) esp_mqtt_demo: MQTT_EVENT_DISCONNECTED
After the fix:
E (12420) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read(): EOF
E (12420) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read() error: errno=128
D (12430) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=0
I (12440) esp_mqtt_demo: MQTT_EVENT_ERROR
E (12450) esp_mqtt_demo: Last error reported from esp-tls: 0x8008
I (12450) esp_mqtt_demo: Last errno string (Success)
E (12460) mqtt_client: mqtt_process_receive: mqtt_message_receive() returned -1
D (12470) mqtt_client: Reconnect after 10000 ms
D (12470) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=2
I (12480) esp_mqtt_demo: MQTT_EVENT_DISCONNECTED

2) Treat timeouts in the middle of MQTT message reading as errors (if
timeouted for the second time and didn't read a byte)
Before the fix:
D (9160) mqtt_client: mqtt_message_receive: read "remaining length" byte: 0x2
D (9170) mqtt_client: mqtt_message_receive: total message length: 4 (already read: 2)
D (19190) mqtt_client: mqtt_message_receive: read_len=0
D (19190) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read(): call timed out before data was ready!
E (19200) mqtt_client: esp_mqtt_connect: mqtt_message_receive() returned 0
E (19210) mqtt_client: MQTT connect failed
D (19220) mqtt_client: Reconnect after 10000 ms
D (19220) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=2
I (19230) esp_mqtt_demo: MQTT_EVENT_DISCONNECTED
After the fix:
D (19190) mqtt_client: mqtt_message_receive: read_len=0
E (19190) mqtt_client: Network timeout while reading MQTT message
E (19200) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read() error: errno=119
D (19210) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=0
I (19220) esp_mqtt_demo: MQTT_EVENT_ERROR
I (19220) esp_mqtt_demo: Last errno string (Success)
E (19230) mqtt_client: esp_mqtt_connect: mqtt_message_receive() returned -1
E (19240) mqtt_client: MQTT connect failed
D (19240) mqtt_client: Reconnect after 10000 ms
D (19240) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=2
I (19250) esp_mqtt_demo: MQTT_EVENT_DISCONNECTED
(Note that the above log is from mid-message timeout of CONNECT message,
which was hadled before the fix. If the mid-message timeout ocurs with
for example SUBACK, the current version would repeatably resend
susbscribe message)

Merges #232
@david-cermak
Copy link
Collaborator

@egnor @ESP-YJM Thanks for taking the time and effort in posting the changes and reviewing. There were indeed few issues with the mqtt client, but the changes are very much different to those you proposed, so I fixed it in a separate commit. You can check ddde502 for more details.

There were two issues:

  1. Treat EOF as an error, since the connection is closed (FIN) from the server side. If we didn't we would try to read (in the next iteration) from the same socket that has been already closed and get an error.

  2. Treat timeouts in the middle of MQTT message reading as errors (if timeouted for the second time and didn't read a byte)

The point 2) is a bit complicated, since the we know about the problem only in the second iteration.

espressif-bot pushed a commit to espressif/esp-idf that referenced this pull request Apr 17, 2024
git log --oneline e6afdb4025fe018ae0add44e3c45249ea1974774..aa6f889fb4f6f743b3a550aa587713aabbdca1fc

Detailed description of the changes:
* fix: regard reason codes greater than 0x80 as failures.
  - See merge request espressif/esp-mqtt!205
  - See commit espressif/esp-mqtt@e7b9aa5
* PR: Return on allocation failure
  - See merge request espressif/esp-mqtt!204
  - set last_retransmit to now when first connected (espressif/esp-mqtt@c06f154)
  - add return to faile_message, avoid segment fault (espressif/esp-mqtt@37478a9)
* Minor warning of unused variable
  - See merge request espressif/esp-mqtt!203
  - fix: Minor warning of unused variable (espressif/esp-mqtt@726e5f2)
* Cover the case for SOC without MAC address
  - See merge request espressif/esp-mqtt!202
  - fix: Cover the case for SOC without MAC address (espressif/esp-mqtt@5e3abd4)
* Make state and size atomic
  - See merge request espressif/esp-mqtt!199
  - feat: Make state and size atomic (espressif/esp-mqtt@891380b)
* fix: Adjust the log level on few messages to avoid cluthering the logs
  - See merge request espressif/esp-mqtt!201
  - See commit espressif/esp-mqtt@5c17fc4
* fix: Make automatic client_id soc dependent
  - See merge request espressif/esp-mqtt!200
  - See commit espressif/esp-mqtt@657a2ae
* Clarify data that users need to take care of lifetime.
  - See merge request espressif/esp-mqtt!197
  - docs: Clarify data that users need to take care of lifetime. (espressif/esp-mqtt@371f594)
* Update mqtt_client.h
  - See merge request espressif/esp-mqtt!198
  - add const char * to esp_mqtt_client_subscribe() generic macros (espressif/esp-mqtt@acdb66d)
* client: Report failure on timeout in mid-message timeout (GitHub PR)
  - See merge request espressif/esp-mqtt!165
  - Merges espressif/esp-mqtt#232
  - client: Report failure on timeout in mid-message timeout (espressif/esp-mqtt@ddde502)
* fix: Move buffer initialization to set config
  - See merge request espressif/esp-mqtt!194
  - Closes espressif/esp-mqtt#267
  - See commit espressif/esp-mqtt@ea0df31
* Fix check for message creation when processing publish
  - See merge request espressif/esp-mqtt!195
  - fix: Deliver publish verifies if message was created only for QoS >0
        (espressif/esp-mqtt@6780056)
espressif-bot pushed a commit to espressif/esp-idf that referenced this pull request Apr 17, 2024
git log --oneline e6afdb4025fe018ae0add44e3c45249ea1974774..aa6f889fb4f6f743b3a550aa587713aabbdca1fc

Detailed description of the changes:
* fix: regard reason codes greater than 0x80 as failures.
  - See merge request espressif/esp-mqtt!205
  - See commit espressif/esp-mqtt@e7b9aa5
* PR: Return on allocation failure
  - See merge request espressif/esp-mqtt!204
  - set last_retransmit to now when first connected (espressif/esp-mqtt@c06f154)
  - add return to faile_message, avoid segment fault (espressif/esp-mqtt@37478a9)
* Minor warning of unused variable
  - See merge request espressif/esp-mqtt!203
  - fix: Minor warning of unused variable (espressif/esp-mqtt@726e5f2)
* Cover the case for SOC without MAC address
  - See merge request espressif/esp-mqtt!202
  - fix: Cover the case for SOC without MAC address (espressif/esp-mqtt@5e3abd4)
* Make state and size atomic
  - See merge request espressif/esp-mqtt!199
  - feat: Make state and size atomic (espressif/esp-mqtt@891380b)
* fix: Adjust the log level on few messages to avoid cluthering the logs
  - See merge request espressif/esp-mqtt!201
  - See commit espressif/esp-mqtt@5c17fc4
* fix: Make automatic client_id soc dependent
  - See merge request espressif/esp-mqtt!200
  - See commit espressif/esp-mqtt@657a2ae
* Clarify data that users need to take care of lifetime.
  - See merge request espressif/esp-mqtt!197
  - docs: Clarify data that users need to take care of lifetime. (espressif/esp-mqtt@371f594)
* Update mqtt_client.h
  - See merge request espressif/esp-mqtt!198
  - add const char * to esp_mqtt_client_subscribe() generic macros (espressif/esp-mqtt@acdb66d)
* client: Report failure on timeout in mid-message timeout (GitHub PR)
  - See merge request espressif/esp-mqtt!165
  - Merges espressif/esp-mqtt#232
  - client: Report failure on timeout in mid-message timeout (espressif/esp-mqtt@ddde502)
* fix: Move buffer initialization to set config
  - See merge request espressif/esp-mqtt!194
  - Closes espressif/esp-mqtt#267
  - See commit espressif/esp-mqtt@ea0df31
* Fix check for message creation when processing publish
  - See merge request espressif/esp-mqtt!195
  - fix: Deliver publish verifies if message was created only for QoS >0
        (espressif/esp-mqtt@6780056)
espressif-bot pushed a commit to espressif/esp-idf that referenced this pull request Jun 13, 2024
git log --oneline e6afdb4025fe018ae0add44e3c45249ea1974774..aa6f889fb4f6f743b3a550aa587713aabbdca1fc

Detailed description of the changes:
* fix: regard reason codes greater than 0x80 as failures.
  - See merge request espressif/esp-mqtt!205
  - See commit espressif/esp-mqtt@e7b9aa5
* PR: Return on allocation failure
  - See merge request espressif/esp-mqtt!204
  - set last_retransmit to now when first connected (espressif/esp-mqtt@c06f154)
  - add return to faile_message, avoid segment fault (espressif/esp-mqtt@37478a9)
* Minor warning of unused variable
  - See merge request espressif/esp-mqtt!203
  - fix: Minor warning of unused variable (espressif/esp-mqtt@726e5f2)
* Cover the case for SOC without MAC address
  - See merge request espressif/esp-mqtt!202
  - fix: Cover the case for SOC without MAC address (espressif/esp-mqtt@5e3abd4)
* Make state and size atomic
  - See merge request espressif/esp-mqtt!199
  - feat: Make state and size atomic (espressif/esp-mqtt@891380b)
* fix: Adjust the log level on few messages to avoid cluthering the logs
  - See merge request espressif/esp-mqtt!201
  - See commit espressif/esp-mqtt@5c17fc4
* fix: Make automatic client_id soc dependent
  - See merge request espressif/esp-mqtt!200
  - See commit espressif/esp-mqtt@657a2ae
* Clarify data that users need to take care of lifetime.
  - See merge request espressif/esp-mqtt!197
  - docs: Clarify data that users need to take care of lifetime. (espressif/esp-mqtt@371f594)
* Update mqtt_client.h
  - See merge request espressif/esp-mqtt!198
  - add const char * to esp_mqtt_client_subscribe() generic macros (espressif/esp-mqtt@acdb66d)
* client: Report failure on timeout in mid-message timeout (GitHub PR)
  - See merge request espressif/esp-mqtt!165
  - Merges espressif/esp-mqtt#232
  - client: Report failure on timeout in mid-message timeout (espressif/esp-mqtt@ddde502)
* fix: Move buffer initialization to set config
  - See merge request espressif/esp-mqtt!194
  - Closes espressif/esp-mqtt#267
  - See commit espressif/esp-mqtt@ea0df31
* Fix check for message creation when processing publish
  - See merge request espressif/esp-mqtt!195
  - fix: Deliver publish verifies if message was created only for QoS >0
        (espressif/esp-mqtt@6780056)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants