-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve/fix handling of read errors (IDFGH-7937) #232
Conversation
@david-cermak Thank you for the quick reply. On further investigation I now think this is an improvement (bug fix) even for ESP-IDF v5. I've updated the PR description accordingly. Can you take another look and let me know if this is something that makes sense to you? |
mqtt_client.c
Outdated
@@ -1030,7 +1013,8 @@ static esp_err_t deliver_publish(esp_mqtt_client_handle_t client) | |||
msg_total_len - msg_read_len > buf_len ? buf_len : msg_total_len - msg_read_len, | |||
client->config->network_timeout_ms); | |||
if (ret <= 0) { | |||
return esp_mqtt_handle_transport_read_error(ret, client) == 0 ? ESP_OK : ESP_FAIL; | |||
esp_mqtt_client_dispatch_transport_error(client); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted to call esp_mtt_handle_transport_read_error(ret, client)
but then return ESP_FAIL unconditionally.
mqtt_client.c
Outdated
if (read_len == 0) { | ||
// On ESP-IDF v5+ and v4.4, esp_transport_read()=0 means timeout. | ||
ESP_LOGD(TAG, "mqtt_message_receive(): no message (read timeout)"); | ||
return 0; | ||
} | ||
if (read_len < 0) { | ||
// On ESP-IDF v5+ and v4.4, esp_transport_read()<0 on close or error. | ||
esp_mqtt_client_dispatch_transport_error(client); | ||
return -1; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Please revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a bit of logic around esp_mqtt_handle_transport_read_error()
as discussed above.
mqtt_client.c
Outdated
esp_mqtt_client_dispatch_transport_error(client); | ||
return -1; // Mid-message timeout or connection-close are errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this makes sense to treat timeouts/EOFs as errors here, but I'd suggest using the previous notation:
esp_mqtt_client_dispatch_transport_error(client); | |
return -1; // Mid-message timeout or connection-close are errors. | |
esp_mqtt_handle_transport_read_error(read_len, client); | |
return -1; // Mid-message timeout or connection-close are errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
mqtt_client.c
Outdated
esp_mqtt_client_dispatch_transport_error(client); | ||
return -1; // Mid-message timeout or connection-close are errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
esp_mqtt_client_dispatch_transport_error(client); | |
return -1; // Mid-message timeout or connection-close are errors. | |
esp_mqtt_handle_transport_read_error(read_len, client); | |
return -1; // Mid-message timeout or connection-close are errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
mqtt_client.c
Outdated
esp_mqtt_client_dispatch_transport_error(client); | ||
return -1; // Mid-message timeout or connection-close are errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
esp_mqtt_client_dispatch_transport_error(client); | |
return -1; // Mid-message timeout or connection-close are errors. | |
esp_mqtt_handle_transport_read_error(read_len, client); | |
return -1; // Mid-message timeout or connection-close are errors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Sorry for the delay and thanks for updating the PR. I agree that in certain places we can (and should!) treat the timeouts as errors. Also please update the docs of the internal function, e.g. /*
* Returns:
- * -1 in case of failure
+ * -1 in case of failure or timeout while one message reading is in progress
* 0 if no message has been received
* 1 if a message has been received and placed to client->mqtt_state:
* message length: client->mqtt_state.message_length
* message content: client->mqtt_state.in_buffer
*/
static int mqtt_message_receive(esp_mqtt_client_handle_t client, int read_poll_timeout_ms) |
Sorry about the long hiatus! I hope you can remember what this is all about. I've updated the PR to be against current head, and did my best to address your requests; see what you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, changes LGTM now. We would proceed with merging this internally (maybe making some minor cosmetic changes), but should merge and publish soon.
{ | ||
if (err == ERR_TCP_TRANSPORT_CONNECTION_CLOSED_BY_FIN) { | ||
ESP_LOGD(TAG, "%s: transport_read(): EOF", __func__); | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @egnor , i think regard ERR_TCP_TRANSPORT_CONNECTION_TIMEOUT
as error is not suitable. Do you meet any case that need regard time-out as error? I think when err is ERR_TCP_TRANSPORT_CONNECTION_CLOSED_BY_FIN
, we need return -1
here. Could you please only change code with return -1
here to test the case you reported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is now void
and no longer returns anything at all. (Are you sure you're looking at the changed version of the code? This annotation is on the old version.)
Whether ESP_TCP_TRANSPORT_CONNECTION_TIMEOUT
is an error or not depends on context, and that case is handled in mqtt_message_receive()
below, where it's treated as an error mid-message but not an error for the first byte.
Can you clarify your request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@egnor Yes, i mean that on the old version, which not include your changes, only modify the one line that when err
is ERR_TCP_TRANSPORT_CONNECTION_CLOSED_BY_FIN
, we return -1
instead of 0
.
When we handle the mid-message and catch the ESP_TCP_TRANSPORT_CONNECTION_TIMEOUT
, we still can return 0
and not treate it as error. And then we will continue call mqtt_message_receive
to read the data for mid-message. As far as I understand, the first byte timeout and the subsequent byte timeout can actually be considered not errors. So I want to know whether you meet the problem, if you don’t handle mid-message time-out as error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean.
It's important that CLOSED_BY_FIN
and other such "hard" errors are always treated as errors, whether they happen at the start of the message or mid-message.
As for whether CONNECTION_TIMEOUT
is an error, even if midmessage-- well that's something of a policy matter, but if you can ensure that it is handled gracefully, and that keepalive failure is still handled correctly, then I suppose it could work if there's something like a very slow network connection.
Fixes error processing on network reads: 1) Treat EOF as an error, since the connection is closed (FIN) from the server side. If we didn't we would try to read (in the next iteration) from the same socket that has been already closed and get an error ENOTCONN. Before the fix: D (13760) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read(): EOF E (13800) transport_base: tcp_read error, errno=Socket is not connected E (13800) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read() error: errno=128 D (13810) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=0 I (13820) esp_mqtt_demo: MQTT_EVENT_ERROR E (13830) esp_mqtt_demo: Last error reported from esp-tls: 0x8008 E (13830) esp_mqtt_demo: Last error captured as transport's socket errno: 0x80 I (13840) esp_mqtt_demo: Last errno string (Socket is not connected) E (13850) mqtt_client: mqtt_process_receive: mqtt_message_receive() returned -1 D (13860) mqtt_client: Reconnect after 10000 ms D (13860) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=2 I (13870) esp_mqtt_demo: MQTT_EVENT_DISCONNECTED After the fix: E (12420) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read(): EOF E (12420) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read() error: errno=128 D (12430) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=0 I (12440) esp_mqtt_demo: MQTT_EVENT_ERROR E (12450) esp_mqtt_demo: Last error reported from esp-tls: 0x8008 I (12450) esp_mqtt_demo: Last errno string (Success) E (12460) mqtt_client: mqtt_process_receive: mqtt_message_receive() returned -1 D (12470) mqtt_client: Reconnect after 10000 ms D (12470) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=2 I (12480) esp_mqtt_demo: MQTT_EVENT_DISCONNECTED 2) Treat timeouts in the middle of MQTT message reading as errors (if timeouted for the second time and didn't read a byte) Before the fix: D (9160) mqtt_client: mqtt_message_receive: read "remaining length" byte: 0x2 D (9170) mqtt_client: mqtt_message_receive: total message length: 4 (already read: 2) D (19190) mqtt_client: mqtt_message_receive: read_len=0 D (19190) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read(): call timed out before data was ready! E (19200) mqtt_client: esp_mqtt_connect: mqtt_message_receive() returned 0 E (19210) mqtt_client: MQTT connect failed D (19220) mqtt_client: Reconnect after 10000 ms D (19220) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=2 I (19230) esp_mqtt_demo: MQTT_EVENT_DISCONNECTED After the fix: D (19190) mqtt_client: mqtt_message_receive: read_len=0 E (19190) mqtt_client: Network timeout while reading MQTT message E (19200) mqtt_client: esp_mqtt_handle_transport_read_error: transport_read() error: errno=119 D (19210) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=0 I (19220) esp_mqtt_demo: MQTT_EVENT_ERROR I (19220) esp_mqtt_demo: Last errno string (Success) E (19230) mqtt_client: esp_mqtt_connect: mqtt_message_receive() returned -1 E (19240) mqtt_client: MQTT connect failed D (19240) mqtt_client: Reconnect after 10000 ms D (19240) esp_mqtt_demo: Event dispatched from event loop base=MQTT_EVENTS, event_id=2 I (19250) esp_mqtt_demo: MQTT_EVENT_DISCONNECTED (Note that the above log is from mid-message timeout of CONNECT message, which was hadled before the fix. If the mid-message timeout ocurs with for example SUBACK, the current version would repeatably resend susbscribe message) Merges #232
@egnor @ESP-YJM Thanks for taking the time and effort in posting the changes and reviewing. There were indeed few issues with the mqtt client, but the changes are very much different to those you proposed, so I fixed it in a separate commit. You can check ddde502 for more details. There were two issues:
The point |
git log --oneline e6afdb4025fe018ae0add44e3c45249ea1974774..aa6f889fb4f6f743b3a550aa587713aabbdca1fc Detailed description of the changes: * fix: regard reason codes greater than 0x80 as failures. - See merge request espressif/esp-mqtt!205 - See commit espressif/esp-mqtt@e7b9aa5 * PR: Return on allocation failure - See merge request espressif/esp-mqtt!204 - set last_retransmit to now when first connected (espressif/esp-mqtt@c06f154) - add return to faile_message, avoid segment fault (espressif/esp-mqtt@37478a9) * Minor warning of unused variable - See merge request espressif/esp-mqtt!203 - fix: Minor warning of unused variable (espressif/esp-mqtt@726e5f2) * Cover the case for SOC without MAC address - See merge request espressif/esp-mqtt!202 - fix: Cover the case for SOC without MAC address (espressif/esp-mqtt@5e3abd4) * Make state and size atomic - See merge request espressif/esp-mqtt!199 - feat: Make state and size atomic (espressif/esp-mqtt@891380b) * fix: Adjust the log level on few messages to avoid cluthering the logs - See merge request espressif/esp-mqtt!201 - See commit espressif/esp-mqtt@5c17fc4 * fix: Make automatic client_id soc dependent - See merge request espressif/esp-mqtt!200 - See commit espressif/esp-mqtt@657a2ae * Clarify data that users need to take care of lifetime. - See merge request espressif/esp-mqtt!197 - docs: Clarify data that users need to take care of lifetime. (espressif/esp-mqtt@371f594) * Update mqtt_client.h - See merge request espressif/esp-mqtt!198 - add const char * to esp_mqtt_client_subscribe() generic macros (espressif/esp-mqtt@acdb66d) * client: Report failure on timeout in mid-message timeout (GitHub PR) - See merge request espressif/esp-mqtt!165 - Merges espressif/esp-mqtt#232 - client: Report failure on timeout in mid-message timeout (espressif/esp-mqtt@ddde502) * fix: Move buffer initialization to set config - See merge request espressif/esp-mqtt!194 - Closes espressif/esp-mqtt#267 - See commit espressif/esp-mqtt@ea0df31 * Fix check for message creation when processing publish - See merge request espressif/esp-mqtt!195 - fix: Deliver publish verifies if message was created only for QoS >0 (espressif/esp-mqtt@6780056)
git log --oneline e6afdb4025fe018ae0add44e3c45249ea1974774..aa6f889fb4f6f743b3a550aa587713aabbdca1fc Detailed description of the changes: * fix: regard reason codes greater than 0x80 as failures. - See merge request espressif/esp-mqtt!205 - See commit espressif/esp-mqtt@e7b9aa5 * PR: Return on allocation failure - See merge request espressif/esp-mqtt!204 - set last_retransmit to now when first connected (espressif/esp-mqtt@c06f154) - add return to faile_message, avoid segment fault (espressif/esp-mqtt@37478a9) * Minor warning of unused variable - See merge request espressif/esp-mqtt!203 - fix: Minor warning of unused variable (espressif/esp-mqtt@726e5f2) * Cover the case for SOC without MAC address - See merge request espressif/esp-mqtt!202 - fix: Cover the case for SOC without MAC address (espressif/esp-mqtt@5e3abd4) * Make state and size atomic - See merge request espressif/esp-mqtt!199 - feat: Make state and size atomic (espressif/esp-mqtt@891380b) * fix: Adjust the log level on few messages to avoid cluthering the logs - See merge request espressif/esp-mqtt!201 - See commit espressif/esp-mqtt@5c17fc4 * fix: Make automatic client_id soc dependent - See merge request espressif/esp-mqtt!200 - See commit espressif/esp-mqtt@657a2ae * Clarify data that users need to take care of lifetime. - See merge request espressif/esp-mqtt!197 - docs: Clarify data that users need to take care of lifetime. (espressif/esp-mqtt@371f594) * Update mqtt_client.h - See merge request espressif/esp-mqtt!198 - add const char * to esp_mqtt_client_subscribe() generic macros (espressif/esp-mqtt@acdb66d) * client: Report failure on timeout in mid-message timeout (GitHub PR) - See merge request espressif/esp-mqtt!165 - Merges espressif/esp-mqtt#232 - client: Report failure on timeout in mid-message timeout (espressif/esp-mqtt@ddde502) * fix: Move buffer initialization to set config - See merge request espressif/esp-mqtt!194 - Closes espressif/esp-mqtt#267 - See commit espressif/esp-mqtt@ea0df31 * Fix check for message creation when processing publish - See merge request espressif/esp-mqtt!195 - fix: Deliver publish verifies if message was created only for QoS >0 (espressif/esp-mqtt@6780056)
git log --oneline e6afdb4025fe018ae0add44e3c45249ea1974774..aa6f889fb4f6f743b3a550aa587713aabbdca1fc Detailed description of the changes: * fix: regard reason codes greater than 0x80 as failures. - See merge request espressif/esp-mqtt!205 - See commit espressif/esp-mqtt@e7b9aa5 * PR: Return on allocation failure - See merge request espressif/esp-mqtt!204 - set last_retransmit to now when first connected (espressif/esp-mqtt@c06f154) - add return to faile_message, avoid segment fault (espressif/esp-mqtt@37478a9) * Minor warning of unused variable - See merge request espressif/esp-mqtt!203 - fix: Minor warning of unused variable (espressif/esp-mqtt@726e5f2) * Cover the case for SOC without MAC address - See merge request espressif/esp-mqtt!202 - fix: Cover the case for SOC without MAC address (espressif/esp-mqtt@5e3abd4) * Make state and size atomic - See merge request espressif/esp-mqtt!199 - feat: Make state and size atomic (espressif/esp-mqtt@891380b) * fix: Adjust the log level on few messages to avoid cluthering the logs - See merge request espressif/esp-mqtt!201 - See commit espressif/esp-mqtt@5c17fc4 * fix: Make automatic client_id soc dependent - See merge request espressif/esp-mqtt!200 - See commit espressif/esp-mqtt@657a2ae * Clarify data that users need to take care of lifetime. - See merge request espressif/esp-mqtt!197 - docs: Clarify data that users need to take care of lifetime. (espressif/esp-mqtt@371f594) * Update mqtt_client.h - See merge request espressif/esp-mqtt!198 - add const char * to esp_mqtt_client_subscribe() generic macros (espressif/esp-mqtt@acdb66d) * client: Report failure on timeout in mid-message timeout (GitHub PR) - See merge request espressif/esp-mqtt!165 - Merges espressif/esp-mqtt#232 - client: Report failure on timeout in mid-message timeout (espressif/esp-mqtt@ddde502) * fix: Move buffer initialization to set config - See merge request espressif/esp-mqtt!194 - Closes espressif/esp-mqtt#267 - See commit espressif/esp-mqtt@ea0df31 * Fix check for message creation when processing publish - See merge request espressif/esp-mqtt!195 - fix: Deliver publish verifies if message was created only for QoS >0 (espressif/esp-mqtt@6780056)
(Edited)
This change simplifies and should correct socket error handling in the MQTT implementation for all ESP-IDF and MQTT versions, giving improved performance and correctness.
Explanation:
Previously, an error (return value <= 0) from
esp_transport_read()
from any locus was followed by a call toesp_mqtt_handle_transport_read_error()
, whose semantics are unclear but would return 0 for closed connection OR timeout, and return -1 (and log an error) for any other error type. The calling sites would generally return error for -1, and return success (but still abandon the operation) for 0.This was often not quite right.
Change:
Revise the error handling to make
esp_mqtt_handle_transport_read_error()
unconditionally describe the error and callesp_mqtt_client_dispatch_transport_error()
. The one place where timeouts should be accepted (the first-byte read) wraps that call accordingly. Comments are added and error messages improved.