-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ON-16327: Improving RTT estimation for delegated sends #66
base: v8_1
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I have one comment which I would like to discuss before giving this a tick to help my own understanding of this problem, but also worth noting that this should probably target v8_1 given it's a bugfix.
For future reviewers, the unhelpfully named and undocumented sa
and sv
are estimated RTT and deviation respectively (see also this description of the algorithm used to estimate RTT).
src/lib/zf/zf_ds.c
Outdated
rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb); | ||
if( ZF_UNLIKELY(rc < 0) ) | ||
goto out; | ||
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb); | ||
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, current_lbb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did wonder if this be the below instead?
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, current_lbb); | |
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb - 1); |
Based on my reading of the delegated sends API (from trader_tcpdirect_ds_efvi), zf_delegated_send_complete
is called to tell TCPDirect that we have sent a packet and it should update its internal state accordingly (e.g., note this being called after an ef_vi_transmit in the test app). In your example, sending sequences [B, E) you note that tcp_queue_sent_segments
will update pcb->snd_lbb
to E (i.e., the next sequence number to be sent, or equivalently, one plus the last sequence number sent). In which case I thought it might make sense to wait for D to estimate the RTT.
However, I expect this suggestion could lead to overestimation of the RTT as it really measures the RTT of three packets (sent at the same time) in this example. In which case your code which only estimates the RTT for the first packet seems more reasonable. Does this make sense, or have I made a mistake somewhere along the way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe your understanding is correct. The intention is to estimate the current RTT as simple and consistent as possible.
Regarding your suggestion above, I'd recommend to reference the first byte sent for the following reasons:
-
TCP is a byte oriented protocol and there might be disagreements between the sender and the receiver in the MSS; see Section 4.2 of https://datatracker.ietf.org/doc/html/rfc2581
-
See also Karn and Partridge, "Improving Round-Trip Time Estimates in Reliable Transport Protocols" for further complications - the takeaway of this paper is to keep the current RTT estimation as simple and as consistent as possible. I mean, consistent with other parts of the code both in TCP Direct and in Onload - both products time the first byte of the most recent segment except for the bug in question.
-
I also believe there could be some sort of head of line blocking for estimating the current RTT (if we adopted
pcb->snd_lbb-1
). So, apart from the original (most recent) segment, which initial byte ispcb->snd_lbb
(before it is modified), there may be other segments sent along that one (via iov). By timing the latest byte (pcb->snd_lbb-1
), we are decreasing the chances of obtaining the sample for the most recent segment (pcb->snd_lbb
before it is modified) as we do not know the fate for accompanying segments (e.g., those can be reordered or dropped) thereby impairing the current RTT estimation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the viable/correct minimal change is to move the tcp_output_timers_common
call before sending the sent segments to the queue. I don't see any problem yet, and it does work in my tests.
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, current_lbb); | |
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcv->snd_lbb); | |
rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb); | |
if( ZF_UNLIKELY(rc < 0) ) | |
goto out; |
A potential problem could be related to rc < 0
, however I believe it happens when the sent queue is too long and send needs to be retried, in which case tcp_output_timers_common
is going to be called again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation and resources, they were quite helpful to consolidate my understanding!
Based on what you've said so far I do think the change you suggest in the above comment might be what we want. Notably, we set pcb->rttest
to the current tick in tcp_ouput_timers_common
, which is later used to calculate the sample RTT. With the current code, an app might fail here, poll the reactor, and try calling zf_delegated_send_complete
again, which would have a different (later) timer tick than your new suggestion. In the newly suggested code, we would call it multiple times, but this would only do work the first time when pcb->rttest
is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also add that with the new suggestion the ZF_TCP_TIMER_RTO
would start at the right time (if need be) if tcp_queue_sent_segments
fails.
193e21f
to
383ded0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks! Should this target v8_1 instead given it's a bugfix?
383ded0
to
90fb9e3
Compare
src/lib/zf/zf_ds.c
Outdated
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb); | ||
rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we fail to queue the segments? You've just kicked off the retransmit timer for a segment we don't have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll return -EAGAIN: no space on send queue and prepare should already have failed, or -EMSGSIZE: attempt to "complete" more bytes than were "prepared".
I'd expect the code to fail on delegated_send_complete (or before) and the customer's code to handle it accordingly, e.g., try again upon -EAGAIN and we eventually will have the segment and the timer set.
However, I see how this could be dangerous if the error is -EMSGSIZE.
Perhaps the previous solution should be the way to go:
diff --git a/src/lib/zf/zf_ds.c b/src/lib/zf/zf_ds.c
index 644b7b9..ccb41e1 100644
--- a/src/lib/zf/zf_ds.c
+++ b/src/lib/zf/zf_ds.c
@@ -166,10 +166,11 @@ int zf_delegated_send_complete(struct zft* ts, const struct iovec* iov,
already_acked = 0;
/* Queue remaining packets on send queue */
- tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb);
+ uint32_t curr_snd_lbb=pcb->snd_lbb;
rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb);
if( ZF_UNLIKELY(rc < 0) )
goto out;
+ tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, curr_snd_lbb);
unsigned snd_buf_consumed = pcb->mss * rc;
if( snd_buf_consumed > pcb->snd_buf )
which I've found inelegant, but I believe addresses the problem you've pointed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment in the code explaining why we need to use the value cached before queuing? It would be helpful to capture some of the explanation on the PR there, so it's obvious to future code maintainers why this is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment - but I think you still need to address the issue of queuing the timer for a non-queued segment, such as with the change you outlined above.
rtseq is not properly initialized for delegated sends.
In zf_delegated_send_complete(...), rtseq is initialized on line 172
[https://github.com/Xilinx-CNS/tcpdirect/blob/14d93228f700fad32ea7fa93f88d79d59d7748ad/src/lib/zf/zf_ds.c#L172]
by calling tcp_output_timers_common() with seq == snd_lbb.
snd_lbb is updated right before this call (on line L169). snd_lbb is the “next sequence buffer to be buffered”.
Assume a delegated send 1 has a seqnum range from B (inclusive) to E (exclusive), the snd_lbb is B before line 169 and E after line 169 so rtseq will be set to E in the call to tcp_output_timers_common().
When the segment 1 is acked, tcp_calculate_rtt_est() is called but
TCP_SEQ_LT(pcb->rtseq, ackno)
is false. Another segment 2 starting with seqnum E will need to be sent for the RTT to be updated and the RTT estimate will include the “quiet” time between sending segment 1 and 2 and therefore can be overstated by a large amountCompare the other calls to tcp_output_timers_common() in tcp_out.c:
tcp_output_timers_common(stack, tcp, ntohl(hdr->seq));
tcp_output_timers_common(stack, tcp, tcp_seg_seq(seg));
Reproducer case:
The issue can be reproduced using our in-house test apps:
Exchange server sending multicast traffic at the lowest rate (-r 1):
andresa@sup-cap1:/usr/local/src/onload-9.0.0.39/build/gnu_x86_64/tests/trade_sim$ ./exchange enp129s0f0 -i 2 -n 2 -r 1 -s
Trader TCP Direct DS efvi app (current bad state):
Note sa and sv which increases with higher number of hearbeats as the customer observes.
Also, observe that with the current (bad) code the regular send works OK - as expected:
Trader Regular Send App
Observe it produces the desired result, ie, constant RTT for each of the spaced out TCP packets.
FIX
Trader TCP Direct DS efvi app (with the fix):
Using Delegated Sends (-d):
Using Regular Sends: