Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ON-16327: Improving RTT estimation for delegated sends #66

Open
wants to merge 1 commit into
base: v8_1
Choose a base branch
from

Conversation

andresa-xilinx
Copy link

rtseq is not properly initialized for delegated sends.

In zf_delegated_send_complete(...), rtseq is initialized on line 172

[https://github.com/Xilinx-CNS/tcpdirect/blob/14d93228f700fad32ea7fa93f88d79d59d7748ad/src/lib/zf/zf_ds.c#L172]
by calling tcp_output_timers_common() with seq == snd_lbb.

snd_lbb is updated right before this call (on line L169). snd_lbb is the “next sequence buffer to be buffered”.

Assume a delegated send 1 has a seqnum range from B (inclusive) to E (exclusive), the snd_lbb is B before line 169 and E after line 169 so rtseq will be set to E in the call to tcp_output_timers_common().

When the segment 1 is acked, tcp_calculate_rtt_est() is called but TCP_SEQ_LT(pcb->rtseq, ackno) is false. Another segment 2 starting with seqnum E will need to be sent for the RTT to be updated and the RTT estimate will include the “quiet” time between sending segment 1 and 2 and therefore can be overstated by a large amount

Compare the other calls to tcp_output_timers_common() in tcp_out.c:

tcp_output_timers_common(stack, tcp, ntohl(hdr->seq));
tcp_output_timers_common(stack, tcp, tcp_seg_seq(seg));

Reproducer case:

The issue can be reproduced using our in-house test apps:

Exchange server sending multicast traffic at the lowest rate (-r 1):

andresa@sup-cap1:/usr/local/src/onload-9.0.0.39/build/gnu_x86_64/tests/trade_sim$ ./exchange enp129s0f0 -i 2 -n 2 -r 1 -s

Trader TCP Direct DS efvi app (current bad state):

andresa@sup-cap2:~/git/tcpdirect$ ZF_ATTR="log_level=0x80" ./build/gnu_x86_64-zf-devel/bin/trade_sim/static/trader_tcpdirect_ds_efvi -d -s 100 enp129s0f0 192.168.5.2 &>log_delegated_send_incorrect.txt
andresa@sup-cap2:~/git/tcpdirect$ cat log_delegated_send_incorrect.txt| grep  -e sa                      
enp129s0f0/138  1000 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1000 seq=2434393844 sa=8 sv=2
enp129s0f0/138  1044 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1022 seq=2434393953 sa=29 sv=23
enp129s0f0/138  1089 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1066 seq=2434394153 sa=49 sv=38
andresa@sup-cap2:~/git/tcpdirect$

Note sa and sv which increases with higher number of hearbeats as the customer observes.

Also, observe that with the current (bad) code the regular send works OK - as expected:

Trader Regular Send App

andresa@sup-cap2:~/git/tcpdirect$ ZF_ATTR="log_level=0x80" ./build/gnu_x86_64-zf-devel/bin/trade_sim/static/trader_tcpdirect_ds_efvi -s 100 enp129s0f0 192.168.5.2 &>log_regular_send_correct_old_code.txt
andresa@sup-cap2:~/git/tcpdirect$ cat log_regular_send_correct_old_code.txt| grep  -e sa                 
enp129s0f0/138  1000 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1000 seq=3867133740 sa=8 sv=2
enp129s0f0/138  1022 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1022 seq=3867133749 sa=8 sv=2
enp129s0f0/138  1044 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1044 seq=3867133849 sa=8 sv=2
enp129s0f0/138  1066 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1066 seq=3867133949 sa=8 sv=2
enp129s0f0/138  1089 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1089 seq=3867134049 sa=8 sv=2

Observe it produces the desired result, ie, constant RTT for each of the spaced out TCP packets.

FIX

Trader TCP Direct DS efvi app (with the fix):

Using Delegated Sends (-d):

andresa@sup-cap2:~/git/tcpdirect$ ZF_ATTR="log_level=0x80" ./build/gnu_x86_64-zf-devel/bin/trade_sim/static/trader_tcpdirect_ds_efvi -d -s 100 enp129s0f0 192.168.5.2 &>log_delegated_send_correct.txt
andresa@sup-cap2:~/git/tcpdirect$ cat log_delegated_send_correct.txt| grep  -e sa                     
enp129s0f0/138  1000 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1000 seq=2768039698 sa=8 sv=2
enp129s0f0/138  1022 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1022 seq=2768039707 sa=8 sv=2
enp129s0f0/138  1044 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1044 seq=2768039807 sa=8 sv=2
enp129s0f0/138  1066 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1066 seq=2768039907 sa=8 sv=2
enp129s0f0/138  1089 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1089 seq=2768040007 sa=8 sv=2

Using Regular Sends:

andresa@sup-cap2:~/git/tcpdirect$ ZF_ATTR="log_level=0x80" ./build/gnu_x86_64-zf-devel/bin/trade_sim/static/trader_tcpdirect_ds_efvi -s 100 enp129s0f0 192.168.5.2 &>log_regular_send_correct.txt
andresa@sup-cap2:~/git/tcpdirect$ cat log_regular_send_correct.txt| grep  -e sa                       
enp129s0f0/138  1000 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1000 seq=3309402769 sa=8 sv=2
enp129s0f0/138  1022 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1022 seq=3309402778 sa=8 sv=2
enp129s0f0/138  1044 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1044 seq=3309402878 sa=8 sv=2
enp129s0f0/138  1066 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1066 seq=3309402978 sa=8 sv=2
enp129s0f0/138  1089 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1089 seq=3309403078 sa=8 sv=2

@andresa-xilinx andresa-xilinx requested a review from a team as a code owner February 13, 2025 17:27
Copy link
Contributor

@jfeather-amd jfeather-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I have one comment which I would like to discuss before giving this a tick to help my own understanding of this problem, but also worth noting that this should probably target v8_1 given it's a bugfix.

For future reviewers, the unhelpfully named and undocumented sa and sv are estimated RTT and deviation respectively (see also this description of the algorithm used to estimate RTT).

rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb);
if( ZF_UNLIKELY(rc < 0) )
goto out;
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb);
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, current_lbb);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did wonder if this be the below instead?

Suggested change
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, current_lbb);
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb - 1);

Based on my reading of the delegated sends API (from trader_tcpdirect_ds_efvi), zf_delegated_send_complete is called to tell TCPDirect that we have sent a packet and it should update its internal state accordingly (e.g., note this being called after an ef_vi_transmit in the test app). In your example, sending sequences [B, E) you note that tcp_queue_sent_segments will update pcb->snd_lbb to E (i.e., the next sequence number to be sent, or equivalently, one plus the last sequence number sent). In which case I thought it might make sense to wait for D to estimate the RTT.

However, I expect this suggestion could lead to overestimation of the RTT as it really measures the RTT of three packets (sent at the same time) in this example. In which case your code which only estimates the RTT for the first packet seems more reasonable. Does this make sense, or have I made a mistake somewhere along the way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe your understanding is correct. The intention is to estimate the current RTT as simple and consistent as possible.

Regarding your suggestion above, I'd recommend to reference the first byte sent for the following reasons:

  • TCP is a byte oriented protocol and there might be disagreements between the sender and the receiver in the MSS; see Section 4.2 of https://datatracker.ietf.org/doc/html/rfc2581

  • See also Karn and Partridge, "Improving Round-Trip Time Estimates in Reliable Transport Protocols" for further complications - the takeaway of this paper is to keep the current RTT estimation as simple and as consistent as possible. I mean, consistent with other parts of the code both in TCP Direct and in Onload - both products time the first byte of the most recent segment except for the bug in question.

  • I also believe there could be some sort of head of line blocking for estimating the current RTT (if we adopted pcb->snd_lbb-1). So, apart from the original (most recent) segment, which initial byte is pcb->snd_lbb (before it is modified), there may be other segments sent along that one (via iov). By timing the latest byte (pcb->snd_lbb-1), we are decreasing the chances of obtaining the sample for the most recent segment (pcb->snd_lbb before it is modified) as we do not know the fate for accompanying segments (e.g., those can be reordered or dropped) thereby impairing the current RTT estimation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the viable/correct minimal change is to move the tcp_output_timers_common call before sending the sent segments to the queue. I don't see any problem yet, and it does work in my tests.

Suggested change
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, current_lbb);
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcv->snd_lbb);
rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb);
if( ZF_UNLIKELY(rc < 0) )
goto out;

A potential problem could be related to rc < 0, however I believe it happens when the sent queue is too long and send needs to be retried, in which case tcp_output_timers_common is going to be called again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation and resources, they were quite helpful to consolidate my understanding!

Based on what you've said so far I do think the change you suggest in the above comment might be what we want. Notably, we set pcb->rttest to the current tick in tcp_ouput_timers_common, which is later used to calculate the sample RTT. With the current code, an app might fail here, poll the reactor, and try calling zf_delegated_send_complete again, which would have a different (later) timer tick than your new suggestion. In the newly suggested code, we would call it multiple times, but this would only do work the first time when pcb->rttest is 0.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also add that with the new suggestion the ZF_TCP_TIMER_RTO would start at the right time (if need be) if tcp_queue_sent_segments fails.

Copy link
Contributor

@jfeather-amd jfeather-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! Should this target v8_1 instead given it's a bugfix?

@jfeather-amd jfeather-amd requested a review from a team February 24, 2025 09:34
@andresa-xilinx andresa-xilinx changed the base branch from master to v8_1 February 24, 2025 17:22
Comment on lines 169 to 173
tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb);
rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we fail to queue the segments? You've just kicked off the retransmit timer for a segment we don't have.

Copy link
Author

@andresa-xilinx andresa-xilinx Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll return -EAGAIN: no space on send queue and prepare should already have failed, or -EMSGSIZE: attempt to "complete" more bytes than were "prepared".
I'd expect the code to fail on delegated_send_complete (or before) and the customer's code to handle it accordingly, e.g., try again upon -EAGAIN and we eventually will have the segment and the timer set.

However, I see how this could be dangerous if the error is -EMSGSIZE.

Perhaps the previous solution should be the way to go:

diff --git a/src/lib/zf/zf_ds.c b/src/lib/zf/zf_ds.c
index 644b7b9..ccb41e1 100644
--- a/src/lib/zf/zf_ds.c
+++ b/src/lib/zf/zf_ds.c
@@ -166,10 +166,11 @@ int zf_delegated_send_complete(struct zft* ts, const struct iovec* iov,
     already_acked = 0;

     /* Queue remaining packets on send queue */
-    tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb);
+    uint32_t curr_snd_lbb=pcb->snd_lbb;
     rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb);
     if( ZF_UNLIKELY(rc < 0) )
       goto out;
+    tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, curr_snd_lbb);

     unsigned snd_buf_consumed = pcb->mss * rc;
     if( snd_buf_consumed > pcb->snd_buf )

which I've found inelegant, but I believe addresses the problem you've pointed out.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment in the code explaining why we need to use the value cached before queuing? It would be helpful to capture some of the explanation on the PR there, so it's obvious to future code maintainers why this is done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment - but I think you still need to address the issue of queuing the timer for a non-queued segment, such as with the change you outlined above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants