ON-16327: Improving RTT estimation for delegated sends #66

andresa-xilinx · 2025-02-13T17:27:29Z

rtseq is not properly initialized for delegated sends.

In zf_delegated_send_complete(...), rtseq is initialized on line 172

[https://github.com/Xilinx-CNS/tcpdirect/blob/14d93228f700fad32ea7fa93f88d79d59d7748ad/src/lib/zf/zf_ds.c#L172]
by calling tcp_output_timers_common() with seq == snd_lbb.

snd_lbb is updated right before this call (on line L169). snd_lbb is the “next sequence buffer to be buffered”.

Assume a delegated send 1 has a seqnum range from B (inclusive) to E (exclusive), the snd_lbb is B before line 169 and E after line 169 so rtseq will be set to E in the call to tcp_output_timers_common().

When the segment 1 is acked, tcp_calculate_rtt_est() is called but TCP_SEQ_LT(pcb->rtseq, ackno) is false. Another segment 2 starting with seqnum E will need to be sent for the RTT to be updated and the RTT estimate will include the “quiet” time between sending segment 1 and 2 and therefore can be overstated by a large amount

Compare the other calls to tcp_output_timers_common() in tcp_out.c:

tcp_output_timers_common(stack, tcp, ntohl(hdr->seq));
tcp_output_timers_common(stack, tcp, tcp_seg_seq(seg));

Reproducer case:

The issue can be reproduced using our in-house test apps:

Exchange server sending multicast traffic at the lowest rate (-r 1):

andresa@sup-cap1:/usr/local/src/onload-9.0.0.39/build/gnu_x86_64/tests/trade_sim$ ./exchange enp129s0f0 -i 2 -n 2 -r 1 -s

Trader TCP Direct DS efvi app (current bad state):

andresa@sup-cap2:~/git/tcpdirect$ ZF_ATTR="log_level=0x80" ./build/gnu_x86_64-zf-devel/bin/trade_sim/static/trader_tcpdirect_ds_efvi -d -s 100 enp129s0f0 192.168.5.2 &>log_delegated_send_incorrect.txt
andresa@sup-cap2:~/git/tcpdirect$ cat log_delegated_send_incorrect.txt| grep  -e sa                      
enp129s0f0/138  1000 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1000 seq=2434393844 sa=8 sv=2
enp129s0f0/138  1044 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1022 seq=2434393953 sa=29 sv=23
enp129s0f0/138  1089 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1066 seq=2434394153 sa=49 sv=38
andresa@sup-cap2:~/git/tcpdirect$

Note sa and sv which increases with higher number of hearbeats as the customer observes.

Also, observe that with the current (bad) code the regular send works OK - as expected:

Trader Regular Send App

andresa@sup-cap2:~/git/tcpdirect$ ZF_ATTR="log_level=0x80" ./build/gnu_x86_64-zf-devel/bin/trade_sim/static/trader_tcpdirect_ds_efvi -s 100 enp129s0f0 192.168.5.2 &>log_regular_send_correct_old_code.txt
andresa@sup-cap2:~/git/tcpdirect$ cat log_regular_send_correct_old_code.txt| grep  -e sa                 
enp129s0f0/138  1000 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1000 seq=3867133740 sa=8 sv=2
enp129s0f0/138  1022 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1022 seq=3867133749 sa=8 sv=2
enp129s0f0/138  1044 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1044 seq=3867133849 sa=8 sv=2
enp129s0f0/138  1066 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1066 seq=3867133949 sa=8 sv=2
enp129s0f0/138  1089 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1089 seq=3867134049 sa=8 sv=2

Observe it produces the desired result, ie, constant RTT for each of the spaced out TCP packets.

FIX

Trader TCP Direct DS efvi app (with the fix):

Using Delegated Sends (-d):

andresa@sup-cap2:~/git/tcpdirect$ ZF_ATTR="log_level=0x80" ./build/gnu_x86_64-zf-devel/bin/trade_sim/static/trader_tcpdirect_ds_efvi -d -s 100 enp129s0f0 192.168.5.2 &>log_delegated_send_correct.txt
andresa@sup-cap2:~/git/tcpdirect$ cat log_delegated_send_correct.txt| grep  -e sa                     
enp129s0f0/138  1000 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1000 seq=2768039698 sa=8 sv=2
enp129s0f0/138  1022 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1022 seq=2768039707 sa=8 sv=2
enp129s0f0/138  1044 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1044 seq=2768039807 sa=8 sv=2
enp129s0f0/138  1066 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1066 seq=2768039907 sa=8 sv=2
enp129s0f0/138  1089 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1089 seq=2768040007 sa=8 sv=2

Using Regular Sends:

andresa@sup-cap2:~/git/tcpdirect$ ZF_ATTR="log_level=0x80" ./build/gnu_x86_64-zf-devel/bin/trade_sim/static/trader_tcpdirect_ds_efvi -s 100 enp129s0f0 192.168.5.2 &>log_regular_send_correct.txt
andresa@sup-cap2:~/git/tcpdirect$ cat log_regular_send_correct.txt| grep  -e sa                       
enp129s0f0/138  1000 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1000 seq=3309402769 sa=8 sv=2
enp129s0f0/138  1022 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1022 seq=3309402778 sa=8 sv=2
enp129s0f0/138  1044 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1044 seq=3309402878 sa=8 sv=2
enp129s0f0/138  1066 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1066 seq=3309402978 sa=8 sv=2
enp129s0f0/138  1089 | TCP#00 | tcp_calculate_rtt_est: rtt: est=1089 seq=3309403078 sa=8 sv=2

jfeather-amd

Looks good to me, I have one comment which I would like to discuss before giving this a tick to help my own understanding of this problem, but also worth noting that this should probably target v8_1 given it's a bugfix.

For future reviewers, the unhelpfully named and undocumented sa and sv are estimated RTT and deviation respectively (see also this description of the algorithm used to estimate RTT).

jfeather-amd · 2025-02-17T14:11:24Z

src/lib/zf/zf_ds.c

    rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb);
    if( ZF_UNLIKELY(rc < 0) )
      goto out;
-    tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb);
+    tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, current_lbb);


I did wonder if this be the below instead?

Suggested change

tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, current_lbb);

tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb - 1);

Based on my reading of the delegated sends API (from trader_tcpdirect_ds_efvi), zf_delegated_send_complete is called to tell TCPDirect that we have sent a packet and it should update its internal state accordingly (e.g., note this being called after an ef_vi_transmit in the test app). In your example, sending sequences [B, E) you note that tcp_queue_sent_segments will update pcb->snd_lbb to E (i.e., the next sequence number to be sent, or equivalently, one plus the last sequence number sent). In which case I thought it might make sense to wait for D to estimate the RTT.

However, I expect this suggestion could lead to overestimation of the RTT as it really measures the RTT of three packets (sent at the same time) in this example. In which case your code which only estimates the RTT for the first packet seems more reasonable. Does this make sense, or have I made a mistake somewhere along the way?

Yes, I believe your understanding is correct. The intention is to estimate the current RTT as simple and consistent as possible.

Regarding your suggestion above, I'd recommend to reference the first byte sent for the following reasons:

TCP is a byte oriented protocol and there might be disagreements between the sender and the receiver in the MSS; see Section 4.2 of https://datatracker.ietf.org/doc/html/rfc2581

See also Karn and Partridge, "Improving Round-Trip Time Estimates in Reliable Transport Protocols" for further complications - the takeaway of this paper is to keep the current RTT estimation as simple and as consistent as possible. I mean, consistent with other parts of the code both in TCP Direct and in Onload - both products time the first byte of the most recent segment except for the bug in question.

I also believe there could be some sort of head of line blocking for estimating the current RTT (if we adopted pcb->snd_lbb-1). So, apart from the original (most recent) segment, which initial byte is pcb->snd_lbb (before it is modified), there may be other segments sent along that one (via iov). By timing the latest byte (pcb->snd_lbb-1), we are decreasing the chances of obtaining the sample for the most recent segment (pcb->snd_lbb before it is modified) as we do not know the fate for accompanying segments (e.g., those can be reordered or dropped) thereby impairing the current RTT estimation.

I wonder if the viable/correct minimal change is to move the tcp_output_timers_common call before sending the sent segments to the queue. I don't see any problem yet, and it does work in my tests.

Suggested change

tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, current_lbb);

tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcv->snd_lbb);

rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb);

if( ZF_UNLIKELY(rc < 0) )

goto out;

A potential problem could be related to rc < 0, however I believe it happens when the sent queue is too long and send needs to be retried, in which case tcp_output_timers_common is going to be called again.

Thanks for the explanation and resources, they were quite helpful to consolidate my understanding!

Based on what you've said so far I do think the change you suggest in the above comment might be what we want. Notably, we set pcb->rttest to the current tick in tcp_ouput_timers_common, which is later used to calculate the sample RTT. With the current code, an app might fail here, poll the reactor, and try calling zf_delegated_send_complete again, which would have a different (later) timer tick than your new suggestion. In the newly suggested code, we would call it multiple times, but this would only do work the first time when pcb->rttest is 0.

I'd also add that with the new suggestion the ZF_TCP_TIMER_RTO would start at the right time (if need be) if tcp_queue_sent_segments fails.

jfeather-amd

Looks good, thanks! Should this target v8_1 instead given it's a bugfix?

sianj-xilinx · 2025-03-05T18:04:26Z

src/lib/zf/zf_ds.c

+    tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb);
    rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb);


What happens if we fail to queue the segments? You've just kicked off the retransmit timer for a segment we don't have.

It'll return -EAGAIN: no space on send queue and prepare should already have failed, or -EMSGSIZE: attempt to "complete" more bytes than were "prepared".
I'd expect the code to fail on delegated_send_complete (or before) and the customer's code to handle it accordingly, e.g., try again upon -EAGAIN and we eventually will have the segment and the timer set.

However, I see how this could be dangerous if the error is -EMSGSIZE.

Perhaps the previous solution should be the way to go:

diff --git a/src/lib/zf/zf_ds.c b/src/lib/zf/zf_ds.c index 644b7b9..ccb41e1 100644 --- a/src/lib/zf/zf_ds.c +++ b/src/lib/zf/zf_ds.c @@ -166,10 +166,11 @@ int zf_delegated_send_complete(struct zft* ts, const struct iovec* iov, already_acked = 0; /* Queue remaining packets on send queue */ - tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb); + uint32_t curr_snd_lbb=pcb->snd_lbb; rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb); if( ZF_UNLIKELY(rc < 0) ) goto out; + tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, curr_snd_lbb); unsigned snd_buf_consumed = pcb->mss * rc; if( snd_buf_consumed > pcb->snd_buf )

which I've found inelegant, but I believe addresses the problem you've pointed out.

Can you add a comment in the code explaining why we need to use the value cached before queuing? It would be helpful to capture some of the explanation on the PR there, so it's obvious to future code maintainers why this is done.

Thanks for the comment - but I think you still need to address the issue of queuing the timer for a non-queued segment, such as with the change you outlined above.

andresa-xilinx requested a review from a team as a code owner February 13, 2025 17:27

jfeather-amd reviewed Feb 17, 2025

View reviewed changes

jfeather-amd requested a review from a team February 17, 2025 14:13

andresa-xilinx force-pushed the ON-16327 branch from 193e21f to 383ded0 Compare February 21, 2025 18:10

jfeather-amd approved these changes Feb 24, 2025

View reviewed changes

jfeather-amd requested a review from a team February 24, 2025 09:34

andresa-xilinx changed the base branch from master to v8_1 February 24, 2025 17:22

andresa-xilinx force-pushed the ON-16327 branch from 383ded0 to 90fb9e3 Compare February 24, 2025 17:24

sianj-xilinx reviewed Mar 5, 2025

View reviewed changes

andresa-xilinx force-pushed the ON-16327 branch from 90fb9e3 to 6d21734 Compare March 6, 2025 11:30

ON-16327: Improving RTT estimation for delegated sends

178e7d4

andresa-xilinx force-pushed the ON-16327 branch from 6d21734 to 178e7d4 Compare March 7, 2025 10:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ON-16327: Improving RTT estimation for delegated sends #66

ON-16327: Improving RTT estimation for delegated sends #66

andresa-xilinx commented Feb 13, 2025

jfeather-amd left a comment

jfeather-amd Feb 17, 2025

andresa-xilinx Feb 18, 2025

andresa-xilinx Feb 18, 2025

jfeather-amd Feb 18, 2025

andresa-xilinx Feb 21, 2025

jfeather-amd left a comment

sianj-xilinx Mar 5, 2025

andresa-xilinx Mar 5, 2025 •

edited

Loading

sianj-xilinx Mar 6, 2025

sianj-xilinx Mar 7, 2025

	tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, current_lbb);
	tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb - 1);

		tcp_output_timers_common(zf_stack_from_zocket(tcp), tcp, pcb->snd_lbb);
		rc = tcp_queue_sent_segments(tcp, &pcb->sendq, &iov_temp, &pcb->snd_lbb);

ON-16327: Improving RTT estimation for delegated sends #66

Are you sure you want to change the base?

ON-16327: Improving RTT estimation for delegated sends #66

Conversation

andresa-xilinx commented Feb 13, 2025

FIX

jfeather-amd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfeather-amd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andresa-xilinx Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andresa-xilinx Mar 5, 2025 •

edited

Loading