Receiver is not recovering lost Data blk>0 seg>1 #79
-
I have a test case where my receiving application often misses the first message sent by the sending application. Both applications are using the NORM API in the NORM_OBJECT_STREAM mode. I’m using the linux traffic control to create random packet losses of 5% on the network interface used by the sending application. I start my receiving application and wait for it to indicate that it is blocking, waiting to receive the first message from the sending application. Then I start the sending application. In my test case, a total of 50 messages are sent. It seems that if any message other than the first message (DATA obj>0 blk>0 seg>1) is affected by the network losses, the NORM NACKing protocol works as expected to deliver all 50 messages to the receiving application. Looking at the trace/debug file where the receiving application does not get the first message, I can see that the receiving application did not receive the first “DATA obj” packet sent by the sending application (DATA obj>0 blk>0 seg>1). I can see the receiving application send a NACK packet, and the sending application responds with the requested “DATA obj”. But on the receiving side, these messages are shown in the trace/debug file, and it seems that the message is discarded rather than being passed up to the application layer. trace>20:12:53.031043 node>1355305018 src>172.27.1.60/56330 inst>40430 seq>30 DATA obj>0 blk>0 seg>0 offset>0 len>117 I’m attaching the full trace/debug files for the sender and the receiver. I’m also attaching the source code I’m using to set up a socket to send, a socket to receive, and for the event handling. We have configuration for setting some of the NORM control parameters, so I provided the values we are using in this test case as comments. In the initial observations of this scenario, I had the TxRobustFactor set to 5, to reduce the “chattiness”. We had done quite a bit of testing with that setting, and had not observed anything unexpected. It’s only during the last month of testing that we’ve run into this scenario. So, I put the TxRobustFactor back to the default of 20. But that has not made any difference in the scenario. I do have the sync policy set to NORM_SYNC_CURRENT. I have considered that maybe NORM_SYNC_ALL would be required to address the scenario I’m seeing with the lost first message. However, I would not want a receiver that “joined the party late” to use NACKing to receive previous messages. For instance, if a receiver didn’t start until after the 10th message had been sent, so the first message it saw was the 11th message, I would not want it to use NACKing to receive the first 10 messages. I hope there is a way to distinguish between the “receiver was running when a lost message was sent, so recover it”, versus the “receiver was not running when that message was sent, so don’t recover it”. Hoping for some advice on how to handle this situation. Thanks. SenderLog230430.txt |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 9 replies
-
The “NORM sync policy” does control the behavior you describe here. The basic “SYNC_CURRENT” policy for DATA/FILE objects is for the receiver to wait until it sees a NORM_DATA message from the first FEC block of one of those objects before “syncing” and beginning sending repair requests (NACKs) for any content. Once a receiver syncs, it is then pretty tenacious, but the reason for that policy is to avoid having late joining receivers penalize the forward progress of the group. For STREAM objects, this policy is augmented by having the first received NORM_DATA packet be uses as the “sync index” and NORM shouldn’t NACK for stream data earlier than that index under the SYNC_CURRENT policy. However, the behavior you observe actually indicates there may be an inconsistency in the code in that regards. I can’t recall if I intended to set the index according to the index of the NORM_DATA packet received or according to first segment of the first FEC block received. From the behavior describe it seems that the “repair check” is sending a NACK as if the index was set to segment zero of the first received FEC block id, but the receiver stream index is set according to that received packet block/segment id and so it discards the retransmitted packet since it’s ordinally lower than the stream sync block/segment id. The sync policy option is something specific to my implementation and the behavior is not defined in the NORM RFC. Probably, the intended behavior was to sync on a block basis, but one of the issues with NORM for very high speed applications is the sort of sloppy behavior that can happen with a bunch of receivers asynchronously joining the group due to the NACKing/retransmission is problematic when something like the ACK-based flow control is not used (as an aside: the “NormSocket” API extension provides a more connection-oriented paradigm which could be used for oore organized behaviors since the receivers have backchannel to the sender app, etc and it also has the ACK-based flow control embedded into it). In any case – I need to look into this. I will need to do that carefully so I don’t break anything, but possibly doing the stream sync on an FEC block basis might give you the desired behavior and be an appropriate solution for general “SYNC_CURRENT” utility. The potentially problematic aspect of this should probably be just addressed by using flow control properly. The SYNC_ALL policy enables late-joining receivers (or a receiver that misses the first packet of a stream) to send NACKs requesting transmission of any content the sender has buffered. This is useful for applications that have receivers that don’t want to miss any data and is a little more like a TCP connection in this regard. If your application generally has receivers starting close the beginning of the sender transmission, this sync mode is useful. If you have receivers that join late/mid-stream, then this mode will cause the sender to retransmit older data. However, that is limited by the NORM “stream buffer” size your sender application sets when it opens a stream object. So, if your stream buffer is not too large, this may not be a large penalty? Note there is also a SYNC_STREAM mode. This one is for the case where an application serializes a sequence of different/multiple stream objects or mix of object types and wants to allow receivers to request repair to the beginning of a current stream but not for earlier stream objects. I haven’t personally tried this, but you potentially could have your sender break up its transmission into a series of stream objects using this policy. This could limit the utility of the NORM FEC-based repair strategy for multicast if your stream objects were small in size (limited FEC blocks). It would be a little more complex use of the API to manage the series of stream objects enqueued and the code has not really been tested with this use pattern. So, to summarize: I will investigate if the SYNC_CURRENT policy should do its sync based on FEC block boundaries (note if you FEC block size is zero, you will have the same behavior you are seeing except for the wasted NACK/retransmission behavior you are seeing). If your application allows late-joining receivers to join mid-stream, I’m not sure exactly what your concern here is with respect to a “a little bit late” versus “a lot late” ? 😉. Hopefully this is helpful. I sort of understand what you mean by receiver was running versus not when the loss occurred. I guess that assumes the receiver is started with some knowledge of when the sender is started? If you have that info, you could have those receiver use SYNC_ALL while late joining receivers use SYNC_CURRENT … Again in either case, a receiver shouldn’t NACK for something it sends up throwing away because of some inconsistency in the sync policy implementation and I will look into that. |
Beta Was this translation helpful? Give feedback.
-
I looked at the code and for the SYNC_CURRENT policy, the received NORM_DATA message its FEC block id and segment id zero is used to set the starting point for purposes of NACKing but the stream "read_index" that marks the current index from where the application begins "reading" data uses the received segment id (instead of segment id zero). If I can safely change the code so that read_index is set using the FEC block id and segment_id zero, then the retransmitted message in your debug log here would not be discarded. I need to spend some time reacquainting myself with the associated code here to make sure I can make a good fix that doesn't cause some other issue. |
Beta Was this translation helpful? Give feedback.
-
Unless your application uses the optional positive acknowledgment mechanism (which can be used for flow control as well as getting an acknowledgment that the receiver(s) got the desired data delivery), the “txRobustFactor” is how you dial in more assurance that receivers will NACK for everything that was sent. The number of NORM_CMD(FLUSH) messages sent at the end of transmission is driven by that. With txRobustFactor of 5, it is easier to have a burst of loss where the last data message(s) and the 5 flush messages are missing, and the receiver will not NACK for repair if it doesn’t know there are missing messages at the end of transmission. A higher txRobustFactor reduces the probability of this. Note if the receiver knows there is a gap in the reception (i.e., it got the last data message sent but missed some prior to that), it will throw an inactivity timeout and NACK even if it misses the sender’s flush messages (the “rxRobustFactor” governs how many times this inactivity cycle is followed). You can set the “robustFactor” values to -1 if you don’t mind your application being chattier and the sender will send an unbounded series of flush messages at end of transmission or until new data is enqueued.
If the transmitter knows the receiver ids (or uses the feature that lets it caches those from NACKs, etc), the ACK mechanism can be used to provide additional assurance and the application can choose how many attempts or for how long of a time it attempts to get acknowledgment from the receiver(s). For example, the normStreamer example, when the ACK option is used, will query the group an indefinite amount of times.
I will also take a look at your log files to see if anything is amiss.
|
Beta Was this translation helpful? Give feedback.
-
I made the change regarding having the SYNC_CURRENT policy sync to FEC block boundaries. Note that one impact of this is that the receiver does not NACK until it reaches the next FEC block boundary which means the latency of the first message output to your application is output. I.e., if SYNC_CURRENT syncs to the first received NORM_DATA message, that content is immediately delivered to the application, but if the sync is done on a block basis, the application won't see any data until the after NACK and repair transmission of earlier packets in the sync block so there will be some delay before the application gets the content for that first FEC block. Is that the behavior you would like? |
Beta Was this translation helpful? Give feedback.
The “NORM sync policy” does control the behavior you describe here. The basic “SYNC_CURRENT” policy for DATA/FILE objects is for the receiver to wait until it sees a NORM_DATA message from the first FEC block of one of those objects before “syncing” and beginning sending repair requests (NACKs) for any content. Once a receiver syncs, it is then pretty tenacious, but the reason for that policy is to avoid having late joining receivers penalize the forward progress of the group. For STREAM objects, this policy is augmented by having the first received NORM_DATA packet be uses as the “sync index” and NORM shouldn’t NACK for stream data earlier than that index under the SYNC_CURRENT policy. H…