-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preview Release Request - Increase DefaultWindowSize from 64 to 128 in LiteNetLib/NetConstants.cs to test the reduction of packet queueing #1324
Comments
Oh this is amazing testing dude, though I need to ask, could you by chance upload what the author said about the possible downside? The discord link isn't valid and/or isn't in a discord that most people are in. |
Just thought of some FAQ items that I think might come up for those who review this request -
|
This is very good info, thank you! One thing that makes me a bit wary though - I've changed the value to 128 (or something similar, I don't remember the exact number, but it was roughly a double) in the past with other projects. While this helped a lot of people, we also had a number of people who were fine before start experiencing networking issues, which forced me to change it back down. Since this was years back, the conditions and code are not the same, this might not happen again, but I'm still a bit wary of this popping up. We could do the test branch, but I'm concerned that we won't get enough of the community to test this for long enough to see the effects. Given that this hasn't too many side effects from your testing though, I think it might be worth it just making the change and then seeing if issues crop up - it's easy enough to revert in such case. Alternatively we could have two versions of LNL at once, so it's possible to switch between them on the fly, but that's additional engineering. |
I think as long as we announce it prominently, and encourage users to speak up in a specified location if they notice any regression in connectivity for themselves- this could be something we just try out, @Frooxius. I agree that if any issues do present themselves- they'd be more likely to get caught when rolled out to everyone, regardless to any testing done in advance. |
You're welcome. I am just doing my part for the good of the community!
Although I have not found a situation where the increase of the DefaultWindowSize has caused issues, it should be noted that my tests weren't large enough to surface any issues increasing the value may cause. With that said, that's why I wanted to shift to a larger A/B test and invited others to perform their own tests.
Yes same, I don't want to be 'the guy' that broke sessions, hence the test!
That can be helped with a bit of marketing, this is a test that could overall benefit many users who suffer from message queuing. I don't mind taking point on leading these tests and collecting the data if needed.
To be clear, I don't believe my tests were thorough enough to rule out all possible issues that could crop up from changing this value. Just because the folks testing with me didn't experience any issues, doesn't mean others will not. Regarding the two versions - the Mod allows for this today as it unpatches and repatches the IL to test any LNL DefaultWindowSize value without restarting the client. I can schedule more tests when using the Mod, but as I noticed - Mods are not viewed as an effective means for testing hence why I'm asking for a Prerelease / or a new Testing branch so that additional tests are performed outside of any Mods. Adding in Shifty's comments:
I'm apprehensive about pushing the change directly to the public branch at this point with minimal testing - I think this test is best performed much in the same way that LNL 1.2 was tested as a Prerelease branch before it was pushed into the main release. I'd rather the community report any any issues and give the team time to sort them out and ultimately let the team determine if the value should or should not be changed before it is pushed into main. Thoughts? |
If issues weren't immediately visible on the scale you've tested them on so-far, I'm uncertain how likely it'd be that we'd find such issues in the small testing environment we'd likely have with the preview branch- I do not anticipate a high engagement with any testing/pre-release branches due to various factors (e.g. build incompatibility/splitting the community) that occurs as a result of how things are right now. In the event that it isn't caught during a pre-release branch, and is pushed out into main- there could still be issues all the same. This is early access software- so it should be expected there will occasionally be service issues- but as long as it's thoroughly communicated that it's a change we'll be testing, I don't think a testing branch will help us. In the end- this is a change that will affect the entire community- so the only way to know the actual true effect of the change would be to expose it to the full community. |
Because of the potential impact to the community, I'm personally not in favor of making a change like this in production without staging this out in a way that can flesh out any potential impacts to users. I understand that Resonite is an early-access game, but I feel that Resonite should still take efforts to minimize impacts to their users, Patreons, and business partners. I recommend placing a hold on this request for now if a Prerelease branch isn't in the cards. I will investigate other avenues to test this at a larger scale within the community to evaluate any impacts in a more thorough way. Once I have more concrete data and user feedback to share, I'll report back here with those results. |
I think Shifty is right in that a wide-scale test is unlikely to manifest from a Prerelease branch. It still may be valuable to give a day or so on a Prerelease branch so the wider community has a chance to try these changes in-practice with more variety in geography and networking conditions. A couple days isn't much but it's likely we wouldn't have the attention of the community for much longer than that anyway given Shifty's points about splitting the user base and folks being busy with MMC. That would allow for a smoke test to spot overt issues that could be missed in Rucio's small-scale tests. The change already sounds pretty unlikely to cause major issues and seems very easy to revert. Given the potential benefit, I'd personally argue this is worth a shot. |
There's really no need to rush with completing a test. There's nothing inherently broken in LNL - it's really more of a quality of life improvement for high latency connections. Even though I prefer the branch testing approach for a number of reasons, we already have the ability to test this change at scale today without a testing branch. The method is unfortunately just a bit more clunky than just switching your branch in Steam.
I agree it's unlikely, but there's still a chance. As @Frooxius already noted, there may have been issues in the past related to the increase of this value and we should take steps to avoid impacting the community. |
While yes its not "inherently broken". Alot of people outside the states westside just can't interact due to the current desync (AU > east coast is real bad) unless the only thing people are doing in session is just talking, as once desynced everything else becomes pretty much impossible. If this has a big impact like it reads this could be a massive improvement for Oceanic/Asia players and could solve stuff like #120 |
We could do it with the pre-release branch, but we'll need to get big enough group of people to help test this - it might be bigger testing group than what you've done with the mod, so that might help. However given my previous experience, I do feel there's still a risk that we won't find the issues until it's pushed to the whole community though, so we should be prepared to roll it back if needed. But I'll push it out to Prerelease branch and we can rally some users to help test.
This isn't what I had in mind specifically. What I was thinking is that we run both versions simultaneously, the way LNL and Steam Networking Sockets exist. |
I am apprehensive about releasing this change to the public branch without additional. Although we tested this with three people (Rucio, Stella, and me) and have seen favorable results, we don't have enough data to determine the impact it will have with larger groups and/or many different people connecting around different parts of the world. There was a reason why LNL 1.2 was made available on the Prerelease branch first before being released on the Public branch, and the LNL 1.2 release was an overall success due to this. I am in agreement with @bontebok due to the same reasons that he provided. Even though I am optimistic that this change will work, I would not want to introduce this network change without proper testing.
And I will be helping @bontebok with this as well.
Thank you, @Frooxius :) |
My main thinking here is these two things:
|
@Frooxius Perhaps have a radio button group for WindowSize on the New Session Settings similar to how you can specify a Port? |
@stiefeljackal That'd be a lot of extra engineering just for testing purposes. We would need the system to communicate the size of the window somehow (which we can't do through the protocol itself) and also have the relays & bridges somehow respond to that as well. |
I understood. Apologies for the poor communication on my part; I wasn't recommending this for the test. Plus, that would requiring the sliding window support in LNL that you requested. Anyway, thank you for considering putting this on the Prerelease branch. I will definitely test this just like I did with LNL 1.2. |
Ah no worries! I'll just try to get the Prerelease branch pushed out soon so we can test it a bit and if it's ok, we'll push it out to public! |
Hello everyone! I've pushed a build This build includes version of LNL with increased window size (from 64 to 128, also thanks to @ProbablePrime for setting up CI/CD for the library which allowed for this!), which can potentially help improve packet queuing, especially with higher latencies (e.g. between EU and US or US and Australia). There might be potential side effects though, so we'd like to test it with as many people as possible and see how the connections fare! Also a note, I'm currently running just a single bridge & relay on my machine locally for the testing purposes, so there is only one located in the Czech Republic If you can, hop on this build and test some networking sessions - particularly ones where you'd normally queue. We'll leave the test up for a few days and if there are no major issues, we'll push this to public and watch for issues there. Please report any findings in this thread! |
Just to say I'm not reporting on the behalf of Bredo's test earlier with their EU headless but just sharing my own experience from their testing. I'm on the east coast mind you and I experienced no change in latency or the like, everything was operating as well as a Resonite session goes. The Australian users had no steadily queued packets, anytime it went up it went back down nearly as fast. Even with objects that generated either a bunch of corrections (which was my object that had driven and undriven dynvar floats) or Character Controller enabled objects that fell to the ground. I even brought in a Australian friend that would queue at around 5 or so users in a US east headless and they had no issue staying synced. A few Australian users even started trying out connecting via mobile data, which oddly was able to keep up. Image from Bredo but this was the panel at the time, 2degrees was Bredo's account that was on mobile data, shortly after this they started recovering. I did some slight testing on @stiefeljackal's headless that is located in the US east coast a bit ago with a three users, I did not notice anything negative that changed then again that low amount of users isn't a good test environment. |
As a follow up to @ohzee00's comment, as the person running the test in the Sync Lounge: We were seeing situations where we'd have people queuing packets at half the user count we had to a US West based session not queuing at all in the European session (in Austria). This included people on poor DSL connections, someone on starlink and a couple computers running tethered off poor phone hotspots. Most users had around 240 - 300ms RTT to the headless, with one of the worst cases being the 2degrees account in the image ohzee posted - this one was seeing anywhere from 300 - 1000ms of latency due to poor network conditions. Even when we had a couple situations specifically tailored to creating large amounts of queued packets (of which some users reached above 100,000 at stages during these), queued packets were able to recover after the fact. |
We haven't had the JP users in the same session before in a very long time due to packet queuing being unbearable. We had at least four JP users who joined us and they never saw packet queuing numbers above single digits the entire time they were in there. As Stella reported, they only experienced queuing during a spike, but recovered quickly. @stiefeljackal has been keeping tabs on the data for our event and no user connected had any packet loss. I suspect in the BOTC event that there could be other factors at play, such as the users Internet connections, or if they were going through the temporary relay hosted in CZ. Only one user connected through the relay, but he only came in for a few minutes and left so it wasn't enough for us to determine if the relay could be a factor. I am going to force the relay on myself and do more testing while users are still in the session. |
This has not been my experience thus far, but this is why we are doing these tests - there are many variables at play and many possible explanations that may be related or unrelated to the DefaultWindowSize change - this is exactly why more tests are needed before we can reach a conclusion.
I looked over the data I've been able to get from the BoTC session and have some theories that I'd like to investigate further. First, I noticed in this shot it appears that the two users who are queuing and yourself are being impacted together and based on the low number of control packets likely were lagged out simultaneously - if those two users also happened to be going over the relay, the issues at that time could be related to bad Internet weather or something localized to your Internet connection. I learned that the BoTC logs were purged after the event, but I would like to determine if these users were connected through the relay which could explain the commonality. Can you review the CZ LNL Relay and confirm if the two users that were impacted the most were going over the relay when connecting to BoTC? Regarding the next screenshot, it looks like only one user was getting enough packet loss to hit go above 0% - due the how extreme their packet loss was, I would have to assume this was probably isolated to just their connection. Did anyone else have more than 0% packet loss at any time? Later today for Creator Jam Session B we will run another build session on the prerelease - today was a resounding success for us, but I want to collect as much data as possible as that's the only way we can be sure of what impacts, if any, the change has on connections. If you anyone following this thread is around, please stop by and encourage others to do as well :) |
There's a few things from the BoTC:
I am pretty sure that this behavior is not caused by people's connections or other factors other than the changed window size:
In conclusion, from what I've seen so far, I would not be comfortable pushing this as a fixed change to the main community at the moment. However the good results you've been getting are still promising and indicative that increased window size does indeed help in a lot of cases and it's something that's worth pursuing, but we will have to tackle the downsides somehow. E.g. having the window change size dynamically would be a potential good step. E.g. if the library is able to negotiate the window size as part of the initial connection at very least, we could at least make this into a per-user option they could tweak. Having it be able to be changed dynamically and adapt to the connection would be even better, but that's probably even more work. |
Would it perhaps make sense to test a less drastic increase then? Something like 96 (half way) or so? |
Based on my previous experience with this library, I don't think it would be worth it. I remember I did test a number of different values a while back and had to settle back to the default of 64. Which is conservative enough to work in most cases and not trigger weird problems and instability. I don't think that we'll find a single value that will work globally for everything. I think we'll need to make it change more dynamically, at very least configurable per-user or have it adapt dynamically based on the conditions. |
This sounds like a good takeaway from the testing. Dynamically changing the window size sounds like a desirable feature, and with some numbers to back up that every connection has a different optimal window size it's at least clear that this will have a noticable performance improvement once implemented. I agree with going the safe route and not changing the default for everyone if it can cause more instabilities. |
Bit of a disappointing result considering the success of the other testing that has been going on with this change. Didn't manage to get a Sync Lounge test with a bunch of people connected to an AU server - will aim to do that at some stage within the next couple days. |
I appreciate the enthusiasm, but this is not a time to vote. This is to test the value, and multiple tests should be conducted to verify the validity of the change. When Rucio submitted this issue, this was to test the reduction of the message queueing. Numbers will change and that is expected, but numbers should not only be the measure. It is the experience and the quality of life improvement that should be measured as well. For the small number that was affected by this, an interview with them should be conducted to see what went wrong. As Rucio pointed out:
People who also have tested this have positive feedback provided in the LNL Preview Build Discussion - BETA 2024.2.15.276 thread. To add to Rucio comments, here is the data captured during the Jam during peak times: 1:45 AM EST (06:45 UTC)2:03 AM EST (7:03 UTC)2:12 AM EST (7:12 UTC)2:24 AM EST (7:24 UTC)2:38 AM EST (7:38 UTC)2:58 AM EST (7:58 UTC)3:20 AM EST (8:20 UTC) |
can be partially explained by
The best thing you can do is... well, play it again with the same players, but actually have the relays hosted in the cloud. The flaw with this methodology is that the relays will be notably less congested, but those who don't need the relay will most likely be unaffected. If your connection still suffers unexpectedly, then that's a useful data point. And this time, please remember to ensure that it doesn't purge the logs! P.S. If you really want to maximise the stress test, get everyone into NeoRoids or NeoFurs. From my observation, the higher the framerate, the more stress on the network connection. But this would make a comparison test on the release branch even more important. |
They normally don't have issues in the game and most players were not going through the relay. Multiple players who were not going through the relay had issues as well. I'd also have some doubt the relay was the cause of issues as it was local to froox and they had issues as well. I did separately suggest that for future relay/networking related changes that having a process to spin them up in the normal relay locations could be beneficial to remove additional variables. The game servers currently don't keep their logs. or anything for that matter, they are fully deleted each time. We are working to tweak that so the logs are saved separate. If we want logs currently, we'd grab them before shutting it down. We generally don't ask players to change to specific avatars or change things they do normally for these tests so they remain more 'realistic' in that it meets what a typical session in this world would be. We could have done other types of tests but that wasn't what we were looking to do. |
Just to clarify, the prelease relay and Froox were on the same IP address, so it is possible that that particular route to the headless had trouble at some point during the game and some users were also going through the relay. Not saying that was the cause, but since the server logs aren't available, it would be hard to determine that conclusively. |
I think some more testing is needed before we'd consider making this the default. I'd be happy to setup some additional relay nodes on our existing relay locations that could help too. Some of the relay boxes are a little slow as they're only designed to run one, but some can easily run another one for testing. |
...come to think of it, the best avatars for creating network stress would be lightweight full-body ones - you get high FPS and
"Realistic" actually means "the worst case happens from time to time". Definitely worth doing at least one round with the extra network stress. |
I agree 100%, and that is what the majority of us are trying to point out. Instead of forming a conclusion or making a determination based on a single test, let's make sure we conduct multiple tests, record the necessary data, and form our conclusions from there. The first five comments in the Discord thread LNL Preview Build Discussion - BETA 2024.2.15.276 provides excellent instructions on how to conduct these tests. I would encourage those that want to test the Prerelease to follow those instructions. |
As I talk in my post above, the observations from BoTC, there is enough reason for concern to not make this a global change for everyone - I would be significantly uncomfortable with making that change, hence my suggestion to focus on a more nuanced solution. I believe that's worth pursuing, given the positive results - but those positive results cannot come at the cost of extra instability. I'd appreciate if you read my post above in full, because there are a number of points that I feel are being overlooked in order to "explain away" the bad results from BoTC. I understand that it's not good to hear some bad news (and I really hate to be the bringer of bad news here), when a lot of the results are positive, but this approach is making me very uncomfortable. To address/repeat some key points:
As it stands and given these reasons I would be scared to make this the default for everyone. I understand there are some very good positive results, but they do not make the negatives go away. The only ways forward I see:
|
I admit, the slight promise of users and friends that would benefit from this change has slightly biased me to wanting the change but overall yeah, more testing or better yet, possible implementations of the value being more dynamic would be great. I'm trying to see if I can do a test with another group if time and general energy allows, here's hoping it ends up promising. |
I did a small scale test with a group of friends, originally I was going to stop before even going to the prerelease to compare due to the lack of people being on however it did lead to #1382 being discovered. Us finding that bug allowed us to test between release and prerelease with a Australian user that would desync with just four people in a session due to the above bug, two of which being in the US close to the headless. On release they would slowly desync, on prerelease with the same amount of people, world and conditions, they had no problem staying up with everyone else. |
We did another BoTC test yesterday. One game on the prerelease and another on the normal build. The group was a little bit smaller, but we had a player from New Zealand (for both games) and some Europeans (including myself). Observations
AnalysisThis is in line with my experience and understanding of what's happening - increasing the window size allows for more packets to be "in flight" - this means that with good conditions, packets get pushed through faster, better saturating the connection and keeping up with the queue. However the flip side is that sending too many UDP packets increases the likelihood that the network elements will start throwing them out (this can happen for a number of reasons, e.g. when the buffer is full, UDP packets are just discarded), increasing the packet loss. This impacts the real-time data (voice & poses) and can also cause queuing, because the protocol becomes significantly less efficient when too many packets are being lost - blocking the queue and requiring a lot of re-transmissions. Based on my experience, it seems like when that happens, the connection ends up going into a "death spiral" of sort - it becomes backed up, so it tries to fire more packets, which is too much for given connection to handle, increasing the packet loss, further degrading the throughput. Even if the reliable queue is able to still push enough stuff, it can still impact the realtime data, like voice and poses, causing things to be jittery, which we experienced in both games. ConclusionBased on this and data collected by everyone else, increasing the window size is a good way to deal with queuing, because it can saturate the connection better. However this also increases likely-hood that the connection becomes overwhelmed with the rate of UDP packets, increasing packet loss, which can have pretty bad effects. Because of this, I don't believe a single fixed value for the window size is a good solution - the default is set conservatively so this doesn't happen with most connections. In my experiments with the library a few years back, I tried a number of different values and settled back on 64 to avoid this degradation. However since it does help a lot in most cases, it's definitely part of the solution - we just need it to be more adaptive. Next stepsIn order to proceed on this, we would need to add ability to the LiteNetLib library to adjust the window size dynamically based on network conditions. The library is currently hardcoded to use a fixed window size. From poking around the code, I think good approach might be to instead redefine that as the "maximum window size" and then have a "soft" size, that can be adjusted dynamically - this one would simply prevent from further packets in the bigger window from being transmitted, until a slot is "freed up". The other part will be having the mechanism which adjusts this value dynamically. This would probably take a form of gradually increasing the size and then stopping/decreasing it when packet loss is detected. It'll probably take a bit to fine tune this mechanism. But I'd need to poke around the library a bunch more to see if this approach will work well. We could also ask the author of LNL again if they want to implement it officially, so we don't have to do it on our end, but if that's not the case, I'd be okay committing my time to this, since I have quite a bit experience with the library code. Extra notesI did dig through the code of LNL a bit ant one of the things I thought would help - doing late merging of packets before transmission, is already implemented in the library, so it's not something that would help. |
I just want to share some knowledge that I have here since I've been looking into this kind of stuff for my own engine1, in case it's useful. I don't know much about LNL's internals specifically, but from what I can gather in this discussion, many of the concepts have similarity to parts of TCP, QUIC, and Lidgren.Network. When looking at protocols such as both Ye Olde TCP and the more modern QUIC, there's actually two separate concepts at play here: the client window (much discussed here) and the congestion window. My understanding of both "4 decade old RFC documents (TCP)" and "5 year old RFC documents (QUIC)" is that these play these roles:
The client window has been much discussed here already. Indeed, increasing the client window by a hardcoded amount does mean that "the networking layer can push more bandwidth" (thanks to the bandwidth-delay product) but that obviously means you're more likely to overload specific people's networks. In TCP and QUIC the size of the client window is something continuously communicated between two peers, whereas in Lidgren (and from what I can read here, LNL) it's hardcoded: suitable for low-bandwidth transfer (what you'd hope most game traffic is), but not for large downloads. The congestion window works much different. Effectively, both peers try to keep tabs on the connection quality through some heuristics, and will throttle their outgoing bandwidth based on that. There's a lot of algorithms for this, but they're all calculated solely on the sending peer, and the other peer generally doesn't need to know or care2. These algorithms are also designed so that multiple independent programs or devices going through the same link can all "fairly" share the link without needing further cooperation. One problem with common TCP congestion control algorithms however (that I know of) is that they are loss based, i.e. the heuristic is "we have to retransmit due to packet loss so we're probably overloading the connection". This has the issue of running into buffer bloat though which causes latency to rise significantly before the congestion control actually realizes the network is having issues. Google (relatively) recently made a new algorithm called BRR that's based on delay rather than loss, so it should try to keep buffer bloat under control so it's possibly worth looking into for games too. What I'm trying to point out here is that when going off existing literature, the specific problem that needs solving here3 is congestion control. The client window should not (and possibly, cannot), itself be scaled based on the network quality. For what it's worth, I took a look at ENet and it does mention something that sounds like congestion control in the API reference, since there was talk about using it in another issue. Footnotes
|
We have announced a coordinated test between Australia and US which commonly experiences packet queueing between each other. The details can be found on the Resonite Discord, using the link below. We will be visiting four sessions with servers located in both Australia and US and will test on the public branch first, followed by the prerelease branch. The test will be recorded and posted for all to see, and the test world we will be using will be made available as well. If you are in the United States or Australia and would like to be part of this test, please read the details at the link below. Thanks everyone in advance for participating! Australia and US Resonite Prerelease Test - 2/24/2024 3:00 AM UTC |
From my end of the test on the pre-release build on both the US and AUS servers I noticed some slight popping from a few users when they spoke. I also had quite a few users comment that they heard quite a bit of popping and studdering in my movements on the pre-release build on the AUS server. There was also popping on the US pre-release server from me and other users but it wasn't as bad as the AUS server. For context, I live on the east coast of the US. |
If you participated in the AU/US test, feel free to post feedback here or in the thread located in the Resonite Discord. I'm in the process of collecting all of the videos from the test sessions to review carefully. I will compile my findings and have them reviewed by others who participated in the test for confirmation before posting here. I hope to have this completed before the weekend ends. I have uploaded the videos from Test 1a and 1b (Public Branch) and Test 2a and 2b (Prerelease Branch) and you can find them linked below if you would like to review them yourself. https://www.youtube.com/watch?v=yKMXGzECwE4 Here is the link to the test world we used during all four tests: resrec:///G-Rucios-Crew/R-9c9bb88f-2958-4ca4-9765-ca02f7cd7a8c |
Here are the results of the Australia and US test which was held on 2/23/2023 at 10:00 pm EST in Resonite. Thanks to @stiefeljackal for helping to develop the test plan, and to @bredo228 for providing the AU headless server! Test Parameters
An artificial delta message generator known as the "Desync Generator" was created to intentionally cause a set amount of string length to be written to a variable on every update from the Host. This was used during each test to identify the limit of the number of bytes per update that could be sent before users started to queue. The four tests performed were:
ResultsServer logs have confirmed that no users connected through the LNL Relay during the tests, one user had used -ForceRelay when connecting to the first test, but removed that the -ForceRelay flag and reconnected before the first test began. Test 1a - AU on Public
Test 1b - US on Public
Test 2a - AU on Prerelease
Test 2b - US on Prerelease
Summary
Special thanks to these Resonite Users who participated in the test!avali032, BigRedWolfy, bredo, Canadian_RedZ, Charizmare, darbdarb, FoxBox, GunGryphon, Lexevo, Near By Gamer, Ocelot342, ohzee, Player Kode, Rucio, Stellanora, StiefelJackal, Sveken, TrebleSketch, and zorky. |
Adding a note regarding the lower frame rate on the Australian server: this was most likely down to the server itself being quite a bit less powerful than the US one (in this case, it was running off an old Intel Core i7-4790) causing the lower frame rate rather than any other factors. |
Thank you for all the detailed info! I do feel like it's very promising and improves throughput a bit. We still need a good solution to avoid the negative side effects of the increased packet loss. I'm not sure if having dynamic window size scaling would be best. It feels like it's the simplest solution, but requires some development work on LNL. We could keep it simple and see if that works well enough. We could try alternate protocols, like ENet, since @PJB3005 mentioned it might have congestion control already, but implementing another protocol is a bit of a task. It might be worth adding some more diagnostics to LNL too better understand what's going on too. I've been thinking about adding following ones:
|
As all testing is complete and the prerelease is no longer needed, I'll go ahead and close this request. Thanks @stiefeljackal, @Frooxius, @shiftyscales, @ProbablePrime, and @bredo228 for all of your efforts. The information learned here will one day lead to a solution to eliminate packet queuing for high latency users! |
As of update 2024.4.22.1366:
https://discord.com/channels/1040316820650991766/1154514012143362151/1232103701221605377 If the packet loss calculation was incorrect before that release, then the packet lost numbers exhibited in these tests are also incorrect. Therefore, this should be retested as the crux of the conclusions mention packet loss as the key factor of not going forward with this. |
The packet loss numbers during the tests were correct, the packet loss shown there wasn't affected by this bug. The fix (and the bug) affected specifically the new dedicated packet loss property, which didn't exist at the time of these tests. |
Is your feature request related to a problem? Please describe.
Summary
This is a request for a Resonite Prerelease branch release to allow a broader audience test to evaluate the effect of increasing the DefaultWindowSize in LiteNetLib\NetConstants.cs to determine the impact of increasing this value.
Some History
Queueing is a problem that many users with higher latency connections often struggle with in large sessions or sessions with large delta messages. This can lead to users struggling to collaborate or experience world updates in real-time. We often struggle with this at Creator Jam's and other events where users are joining US hosted sessions from other continents, typically the most queueing appears from users connecting from Japan, Korea, New Zealand, and Australia. The common factor among all of them is that they have much higher latencies than users connecting from other locations.
To study the problem, I set up a packet shaper to emulate both a decrease in bandwidth and an increase in latency. I found that I could create scenarios where users connected to a world would begin to queue packets simply by pushing the latency above a threshold. That threshold was found to be a factor of the number and size of delta messages, combined with a higher latency connection.
I raised this problem to the author of LiteNetLib on their Discord server looking for suggestions of what can be done now to alleviate this problem. Their suggestion was to increase the value of the *DefaultWindowSize in the NetConstants.cs file to a value higher than the default of 64. This would reduce the amount of acknowledgement packets and reduce the amount of queueing. Discord Link
*LiteNetLib currently does not have a sliding window feature that would allow the DefaultWindowSize to scale dynamically to reduce queuing in scenarios of queueing such as exhibited in Resonite, however it is worth mentioning that this feature is on LiteNetLib's roadmap. (Link in Additional Context)
Testing
I developed a ResoniteModLoader Mod along with assistance from @stiefeljackal to patch the DefaultWindowSize constant value in the IL to give us the ability to dynamically change the constant from the default value of 64 to values such as 128 and as high as 512. We built a test world that contained a fixed number of delta generating messages and performed A/B testing from the default value to higher values. We determine that even a small increase of the DefaultWindowSize from 64 to 128 or higher could reduce or eliminate queuing. ResoniteLNLTweaks Mod
Our A/B tests showed that even a small increase of the DefaultWindowSize from 64 to 128 significantly decreased the problem of queueing in both test circumstances of simulated latency (via packet shaping) and real latency experienced by @Stellanora64. Increasing the value to as large as 512 showed that hypothetically high latent links (~1000ms) could even be prevented from queuing, however as the LiteNetLib author pointed out, increasing this value too high could lead to other problems. Discord Link
Describe the solution you'd like
I would like to request a prerelease branch be created to allow users to test the effectiveness of increasing the DefaultWindowSize in NetConstants.cs from 64 to 128 to conduct proper testing and gather feedback. As clients would not be compatible with each other, the goal of this branch would be only for the purpose of coordinated testing where groups of users could determine the impact of this change.
Describe alternatives you've considered
As I have created a ResoniteModLoader Mod to allow this change to be tested dynamically, an alternative already exists. However, it is understood that Mods are not an effective means of testing and it would be far more beneficial to have a separate test branch created to test this change.
Additional Context
References -
LiteNetLib GitHub Issue #110 - Implement sliding window scale logic.
Requesters
@stiefeljackal - Mod development assistance & participating in A/B testing.
@Stellanora64 - First raising the issue here that needed to be solved and for participating in our A/B testing,
@Frooxius - For requesting sliding window support in LiteNetLib and producing the User Delta Messages node which was key to understanding the problem we are trying to solve.
And Creator Jam & Medra for providing a space for us to study the problem closer.
The text was updated successfully, but these errors were encountered: