Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preview Release Request - Increase DefaultWindowSize from 64 to 128 in LiteNetLib/NetConstants.cs to test the reduction of packet queueing #1324

Closed
bontebok opened this issue Feb 9, 2024 · 64 comments
Assignees
Labels
triaged This issue has been assessed tweak Suggested change to an existing feature

Comments

@bontebok
Copy link

bontebok commented Feb 9, 2024

Is your feature request related to a problem? Please describe.

Summary

This is a request for a Resonite Prerelease branch release to allow a broader audience test to evaluate the effect of increasing the DefaultWindowSize in LiteNetLib\NetConstants.cs to determine the impact of increasing this value.

Some History

Queueing is a problem that many users with higher latency connections often struggle with in large sessions or sessions with large delta messages. This can lead to users struggling to collaborate or experience world updates in real-time. We often struggle with this at Creator Jam's and other events where users are joining US hosted sessions from other continents, typically the most queueing appears from users connecting from Japan, Korea, New Zealand, and Australia. The common factor among all of them is that they have much higher latencies than users connecting from other locations.

To study the problem, I set up a packet shaper to emulate both a decrease in bandwidth and an increase in latency. I found that I could create scenarios where users connected to a world would begin to queue packets simply by pushing the latency above a threshold. That threshold was found to be a factor of the number and size of delta messages, combined with a higher latency connection.

I raised this problem to the author of LiteNetLib on their Discord server looking for suggestions of what can be done now to alleviate this problem. Their suggestion was to increase the value of the *DefaultWindowSize in the NetConstants.cs file to a value higher than the default of 64. This would reduce the amount of acknowledgement packets and reduce the amount of queueing. Discord Link

*LiteNetLib currently does not have a sliding window feature that would allow the DefaultWindowSize to scale dynamically to reduce queuing in scenarios of queueing such as exhibited in Resonite, however it is worth mentioning that this feature is on LiteNetLib's roadmap. (Link in Additional Context)

Testing

I developed a ResoniteModLoader Mod along with assistance from @stiefeljackal to patch the DefaultWindowSize constant value in the IL to give us the ability to dynamically change the constant from the default value of 64 to values such as 128 and as high as 512. We built a test world that contained a fixed number of delta generating messages and performed A/B testing from the default value to higher values. We determine that even a small increase of the DefaultWindowSize from 64 to 128 or higher could reduce or eliminate queuing. ResoniteLNLTweaks Mod

Our A/B tests showed that even a small increase of the DefaultWindowSize from 64 to 128 significantly decreased the problem of queueing in both test circumstances of simulated latency (via packet shaping) and real latency experienced by @Stellanora64. Increasing the value to as large as 512 showed that hypothetically high latent links (~1000ms) could even be prevented from queuing, however as the LiteNetLib author pointed out, increasing this value too high could lead to other problems. Discord Link

Describe the solution you'd like

I would like to request a prerelease branch be created to allow users to test the effectiveness of increasing the DefaultWindowSize in NetConstants.cs from 64 to 128 to conduct proper testing and gather feedback. As clients would not be compatible with each other, the goal of this branch would be only for the purpose of coordinated testing where groups of users could determine the impact of this change.

Describe alternatives you've considered

As I have created a ResoniteModLoader Mod to allow this change to be tested dynamically, an alternative already exists. However, it is understood that Mods are not an effective means of testing and it would be far more beneficial to have a separate test branch created to test this change.

Additional Context

References -

LiteNetLib GitHub Issue #110 - Implement sliding window scale logic.

Requesters

@stiefeljackal - Mod development assistance & participating in A/B testing.
@Stellanora64 - First raising the issue here that needed to be solved and for participating in our A/B testing,
@Frooxius - For requesting sliding window support in LiteNetLib and producing the User Delta Messages node which was key to understanding the problem we are trying to solve.
And Creator Jam & Medra for providing a space for us to study the problem closer.

@bontebok bontebok added the New Feature A new addition, whose complexity hasn't been evaluated yet label Feb 9, 2024
@bontebok bontebok changed the title Preview Release Request - Increase DefaultWindowSize from 64 to 128 in LiteNetLib/NetConstants.cs Preview Release Request - Increase DefaultWindowSize from 64 to 128 in LiteNetLib/NetConstants.cs to test the reduction of packet queueing Feb 9, 2024
@ohzee00
Copy link

ohzee00 commented Feb 9, 2024

Oh this is amazing testing dude, though I need to ask, could you by chance upload what the author said about the possible downside? The discord link isn't valid and/or isn't in a discord that most people are in.

@bontebok
Copy link
Author

bontebok commented Feb 9, 2024

Oh this is amazing testing dude, though I need to ask, could you by chance upload what the author said about the possible downside? The discord link isn't valid and/or isn't in a discord that most people are in.

Sure! I'll drop the screenshots to this thread so it keeps the original post clean.

Here's my conversation with the author where I was asking about a potential solution -
image

Here's the response to my question about increasing the DefaultWindowSize value too large -

image

@bontebok
Copy link
Author

Just thought of some FAQ items that I think might come up for those who review this request -

  • In our tests, we found no downsides to changing the value from 64 to 128. Even at 512, Resonite seemed to behave very normally, although adhering to the warning the author gave about increasing the value too high - we decided that a value of 128 was a conservative enough change that seemed to make a pretty remarkable difference in our A/B tests.
  • Users with different DefaultWindowSize values will not be able to connect to each other, nor will they be able to use the Resonite LNL Relays. This is because all clients expect this constant to be the same, and the UDP packets will no longer be compatible with each other. This change is however compatible with the existing LNL NAT servers, so standard UDP punch through connections will work as normal.
  • This value also improves queueing on IPv6 connections. Yay :)

@shiftyscales shiftyscales added tweak Suggested change to an existing feature and removed New Feature A new addition, whose complexity hasn't been evaluated yet labels Feb 10, 2024
@shiftyscales shiftyscales removed their assignment Feb 10, 2024
@Frooxius
Copy link
Member

This is very good info, thank you!

One thing that makes me a bit wary though - I've changed the value to 128 (or something similar, I don't remember the exact number, but it was roughly a double) in the past with other projects. While this helped a lot of people, we also had a number of people who were fine before start experiencing networking issues, which forced me to change it back down.

Since this was years back, the conditions and code are not the same, this might not happen again, but I'm still a bit wary of this popping up.

We could do the test branch, but I'm concerned that we won't get enough of the community to test this for long enough to see the effects.

Given that this hasn't too many side effects from your testing though, I think it might be worth it just making the change and then seeing if issues crop up - it's easy enough to revert in such case. Alternatively we could have two versions of LNL at once, so it's possible to switch between them on the fly, but that's additional engineering.

@shiftyscales
Copy link
Collaborator

I think as long as we announce it prominently, and encourage users to speak up in a specified location if they notice any regression in connectivity for themselves- this could be something we just try out, @Frooxius.

I agree that if any issues do present themselves- they'd be more likely to get caught when rolled out to everyone, regardless to any testing done in advance.

@shiftyscales shiftyscales added the triaged This issue has been assessed label Feb 12, 2024
@bontebok
Copy link
Author

This is very good info, thank you!

You're welcome. I am just doing my part for the good of the community!

One thing that makes me a bit wary though - I've changed the value to 128 (or something similar, I don't remember the exact number, but it was roughly a double) in the past with other projects. While this helped a lot of people, we also had a number of people who were fine before start experiencing networking issues, which forced me to change it back down.

Although I have not found a situation where the increase of the DefaultWindowSize has caused issues, it should be noted that my tests weren't large enough to surface any issues increasing the value may cause. With that said, that's why I wanted to shift to a larger A/B test and invited others to perform their own tests.

Since this was years back, the conditions and code are not the same, this might not happen again, but I'm still a bit wary of this popping up.

Yes same, I don't want to be 'the guy' that broke sessions, hence the test!

We could do the test branch, but I'm concerned that we won't get enough of the community to test this for long enough to see the effects.

That can be helped with a bit of marketing, this is a test that could overall benefit many users who suffer from message queuing. I don't mind taking point on leading these tests and collecting the data if needed.

Given that this hasn't too many side effects from your testing though, I think it might be worth it just making the change and then seeing if issues crop up - it's easy enough to revert in such case. Alternatively we could have two versions of LNL at once, so it's possible to switch between them on the fly, but that's additional engineering.

To be clear, I don't believe my tests were thorough enough to rule out all possible issues that could crop up from changing this value. Just because the folks testing with me didn't experience any issues, doesn't mean others will not.

Regarding the two versions - the Mod allows for this today as it unpatches and repatches the IL to test any LNL DefaultWindowSize value without restarting the client. I can schedule more tests when using the Mod, but as I noticed - Mods are not viewed as an effective means for testing hence why I'm asking for a Prerelease / or a new Testing branch so that additional tests are performed outside of any Mods.

Adding in Shifty's comments:

I think as long as we announce it prominently, and encourage users to speak up in a specified location if they notice any regression in connectivity for themselves- this could be something we just try out, @Frooxius.

I agree that if any issues do present themselves- they'd be more likely to get caught when rolled out to everyone, regardless to any testing done in advance.

I'm apprehensive about pushing the change directly to the public branch at this point with minimal testing - I think this test is best performed much in the same way that LNL 1.2 was tested as a Prerelease branch before it was pushed into the main release. I'd rather the community report any any issues and give the team time to sort them out and ultimately let the team determine if the value should or should not be changed before it is pushed into main.

Thoughts?

@shiftyscales
Copy link
Collaborator

If issues weren't immediately visible on the scale you've tested them on so-far, I'm uncertain how likely it'd be that we'd find such issues in the small testing environment we'd likely have with the preview branch- I do not anticipate a high engagement with any testing/pre-release branches due to various factors (e.g. build incompatibility/splitting the community) that occurs as a result of how things are right now.

In the event that it isn't caught during a pre-release branch, and is pushed out into main- there could still be issues all the same. This is early access software- so it should be expected there will occasionally be service issues- but as long as it's thoroughly communicated that it's a change we'll be testing, I don't think a testing branch will help us.

In the end- this is a change that will affect the entire community- so the only way to know the actual true effect of the change would be to expose it to the full community.

@bontebok
Copy link
Author

In the end- this is a change that will affect the entire community- so the only way to know the actual true effect of the change would be to expose it to the full community.

Because of the potential impact to the community, I'm personally not in favor of making a change like this in production without staging this out in a way that can flesh out any potential impacts to users. I understand that Resonite is an early-access game, but I feel that Resonite should still take efforts to minimize impacts to their users, Patreons, and business partners.

I recommend placing a hold on this request for now if a Prerelease branch isn't in the cards. I will investigate other avenues to test this at a larger scale within the community to evaluate any impacts in a more thorough way. Once I have more concrete data and user feedback to share, I'll report back here with those results.

@PointerOffset
Copy link

I think Shifty is right in that a wide-scale test is unlikely to manifest from a Prerelease branch. It still may be valuable to give a day or so on a Prerelease branch so the wider community has a chance to try these changes in-practice with more variety in geography and networking conditions. A couple days isn't much but it's likely we wouldn't have the attention of the community for much longer than that anyway given Shifty's points about splitting the user base and folks being busy with MMC.

That would allow for a smoke test to spot overt issues that could be missed in Rucio's small-scale tests. The change already sounds pretty unlikely to cause major issues and seems very easy to revert. Given the potential benefit, I'd personally argue this is worth a shot.

@bontebok
Copy link
Author

I think Shifty is right in that a wide-scale test is unlikely to manifest from a Prerelease branch. It still may be valuable to give a day or so on a Prerelease branch so the wider community has a chance to try these changes in-practice with more variety in geography and networking conditions. A couple days isn't much but it's likely we wouldn't have the attention of the community for much longer than that anyway given Shifty's points about splitting the user base and folks being busy with MMC.

There's really no need to rush with completing a test. There's nothing inherently broken in LNL - it's really more of a quality of life improvement for high latency connections. Even though I prefer the branch testing approach for a number of reasons, we already have the ability to test this change at scale today without a testing branch. The method is unfortunately just a bit more clunky than just switching your branch in Steam.

That would allow for a smoke test to spot overt issues that could be missed in Rucio's small-scale tests. The change already sounds pretty unlikely to cause major issues and seems very easy to revert. Given the potential benefit, I'd personally argue this is worth a shot.

I agree it's unlikely, but there's still a chance. As @Frooxius already noted, there may have been issues in the past related to the increase of this value and we should take steps to avoid impacting the community.

@sveken
Copy link

sveken commented Feb 12, 2024

While yes its not "inherently broken". Alot of people outside the states westside just can't interact due to the current desync (AU > east coast is real bad) unless the only thing people are doing in session is just talking, as once desynced everything else becomes pretty much impossible.

If this has a big impact like it reads this could be a massive improvement for Oceanic/Asia players and could solve stuff like #120

@Frooxius
Copy link
Member

We could do it with the pre-release branch, but we'll need to get big enough group of people to help test this - it might be bigger testing group than what you've done with the mod, so that might help.

However given my previous experience, I do feel there's still a risk that we won't find the issues until it's pushed to the whole community though, so we should be prepared to roll it back if needed.

But I'll push it out to Prerelease branch and we can rally some users to help test.

Regarding the two versions - the Mod allows for this today as it unpatches and repatches the IL to test any LNL DefaultWindowSize value without restarting the client. I can schedule more tests when using the Mod, but as I noticed - Mods are not viewed as an effective means for testing hence why I'm asking for a Prerelease / or a new Testing branch so that additional tests are performed outside of any Mods.

This isn't what I had in mind specifically. What I was thinking is that we run both versions simultaneously, the way LNL and Steam Networking Sockets exist.

@stiefeljackal
Copy link

I am apprehensive about releasing this change to the public branch without additional. Although we tested this with three people (Rucio, Stella, and me) and have seen favorable results, we don't have enough data to determine the impact it will have with larger groups and/or many different people connecting around different parts of the world. There was a reason why LNL 1.2 was made available on the Prerelease branch first before being released on the Public branch, and the LNL 1.2 release was an overall success due to this.

I am in agreement with @bontebok due to the same reasons that he provided. Even though I am optimistic that this change will work, I would not want to introduce this network change without proper testing.

I will investigate other avenues to test this at a larger scale within the community to evaluate any impacts in a more thorough way. Once I have more concrete data and user feedback to share, I'll report back here with those results.

And I will be helping @bontebok with this as well.

We could do it with the pre-release branch, but we'll need to get big enough group of people to help test this - it might be bigger testing group than what you've done with the mod, so that might help.

However given my previous experience, I do feel there's still a risk that we won't find the issues until it's pushed to the whole community though, so we should be prepared to roll it back if needed.

Thank you, @Frooxius :)

@Frooxius
Copy link
Member

My main thinking here is these two things:

  1. Change to LNL 1.2 has been pretty drastic change to the code on how sockets are handled. This is much more likely to completely break things
  2. This is changing a constant, without modifying the actual behavior/protocol, which is much safer and it's something I've done in the past - the main downside would be that people would start queuing when they previously wouldn't, I think the risk of things breaking less predictably is very low. I think we are also more likely to see this crop up in public use, rather than on pre-release, but we can do that one anyway just to be safe, but we shouldput more focus on watching out for issues once it's rolled out.

@stiefeljackal
Copy link

This isn't what I had in mind specifically. What I was thinking is that we run both versions simultaneously, the way LNL and Steam Networking Sockets exist.

@Frooxius Perhaps have a radio button group for WindowSize on the New Session Settings similar to how you can specify a Port?

@Frooxius
Copy link
Member

@stiefeljackal That'd be a lot of extra engineering just for testing purposes. We would need the system to communicate the size of the window somehow (which we can't do through the protocol itself) and also have the relays & bridges somehow respond to that as well.

@stiefeljackal
Copy link

@stiefeljackal That'd be a lot of extra engineering just for testing purposes.

I understood. Apologies for the poor communication on my part; I wasn't recommending this for the test. Plus, that would requiring the sliding window support in LNL that you requested.

Anyway, thank you for considering putting this on the Prerelease branch. I will definitely test this just like I did with LNL 1.2.

@Frooxius
Copy link
Member

Ah no worries! I'll just try to get the Prerelease branch pushed out soon so we can test it a bit and if it's ok, we'll push it out to public!

@Frooxius
Copy link
Member

Frooxius commented Feb 15, 2024

Hello everyone!

I've pushed a build 2024.2.15.1078 to the prerelease branch on Steam!

This build includes version of LNL with increased window size (from 64 to 128, also thanks to @ProbablePrime for setting up CI/CD for the library which allowed for this!), which can potentially help improve packet queuing, especially with higher latencies (e.g. between EU and US or US and Australia).

There might be potential side effects though, so we'd like to test it with as many people as possible and see how the connections fare!

Also a note, I'm currently running just a single bridge & relay on my machine locally for the testing purposes, so there is only one located in the Czech Republic

If you can, hop on this build and test some networking sessions - particularly ones where you'd normally queue.

We'll leave the test up for a few days and if there are no major issues, we'll push this to public and watch for issues there.

Please report any findings in this thread!

@Frooxius Frooxius self-assigned this Feb 15, 2024
@ohzee00
Copy link

ohzee00 commented Feb 16, 2024

Just to say I'm not reporting on the behalf of Bredo's test earlier with their EU headless but just sharing my own experience from their testing.
It hovered around 10 users, hitting around 15 some of which being aus users who would normally queue in half that player size.
In that session it was just a normal resync lounge world with a board that reported network related things like ping and queued packets.

I'm on the east coast mind you and I experienced no change in latency or the like, everything was operating as well as a Resonite session goes. The Australian users had no steadily queued packets, anytime it went up it went back down nearly as fast. Even with objects that generated either a bunch of corrections (which was my object that had driven and undriven dynvar floats) or Character Controller enabled objects that fell to the ground. I even brought in a Australian friend that would queue at around 5 or so users in a US east headless and they had no issue staying synced. A few Australian users even started trying out connecting via mobile data, which oddly was able to keep up.

Image from Bredo but this was the panel at the time, 2degrees was Bredo's account that was on mobile data, shortly after this they started recovering.
image

I did some slight testing on @stiefeljackal's headless that is located in the US east coast a bit ago with a three users, I did not notice anything negative that changed then again that low amount of users isn't a good test environment.
So far, I only notice positive changes and damn if they don't make me excited for our Australian friends to play with us if all continues to go well. Though I think you're correct about releasing to main being the only real test for these things.

@bredo228
Copy link

bredo228 commented Feb 16, 2024

As a follow up to @ohzee00's comment, as the person running the test in the Sync Lounge:

We were seeing situations where we'd have people queuing packets at half the user count we had to a US West based session not queuing at all in the European session (in Austria). This included people on poor DSL connections, someone on starlink and a couple computers running tethered off poor phone hotspots. Most users had around 240 - 300ms RTT to the headless, with one of the worst cases being the 2degrees account in the image ohzee posted - this one was seeing anywhere from 300 - 1000ms of latency due to poor network conditions.

Even when we had a couple situations specifically tailored to creating large amounts of queued packets (of which some users reached above 100,000 at stages during these), queued packets were able to recover after the fact.

@bontebok
Copy link
Author

How does it compare to previous creator jams?

We haven't had the JP users in the same session before in a very long time due to packet queuing being unbearable. We had at least four JP users who joined us and they never saw packet queuing numbers above single digits the entire time they were in there. As Stella reported, they only experienced queuing during a spike, but recovered quickly.

@stiefeljackal has been keeping tabs on the data for our event and no user connected had any packet loss. I suspect in the BOTC event that there could be other factors at play, such as the users Internet connections, or if they were going through the temporary relay hosted in CZ.

Only one user connected through the relay, but he only came in for a few minutes and left so it wasn't enough for us to determine if the relay could be a factor. I am going to force the relay on myself and do more testing while users are still in the session.

@bontebok
Copy link
Author

We've done a test at BoTC earlier today and unfortunately I don't think this change is safe to make, because the game ran into significant networking issues, worse than a typical game.

A number of users would frequently queue pretty fast. While it would often end up clearing after a bit of time, the bursts of queuing were a lot more unpredictable from my observations from regular sessions.

This has not been my experience thus far, but this is why we are doing these tests - there are many variables at play and many possible explanations that may be related or unrelated to the DefaultWindowSize change - this is exactly why more tests are needed before we can reach a conclusion.

We had to remove some people from the game, because at some point they ended up queing pretty badly - it just kept increasing. Their packet loss was around 17 %, which is extremely high. This is likely due to too many packets being sent on their connection.

I looked over the data I've been able to get from the BoTC session and have some theories that I'd like to investigate further. First, I noticed in this shot it appears that the two users who are queuing and yourself are being impacted together and based on the low number of control packets likely were lagged out simultaneously - if those two users also happened to be going over the relay, the issues at that time could be related to bad Internet weather or something localized to your Internet connection.

20240217224033_1

I learned that the BoTC logs were purged after the event, but I would like to determine if these users were connected through the relay which could explain the commonality. Can you review the CZ LNL Relay and confirm if the two users that were impacted the most were going over the relay when connecting to BoTC?

Regarding the next screenshot, it looks like only one user was getting enough packet loss to hit go above 0% - due the how extreme their packet loss was, I would have to assume this was probably isolated to just their connection. Did anyone else have more than 0% packet loss at any time?

20240217224808_1

Later today for Creator Jam Session B we will run another build session on the prerelease - today was a resounding success for us, but I want to collect as much data as possible as that's the only way we can be sure of what impacts, if any, the change has on connections. If you anyone following this thread is around, please stop by and encourage others to do as well :)

@Frooxius
Copy link
Member

There's a few things from the BoTC:

  • There were multiple users who had spurts of queuing throughout the game. The screenshot you provided was taken near the end, but there were about 4-5 players (not including myself) who would end up with a large spurt of sudden queuing - all of them were Europeans. The server is in the US (I believe East coast, but would have to double check)
  • Only one of the players at BoTC was going through the relay. They were not the same player as one with 17 % packet loss at the end - they have already left at that point
  • I've noticed some players popping to 2 % packet loss at a point, but after some time that averaged out
  • I've experienced significant packet loss myself in middle of the game - I started queuing heavily (~35K), but most noticeably, everyone's voices and movements became very stuttery, to the point they were incomprehensible. I've never really had anything like this happen on my connection before

I am pretty sure that this behavior is not caused by people's connections or other factors other than the changed window size:

  • The players were regulars and we haven't seen this kind of behavior in any of the previous games with them
  • It was too many people (including myself) who experienced these kinds of issues to be a fluke
  • This is expected behavior with UDP when too many packets are being sent - the network elements will start throwing them away
  • Furthermore, like I mentioned in my initial post, I have made this kind of change to the library a few years in the past and rolled it out to community with a different project. The changes I observed back then match all of our observations - while a number of people have improved, the protocol became a lot more "volatile" in a number of situations (sudden bursts of packet loss and queuing and even voice/pose stream instability), enough to force a rollback.

In conclusion, from what I've seen so far, I would not be comfortable pushing this as a fixed change to the main community at the moment.

However the good results you've been getting are still promising and indicative that increased window size does indeed help in a lot of cases and it's something that's worth pursuing, but we will have to tackle the downsides somehow.

E.g. having the window change size dynamically would be a potential good step. E.g. if the library is able to negotiate the window size as part of the initial connection at very least, we could at least make this into a per-user option they could tweak. Having it be able to be changed dynamically and adapt to the connection would be even better, but that's probably even more work.

@Banane9
Copy link

Banane9 commented Feb 18, 2024

In conclusion, from what I've seen so far, I would not be comfortable pushing this as a fixed change to the main community at the moment.

However the good results you've been getting are still promising and indicative that increased window size does indeed help in a lot of cases and it's something that's worth pursuing, but we will have to tackle the downsides somehow.

Would it perhaps make sense to test a less drastic increase then? Something like 96 (half way) or so?

@Frooxius
Copy link
Member

In conclusion, from what I've seen so far, I would not be comfortable pushing this as a fixed change to the main community at the moment.
However the good results you've been getting are still promising and indicative that increased window size does indeed help in a lot of cases and it's something that's worth pursuing, but we will have to tackle the downsides somehow.

Would it perhaps make sense to test a less drastic increase then? Something like 96 (half way) or so?

Based on my previous experience with this library, I don't think it would be worth it. I remember I did test a number of different values a while back and had to settle back to the default of 64. Which is conservative enough to work in most cases and not trigger weird problems and instability.

I don't think that we'll find a single value that will work globally for everything. I think we'll need to make it change more dynamically, at very least configurable per-user or have it adapt dynamically based on the conditions.

@JackTheFoxOtter
Copy link

This sounds like a good takeaway from the testing. Dynamically changing the window size sounds like a desirable feature, and with some numbers to back up that every connection has a different optimal window size it's at least clear that this will have a noticable performance improvement once implemented. I agree with going the safe route and not changing the default for everyone if it can cause more instabilities.

@bredo228
Copy link

We've done a test at BoTC earlier today and unfortunately I don't think this change is safe to make, because the game ran into significant networking issues, worse than a typical game.

A number of users would frequently queue pretty fast. While it would often end up clearing after a bit of time, the bursts of queuing were a lot more unpredictable from my observations from regular sessions.

We had to remove some people from the game, because at some point they ended up queing pretty badly - it just kept increasing. Their packet loss was around 17 %, which is extremely high. This is likely due to too many packets being sent on their connection.

This matches the problems I remember happening when I made similar attempt several years ago with the library, so I don't think any changes to the library itself changed the behavior here.

While the fixed change seems to help some users, it makes the connections a lot more volatile, so we'll need a different solution here. Either the library supporting dynamically changing window size and packet aggregation (which would need to be implemented) or trying different protocols, such as ENet.

Bit of a disappointing result considering the success of the other testing that has been going on with this change.

Didn't manage to get a Sync Lounge test with a bunch of people connected to an AU server - will aim to do that at some stage within the next couple days.

@stiefeljackal
Copy link

This sounds like a good takeaway from the testing. Dynamically changing the window size sounds like a desirable feature, and with some numbers to back up that every connection has a different optimal window size it's at least clear that this will have a noticable performance improvement once implemented. I agree with going the safe route and not changing the default for everyone if it can cause more instabilities.

I appreciate the enthusiasm, but this is not a time to vote. This is to test the value, and multiple tests should be conducted to verify the validity of the change. When Rucio submitted this issue, this was to test the reduction of the message queueing. Numbers will change and that is expected, but numbers should not only be the measure. It is the experience and the quality of life improvement that should be measured as well. For the small number that was affected by this, an interview with them should be conducted to see what went wrong.

As Rucio pointed out:

We haven't had the JP users in the same session before in a very long time due to packet queuing being unbearable. We had at least four JP users who joined us and they never saw packet queuing numbers above single digits the entire time they were in there. As Stella reported, they only experienced queuing during a spike, but recovered quickly.

@stiefeljackal has been keeping tabs on the data for our event and no user connected had any packet loss. I suspect in the BOTC event that there could be other factors at play, such as the users Internet connections, or if they were going through the temporary relay hosted in CZ.

Only one user connected through the relay, but he only came in for a few minutes and left so it wasn't enough for us to determine if the relay could be a factor. I am going to force the relay on myself and do more testing while users are still in the session.

People who also have tested this have positive feedback provided in the LNL Preview Build Discussion - BETA 2024.2.15.276 thread.

To add to Rucio comments, here is the data captured during the Jam during peak times:

1:45 AM EST (06:45 UTC)

Test01

2:03 AM EST (7:03 UTC)

Test02

2:12 AM EST (7:12 UTC)

Test03

2:24 AM EST (7:24 UTC)

Test04

2:38 AM EST (7:38 UTC)

Test05

2:58 AM EST (7:58 UTC)

Test06

3:20 AM EST (8:20 UTC)

Test07

@iamgreaser
Copy link

  • I've experienced significant packet loss myself in middle of the game - I started queuing heavily (~35K), but most noticeably, everyone's voices and movements became very stuttery, to the point they were incomprehensible. I've never really had anything like this happen on my connection before

can be partially explained by

Also a note, I'm currently running just a single bridge & relay on my machine locally for the testing purposes, so there is only one located in the Czech Republic

The best thing you can do is... well, play it again with the same players, but actually have the relays hosted in the cloud.

The flaw with this methodology is that the relays will be notably less congested, but those who don't need the relay will most likely be unaffected. If your connection still suffers unexpectedly, then that's a useful data point.

And this time, please remember to ensure that it doesn't purge the logs!

P.S. If you really want to maximise the stress test, get everyone into NeoRoids or NeoFurs. From my observation, the higher the framerate, the more stress on the network connection. But this would make a comparison test on the release branch even more important.

@XDelta
Copy link
Contributor

XDelta commented Feb 19, 2024

They normally don't have issues in the game and most players were not going through the relay. Multiple players who were not going through the relay had issues as well. I'd also have some doubt the relay was the cause of issues as it was local to froox and they had issues as well. I did separately suggest that for future relay/networking related changes that having a process to spin them up in the normal relay locations could be beneficial to remove additional variables.

The game servers currently don't keep their logs. or anything for that matter, they are fully deleted each time. We are working to tweak that so the logs are saved separate. If we want logs currently, we'd grab them before shutting it down.

We generally don't ask players to change to specific avatars or change things they do normally for these tests so they remain more 'realistic' in that it meets what a typical session in this world would be. We could have done other types of tests but that wasn't what we were looking to do.

@bontebok
Copy link
Author

I'd also have some doubt the relay was the cause of issues as it was local to froox and they had issues as well. I did separately suggest that for future relay/networking related changes that having a process to spin them up in the normal relay locations could be beneficial to remove additional variables.

Just to clarify, the prelease relay and Froox were on the same IP address, so it is possible that that particular route to the headless had trouble at some point during the game and some users were also going through the relay. Not saying that was the cause, but since the server logs aren't available, it would be hard to determine that conclusively.

@ProbablePrime
Copy link
Member

I think some more testing is needed before we'd consider making this the default.

I'd be happy to setup some additional relay nodes on our existing relay locations that could help too. Some of the relay boxes are a little slow as they're only designed to run one, but some can easily run another one for testing.

@iamgreaser
Copy link

...come to think of it, the best avatars for creating network stress would be lightweight full-body ones - you get high FPS and 65% more frame per frame - I mean more data sent per frame. Danyy59's shared folder has a lot of these.

I'd also have some doubt the relay was the cause of issues as it was local to froox and they had issues as well.

But other people were using the relay - that is, Froox's connection to the internet. dammit, Rucio ninja'd me there.

We generally don't ask players to change to specific avatars or change things they do normally for these tests so they remain more 'realistic' in that it meets what a typical session in this world would be.

"Realistic" actually means "the worst case happens from time to time". Definitely worth doing at least one round with the extra network stress.

@stiefeljackal
Copy link

I think some more testing is needed before we'd consider making this the default.

I agree 100%, and that is what the majority of us are trying to point out. Instead of forming a conclusion or making a determination based on a single test, let's make sure we conduct multiple tests, record the necessary data, and form our conclusions from there.

The first five comments in the Discord thread LNL Preview Build Discussion - BETA 2024.2.15.276 provides excellent instructions on how to conduct these tests. I would encourage those that want to test the Prerelease to follow those instructions.

@Frooxius
Copy link
Member

As I talk in my post above, the observations from BoTC, there is enough reason for concern to not make this a global change for everyone - I would be significantly uncomfortable with making that change, hence my suggestion to focus on a more nuanced solution. I believe that's worth pursuing, given the positive results - but those positive results cannot come at the cost of extra instability.

I'd appreciate if you read my post above in full, because there are a number of points that I feel are being overlooked in order to "explain away" the bad results from BoTC. I understand that it's not good to hear some bad news (and I really hate to be the bringer of bad news here), when a lot of the results are positive, but this approach is making me very uncomfortable.

To address/repeat some key points:

  • There was only single person going through relay, but 4-5 people were having significant issues. The person experiencing 17 % packet loss was not connected through the relay (the person who was connected through relay was already gone at that point). Therefore the issues cannot be explained by relays
  • Even if people were connected through a relay, this is still a cause for concern - with this change we need relays to function normally as well
  • I was not connected through a relay either. I've been playing BoTC weekly (sometimes more) for over 2 years now with the same ISP and setup and I have never experienced issues as I did on Saturday in any of the games. Furthermore, the issues I experienced and observed match what happened last time I attempted to increase the DefaultWindowSize with this library. I think with these factors combined, it would be too much of a coincidence for the issue to be caused by something else

As it stands and given these reasons I would be scared to make this the default for everyone. I understand there are some very good positive results, but they do not make the negatives go away.

The only ways forward I see:

  1. We explore how to make this value more dynamic, so it can be changed on the fly (at very least at the time connection is established)
  2. We do a lot more extensive testing and the issues do not reappear in any of the tests - but given that we've done this in the past, I do not expect to get different results, since the same issues have already reappeared

@ohzee00
Copy link

ohzee00 commented Feb 19, 2024

I admit, the slight promise of users and friends that would benefit from this change has slightly biased me to wanting the change but overall yeah, more testing or better yet, possible implementations of the value being more dynamic would be great.

I'm trying to see if I can do a test with another group if time and general energy allows, here's hoping it ends up promising.

@ohzee00
Copy link

ohzee00 commented Feb 20, 2024

I did a small scale test with a group of friends, originally I was going to stop before even going to the prerelease to compare due to the lack of people being on however it did lead to #1382 being discovered.

Us finding that bug allowed us to test between release and prerelease with a Australian user that would desync with just four people in a session due to the above bug, two of which being in the US close to the headless. On release they would slowly desync, on prerelease with the same amount of people, world and conditions, they had no problem staying up with everyone else.

@Frooxius
Copy link
Member

Frooxius commented Feb 22, 2024

We did another BoTC test yesterday. One game on the prerelease and another on the normal build.

The group was a little bit smaller, but we had a player from New Zealand (for both games) and some Europeans (including myself).

Observations

  • We didn't have instances of people queuing on the prerelease in this game. On normal build, there was a burst of queuing during setup, but it eventually went down and we could play it uninterrupted.
  • Pre-release definitely seems to help keep queuing down better when it works well
  • We ran into a issue on pre-release with real-time data streams, notably voice. Compared to normal build, storytellers noticed it was crunchier on prerelease. Mine got particularly bad, where people were having significant issue understanding me, I had 9 % packet loss for voice data on prerelease, even after most people left. On normal build the voice was clean for everyone (including myself), even though we had more people.
  • Only a single player was connected through the relay, same as last game. I was NOT connected through a relay

Analysis

This is in line with my experience and understanding of what's happening - increasing the window size allows for more packets to be "in flight" - this means that with good conditions, packets get pushed through faster, better saturating the connection and keeping up with the queue.

However the flip side is that sending too many UDP packets increases the likelihood that the network elements will start throwing them out (this can happen for a number of reasons, e.g. when the buffer is full, UDP packets are just discarded), increasing the packet loss. This impacts the real-time data (voice & poses) and can also cause queuing, because the protocol becomes significantly less efficient when too many packets are being lost - blocking the queue and requiring a lot of re-transmissions.

Based on my experience, it seems like when that happens, the connection ends up going into a "death spiral" of sort - it becomes backed up, so it tries to fire more packets, which is too much for given connection to handle, increasing the packet loss, further degrading the throughput.

Even if the reliable queue is able to still push enough stuff, it can still impact the realtime data, like voice and poses, causing things to be jittery, which we experienced in both games.

Conclusion

Based on this and data collected by everyone else, increasing the window size is a good way to deal with queuing, because it can saturate the connection better.

However this also increases likely-hood that the connection becomes overwhelmed with the rate of UDP packets, increasing packet loss, which can have pretty bad effects.

Because of this, I don't believe a single fixed value for the window size is a good solution - the default is set conservatively so this doesn't happen with most connections. In my experiments with the library a few years back, I tried a number of different values and settled back on 64 to avoid this degradation.

However since it does help a lot in most cases, it's definitely part of the solution - we just need it to be more adaptive.

Next steps

In order to proceed on this, we would need to add ability to the LiteNetLib library to adjust the window size dynamically based on network conditions.

The library is currently hardcoded to use a fixed window size. From poking around the code, I think good approach might be to instead redefine that as the "maximum window size" and then have a "soft" size, that can be adjusted dynamically - this one would simply prevent from further packets in the bigger window from being transmitted, until a slot is "freed up".

The other part will be having the mechanism which adjusts this value dynamically. This would probably take a form of gradually increasing the size and then stopping/decreasing it when packet loss is detected.

It'll probably take a bit to fine tune this mechanism. But I'd need to poke around the library a bunch more to see if this approach will work well.

We could also ask the author of LNL again if they want to implement it officially, so we don't have to do it on our end, but if that's not the case, I'd be okay committing my time to this, since I have quite a bit experience with the library code.

Extra notes

I did dig through the code of LNL a bit ant one of the things I thought would help - doing late merging of packets before transmission, is already implemented in the library, so it's not something that would help.

@PJB3005
Copy link

PJB3005 commented Feb 23, 2024

I just want to share some knowledge that I have here since I've been looking into this kind of stuff for my own engine1, in case it's useful. I don't know much about LNL's internals specifically, but from what I can gather in this discussion, many of the concepts have similarity to parts of TCP, QUIC, and Lidgren.Network.

When looking at protocols such as both Ye Olde TCP and the more modern QUIC, there's actually two separate concepts at play here: the client window (much discussed here) and the congestion window. My understanding of both "4 decade old RFC documents (TCP)" and "5 year old RFC documents (QUIC)" is that these play these roles:

  • The client window is "how much data can the other peer handle".
  • The congestion window is "how much data can the peer's network handle".

The client window has been much discussed here already. Indeed, increasing the client window by a hardcoded amount does mean that "the networking layer can push more bandwidth" (thanks to the bandwidth-delay product) but that obviously means you're more likely to overload specific people's networks. In TCP and QUIC the size of the client window is something continuously communicated between two peers, whereas in Lidgren (and from what I can read here, LNL) it's hardcoded: suitable for low-bandwidth transfer (what you'd hope most game traffic is), but not for large downloads.

The congestion window works much different. Effectively, both peers try to keep tabs on the connection quality through some heuristics, and will throttle their outgoing bandwidth based on that. There's a lot of algorithms for this, but they're all calculated solely on the sending peer, and the other peer generally doesn't need to know or care2. These algorithms are also designed so that multiple independent programs or devices going through the same link can all "fairly" share the link without needing further cooperation.

One problem with common TCP congestion control algorithms however (that I know of) is that they are loss based, i.e. the heuristic is "we have to retransmit due to packet loss so we're probably overloading the connection". This has the issue of running into buffer bloat though which causes latency to rise significantly before the congestion control actually realizes the network is having issues. Google (relatively) recently made a new algorithm called BRR that's based on delay rather than loss, so it should try to keep buffer bloat under control so it's possibly worth looking into for games too.

What I'm trying to point out here is that when going off existing literature, the specific problem that needs solving here3 is congestion control. The client window should not (and possibly, cannot), itself be scaled based on the network quality.

For what it's worth, I took a look at ENet and it does mention something that sounds like congestion control in the API reference, since there was talk about using it in another issue.

Footnotes

  1. uses Lidgren, not LNL, but I believe the issues are mostly the same.

  2. though there are some changes to the TCP protocol that have been made to facilitate more accurate heuristics, like timing, for these algorithms.

  3. if you can solve dynamic window sizing that'd be nice too, but it might be far more involved at the protocol level.

@bontebok
Copy link
Author

We have announced a coordinated test between Australia and US which commonly experiences packet queueing between each other. The details can be found on the Resonite Discord, using the link below. We will be visiting four sessions with servers located in both Australia and US and will test on the public branch first, followed by the prerelease branch.

The test will be recorded and posted for all to see, and the test world we will be using will be made available as well.

If you are in the United States or Australia and would like to be part of this test, please read the details at the link below. Thanks everyone in advance for participating!

Australia and US Resonite Prerelease Test - 2/24/2024 3:00 AM UTC

@Charizmare
Copy link

From my end of the test on the pre-release build on both the US and AUS servers I noticed some slight popping from a few users when they spoke. I also had quite a few users comment that they heard quite a bit of popping and studdering in my movements on the pre-release build on the AUS server. There was also popping on the US pre-release server from me and other users but it wasn't as bad as the AUS server. For context, I live on the east coast of the US.

@bontebok
Copy link
Author

bontebok commented Feb 24, 2024

If you participated in the AU/US test, feel free to post feedback here or in the thread located in the Resonite Discord. I'm in the process of collecting all of the videos from the test sessions to review carefully. I will compile my findings and have them reviewed by others who participated in the test for confirmation before posting here. I hope to have this completed before the weekend ends.

I have uploaded the videos from Test 1a and 1b (Public Branch) and Test 2a and 2b (Prerelease Branch) and you can find them linked below if you would like to review them yourself.

https://www.youtube.com/watch?v=yKMXGzECwE4
https://www.youtube.com/watch?v=1jvRYeplVs4

Here is the link to the test world we used during all four tests: resrec:///G-Rucios-Crew/R-9c9bb88f-2958-4ca4-9765-ca02f7cd7a8c

@bontebok
Copy link
Author

Here are the results of the Australia and US test which was held on 2/23/2023 at 10:00 pm EST in Resonite. Thanks to @stiefeljackal for helping to develop the test plan, and to @bredo228 for providing the AU headless server!

Test Parameters

  1. The same server config file was used for all tests with four separate headless accounts
  2. No mods were used on the headless server to keep the headless servers vanilla
  3. The same server in AU and US were utilized for both Test 1 and Test 2 respectively
  4. All users stayed in the same avatars and mode (VR/Desktop/Fullbody) (8 in US, 11 in AU)
  5. The same test world was used for all four tests (resrec:///G-Rucios-Crew/R-9c9bb88f-2958-4ca4-9765-ca02f7cd7a8c)
  6. Each test ran for a minimum of 18 minutes.

An artificial delta message generator known as the "Desync Generator" was created to intentionally cause a set amount of string length to be written to a variable on every update from the Host. This was used during each test to identify the limit of the number of bytes per update that could be sent before users started to queue.

The four tests performed were:

  1. Test 1a - AU on Public
  2. Test 1b - US on Public
  3. Test 2a - AU on Prerelease
  4. Test 2b - US on Prerelease

Results

Server logs have confirmed that no users connected through the LNL Relay during the tests, one user had used -ForceRelay when connecting to the first test, but removed that the -ForceRelay flag and reconnected before the first test began.

Test 1a - AU on Public

  1. All users were able to remain in sync in this session
  2. Packet loss settled to 0% for all users by the end of the session
  3. The Desync Generator was able to send between 2700 and 2800 bytes per update before the users with the highest latency started queueing packets
  4. The FPS reported by the headless was between 40-41 fps
  5. No additional observations were reported.

Test 1b - US on Public

  1. About four users had difficulty staying synced, however only a single user could not remain synced for the entire test
  2. Packet loss reached 1% for only two users by the end of the session
  3. Even though one user could not remain synced, the Desync Generator was still used and as able to reach 1862 bytes per update before the users with the highest latency started queueing packets
  4. The FPS reported by the headless was between 57-58 fps
  5. No additional observations were reported.

Test 2a - AU on Prerelease

  1. All users were able to remain in sync in this session
  2. Three uses reached 0% packet loss by the end of the session, twelve users reached 1%, three users reached 2%, and one user at 3%
  3. The Desync Generator was able to send 12000 bytes per update before the users with the highest latency started queueing packets
  4. The FPS reported by the headless was between 40-41 fps
  5. Audio hitching/pops were reported by three users that had the highest latency, once reported other users came forward and said they observed the same

Test 2b - US on Prerelease

  1. All users were able to remain in sync in this session
  2. Eleven users reached 0% packet loss by the end of the session, four users reached 1%, three reached 2%, and one user at 3%. One user who was using Starlink had significant latency during parts of the test resulting in a max of 7% packet loss.
  3. The Desync Generator was able to send 5500 bytes per update before the user with the highest latency started queuing packets. The Desync Generator was pushed to 7000 bytes per update before most users with high latency started queueing packets.
  4. Audio hitching/pops were confirmed by only one user during this session.

Summary

  • As confirmed in other test results reported here, the increase of the DefaultWindowsSize value in the NetConstant.cs file for LiteNetLib has a significant impact on reducing the likelihood that users with high latency connections will queue messages during Resonite sessions.
  • The difference in the maximum amount of bytes that could be used in the Desync Generator on each headless server is likely attributable to the difference in the FPS between the two servers, as the delta messages are limited to the server FPS. The US server was producing about 35% more updates than the AU server.
  • In the Prerelease branch, higher packet loss was observed using the DebugUsers component with a maximum of 3% total packet loss by the end of the Test 2a and Test 2b. The 7% packet loss for one user in Test 2b was likely attributed to a poor Starlink connection.
  • Audio hitching/pops were reported by some users on the prerelease branch during Test 2a. In reviewing video recordings submitted by three users, some hitching/pops were also observed in the public releases but went unnoticed. It is difficult to determine from the recordings submitted if the number or percentage of these were higher during Test 2; however given the increase in packet loss and what was observed in prior tests, this may be statistically relevant.

Special thanks to these Resonite Users who participated in the test!

avali032, BigRedWolfy, bredo, Canadian_RedZ, Charizmare, darbdarb, FoxBox, GunGryphon, Lexevo, Near By Gamer, Ocelot342, ohzee, Player Kode, Rucio, Stellanora, StiefelJackal, Sveken, TrebleSketch, and zorky.

@bredo228
Copy link

Adding a note regarding the lower frame rate on the Australian server: this was most likely down to the server itself being quite a bit less powerful than the US one (in this case, it was running off an old Intel Core i7-4790) causing the lower frame rate rather than any other factors.

@Frooxius
Copy link
Member

Thank you for all the detailed info! I do feel like it's very promising and improves throughput a bit.

We still need a good solution to avoid the negative side effects of the increased packet loss.

I'm not sure if having dynamic window size scaling would be best. It feels like it's the simplest solution, but requires some development work on LNL. We could keep it simple and see if that works well enough.

We could try alternate protocols, like ENet, since @PJB3005 mentioned it might have congestion control already, but implementing another protocol is a bit of a task.

It might be worth adding some more diagnostics to LNL too better understand what's going on too. I've been thinking about adding following ones:

  • Average size of packet (to see if the merging is working well)
  • Keeping track of how many packets are "in flight" at the time in some way, to see how often is the window "stalled"

@bontebok
Copy link
Author

bontebok commented Mar 3, 2024

As all testing is complete and the prerelease is no longer needed, I'll go ahead and close this request. Thanks @stiefeljackal, @Frooxius, @shiftyscales, @ProbablePrime, and @bredo228 for all of your efforts. The information learned here will one day lead to a solution to eliminate packet queuing for high latency users!

@stiefeljackal
Copy link

As of update 2024.4.22.1366:

Bugfixes:

  • Fix invalid packet loss calculation for LNL resulting in numbers that are too large
    -- The packet loss % is now also more accurage, including full decimal points, rather than being rounded to nearest percent

https://discord.com/channels/1040316820650991766/1154514012143362151/1232103701221605377

If the packet loss calculation was incorrect before that release, then the packet lost numbers exhibited in these tests are also incorrect. Therefore, this should be retested as the crux of the conclusions mention packet loss as the key factor of not going forward with this.

@Frooxius
Copy link
Member

As of update 2024.4.22.1366:

Bugfixes:

  • Fix invalid packet loss calculation for LNL resulting in numbers that are too large
    -- The packet loss % is now also more accurage, including full decimal points, rather than being rounded to nearest percent

https://discord.com/channels/1040316820650991766/1154514012143362151/1232103701221605377

If the packet loss calculation was incorrect before that release, then the packet lost numbers exhibited in these tests are also incorrect. Therefore, this should be retested as the crux of the conclusions mention packet loss as the key factor of not going forward with this.

The packet loss numbers during the tests were correct, the packet loss shown there wasn't affected by this bug.

The fix (and the bug) affected specifically the new dedicated packet loss property, which didn't exist at the time of these tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged This issue has been assessed tweak Suggested change to an existing feature
Projects
None yet
Development

No branches or pull requests