Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client not resolving any packets after lots of publish #133

Closed
SeverinAlexB opened this issue Feb 25, 2025 · 11 comments
Closed

client not resolving any packets after lots of publish #133

SeverinAlexB opened this issue Feb 25, 2025 · 11 comments

Comments

@SeverinAlexB
Copy link
Contributor

We were experimenting with pkarr packet churn rates when we found some weird behaviour.
If we publish 100 public keys without using the cache the pkarr client would always return None even though the packet actually exists.
I made an example to reproduce it. resolve_most_recent() stops to work completely and resolve() works occasionally but with lots of misses.

Version: 3.5.1

FYI: @SHAcollision

@SeverinAlexB
Copy link
Contributor Author

SeverinAlexB commented Feb 25, 2025

Actually, you don't seem to be needing all the publishing. For resolve_most_recent(), publish a key and then resolve it this way and it fails.

    let client = Client::builder()
        .no_relays()
        .cache_size(0) // Disable caching so we actually call the DHT.
        .maximum_ttl(0)
        .build()
        .unwrap();

    let key = PublicKey::try_from("de5ommih83t8u9x6p94yfrar367ewgprobf1x8uwdynixq7zoupy").unwrap();
    if client.resolve_most_recent(&key).await.is_none() {
        println!("None");
    }

resolve() is a different story though as it sometimes works and sometimes doesnt.

@Nuhvi
Copy link
Collaborator

Nuhvi commented Feb 26, 2025

  1. resolve_most_recent() with zero sized cache is a different bug, that was fixed in [email protected].
  2. Every supposedly "churned" key from running this script, is:
    • Resolvable from cargo run --example resolve <key> script, few seconds after you close the script.
    • Immediately resolvable from app.pkarr.org.

This mainly tells me that the DHT is just not happy with the rate of requests this script is making from the same IP.

@Nuhvi
Copy link
Collaborator

Nuhvi commented Feb 26, 2025

There is also the possibility that Pkarr and or Mainline itself is where the rate of queries is causing an issue (channels dropping messages or something), and that is what I will try to debug now in a more controlled environment.

@SeverinAlexB
Copy link
Contributor Author

This mainly tells me that the DHT is just not happy with the rate of requests this script is making from the same IP.

This is a possibility we are looking into right now. The goal is to determine what is possible with a republisher that keep republishing user packets on the homeserver. I am currently switching the churn script to the mainline lib directly to count actual items responded by the individual nodes. This will give me more insights.

There is also the possibility that Pkarr and or Mainline itself is where the rate of queries is causing an issue (channels dropping messages or something), and that is what I will try to debug now in a more controlled environment.

Cool. I'll let you know if I figure out more.

@Nuhvi
Copy link
Collaborator

Nuhvi commented Feb 26, 2025

The goal is to determine what is possible with a republisher that keep republishing user packets on the homeserver.

I think the correct methodology is:

  1. Test how many records can you publish per 2 hours window before the DHT start rejecting your PUT queries.
  2. Separately, random sample records once every 10 minutes or so, and see how often do you fail to resolve records.

Last time I tried this way before I started Pkarr in Rust, I managed to keep ~150k records alive with >95% success rate at resolving.

Anyways, good luck.

@SeverinAlexB
Copy link
Contributor Author

SeverinAlexB commented Feb 26, 2025

I incorporated mainline into the resolve function and I see the same behaviour. https://github.com/pubky/pkarr-churn/tree/mainline-problem

Resolution starts to drop heavily after 30 public keys. And keep in mind, I sleep 1min! after every resolve in this example. So I have doubts that it is the rate limiting. I will keep experimenting.

2025-02-26T09:53:57.954428Z  INFO pkarr_churn_experiment: - 0/50 Key 97z75z8pm3igwsxzn9f1s1eneizbcrepi6kr8enaet3orcbe1yhy is resolvable on 21 nodes.
2025-02-26T09:55:00.426725Z  INFO pkarr_churn_experiment: - 1/50 Key dqou79zyg31nxu7uwcg9nrtj8sjubtzmky361ha98mp14dw5nhey is resolvable on 26 nodes.
2025-02-26T09:56:02.838772Z  INFO pkarr_churn_experiment: - 2/50 Key sdos6sex3gt7yeqqatdpdafeek1d5xqkxs8ws6cs6mub5re319po is resolvable on 25 nodes.
2025-02-26T09:57:05.108534Z  INFO pkarr_churn_experiment: - 3/50 Key psdnz7fof9tae55iinuewamnamgfw6e1uq36a3gtyrdaomnrwyry is resolvable on 17 nodes.
2025-02-26T09:58:07.572797Z  INFO pkarr_churn_experiment: - 4/50 Key a7ejkn9mf8dfg4twgntarngk8jrhosjta51y6buxhkf3dpeghcsy is resolvable on 19 nodes.
2025-02-26T09:59:09.871704Z  INFO pkarr_churn_experiment: - 5/50 Key 44c6x3wg1a7josbdcy9fisg181zkwy1jhbq7t6urub8syp8kjjdo is resolvable on 21 nodes.
2025-02-26T10:00:12.256919Z  INFO pkarr_churn_experiment: - 6/50 Key cbffgwrouconyxed9ud7gcngzjhx1ebscha959cwy71b5dqarnoy is resolvable on 23 nodes.
2025-02-26T10:01:14.634595Z  INFO pkarr_churn_experiment: - 7/50 Key gbduhf6eyc5h4edgq9bgi7hih7iwz6pjpinbn4jyxoxsce3pe7oo is resolvable on 17 nodes.
2025-02-26T10:02:17.405702Z  INFO pkarr_churn_experiment: - 8/50 Key i91a77e5sbf1zt8ad5k9x5efajpzweg4r9iw8wbi7kp7qjg6rkpy is resolvable on 19 nodes.
2025-02-26T10:03:19.783977Z  INFO pkarr_churn_experiment: - 9/50 Key 8xebyum5nwstjg1tsepanowrkwut79cmi83e4xegj6ggmz53xg5o is resolvable on 20 nodes.
2025-02-26T10:04:22.456331Z  INFO pkarr_churn_experiment: - 10/50 Key gg5qr7g6ypgzz78gznabhqhxuhydrn8rhjoizwnwkbpt7uxp7cjo is resolvable on 16 nodes.
2025-02-26T10:05:25.019211Z  INFO pkarr_churn_experiment: - 11/50 Key r5xm4f1anogd91ebh8uj1uomekbjpmtdefpr9donyrssns9g4nty is resolvable on 17 nodes.
2025-02-26T10:06:27.392518Z  INFO pkarr_churn_experiment: - 12/50 Key bgyyrtqpqkzyeza3xzzdwo3ep4uzfdxfzjrwuoous7zcm8tr1gyy is resolvable on 20 nodes.
2025-02-26T10:07:30.170789Z  INFO pkarr_churn_experiment: - 13/50 Key 4je976hz38mrjwqgpaw5zi3t18ca8up86krmwqsk73zye4sn5mho is resolvable on 15 nodes.
2025-02-26T10:08:32.667768Z  INFO pkarr_churn_experiment: - 14/50 Key ffsr1u584ixwt49119iqczm9ai5its5nb9ekhtcjnnwr77fyt3co is resolvable on 15 nodes.
2025-02-26T10:09:35.109878Z  INFO pkarr_churn_experiment: - 15/50 Key 517hrxd4nmt45hx3p1jzyswpxu94a19fz4g6c1xog1dziyraj56y is resolvable on 17 nodes.
2025-02-26T10:10:37.627732Z  INFO pkarr_churn_experiment: - 16/50 Key xgiqsrph3et5hp53qcpcp8s8bb6t59wssg6q4fd4k3eo1mwd9c1o is resolvable on 14 nodes.
2025-02-26T10:11:40.193772Z  INFO pkarr_churn_experiment: - 17/50 Key ruhjbobp9u9qjgxebf5i9b9tcstits8f5z5t9555g769yfm6os4o is resolvable on 17 nodes.
2025-02-26T10:12:42.704537Z  INFO pkarr_churn_experiment: - 18/50 Key 7fpu11ybpn4u3wdy3qti19k6kjcs71aheap5ximyjpk5h54f8zzo is resolvable on 17 nodes.
2025-02-26T10:13:45.271250Z  INFO pkarr_churn_experiment: - 19/50 Key sxfrcpwdc6e8fah31gzimkqgb48imgqeosqca83rbqanq1ybe53o is resolvable on 20 nodes.
2025-02-26T10:14:47.688836Z  INFO pkarr_churn_experiment: - 20/50 Key 8ni4a6qxjkbwt4pssygpjig8qagcrrroq8r69dohtbbdyd3my9my is resolvable on 16 nodes.
2025-02-26T10:15:50.119908Z  INFO pkarr_churn_experiment: - 21/50 Key jhfbgerad5dis75hxp8zf8e15qs7zcjky6x4g8r1u44xbczqcp9o is resolvable on 15 nodes.
2025-02-26T10:16:53.204056Z  INFO pkarr_churn_experiment: - 22/50 Key zeufsb1ukybec8w494e9whd56aiub7nsymkwb63pmfc1ibtutb3y is resolvable on 18 nodes.
2025-02-26T10:17:55.882563Z  INFO pkarr_churn_experiment: - 23/50 Key t4sjobo45gu8cwhi95edo4ojkha3arijjhwhhs3eugdkm63yz8jo is resolvable on 16 nodes.
2025-02-26T10:18:58.667348Z  INFO pkarr_churn_experiment: - 24/50 Key kzurbahohkyspa76dabzrteg6x6qm11nd9qc7bnmtsk3xan34fty is resolvable on 15 nodes.
2025-02-26T10:20:00.970007Z  INFO pkarr_churn_experiment: - 25/50 Key f4iayna836xnmjz3qpr5bk741atn1wdzgejiew9atw8fjdr5sk8y is resolvable on 15 nodes.
2025-02-26T10:21:03.520087Z  INFO pkarr_churn_experiment: - 26/50 Key 8xyndggfgmwr1obpkn4z5gdwryue8bwjwym8mekoemwcg8kf3rzo is resolvable on 19 nodes.
2025-02-26T10:22:05.933702Z  INFO pkarr_churn_experiment: - 27/50 Key o1dnk9f979asj65j6pbx7q3ohbzim3jz1brfzarirwhtztzewsdo is resolvable on 16 nodes.
2025-02-26T10:23:08.391699Z  INFO pkarr_churn_experiment: - 28/50 Key hq1k3dn87jfrsttdobj5b9fgcrh3qhyc38m5sjr5tg4ya3pnmouy is resolvable on 14 nodes.
2025-02-26T10:24:11.052462Z  INFO pkarr_churn_experiment: - 29/50 Key gzxbnqpc5t4ijnsfwxupob545hyz76aeaeupre7z67n6j45jaojy is resolvable on 10 nodes.
2025-02-26T10:25:13.501293Z  INFO pkarr_churn_experiment: - 30/50 Key 6m3hp9ymf6nwxuaidh3ixhucxe7bewfje9j1pq7kfs43be91eyfo is resolvable on 1 nodes.
2025-02-26T10:26:15.920237Z  INFO pkarr_churn_experiment: - 31/50 Key izozuwbp3j77ekm68ohsite1pekrxbwncp348a66astk386kg1qy unresolved

@Nuhvi
Copy link
Collaborator

Nuhvi commented Feb 27, 2025

Anyone interested in stress testing pkarr/mainline and getting a picture on how long it takes to see a packet churning, here is an MRE that works as expected, and easy to read:

  • First you publish as many records as you want.
  • Then you start randomly sampling one of these keys on an interval (default to 1 second) and see how long before resolution starts failing.

In my experience, publishing 512 records work fine with no issues, and I got a 99.7% resolution success rate for the first 10 minutes...

Even then, when a record failed to resolve, that was most likely a rate limiting issue, because I can manually resolve that failing key just fine.

I didn't run this random sampling for longer than 10 minutes, because I know from experience churning can take hours, and on occasion days, and I am not that invested to find the exact number.

@SeverinAlexB
Copy link
Contributor Author

SeverinAlexB commented Feb 27, 2025

In my experience, publishing 512 records work fine with no issues, and I got a 99.7% resolution success rate for the first 10 minutes...

Doesn't work for me. Fails all the time

89/100 Stored mutable data as PublicKey(6r5b5axmfh37jns4n34sy4dnppwht8cik7891psxeduw5xh7byro) in 4323 milliseconds
thread 'main' panicked at src/main_ar_publish.rs:28:14:
Failed to publish: Query(NoClosestNodes)

I guess rate limiting. Not sure where though. Maybe asking the same nodes again and again for routing info?

@Nuhvi
Copy link
Collaborator

Nuhvi commented Feb 27, 2025

I suspect that you are either using a VPN (sharing the same IP with others) or you have been making more requests than I do (running pkarr-churn more often) or just the mainline is more busy now.

Rate limiting is not the only problem though, UDP in general is flaky and it can be overwhelmed and packets can be dropped for no fault of your own, or anyone's.

I still don't see enough evidence that the implementation of mainline is causing a bug ... maybe more throttling of requests would help, but I would rather leave it for whoever dealing with this edge case to retry and backoff.. at least for now.

If the issue was blocking an actual production needs, as opposed to statistics gathering, that I quite honestly don't consider critical or valuable (actionable), I would be acting with more sense of urgency.

@Nuhvi
Copy link
Collaborator

Nuhvi commented Feb 27, 2025

In my experience, publishing 512 records work fine with no issues, and I got a 99.7% resolution success rate for the first 10 minutes...

I just replicated this from my laptop behind VPN and from my VPS.

Resolution right after that is a bit spotty though, but I couldn't see an issue while publishing. And as usual resolving manually always succeeds.

I am satisfied that a medium sized service provider can easily keep the record's of thousands of users live on the DHT, at worst by implementing some retries and backoffs, let alone what can be done with relays and republishing from native clients.

I don't see a bug or a mission failure here... so far.

@SeverinAlexB
Copy link
Contributor Author

Conclusion:

  • UDP reliability is a real problem
  • It's more realiable to create a new pkarr client for parallel queries than using the same instance.

Closing here as it is resolve for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants