-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: measure bitswap ttfb from after we get candidates back #432
base: main
Are you sure you want to change the base?
Conversation
I have an SP serving retrievals via booster-bitswap. He set it up in proxy mode behind the boostd node. When we try to fetch something using Lassie and specifying the protocol, the retrievals work. |
I'm suspecing that RetrievalBot might use a timeout of 10s: https://github.com/data-preservation-programs/RetrievalBot/blob/main/.env.retrievalworker#L4 @xinaxu am I understanding that correctly? Lassie has an overall timeout by default on the commandline of 20s - start fetching something or it'll time out. We're recording ttfb for Saturn atm, this is all peers, not just Filecoin of course, but ttfb for bitswap is the fastest across protocols, even with the extra delay for discovery this PR is looking at removing: It could be that we're that fast because these are Saturn nodes that maintain a large number of peers already-connected, and there's a lot of repeat content. But if booster-bitswap is taking longer than 10 seconds then that seems like a problem to me. It's also the case that Lassie is stuck on a boxo commit that has now been reverted (long story): ipfs/boxo#435 which has to do with initial peer discovery + connection, without this particular change we get a certain % where we know the peer(s) to connect to and tell bitswap but a race condition makes it fall through the cracks and we get what looks like a timeout. If we don't get a "proper" fix to this in boxo by the time we need to move on our boxo dependency then we may have to fork bitswap to maintain this! But looking at RetrievalBot's bitswap code I suspect this might not be the same problem we're addressing since it seems to be more direct & 1:1 whereas we use the full peer discovery mechanism and just short-circuit it with indexer results. |
btw I'm not sure we know about this, yet; we currently don't have insight into who's serving via Rhea/Saturn on Bitswap. That's what #345 was for but it's only just released and hasn't been deployed to Saturn yet. Once that's turned on we should have insight into Filecoin SPs giving us data via Bitswap. Hopefully by next week we should be able to make some more interesting statements about this and even get stuff onto the dashboard. |
@@ -300,6 +315,7 @@ func (br *bitswapRetrieval) RetrieveFromAsyncCandidates(ayncCandidates types.Inb | |||
TotalPayment: big.Zero(), | |||
NumPayments: 0, | |||
AskPrice: big.Zero(), | |||
TimeToFirstByte: time.Duration(ttfb.Load()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should preserve this change even if this PR is closed, it's the only retriever to not return this value and there's no good reason not to
Fixes: #333
This changes the start-time from when we start asking the indexer for bitswap candidates to when we get our first candidate back. So the "Bitswap TTFB" is measured from when we start attempting to talk to peers. This roughly matches the other retrievers that start their timing from when we start retrieving from each peer (we do them in sequence, so we do a per-peer ttfb from when we start the individual attempts).
BUT I'm having reservations about this now that I consider it further. Bitswap inherently comes with a greater discovery time penalty that we don't measure anywhere else (it's difficult to measure). If we move the timing as per this PR, we lose account of that delay; maybe we want to see that in the ttfb measurement for bitswap?
Either way, it's difficult to compare protocols; we just have to acknowledge that.
Feedback from @hannahhoward & @willscott would be good before proceeding.