Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perhaps longer latencies for Netherite for medium-sized payloads #384

Open
ericleigh007 opened this issue Apr 19, 2024 · 13 comments
Open

Perhaps longer latencies for Netherite for medium-sized payloads #384

ericleigh007 opened this issue Apr 19, 2024 · 13 comments
Labels
P2 Priority 2

Comments

@ericleigh007
Copy link

We have switched to Netherite for our production workloads and we are quite pleased with the results in our testing.

However, it appears that for some size payload, the latency of calling orchestrators and activities is greater than with the same code using the default backend and we'd like to explore if there are ways to get this more in line.

Our main codebase is cored in Change feeds from Cosmos DB and the change feed immediately calls an orchestrator that then calls activity based on the state of the document. I have taken some basic A/B data for the time the change feed function takes and it appears the timing is longer for Netherite.

This is a simple test without being able to properly isolate all the factors, but wanted to know if any of you, especially @sebastianburckhardt might know of a reason. The below graph illustrates the longer latencies. The blue line represents the request duration for our main change feed for the standard backend, while the red is the duration when Netherite is used. The Netherite durations are spread over a wider range and overall at least twice the duration of the standard backend, it would seem. The data is normalized so that the scale is percentage of the total requests.

image

Thanks!

@ericleigh007
Copy link
Author

I have done some more tests on this and my original hypothesis seems correct. Netherite is taking longer to start orchestrators and activiites.

  • My current follow-up idea is that this is due to the more scalable, but longer time it takes Netherite to push a message across Eventhub with its base throughput of only 1MB/sec (Per TU) as opposed to how long the Azure Storage backend takes to transmit queue entries.

Our code has to transmit this information down between orchestrators and activities because it is operating on cosmos data that we're cobbled together from various document types into a materialized object at the higher levels and is commonly used in the lower in let's say 10 places. Ergo, fetching it back from cosmos would be 10x more expensive than getting it once.

And for better or worse, we're not able to change that now, as we're expected to go live in less than a month.

@sebastianburckhardt you added blob storage to avoid transmitting large messages across Event hubs for me. Is there a configuration element we can tweak that would cause that logic to help us in our quest for lower latency here?

As a first step, I'm sure you'll ask me just how "large" our messages are at max, how large the average is, and I would turn that back to you to ask if there is something in you know of that could be used to report that info with little fuss.

🙏

@bachuv bachuv added P2 Priority 2 and removed Needs: Triage 🔍 labels Apr 23, 2024
@ericleigh007
Copy link
Author

I wouldn't mind if this were answered more quickly.
I was wondering about examples of maybe tweaking internal or user-accessible settings, or maybe a compression algorithm to avoid event hubs being as much of a bottleneck for the case of larger payloads.

@davidmrdavid
Copy link
Member

Hey @ericleigh007: sorry for the delay, it's been busy :-) .

My current follow-up idea is that this is due to the more scalable, but longer time it takes Netherite to push a message across Eventhub with its base throughput of only 1MB/sec (Per TU) as opposed to how long the Azure Storage backend takes to transmit queue entries.

This is an interesting theory, at a glance I see the logic, but I can't immediately confirm that this is the culprit. I do know that @sebastianburckhardt implemented a related optimization here (#275) but by scanning through the PR, I don't see any top-level configuration knobs.

As a first step, I'm sure you'll ask me just how "large" our messages are at max, how large the average is, and I would turn that back to you to ask if there is something in you know of that could be used to report that info with little fuss.

Good guess. It may be possible we already have telemetry for this - can you provide me with your app name, an affected instanceIDs, and timerange in UTC that can be used to dig deeper?

@ericleigh007
Copy link
Author

Thank you. i will collect these tomorrow and send them via your gmail if that's fine.

@davidmrdavid
Copy link
Member

Thanks @ericleigh007: Yes, that's fine, go ahead. Just please tag me here once sent so that I can check my email in case it did not land on the inbox.

@ericleigh007
Copy link
Author

Hello @davidmrdavid , I have sent some information which I hope is of help to your gmail.
Please shout back in email if you need anything else.

@ericleigh007
Copy link
Author

So @davidmrdavid and @sebastianburckhardt I know you said there are no knobs externally to turn, but I wonder if you can point me to something "internal" (I can build the source, you know), so that I can maybe find a better setting for what I'm doing.

I have a hypothesis that maybe the tradeoff point is not appropriate for our stuff, and maybe it gives Event Hubs more priority for the payload, thereby letting things take a few seconds to transfer. I'd probably rather use the blobs earlier, IF that won't blow our blob backend throughput limit.

I don't think it should, since I believe the old backend's throughput hit a wall because of lack of operations (counting scans) in storage queues.

Give me some ideas, please, thanks!

@davidmrdavid
Copy link
Member

Thanks @ericleigh007: Just confirming I've seen the email btw. I'll look to carve out time to see the data.

but I wonder if you can point me to something "internal" (I can build the source, you know), so that I can maybe find a better setting for what I'm doing.

Just for transparency, Sebastian is currently out of office so unfortunately he won't be able to assist in the immediate term, and he's indeed the SME here who'd more readily be able to point you to something internal. But I'm nonetheless hoping we can make a dent here in the meantime.

I do think that studying this PR (https://github.com/microsoft/durabletask-netherite/pull/275/files) seems like our best bet. In there, I see the creation of two new utility classes used to deal with large messages:

I think there's performance knobs in each we could tweak.
In the blobSender, I see these:

// these constants influence the max size of batches transmitted in EH and via blobs.
// note that these cannot be made arbitrarily large (EH batches cannot exceed max batch size on EH, and neither can the indexes of blob batches)
public int MaxEventHubsBatchBytes = 30 * 1024;
public int MaxEventHubsBatchEvents = 300;
public int MaxBlobBatchBytes = 500 * 1024;
public int MaxBlobBatchEvents = 5000;

In particular, the variable maxEventHubsBatchEvents seems to control whether or not we offload a large message to a blobBatch instead, as seen here:

{
// unless the total number of events is above the max already, we always check first if we can avoid using a blob
bool usingBlobBatches = toSend.Count > this.blobBatchSender.MaxEventHubsBatchEvents;

So if you're trying to opt into the "blob optimization" earlier (I think that was your suggestion) then perhaps you can make maxEventHubsBatchEvents smaller. Does that make sense?

On the BlobReceiver, I see there's some concurrency control in the semaphore (AsynchronousStorageReadMaxConcurrency), that limits the amount of azure storage concurrency, which perhaps can be naively increased as well:

https://github.com/microsoft/durabletask-netherite/blob/0e342923af76f82a68aebf5c1fff23ee619b281d/src/DurableTask.Netherite/TransportLayer/EventHubs/BlobBatchReceiver.cs#L108C39-L108C76

@ericleigh007
Copy link
Author

David, thanks
This is a balancing act, I think. The way I see it, now I have the choice between good throughput for orchestrator and activity input with the blobs, but higher transactions/second, and the slower throughput for that with event hubs themselves.

I was wondering if these blobs from Blob sender/receiver "COULD" be a different storage account, which might give us a way that could provide faster throughput without affecting the single storage account used for the other operations of the backend.

@davidmrdavid
Copy link
Member

davidmrdavid commented May 15, 2024

@ericleigh007: It may be technically possible to have a storage account just for blobs, but I worry that's a bigger change than it sounds, and even if I could provide it on a private package, I'm unsure I could officialize it in the main branch.

That said, I can definitely try to provide you with a private package that tweaks some of the perf knobs I described above. That said, to avoid mixing too many variables, I'd like to first make a bit more progress here (#383 (comment)) and once that error seems resolved (I think we're close), I'll look to provide some tweaks to these settings

@sebastianburckhardt
Copy link
Member

@ericleigh007: I think it is perhaps too early to fine tune the parameters without some better insight in what is causing the delays. I have taken a look at the telemetry for the two-day period and the orchestration you mentioned:

let start = datetime("2024-04-29T20:00");
let end = datetime("2024-05-01T20:00");
let orchestrationName = "CollectingTransactionSetChangeOrchestrator";

It seems that 99% of all orchestrations finish within 4.26 seconds. The interesting question is what happens to the others? There is a pretty long tail (max execution time is about 15 seconds). Importantly though, these 'slow' orchestration are not a constant occurrence over the two days observed. Rather, they all originate from just three distinctive time periods, of lengths 20 minutes, 7 minutes, and 6 seconds respectively.

I still have to dig deeper to find out what is causing the slowness during those times, but I would be surprised if it is related to the EH throughput. I am suspecting that it is about bugs or other faults that cause partitions to restart.

@ericleigh007
Copy link
Author

ericleigh007 commented May 18, 2024

My estimation is around 16MB for what we call the Collecting path (so orchestrators starting with Collecting....) smaller for the other flows, but I think 16MB IS a decent estimate for "frequent" use. We also have one or two orchestrators that take larger hunks of data, and one that is maybe as large as 50MB sometimes, which is a return of 25,000 "items" in a single Service Bus trigger.

I also did some quick experimentation with "hidden" compression / decompression and I wondered why the current persistence stores don't use this, or at least have it as an option.

@davidmrdavid
Copy link
Member

@sebastianburckhardt: we don't have stress tests with large inputs such as the ones described above, do we? I suppose it would be easy to modify them to generate and use several MB of random data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Priority 2
Projects
None yet
Development

No branches or pull requests

4 participants