-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Entities become unresponsive under load #339
Comments
Hi @greg-zund: Can you tell me a bit more about how much "load" is put on these Entities? Here's a few data points that should help us get a pulse here: (1) How large are the inputs sent to the Entities? Like with Orchestrators, inputs and outputs to DF APIs should be kept small to avoid performance issues in the long run. I've seen plenty of cases where folks use Entities as a kind of "database replacement" where, over time, their Entities may get stuck (i.e slowed down greatly) due to the cost of de-/serializing large states repeatedly. (2) How big does your workItem backlog tend to get? It would be good to log this as that becomes part of your Entity state, which means that if an Entity is accumulating a large backlog, then it's processing will slow down due to the cost of de-/serializing its state. (3) How many signals are coming to the same Entity instanceID at any point in time?
I think I'll be able to answer this more definitely after getting your thoughts on my questions above. Thanks! |
Hi @davidmrdavid (1) The state is a list (max 10000 elements, normally and on average only <10 elements) of custom objects, each with 3 (2) We have several metrics for the backlog and addition and removal of work items. However in the case we are looking at, it seems that some entities have stopped doing work all together. As mentioned above, 10000 elements in the backlog ist the max I have seen, but potentially this could be higher if the delay in processing never stops. (3) For one entity, there are several signals (the point of time is hard to determine).
I hope we can get this working, any help appreciated. |
Also, I was under the impression that the keep small part referred to the input and output and not the entity state? |
Another addition: the |
@davidmrdavid any new ideas? |
Have you looked at https://microsoft.github.io/durabletask-netherite/#/ptable |
My apologies here, this thread fell under the radar. I'll respond to the remaining questions I see in case it's still helpful.
The docs have been recently updated to reflect that entity state should also be kept small, sorry it wasn't there before: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-best-practice-reference#keep-entity-data-small
The fact that a restart helps suggests this may not be a stuck partition in the same sense as what you experienced @alexjebens. In the issue you linked, I don't expect a restart to help as was a permanent error de-serializing the partition payload.
This is great to hear. Please keep us posted @alexjebens if it re-occurs. We did encounter 1 more FASTER corruption bug which we're fixing here (#395), so on the next Netherite release, you should be getting that automatically. @greg-zund - apologies for this thread falling off our radar. Did you work around this issue or is it still present? Please let me know, I can try to engage more team members here. |
We are using durable functions in Azure with Netherite and an elastic premium plan (EP2).
We are using a setup with only entity functions and no orchestrators. Each entity has a list of work items it needs to process and an operation to trigger the processing of one task in the list. If the list is not empty after the operation has finished, the entity signals itself to run the operation again.
Pseudocode:
The workers are created and initially signaled by another function:
Pseudocode
The problem we are facing is that this setup runs ok for some time and then entities start becoming "stuck" somehow (they aren't doing the calculations) and a query to
ListEntitiesAsync
times out. The only method to revive the durable functions is to restart the durable function in Azure. We see some storage exceptions in the logs, but nothing really meaningful (to us). We don't see this problem without netherite (although it should be noted that we don't have the exact same system deployed with durable functions backed by Azure storage).Is there a good way to debug these kind of problems when the durable runtime becomes unresponsive, or does someone see an obvious problem with the setup we are using?
The text was updated successfully, but these errors were encountered: