Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Hub Client Timeout Issue when directed to Specific Partition via Orchestration Id #295

Open
ohadtay opened this issue Aug 6, 2023 · 3 comments
Labels
P2 Priority 2

Comments

@ohadtay
Copy link

ohadtay commented Aug 6, 2023

Hey, I recently encountered an issue when attempting to create a new orchestration using the task hub client with Netherite orchestration client. When the client is directed to a specific partition (6) via the id of the orchestration, the operation times out (I added the traces of the client below). However, when using any other partition, the operation completes successfully. I have tried restarting the machine, but the issue persists. Additionally, every time I use this partition, I experience a timeout even when attempting to retrieve the status of the orchestration.
If there is any additional information that would be helpful, please let me know.
This issue is reproducible when using specific orchestration ids.

I am using these versions:

this is the task hub parameters that I am using:
{
"TaskhubName": "MonitoringHub",
"TaskhubGuid": "6b5920e4-3b6c-4af0-942c-cae43ed2154e",
"CreationTimestamp": "2023-07-06T09:22:03.5191171Z",
"StorageFormat": "{"FormatVersion":5,"UseAlternateObjectStore":false,"PageAndSegmentSizes":[21,26,10,19],"MemorySizes":[22,20]}",
"PartitionCount": 16
}

Client.3d18f47 ClientTimer scheduled (2023-08-03T13:58:36.0000000Z,1): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"ClientTimer scheduled (2023-08-03T13:58:36.0000000Z,1)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 ClientTimer is awakening at -251711229983.959s: [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"ClientTimer is awakening at -251711229983.959s"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Part06 completed ClientWorkItem 3d18f47R2 instanceId=engine-oro44ffn2507-physical2 status=Create latencyMs=205.85 producedMessages=1: [{"Key":"partition","Value":6},{"Key":"workItemType","Value":1},{"Key":"workItemId","Value":"3d18f47R2"},{"Key":"instanceId","Value":"engine-oro44ffn2507-physical2"},{"Key":"status","Value":1},{"Key":"latencyMs","Value":205.85300000000279},{"Key":"producedMessages","Value":1},{"Key":"{OriginalFormat}","Value":"Part{partition:D2} completed {workItemType}WorkItem {workItemId} instanceId={instanceId} status={status} latencyMs={latencyMs:F2} producedMessages={producedMessages}"}]
Part06 sent TaskMessage 3d18f47R2M0 eventType=ExecutionStarted taskEventId=-1 instanceId=engine-oro44ffn2507-physical2 executionId=cc9a4aa2213848c2a57e1f9dad6a4301 persistenceDelayMs= sendDelayMs=205.85: [{"Key":"partition","Value":6},{"Key":"prefix","Value"
:""},{"Key":"messageId","Value":"3d18f47R2M0"},{"Key":"eventType","Value":"ExecutionStarted"},{"Key":"taskEventId","Value":-1},{"Key":"instanceId","Value":"engine-oro44ffn2507-physical2"},{"Key":"executionId","Value":"cc9a4aa2213848c2a57e1f9dad6a4301"},{"Key":"persistenceDelayMs","Value":""},{"Key":"sendDelayMs","Value":"205.85"},{"Key":"{OriginalFormat}","Value":"Part{partition:D2}{prefix} sent TaskMessage {messageId} eventType={eventType} taskEventId={taskEventId} instanceId={instanceId} executionId={executionId} persistenceDelayMs={persistenceDelayMs} sendDelayMs={sendDelayMs}"}]
Client.3d18f47 ClientTimer is awakening at 0.0027905s: [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"ClientTimer is awakening at 0.0027905s"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 starting ClientTimer batch size=1 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"starting ClientTimer batch size=1 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 firing (2023-08-03T13:58:36.0000000Z,1): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"firing (2023-08-03T13:58:36.0000000Z,1)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 Request 3d18f47R2 for partition 06 timed out: [{"Key":"client","Value":"Client.3d18f47"},{"Key":"eventId","Value":{"Category":0,"ClientId":"3d18f478-1e5e-44f6-a086-5cd17523d139","PartitionId":0,"Number":2,"SubIndex":0,"WorkItemId":null,"Index":null}},{"Key":"partitionId","Value":6},{"Key":"{OriginalFormat}","Value":"{client} Request {eventId} for partition {partitionId:D2} timed out"}]
Client.3d18f47 completed ClientTimer batch size=0 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"completed ClientTimer batch size=0 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]

@saguiitay
Copy link

saguiitay commented Aug 10, 2023

@sebastianburckhardt this is happening when using a specific instance Id, which appears to be mapped (hashed?) to partition 6.
This looks very similar to an issue you helped me debug in the DurableTask Azure Storage provider, in which the partition manager lost track of one of the partitions (e.g. 3 nodes running over 16 partitions, each taking 3 partitions, and "fighting" over the last one by stealing it from each other). see Azure/durabletask#717

@bachuv bachuv added P2 Priority 2 and removed Needs: Triage 🔍 labels Aug 31, 2023
@stachow
Copy link

stachow commented Sep 28, 2023

Hi. We believe we are hitting this issue periodically in our service's production environment.

We see one of the partitions become "frozen" at a point in time. After that point we see roughly 1/12 of our orchestrator instances failing (we have the default 12 partitions). We also have a periodic housekeeping process which scans orchestrator instances usingIDurableOrchestrationClient.ListInstancesAsync and this process ceases to work from the same point in time.

image

  • Apart from the PartitionCount we have the same parameters as the issue creator.
  • The issue has occurred at times of relatively low throughput.
  • The issue occurs every 3-7 days but we have no evidence this effect correlates with data accumulation.
  • Workers are seen starting and stopping around the time of the freeze, but we can't see anything (yet) beyond the normal pattern observed throughout the day.

@sebastianburckhardt
Copy link
Member

We have seen situations where one or more partitions become permanently unresponsive. We are still trying to find the root cause. Our latest release 1.4.1 contains some mitigations and additional instrumentation, it may be worth trying that out.

Also, I can take a look at our internal telemetry if you provide me with the app name and time window.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Priority 2
Projects
None yet
Development

No branches or pull requests

7 participants