Task Hub Client Timeout Issue when directed to Specific Partition via Orchestration Id #295

ohadtay · 2023-08-06T08:23:46Z

Hey, I recently encountered an issue when attempting to create a new orchestration using the task hub client with Netherite orchestration client. When the client is directed to a specific partition (6) via the id of the orchestration, the operation times out (I added the traces of the client below). However, when using any other partition, the operation completes successfully. I have tried restarting the machine, but the issue persists. Additionally, every time I use this partition, I experience a timeout even when attempting to retrieve the status of the orchestration.
If there is any additional information that would be helpful, please let me know.
This issue is reproducible when using specific orchestration ids.

I am using these versions:

this is the task hub parameters that I am using:
{
"TaskhubName": "MonitoringHub",
"TaskhubGuid": "6b5920e4-3b6c-4af0-942c-cae43ed2154e",
"CreationTimestamp": "2023-07-06T09:22:03.5191171Z",
"StorageFormat": "{"FormatVersion":5,"UseAlternateObjectStore":false,"PageAndSegmentSizes":[21,26,10,19],"MemorySizes":[22,20]}",
"PartitionCount": 16
}

Client.3d18f47 ClientTimer scheduled (2023-08-03T13:58:36.0000000Z,1): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"ClientTimer scheduled (2023-08-03T13:58:36.0000000Z,1)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 ClientTimer is awakening at -251711229983.959s: [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"ClientTimer is awakening at -251711229983.959s"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Part06 completed ClientWorkItem 3d18f47R2 instanceId=engine-oro44ffn2507-physical2 status=Create latencyMs=205.85 producedMessages=1: [{"Key":"partition","Value":6},{"Key":"workItemType","Value":1},{"Key":"workItemId","Value":"3d18f47R2"},{"Key":"instanceId","Value":"engine-oro44ffn2507-physical2"},{"Key":"status","Value":1},{"Key":"latencyMs","Value":205.85300000000279},{"Key":"producedMessages","Value":1},{"Key":"{OriginalFormat}","Value":"Part{partition:D2} completed {workItemType}WorkItem {workItemId} instanceId={instanceId} status={status} latencyMs={latencyMs:F2} producedMessages={producedMessages}"}]
Part06 sent TaskMessage 3d18f47R2M0 eventType=ExecutionStarted taskEventId=-1 instanceId=engine-oro44ffn2507-physical2 executionId=cc9a4aa2213848c2a57e1f9dad6a4301 persistenceDelayMs= sendDelayMs=205.85: [{"Key":"partition","Value":6},{"Key":"prefix","Value"
:""},{"Key":"messageId","Value":"3d18f47R2M0"},{"Key":"eventType","Value":"ExecutionStarted"},{"Key":"taskEventId","Value":-1},{"Key":"instanceId","Value":"engine-oro44ffn2507-physical2"},{"Key":"executionId","Value":"cc9a4aa2213848c2a57e1f9dad6a4301"},{"Key":"persistenceDelayMs","Value":""},{"Key":"sendDelayMs","Value":"205.85"},{"Key":"{OriginalFormat}","Value":"Part{partition:D2}{prefix} sent TaskMessage {messageId} eventType={eventType} taskEventId={taskEventId} instanceId={instanceId} executionId={executionId} persistenceDelayMs={persistenceDelayMs} sendDelayMs={sendDelayMs}"}]
Client.3d18f47 ClientTimer is awakening at 0.0027905s: [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"ClientTimer is awakening at 0.0027905s"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 starting ClientTimer batch size=1 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"starting ClientTimer batch size=1 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 firing (2023-08-03T13:58:36.0000000Z,1): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"firing (2023-08-03T13:58:36.0000000Z,1)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 Request 3d18f47R2 for partition 06 timed out: [{"Key":"client","Value":"Client.3d18f47"},{"Key":"eventId","Value":{"Category":0,"ClientId":"3d18f478-1e5e-44f6-a086-5cd17523d139","PartitionId":0,"Number":2,"SubIndex":0,"WorkItemId":null,"Index":null}},{"Key":"partitionId","Value":6},{"Key":"{OriginalFormat}","Value":"{client} Request {eventId} for partition {partitionId:D2} timed out"}]
Client.3d18f47 completed ClientTimer batch size=0 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"completed ClientTimer batch size=0 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]

saguiitay · 2023-08-10T07:50:47Z

@sebastianburckhardt this is happening when using a specific instance Id, which appears to be mapped (hashed?) to partition 6.
This looks very similar to an issue you helped me debug in the DurableTask Azure Storage provider, in which the partition manager lost track of one of the partitions (e.g. 3 nodes running over 16 partitions, each taking 3 partitions, and "fighting" over the last one by stealing it from each other). see Azure/durabletask#717

stachow · 2023-09-28T21:38:40Z

Hi. We believe we are hitting this issue periodically in our service's production environment.

We see one of the partitions become "frozen" at a point in time. After that point we see roughly 1/12 of our orchestrator instances failing (we have the default 12 partitions). We also have a periodic housekeeping process which scans orchestrator instances usingIDurableOrchestrationClient.ListInstancesAsync and this process ceases to work from the same point in time.

Apart from the PartitionCount we have the same parameters as the issue creator.
The issue has occurred at times of relatively low throughput.
The issue occurs every 3-7 days but we have no evidence this effect correlates with data accumulation.
Workers are seen starting and stopping around the time of the freeze, but we can't see anything (yet) beyond the normal pattern observed throughout the day.

sebastianburckhardt · 2023-10-24T21:35:48Z

We have seen situations where one or more partitions become permanently unresponsive. We are still trying to find the root cause. Our latest release 1.4.1 contains some mitigations and additional instrumentation, it may be worth trying that out.

Also, I can take a look at our internal telemetry if you provide me with the app name and time window.

cgillum added the Needs: Triage 🔍 label Aug 23, 2023

bachuv added P2 Priority 2 and removed Needs: Triage 🔍 labels Aug 31, 2023

davidmrdavid added the Needs: Triage 🔍 label Oct 11, 2023

bachuv removed the Needs: Triage 🔍 label Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task Hub Client Timeout Issue when directed to Specific Partition via Orchestration Id #295

Task Hub Client Timeout Issue when directed to Specific Partition via Orchestration Id #295

ohadtay commented Aug 6, 2023 •

edited

Loading

saguiitay commented Aug 10, 2023 •

edited

Loading

stachow commented Sep 28, 2023

sebastianburckhardt commented Oct 24, 2023

Task Hub Client Timeout Issue when directed to Specific Partition via Orchestration Id #295

Task Hub Client Timeout Issue when directed to Specific Partition via Orchestration Id #295

Comments

ohadtay commented Aug 6, 2023 • edited Loading

saguiitay commented Aug 10, 2023 • edited Loading

stachow commented Sep 28, 2023

sebastianburckhardt commented Oct 24, 2023

ohadtay commented Aug 6, 2023 •

edited

Loading

saguiitay commented Aug 10, 2023 •

edited

Loading