-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task Hub Client Timeout Issue when directed to Specific Partition via Orchestration Id #295
Comments
@sebastianburckhardt this is happening when using a specific instance Id, which appears to be mapped (hashed?) to partition 6. |
Hi. We believe we are hitting this issue periodically in our service's production environment. We see one of the partitions become "frozen" at a point in time. After that point we see roughly 1/12 of our orchestrator instances failing (we have the default 12 partitions). We also have a periodic housekeeping process which scans orchestrator instances using
|
We have seen situations where one or more partitions become permanently unresponsive. We are still trying to find the root cause. Our latest release 1.4.1 contains some mitigations and additional instrumentation, it may be worth trying that out. Also, I can take a look at our internal telemetry if you provide me with the app name and time window. |
Hey, I recently encountered an issue when attempting to create a new orchestration using the task hub client with Netherite orchestration client. When the client is directed to a specific partition (6) via the id of the orchestration, the operation times out (I added the traces of the client below). However, when using any other partition, the operation completes successfully. I have tried restarting the machine, but the issue persists. Additionally, every time I use this partition, I experience a timeout even when attempting to retrieve the status of the orchestration.
If there is any additional information that would be helpful, please let me know.
This issue is reproducible when using specific orchestration ids.
I am using these versions:
this is the task hub parameters that I am using:
{
"TaskhubName": "MonitoringHub",
"TaskhubGuid": "6b5920e4-3b6c-4af0-942c-cae43ed2154e",
"CreationTimestamp": "2023-07-06T09:22:03.5191171Z",
"StorageFormat": "{"FormatVersion":5,"UseAlternateObjectStore":false,"PageAndSegmentSizes":[21,26,10,19],"MemorySizes":[22,20]}",
"PartitionCount": 16
}
Client.3d18f47 ClientTimer scheduled (2023-08-03T13:58:36.0000000Z,1): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"ClientTimer scheduled (2023-08-03T13:58:36.0000000Z,1)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 ClientTimer is awakening at -251711229983.959s: [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"ClientTimer is awakening at -251711229983.959s"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Part06 completed ClientWorkItem 3d18f47R2 instanceId=engine-oro44ffn2507-physical2 status=Create latencyMs=205.85 producedMessages=1: [{"Key":"partition","Value":6},{"Key":"workItemType","Value":1},{"Key":"workItemId","Value":"3d18f47R2"},{"Key":"instanceId","Value":"engine-oro44ffn2507-physical2"},{"Key":"status","Value":1},{"Key":"latencyMs","Value":205.85300000000279},{"Key":"producedMessages","Value":1},{"Key":"{OriginalFormat}","Value":"Part{partition:D2} completed {workItemType}WorkItem {workItemId} instanceId={instanceId} status={status} latencyMs={latencyMs:F2} producedMessages={producedMessages}"}]
Part06 sent TaskMessage 3d18f47R2M0 eventType=ExecutionStarted taskEventId=-1 instanceId=engine-oro44ffn2507-physical2 executionId=cc9a4aa2213848c2a57e1f9dad6a4301 persistenceDelayMs= sendDelayMs=205.85: [{"Key":"partition","Value":6},{"Key":"prefix","Value"
:""},{"Key":"messageId","Value":"3d18f47R2M0"},{"Key":"eventType","Value":"ExecutionStarted"},{"Key":"taskEventId","Value":-1},{"Key":"instanceId","Value":"engine-oro44ffn2507-physical2"},{"Key":"executionId","Value":"cc9a4aa2213848c2a57e1f9dad6a4301"},{"Key":"persistenceDelayMs","Value":""},{"Key":"sendDelayMs","Value":"205.85"},{"Key":"{OriginalFormat}","Value":"Part{partition:D2}{prefix} sent TaskMessage {messageId} eventType={eventType} taskEventId={taskEventId} instanceId={instanceId} executionId={executionId} persistenceDelayMs={persistenceDelayMs} sendDelayMs={sendDelayMs}"}]
Client.3d18f47 ClientTimer is awakening at 0.0027905s: [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"ClientTimer is awakening at 0.0027905s"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 starting ClientTimer batch size=1 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"starting ClientTimer batch size=1 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 firing (2023-08-03T13:58:36.0000000Z,1): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"firing (2023-08-03T13:58:36.0000000Z,1)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
Client.3d18f47 Request 3d18f47R2 for partition 06 timed out: [{"Key":"client","Value":"Client.3d18f47"},{"Key":"eventId","Value":{"Category":0,"ClientId":"3d18f478-1e5e-44f6-a086-5cd17523d139","PartitionId":0,"Number":2,"SubIndex":0,"WorkItemId":null,"Index":null}},{"Key":"partitionId","Value":6},{"Key":"{OriginalFormat}","Value":"{client} Request {eventId} for partition {partitionId:D2} timed out"}]
Client.3d18f47 completed ClientTimer batch size=0 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0): [{"Key":"client","Value":"Client.3d18f47"},{"Key":"details","Value":"completed ClientTimer batch size=0 first=(2023-08-03T13:58:36.0000000Z,1) next=(0001-01-01T00:00:00.0000000,0)"},{"Key":"{OriginalFormat}","Value":"{client} {details}"}]
The text was updated successfully, but these errors were encountered: