Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dt.CreateInstance stored proc is causing deadlocks (proposed ROLLBACK change that fixes the issue) #217

Closed
microrama opened this issue May 6, 2024 · 11 comments · Fixed by #218
Labels
bug Something isn't working P2 Priority 2

Comments

@microrama
Copy link
Contributor

Problem:
Dt.CreateInstance store proc does not ROLLBACK the transaction (line 36) when it fails to create a new orchestration since another process/thread has already created it. This happens when there are multiple processes that are trying to create orchestration instances over the same source data. This was easily recreated when multiple processes are trying to create, say 500 orchestration instances based on the data in a SQL table.

The current stored proc code throws an exception (THROW 50001, @msg, 1; ) when it encounters that another process already created the orchestration instance but it does rollback the "Begin Transaction" which is perhaps causing the deadlocks and SQL timeout exceptions in the code.

The deadlock this caused was identified using the deadlock reports from the database and the report has been provided below along with the .NET exception stack trace.

Solution:
Dt.CreateInstance stored procedure with the ROLLBACK when it detects duplicate orchestration instance before throwing the exception that fixed the deadlock issues.

…….
BEGIN TRANSACTION
DECLARE @existingStatus varchar(30) = (
SELECT TOP 1 existing.[RuntimeStatus]
FROM Instances existing WITH (HOLDLOCK)
WHERE [TaskHub] = @taskhub AND [InstanceID] = @instanceid
)
-- Instance IDs can be overwritten only if the orchestration is in a terminal state
--IF @existingStatus IN ('Pending', 'Running')
IF @existingStatus IN (SELECT value FROM STRING_SPLIT(@DedupeStatuses, ','))
BEGIN
ROLLBACK TRANSACTION;
THROW 50001, @msg, 1;
END
ELSE IF @existingStatus IS NOT NULL
……………..

Exception StackTrace
"message": [
"TaskOrchestrationDispatcher-25904bb427294445a049682c104c412a-0: Failed to fetch a work-item: Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.\r\n ---> System.ComponentModel.Win32Exception (258): The wait operation timed out.\r\n at Microsoft.Data.SqlClient.SqlCommand.<>c.b__209_0(Task1 result)\r\n at System.Threading.Tasks.ContinuationResultTaskFromResultTask2.InnerInvoke()\r\n at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)\r\n--- End of stack trace from previous location ---\r\n at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)\r\n--- End of stack trace from previous location ---\r\n at DurableTask.SqlServer.SqlUtils.WithRetry[T](Func1 func, SprocExecutionContext context, LogHelper traceHelper, String instanceId, Int32 maxRetries)\r\n at DurableTask.SqlServer.SqlUtils.WithRetry[T](Func1 func, SprocExecutionContext context, LogHelper traceHelper, String instanceId, Int32 maxRetries) in /_/src/DurableTask.SqlServer/SqlUtils.cs:line 504\r\n at DurableTask.SqlServer.SqlUtils.ExecuteSprocAndTraceAsync[T](DbCommand command, LogHelper traceHelper, String instanceId, Func2 executor) in /_/src/DurableTask.SqlServer/SqlUtils.cs:line 454\r\n at DurableTask.SqlServer.SqlOrchestrationService.LockNextTaskOrchestrationWorkItemAsync(TimeSpan receiveTimeout, CancellationToken cancellationToken) in /_/src/DurableTask.SqlServer/SqlOrchestrationService.cs:line 135\r\n at DurableTask.Core.WorkItemDispatcher1.DispatchAsync(WorkItemDispatcherContext context)\r\nClientConnectionId:22a53d4c-5250-463f-96c0-bb08165b71a7\r\nError Number:-2,State:0,Class:11"
],

Deadlock Report






DELETE FROM Instances WHERE [TaskHub] = @taskhub AND [InstanceID] IN (SELECT [InstanceID] FROM @InstanceIDs
EXEC dt.PurgeInstanceStateByID @instancesToPurg

Proc [Database Id = 12 Object Id = 1922978077]



DELETE FROM Instances WHERE [TaskHub] = @taskhub AND [InstanceID] IN (SELECT [InstanceID] FROM @InstanceIDs
EXEC dt.PurgeInstanceStateByID @instancesToPurg

Proc [Database Id = 12 Object Id = 1922978077]






















@cgillum
Copy link
Member

cgillum commented May 6, 2024

Hi @microrama, were you able to confirm whether adding ROLLBACK TRANSACTION before the THROW truly resolves the deadlocks that you encountered? I ask because I didn't expect this to be necessary since we hadn't actually changed any data at this point in the stored procedure.

@cgillum cgillum added Needs: Author Feedback Awaiting feedback from the issue author. and removed Needs: Triage 🔍 labels May 6, 2024
@microrama
Copy link
Contributor Author

microrama commented May 6, 2024 via email

@microsoft-github-policy-service microsoft-github-policy-service bot added Needs: Attention 👋 Issue needs attention from maintainers and removed Needs: Author Feedback Awaiting feedback from the issue author. labels May 6, 2024
@cgillum cgillum added bug Something isn't working P2 Priority 2 and removed Needs: Attention 👋 Issue needs attention from maintainers labels May 6, 2024
@cgillum
Copy link
Member

cgillum commented May 6, 2024

@microrama thanks for confirming! By the way, we accept pull requests from users. If you would like to submit this as a small pull request, we would happily accept it and give you appropriate credit in the Git history and release notes.

@microrama
Copy link
Contributor Author

microrama commented May 6, 2024 via email

@microrama
Copy link
Contributor Author

Chris - I submitted a pull request - #218

@cgillum
Copy link
Member

cgillum commented May 7, 2024

Thanks @microrama! The change looks good to me. Please sign the CLA and I'll go ahead and merge your PR.

@bhugot
Copy link
Contributor

bhugot commented May 7, 2024

@cgillum any chance to get a release :)

@cgillum
Copy link
Member

cgillum commented May 7, 2024

@bhugot yes, I can work on that.

@bhugot
Copy link
Contributor

bhugot commented May 7, 2024

@cgillum i found other transaction with throw not rollbacked

@cgillum
Copy link
Member

cgillum commented May 7, 2024

@bhugot feel free to submit a PR for the other case you found. I can include it in the next release.

@bhugot
Copy link
Contributor

bhugot commented May 7, 2024

@cgillum done #219

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Priority 2
Projects
None yet
3 participants