Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlUtils retry logic does not handle DB closed connections #216

Open
michaelplavnik opened this issue May 3, 2024 · 9 comments
Open

SqlUtils retry logic does not handle DB closed connections #216

michaelplavnik opened this issue May 3, 2024 · 9 comments
Labels
bug Something isn't working P2 Priority 2

Comments

@michaelplavnik
Copy link

michaelplavnik commented May 3, 2024

SqlUtils retry logic is based on the idea that connection is live during retrying. But that is not always the case and it leads to the failure of retry loop.

Below is exception that triggered retry

Non-orchestration failure: A transient database failure occurred and will be retried. Current retry count: 0. 
Details: Microsoft.Data.SqlClient.SqlException (0x80131904): A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.) 
---> System.ComponentModel.Win32Exception (10060): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. 

at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__211_0(Task`1 result) 
at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke() 
at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) 
--- End of stack trace from previous location 

And this is exception that caused retry to fail

TaskActivityDispatcher-20f9dc6540734ed199f29b59ce5ed253-0: Failed to fetch a work-item: System.InvalidOperationException: BeginExecuteReader requires an open and available Connection. The connection's current state is closed. 
    at Microsoft.Data.SqlClient.SqlCommand.<>c.<ExecuteDbDataReaderAsync>b__211_0(Task`1 result) at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) 
--- End of stack trace from previous location 

--- at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread) 
--- End of stack trace from previous location 

--- at DurableTask.SqlServer.SqlUtils.WithRetry[T](Func`1 func, SprocExecutionContext context, LogHelper traceHelper, String instanceId, Int32 maxRetries)
    at DurableTask.SqlServer.SqlUtils.WithRetry[T](Func`1 func, SprocExecutionContext context, LogHelper traceHelper, String instanceId, Int32 maxRetries)
    at DurableTask.SqlServer.SqlUtils.ExecuteSprocAndTraceAsync[T](DbCommand command, LogHelper traceHelper, String instanceId, Func`2 executor) 
    at DurableTask.SqlServer.SqlOrchestrationService.LockNextTaskActivityWorkItem(TimeSpan receiveTimeout, CancellationToken shutdownCancellationToken) 
    at DurableTask.Core.WorkItemDispatcher`1.DispatchAsync(WorkItemDispatcherContext context)

Filter for
--- at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread) 
--- End of stack trace from previous location 

--- at DurableTask.SqlServer.SqlUtils.WithRetry[T](Func`1 func, SprocExecutionContext context, LogHelper traceHelper, String instanceId, Int32 maxRetries) 

ClientConnectionId:26deed2f-447e-4cc4-afc9-989c09fe072e Error Number:10060,State:0,Class:20.

@michaelplavnik michaelplavnik changed the title Retry logic allows does not handle DB closed connections SqlUtils retry logic does not handle DB closed connections May 3, 2024
@cgillum cgillum added bug Something isn't working P2 Priority 2 and removed Needs: Triage 🔍 labels May 3, 2024
@microrama
Copy link
Contributor

@cgillum - Would it be OK for me and Misha to attempt to fix this issue and propose a solution?
Btw, Misha is my manager and we work together and are trying to use DTF for workflows.

@cgillum
Copy link
Member

cgillum commented May 8, 2024

@microrama yes, if you’re able to find and submit a solution for this, we would be more than happy to accept it. It would be great to have a fix for this issue.

@microrama
Copy link
Contributor

Thanks @cgillum! This certainly is a bit more complicated but we will work with you for any approaches we have and of course test locally before and after hand as well.

@microrama
Copy link
Contributor

This is what's happening:
DurableTask.SqlServer.SqlUtils.WithRetry() gets called by SqlOrchestrationService.LockNextTaskOrchestrationWorkItemAsync() method. This has the DBCommand passed to it.
The first run when the network is out, we get the SqlException which is checked by IsTransient() method and is logged in the method as traceHelper.TransientDatabaseFailure(). This is the first DurableTask.SqlServer error that we see as a warning.
Now, while the retry loop is being executed, the second time around it appears we get an System.InvalidOperationException that is not evaluated as transient by the IsTransient() method.
This exception is thrown back to the calling DurableTask.Core caller and that is logged as an error.
Subsequent calls work great because the network blip worked out.

POC:
With a simple console application, we were able to reproduce the InvalidOperationException by forcibly closing the SqlConnection before calling the ExecuteReaderAsync.
It was hard to reproduce the first transient SqlException in the VM machine where we develop.

Bottomline:
The first error is a SqlException error and is considered a transient but by the second attempt, it turns into a InvalidOperationException since the SqlConnection is closed at this point.

Fix:
Pass the DBCommand to the WithRetry() method and let the method check the DBCommand connection state and open if needed before attempting executor again. If there is a SqlException opening the connection again, existing logic will retry or return fatal error back.

Next Steps:
We will try out the code in our Sandbox environment and monitor.

@microrama
Copy link
Contributor

The code change was tested in our cert environment and it appears the issue is now fixed. We will still see the Warning but the Error is gone since we check for SqlConnection State and open the connection on the command object as needed.
This was tested in the CERT environment where on an average there is a network blip once or twice a day.
Attaching screenshots:
image

Next steps - I will create a PR for this issue for your review. Thx @cgillum!

@microrama
Copy link
Contributor

@cgillum - FYI. I created the PR few days back. Please review when you have time.
#221

@microrama
Copy link
Contributor

@cgillum - Greatly appreciate any help with reviewing the PR :-)
#221

@cgillum
Copy link
Member

cgillum commented Jun 3, 2024

@microrama taking a look now - thanks!

@microrama
Copy link
Contributor

Thanks @cgillum!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P2 Priority 2
Projects
None yet
Development

No branches or pull requests

3 participants