-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeating Timers on no longer existing durable function and instance #312
Comments
Logs that say "Replayed internal UpdateEvent" are emitted when a partition is recovering. It is normal for those events to be things that happened in the past (the log being replayed is an "event source" that records all partition state updates and is replayed to recover the latest partition state). However, partitions are also supposed to checkpoint relatively often (about once a minute), at which point the log is trimmed. So, if you keep seeing those old events over and over, it suggests that something is amiss with the partition, for example it may be stuck in an infinite recovery cycle. If this is running on a consumption or premium plan I could take a look at our internal telemetry if you provide me with the application name. |
That is probably what it is. It is on a consumption plan. Tha app is called CritterCoinAPI. Please have a look
Mike Beaty
RedCritter
…________________________________
From: Sebastian Burckhardt ***@***.***>
Sent: Tuesday, October 24, 2023 5:48:56 PM
To: microsoft/durabletask-netherite ***@***.***>
Cc: Mike Beaty ***@***.***>; Author ***@***.***>
Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312)
You don't often get email from ***@***.*** Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Logs that say "Replayed internal UpdateEvent" are emitted when a partition is recovering. It is normal for those events to be things that happened in the past (the log being replayed is an "event source" that records all partition state updates and is replayed to recover the latest partition state). However, partitions are also supposed to checkpoint relatively often (about once a minute), at which point the log is trimmed.
So, if you keep seeing those old events over and over, it suggests that something is amiss with the partition, for example it may be stuck in an infinite recovery cycle. If this is running on a consumption or premium plan I could take a look at our internal telemetry if you provide me with the application name.
—
Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YOW5O3T5HXEPHMBNWDYBAZURAVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYGA4TMMRUGU>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
After investigating the telemetry, I can confirm that the problem (as of right now) is that partition 5 is caught in an infinite recycle. Every time, immediately after recovering, it throws an exception
The cause is unclear, nor can I tell what the "invalid page range is". This is likely some malfunction inside FASTER, my guess is that somehow it did recover to an invalid state. To help us troubleshoot this, it would be helpful if you edit the following lines in host.json, to increase the verbosity of the storage access tracing:
Regardless, I will also try to see what happened to partition 5 right before this symptom starts appearing. |
That sounds good, I am traveling today and tomorrow but if I ca. get to this I will let you know . Anything you can do and suggestions how to fix would be great.
Mike Beaty
RedCritter
…________________________________
From: Sebastian Burckhardt ***@***.***>
Sent: Tuesday, October 24, 2023 6:51:08 PM
To: microsoft/durabletask-netherite ***@***.***>
Cc: Mike Beaty ***@***.***>; Author ***@***.***>
Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312)
You don't often get email from ***@***.*** Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
After investigating the telemetry, I can confirm that the problem (as of right now) is that partition 5 is caught in an infinite recycle. Every time, immediately after recovering, it throws an exception
Azure.RequestFailedException: The page range specified is invalid.
at Azure.Storage.Blobs.PageBlobRestClient.UploadPagesAsync(Int64 contentLength, Stream body, Byte[] transactionalContentMD5, Byte[] transactionalContentCrc64, Nullable`1 timeout, String range, String leaseId, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 ifSequenceNumberLessThanOrEqualTo, Nullable`1 ifSequenceNumberLessThan, Nullable`1 ifSequenceNumberEqualTo, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, CancellationToken cancellationToken)
at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesInternal(Stream content, Int64 offset, UploadTransferValidationOptions transferValidationOverride, PageBlobRequestConditions conditions, IProgress`1 progressHandler, Boolean async, CancellationToken cancellationToken)
at Azure.Storage.Blobs.Specialized.PageBlobClient.UploadPagesAsync(Stream content, Int64 offset, Byte[] transactionalContentHash, PageBlobRequestConditions conditions, IProgress`1 progressHandler, CancellationToken cancellationToken)
at DurableTask.Netherite.Faster.AzureStorageDevice.<>c__DisplayClass39_1.<<WritePortionToBlobAsync>b__0>d.MoveNext() in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/AzureStorageDevice.cs:line 454
--- End of stack trace from previous location ---
at DurableTask.Netherite.Faster.BlobManager.PerformWithRetriesAsync(SemaphoreSlim semaphore, Boolean requireLease, String name, String intent, String data, String target, Int32 expectedLatencyBound, Boolean isCritical, Func`2 operationAsync, Func`1 readETagAsync) in /_/src/DurableTask.Netherite/StorageLayer/Faster/AzureBlobs/StorageOperations.cs:line 68
The cause is unclear, nor can I tell what the "invalid page range is". This is likely some malfunction inside FASTER, my guess is that somehow it did recover to an invalid state.
To help us troubleshoot this, it would be helpful if you edit the following lines in host.json, to increase the verbosity of the storage access tracing:
{
"extensions": {
"durableTask": {
"storageProvider": {
"StorageLogLevelLimit": "Trace"
}
}
}
}
Regardless, I will also try to see what happened to partition 5 right before this symptom starts appearing.
—
Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YKHSOUXTHUP5BVM6MLYBBA5ZAVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZYGE3TMMZZHA>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Thanks. Since this seems to be some kind of data corruption problem, starting with a fresh task hub would probably eliminates the issue. At least temporarily, it is possible the problem comes back if the original issue reoccurs. Based on the telemetry it would appear that the original occurrence of this problem is rare... the current corruption of partition 5 happened back in September. |
I would like to avoid that since a new task hub will not have our data in it. Any other option?
Mike Beaty
RedCritter
…________________________________
From: Sebastian Burckhardt ***@***.***>
Sent: Wednesday, October 25, 2023 11:47:03 AM
To: microsoft/durabletask-netherite ***@***.***>
Cc: Mike Beaty ***@***.***>; Author ***@***.***>
Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312)
You don't often get email from ***@***.*** Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Thanks.
Since this seems to be some kind of data corruption problem, starting with a fresh task hub would probably eliminates the issue. At least temporarily, it is possible the problem comes back if the original issue reoccurs. Based on the telemetry it would appear that the original occurrence of this problem is rare... the current corruption of partition 5 happened back in September.
—
Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YPWOWZIEFIILH6KLGDYBEX7PAVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZZGU3TCOJQGQ>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I've updated host.json to log more information with "StorageLogLevelLimit": "Trace". Please have a look. I would like to retain this hub, rather that start a new one, because it has prod data in it. |
Thanks for the traces, they were helpful. I have found the cause. What happens is that the total amount of data written when taking a checkpoint is exceeding the size of the page blob (512GB). Page blobs can be resized dynamically without losing the content, so we should be able to get recover the content. I will work on a fix. Note that even after we resolve this, it may be worth considering taking some measures to keep the total amount of data stored in a task hub in check, since Netherite is (unlike Azure Storage) not equipped to handle "unlimited' amounts of data. At some point the performances starts to suffer. Here are some things to consider:
|
Interesting, can you tell if there is a certain type of data taking most of the space. The objects we are storing are probably not more than 255k or so at most. Could it be some type of logging creating this large amount of data.
Mike Beaty
RedCritter
…________________________________
From: Sebastian Burckhardt ***@***.***>
Sent: Monday, October 30, 2023 2:01:19 PM
To: microsoft/durabletask-netherite ***@***.***>
Cc: Mike Beaty ***@***.***>; Author ***@***.***>
Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312)
You don't often get email from ***@***.*** Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
Thanks for the traces, they were helpful. I have found the cause.
What happens is that the total amount of data written when taking a checkpoint is exceeding the size of the page blob (512GB).
Page blobs can be resized dynamically without losing the content, so we should be able to get recover the content. I will work on a fix.
Note that even after we resolve this, it may be worth considering taking some measures to keep the total amount of data stored in a task hub in check, since Netherite is (unlike Azure Storage) not equipped to handle "unlimited' amounts of data. At some point the performances starts to suffer.
Here are some things to consider:
1. it is usually a good idea to purge completed orchestrations after a while (e.g. see the discussion in #229<#229>)
2. If you have a significant amount of data you want to store persistently for a long time, I would consider writing such data to external storage suitable for long term storage.
—
Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YLQ4GNHCBVV24IUKCLYB72P7AVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBVHA3DEMJXGE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
I have implemented a fix (#329) that should unblock your partition once we can get it to you.
There is no quick way to see this type of information right now (maybe something we should consider monitoring).
Yep, that does not explain the 512GB. Based on the load information (you can also see this information in the partition table) there are only about 4.6k instances stored on this partition, so if each is 255k, then that is still less than 1.2GB.
there is no logging being stored inside the task hub. Sometimes entities accumulate a lot of space in their metadata. Also, frequent updates tend to generate more data since multiple versions are being stored. |
We are using only durable entities. The updates are actually very infrequent, because this feature is new in our system and very lightly used. I don't believe our data should be of any significant size.
Can you think of any way that we would see the breakdown per object in the partition? Wondering if the meta data is growing perhaps with every paritition recovery attempt.
Thanks
…________________________________
From: Sebastian Burckhardt ***@***.***>
Sent: Tuesday, October 31, 2023 1:43 PM
To: microsoft/durabletask-netherite ***@***.***>
Cc: Mike Beaty ***@***.***>; Author ***@***.***>
Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312)
You don't often get email from ***@***.*** Learn why this is important<https://aka.ms/LearnAboutSenderIdentification>
I have implemented a fix (#329<#329>) that should unblock your partition once we can get it to you.
can you tell if there is a certain type of data taking most of the space
There is no quick way to see this type of information right now (maybe something we should consider monitoring).
Are you using entities and/or orchestrations?
The objects we are storing are probably not more than 255k or so
Yep, that does not explain the 512GB. Based on the load information (you can also see this information in the partition table<https://microsoft.github.io/durabletask-netherite/#/ptable>) there are only about 4.6k instances stored on this partition, so if each is 255k, then that is still less than 1.2GB.
can you tell if there is a certain type of data taking most of the space
there is no logging being stored inside the task hub. Sometimes entities accumulate a lot of space in their metadata. Also, frequent updates tend to generate more data since multiple versions are being stored.
—
Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YNK6V3MCAHPEX5HNWLYCFBFHAVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBXG44TGMRYHE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
We just published a Netherite private package with @sebastianburckhardt's fix here: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.4.1-blob-growth-fix-1 The version is
Please try it out, let us know once you do so, and whether your application appears to recover after some time. Thanks! |
Unfortunately, Netherite is failing silently during initialization with this version. I saw the same issue with trying to upgrade to 1.4 All calls to it hang indefinitely.
I have been running microsoft.azure.durabletask.netherite.azurefunctions.1.3.4
Can I get access to the project source for 1.4.1-blob-growth-fix-1 so that I can add the project to my solution and narrow down the failure in our dev environment?
Thanks
…________________________________
From: David Justo ***@***.***>
Sent: Thursday, November 2, 2023 12:42 AM
To: microsoft/durabletask-netherite ***@***.***>
Cc: Mike Beaty ***@***.***>; Mention ***@***.***>
Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312)
@rmbeaty<https://github.com/rmbeaty>:
We just published a Netherite private package with @sebastianburckhardt<https://github.com/sebastianburckhardt>'s fix here: https://durabletaskframework.visualstudio.com/Durable%20Task%20Framework%20CI/_artifacts/feed/durabletask/NuGet/Microsoft.Azure.DurableTask.Netherite/overview/1.4.1-blob-growth-fix-1
The version is 1.4.1-blob-growth-fix-1. To be able to install it, you need to add the following key to your nuget.config:
<add key="durabletask" value="https://durabletaskframework.pkgs.visualstudio.com/734e7913-2fab-4624-a174-bc57fe96f95d/_packaging/durabletask/nuget/v3/index.json" />
Please try it out, let us know once you do so, and whether your application appears to recover after some time. Thanks!
—
Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YKXWGRMP2N7BLOLT4TYCMXGDAVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJQGEYDINBUGQ>.
You are receiving this because you were mentioned.
|
@rmbeaty: the known upgrade issues to |
Just to make sure, I did a quick test and I did not see any failures to open the task hub. So, I would need to know more details about the failure to debug it. (for the test, I created a task hub with 1.3.5 and then opened it with the version in PR #329) |
Thanks, I am looking into it. It is unclear where it is failing. Hopefully I can narrow it down.
…________________________________
From: Sebastian Burckhardt ***@***.***>
Sent: Thursday, November 2, 2023 1:09 PM
To: microsoft/durabletask-netherite ***@***.***>
Cc: Mike Beaty ***@***.***>; Mention ***@***.***>
Subject: Re: [microsoft/durabletask-netherite] Repeating Timers on no longer existing durable function and instance (Issue #312)
Just to make sure, I did a quick test and I did not see any failures to open the task hub. So, I would need to know more details about the failure to debug it.
(for the test, I created a task hub with 1.3.5 and then opened it with the version in PR #329<#329>)
—
Reply to this email directly, view it on GitHub<#312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF3P3YPAQSEAAHIVOH2PFULYCPOT3AVCNFSM6AAAAAA5W6KMM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJRGI4DENJWGI>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@rmbeaty: Any updates here? We should be able to help if you can provide us any more details about the kind of failure you're seeing. Since you have an internal support case with us, you can work your support point of contact to schedule a meeting with us to help you debug, for example. I'll ask the support person on our end to see if we can set that up. |
Closing for inactivity. Feel free to reopen if needed. |
Hi,
I've am using Netherite in Azure Functions. At one point I had a durable functions that repeated on a durable timer. However that class was deleted from the source, but now I am still seeing some type of activity for it in the Log Info, subset of these below (these seem to repeat every few seconds)....
2023-10-07T13:39:28Z [Information] Part05.21148384599712 Replayed internal UpdateEvent TimerFired eventId=05T147841722 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384599712,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384601104 Replayed internal UpdateEvent TimerFired eventId=05T147841723 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384601104,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384602496 Replayed internal UpdateEvent TimerFired eventId=05T147841724 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384602496,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384603884 Replayed internal UpdateEvent TimerFired eventId=05T147841725 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384603884,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384605276 Replayed internal UpdateEvent TimerFired eventId=05T147841726 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384605276,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384606668 Replayed internal UpdateEvent TimerFired eventId=05T147841727 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384606668,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384608060 Replayed internal UpdateEvent TimerFired eventId=05T147841728 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384608060,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384609448 Replayed internal UpdateEvent TimerFired eventId=05T147841729 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384609448,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384610840 Replayed internal UpdateEvent TimerFired eventId=05T147841730 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384610840,0) latency=(0, 49094, 0)
2023-10-07T13:39:28Z [Information] Part05.21148384612232 Replayed internal UpdateEvent TimerFired eventId=05T147841731 instanceId=@durableaiimagegenerator@AIImageGenerator_0 pos=(21148384612232,0) latency=(0, 49094, 0)
Any ideas on how I can make these stop. I tried recreating the Durable Function durableaiimagegenerator and instantiated a new entity with the same ID AIImageGenerator_0 thinking it might let the events process, but no luck.
The text was updated successfully, but these errors were encountered: