Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blob Storage Continually Growing using Netherite #229

Open
UMCPGrad opened this issue Mar 1, 2023 · 2 comments
Open

Blob Storage Continually Growing using Netherite #229

UMCPGrad opened this issue Mar 1, 2023 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@UMCPGrad
Copy link

UMCPGrad commented Mar 1, 2023

Our team is currently planning on switching to Netherite from the Azure Storage backend provider. In testing, we see the improved latency, however we noticed that Blob Storage appears to continually grow over time. We had a similar issue with the Azure Storage backend provider, and were hoping that we would not have to deal with clearing out blob storage. Should we be running the Purge API daily to overcome this? Here is a screen shot of about 24 hours worth of testing in our test environment.

Netherite

@sebastianburckhardt
Copy link
Member

sebastianburckhardt commented Mar 2, 2023

Indeed. By default, an orchestration instance remains in storage indefinitely, until explicitly purged using the purge API). This behavior is the same across all storage providers.

Supporting a more convenient auto-purge functionality is high on our priority list, but I cannot give you an ETA yet.

In the meantime, I would suggest to use a periodic timer function that purges completed orchestrations. Below are three examples (using a 10 minute, 1 hour, or 1 day interval) of how to run such a function. Which one of the three snippets you pick depends on how long you want to keep completed orchestrations in storage.

        [FunctionName("PurgeEveryTenMinutes")]
        public static Task PurgeEveryTenMinutes(
            [DurableClient] IDurableOrchestrationClient client,
            [TimerTrigger("0 */10 * * * *")] TimerInfo myTimer)
        {
            // purge all orchestration instances that started at least 10 minutes ago
            // and are now in "Completed" state
            return client.PurgeInstanceHistoryAsync(
                DateTime.MinValue,
                DateTime.UtcNow.AddMinutes(-10),
                new List<OrchestrationStatus>
                {
                    OrchestrationStatus.Completed
                });
        }

        [FunctionName("PurgeEveryHour")]
        public static Task PurgeEveryHour(
             [DurableClient] IDurableOrchestrationClient client,
             [TimerTrigger("0 0 * * * *")] TimerInfo myTimer)
        {
            // purge all orchestration instances that started at least 1 hour ago
            // and are now in "Completed" state
            return client.PurgeInstanceHistoryAsync(
                DateTime.MinValue,
                DateTime.UtcNow.AddHours(-1),
                new List<OrchestrationStatus>
                {
                OrchestrationStatus.Completed
                });
        }

        [FunctionName("PurgeEveryDay")]
        public static Task PurgeEveryDay(
            [DurableClient] IDurableOrchestrationClient client,
            [TimerTrigger("0 0 0 * * *")] TimerInfo myTimer)
        {
            // purge all orchestration instances that started at least 1 day ago
            // and are now in "Completed" state
            return client.PurgeInstanceHistoryAsync(
                DateTime.MinValue,
                DateTime.UtcNow.AddDays(-1),
                new List<OrchestrationStatus>
                {
                    OrchestrationStatus.Completed
                });
        }

(by the way, in these examples, I use the same time period for how often the purge runs, and the minimum age of an orchestration before it is purged. But you can of course use different values).

@sebastianburckhardt sebastianburckhardt added needs author response bug Something isn't working and removed Needs: Triage 🔍 bug Something isn't working labels Mar 2, 2023
@sebastianburckhardt
Copy link
Member

The purge timer functions are now working as intended, but reported storage capacity is still much higher than expected.

I have had a detailed look at the collected telemetry (thank you @UMCPGrad for giving me the details) and was able to identify two more issues that cause larger than expected storage capacity billing:

  1. Soft delete needs to be disabled for the storage account. Apparently, new storage accounts by default have this feature turned on. It is very important to turn this off when using Netherite. This is because with soft delete enabled, the capacity of each deleted blob continues to get billed for 7 more days. Because Netherite frequently writes and deletes blobs at high volume (a characteristic of the blob-backed log devices used by FASTER), the implicit retention of blobs that comes with the soft delete feature can inflate the billed capacity tremendously (we saw more than 10x inflated capacity billing with this application).

  2. Netherite does not collect object logs as aggressively as it should. I am working on a PR to tune the corresponding FASTER parameters to improve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants