Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness issue when no reports are being uploaded #3427

Open
divergentdave opened this issue Oct 3, 2024 · 1 comment
Open

Liveness issue when no reports are being uploaded #3427

divergentdave opened this issue Oct 3, 2024 · 1 comment

Comments

@divergentdave
Copy link
Collaborator

Currently, if a time interval task has some number of reports uploaded, and then report uploads stop, it's possible for aggregation and collection of the existing reports to get stuck. (at least until clients upload more reports)

When the report uploads stop, if there are fewer unaggregated reports than min_aggregation_job_size, then the aggregation job creator will not create any aggregation jobs. Thus, these reports will remain unaggregated. If a collection job is submitted with an interval that includes any such unaggregated report, the collection job driver will not process the job until all unaggregated reports in the batch interval have been processed (and all outstanding aggregation jobs have been finished or abandoned). Taken together, this means it's possible for a collection job to get stuck, even if we have sufficient valid reports to complete it. Getting into this state depends on race conditions between the clients' uploads and the aggregation job creator. We expect that tasks using the time interval query type will typically be for continuous metrics tasks, so extended periods with zero uploaded reports may be unusual.

We could fix this with new heuristics or conditions to allow creating an under-sized aggregation job, though how we do so may impact overhead from more smaller aggregation jobs and write contention during the ensuing aggregation. Thus, we'll want to only create under-sized aggregation jobs in limited situations.

@branlwyd
Copy link
Contributor

branlwyd commented Oct 3, 2024

Implementation idea, based on off-issue discussion:

I think we'd implement it as: after "normal" creation of aggregation jobs, we might have a few "straggler" reports left in-hand that aren't numerous enough to permit creation of another aggregation job. Check for existing collection jobs for the time windows associated with these straggler reports; create an aggregation job using the reports whose time windows have a collection job.

Things I'd want to think about more deeply before implementing:

  1. If we're going to create a "stragglers" agg job, maybe we want to go ahead and throw as many reports as possible, including remaining reports for time windows that don't have a collection job, to increase the overall average agg job size. This would increase the number of batches touched by these aggregation jobs, however.
  2. Do we really just want one straggler agg job, or would multiple agg jobs be better for write contention? Creating multiple aggregation jobs, one per batch, would increase the number of aggregation jobs but reduce the number of batches touched by each aggregation job.

(These two points are in contention with one another.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants