From ba8eb3af1f2d2c38867ac0542e9b63e3c464e729 Mon Sep 17 00:00:00 2001 From: Ameer Ghani Date: Fri, 23 Feb 2024 11:28:57 -0500 Subject: [PATCH 1/2] Add documentation on upload metrics --- .../operational-metrics.md | 120 ++++++++++++++++++ 1 file changed, 120 insertions(+) create mode 100644 docs/product-documentation/operational-metrics.md diff --git a/docs/product-documentation/operational-metrics.md b/docs/product-documentation/operational-metrics.md new file mode 100644 index 0000000..0f75112 --- /dev/null +++ b/docs/product-documentation/operational-metrics.md @@ -0,0 +1,120 @@ +--- +toc_max_heading_level: 4 +--- + +# Operational Metrics + +Observe how Divvi Up is working. + +## Upload Metrics + +Tasks provide metrics on reports uploaded to them. Upload metrics are reported +to Divvi Up via the task's leader aggregator, and represent the leader's ability +to process the reports. + +Metrics are monotonic counters that last the lifetime of the task. + +:::note + +This is a new feature. Tasks created before February 2024 will start counting +from when the feature was implemented, rather than the lifetime of the task. + +::: + +### Successful Uploads + +Indicates reports successfully ingested by the leader. Reports contributing to +this count are eligible for aggregation and collection. Use the rate of this +counter to inform many reports this task gets over time. + +Also referred to as `report_counter_success` in the Divvi Up API. + +### Upload Errors + +Aggregators count some DAP-level errors that lead to report rejection. Rejected +reports are not processed any further. + +See sections below for a description and basic troubleshooting steps for each +error type. + +#### Interval Collected Failure + +Indicates there were reports that had timestamps corresponding to time intervals +that were already collected. This is only applicable for tasks with a query type +of time interval. + +The rate of this error depends on the accuracy of the client time source, and +how long the collector waits after a time interval has passed before collecting +it. Use the rate to inform how long you should wait before collecting a time +interval. + +Depending on how the client derives the report timestamp, it may not be possible +to fully eliminate this error. + +Also referred to as `report_counter_interval_collected` in the Divvi Up API. + +#### Decode Failure + +Indicates there were reports that failed to decode from their DAP message +representation. + +This is most often caused by task configuration mismatch between the server and +client. Ensure that all client-side task parameters match those reported by +Divvi Up. + +Also referred to as `report_counter_decode_failure` in the Divvi Up API. + +#### Decrypt Failure + +Indicates there were reports whose leader share could not be decrypted. + +This is most often caused by clients using the incorrect HPKE configuration. +Ensure that the client is using the correct task HPKE key and HPKE keys are not +being permanently cached. + +Also referred to as `report_counter_decrypt_failure` in the Divvi Up API. + +#### Report Expired Failure + +Indicates that there were reports whose timestamp was too old. Divvi Up rejects +reports whose timestamps are more than 2 weeks in the past. + +Depending on how the client derives the report timestamp, it may not be possible +to fully eliminate this error. + +Also referred to as `report_counter_expired` in the Divvi Up API. + +#### Outdated Key Failure + +Indicates that there were reports whose leader share was encrypted with an +unknown or outdated HPKE key. + +Ensure that the client is using the correct task HPKE key and HPKE keys are not +being permanently cached. + +Also referred to as `report_counter_outdated_key` in the Divvi Up API. + +#### Report Too Early Failure + +Indicates that there were reports whose timestamp was too far in the future. +Divvi Up rejects reports whose timestamps are more than 60 seconds in the +future. + +Depending on how the client derives the report timestamp, it may not be possible +to fully eliminate this error. + +Also referred to as `report_counter_too_early` in the Divvi Up API. + +#### Task Expired Failure + +Indicates there were reports sent to this task after it had expired. + +Use this metric to monitor clients migrating off of the expired task. + +Also referred to as `report_counter_task_expired` in the Divvi Up API. + +### API Access + +Use `GET /tasks/{task_id}` to retrieve upload metrics from the API. + + From b447b50e7a36e7c8d07565141003d08c44202193 Mon Sep 17 00:00:00 2001 From: Ameer Ghani Date: Mon, 26 Feb 2024 12:44:52 -0500 Subject: [PATCH 2/2] PR review --- docs/product-documentation/operational-metrics.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/docs/product-documentation/operational-metrics.md b/docs/product-documentation/operational-metrics.md index 0f75112..fd3d8b1 100644 --- a/docs/product-documentation/operational-metrics.md +++ b/docs/product-documentation/operational-metrics.md @@ -16,8 +16,8 @@ Metrics are monotonic counters that last the lifetime of the task. :::note -This is a new feature. Tasks created before February 2024 will start counting -from when the feature was implemented, rather than the lifetime of the task. +This feature was implemented February 24th, 2024. Tasks created before will +start counting after then, rather than for the lifetime of the task. ::: @@ -60,7 +60,12 @@ representation. This is most often caused by task configuration mismatch between the server and client. Ensure that all client-side task parameters match those reported by -Divvi Up. +Divvi Up. In particular, check for these common configuration mistakes: + +- Client's DAP library is using an incorrect DAP version, e.g. library supports + DAP-04 when the task supports DAP-07. +- Using the wrong function, e.g. using `sum` when the task is configured for + `count`. Also referred to as `report_counter_decode_failure` in the Divvi Up API.