forked from micrometer-metrics/micrometer
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
cc8c23d
commit c38d614
Showing
1 changed file
with
54 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
## Netflix 6/21/17 | ||
|
||
### Alerting | ||
|
||
* Support triggering on one query, sending alert notifications on another. | ||
One is to discover a problem exists, the other to provide actionable insight | ||
quickly. | ||
* Alert queries are often instance bound, e.g. `:group-by` instance to generate | ||
a signal per instance. Healthy instances will tend to cluster together in a | ||
mass of color, and the bad instance(s) will be visibly divergent. | ||
* Look at `:rolling-count`. | ||
|
||
### ACA | ||
|
||
* Most canaries run 1-4 hours. Extreme cases run 96 hours. Shorter windows | ||
are recommended, as bad canaries are receiving production traffic -- should | ||
be failed as soon as possible. | ||
* ACA at Netflix is approaching limits on how many queries it is allowed to | ||
perform against Atlas in a given window. To improve, could combine queries | ||
to baseline and canary in one and split the results in ACA. Also, may be | ||
able to use LWC streaming? | ||
* Good canary metrics include box level metrics, error counting metrics with | ||
sufficient volume to present a statistically viable comparison. | ||
* Because a normal distribution cannot be assumed, uses Mann-Whitney U Test. | ||
* Use canary analysis to inform alerts -- if a canary is critically failing, | ||
even if it only lasts 4 hours, shouldn't somebody have been alerted on | ||
failures before? | ||
* Good ACA criteria = U.S.E. = Utilization, Saturation, Error (Brendan Gregg) | ||
* Some teams basing ACAs off of high-dimension data stored in Druid. | ||
|
||
### Atlas | ||
|
||
* Look at `:dist-max` when concerned about the absolute max in a given window. | ||
The `max` statistic shown on a graph is the maximum value that the plot yields, | ||
but this may be the max average value if the plot is `:avg`, for example. | ||
`:dist-max` plots the maximum sample at each step. So, `max` of `:dist-max` is | ||
the maximum sample seen along the plot's x-axis. | ||
* Constant-time lookup function on buckets is important. | ||
* Bucket functions lead to a mergeable quantile approximation. | ||
* There may be a static 276 bucket histogram that leads to good error bounds | ||
on quantile approximation for majority of use cases. | ||
* Standard deviation calculation often exhibits high error bounds because of cliffs: | ||
* Left-side cliff on payload size that represents minimum header size | ||
* Right-side cliff on latency that represents HTTP timeout | ||
* For a latency timer across all endpoints in an app, distribution can be | ||
wildly non-normal because of different levels of computation and I/O across | ||
those endpoints. | ||
* Say no to t-digests. | ||
* Counters not decrementable | ||
* Look at `:cq`, `:list`, `:each` for an easy way to tack on additional | ||
criteria from a dashboard-building app without understanding the existing | ||
structure of the query. | ||
* `:dist-avg` does the `totalTime/count` division math for you. | ||
* r3.2xlg with 60GB RAM capable of managing 2M time series over 6 hours. |