From c38d614ce19bf78563a41288bb847bdb83beedd4 Mon Sep 17 00:00:00 2001 From: Jon Schneider Date: Thu, 29 Jun 2017 09:12:27 -0500 Subject: [PATCH] Add notes from Netflix meeting --- NOTES.md | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 54 insertions(+) create mode 100644 NOTES.md diff --git a/NOTES.md b/NOTES.md new file mode 100644 index 0000000000..49c6428dea --- /dev/null +++ b/NOTES.md @@ -0,0 +1,54 @@ +## Netflix 6/21/17 + +### Alerting + +* Support triggering on one query, sending alert notifications on another. +One is to discover a problem exists, the other to provide actionable insight +quickly. +* Alert queries are often instance bound, e.g. `:group-by` instance to generate +a signal per instance. Healthy instances will tend to cluster together in a +mass of color, and the bad instance(s) will be visibly divergent. +* Look at `:rolling-count`. + +### ACA + +* Most canaries run 1-4 hours. Extreme cases run 96 hours. Shorter windows +are recommended, as bad canaries are receiving production traffic -- should +be failed as soon as possible. +* ACA at Netflix is approaching limits on how many queries it is allowed to +perform against Atlas in a given window. To improve, could combine queries +to baseline and canary in one and split the results in ACA. Also, may be +able to use LWC streaming? +* Good canary metrics include box level metrics, error counting metrics with +sufficient volume to present a statistically viable comparison. +* Because a normal distribution cannot be assumed, uses Mann-Whitney U Test. +* Use canary analysis to inform alerts -- if a canary is critically failing, +even if it only lasts 4 hours, shouldn't somebody have been alerted on +failures before? +* Good ACA criteria = U.S.E. = Utilization, Saturation, Error (Brendan Gregg) +* Some teams basing ACAs off of high-dimension data stored in Druid. + +### Atlas + +* Look at `:dist-max` when concerned about the absolute max in a given window. +The `max` statistic shown on a graph is the maximum value that the plot yields, +but this may be the max average value if the plot is `:avg`, for example. +`:dist-max` plots the maximum sample at each step. So, `max` of `:dist-max` is +the maximum sample seen along the plot's x-axis. +* Constant-time lookup function on buckets is important. +* Bucket functions lead to a mergeable quantile approximation. +* There may be a static 276 bucket histogram that leads to good error bounds +on quantile approximation for majority of use cases. +* Standard deviation calculation often exhibits high error bounds because of cliffs: + * Left-side cliff on payload size that represents minimum header size + * Right-side cliff on latency that represents HTTP timeout + * For a latency timer across all endpoints in an app, distribution can be + wildly non-normal because of different levels of computation and I/O across + those endpoints. +* Say no to t-digests. +* Counters not decrementable +* Look at `:cq`, `:list`, `:each` for an easy way to tack on additional +criteria from a dashboard-building app without understanding the existing +structure of the query. +* `:dist-avg` does the `totalTime/count` division math for you. +* r3.2xlg with 60GB RAM capable of managing 2M time series over 6 hours. \ No newline at end of file