Skip to content

Commit

Permalink
Add notes from Netflix meeting
Browse files Browse the repository at this point in the history
  • Loading branch information
jkschneider committed Jun 29, 2017
1 parent cc8c23d commit c38d614
Showing 1 changed file with 54 additions and 0 deletions.
54 changes: 54 additions & 0 deletions NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
## Netflix 6/21/17

### Alerting

* Support triggering on one query, sending alert notifications on another.
One is to discover a problem exists, the other to provide actionable insight
quickly.
* Alert queries are often instance bound, e.g. `:group-by` instance to generate
a signal per instance. Healthy instances will tend to cluster together in a
mass of color, and the bad instance(s) will be visibly divergent.
* Look at `:rolling-count`.

### ACA

* Most canaries run 1-4 hours. Extreme cases run 96 hours. Shorter windows
are recommended, as bad canaries are receiving production traffic -- should
be failed as soon as possible.
* ACA at Netflix is approaching limits on how many queries it is allowed to
perform against Atlas in a given window. To improve, could combine queries
to baseline and canary in one and split the results in ACA. Also, may be
able to use LWC streaming?
* Good canary metrics include box level metrics, error counting metrics with
sufficient volume to present a statistically viable comparison.
* Because a normal distribution cannot be assumed, uses Mann-Whitney U Test.
* Use canary analysis to inform alerts -- if a canary is critically failing,
even if it only lasts 4 hours, shouldn't somebody have been alerted on
failures before?
* Good ACA criteria = U.S.E. = Utilization, Saturation, Error (Brendan Gregg)
* Some teams basing ACAs off of high-dimension data stored in Druid.

### Atlas

* Look at `:dist-max` when concerned about the absolute max in a given window.
The `max` statistic shown on a graph is the maximum value that the plot yields,
but this may be the max average value if the plot is `:avg`, for example.
`:dist-max` plots the maximum sample at each step. So, `max` of `:dist-max` is
the maximum sample seen along the plot's x-axis.
* Constant-time lookup function on buckets is important.
* Bucket functions lead to a mergeable quantile approximation.
* There may be a static 276 bucket histogram that leads to good error bounds
on quantile approximation for majority of use cases.
* Standard deviation calculation often exhibits high error bounds because of cliffs:
* Left-side cliff on payload size that represents minimum header size
* Right-side cliff on latency that represents HTTP timeout
* For a latency timer across all endpoints in an app, distribution can be
wildly non-normal because of different levels of computation and I/O across
those endpoints.
* Say no to t-digests.
* Counters not decrementable
* Look at `:cq`, `:list`, `:each` for an easy way to tack on additional
criteria from a dashboard-building app without understanding the existing
structure of the query.
* `:dist-avg` does the `totalTime/count` division math for you.
* r3.2xlg with 60GB RAM capable of managing 2M time series over 6 hours.

0 comments on commit c38d614

Please sign in to comment.