Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

state label in bird_protocol_up leads to unique time serie for each state #63

Open
sgrade opened this issue Feb 18, 2022 · 1 comment
Open
Assignees
Milestone

Comments

@sgrade
Copy link
Contributor

sgrade commented Feb 18, 2022

Overview

In Prometheus, every time series is UNIQuely identified by its metric name and set of LABELS (source). So, when a state (label) changes in bird_protocol_up metric, new time series is created in addition to the one with previous state. This ruins the metric: instead of one bird_protocol_up time series per BIRD protocol we see several in parallel. And when the state changes regularly (e.g. flap), we have gaps in the series.

How to replicate

If a BGP peer on other side becomes unavailable, BIRD tries to reconnect (goes through different states). In the example below, in Prometheus we see three different bird_protocol_up time series for one peer. They correspond to the BGP states (state labels):

  • "Idle Socket: No route to host"
  • "Connect Socket: No route to host"
  • "Active Socket: No route to host"

All three exist in the TSDB in parallel.

Problems this approach creates

  • When the protocol state changes (e.g. flaps), bird_exporter reports only current state. So, at the moment of scraping it can be one state. A second after that the state is different, but we don't see it in Prometheus. Different combinations of the scraping intervals and protocol timers create different (weird) results in monitoring.
  • In a complex environment with thousands of peers (thus many labels per peer) an unstable (unpredictable) number of metrics per protocol is difficult to manage. Idempotence is difficult to achieve. Automation breaks.
  • It is difficult to understand, which BGP state is current. Prometheus returns all time series (in example above three time series) for the single bird protocol. It is the same with instant queries as the series with different states are considered unique
  • It is difficult to count peers, for which bird_protocol_up == 0. Instead of actual number of down peers count shows number of unique time series, which is not what we want to see. I still managed to do it using count(group by (state) {}), but IMHO this is more a workaround than a proper solution

Suggestion

Who will do it

I can implement it myself if the agreement is made.

@sgrade sgrade changed the title state label in bird_protocol_up leads to unique time series for each state state label in bird_protocol_up leads to unique time serie for each state Feb 18, 2022
@dmitry-sinina
Copy link

I am faced same problem. Variable state label in bird_protocol_up metric doesn't look right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants