state label in bird_protocol_up leads to unique time serie for each state #63

sgrade · 2022-02-18T13:00:51Z

Overview

In Prometheus, every time series is UNIQuely identified by its metric name and set of LABELS (source). So, when a state (label) changes in bird_protocol_up metric, new time series is created in addition to the one with previous state. This ruins the metric: instead of one bird_protocol_up time series per BIRD protocol we see several in parallel. And when the state changes regularly (e.g. flap), we have gaps in the series.

How to replicate

If a BGP peer on other side becomes unavailable, BIRD tries to reconnect (goes through different states). In the example below, in Prometheus we see three different bird_protocol_up time series for one peer. They correspond to the BGP states (state labels):

"Idle Socket: No route to host"
"Connect Socket: No route to host"
"Active Socket: No route to host"

All three exist in the TSDB in parallel.

Problems this approach creates

When the protocol state changes (e.g. flaps), bird_exporter reports only current state. So, at the moment of scraping it can be one state. A second after that the state is different, but we don't see it in Prometheus. Different combinations of the scraping intervals and protocol timers create different (weird) results in monitoring.
In a complex environment with thousands of peers (thus many labels per peer) an unstable (unpredictable) number of metrics per protocol is difficult to manage. Idempotence is difficult to achieve. Automation breaks.
It is difficult to understand, which BGP state is current. Prometheus returns all time series (in example above three time series) for the single bird protocol. It is the same with instant queries as the series with different states are considered unique
It is difficult to count peers, for which bird_protocol_up == 0. Instead of actual number of down peers count shows number of unique time series, which is not what we want to see. I still managed to do it using count(group by (state) {}), but IMHO this is more a workaround than a proper solution

Suggestion

Remove state label from bird_protocol_up metric
Return to a separate bgp_state metric as in Add bgp state metrics #46, but make it optional (activated with a flag at startup). There are people who need it, so they will have it. Others, who do believe that Prometheus is only for numerical metrics, won't have it.

Who will do it

I can implement it myself if the agreement is made.

dmitry-sinina · 2023-02-22T17:27:12Z

I am faced same problem. Variable state label in bird_protocol_up metric doesn't look right.

sgrade changed the title ~~state label in bird_protocol_up leads to unique time series for each state~~ state label in bird_protocol_up leads to unique time serie for each state Feb 18, 2022

czerwonk self-assigned this Oct 26, 2023

czerwonk added this to the 1.5 milestone Oct 26, 2023

nervous-inhuman mentioned this issue Dec 16, 2023

Feature Suggestion: Indicate disabled state of peering #99

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

state label in bird_protocol_up leads to unique time serie for each state #63

state label in bird_protocol_up leads to unique time serie for each state #63

sgrade commented Feb 18, 2022 •

edited

Loading

dmitry-sinina commented Feb 22, 2023

state label in bird_protocol_up leads to unique time serie for each state #63

state label in bird_protocol_up leads to unique time serie for each state #63

Comments

sgrade commented Feb 18, 2022 • edited Loading

Overview

How to replicate

Problems this approach creates

Suggestion

Who will do it

dmitry-sinina commented Feb 22, 2023

sgrade commented Feb 18, 2022 •

edited

Loading