Add Pressure Stall Information Metrics #3649

xinau · 2025-01-26T17:24:42Z

issues: #3052, #3083, kubernetes/enhancements#4205

This change adds metrics for pressure stall information, that indicate
why some or all tasks of a cgroupv2 have waited due to resource
congestion (cpu, memory, io). The change exposes this information by
including the PSIStats of each controller in it's stats, i.e.
CPUStats.PSI, MemoryStats.PSI and DiskStats.PSI.

The information is additionally exposed as Prometheus metrics. The
metrics follow the naming outlined by the prometheus/node-exporter,
where stalled eq full and waiting eq some.

container_pressure_cpu_stalled_seconds_total
container_pressure_cpu_waiting_seconds_total
container_pressure_memory_stalled_seconds_total
container_pressure_memory_waiting_seconds_total
container_pressure_io_stalled_seconds_total
container_pressure_io_waiting_seconds_total

This change is a rebase and resolve of the comments the work done in #3083.

Signed-off-by: Daniel Dao <[email protected]>

This adds 2 new set of metrics: - `psi_total`: read total number of seconds a resource is under pressure - `psi_avg`: read ratio of time a resource is under pressure over a sliding time window. For more details about these definitions, see: - https://www.kernel.org/doc/html/latest/accounting/psi.html - https://facebookmicrosites.github.io/psi/docs/overview Signed-off-by: Daniel Dao <[email protected]>

This adds support for reading PSI metrics via prometheus. We exposes the following for `psi_total`: ``` container_cpu_psi_total_seconds container_memory_psi_total_seconds container_io_psi_total_seconds ``` And for `psi_avg`: ``` container_cpu_psi_avg10_ratio container_cpu_psi_avg60_ratio container_cpu_psi_avg300_ratio container_memory_psi_avg10_ratio container_memory_psi_avg60_ratio container_memory_psi_avg300_ratio container_io_psi_avg10_ratio container_io_psi_avg60_ratio container_io_psi_avg300_ratio ``` Signed-off-by: Daniel Dao <[email protected]>

xinau · 2025-01-26T17:27:22Z

@rexagod, @SuperQ Could you please give this a review and advise me how to get this change merged.

issues: google#3052, google#3083, kubernetes/enhancements#4205 This change adds metrics for pressure stall information, that indicate why some or all tasks of a cgroupv2 have waited due to resource congestion (cpu, memory, io). The change exposes this information by including the _PSIStats_ of each controller in it's stats, i.e. _CPUStats.PSI_, _MemoryStats.PSI_ and _DiskStats.PSI_. The information is additionally exposed as Prometheus metrics. The metrics follow the naming outlined by the prometheus/node-exporter, where stalled eq full and waiting eq some. ``` container_pressure_cpu_stalled_seconds_total container_pressure_cpu_waiting_seconds_total container_pressure_memory_stalled_seconds_total container_pressure_memory_waiting_seconds_total container_pressure_io_stalled_seconds_total container_pressure_io_waiting_seconds_total ``` Signed-off-by: Felix Ehrenpfort <[email protected]>

cmd/go.mod

metrics/prometheus.go

SuperQ · 2025-01-26T18:29:46Z

Looking great so far, the metric names and other conventions look fine.

metrics/prometheus.go

Signed-off-by: Felix Ehrenpfort <[email protected]>

xinau · 2025-01-26T19:48:48Z

@SuperQ Thanks for the quick review. I've added the improvements.

metrics/prometheus.go

Signed-off-by: Felix Ehrenpfort <[email protected]>

xinau · 2025-01-27T08:21:16Z

@SuperQ I'm going take a look at the CPU PSI metrics again today. It seems that the CPU PSI full metric can be neq 0. I've stumbled upon this while reading kubernetes/enhancements#5062

xinau · 2025-01-27T08:40:20Z

@SuperQ I'm going to re-add the CPU full metric, as it's actively being reported by the kernel for cgroups.

* Naturally, the FULL state doesn't exist for the CPU resource at the
* system level, but exist at the cgroup level, means all non-idle tasks
* in a cgroup are delayed on the CPU resource which used by others outside
* of the cgroup or throttled by the cgroup cpu.max configuration.

See
https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/

rexagod · 2025-01-27T09:04:45Z

Thank you for your work (and investigation) on this, @xinau!

~~Not sure but after a quick look I can see we dropped container_%s_psi_avg%s_ratio here, was this intentional?~~

Ah, nevermind. I believe these can be derived.

SuperQ · 2025-01-27T10:23:12Z

@rexagod Yup. With Prometheus we can derive arbitrary averages as they're just rate(container_..._total[Xm]).

Signed-off-by: Felix Ehrenpfort <[email protected]>

xinau · 2025-01-27T20:54:32Z

@rexagod, @SuperQ all good from my side now.

rexagod

@dims Could you please approve the pending workflow here, or ping someone who could? The patch builds on top of the original PR while additionally following the community guidelines, and looks good to go in.

dqminh added 3 commits January 26, 2025 12:53

Replace runc with dqminh/runc for psi support

b621e78

Signed-off-by: Daniel Dao <[email protected]>

xinau mentioned this pull request Jan 26, 2025

Support for exposing PSI metrics #3083

Open

xinau force-pushed the xinau/add-psi-metrics branch from 8b41ec5 to 103b4be Compare January 26, 2025 17:30

SuperQ suggested changes Jan 26, 2025

View reviewed changes

cmd/go.mod Outdated Show resolved Hide resolved

metrics/prometheus.go Outdated Show resolved Hide resolved

SuperQ reviewed Jan 26, 2025

View reviewed changes

metrics/prometheus.go Show resolved Hide resolved

Add minor improvements to PSI metrics

94a027c

Signed-off-by: Felix Ehrenpfort <[email protected]>

SuperQ reviewed Jan 26, 2025

View reviewed changes

metrics/prometheus.go Outdated Show resolved Hide resolved

Use 1e6/9 instead of time for conversion

e238b08

Signed-off-by: Felix Ehrenpfort <[email protected]>

xinau requested a review from SuperQ January 26, 2025 20:34

SuperQ approved these changes Jan 26, 2025

View reviewed changes

Expose PSI metric for CPU full

20e5af2

Signed-off-by: Felix Ehrenpfort <[email protected]>

rexagod approved these changes Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Pressure Stall Information Metrics #3649

Add Pressure Stall Information Metrics #3649

xinau commented Jan 26, 2025

xinau commented Jan 26, 2025

SuperQ commented Jan 26, 2025

xinau commented Jan 26, 2025

xinau commented Jan 27, 2025

xinau commented Jan 27, 2025

rexagod commented Jan 27, 2025 •

edited

Loading

SuperQ commented Jan 27, 2025

xinau commented Jan 27, 2025

rexagod left a comment

Add Pressure Stall Information Metrics #3649

Are you sure you want to change the base?

Add Pressure Stall Information Metrics #3649

Conversation

xinau commented Jan 26, 2025

xinau commented Jan 26, 2025

SuperQ commented Jan 26, 2025

xinau commented Jan 26, 2025

xinau commented Jan 27, 2025

xinau commented Jan 27, 2025

rexagod commented Jan 27, 2025 • edited Loading

SuperQ commented Jan 27, 2025

xinau commented Jan 27, 2025

rexagod left a comment

Choose a reason for hiding this comment

rexagod commented Jan 27, 2025 •

edited

Loading