-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Pressure Stall Information Metrics #3649
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Daniel Dao <[email protected]>
This adds 2 new set of metrics: - `psi_total`: read total number of seconds a resource is under pressure - `psi_avg`: read ratio of time a resource is under pressure over a sliding time window. For more details about these definitions, see: - https://www.kernel.org/doc/html/latest/accounting/psi.html - https://facebookmicrosites.github.io/psi/docs/overview Signed-off-by: Daniel Dao <[email protected]>
This adds support for reading PSI metrics via prometheus. We exposes the following for `psi_total`: ``` container_cpu_psi_total_seconds container_memory_psi_total_seconds container_io_psi_total_seconds ``` And for `psi_avg`: ``` container_cpu_psi_avg10_ratio container_cpu_psi_avg60_ratio container_cpu_psi_avg300_ratio container_memory_psi_avg10_ratio container_memory_psi_avg60_ratio container_memory_psi_avg300_ratio container_io_psi_avg10_ratio container_io_psi_avg60_ratio container_io_psi_avg300_ratio ``` Signed-off-by: Daniel Dao <[email protected]>
issues: google#3052, google#3083, kubernetes/enhancements#4205 This change adds metrics for pressure stall information, that indicate why some or all tasks of a cgroupv2 have waited due to resource congestion (cpu, memory, io). The change exposes this information by including the _PSIStats_ of each controller in it's stats, i.e. _CPUStats.PSI_, _MemoryStats.PSI_ and _DiskStats.PSI_. The information is additionally exposed as Prometheus metrics. The metrics follow the naming outlined by the prometheus/node-exporter, where stalled eq full and waiting eq some. ``` container_pressure_cpu_stalled_seconds_total container_pressure_cpu_waiting_seconds_total container_pressure_memory_stalled_seconds_total container_pressure_memory_waiting_seconds_total container_pressure_io_stalled_seconds_total container_pressure_io_waiting_seconds_total ``` Signed-off-by: Felix Ehrenpfort <[email protected]>
8b41ec5
to
103b4be
Compare
Looking great so far, the metric names and other conventions look fine. |
Signed-off-by: Felix Ehrenpfort <[email protected]>
@SuperQ Thanks for the quick review. I've added the improvements. |
Signed-off-by: Felix Ehrenpfort <[email protected]>
@SuperQ I'm going take a look at the CPU PSI metrics again today. It seems that the CPU PSI full metric can be neq 0. I've stumbled upon this while reading kubernetes/enhancements#5062 |
@SuperQ I'm going to re-add the CPU full metric, as it's actively being reported by the kernel for cgroups.
See |
Thank you for your work (and investigation) on this, @xinau!
Ah, nevermind. I believe these can be derived. |
@rexagod Yup. With Prometheus we can derive arbitrary averages as they're just |
Signed-off-by: Felix Ehrenpfort <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dims Could you please approve the pending workflow here, or ping someone who could? The patch builds on top of the original PR while additionally following the community guidelines, and looks good to go in.
issues: #3052, #3083, kubernetes/enhancements#4205
This change adds metrics for pressure stall information, that indicate
why some or all tasks of a cgroupv2 have waited due to resource
congestion (cpu, memory, io). The change exposes this information by
including the PSIStats of each controller in it's stats, i.e.
CPUStats.PSI, MemoryStats.PSI and DiskStats.PSI.
The information is additionally exposed as Prometheus metrics. The
metrics follow the naming outlined by the prometheus/node-exporter,
where stalled eq full and waiting eq some.
This change is a rebase and resolve of the comments the work done in #3083.