hh: understand why system CPU usage is so high #22

cirocosta · 2019-03-12T12:12:54Z

Hey,

Looking at the CPU utilization of a node that runs the worker that got assigned the final step of the strabo pipeline, we can see that the CPU utilization gets to a quite interesting place:

From the perspective of the container, we can see similar results, indicating that at least a great part of that usage indeed comes from processes within that worker cgroup:

Doing a pidstat on the pid of concourse worker inside the container:

12:17:09      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
12:17:10        0         8    0.00   14.85    0.00    0.00   14.85     6  concourse
12:17:11        0         8    0.00    1.00    0.00    0.00    1.00     6  concourse
12:17:12        0         8    0.00    0.00    0.00    0.00    0.00     6  concourse
12:17:13        0         8    0.00    0.00    0.00    0.00    0.00     6  concourse
12:17:14        0         8    0.00    1.00    0.00    0.00    1.00     6  concourse
12:17:15        0         8   26.00  100.00    0.00    0.00  100.00     6  concourse
12:17:16        0         8   26.00  100.00    0.00    0.00  100.00     6  concourse
12:17:17        0         8    0.00  100.00    0.00    0.00  100.00     6  concourse
12:17:18        0         8    0.00  100.00    0.00    0.00  100.00     6  concourse
12:17:19        0         8    0.00  100.00    0.00    0.00  100.00     6  concourse
12:17:20        0         8    0.00   67.00    0.00    0.00   67.00     6  concourse
12:17:21        0         8    0.00   74.00    0.00    0.00   74.00     6  concourse
12:17:22        0         8    0.00   96.00    0.00    0.00   96.00     6  concourse
12:17:23        0         8   10.00   67.00    0.00    0.00   77.00     6  concourse
12:17:24        0         8    0.00   57.00    0.00    0.00   57.00     6  concourse
12:17:25        0         8    0.00  100.00    0.00    0.00  100.00     6  concourse
12:17:26        0         8    0.00  100.00    0.00    0.00  100.00     6  concourse
12:17:27        0         8    0.00  100.00    0.00    0.00  100.00     6  concourse
12:17:28        0         8    0.00  100.00    0.00    0.00  100.00     6  concourse
12:17:29        0         8    0.00  100.00    0.00    0.00  100.00     6  concourse
12:17:30        0         8    1.00   91.00    0.00    0.00   92.00     6  concourse
12:17:31        0         8   32.00   44.00    0.00    0.00   76.00     6  concourse
12:17:32        0         8    0.00   89.00    0.00    0.00   89.00     6  concourse
12:17:33        0         8    0.00   87.00    0.00    0.00   87.00     6  concourse
12:17:34        0         8    0.00  100.00    0.00    0.00  100.00     6  concourse
12:17:35        0         8    0.00   76.00    0.00    0.00   76.00     6  concourse
12:17:36        0         8    0.00   70.00    0.00    0.00   70.00     6  concourse
12:17:37        0         8    0.00   64.00    0.00    0.00   64.00     6  concourse

To help troubleshoot that, it might be useful to generate flamegraphs from the perspective of the host to see from the traces where it's spending all of that time.

The text was updated successfully, but these errors were encountered:

cirocosta · 2019-03-12T12:21:44Z

cc @topherbullock the thing I mentioned yesterday after standup
cc @ddadlani @kcmannem (after investigating this, it might touch runtime at some point?)

cirocosta · 2019-03-13T19:26:56Z

Another interesting fact: despite having provisioned what could be the best combination of cpu count + disk size to max out the possible amount of IOPS and consistent throughput we could get, it seems like we're still being throttled:

Which is interesting, given that in those moments, we've never gotten close to what should be our quota:

cirocosta · 2019-03-21T12:42:44Z

I've been wondering whether this is related to overlay(the driver we use there) - in theory, we should be able to see that by taking a profile of the machine once Strabo kicks in and then derive some more data around what's the workload that makes the system cpu usage to grow as much, then, with that, create a specialized workload that allows us to better understand what in Overlay causes that 🤔

kcmannem · 2019-03-21T13:58:09Z

I don't know the specifics of overlay but we do know one situation where overlay struggles is when using privileged containers. In the case of big resources with Strabo, using privileged will cause overlay to apply a real write on the upper dir with the correct permissions instead of using the read only version in the lower dir.

cirocosta · 2019-03-21T21:23:44Z

If I understand correctly, overlay ends up being a problem during the beginning of a step when all of those chowns need to happen, but, for what happens after that, would it be anyhow worse than btrfs? 🤔

In the case of strabo there are no privileged containers 🤔

cirocosta · 2019-03-22T21:31:26Z

It turns out that sys cpu is pretty tricky 🤷‍♂️

Running chicken-gun's net stress we can drive sys util to 32% constantly w pretty much ~1% user CPU usage, which is quite interesting! Next question is - does sys util count when cgroups throttles processes? 🤔

cirocosta · 2019-04-25T12:30:37Z

Hey,

I was giving a look at what goes on with the machine when it's in such state, and it seems that such performance trashing is coming from the memory subsystem.

# Children      Self       Samples  Command          Shared Object             Symbol
# ........  ........  ............  ...............  ........................  ...........................................................................................................................................................................................
#
    24.91%     0.00%             0  exe              [kernel.kallsyms]         [k] try_charge
13.63% try_charge;try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one
3.57% try_charge;try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one;do_raw_spin_lock;native_queued_spin_lock_slowpath
3.52% try_charge;try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one;do_raw_spin_lock
1.17% try_charge;try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count
1.15% try_charge;try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;shmem_unused_huge_count
0.98% try_charge;try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49
    24.91%     0.00%             0  exe              [kernel.kallsyms]         [k] try_to_free_mem_cgroup_pages
13.63% try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one
3.57% try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one;do_raw_spin_lock;native_queued_spin_lock_slowpath
3.52% try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one;do_raw_spin_lock
1.17% try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count
1.15% try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;shmem_unused_huge_count
0.98% try_to_free_mem_cgroup_pages;do_try_to_free_pages;shrink_node;shrink_slab.part.49
    24.91%     0.00%             0  exe              [kernel.kallsyms]         [k] do_try_to_free_pages
13.63% do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one
3.57% do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one;do_raw_spin_lock;native_queued_spin_lock_slowpath
3.52% do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one;do_raw_spin_lock
1.17% do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count
1.15% do_try_to_free_pages;shrink_node;shrink_slab.part.49;super_cache_count;shmem_unused_huge_count
0.98% do_try_to_free_pages;shrink_node;shrink_slab.part.49
    24.77%     0.00%             2  exe              [kernel.kallsyms]         [k] shrink_node
13.63% shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one
3.57% shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one;do_raw_spin_lock;native_queued_spin_lock_slowpath
3.52% shrink_node;shrink_slab.part.49;super_cache_count;list_lru_count_one;do_raw_spin_lock
1.17% shrink_node;shrink_slab.part.49;super_cache_count
1.15% shrink_node;shrink_slab.part.49;super_cache_count;shmem_unused_huge_count
0.98% shrink_node;shrink_slab.part.49
    24.58%     0.98%           464  exe              [kernel.kallsyms]         [k] shrink_slab.part.49
13.63% shrink_slab.part.49;super_cache_count;list_lru_count_one
3.57% shrink_slab.part.49;super_cache_count;list_lru_count_one;do_raw_spin_lock;native_queued_spin_lock_slowpath
3.52% shrink_slab.part.49;super_cache_count;list_lru_count_one;do_raw_spin_lock
1.17% shrink_slab.part.49;super_cache_count
1.15% shrink_slab.part.49;super_cache_count;shmem_unused_huge_count
    24.06%     0.00%             0  exe              [kernel.kallsyms]         [k] handle_mm_fault

I remember reading that per-cgroup memory accounting is not free, but that seems quite expensive. Maybe there's something we're missing here.

Here are some other interesting stats:

At the moment that we're in such state, we have a bunch of small IOPS, having a quite high IO completion time 🤔

cirocosta · 2019-04-25T21:15:44Z

Supporting the theory that it's indeed something related to paging and not the fact that our IOPS are being throttled:

We're pretty much never being throttled for reads 🤔 (the stats for throttling come from Stackdriver, i.e., from the IaaS itself).

Reading https://engineering.linkedin.com/blog/2016/08/don_t-let-linux-control-groups-uncontrolled, it feels like we're hitting case 2:

A cgroup’s memory limit (e.g., 10GB) includes all memory usage of the processes running in it—both the anonymous memory and page cache of the cgroup are counted towards the memory limit. In particular, when the application running in a cgroup reads or writes files, the corresponding page cache allocated by OS is counted as part of the cgroup’s memory limit.

For some applications, the starvation of page cache (and corresponding low page cache hit rate) has two effects: it degrades the performance of the application and increases workload on the root disk drive, and it could also severely degrade the performance of all applications on the machine that perform disk IO.

It'd be nice to create a reproducible for this. I'll see if I can come with something up with chicken-gun.

cirocosta · 2019-05-01T14:11:43Z

I was under the impression that it could be that overlay could potentially be part of the issue, but it seems like that's not the case - we just hit the same kind of CPU usage curve with Ubuntu-based machines (thus, not overlay related):

cirocosta · 2019-05-02T00:31:00Z

maybe some good news!

As a way of trying to get rid of some of the iptables memory allocation problems (see concourse/concourse#3127) that could be just a symptom of memory pressure, I tried raising /proc/sys/vm/min_free_kbytes in order to ensure that we always leave some more room for memory for the kernel, and it seems like the behavior is better - not only no more of those iptables problems, but in general, it seems like we have a better CPU usage distribution in overall.

I'll perform some more runs tomorrow and see how it goes 👍

https://github.com/torvalds/linux/blob/baf76f0c58aec435a3a864075b8f6d8ee5d1f17e/Documentation/sysctl/vm.txt#L438-L450

cirocosta · 2019-05-03T21:18:35Z

sys cpu continued consistently low compared to how it was before the tuning 🙌 iptables mem allocation problem still happens though.

As reported in "Understand why system CPU usage is so high"[1], we see that the kernel spends a lot of time trying to reclaim memory for itself when a bunch of containers start running on those VMs. By adjusting the minimum amount of reserved memory for the kernel to a higher value, we're able to then let it do less work. With the introduction of the `sysctl` chart, every time a new `worker` node gets in, the pod in the daemonset will get scheduled to it, thus, the parameters we want will be set on such selected set of nodes. [1]: #22 Signed-off-by: Ciro S. Costa <[email protected]>

cirocosta · 2019-05-24T22:35:25Z

Actually ... this didn't really help much 🤷‍♂

It's quite unfortunate that when tracing unzip we get the following:

Looking at https://lore.kernel.org/lkml/[email protected]/, which suggests that 4.19+ ships with an improvement to the whole list_lru_count_one having such high sys CPU utilization and knowing that we'd like to try out the iptables problem in 4.19+, I'll go ahead with having hush-house workers running on top of Ubuntu Disco to see how that compares.

That'd still not be a 100%-valid comparison as we'd not be running on top of a GKE-managed instance type (PKS wouldn't help us here either as the stemcells are 4.14-based just like the Ubuntu family from GKE, as well as the COS-family - see https://cloud.google.com/container-optimized-os/docs/release-notes#current_active_releases).

Also, this is quite interesting:

Every mount introduce a new shrinker to the system, so it's easy to see a system with 100 or ever 1000 shrinkers.

ooooh yeah, strabo has "just" 100s of those 😅 not counting all of those other 100s of containers that are usually already there in the machine 😅

Knowing those, we can also prepare something with chicken-gun 🤔

side note: if this really helps, the next step will be seeing if it's possible to get strabo doing its thing with configurable concurrency 👀

cirocosta · 2019-05-30T23:32:38Z

(not completely related, but something else to keep an eye on - cpu.cfs_quota https://lkml.org/lkml/2019/5/17/581)

see kubernetes/kubernetes#67577

cirocosta · 2019-06-07T21:11:14Z

I spent some time trying to see how a large number of cgroups affect the performance of reading files from disk, and it turns out that it's quite significant:

no extra cgroups:
    0.00user 9.89system 0:13.25elapsed 74%CPU (0avgtext+0avgdata 3336maxresident)k
    10486016inputs+8outputs (1major+130minor)pagefaults 0swaps

100
    0.02user 18.96system 0:22.10elapsed 85%CPU (0avgtext+0avgdata 3220maxresident)k
    10486152inputs+8outputs (2major+131minor)pagefaults 0swaps

300
    0.06user 80.90system 1:24.64elapsed 95%CPU (0avgtext+0avgdata 3360maxresident)k
    10486024inputs+8outputs (1major+132minor)pagefaults 0swaps

which seems to be quite on point with the comments in the lkml articles regarding the quadratic times of shrinkers 🤔

The reproducible consists of:

having a bunch of large files to read (see https://github.com/cirocosta/chicken-gun/blob/3c5fe02045e784c7da47575a415bb98a6bbf9e71/scenarios/page-cache/setup.sh#L11-L16)
setting up a number of cgroups (see https://github.com/cirocosta/chicken-gun/blob/3c5fe02045e784c7da47575a415bb98a6bbf9e71/scenarios/page-cache/containers.sh#L5-L10), then
reading all of those files (see https://github.com/cirocosta/chicken-gun/blob/3c5fe02045e784c7da47575a415bb98a6bbf9e71/src/page_cache.rs)

cirocosta mentioned this issue Mar 13, 2019

Do we use baggageclaim's Reaper functionality? concourse/concourse#3495

Closed

cirocosta pinned this issue Apr 25, 2019

cirocosta mentioned this issue May 13, 2019

hh: Improve our streaming throughput #49

Closed

cirocosta mentioned this issue Jun 20, 2019

hh: experiment isolating Strabo final jobs #58

Open

deniseyu mentioned this issue Aug 1, 2019

make the garden.max_containers property configurable concourse/concourse-bosh-release#43

Merged

cirocosta mentioned this issue Sep 26, 2019

Just checking in... is hush-house still ongoing...? #65

Closed

cirocosta mentioned this issue Dec 24, 2019

Upgrade node pools to v1.15.4-gke.22 #95

Closed

cirocosta closed this as completed Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hh: understand why system CPU usage is so high #22

hh: understand why system CPU usage is so high #22

cirocosta commented Mar 12, 2019 •

edited

Loading

cirocosta commented Mar 12, 2019 •

edited

Loading

cirocosta commented Mar 13, 2019

cirocosta commented Mar 21, 2019

kcmannem commented Mar 21, 2019

cirocosta commented Mar 21, 2019

cirocosta commented Mar 22, 2019 •

edited

Loading

cirocosta commented Apr 25, 2019 •

edited

Loading

cirocosta commented Apr 25, 2019 •

edited

Loading

cirocosta commented May 1, 2019

cirocosta commented May 2, 2019 •

edited

Loading

cirocosta commented May 3, 2019

cirocosta commented May 24, 2019 •

edited

Loading

cirocosta commented May 30, 2019

cirocosta commented Jun 7, 2019 •

edited

Loading

hh: understand why system CPU usage is so high #22

hh: understand why system CPU usage is so high #22

Comments

cirocosta commented Mar 12, 2019 • edited Loading

cirocosta commented Mar 12, 2019 • edited Loading

cirocosta commented Mar 13, 2019

cirocosta commented Mar 21, 2019

kcmannem commented Mar 21, 2019

cirocosta commented Mar 21, 2019

cirocosta commented Mar 22, 2019 • edited Loading

cirocosta commented Apr 25, 2019 • edited Loading

cirocosta commented Apr 25, 2019 • edited Loading

cirocosta commented May 1, 2019

cirocosta commented May 2, 2019 • edited Loading

cirocosta commented May 3, 2019

cirocosta commented May 24, 2019 • edited Loading

cirocosta commented May 30, 2019

cirocosta commented Jun 7, 2019 • edited Loading

cirocosta commented Mar 12, 2019 •

edited

Loading

cirocosta commented Mar 12, 2019 •

edited

Loading

cirocosta commented Mar 22, 2019 •

edited

Loading

cirocosta commented Apr 25, 2019 •

edited

Loading

cirocosta commented Apr 25, 2019 •

edited

Loading

cirocosta commented May 2, 2019 •

edited

Loading

cirocosta commented May 24, 2019 •

edited

Loading

cirocosta commented Jun 7, 2019 •

edited

Loading