Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build perf: avoid thousands of needless file opens and yaml parses #11190

Conversation

isimluk
Copy link
Member

@isimluk isimluk commented Oct 10, 2023

Avoid hundred thousands of useless yaml parses on make all products

This Speeds up rhel8 build for me by 5 seconds.

Explainer

The load() function is parsing ./components/*.yml every time it is called. Unfortunately, it is being called at least once per Xccdf:Rule or Xccdf:Group in the benchmark. So for instance, with rhel9 the load() function is called 187 times. Each time it is called, it needs to parse those 150 yaml files in ./components/ directory thus it leads to 27900 useless yaml parses just per rhel9.

Before ( ~ 29s)

# time make rhel8-compile-all
[rhel8-content] generating sce/metadata.json
Built target generate-internal-rhel8-sce-metadata.json
[rhel8-content] compiling product yaml
[rhel8-content] compiling everything
Built target rhel8-compile-all

real    0m29.446s
user    0m27.508s
sys     0m1.720s

After ( ~24s)

time make rhel8-compile-all
[rhel8-content] generating sce/metadata.json
Built target generate-internal-rhel8-sce-metadata.json
[rhel8-content] compiling product yaml
[rhel8-content] compiling everything
Built target rhel8-compile-all

real    0m24.772s
user    0m23.141s
sys     0m1.485s

Avoid hundred thousands of useless yaml parses on make all products

This Speeds up rhel8 build for me by 5 seconds.

Explainer

The `load()` function is parsing ./components/*.yml every time it is called.
Unfortunately, it is being called at least once per Xccdf:Rule or Xccdf:Group in
the benchmark. So for instance, with rhel9 the `load()` function is called 187
times. Each time it is called, it needs to parse those 150 yaml files in ./components/
directory thus it leads to 27900 useless yaml parses just per rhel9.

Before ( ~ 29s)
    # time make rhel8-compile-all
    [rhel8-content] generating sce/metadata.json
    Built target generate-internal-rhel8-sce-metadata.json
    [rhel8-content] compiling product yaml
    [rhel8-content] compiling everything
    Built target rhel8-compile-all

    real    0m29.446s
    user    0m27.508s
    sys     0m1.720s

After ( ~24s)

    time make rhel8-compile-all
    [rhel8-content] generating sce/metadata.json
    Built target generate-internal-rhel8-sce-metadata.json
    [rhel8-content] compiling product yaml
    [rhel8-content] compiling everything
    Built target rhel8-compile-all

    real    0m24.772s
    user    0m23.141s
    sys     0m1.485s
@github-actions
Copy link

Start a new ephemeral environment with changes proposed in this pull request:

Fedora Environment
Open in Gitpod

Oracle Linux 8 Environment
Open in Gitpod

@isimluk
Copy link
Member Author

isimluk commented Oct 10, 2023

Oh, I allowed myself to forgot how it is to support antiquated systems. Will fix the build errors later.

@isimluk isimluk force-pushed the perf-avoid-parsing-same-yaml-hundredtimes branch 3 times, most recently from 323e612 to d8f6d37 Compare October 10, 2023 17:15
@Mab879 Mab879 self-assigned this Oct 10, 2023
@Mab879 Mab879 added the Infrastructure Our content build system label Oct 10, 2023
Copy link
Member

@Mab879 Mab879 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nixing the underscore should fix that error. See https://docs.python.org/3/library/functools.html#functools.lru_cache

ssg/components.py Outdated Show resolved Hide resolved
ssg/components.py Outdated Show resolved Hide resolved
@isimluk isimluk force-pushed the perf-avoid-parsing-same-yaml-hundredtimes branch 4 times, most recently from a2f57ac to d2a9ff2 Compare October 10, 2023 19:17
@isimluk isimluk force-pushed the perf-avoid-parsing-same-yaml-hundredtimes branch from d2a9ff2 to ecdd234 Compare October 10, 2023 19:18
@codeclimate
Copy link

codeclimate bot commented Oct 10, 2023

Code Climate has analyzed commit ecdd234 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 18.7% (50% is the threshold).

This pull request will bring the total coverage in the repository to 56.9%.

View more on Code Climate.

@Mab879 Mab879 added this to the 0.1.71 milestone Oct 11, 2023
Copy link
Collaborator

@jan-cerny jan-cerny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for discovering this problem and opening this PR. I appreciate activities like this because I understand the frustration that comes from extremely long build and test times. We definitely need more of these efforts.

The proposed solution with the LRU cache is nice. But I think it's just a workaround. It only hides a mistake that we did in the implementation of the components module.

When the build systems resolves the content using compile_all.py, the compile_all.py creates an instance of ssg.builld_yaml.BuildLoader class. Moreover, and that might not be obvious, many other instances of ssg.builld_yaml.BuildLoader class are created recursively as the loader recurses into sub-directories of the given benchmark root directory. I don't know the details why it's designed this way but @matejak can shed some light on it. But, see how it works at

content/ssg/build_yaml.py

Lines 1268 to 1275 in 5c06539

def _recurse_into_subdirs(self):
for subdir in self.subdirectories:
loader = self._get_new_loader()
loader.parent_group = self.loaded_group
loader.process_directory_tree(subdir)
self.all_values.update(loader.all_values)
self.all_rules.update(loader.all_rules)
self.all_groups.update(loader.all_groups)

When this method iterates over the directories, for each child directory it creates a new instance of the ssg.builld_yaml.BuildLoader by the ssg.builld_yaml.BuildLoader._get_new_loader() method. This method should ensure that the component data and other data won't be unnecessarily loaded again but will be just referenced from the parent ssg.builld_yaml.BuildLoader that handles the parent directory.

Look here:

content/ssg/build_yaml.py

Lines 1391 to 1392 in 5c06539

# Do it this way so we only have to parse the component metadata once.
loader.rule_to_components = self.rule_to_components

We can see that it doesn't work the way the comment says. I think that the problem is the way how the ssg.builld_yaml.BuildLoader._load_components() method is called in the ctor of ssg.builld_yaml.BuildLoader. Also, the ctor is called before the initialization of the rule_to_components attribute. Also, there is no code in this ctor or elsewhere assuring that it won't load the components data if they were already loaded. And that is a mistake.

I think that in order to fix this problem we need to change the way how the ssg.builld_yaml.BuildLoader is initialized and how component data are loaded.

I tried to do the following change instead of your change:

diff --git a/build-scripts/compile_all.py b/build-scripts/compile_all.py
index 90d4b4d774..76ee26032f 100644
--- a/build-scripts/compile_all.py
+++ b/build-scripts/compile_all.py
@@ -133,6 +133,7 @@ def main():
 
     loader = ssg.build_yaml.BuildLoader(
         None, env_yaml, product_cpes, args.sce_metadata, args.stig_references)
+    loader.load_components()
     load_benchmark_source_data_from_directory_tree(loader, env_yaml, product_yaml)
 
     project_root_abspath = os.path.abspath(args.project_root)
diff --git a/ssg/build_yaml.py b/ssg/build_yaml.py
index c4ef458103..67803752b6 100644
--- a/ssg/build_yaml.py
+++ b/ssg/build_yaml.py
@@ -1331,9 +1331,9 @@ class BuildLoader(DirectoryLoader):
         if stig_reference_path:
             self.stig_references = ssg.build_stig.map_versions_to_rule_ids(stig_reference_path)
         self.components_dir = None
-        self.rule_to_components = self._load_components()
+        self.rule_to_components = None
 
-    def _load_components(self):
+    def load_components(self):
         if "components_root" not in self.env_yaml:
             return None
         product_dir = self.env_yaml["product_dir"]
@@ -1341,9 +1341,8 @@ class BuildLoader(DirectoryLoader):
         self.components_dir = os.path.abspath(
             os.path.join(product_dir, components_root))
         components = ssg.components.load(self.components_dir)
-        rule_to_components = ssg.components.rule_component_mapping(
+        self.rule_to_components = ssg.components.rule_component_mapping(
             components)
-        return rule_to_components
 
     def _process_values(self):
         for value_yaml in self.value_files:

In my patch, we won't call _load_components in the ctor, which will leave the opportunity to _get_new_loader to swap the data as the comment said.

What do you think?

@matejak
Copy link
Member

matejak commented Oct 11, 2023

The design of the build system is organic, it evolved gradually by iterating expansions of the functionality and refactors. Most of the time, there is no strong reason why things are exactly like that. And although it is not optimal, it works, and should be replaced only by something better, i.e. more expressive and maintainable. In this particular case, the top-level loader should be different, to behave slightly differently than those recursively spawning ones.
However, there is no way that we accept PRs that increase the technical debt for performance reasons. Reimplementation of LRU cache for Python2 is a clear example (a wrapper that does nothing would be a better choice), but as Jan already said, sweeping design issues under the rug by introducing caching is also a no-go.

@isimluk isimluk closed this Oct 11, 2023
@isimluk isimluk deleted the perf-avoid-parsing-same-yaml-hundredtimes branch October 11, 2023 10:43
jan-cerny added a commit to jan-cerny/scap-security-guide that referenced this pull request Oct 12, 2023
As discovered in ComplianceAsCode#11190,
the `components.load()` function is called many times during a build
of product content. However, the component files need to be loaded only
once.

When the build systems resolves the content using `compile_all.py`, the
`compile_all.py` creates an instance of `ssg.builld_yaml.BuildLoader`
class.  Moreover, many other instances of `ssg.builld_yaml.BuildLoader`
class are created recursively as the loader recurses into
sub-directories of the given benchmark root directory.  When we iterate
over the directories, for each child directory the script creates a new
instance of the `ssg.build_yaml.BuildLoader` by the
`ssg.builld_yaml.BuildLoader._get_new_loader()` method. This method
should ensure that the component data and other data won't be
unnecessarily loaded again but will be just referenced from the parent
`ssg.builld_yaml.BuildLoader` that handles the parent directory.

The problem is the way how the
`ssg.builld_yaml.BuildLoader._load_components()` method is called in the
ctor of `ssg.builld_yaml.BuildLoader`. The ctor is called before
the initialization of the `rule_to_components` attribute. Also, there is
no code in this ctor or elsewhere assuring that it won't load the
components data if they were already loaded.

In this commit, we won't call `_load_components` in the ctor, which will
leave the opportunity to `_get_new_loader` to swap the data as the comment
said.

This should bring a speed up about 5 seconds.
@Mab879 Mab879 removed this from the 0.1.71 milestone Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Infrastructure Our content build system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants