-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build perf: avoid thousands of needless file opens and yaml parses #11190
build perf: avoid thousands of needless file opens and yaml parses #11190
Conversation
Avoid hundred thousands of useless yaml parses on make all products This Speeds up rhel8 build for me by 5 seconds. Explainer The `load()` function is parsing ./components/*.yml every time it is called. Unfortunately, it is being called at least once per Xccdf:Rule or Xccdf:Group in the benchmark. So for instance, with rhel9 the `load()` function is called 187 times. Each time it is called, it needs to parse those 150 yaml files in ./components/ directory thus it leads to 27900 useless yaml parses just per rhel9. Before ( ~ 29s) # time make rhel8-compile-all [rhel8-content] generating sce/metadata.json Built target generate-internal-rhel8-sce-metadata.json [rhel8-content] compiling product yaml [rhel8-content] compiling everything Built target rhel8-compile-all real 0m29.446s user 0m27.508s sys 0m1.720s After ( ~24s) time make rhel8-compile-all [rhel8-content] generating sce/metadata.json Built target generate-internal-rhel8-sce-metadata.json [rhel8-content] compiling product yaml [rhel8-content] compiling everything Built target rhel8-compile-all real 0m24.772s user 0m23.141s sys 0m1.485s
Oh, I allowed myself to forgot how it is to support antiquated systems. Will fix the build errors later. |
323e612
to
d8f6d37
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nixing the underscore should fix that error. See https://docs.python.org/3/library/functools.html#functools.lru_cache
a2f57ac
to
d2a9ff2
Compare
d2a9ff2
to
ecdd234
Compare
Code Climate has analyzed commit ecdd234 and detected 0 issues on this pull request. The test coverage on the diff in this pull request is 18.7% (50% is the threshold). This pull request will bring the total coverage in the repository to 56.9%. View more on Code Climate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for discovering this problem and opening this PR. I appreciate activities like this because I understand the frustration that comes from extremely long build and test times. We definitely need more of these efforts.
The proposed solution with the LRU cache is nice. But I think it's just a workaround. It only hides a mistake that we did in the implementation of the components
module.
When the build systems resolves the content using compile_all.py
, the compile_all.py
creates an instance of ssg.builld_yaml.BuildLoader
class. Moreover, and that might not be obvious, many other instances of ssg.builld_yaml.BuildLoader
class are created recursively as the loader recurses into sub-directories of the given benchmark root directory. I don't know the details why it's designed this way but @matejak can shed some light on it. But, see how it works at
Lines 1268 to 1275 in 5c06539
def _recurse_into_subdirs(self): | |
for subdir in self.subdirectories: | |
loader = self._get_new_loader() | |
loader.parent_group = self.loaded_group | |
loader.process_directory_tree(subdir) | |
self.all_values.update(loader.all_values) | |
self.all_rules.update(loader.all_rules) | |
self.all_groups.update(loader.all_groups) |
When this method iterates over the directories, for each child directory it creates a new instance of the ssg.builld_yaml.BuildLoader
by the ssg.builld_yaml.BuildLoader._get_new_loader()
method. This method should ensure that the component data and other data won't be unnecessarily loaded again but will be just referenced from the parent ssg.builld_yaml.BuildLoader
that handles the parent directory.
Look here:
Lines 1391 to 1392 in 5c06539
# Do it this way so we only have to parse the component metadata once. | |
loader.rule_to_components = self.rule_to_components |
We can see that it doesn't work the way the comment says. I think that the problem is the way how the ssg.builld_yaml.BuildLoader._load_components()
method is called in the ctor of ssg.builld_yaml.BuildLoader
. Also, the ctor is called before the initialization of the rule_to_components
attribute. Also, there is no code in this ctor or elsewhere assuring that it won't load the components data if they were already loaded. And that is a mistake.
I think that in order to fix this problem we need to change the way how the ssg.builld_yaml.BuildLoader
is initialized and how component data are loaded.
I tried to do the following change instead of your change:
diff --git a/build-scripts/compile_all.py b/build-scripts/compile_all.py
index 90d4b4d774..76ee26032f 100644
--- a/build-scripts/compile_all.py
+++ b/build-scripts/compile_all.py
@@ -133,6 +133,7 @@ def main():
loader = ssg.build_yaml.BuildLoader(
None, env_yaml, product_cpes, args.sce_metadata, args.stig_references)
+ loader.load_components()
load_benchmark_source_data_from_directory_tree(loader, env_yaml, product_yaml)
project_root_abspath = os.path.abspath(args.project_root)
diff --git a/ssg/build_yaml.py b/ssg/build_yaml.py
index c4ef458103..67803752b6 100644
--- a/ssg/build_yaml.py
+++ b/ssg/build_yaml.py
@@ -1331,9 +1331,9 @@ class BuildLoader(DirectoryLoader):
if stig_reference_path:
self.stig_references = ssg.build_stig.map_versions_to_rule_ids(stig_reference_path)
self.components_dir = None
- self.rule_to_components = self._load_components()
+ self.rule_to_components = None
- def _load_components(self):
+ def load_components(self):
if "components_root" not in self.env_yaml:
return None
product_dir = self.env_yaml["product_dir"]
@@ -1341,9 +1341,8 @@ class BuildLoader(DirectoryLoader):
self.components_dir = os.path.abspath(
os.path.join(product_dir, components_root))
components = ssg.components.load(self.components_dir)
- rule_to_components = ssg.components.rule_component_mapping(
+ self.rule_to_components = ssg.components.rule_component_mapping(
components)
- return rule_to_components
def _process_values(self):
for value_yaml in self.value_files:
In my patch, we won't call _load_components
in the ctor, which will leave the opportunity to _get_new_loader
to swap the data as the comment said.
What do you think?
The design of the build system is organic, it evolved gradually by iterating expansions of the functionality and refactors. Most of the time, there is no strong reason why things are exactly like that. And although it is not optimal, it works, and should be replaced only by something better, i.e. more expressive and maintainable. In this particular case, the top-level loader should be different, to behave slightly differently than those recursively spawning ones. |
As discovered in ComplianceAsCode#11190, the `components.load()` function is called many times during a build of product content. However, the component files need to be loaded only once. When the build systems resolves the content using `compile_all.py`, the `compile_all.py` creates an instance of `ssg.builld_yaml.BuildLoader` class. Moreover, many other instances of `ssg.builld_yaml.BuildLoader` class are created recursively as the loader recurses into sub-directories of the given benchmark root directory. When we iterate over the directories, for each child directory the script creates a new instance of the `ssg.build_yaml.BuildLoader` by the `ssg.builld_yaml.BuildLoader._get_new_loader()` method. This method should ensure that the component data and other data won't be unnecessarily loaded again but will be just referenced from the parent `ssg.builld_yaml.BuildLoader` that handles the parent directory. The problem is the way how the `ssg.builld_yaml.BuildLoader._load_components()` method is called in the ctor of `ssg.builld_yaml.BuildLoader`. The ctor is called before the initialization of the `rule_to_components` attribute. Also, there is no code in this ctor or elsewhere assuring that it won't load the components data if they were already loaded. In this commit, we won't call `_load_components` in the ctor, which will leave the opportunity to `_get_new_loader` to swap the data as the comment said. This should bring a speed up about 5 seconds.
Avoid hundred thousands of useless yaml parses on make all products
This Speeds up rhel8 build for me by 5 seconds.
Explainer
The
load()
function is parsing ./components/*.yml every time it is called. Unfortunately, it is being called at least once per Xccdf:Rule or Xccdf:Group in the benchmark. So for instance, with rhel9 theload()
function is called 187 times. Each time it is called, it needs to parse those 150 yaml files in ./components/ directory thus it leads to 27900 useless yaml parses just per rhel9.Before ( ~ 29s)
After ( ~24s)