-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid monolithic state file #81
base: develop
Are you sure you want to change the base?
Conversation
In my opinion, synpp needs a huge refactoring to make it clearer. In particular, split pipeline.py in several files. I tried to do so in one of the branches of my fork: https://github.com/ainar/synpp/tree/refactor In the current PR, I could not help but refactor by renaming "hash" (which is a standard Python function!) by "stage_id" when it represents the unique name of the combination stage-configuration or "digest" when it is truly a hash digest. |
Ok, thanks! Let's discuss this in our call |
I cleaned by undoing the renamings to make my proposition more transparent. |
Thanks, for that. I'll give it a try on a fullscale pipeline before merging. |
Can you quickly explain why you keep the source code of the files everywhere in the code? Why wouldn't it work with just hashing the source code in the first place? |
I attached to each stage a hash digest of the concatenation of its source and the source of all the dependencies. I agree I could hash the (concatenation of the) source digests, instead of hashing directly (the concatenation of) the sources. Then, I would only store source digests, instead of source. My reasons for not doing that are not so well-established and purely driven by personal choices:
So I think both approaches are comparable. Do you have a preference? |
I would prefer using hashes, this would also reduce the amount of code changes in this PR. |
Done! |
The PR is synced again. 🙂 |
check dependency only if not stale (it will devalidate at "Devalidate descendants of devalidated stages")
can we make this test better?
I just fixed 2 bugs in my implementation and added the corresponding tests:
|
# Get current validation tokens | ||
current_validation_tokens = { | ||
stage_id: | ||
str( | ||
registry[stage_id]["wrapper"].validate( | ||
ValidateContext(registry[stage_id]["config"], get_cache_directory_path(working_directory, stage_id)) | ||
) | ||
) for stage_id in sorted_hashes | ||
} | ||
|
||
# Cache mapper between stage id and cache id. | ||
cache_ids = {stage_id: get_cache_prefix(stage_id, source_digests[stage_id]) + "__" + str(current_validation_tokens[stage_id]) for stage_id in sorted_hashes} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this "__" + validation_tokens
, can't this just be included in the hash? Like hashing the get_cache_prefix(...)
and the validation tokens together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, sorry for the long delay. I tested it with certain scenarios and it looks good. It's nice to not have the monolithic file anymore. I marked one part in the code, why do we need these explicity validation tokens added to the cache folders / files? Currently this creates some strange file names with spaces (here for IDF):
Probably this may cause some problems on different OS?
In line with #45, I refactored some things to make the whole framework run without the pipeline.json file. It passes all the tests but needs some review and real-world tests.
There is still no management of several concurrent pipelines that share the same stages. The idea is not to compute two times the same stage, even if pipelines are launched simultaneously.