Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid monolithic state file #81

Open
wants to merge 22 commits into
base: develop
Choose a base branch
from

Conversation

ainar
Copy link
Contributor

@ainar ainar commented Feb 24, 2023

In line with #45, I refactored some things to make the whole framework run without the pipeline.json file. It passes all the tests but needs some review and real-world tests.

There is still no management of several concurrent pipelines that share the same stages. The idea is not to compute two times the same stage, even if pipelines are launched simultaneously.

@ainar
Copy link
Contributor Author

ainar commented Feb 24, 2023

In my opinion, synpp needs a huge refactoring to make it clearer. In particular, split pipeline.py in several files. I tried to do so in one of the branches of my fork: https://github.com/ainar/synpp/tree/refactor

In the current PR, I could not help but refactor by renaming "hash" (which is a standard Python function!) by "stage_id" when it represents the unique name of the combination stage-configuration or "digest" when it is truly a hash digest.
I also created some getters to clean the file.

@sebhoerl
Copy link
Contributor

Ok, thanks! Let's discuss this in our call

@ainar
Copy link
Contributor Author

ainar commented Mar 22, 2023

I cleaned by undoing the renamings to make my proposition more transparent.
I faced an odd bug where test_devalidate_by_parent failed randomly. It seems that the file IO is slower than the sequence of re-runs during the tests, which is an issue for this test as I based the parent devalidation on the cache files' timestamps. I slowed down the test by adding some sleep.

@sebhoerl
Copy link
Contributor

Thanks, for that. I'll give it a try on a fullscale pipeline before merging.

@sebhoerl
Copy link
Contributor

Can you quickly explain why you keep the source code of the files everywhere in the code? Why wouldn't it work with just hashing the source code in the first place?

@ainar
Copy link
Contributor Author

ainar commented Apr 18, 2023

Can you quickly explain why you keep the source code of the files everywhere in the code? Why wouldn't it work with just hashing the source code in the first place?

I attached to each stage a hash digest of the concatenation of its source and the source of all the dependencies. I agree I could hash the (concatenation of the) source digests, instead of hashing directly (the concatenation of) the sources. Then, I would only store source digests, instead of source.

My reasons for not doing that are not so well-established and purely driven by personal choices:

  • I wanted to avoid hashing digests (I think it is more robust against collisions, but this is just intuition and maybe not so essential) and to only hash source code,
  • This is not too much memory usage to store all the sources in memory on a modern computer.

So I think both approaches are comparable. Do you have a preference?

@sebhoerl
Copy link
Contributor

I would prefer using hashes, this would also reduce the amount of code changes in this PR.

@ainar
Copy link
Contributor Author

ainar commented Apr 18, 2023

Done!
It seems that the PR is not synchronized anymore with my branch. It is more than one commit late.

@ainar
Copy link
Contributor Author

ainar commented Apr 28, 2023

The PR is synced again. 🙂
I tested it with https://github.com/eqasim-org/ile-de-france, and it seems to run flawlessly!

@ainar
Copy link
Contributor Author

ainar commented May 17, 2023

I just fixed 2 bugs in my implementation and added the corresponding tests:

  1. I just understood that an ephemeral stage is not systematically devalidated if a parent has a valid cache (I was systematically devalidating it before). Separating the code between "with working directory" and "without working directory" helped me understand and match my PR's with the original behavior.
  2. I added a bug when creating the hash digest for one stage: the hash digests were not always in the same order as set is not ordered, resulting in stages randomly devalidating at each run. I fixed it by ordering the stage names lexically before hashing. In the test, I launch a system command with os.system to test if two runs are "stable". There may be a better way to test it.

Comment on lines +680 to +691
# Get current validation tokens
current_validation_tokens = {
stage_id:
str(
registry[stage_id]["wrapper"].validate(
ValidateContext(registry[stage_id]["config"], get_cache_directory_path(working_directory, stage_id))
)
) for stage_id in sorted_hashes
}

# Cache mapper between stage id and cache id.
cache_ids = {stage_id: get_cache_prefix(stage_id, source_digests[stage_id]) + "__" + str(current_validation_tokens[stage_id]) for stage_id in sorted_hashes}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this "__" + validation_tokens, can't this just be included in the hash? Like hashing the get_cache_prefix(...) and the validation tokens together?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry for the long delay. I tested it with certain scenarios and it looks good. It's nice to not have the monolithic file anymore. I marked one part in the code, why do we need these explicity validation tokens added to the cache folders / files? Currently this creates some strange file names with spaces (here for IDF):

image

Probably this may cause some problems on different OS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants