Avoid monolithic state file #81

ainar · 2023-02-24T16:37:52Z

In line with #45, I refactored some things to make the whole framework run without the pipeline.json file. It passes all the tests but needs some review and real-world tests.

There is still no management of several concurrent pipelines that share the same stages. The idea is not to compute two times the same stage, even if pipelines are launched simultaneously.

ainar · 2023-02-24T16:46:42Z

In my opinion, synpp needs a huge refactoring to make it clearer. In particular, split pipeline.py in several files. I tried to do so in one of the branches of my fork: https://github.com/ainar/synpp/tree/refactor

In the current PR, I could not help but refactor by renaming "hash" (which is a standard Python function!) by "stage_id" when it represents the unique name of the combination stage-configuration or "digest" when it is truly a hash digest.
I also created some getters to clean the file.

sebhoerl · 2023-02-27T07:12:15Z

Ok, thanks! Let's discuss this in our call

ainar · 2023-03-22T13:58:01Z

I cleaned by undoing the renamings to make my proposition more transparent.
I faced an odd bug where test_devalidate_by_parent failed randomly. It seems that the file IO is slower than the sequence of re-runs during the tests, which is an issue for this test as I based the parent devalidation on the cache files' timestamps. I slowed down the test by adding some sleep.

sebhoerl · 2023-03-29T18:17:32Z

Thanks, for that. I'll give it a try on a fullscale pipeline before merging.

- create mapping for paths

sebhoerl · 2023-04-18T09:21:59Z

Can you quickly explain why you keep the source code of the files everywhere in the code? Why wouldn't it work with just hashing the source code in the first place?

ainar · 2023-04-18T11:29:59Z

Can you quickly explain why you keep the source code of the files everywhere in the code? Why wouldn't it work with just hashing the source code in the first place?

I attached to each stage a hash digest of the concatenation of its source and the source of all the dependencies. I agree I could hash the (concatenation of the) source digests, instead of hashing directly (the concatenation of) the sources. Then, I would only store source digests, instead of source.

My reasons for not doing that are not so well-established and purely driven by personal choices:

I wanted to avoid hashing digests (I think it is more robust against collisions, but this is just intuition and maybe not so essential) and to only hash source code,
This is not too much memory usage to store all the sources in memory on a modern computer.

So I think both approaches are comparable. Do you have a preference?

sebhoerl · 2023-04-18T13:18:31Z

I would prefer using hashes, this would also reduce the amount of code changes in this PR.

ainar · 2023-04-18T14:23:56Z

Done!
It seems that the PR is not synchronized anymore with my branch. It is more than one commit late.

ainar · 2023-04-28T12:32:26Z

The PR is synced again. 🙂
I tested it with https://github.com/eqasim-org/ile-de-france, and it seems to run flawlessly!

check dependency only if not stale (it will devalidate at "Devalidate descendants of devalidated stages")

can we make this test better?

ainar · 2023-05-17T08:49:54Z

I just fixed 2 bugs in my implementation and added the corresponding tests:

I just understood that an ephemeral stage is not systematically devalidated if a parent has a valid cache (I was systematically devalidating it before). Separating the code between "with working directory" and "without working directory" helped me understand and match my PR's with the original behavior.
I added a bug when creating the hash digest for one stage: the hash digests were not always in the same order as set is not ordered, resulting in stages randomly devalidating at each run. I fixed it by ordering the stage names lexically before hashing. In the test, I launch a system command with os.system to test if two runs are "stable". There may be a better way to test it.

sebhoerl · 2023-06-28T14:35:11Z

src/synpp/pipeline.py

+    # Get current validation tokens
+    current_validation_tokens = {
+        stage_id:
+        str(
+            registry[stage_id]["wrapper"].validate(
+                ValidateContext(registry[stage_id]["config"], get_cache_directory_path(working_directory, stage_id))
+            )
+        ) for stage_id in sorted_hashes
+    }
+
+    # Cache mapper between stage id and cache id.
+    cache_ids = {stage_id: get_cache_prefix(stage_id, source_digests[stage_id]) + "__" + str(current_validation_tokens[stage_id]) for stage_id in sorted_hashes}


Why do we need this "__" + validation_tokens, can't this just be included in the hash? Like hashing the get_cache_prefix(...) and the validation tokens together?

sebhoerl · 2023-06-28T14:36:45Z

src/synpp/pipeline.py

Hi, sorry for the long delay. I tested it with certain scenarios and it looks good. It's nice to not have the monolithic file anymore. I marked one part in the code, why do we need these explicity validation tokens added to the cache folders / files? Currently this creates some strange file names with spaces (here for IDF):

Probably this may cause some problems on different OS?

ainar added 5 commits February 24, 2023 11:08

Create cache dir and file path getter.

c6b27dc

Use path getters.

8e805cb

Clean parameter names of getters.

70830d1

clean

16a5821

remove pipeline json file

b12f7bc

ainar added 2 commits March 22, 2023 14:39

Undo refactoring to make modifications clearer.

77fe7f6

add sleep between runs for devalidation by parent

79a8648

ainar added 4 commits April 7, 2023 12:37

Remove redundant dictionary and set.

d7393b4

Remove useless shallow copy of sorted_hashes.

41f50d2

- fix: keep cache only when no working_directory

6f2c075

- create mapping for paths

cleaning

5bd28e0

ainar added 2 commits April 18, 2023 16:14

use stage digest instead of stage source code

d75dfba

Clarify variable names and comment

071d78d

remove empty lines

bfd61aa

ainar mentioned this pull request Apr 28, 2023

Improving cache management #82

Open

ainar added 6 commits May 16, 2023 18:45

fix manual devalidation (with token) behavior

c8e382b

devalidate if parent has been updated:

d06b294

check dependency only if not stale (it will devalidate at "Devalidate descendants of devalidated stages")

clean log

24933f1

separate with and without wd behaviors

efae443

fix ephemeral behavior and add corresponding test

df37cbb

fix hash code (still test to do)

ade4a56

ainar added 2 commits May 17, 2023 10:40

add test for devalidation stability between runs

e4e26b4

can we make this test better?

clean

6c790dd

sebhoerl reviewed Jun 28, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid monolithic state file #81

Avoid monolithic state file #81

ainar commented Feb 24, 2023

ainar commented Feb 24, 2023 •

edited

Loading

sebhoerl commented Feb 27, 2023

ainar commented Mar 22, 2023

sebhoerl commented Mar 29, 2023

sebhoerl commented Apr 18, 2023

ainar commented Apr 18, 2023 •

edited

Loading

sebhoerl commented Apr 18, 2023

ainar commented Apr 18, 2023 •

edited

Loading

ainar commented Apr 28, 2023

ainar commented May 17, 2023

sebhoerl Jun 28, 2023

sebhoerl Jun 28, 2023

Avoid monolithic state file #81

Are you sure you want to change the base?

Avoid monolithic state file #81

Conversation

ainar commented Feb 24, 2023

ainar commented Feb 24, 2023 • edited Loading

sebhoerl commented Feb 27, 2023

ainar commented Mar 22, 2023

sebhoerl commented Mar 29, 2023

sebhoerl commented Apr 18, 2023

ainar commented Apr 18, 2023 • edited Loading

sebhoerl commented Apr 18, 2023

ainar commented Apr 18, 2023 • edited Loading

ainar commented Apr 28, 2023

ainar commented May 17, 2023

sebhoerl Jun 28, 2023

Choose a reason for hiding this comment

sebhoerl Jun 28, 2023

Choose a reason for hiding this comment

ainar commented Feb 24, 2023 •

edited

Loading

ainar commented Apr 18, 2023 •

edited

Loading

ainar commented Apr 18, 2023 •

edited

Loading