Add job expiration dates #1983

zmc · 2024-07-31T21:50:56Z

This feature has two parts:

Specifying expiration dates when scheduling test runs
A global maximum age

Expiration dates are provided by passing --expire to teuthology-suite with
a relative value like 1d (one day), 1w (one week), or an absolute value like
1999-12-31_23:59:59.

A new configuration item, max_job_age, is specified in seconds. This defaults
to two weeks.

When the dispatcher checks the queue for the next job to run, it will first
compare the job's timestamp value - which reflects the time the job was
scheduled. If more than max_job_age seconds have passed, the job is skipped
and marked dead. It next checks for an expire value; if that value is in the
past, the job is skipped and marked dead. Otherwise, it will be run as usual.

teuthology/test/test_misc.py

zmc · 2024-08-01T20:49:12Z

I scheduled two runs to test the feature:

--expire 1d ran and passed: https://pulpito-ng.ceph.com/runs/zack-2024-08-01_20:41:26-teuthology:no-ceph-main-distro-default-null
--expire 1s was skipped as expected: https://pulpito-ng.ceph.com/runs/zack-2024-08-01_20:42:01-teuthology:no-ceph-main-distro-default-null

teuthology/config.py

kshtsk · 2024-08-02T01:49:20Z

teuthology/dispatcher/__init__.py

@@ -306,6 +309,31 @@ def prep_job(job_config, log_file_path, archive_dir):
    return job_config, teuth_bin_path


+def check_job_expiration(job_config):


First of all, I'd like to see inline docs for the method.
Second, since this method is used both in dispatcher and supervisor maybe it is the best to create a separate module for job, and move all related methods there, for example, some of them can be taken from schedule.

teuthology/config.py

teuthology/util/time.py

kshtsk · 2024-08-05T15:01:14Z

teuthology/test/test_misc.py

    assert excinfo.value.returncode == 111
    for record in caplog.records:
        if record.levelname == 'ERROR':
            assert ('replay full' in record.message or
                    'ABC\n' == record.message)

 def test_sh_progress(caplog):
-    misc.sh("echo AB ; sleep 5 ; /bin/echo C", 2) == "ABC\n"
+    assert misc.sh("echo AB ; sleep 0.1 ; /bin/echo C", 2) == "AB\nC\n"


nice, that's probably improve test timing 5 seconds is overkill

teuthology/test/test_misc.py

scripts/suite.py

kshtsk · 2024-08-06T19:10:41Z

teuthology/util/time.py

+        case 'w':
+            return timedelta(weeks=num)
+        case _:
+            raise ValueError(err_msg)


Suggesting for the followup PR:

m = re.match(r"^\s*" r"(?:(?P<weeks>\d+)\s*w\s*)?" r"(?:(?P<days>\d+)\s*d\s*)?" r"(?:(?P<hours>\d+)\s*h\s*)?" r"(?:(?P<minutes>\d+)\s*m\s*)?", r"(?:(?P<seconds>\d+)\s*s\s*)?$", offset.lower()) if match is None: raise ValueError(err_msg) args = {k: int(v or "0") for k, v in m.groupdict().items()} return datetime.timedelta(**args)

kshtsk · 2024-08-06T19:23:24Z

teuthology/util/test/test_time.py

+        ["1x", ValueError],
+        ["-1m", ValueError],
+        ["0xde", ValueError],
+        ["frog", ValueError],


How 'bout the case: "7dwarfs"?

kshtsk · 2024-08-07T23:52:32Z

rebase needed

Signed-off-by: Zack Cerza <[email protected]>

And move the format string to the time module. Signed-off-by: Zack Cerza <[email protected]>

Signed-off-by: Zack Cerza <[email protected]>

One test had a missing assert; another had a comparison that would never fire because of an expected exception being raised during the call. Signed-off-by: Zack Cerza <[email protected]>

test_init.py was making modifications to the config object that persisted between tests. When I fixed that, initially some tests in test_run_.py started failing because of settings in my local ~/.teuthology.yaml. This change causes all of the tests in suite.test to use default config values. Signed-off-by: Zack Cerza <[email protected]>

This feature has two parts: * Specifying expiration dates when scheduling test runs * A global maximum age Expiration dates are provided by passing `--expire` to `teuthology-suite` with a relative value like `1d` (one day), `1w` (one week), or an absolute value like `1999-12-31_23:59:59`. A new configuration item, `max_job_age`, is specified in seconds. This defaults to two weeks. When the dispatcher checks the queue for the next job to run, it will first compare the job's `timestamp` value - which reflects the time the job was scheduled. If more than `max_job_age` seconds have passed, the job is skipped and marked dead. It next checks for an `expire` value; if that value is in the past, the job is skipped and marked dead. Otherwise, it will be run as usual. Signed-off-by: Zack Cerza <[email protected]>

This commit isn't strictly necessary for the feature's implementation, but will allow testing the feature on the production teuthology cluster before merging. Signed-off-by: Zack Cerza <[email protected]>

zmc force-pushed the expiry branch 3 times, most recently from 2d08bf5 to 46c1903 Compare August 1, 2024 00:46

kshtsk reviewed Aug 1, 2024

View reviewed changes

teuthology/test/test_misc.py Outdated Show resolved Hide resolved

zmc force-pushed the expiry branch from 46c1903 to c9db6e3 Compare August 1, 2024 16:22

zmc requested a review from batrick August 1, 2024 17:54

zmc force-pushed the expiry branch from 3dbc960 to fd45032 Compare August 1, 2024 20:03

zmc marked this pull request as ready for review August 1, 2024 20:50