CI: Add matrix testing for both SQLite and PostgreSQL database backends #6670

rabbull · 2024-12-17T15:32:17Z

This PR updates the GitHub Actions CI workflows (ci-code and test-install) to include testing for both SQLite and PostgreSQL database backends. This will also increase the testing coverage over SQLite specific codebase.

GeigerJ2 · 2024-12-17T15:55:34Z

Thanks, @rabbull, for opening the PR. Good to know that the failure comes from the DB matrix, and not your changes in #6659! Also pinging @khsrali, as he also asked on the topic of testing with the different DBs recently.

codecov · 2024-12-17T15:59:44Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.99%. Comparing base (8ae66fb) to head (d29225c).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6670      +/-   ##
==========================================
+ Coverage   77.94%   77.99%   +0.06%     
==========================================
  Files         563      563              
  Lines       41761    41761              
==========================================
+ Hits        32545    32567      +22     
+ Misses       9216     9194      -22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

GeigerJ2 · 2024-12-17T16:06:15Z

Also weird that here they only seem to fail for sqlite, while in #6659 there are mainly failed runs for psql...

unkcpz · 2024-12-17T23:14:15Z

I tested with rmq version 3.12 as I had in my localhost. As expected, the failed test has nothing to check with rmq version.

unkcpz · 2024-12-18T07:26:28Z

I switch off the xdist, it seems the test fail for 3.9 sqlite but okay for 3.12. I'll try to reproduce it locally with python3.9. Just to double check, I think @GeigerJ2 @rabbull you don't have python3.9 with your local setup, correct?

khsrali

Thank you @rabbull
In principle I'm in favour of this changes, however now the whole test suite becomes a bit too long.
I understand this might seem inevitable, but I feel like there are many tests that are independent of the type of storage backend. Perhaps they may not need to be repeated two times.
I think it worth to have a quick look in the test folders and mark those tests in a way that they would only run once.

Maybe we can also use @sphuber view on this PR, as probably there's a reason why historically tests are not run for all storage backends.

.github/workflows/ci-code.yml

rabbull · 2024-12-18T10:23:05Z

it seems the test fail for 3.9 sqlite but okay for 3.12.

It is also the symptom I observed in #6659, where I removed xdist in the end. In addition, my local attempt with Python 3.9 actually passed, but I believe it's also worthful for you to run it locally.

rabbull · 2024-12-18T10:28:29Z

Also weird that here they only seem to fail for sqlite, while in #6659 there are mainly failed runs for psql...

In #6659, the same failure also occurs with sqlite. In the latest run, psql actions timed out rather than failing due to test errors. I removed xdist, resulting in a significant increase in the execution time.

rabbull · 2024-12-18T10:38:05Z

I feel like there are many tests that are independent of the type of storage backend. Perhaps they may not need to be repeated two times.

I totally agree with you. This PR didn't delve deeper is because it mainly serve for the other two pending PRs, who are blocked by the CI checks, and I believe simply adding sqlite could be a quick fix. An issue, if not exist for now, could be added to follow up the problem.

unkcpz · 2024-12-20T10:39:48Z

.github/workflows/ci-code.yml

+    - name: upterm
+      uses: lhotari/action-upterm@v1
+      if: ${{ failure() }}
+      with:
+        wait-timeout-minutes: 60


We want to remove this before merge..

unkcpz · 2024-12-20T10:46:08Z

tests/cmdline/commands/test_rabbitmq.py

-@pytest.mark.usefixtures('started_daemon_client')
+@pytest.mark.usefixtures('aiida_profile_clean')


This can solve the failed test, I think here the problem was caused by the rmq resource racing the queue might be cleaned by the previous test maybe??

When I run only this test in CI, the test will pass. Test only fail when it is run with other tests. I guess it is because the daemon_client is shared and the rmq queue is getting cleaned by other tests (high likely the previous one? it is easy to check by just run the test in this file).

But no matter what the cause is, the solution I think is make the daemon_client independent for this test by using aiida_profile_clean fixture which will close the daemon (which will call broker_reset that close the rmq before running the test).

unkcpz · 2024-12-20T10:50:57Z

but I feel like there are many tests that are independent of the type of storage backend. Perhaps they may not need to be repeated two times.

Good point. Then it is not so easy to find out which tests should be tagged with db_required. One solution I think about is to put a explicit bug in the db related code path that will early failed the tests that require DB (it is a trick I learned from Johnathon Blow's channel). We than tag them and run those with sqlite.
I also has a gut feeling that we may not need to do this, since if the test do not need DB usually it will run very fast. I am not sure but lets try.

unkcpz · 2024-12-20T15:10:06Z

At some point, I also saw following exception that might be the root cause of the failed test (I apply the timeout when running test, so maybe the problem of another PR by @rabbull is from this. In remote test stuck, I think the better way is to use pytest-timeout, I'll open an issue for it):

[gw0] linux -- Python 3.12.7 /home/jyu/venv/aiida-core-dev/bin/python

run_cli_command = <function run_cli_command.<locals>.factory at 0x7adda39874c0>, monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7adda3fe48c0>
aiida_code_installed = <function aiida_code_installed.<locals>.factory at 0x7adda3985940>, submit_and_await = <function submit_and_await.<locals>.factory at 0x7adda3985620>

    @pytest.mark.usefixtures('started_daemon_client')
    def test_revive(run_cli_command, monkeypatch, aiida_code_installed, submit_and_await):
        """Test ``tasks revive``."""
        code = aiida_code_installed(default_calc_job_plugin='core.arithmetic.add', filepath_executable='/bin/bash')
        builder = code.get_builder()
        builder.x = Int(1)
        builder.y = Int(1)
        builder.metadata.options.resources = {'num_machines': 1}
    
        # Temporarily patch the ``RemoteProcessThreadController.continue_process`` method to do nothing and just return a
        # completed future. This ensures that the submission creates a process node in the database but no task is sent to
        # RabbitMQ and so the daemon will not start running it.
        with monkeypatch.context() as context:
            context.setattr(RemoteProcessThreadController, 'continue_process', lambda *args, **kwargs: None)
>           node = submit(builder)

tests/cmdline/commands/test_rabbitmq.py:86: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/aiida/engine/launch.py:128: in submit
    process_inited = instantiate_process(runner, process, **inputs)
src/aiida/engine/utils.py:86: in instantiate_process
    process = process_class(runner=runner, inputs=inputs)
../plumpy/src/plumpy/base/state_machine.py:205: in __call__
    inst.transition_to(inst.create_initial_state())
../plumpy/src/plumpy/base/state_machine.py:357: in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
../plumpy/src/plumpy/processes.py:1098: in transition_failed
    raise exception.with_traceback(trace)
../plumpy/src/plumpy/base/state_machine.py:343: in transition_to
    self._enter_next_state(new_state)
../plumpy/src/plumpy/base/state_machine.py:410: in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERING_STATE, next_state)
../plumpy/src/plumpy/base/state_machine.py:311: in _fire_state_event
    callback(self, hook, state)
../plumpy/src/plumpy/processes.py:353: in <lambda>
    state_machine.StateEventHook.ENTERING_STATE: lambda _s, _h, state: self.on_entering(
../plumpy/src/plumpy/processes.py:720: in on_entering
    call_with_super_check(self.on_create)
../plumpy/src/plumpy/base/utils.py:31: in call_with_super_check
    wrapped(*args, **kwargs)
src/aiida/engine/processes/process.py:412: in on_create
    super().on_create()
../plumpy/src/plumpy/base/utils.py:16: in wrapper
    wrapped(self, *args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <aiida.calculations.arithmetic.add.ArithmeticAddCalculation object at 0x7add9bf9ad20>

    @super_check
    def on_create(self) -> None:
        """Entering the CREATED state."""
        self._creation_time = time.time()
    
        def recursively_copy_dictionaries(value: Any) -> Any:
            """Recursively copy the mapping but only create copies of the dictionaries not the values."""
            if isinstance(value, dict):
                return {key: recursively_copy_dictionaries(subvalue) for key, subvalue in value.items()}
            return value
    
        # This will parse the inputs with respect to the input portnamespace of the spec and validate them. The
        # ``pre_process`` method of the inputs port namespace modifies its argument in place, and since the
        # ``_raw_inputs`` should not be modified, we pass a clone of it. Note that we only need a clone of the nested
        # dictionaries, so we don't use ``copy.deepcopy`` (which might seem like the obvious choice) as that will also
        # create a clone of the values, which we don't want.
        raw_inputs = recursively_copy_dictionaries(dict(self._raw_inputs)) if self._raw_inputs else {}
        self._parsed_inputs = self.spec().inputs.pre_process(raw_inputs)
        result = self.spec().inputs.validate(self._parsed_inputs)
    
        if result is not None:
>           raise ValueError(result)
E           ValueError: Error occurred validating port 'inputs': input `metadata.options.resources` is not valid for the `DirectScheduler` scheduler: At least two among `num_machines`, `num_mpiprocs_per_machine` or `tot_num_mpipro
cs` must be specified.

../plumpy/src/plumpy/processes.py:788: ValueError

tests/tools/archive/orm/test_codes.py

danielhollas · 2024-12-23T11:14:13Z

Good point. Then it is not so easy to find out which tests should be tagged with db_required.

You could maybe do this by adding markers automatically based on used fixtures (e.g. aiida_profile, we do that already in tests/conftest.py for requires_rmq marker.

aiida-core/tests/conftest.py

Line 75 in 02cbe0c

if 'daemon_client' in item.fixturenames:

.github/workflows/test-install.yml

.github/workflows/ci-code.yml

rabbull · 2025-01-07T09:57:29Z

CI tests are currently failing due to a known issue. Specifically, with older versions of monty, the deprecation of the pymatgen function is_rare_earth_metal raises an error instead of a warning in CI environments. The monty team has already released a newer version that fixes this issue. However, the latest version of monty requires python>=3.10, which is not fully compatible with the Python version range supported by aiida.

Since this issue only affects the CI testing procedure, I believe an acceptable solution is:

Force the use of the newer monty version for environments with python>=3.10.
Skip tests involving pymatgen for environments with python<3.10.

I will open up a new issue and submit a PR to discuss and address this problem soon.

cc @GeigerJ2 @unkcpz @chrisjsewell

danielhollas · 2025-01-07T13:46:08Z

@rabbull thank you for a great writeup of the issue, it contained helpful pointers for me. I believe I have a good solution in #6676, can you take a look?

unkcpz · 2025-01-08T22:07:55Z

Thanks @danielhollas, @rabbull I update the PR, mind to give it look?

danielhollas · 2025-01-08T22:27:50Z

Changes look good, although the fixture fixes seem unrelated to the original purpose of this PR and might be better in a separate one.

By the way, just to make things clear, the SQLite backend was already being tested as part of the presto test job.

Having an extra SQLite test job will increase the code coverage for SQLite backend by running tests that are not part of presto tests (specifically those that require RMQ).

unkcpz · 2025-01-08T22:31:41Z

The fixture change fix the failed test, see #6670 (comment). I'll try to revert it to see if it is really random (it was always happened when the PR created in the first place by @rabbull).

danielhollas

Ah, interesting, it looks like the test failure is more reproducible with SQLite backend? Let's just include the fix in this PR then, sorry for the noise.

unkcpz · 2025-01-09T15:50:40Z

@danielhollas after I put timeout to single test in #6674, tests/tools/archive/orm/test_links.py::test_high_level_workflow_links seems easily hit this timeout. The idea of #6674 was first it prevent engine tests that may hang because of RMQ problem like what initially encountered in this PR. I think also helps to notify developers which unit test is too slow and better be moved as "integration" test.

Now, we have two solutions for this,

I can increase the timeout to 120s (I can tag to these archive tests explicitly) and hope these archive tests will not timeout.
We tag those slow tests and run as another CI. It is then can make the ci-code runs faster.

What do you think?

danielhollas · 2025-01-09T16:32:21Z

I've actually created an issue about moving slow tests to nightly a while ago, see #6526.
Indeed, I pointed out tests/tools/archive/orm/test_links.py::test_high_level_workflow_links as the most likely candidate for that.

For now I'd suggest bumping the timeout to 120s and continue the discussion on #6526.

The idea of #6674 was first it prevent engine tests that may hang because of RMQ problem like what initially encountered in this PR. I think also helps to notify developers which unit test is too slow and better be moved as "integration" test.

tbh: I am not a huge fan of the timeout. I understand the motivation, but the main downside is that it makes the test suite flaky (as you now encountered). The problem is that GHA runners are very unpredictable and sometimes take much longer to run the tests than usual so it's very hard to pick a timeout value that's both useful (not too large) and not too low to make test suite flaky.

unkcpz · 2025-01-09T17:02:36Z

I didn't see #6526 before, sorry.
I'll increase time for those tests to 120s and we continue the discussion in your issue.

tbh: I am not a huge fan of the timeout. I understand the motivation, but the main downside is that it makes the test suite flaky (as you now encountered).

Make sense, I'd len to move those to nightly or slow tag tests. Then, if a test runs more than 120s, then it is something we really need to check.

Updates the GitHub Actions CI workflows ci-code.yml to include testing for both SQLite and PostgreSQL database backends. This will also increase the testing coverage over SQLite specific codebase.

unkcpz · 2025-01-09T22:32:17Z

Thanks @rabbull for the PR and thanks all for reviewing.

rabbull requested review from unkcpz and GeigerJ2 December 17, 2024 15:32

rabbull self-assigned this Dec 17, 2024

unkcpz force-pushed the backend-matrix branch from 6a75315 to 944b030 Compare December 17, 2024 23:12

khsrali reviewed Dec 18, 2024

View reviewed changes

.github/workflows/ci-code.yml Outdated Show resolved Hide resolved

unkcpz reviewed Dec 20, 2024

View reviewed changes

tests/tools/archive/orm/test_codes.py Show resolved Hide resolved

danielhollas reviewed Dec 23, 2024

View reviewed changes

.github/workflows/test-install.yml Outdated Show resolved Hide resolved

danielhollas reviewed Dec 23, 2024

View reviewed changes

.github/workflows/ci-code.yml Outdated Show resolved Hide resolved

unkcpz force-pushed the backend-matrix branch 3 times, most recently from 9dc539a to acb554d Compare December 31, 2024 23:55

danielhollas approved these changes Jan 9, 2025

View reviewed changes

unkcpz force-pushed the backend-matrix branch from 0aca762 to 44fd842 Compare January 9, 2025 14:33

unkcpz force-pushed the backend-matrix branch 2 times, most recently from 1f42275 to d29225c Compare January 9, 2025 22:29

unkcpz merged commit 3f5e2c1 into aiidateam:main Jan 9, 2025
8 of 9 checks passed

rabbull and others added 2 commits January 9, 2025 23:30

CI: Add matrix testing for both SQLite and PostgreSQL database backends

8f3f2b5

Updates the GitHub Actions CI workflows ci-code.yml to include testing for both SQLite and PostgreSQL database backends. This will also increase the testing coverage over SQLite specific codebase.

Clean profile in orm/test_codes.py::test_input_code

d29225c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Add matrix testing for both SQLite and PostgreSQL database backends #6670

CI: Add matrix testing for both SQLite and PostgreSQL database backends #6670

rabbull commented Dec 17, 2024

GeigerJ2 commented Dec 17, 2024

codecov bot commented Dec 17, 2024 •

edited

Loading

GeigerJ2 commented Dec 17, 2024 •

edited

Loading

unkcpz commented Dec 17, 2024

unkcpz commented Dec 18, 2024

khsrali left a comment

rabbull commented Dec 18, 2024

rabbull commented Dec 18, 2024

rabbull commented Dec 18, 2024

unkcpz Dec 20, 2024

unkcpz Dec 20, 2024

unkcpz commented Dec 20, 2024 •

edited

Loading

unkcpz commented Dec 20, 2024 •

edited

Loading

danielhollas commented Dec 23, 2024

rabbull commented Jan 7, 2025 •

edited

Loading

danielhollas commented Jan 7, 2025

unkcpz commented Jan 8, 2025

danielhollas commented Jan 8, 2025

unkcpz commented Jan 8, 2025

danielhollas left a comment

unkcpz commented Jan 9, 2025

danielhollas commented Jan 9, 2025

unkcpz commented Jan 9, 2025

unkcpz commented Jan 9, 2025

		@pytest.mark.usefixtures('started_daemon_client')
		@pytest.mark.usefixtures('aiida_profile_clean')

CI: Add matrix testing for both SQLite and PostgreSQL database backends #6670

CI: Add matrix testing for both SQLite and PostgreSQL database backends #6670

Conversation

rabbull commented Dec 17, 2024

GeigerJ2 commented Dec 17, 2024

codecov bot commented Dec 17, 2024 • edited Loading

Codecov Report

GeigerJ2 commented Dec 17, 2024 • edited Loading

unkcpz commented Dec 17, 2024

unkcpz commented Dec 18, 2024

khsrali left a comment

Choose a reason for hiding this comment

rabbull commented Dec 18, 2024

rabbull commented Dec 18, 2024

rabbull commented Dec 18, 2024

unkcpz Dec 20, 2024

Choose a reason for hiding this comment

unkcpz Dec 20, 2024

Choose a reason for hiding this comment

unkcpz commented Dec 20, 2024 • edited Loading

unkcpz commented Dec 20, 2024 • edited Loading

danielhollas commented Dec 23, 2024

rabbull commented Jan 7, 2025 • edited Loading

danielhollas commented Jan 7, 2025

unkcpz commented Jan 8, 2025

danielhollas commented Jan 8, 2025

unkcpz commented Jan 8, 2025

danielhollas left a comment

Choose a reason for hiding this comment

unkcpz commented Jan 9, 2025

danielhollas commented Jan 9, 2025

unkcpz commented Jan 9, 2025

unkcpz commented Jan 9, 2025

codecov bot commented Dec 17, 2024 •

edited

Loading

GeigerJ2 commented Dec 17, 2024 •

edited

Loading

unkcpz commented Dec 20, 2024 •

edited

Loading

unkcpz commented Dec 20, 2024 •

edited

Loading

rabbull commented Jan 7, 2025 •

edited

Loading