Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Add matrix testing for both SQLite and PostgreSQL database backends #6670

Merged
merged 2 commits into from
Jan 9, 2025

Conversation

rabbull
Copy link
Contributor

@rabbull rabbull commented Dec 17, 2024

This PR updates the GitHub Actions CI workflows (ci-code and test-install) to include testing for both SQLite and PostgreSQL database backends. This will also increase the testing coverage over SQLite specific codebase.

@rabbull rabbull requested review from unkcpz and GeigerJ2 December 17, 2024 15:32
@rabbull rabbull self-assigned this Dec 17, 2024
@GeigerJ2
Copy link
Contributor

Thanks, @rabbull, for opening the PR. Good to know that the failure comes from the DB matrix, and not your changes in #6659! Also pinging @khsrali, as he also asked on the topic of testing with the different DBs recently.

Copy link

codecov bot commented Dec 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.99%. Comparing base (8ae66fb) to head (d29225c).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6670      +/-   ##
==========================================
+ Coverage   77.94%   77.99%   +0.06%     
==========================================
  Files         563      563              
  Lines       41761    41761              
==========================================
+ Hits        32545    32567      +22     
+ Misses       9216     9194      -22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@GeigerJ2
Copy link
Contributor

GeigerJ2 commented Dec 17, 2024

Also weird that here they only seem to fail for sqlite, while in #6659 there are mainly failed runs for psql...

@unkcpz
Copy link
Member

unkcpz commented Dec 17, 2024

I tested with rmq version 3.12 as I had in my localhost. As expected, the failed test has nothing to check with rmq version.

@unkcpz
Copy link
Member

unkcpz commented Dec 18, 2024

I switch off the xdist, it seems the test fail for 3.9 sqlite but okay for 3.12. I'll try to reproduce it locally with python3.9. Just to double check, I think @GeigerJ2 @rabbull you don't have python3.9 with your local setup, correct?

Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rabbull
In principle I'm in favour of this changes, however now the whole test suite becomes a bit too long.
I understand this might seem inevitable, but I feel like there are many tests that are independent of the type of storage backend. Perhaps they may not need to be repeated two times.
I think it worth to have a quick look in the test folders and mark those tests in a way that they would only run once.

Maybe we can also use @sphuber view on this PR, as probably there's a reason why historically tests are not run for all storage backends.

.github/workflows/ci-code.yml Outdated Show resolved Hide resolved
@rabbull
Copy link
Contributor Author

rabbull commented Dec 18, 2024

it seems the test fail for 3.9 sqlite but okay for 3.12.

It is also the symptom I observed in #6659, where I removed xdist in the end. In addition, my local attempt with Python 3.9 actually passed, but I believe it's also worthful for you to run it locally.

@rabbull
Copy link
Contributor Author

rabbull commented Dec 18, 2024

Also weird that here they only seem to fail for sqlite, while in #6659 there are mainly failed runs for psql...

In #6659, the same failure also occurs with sqlite. In the latest run, psql actions timed out rather than failing due to test errors. I removed xdist, resulting in a significant increase in the execution time.

@rabbull
Copy link
Contributor Author

rabbull commented Dec 18, 2024

I feel like there are many tests that are independent of the type of storage backend. Perhaps they may not need to be repeated two times.

I totally agree with you. This PR didn't delve deeper is because it mainly serve for the other two pending PRs, who are blocked by the CI checks, and I believe simply adding sqlite could be a quick fix. An issue, if not exist for now, could be added to follow up the problem.

Comment on lines 88 to 92
- name: upterm
uses: lhotari/action-upterm@v1
if: ${{ failure() }}
with:
wait-timeout-minutes: 60
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to remove this before merge..

@pytest.mark.usefixtures('started_daemon_client')
@pytest.mark.usefixtures('aiida_profile_clean')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can solve the failed test, I think here the problem was caused by the rmq resource racing the queue might be cleaned by the previous test maybe??

When I run only this test in CI, the test will pass. Test only fail when it is run with other tests. I guess it is because the daemon_client is shared and the rmq queue is getting cleaned by other tests (high likely the previous one? it is easy to check by just run the test in this file).

But no matter what the cause is, the solution I think is make the daemon_client independent for this test by using aiida_profile_clean fixture which will close the daemon (which will call broker_reset that close the rmq before running the test).

@unkcpz
Copy link
Member

unkcpz commented Dec 20, 2024

but I feel like there are many tests that are independent of the type of storage backend. Perhaps they may not need to be repeated two times.

Good point. Then it is not so easy to find out which tests should be tagged with db_required. One solution I think about is to put a explicit bug in the db related code path that will early failed the tests that require DB (it is a trick I learned from Johnathon Blow's channel). We than tag them and run those with sqlite.
I also has a gut feeling that we may not need to do this, since if the test do not need DB usually it will run very fast. I am not sure but lets try.

@unkcpz
Copy link
Member

unkcpz commented Dec 20, 2024

At some point, I also saw following exception that might be the root cause of the failed test (I apply the timeout when running test, so maybe the problem of another PR by @rabbull is from this. In remote test stuck, I think the better way is to use pytest-timeout, I'll open an issue for it):

[gw0] linux -- Python 3.12.7 /home/jyu/venv/aiida-core-dev/bin/python

run_cli_command = <function run_cli_command.<locals>.factory at 0x7adda39874c0>, monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7adda3fe48c0>
aiida_code_installed = <function aiida_code_installed.<locals>.factory at 0x7adda3985940>, submit_and_await = <function submit_and_await.<locals>.factory at 0x7adda3985620>

    @pytest.mark.usefixtures('started_daemon_client')
    def test_revive(run_cli_command, monkeypatch, aiida_code_installed, submit_and_await):
        """Test ``tasks revive``."""
        code = aiida_code_installed(default_calc_job_plugin='core.arithmetic.add', filepath_executable='/bin/bash')
        builder = code.get_builder()
        builder.x = Int(1)
        builder.y = Int(1)
        builder.metadata.options.resources = {'num_machines': 1}
    
        # Temporarily patch the ``RemoteProcessThreadController.continue_process`` method to do nothing and just return a
        # completed future. This ensures that the submission creates a process node in the database but no task is sent to
        # RabbitMQ and so the daemon will not start running it.
        with monkeypatch.context() as context:
            context.setattr(RemoteProcessThreadController, 'continue_process', lambda *args, **kwargs: None)
>           node = submit(builder)

tests/cmdline/commands/test_rabbitmq.py:86: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/aiida/engine/launch.py:128: in submit
    process_inited = instantiate_process(runner, process, **inputs)
src/aiida/engine/utils.py:86: in instantiate_process
    process = process_class(runner=runner, inputs=inputs)
../plumpy/src/plumpy/base/state_machine.py:205: in __call__
    inst.transition_to(inst.create_initial_state())
../plumpy/src/plumpy/base/state_machine.py:357: in transition_to
    self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
../plumpy/src/plumpy/processes.py:1098: in transition_failed
    raise exception.with_traceback(trace)
../plumpy/src/plumpy/base/state_machine.py:343: in transition_to
    self._enter_next_state(new_state)
../plumpy/src/plumpy/base/state_machine.py:410: in _enter_next_state
    self._fire_state_event(StateEventHook.ENTERING_STATE, next_state)
../plumpy/src/plumpy/base/state_machine.py:311: in _fire_state_event
    callback(self, hook, state)
../plumpy/src/plumpy/processes.py:353: in <lambda>
    state_machine.StateEventHook.ENTERING_STATE: lambda _s, _h, state: self.on_entering(
../plumpy/src/plumpy/processes.py:720: in on_entering
    call_with_super_check(self.on_create)
../plumpy/src/plumpy/base/utils.py:31: in call_with_super_check
    wrapped(*args, **kwargs)
src/aiida/engine/processes/process.py:412: in on_create
    super().on_create()
../plumpy/src/plumpy/base/utils.py:16: in wrapper
    wrapped(self, *args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <aiida.calculations.arithmetic.add.ArithmeticAddCalculation object at 0x7add9bf9ad20>

    @super_check
    def on_create(self) -> None:
        """Entering the CREATED state."""
        self._creation_time = time.time()
    
        def recursively_copy_dictionaries(value: Any) -> Any:
            """Recursively copy the mapping but only create copies of the dictionaries not the values."""
            if isinstance(value, dict):
                return {key: recursively_copy_dictionaries(subvalue) for key, subvalue in value.items()}
            return value
    
        # This will parse the inputs with respect to the input portnamespace of the spec and validate them. The
        # ``pre_process`` method of the inputs port namespace modifies its argument in place, and since the
        # ``_raw_inputs`` should not be modified, we pass a clone of it. Note that we only need a clone of the nested
        # dictionaries, so we don't use ``copy.deepcopy`` (which might seem like the obvious choice) as that will also
        # create a clone of the values, which we don't want.
        raw_inputs = recursively_copy_dictionaries(dict(self._raw_inputs)) if self._raw_inputs else {}
        self._parsed_inputs = self.spec().inputs.pre_process(raw_inputs)
        result = self.spec().inputs.validate(self._parsed_inputs)
    
        if result is not None:
>           raise ValueError(result)
E           ValueError: Error occurred validating port 'inputs': input `metadata.options.resources` is not valid for the `DirectScheduler` scheduler: At least two among `num_machines`, `num_mpiprocs_per_machine` or `tot_num_mpipro
cs` must be specified.

../plumpy/src/plumpy/processes.py:788: ValueError

@danielhollas
Copy link
Collaborator

Good point. Then it is not so easy to find out which tests should be tagged with db_required.

You could maybe do this by adding markers automatically based on used fixtures (e.g. aiida_profile, we do that already in tests/conftest.py for requires_rmq marker.

if 'daemon_client' in item.fixturenames:

@unkcpz unkcpz force-pushed the backend-matrix branch 3 times, most recently from 9dc539a to acb554d Compare December 31, 2024 23:55
@rabbull
Copy link
Contributor Author

rabbull commented Jan 7, 2025

CI tests are currently failing due to a known issue. Specifically, with older versions of monty, the deprecation of the pymatgen function is_rare_earth_metal raises an error instead of a warning in CI environments. The monty team has already released a newer version that fixes this issue. However, the latest version of monty requires python>=3.10, which is not fully compatible with the Python version range supported by aiida.

Since this issue only affects the CI testing procedure, I believe an acceptable solution is:

  1. Force the use of the newer monty version for environments with python>=3.10.
  2. Skip tests involving pymatgen for environments with python<3.10.

I will open up a new issue and submit a PR to discuss and address this problem soon.

cc @GeigerJ2 @unkcpz @chrisjsewell

@danielhollas
Copy link
Collaborator

@rabbull thank you for a great writeup of the issue, it contained helpful pointers for me. I believe I have a good solution in #6676, can you take a look?

@unkcpz
Copy link
Member

unkcpz commented Jan 8, 2025

Thanks @danielhollas, @rabbull I update the PR, mind to give it look?

@danielhollas
Copy link
Collaborator

Changes look good, although the fixture fixes seem unrelated to the original purpose of this PR and might be better in a separate one.

By the way, just to make things clear, the SQLite backend was already being tested as part of the presto test job.

Having an extra SQLite test job will increase the code coverage for SQLite backend by running tests that are not part of presto tests (specifically those that require RMQ).

@unkcpz
Copy link
Member

unkcpz commented Jan 8, 2025

The fixture change fix the failed test, see #6670 (comment). I'll try to revert it to see if it is really random (it was always happened when the PR created in the first place by @rabbull).

Copy link
Collaborator

@danielhollas danielhollas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting, it looks like the test failure is more reproducible with SQLite backend? Let's just include the fix in this PR then, sorry for the noise.

@unkcpz
Copy link
Member

unkcpz commented Jan 9, 2025

@danielhollas after I put timeout to single test in #6674, tests/tools/archive/orm/test_links.py::test_high_level_workflow_links seems easily hit this timeout. The idea of #6674 was first it prevent engine tests that may hang because of RMQ problem like what initially encountered in this PR. I think also helps to notify developers which unit test is too slow and better be moved as "integration" test.

Now, we have two solutions for this,

  • I can increase the timeout to 120s (I can tag to these archive tests explicitly) and hope these archive tests will not timeout.
  • We tag those slow tests and run as another CI. It is then can make the ci-code runs faster.

What do you think?

@danielhollas
Copy link
Collaborator

I've actually created an issue about moving slow tests to nightly a while ago, see #6526.
Indeed, I pointed out tests/tools/archive/orm/test_links.py::test_high_level_workflow_links as the most likely candidate for that.

For now I'd suggest bumping the timeout to 120s and continue the discussion on #6526.

The idea of #6674 was first it prevent engine tests that may hang because of RMQ problem like what initially encountered in this PR. I think also helps to notify developers which unit test is too slow and better be moved as "integration" test.

tbh: I am not a huge fan of the timeout. I understand the motivation, but the main downside is that it makes the test suite flaky (as you now encountered). The problem is that GHA runners are very unpredictable and sometimes take much longer to run the tests than usual so it's very hard to pick a timeout value that's both useful (not too large) and not too low to make test suite flaky.

@unkcpz
Copy link
Member

unkcpz commented Jan 9, 2025

I didn't see #6526 before, sorry.
I'll increase time for those tests to 120s and we continue the discussion in your issue.

tbh: I am not a huge fan of the timeout. I understand the motivation, but the main downside is that it makes the test suite flaky (as you now encountered).

Make sense, I'd len to move those to nightly or slow tag tests. Then, if a test runs more than 120s, then it is something we really need to check.

@unkcpz unkcpz force-pushed the backend-matrix branch 2 times, most recently from 1f42275 to d29225c Compare January 9, 2025 22:29
@unkcpz unkcpz merged commit 3f5e2c1 into aiidateam:main Jan 9, 2025
8 of 9 checks passed
rabbull and others added 2 commits January 9, 2025 23:30
Updates the GitHub Actions CI workflows ci-code.yml to include testing for both SQLite
and PostgreSQL database backends.
This will also increase the testing coverage over SQLite specific codebase.
@unkcpz
Copy link
Member

unkcpz commented Jan 9, 2025

Thanks @rabbull for the PR and thanks all for reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants