-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serious Performance Regression in tidal_array.py Example Since October 2024 #391
Comments
@stephankramer - I believe this is Firedrake/Firedrake dependency related. I thought it was better waiting for you to have a look if you have any capacity to, before I tag Firedrake developers again. |
So I suppose it looks like it's the time in MatLUFactorNum is the immediate culprit gone from 61.7% of 317=195 s to 94% of 1693=1591 s. This could mean a number of things:
To try and suss this out could you rerun (just the serial is probably clearest) using the following petsc options:
I'm sure you are right but try to run the exact same setup before and after if you weren't already (as far as possible given API changes in Firedrake/Thetis) as it makes it easier to interpret the results - e.g. if nothing has changed the SNES residuals from the monitor should be exactly the same up to some to some level of rounding.... |
Hi @stephankramer - as suggested I have created a test branch and excluded commits relating to Firedrake/dependency updates (8c6f9d1, 3e67547, 39880f0, aecda67, 3515ec4 - merge commit, d2ae4fe - merge commit, 6ba89f8). I got rid of the additional callback and just ran the example with two turbine types for 200s - Results through the docker (Oct. '24): Results through latest: Please let me know if I have missed anything you needed! EDIT: added EDIT 2: The turbines make no difference, so it must be in the solver. KSP residuals vary between new and old Firedrake. |
Ah, of course the flamegraph stuff is also using What is encouraging is that it seems that none of the numerical results have significantly changed and the number of nonlinear iterations is exactly the same as well (excluding option 3 above). |
@stephankramer - here are the second set of logs: Docker (Oct. '24): Latest: |
Hm, so not quite what I expected. First of, everything about MatLUFactorNum seems to be the same: same number of calls, same number of flops - just takes a lot longer in latest. However, if you scroll down a bit further in these petsc_logs, compare the configure options - so these are the configure options petsc has been built with, where 2024 is using |
This pointed towards it being a local issue - I ran the Thetis tests and it took ~55 minutes, pretty much confirming that. Sorry - I should have realised this sooner. If I use the latest Firedrake docker image my run time is initially 2 minutes 10s, then reduces to 1 minute 40s (relating to the cold/hot cache discussion you were having with Connor?). Using the Oct. '24 docker image, my run time is initially 2 minutes 30s reducing to 1 minute 11s (as per the logs previously). It seems we have still lost a reasonable amount of performance here (71s -> 100s), but not as bad as I thought initially. I've installed locally with:
once I get to my typical installation failure point with libsupermesh (firedrakeproject/firedrake#3374) I then run:
I run again: |
Sorry, confirming what? So if I understand correctly, you see a slow-down (significant but not as dramatic) between container 2024-10 and container latest, but a much larger slow down between container "latest" and a local "latest" build? That is interesting... Ah, I see there's a subtlety going wrong here:
That doesn't work because it will run something like this
whereas what you want is
In the first the
Another difference I see in the petsc log between 10-2024 and both versions of latest is right at the end in "Using libraries:" section: "-llapack -lblas" vs. "-lnlapack -lnblas". This is important because you spent most of your time in mumps solving matrix systems (as you should) which relies heavily on these two libraries. Not entirely sure how that difference comes about (I don't think it comes from the missing -march=native). Could you also upload the petsc log of the run in the latest container btw - cause basically you want to figure what difference there is between that one, and your local latest build. But yes might be worth asking on firedrake, as they'll be interested to see why you're seemingly not getting optimal performance out of the box with the default built script. |
That my local build on the machine is much slower than the containers.
Correct, ignoring the first attempt at running the example I have: Docker (Oct. '24): 71s I will re-install with the updated export commands.
Here is the log for the latest container: serial_latest_docker_petsc_log.txt - ignore the timing in this instance as I am running other things in the background now. As you spotted, it seems like "-llapack -lblas" is used in both containers, whereas the local build has "-lnlapack -lnblas", so another thing I'll need to try and change.
|
So it looks like the (n)lapack and (n)blas difference comes from the fact that locally you have It might be however that if you fix the On the last point, and to answer an earlier comment
So the thing that slowed down the Thetis CI suite was mostly time it takes to compile. This compilation time only affects the first time you run a problem (running from a "cold cache"), as the next time it will pick up the compiled code from the cache directory on disk - and also within a single run, solving the same equation multiple timesteps, it only affects the first timestep. This does however have a very big impact on the running of the Thetis suite as it will run all problems in a clean container so all equations will be new, and the tests are typically configured to run just a few timesteps. This is not the case in any production runs, as there you will be running many timesteps and so compilation time should become negligible. Based on your numbers, it seems that after the fixes that have gone in, the compilation is now in fact faster than it was in 10-2024, but other run time slightly slower. So it may well be that on average the test suite runs in roughly the same time, but that's not necessarily representative for production runs - so definitely worth querying firedrake devs indeed. |
@stephankramer I found that I had to build PETSc myself - just updating the flags to:
was not enough i.e. my results stayed as slow as the default build. Once I built PETSc myself I get the same time as the container, suggesting it was the (n)lapack+blas causing the slow down. I have copied my instructions below for reference later. I asked a colleague to check the containers and a latest default build on a different machine. To summarise the results of running this test: Dell Precision 5820 Tower X-Series (Ubuntu 20.04.6 LTS)October ’24 image initial run: 150s HP Z6 G5 Workstation Desktop (Ubuntu 22.04.5 LTS)For my colleague: I'll create a Firedrake discussion now - thank you as always. Local install - separate PETSc build:
Check successful install with make PETSC_DIR=/home/s1626662/firedrake2/petsc PETSC_ARCH=default check
Still find my install fails at libsupermesh install
Check all Firedrake tests pass. |
Slow down is attributed to Firedrake #3983. Discussion thread on poor default installation performance is here. Hopefully the default performance is fixed by Firedrake #4011, but the remaining (~40% in this case) slow down requires Firedrake #3983 to be fixed. EDIT: clarified wording. |
Unfortunately this isn't right. firedrakeproject/firedrake#4011 will remove some of the unneeded PETSc configure options that you thought were causing a slowdown but I am not modifying the caching code. |
This second bit was in reference to local installation issues I had which were causing a more significant slow down. I just included this because a lot of the discussion above was also around the default local installation (e.g. see below). My "hopefully fixed" was referring to the default installation performance. Sorry for the confusion!
|
I’ve encountered a substantial performance regression when running the
tidal_array.py
example from the Thetis repository. This may also impact other examples, but this particular example covers a lot of features. I have run the example for 1000s simulation time.Details:
Tested Environment:
Using a Firedrake image with older Thetis version from October 2024:
Using the latest Firedrake and Thetis versions (same
tidal_array.py
example):Additional Context:
Attachments:
I’ve attached .txt and .svg flamegraphs from both the October 2024 and the latest runs for reference.
Firedrake 2024-10 - serial
tidal_array_serial_fd2024_10.txt

Latest - serial
tidal_array_serial_latest.txt

Firedrake 2024-10 - parallel
tidal_array_parallel16proc_latest.txt

Latest - parallel
tidal_array_parallel16proc_fd2024_10.txt

The text was updated successfully, but these errors were encountered: