Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serious Performance Regression in tidal_array.py Example Since October 2024 #391

Open
cpjordan opened this issue Feb 4, 2025 · 14 comments

Comments

@cpjordan
Copy link
Contributor

cpjordan commented Feb 4, 2025

I’ve encountered a substantial performance regression when running the tidal_array.py example from the Thetis repository. This may also impact other examples, but this particular example covers a lot of features. I have run the example for 1000s simulation time.

Details:

Tested Environment:

Using a Firedrake image with older Thetis version from October 2024:

  • Serial: 317s
  • Parallel: 73s

Using the latest Firedrake and Thetis versions (same tidal_array.py example):

  • Serial: 1693s (5.3x slower)
  • Parallel: 282s (3.9x slower)

Additional Context:

  • There have been minor updates to turbines.py related to this example, but these changes seem minimal and unlikely to cause this significant slowdown. As per @stephankramer's comments on Issue Inconsistent (and slowing) CI Build Times #383, I don't think this is anything to do with Thetis (happy to be proven wrong!).
  • This performance regression also impacts my larger-scale implementation of this example, which I use in my research—it’s now substantially slower since October 2024.

Attachments:

I’ve attached .txt and .svg flamegraphs from both the October 2024 and the latest runs for reference.

Firedrake 2024-10 - serial

tidal_array_serial_fd2024_10.txt
Image

Latest - serial

tidal_array_serial_latest.txt
Image

Firedrake 2024-10 - parallel

tidal_array_parallel16proc_latest.txt
Image

Latest - parallel

tidal_array_parallel16proc_fd2024_10.txt
Image

@cpjordan
Copy link
Contributor Author

cpjordan commented Feb 4, 2025

@stephankramer - I believe this is Firedrake/Firedrake dependency related. I thought it was better waiting for you to have a look if you have any capacity to, before I tag Firedrake developers again.

@stephankramer
Copy link
Contributor

So I suppose it looks like it's the time in MatLUFactorNum is the immediate culprit gone from 61.7% of 317=195 s to 94% of 1693=1591 s. This could mean a number of things:

  1. something inside PETSc/mumps has changed to make MatLUFactorNum slower
  2. we are not reusing the LU factorisation when we're solving with the same matrix multiple times (possibly in the adjoint?)
  3. if this involves nonlinear solves the number of iterations (SNES iterations) has gone up - oh I see you are running with snes_monitor, so you shoudl be able to tell...

To try and suss this out could you rerun (just the serial is probably clearest) using the following petsc options:

  • add `'ksp_view': None to the solver parameters
  • add -log_view either as a command line argument to the script, or set it using export PETSC_OPTIONS=-log_view before running
    and redirect the terminal output and upload both versions of this log.

There have been minor updates to turbines.py related to this example, but these changes seem minimal and unlikely to cause this significant slowdown.

I'm sure you are right but try to run the exact same setup before and after if you weren't already (as far as possible given API changes in Firedrake/Thetis) as it makes it easier to interpret the results - e.g. if nothing has changed the SNES residuals from the monitor should be exactly the same up to some to some level of rounding....

@cpjordan
Copy link
Contributor Author

cpjordan commented Feb 7, 2025

Hi @stephankramer - as suggested I have created a test branch and excluded commits relating to Firedrake/dependency updates (8c6f9d1, 3e67547, 39880f0, aecda67, 3515ec4 - merge commit, d2ae4fe - merge commit, 6ba89f8). I got rid of the additional callback and just ran the example with two turbine types for 200s - python regression_test.py -log_view :serial_fd2024.txt:ascii_flamegraph &> serial_log.txt regression_test.py.

Results through the docker (Oct. '24):

Flame graph
Log

Results through latest:

Flame graph
Log

Please let me know if I have missed anything you needed!

EDIT: added ksp_monitor and snes_monitor to solver parameters - logs updated.

EDIT 2: The turbines make no difference, so it must be in the solver. KSP residuals vary between new and old Firedrake.

@stephankramer
Copy link
Contributor

Ah, of course the flamegraph stuff is also using -log_view. Sorry, what I meant was just using the default log view either just -log_view without any argument, it will print a summary at the end of the run, or if you want that directly saved to a file -log_view :serial_fd2024_petsc_log:. This will give us some info on the number of times each routine is called.

What is encouraging is that it seems that none of the numerical results have significantly changed and the number of nonlinear iterations is exactly the same as well (excluding option 3 above).

@cpjordan
Copy link
Contributor Author

cpjordan commented Feb 7, 2025

@stephankramer - here are the second set of logs:

Docker (Oct. '24):
serial_fd2024_petsc_log.txt

Latest:
serial_latest_petsc_log.txt

@stephankramer
Copy link
Contributor

Hm, so not quite what I expected. First of, everything about MatLUFactorNum seems to be the same: same number of calls, same number of flops - just takes a lot longer in latest. However, if you scroll down a bit further in these petsc_logs, compare the configure options - so these are the configure options petsc has been built with, where 2024 is using --COPTFLAGS=-O3 -march=native -mtune=native --CXXOPTFLAGS=-O3 -march=native -mtune=native --FOPTFLAGS=-O3 -march=native -mtune=native . Now in terms of options that produces for the c and fortran compiler (a bit further down) that seems to just change from '-O3' to '-g -O' which I would have thought isn't so different, but maybe it has an effect on the way mumps is built? Might be worth trying to rebuild your petsc that your using with "latest" with the same options as were used in the 2024 container.
Another thought: make sure there's no funny business going on with openmp threads (export NUM_OMP_THREADS=1) I don't see any warnings, so I presume that's been set?

@cpjordan
Copy link
Contributor Author

cpjordan commented Feb 8, 2025

This pointed towards it being a local issue - I ran the Thetis tests and it took ~55 minutes, pretty much confirming that. Sorry - I should have realised this sooner.

If I use the latest Firedrake docker image my run time is initially 2 minutes 10s, then reduces to 1 minute 40s (relating to the cold/hot cache discussion you were having with Connor?). Using the Oct. '24 docker image, my run time is initially 2 minutes 30s reducing to 1 minute 11s (as per the logs previously). It seems we have still lost a reasonable amount of performance here (71s -> 100s), but not as bad as I thought initially.

I've installed locally with:

export PETSC_CONFIGURE_OPTIONS="--COPTFLAGS=-O3 -march=native -mtune=native --CXXOPTFLAGS=-O3 -march=native -mtune=native --FOPTFLAGS=-O3 -march=native -mtune=native"
curl -O https://raw.githubusercontent.com/firedrakeproject/firedrake/master/scripts/firedrake-install
python3.11 firedrake-install

once I get to my typical installation failure point with libsupermesh (firedrakeproject/firedrake#3374) I then run:

source firedrake/bin/activate
export PATH=$VIRTUAL_ENV/src/petsc/default/bin:$PATH
firedrake-update --install thetis

I run again: python regression_test.py -log_view :serial_latest_petsc_log: and still find it takes ~6.5 minutes. Log attached - I guess I should transfer this to a Firedrake installation discussion post?

serial_latest_petsc_log.txt

@stephankramer
Copy link
Contributor

This pointed towards it being a local issue - I ran the Thetis tests and it took ~55 minutes, pretty much confirming that.

Sorry, confirming what?

So if I understand correctly, you see a slow-down (significant but not as dramatic) between container 2024-10 and container latest, but a much larger slow down between container "latest" and a local "latest" build? That is interesting...

Ah, I see there's a subtlety going wrong here:

export PETSC_CONFIGURE_OPTIONS="--COPTFLAGS=-O3 -march=native -mtune=native --CXXOPTFLAGS=-O3 -march=native -mtune=native --FOPTFLAGS=-O3 -march=native -mtune=native"

That doesn't work because it will run something like this

./configure --COPTFLAGS=-O3 -march=native -mtune=native --CXXOPTFLAGS=-O3 -march=native -mtune=native --FOPTFLAGS=-O3 -march=native -mtune=native

whereas what you want is

./configure --COPTFLAGS="-O3 -march=native -mtune=native" --CXXOPTFLAGS=-O3 -march=native -mtune=native --FOPTFLAGS="-O3 -march=native -mtune=native"

In the first the --CXXOPTFLAGS=-O3 and -march=native and -mtune=native are all passed to petsc's configure as separate options, and so CXXOPTFLAGS is set to -O3 only and the other ones end up being silently ignored. If you look at your latest serial_latest_petsc_log.txt you can see this happening it has moved the -mtune=native and -march=native randomly inbetween. So I think you want to use something like

export PETSC_CONFIGURE_OPTIONS='--COPTFLAGS="-O3 -march=native -mtune=native" --CXXOPTFLAGS="-O3 -march=native -mtune=native" --FOPTFLAGS="-O3 -march=native -mtune=native"'

Another difference I see in the petsc log between 10-2024 and both versions of latest is right at the end in "Using libraries:" section: "-llapack -lblas" vs. "-lnlapack -lnblas". This is important because you spent most of your time in mumps solving matrix systems (as you should) which relies heavily on these two libraries. Not entirely sure how that difference comes about (I don't think it comes from the missing -march=native). Could you also upload the petsc log of the run in the latest container btw - cause basically you want to figure what difference there is between that one, and your local latest build.

But yes might be worth asking on firedrake, as they'll be interested to see why you're seemingly not getting optimal performance out of the box with the default built script.

@cpjordan
Copy link
Contributor Author

cpjordan commented Feb 10, 2025

This pointed towards it being a local issue - I ran the Thetis tests and it took ~55 minutes, pretty much confirming that.

Sorry, confirming what?

That my local build on the machine is much slower than the containers.

So if I understand correctly, you see a slow-down (significant but not as dramatic) between container 2024-10 and container latest, but a much larger slow down between container "latest" and a local "latest" build? That is interesting...

Correct, ignoring the first attempt at running the example I have:

Docker (Oct. '24): 71s
Docker (Latest): 100s
Local build (Latest): ~390s

I will re-install with the updated export commands.

Could you also upload the petsc log of the run in the latest container btw - cause basically you want to figure what difference there is between that one, and your local latest build.

Here is the log for the latest container: serial_latest_docker_petsc_log.txt - ignore the timing in this instance as I am running other things in the background now.

As you spotted, it seems like "-llapack -lblas" is used in both containers, whereas the local build has "-lnlapack -lnblas", so another thing I'll need to try and change.

  • With regards to the slow down with respect to my local build on the machine, I'll create a discussion post on Firedrake as this doesn't seem a Thetis problem.
  • I'll ask some colleagues to run this example on their machines to see if they re-produce the same timings/inconsistencies. It seems odd that the build is just as quick as previously now but I see a discrepancy in the containers, so it seems worth checking. If there is still a discrepancy between builds then perhaps this Issue should stay open?

@stephankramer
Copy link
Contributor

So it looks like the (n)lapack and (n)blas difference comes from the fact that locally you have --download-netlib-lapack --with-netlib-lapack-c-bindings as petsc configure options (these are added by the firedrake install script) whereas the container builds do not and I think they link to an lapack+blas that comes from a system (Ubuntu) package. If you want to build exactly as in the container, you have to first build petsc separately following https://github.com/firedrakeproject/firedrake/blob/master/docker/Dockerfile.env and then run the firedrake script telling it where your pre-built petsc is located by exporting PETSC_DIR and PETSC_ARCH and giving the --honour-petsc-dir option as is done here: https://github.com/firedrakeproject/firedrake/blob/master/docker/Dockerfile.env

It might be however that if you fix the -march=native option alone, it already speeds up because that option is presumably also used when petsc builds netlib lapack and blas.

On the last point, and to answer an earlier comment

If I use the latest Firedrake docker image my run time is initially 2 minutes 10s, then reduces to 1 minute 40s (relating to the #383 (comment)?). Using the Oct. '24 docker image, my run time is initially 2 minutes 30s reducing to 1 minute 11s (as per the logs previously). It seems we have still lost a reasonable amount of performance here (71s -> 100s), but not as bad as I thought initially.

So the thing that slowed down the Thetis CI suite was mostly time it takes to compile. This compilation time only affects the first time you run a problem (running from a "cold cache"), as the next time it will pick up the compiled code from the cache directory on disk - and also within a single run, solving the same equation multiple timesteps, it only affects the first timestep. This does however have a very big impact on the running of the Thetis suite as it will run all problems in a clean container so all equations will be new, and the tests are typically configured to run just a few timesteps. This is not the case in any production runs, as there you will be running many timesteps and so compilation time should become negligible. Based on your numbers, it seems that after the fixes that have gone in, the compilation is now in fact faster than it was in 10-2024, but other run time slightly slower. So it may well be that on average the test suite runs in roughly the same time, but that's not necessarily representative for production runs - so definitely worth querying firedrake devs indeed.

@cpjordan
Copy link
Contributor Author

cpjordan commented Feb 11, 2025

@stephankramer I found that I had to build PETSc myself - just updating the flags to:

./configure --COPTFLAGS="-O3 -march=native -mtune=native" --CXXOPTFLAGS=-O3 -march=native -mtune=native --FOPTFLAGS="-O3 -march=native -mtune=native"

was not enough i.e. my results stayed as slow as the default build. Once I built PETSc myself I get the same time as the container, suggesting it was the (n)lapack+blas causing the slow down. I have copied my instructions below for reference later. I asked a colleague to check the containers and a latest default build on a different machine. To summarise the results of running this test:

Dell Precision 5820 Tower X-Series (Ubuntu 20.04.6 LTS)

October ’24 image initial run: 150s
October ’24 image second run: 71s
Current image initial run: 130s
Current image second run: 100s
Local build - default (Latest): ~390s
Local build - own PETSc build (see below): 100s (log for completeness)

HP Z6 G5 Workstation Desktop (Ubuntu 22.04.5 LTS)

For my colleague:
October ’24 image initial run: 109s
October ’24 image second run: 59s
Current image initial run: 105s
Current image second run: 76s
Local build - default: 351s

I'll create a Firedrake discussion now - thank you as always.

Local install - separate PETSc build:

mkdir firedrake2
cd firedrake2
sudo apt update && sudo apt install -y \
    build-essential cmake gfortran git libopenblas-dev \
    python3-dev python3-pip python3-venv python3-tk \
    libtool zlib1g-dev bison flex ninja-build pkg-config
git clone https://github.com/firedrakeproject/petsc.git petsc
git clone https://github.com/firedrakeproject/slepc.git slepc
cd petsc

./configure \
    --COPTFLAGS="-O3 -march=native -mtune=native" \
    --CXXOPTFLAGS="-O3 -march=native -mtune=native" \
    --FOPTFLAGS="-O3 -march=native -mtune=native" \
    --with-c2html=0 \
    --with-debugging=0 \
    --with-fortran-bindings=0 \
    --with-make-np=12 \
    --with-shared-libraries=1 \
    --with-zlib \
    --download-fftw \
    --download-hdf5 \
    --download-hwloc \
    --download-hypre \
    --download-metis \
    --download-mumps \
    --download-mpich \
    --download-mpich-device=ch3:sock \
    --download-netcdf \
    --download-pastix \
    --download-pnetcdf \
    --download-ptscotch \
    --download-scalapack \
    --download-suitesparse \
    --download-superlu_dist \
    --with-strict-petscerrorcode \
    --download-cmake  \
    PETSC_ARCH=default

make PETSC_DIR=/home/s1626662/firedrake2/petsc PETSC_ARCH=default all

Check successful install with make PETSC_DIR=/home/s1626662/firedrake2/petsc PETSC_ARCH=default check

cd ../slepc
export PETSC_DIR=/home/s1626662/firedrake2/petsc
export PETSC_ARCH=default

./configure
make SLEPC_DIR=/home/s1626662/firedrake2/slepc PETSC_DIR=/home/s1626662/firedrake2/petsc PETSC_ARCH=default
Check successful install with make SLEPC_DIR=/home/s1626662/firedrake2/slepc PETSC_DIR=/home/s1626662/firedrake2/petsc check
cd ..
export PETSC_DIR=/home/s1626662/firedrake2/petsc
export PETSC_ARCH=default
curl -O https://raw.githubusercontent.com/firedrakeproject/firedrake/master/scripts/firedrake-install
python3.11 firedrake-install --honour-petsc-dir

Still find my install fails at libsupermesh install

source firedrake/bin/activate
export PATH=/home/s1626662/firedrake2/petsc/default/bin
firedrake-update --install thetis

Check all Firedrake tests pass.

@cpjordan
Copy link
Contributor Author

cpjordan commented Feb 13, 2025

Slow down is attributed to Firedrake #3983.

Discussion thread on poor default installation performance is here. Hopefully the default performance is fixed by Firedrake #4011, but the remaining (~40% in this case) slow down requires Firedrake #3983 to be fixed.

EDIT: clarified wording.

@connorjward
Copy link
Contributor

and hopefully fixed by firedrakeproject/firedrake#4011.

Unfortunately this isn't right. firedrakeproject/firedrake#4011 will remove some of the unneeded PETSc configure options that you thought were causing a slowdown but I am not modifying the caching code.

@cpjordan
Copy link
Contributor Author

Unfortunately this isn't right. firedrakeproject/firedrake#4011 will remove some of the unneeded PETSc configure options that you thought were causing a slowdown but I am not modifying the caching code.

This second bit was in reference to local installation issues I had which were causing a more significant slow down. I just included this because a lot of the discussion above was also around the default local installation (e.g. see below). My "hopefully fixed" was referring to the default installation performance. Sorry for the confusion!

Local build - default (Latest): ~390s
Local build - own PETSc build (see below): 100s (log for completeness)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants