Evaluate more advanced optimizations like LTO, PGO, PLO #141

zamazan4ik · 2024-09-05T17:51:28Z

zamazan4ik
Sep 5, 2024

Hi!

I just read an article about Harper at Reddit - nice work! I guess I have several possibly interesting ideas to try with Harper regarding its performance and binary size.

At first, I saw that Link-Time Optimization (LTO) was not enabled. Have you tried to enable it before for the project? It can help a lot with reducing the binary size and helps a compiler perform more aggressive optimizations (always a good thing to have). If you think that enabling LTO with the default one "Release" profile can affect developers experience too much, you can create a dedicated build profile like "advanced_release" or "dist" - many projects enable LTO exactly in this way.

Secondly, after LTO I highly recommend taking a look at PGO (Profile-Guided Optimization). This optimization gives to a compiler more information about how a program is executed. Based on this, the compiler can perform more aggressive optimizations with better runtime performance. I collect as much as many materials about PGO in my repo - https://github.com/zamazan4ik/awesome-pgo . There you can read more about actual PGO benchmarks in various software (parsers, compilers, databases, etc.). Also, highly recommend to read the (unfinished-yet) article/book about PGO - it can answer many of your possible questions.

I also performed some quick PGO benchmarks for the project based on its built-in benchmarks.

Test environment

Fedora 40
Linux kernel 6.10.7
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.80.1
harper version: master branch on commit ccf14d1535c2f1450b42027afac2a8446f98e11d
Disabled Turbo boost

taskset -c 0 is used for reducing the OS scheduler's noise during the benchmarks (as much as I can guarantee ofc). For PGO optimization I use cargo-pgo tool.

I got the following results.
Release (taskset -c 0 cargo bench --workspace --all-features):

     Running benches/parse_demo.rs (target/release/deps/parse_demo-04215e47acae334a)
parse_demo              time:   [31.299 µs 31.409 µs 31.569 µs]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

lint_demo               time:   [397.99 µs 398.06 µs 398.13 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

lint_demo_uncached      time:   [36.969 ms 36.974 ms 36.979 ms]
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

PGO optimized compared to Release (taskset -c 0 cargo pgo optimize bench -- --workspace --all-features):

     Running benches/parse_demo.rs (target/x86_64-unknown-linux-gnu/release/deps/parse_demo-e8f3360d7fa72eaf)
Benchmarking parse_demo
Benchmarking parse_demo: Warming up for 3.0000 s
Benchmarking parse_demo: Collecting 100 samples in estimated 5.0619 s (192k iterations)
Benchmarking parse_demo: Analyzing
parse_demo              time:   [26.031 µs 26.092 µs 26.187 µs]
                        change: [-17.014% -16.795% -16.606%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking lint_demo
Benchmarking lint_demo: Warming up for 3.0000 s
Benchmarking lint_demo: Collecting 100 samples in estimated 5.9111 s (15k iterations)
Benchmarking lint_demo: Analyzing
lint_demo               time:   [400.42 µs 400.76 µs 401.27 µs]
                        change: [+0.6026% +0.6648% +0.7513%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking lint_demo_uncached
Benchmarking lint_demo_uncached: Warming up for 3.0000 s
Benchmarking lint_demo_uncached: Collecting 100 samples in estimated 9.0357 s (200 iterations)
Benchmarking lint_demo_uncached: Analyzing
lint_demo_uncached      time:   [45.079 ms 45.095 ms 45.112 ms]
                        change: [+21.919% +21.966% +22.014%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

(just for reference) PGO instrumented compared to Release (taskset -c 0 cargo pgo bench -- --workspace --all-features):

     Running benches/parse_demo.rs (target/x86_64-unknown-linux-gnu/release/deps/parse_demo-e8f3360d7fa72eaf)
Benchmarking parse_demo
Benchmarking parse_demo: Warming up for 3.0000 s
Benchmarking parse_demo: Collecting 100 samples in estimated 5.2021 s (71k iterations)
Benchmarking parse_demo: Analyzing
parse_demo              time:   [73.345 µs 73.424 µs 73.564 µs]
                        change: [+133.51% +134.08% +134.49%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  8 (8.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

Benchmarking lint_demo
Benchmarking lint_demo: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60.
Benchmarking lint_demo: Collecting 100 samples in estimated 6.1270 s (5050 iterations)
Benchmarking lint_demo: Analyzing
lint_demo               time:   [1.1247 ms 1.1250 ms 1.1253 ms]
                        change: [+182.46% +182.58% +182.70%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

Benchmarking lint_demo_uncached
Benchmarking lint_demo_uncached: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, or reduce sample count to 90.
Benchmarking lint_demo_uncached: Collecting 100 samples in estimated 5.2008 s (100 iterations)
Benchmarking lint_demo_uncached: Analyzing
lint_demo_uncached      time:   [54.176 ms 54.230 ms 54.298 ms]
                        change: [+46.517% +46.671% +46.859%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

According to the results, PGO can help with improving the library performance further. However, in the uncached example, we see performance degradation. I think it's due to the training dataset skew between loads for something like that - more experiments can be performed in this area. Before that, maybe this PGO-related information would be helpful for other performance-oriented users.

After PGO, I can suggest evaluating PLO (Post-Link Optimization) with LLVM BOLT as an additional optimization step. However, I recommend enabling it only after PGO (PGO usually works better than PLO in practice for now).

Regarding priorities. I highly suggest enabling LTO now. PGO and PLO, IMHO, can wait for more time (I guess spending this time on actual features would be a better option since switching on PGO with PLO, and possible CI pipelines tweaks can consume too much human resources).

Thank you!

zamazan4ik · 2025-01-08T04:20:02Z

zamazan4ik
Jan 8, 2025
Author

Regarding LTO, I've made quick local tests regarding how efficient LTO will be from the improving the binary size perspective. In this test, I didn't measure performance improvements since it's a bit more difficult to perform compared to the binary size comparisons. Env: Fedora 41, Rust 1.83, the latest version of the project at the moment. I added Fat LTO + codegen-units = 1 to the current Release profile.

harper-ls: from 33 Mib to 32 Mib.
harper-cli: from 30 to 29 Mib

0 replies

elijah-potter · 2025-01-08T22:07:24Z

elijah-potter
Jan 8, 2025
Maintainer

Oh my goodness! @zamazan4ik, I read your post when you first put it out but I must have forgotten to reply. I'm sorry about that.

If you'd like to open a PR to enable lto for harper-ls and harper-cli, go ahead, and we'll get it up and running.

1 reply

zamazan4ik Jan 9, 2025
Author

No worries - we are all humans and it's fine to forget things ;)

The corresponding PR is here: #363

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate more advanced optimizations like LTO, PGO, PLO #141

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Evaluate more advanced optimizations like LTO, PGO, PLO #141

zamazan4ik Sep 5, 2024

Replies: 2 comments · 1 reply

zamazan4ik Jan 8, 2025 Author

elijah-potter Jan 8, 2025 Maintainer

zamazan4ik Jan 9, 2025 Author

zamazan4ik
Sep 5, 2024

Replies: 2 comments 1 reply

zamazan4ik
Jan 8, 2025
Author

elijah-potter
Jan 8, 2025
Maintainer

zamazan4ik Jan 9, 2025
Author