Profiler: Add Tracy backend #4300

neobrain · 2025-01-23T21:20:30Z

Overview

The existing GPUVis backend isn't suitable for live-profiling and carries heavy overhead per traced event (a memory allocation and a system call). This makes gathering accurate data for quantitative analysis impossible.

This PR adds a new backend for FEX's profiler interface based on Tracy, a nanosecond resolution profiler with support for live tracing and a rich feature set. Notably, statistics and histograms are generated from profiled zones out-of-the-box.

To use this backend, one of the environment variables FEX_PROFILE_TARGET_NAME
or FEX_PROFILE_TARGET_PATH must be defined to select the application under
profile by name or by path suffix.

Here's an example screenshot of the profiler view while running God of War (a very JIT-time heavy title). I'll post more screenshots as comments below.

Usage

Build FEX with ENABLE_FEXCORE_PROFILER=ON and FEXCORE_PROFILER_BACKEND=tracy
Build the Tracy profiler in External/tracy/profiler using CMake and run using tracy-profiler -a ::
Launch a game using FEX_PROFILE_TARGET_NAME=Celeste.bin.x86_64 (matches app name) or FEX_PROFILE_TARGET_PATH=amd64/SuperMeatBoy (matches a path suffix)
- Using a game's Steam launch options for these is a good way to automatically profile when using a profile-enabled FEX build

Future work

Plotting data

Tracy supports plotting data and creating pretty graphs out of it, which is easy to implement but will need new Profiler interfaces.

Frame markers

Integration with GL/Vulkan library forwarding allows us to detect where frames end, which is useful to recognize stuttering in a recorded profile.

neobrain · 2025-01-23T21:20:45Z

Statistics for events like triggered SMC handling can be computed over any selected time range. This shows how many events happened (102) and plots a histogram of the various durations of each event instance.

neobrain · 2025-01-23T21:20:48Z

Gameplay stuttering can be highlighted by detecting frame ends. Here's a screenshot showing an outlier frame that takes more than 16 ms due to JIT compilation:

(This required integration with libGL/Vulkan forwarding, so it's not included in this PR.)

This differs from the existing GPUVis backend in a number of ways: * Tracy is optimized for minimal overhead and nanosecond-resolution profiling * Tracy supports live tracing (in addition to capture-based operation) * Tracy has a richer feature set and a more polished UI (notably, statistics and histograms are generated out-of-the-box) * GPUVis supports tracing multiple processes, whereas Tracy is single-process only To use this backend, one of the environment variables FEX_PROFILE_TARGET_NAME or FEX_PROFILE_TARGET_PATH must be defined to select the application under profile by name or by path suffix.

…FOR_FORK for games that fork on startup

Sonicadvance1 · 2025-01-27T20:52:38Z

Do you know how long it takes each Tracy event to execute?
For GPUViz instant events, we average ~282ns per instant event. I assume scoped events are a smidge longer due to more string parsing.

neobrain · 2025-01-27T21:10:08Z

Do you know how long it takes each Tracy event to execute?
For GPUViz instant events, we average ~282ns per instant event. I assume scoped events are a smidge longer due to more string parsing.

The author advertises cost in the single-digit nanosecond range (< 3ns per zone) in a reference benchmark. The details are briefly outlined in section 1.7 in the Tracy manual :)

(For FEX it will be marginally larger since each profiling event must be wrapped in a IsProfilerActive() check to ensure profiling only affects the main process.)

Sonicadvance1 · 2025-01-27T21:15:23Z

Do you know how long it takes each Tracy event to execute?
For GPUViz instant events, we average ~282ns per instant event. I assume scoped events are a smidge longer due to more string parsing.

The author advertises cost in the single-digit nanosecond range (< 3ns per zone) in a reference benchmark. The details are briefly outlined in section 1.7 in the Tracy manual :)

(For FEX it will be marginally larger since each profiling event must be wrapped in a IsProfilerActive() check to ensure profiling only affects the main process.)

It would be nice if this could be measured on X13s/X1E or whatever since Snapdragon's cyclecounter doesn't give sub-nanosecond accuracy. At 19.2Mhz you only get ~52ns granularity so we'll need to get averages. So spinning a loop on the markers would be good just to see what it gets down to. I don't see what their measurement methodology is other than just saying it's fast.

pmatos · 2025-01-30T13:02:36Z

Is there a reason why we are adding tracy as an external here instead of using cmake to check the user has the appropriate dependencies installed?

pmatos · 2025-01-30T13:03:49Z

One of the things I noticed is that the Tracy UI is really small on my display for some reason, certainly smaller than the remaining GUI elements. This is just to document the existence of TRACY_DPI_SCALE to adjust the size of the UI.

neobrain · 2025-01-30T15:38:20Z

Is there a reason why we are adding tracy as an external here instead of using cmake to check the user has the appropriate dependencies installed?

Did you check if linking against the system libtracy will produce a functional setup? I think the TRACY_ defines we set up aren't only needed for compiling FEX but for compiling Tracy itself as well, so in that case we really have no choice but to vendor the dependency.

One of the things I noticed is that the Tracy UI is really small on my display for some reason, certainly smaller than the remaining GUI elements. This is just to document the existence of TRACY_DPI_SCALE to adjust the size of the UI.

Seems like lack of automatic DPI detection is a typical Dear ImGui quirk, but good to know!

neobrain · 2025-01-31T10:37:12Z

It would be nice if this could be measured on X13s/X1E or whatever since Snapdragon's cyclecounter doesn't give sub-nanosecond accuracy. At 19.2Mhz you only get ~52ns granularity so we'll need to get averages. So spinning a loop on the markers would be good just to see what it gets down to. I don't see what their measurement methodology is other than just saying it's fast.

I ran this synthetic benchmark with the -DTRACY_ compile flags used by FEX, testing both single events (TracyMessageL) and paired events (ZoneScopedN):

#include <tracy/Tracy.hpp>
#include <chrono>
#include <iostream>

int main() {
  tracy::StartupProfiler();

  auto begin = std::chrono::high_resolution_clock::now();
  auto count = 1000'000'00ull;
  for (decltype(count) i = 0; i < count; ++i) {
    // ZoneScopedN("Hello");
    TracyMessageL("Hello");
  }
  auto end = std::chrono::high_resolution_clock::now();
  std::cout << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / (double)count) << '\n';
}

The results indicate a runtime per-event of 43-44ns (X13s), 31-48ns (M1, seems to vary by power state), ~~31-33ns~~16ns (Ryzen 7 3700X). In real-world workloads this overhead would partly be subsumed by memory accesses performed by the profiled application itself, so this seems consistent with the Tracy manual.

Sonicadvance1 · 2025-01-31T10:40:16Z

It would be nice if this could be measured on X13s/X1E or whatever since Snapdragon's cyclecounter doesn't give sub-nanosecond accuracy. At 19.2Mhz you only get ~52ns granularity so we'll need to get averages. So spinning a loop on the markers would be good just to see what it gets down to. I don't see what their measurement methodology is other than just saying it's fast.

I ran this synthetic benchmark with the -DTRACY_ compile flags used by FEX, testing both single events (TracyMessageL) and paired events (ZoneScopedN):
#include <tracy/Tracy.hpp>
#include <chrono>
#include <iostream>

int main() {
  tracy::StartupProfiler();

  auto begin = std::chrono::high_resolution_clock::now();
  auto count = 1000'000'00ull;
  for (decltype(count) i = 0; i < count; ++i) {
    // ZoneScopedN("Hello");
    TracyMessageL("Hello");
  }
  auto end = std::chrono::high_resolution_clock::now();
  std::cout << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / (double)count) << '\n';
}
The results indicate a runtime per-event of 43-44ns (X13s), 31-48ns (M1, seems to vary by power state), 31-33ns (Ryzen 7 3700X). In real-world workloads this overhead would partly be subsumed by memory accesses performed by the profiled application itself, so this seems consistent with the Tracy manual.

Alright cool, that's roughly indicative of what I'd expect around dumping memory in to a ringbuffer that another thread consumes. So roughly 5-8x the speed of a write in to ftrace if my average was ~260ns on X1E.

External: Add Tracy submodule

5e2c7e9

neobrain force-pushed the feature_profiler_tracy branch from fe94fe0 to 7698229 Compare January 23, 2025 21:24

neobrain force-pushed the feature_profiler_tracy branch from 7698229 to 7650aef Compare January 23, 2025 21:27

neobrain added 2 commits January 24, 2025 17:00

Profiler: Further updates to the Tracy backend; add FEX_PROFILE_WAIT_…

1968e8e

…FOR_FORK for games that fork on startup

Profiler: Fix 32-bit applications

177e9c5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiler: Add Tracy backend #4300

Profiler: Add Tracy backend #4300

neobrain commented Jan 23, 2025 •

edited

Loading

neobrain commented Jan 23, 2025 •

edited

Loading

neobrain commented Jan 23, 2025 •

edited

Loading

Sonicadvance1 commented Jan 27, 2025

neobrain commented Jan 27, 2025 •

edited

Loading

Sonicadvance1 commented Jan 27, 2025

pmatos commented Jan 30, 2025

pmatos commented Jan 30, 2025

neobrain commented Jan 30, 2025

neobrain commented Jan 31, 2025 •

edited

Loading

Sonicadvance1 commented Jan 31, 2025

Profiler: Add Tracy backend #4300

Are you sure you want to change the base?

Profiler: Add Tracy backend #4300

Conversation

neobrain commented Jan 23, 2025 • edited Loading

Overview

Usage

Future work

Plotting data

Frame markers

neobrain commented Jan 23, 2025 • edited Loading

neobrain commented Jan 23, 2025 • edited Loading

Sonicadvance1 commented Jan 27, 2025

neobrain commented Jan 27, 2025 • edited Loading

Sonicadvance1 commented Jan 27, 2025

pmatos commented Jan 30, 2025

pmatos commented Jan 30, 2025

neobrain commented Jan 30, 2025

neobrain commented Jan 31, 2025 • edited Loading

Sonicadvance1 commented Jan 31, 2025

neobrain commented Jan 23, 2025 •

edited

Loading

neobrain commented Jan 23, 2025 •

edited

Loading

neobrain commented Jan 23, 2025 •

edited

Loading

neobrain commented Jan 27, 2025 •

edited

Loading

neobrain commented Jan 31, 2025 •

edited

Loading