Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiler: Add Tracy backend #4300

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

neobrain
Copy link
Member

@neobrain neobrain commented Jan 23, 2025

Overview

The existing GPUVis backend isn't suitable for live-profiling and carries heavy overhead per traced event (a memory allocation and a system call). This makes gathering accurate data for quantitative analysis impossible.

This PR adds a new backend for FEX's profiler interface based on Tracy, a nanosecond resolution profiler with support for live tracing and a rich feature set. Notably, statistics and histograms are generated from profiled zones out-of-the-box.

To use this backend, one of the environment variables FEX_PROFILE_TARGET_NAME
or FEX_PROFILE_TARGET_PATH must be defined to select the application under
profile by name or by path suffix.

Here's an example screenshot of the profiler view while running God of War (a very JIT-time heavy title). I'll post more screenshots as comments below.

20250123_fex_tracy_general

Usage

  1. Build FEX with ENABLE_FEXCORE_PROFILER=ON and FEXCORE_PROFILER_BACKEND=tracy
  2. Build the Tracy profiler in External/tracy/profiler using CMake and run using tracy-profiler -a ::
  3. Launch a game using FEX_PROFILE_TARGET_NAME=Celeste.bin.x86_64 (matches app name) or FEX_PROFILE_TARGET_PATH=amd64/SuperMeatBoy (matches a path suffix)
    • Using a game's Steam launch options for these is a good way to automatically profile when using a profile-enabled FEX build

Future work

Plotting data

Tracy supports plotting data and creating pretty graphs out of it, which is easy to implement but will need new Profiler interfaces.

Frame markers

Integration with GL/Vulkan library forwarding allows us to detect where frames end, which is useful to recognize stuttering in a recorded profile.

@neobrain
Copy link
Member Author

neobrain commented Jan 23, 2025

Statistics for events like triggered SMC handling can be computed over any selected time range. This shows how many events happened (102) and plots a histogram of the various durations of each event instance.
20250123_fex_tracy_smc

@neobrain
Copy link
Member Author

neobrain commented Jan 23, 2025

Gameplay stuttering can be highlighted by detecting frame ends. Here's a screenshot showing an outlier frame that takes more than 16 ms due to JIT compilation:

2025_01_23_fex_tracy_frames

(This required integration with libGL/Vulkan forwarding, so it's not included in this PR.)

@neobrain neobrain force-pushed the feature_profiler_tracy branch from fe94fe0 to 7698229 Compare January 23, 2025 21:24
This differs from the existing GPUVis backend in a number of ways:
* Tracy is optimized for minimal overhead and nanosecond-resolution profiling
* Tracy supports live tracing (in addition to capture-based operation)
* Tracy has a richer feature set and a more polished UI (notably, statistics and histograms are generated out-of-the-box)
* GPUVis supports tracing multiple processes, whereas Tracy is single-process only

To use this backend, one of the environment variables FEX_PROFILE_TARGET_NAME
or FEX_PROFILE_TARGET_PATH must be defined to select the application under
profile by name or by path suffix.
@neobrain neobrain force-pushed the feature_profiler_tracy branch from 7698229 to 7650aef Compare January 23, 2025 21:27
@Sonicadvance1
Copy link
Member

Do you know how long it takes each Tracy event to execute?
For GPUViz instant events, we average ~282ns per instant event. I assume scoped events are a smidge longer due to more string parsing.

@neobrain
Copy link
Member Author

neobrain commented Jan 27, 2025

Do you know how long it takes each Tracy event to execute?
For GPUViz instant events, we average ~282ns per instant event. I assume scoped events are a smidge longer due to more string parsing.

The author advertises cost in the single-digit nanosecond range (< 3ns per zone) in a reference benchmark. The details are briefly outlined in section 1.7 in the Tracy manual :)

(For FEX it will be marginally larger since each profiling event must be wrapped in a IsProfilerActive() check to ensure profiling only affects the main process.)

@Sonicadvance1
Copy link
Member

Do you know how long it takes each Tracy event to execute?
For GPUViz instant events, we average ~282ns per instant event. I assume scoped events are a smidge longer due to more string parsing.

The author advertises cost in the single-digit nanosecond range (< 3ns per zone) in a reference benchmark. The details are briefly outlined in section 1.7 in the Tracy manual :)

(For FEX it will be marginally larger since each profiling event must be wrapped in a IsProfilerActive() check to ensure profiling only affects the main process.)

It would be nice if this could be measured on X13s/X1E or whatever since Snapdragon's cyclecounter doesn't give sub-nanosecond accuracy. At 19.2Mhz you only get ~52ns granularity so we'll need to get averages. So spinning a loop on the markers would be good just to see what it gets down to. I don't see what their measurement methodology is other than just saying it's fast.

@pmatos
Copy link
Collaborator

pmatos commented Jan 30, 2025

Is there a reason why we are adding tracy as an external here instead of using cmake to check the user has the appropriate dependencies installed?

@pmatos
Copy link
Collaborator

pmatos commented Jan 30, 2025

One of the things I noticed is that the Tracy UI is really small on my display for some reason, certainly smaller than the remaining GUI elements. This is just to document the existence of TRACY_DPI_SCALE to adjust the size of the UI.

@neobrain
Copy link
Member Author

Is there a reason why we are adding tracy as an external here instead of using cmake to check the user has the appropriate dependencies installed?

Did you check if linking against the system libtracy will produce a functional setup? I think the TRACY_ defines we set up aren't only needed for compiling FEX but for compiling Tracy itself as well, so in that case we really have no choice but to vendor the dependency.

One of the things I noticed is that the Tracy UI is really small on my display for some reason, certainly smaller than the remaining GUI elements. This is just to document the existence of TRACY_DPI_SCALE to adjust the size of the UI.

Seems like lack of automatic DPI detection is a typical Dear ImGui quirk, but good to know!

@neobrain
Copy link
Member Author

neobrain commented Jan 31, 2025

It would be nice if this could be measured on X13s/X1E or whatever since Snapdragon's cyclecounter doesn't give sub-nanosecond accuracy. At 19.2Mhz you only get ~52ns granularity so we'll need to get averages. So spinning a loop on the markers would be good just to see what it gets down to. I don't see what their measurement methodology is other than just saying it's fast.

I ran this synthetic benchmark with the -DTRACY_ compile flags used by FEX, testing both single events (TracyMessageL) and paired events (ZoneScopedN):

#include <tracy/Tracy.hpp>
#include <chrono>
#include <iostream>

int main() {
  tracy::StartupProfiler();

  auto begin = std::chrono::high_resolution_clock::now();
  auto count = 1000'000'00ull;
  for (decltype(count) i = 0; i < count; ++i) {
    // ZoneScopedN("Hello");
    TracyMessageL("Hello");
  }
  auto end = std::chrono::high_resolution_clock::now();
  std::cout << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / (double)count) << '\n';
}

The results indicate a runtime per-event of 43-44ns (X13s), 31-48ns (M1, seems to vary by power state), 31-33ns16ns (Ryzen 7 3700X). In real-world workloads this overhead would partly be subsumed by memory accesses performed by the profiled application itself, so this seems consistent with the Tracy manual.

@Sonicadvance1
Copy link
Member

It would be nice if this could be measured on X13s/X1E or whatever since Snapdragon's cyclecounter doesn't give sub-nanosecond accuracy. At 19.2Mhz you only get ~52ns granularity so we'll need to get averages. So spinning a loop on the markers would be good just to see what it gets down to. I don't see what their measurement methodology is other than just saying it's fast.

I ran this synthetic benchmark with the -DTRACY_ compile flags used by FEX, testing both single events (TracyMessageL) and paired events (ZoneScopedN):

#include <tracy/Tracy.hpp>
#include <chrono>
#include <iostream>

int main() {
  tracy::StartupProfiler();

  auto begin = std::chrono::high_resolution_clock::now();
  auto count = 1000'000'00ull;
  for (decltype(count) i = 0; i < count; ++i) {
    // ZoneScopedN("Hello");
    TracyMessageL("Hello");
  }
  auto end = std::chrono::high_resolution_clock::now();
  std::cout << (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count() / (double)count) << '\n';
}

The results indicate a runtime per-event of 43-44ns (X13s), 31-48ns (M1, seems to vary by power state), 31-33ns (Ryzen 7 3700X). In real-world workloads this overhead would partly be subsumed by memory accesses performed by the profiled application itself, so this seems consistent with the Tracy manual.

Alright cool, that's roughly indicative of what I'd expect around dumping memory in to a ringbuffer that another thread consumes. So roughly 5-8x the speed of a write in to ftrace if my average was ~260ns on X1E.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants