Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#0: Use posted writes for profiler. #17261

Merged
merged 1 commit into from
Jan 30, 2025
Merged

Conversation

jbaumanTT
Copy link
Contributor

Problem description

Currently we take a nontrivial amount of time in the profiler waiting for the writes to DRAM to be completed.

What's changed

By posting writes and only waiting for the flush, we can save around .2 us between ops. The only reason there's a flush here is to ensure that later writes to the buffer don't clobber data that's in the process of being written.

Checklist

  • Post commit CI passes
  • Blackhole Post commit (if applicable)
  • Model regression CI testing passes (if applicable)
  • Device performance regression CI testing passes (if applicable)
  • (For models and ops writers) Full new models tests passes
  • New/Existing tests provide coverage for changes

By posting writes and only waiting for the flush, we can save around .2 us
between ops. The only reason there's a flush here is to ensure that later
writes to the buffer don't clobber data that's in the process of being written.
@jbaumanTT jbaumanTT force-pushed the jbauman/optimizeprofiling branch from 45aed6f to baa26d6 Compare January 28, 2025 22:03
Comment on lines +185 to +195
#if defined(COMPILE_FOR_NCRISC) || defined(COMPILE_FOR_BRISC) || defined(COMPILE_FOR_ERISC) || \
defined(COMPILE_FOR_IDLE_ERISC)
inline void __attribute__((always_inline)) noc_async_write_posted(
std::uint32_t src_local_l1_addr, std::uint64_t dst_noc_addr, std::uint32_t size, uint8_t noc = noc_index) {
WAYPOINT("NAWW");
DEBUG_SANITIZE_NOC_WRITE_TRANSACTION(noc, dst_noc_addr, src_local_l1_addr, size);
ncrisc_noc_fast_write_any_len<proc_type, noc_mode>(
noc, write_cmd_buf, src_local_l1_addr, dst_noc_addr, size, NOC_UNICAST_WRITE_VC, false, false, 1, true, true);
WAYPOINT("NAWD");
}
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we expose the noc_async_posted_writes_flushed fn in dataflow_api, but this fn we only define in profiler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noc_async_posted_writes_flushed seems more useful as it's the one type of barrier that happens to be missing, and dataflow_api.h already contains other functions that can cause posted writes. noc_async_write_posted is just one specific write function with a specific set of args that makes sense here, but I'm not sure how useful it would be to other code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point worth mentioning is, in order to make dataflow_api "profilable", the plan is to create a stripped down, profiler only noc api in order to remove the circular dependancy. Keeping this function separated is helpful towards that path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Would be good to have this in a separate file when that is refactored. Not a blocker. I think there are cases where we might want to switch to posted writes as well for kernels (ex kernels where we mcast data followed by a semaphore can probably be posted since receiver waits on the sem), but can look at that later

@jbaumanTT jbaumanTT merged commit 3992e68 into main Jan 30, 2025
153 of 207 checks passed
@jbaumanTT jbaumanTT deleted the jbauman/optimizeprofiling branch January 30, 2025 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants