Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#18332: Update BN Kernel #18278

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

#18332: Update BN Kernel #18278

wants to merge 4 commits into from

Conversation

VirdhatchaniKN
Copy link
Contributor

@VirdhatchaniKN VirdhatchaniKN commented Feb 25, 2025

Ticket

#18332

Problem description

The current BN implementation splits tensor reading between NCRISC (Input tensor) and BRISC (input stat tensor) , while BRISC also handles writing. This stalls BRISC from writing until compute is done, and also writes are more expensive. It also creates performance inconsistencies.

What's changed

Moved all the input reads to reader file

Checklist

@VirdhatchaniKN VirdhatchaniKN changed the title #0: Update #18332: Update BN Kernel Feb 26, 2025
@VirdhatchaniKN VirdhatchaniKN force-pushed the virdhatchani/update_bn branch 3 times, most recently from 96ea39e to 2afcad4 Compare February 26, 2025 14:41
Copy link
Contributor

@sjameelTT sjameelTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the functionality atm, did it solve your problem with bigger dimensions?

uint32_t num_tiles_read = 0;
for (uint32_t n = start_n; n < N && num_tiles_read < num_tiles; ++n, start_c = 0) {
for (uint32_t c = start_c; c < C && num_tiles_read < num_tiles; ++c, start_t = 0) {
// read a tile from batch_mean
cb_reserve_back(cb_id_batch_mean, onetile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this works, but you could also consider issuing the reads for both the mean tile and the var tile, and adding the barrier for both of them:

cb_reserve_back(cb_id_batch_mean, onetile);
cb_reserve_back(cb_id_batch_var, onetile);
noc_async_read_tile(tile_offset_stat, batch_mean, l1_write_addr);
noc_async_read_tile(tile_offset_stat, batch_var, l1_batch_var_write_addr);
uint32_t l1_write_addr = get_write_ptr(cb_id_batch_mean);
uint32_t l1_batch_var_write_addr = get_write_ptr(cb_id_batch_var);
noc_async_read_barrier();
FILL_TILE_WITH_FIRST_ELEMENT(cb_id_batch_mean);
FILL_TILE_WITH_FIRST_ELEMENT(cb_id_batch_var);
cb_push_back(cb_id_batch_mean, onetile);
cb_push_back(cb_id_batch_var, onetile);

You will be able to amortize both of the reads with each other. The way you have it currently should work too though, so it's fine, and let's get this working completely before you make micro-optimizations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I will update this change. But this is not solving the hang issue that we are working on. Need to work more on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sjameelTT , I've updated the files

@VirdhatchaniKN VirdhatchaniKN marked this pull request as ready for review February 26, 2025 23:55

tile_regs_wait();
tile_regs_commit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is a loop of one? Might as well remove the loop unless you expect this to actually change?

Overall I think it makes things clearer for tile_regs_wait() to come after tile_regs_commit() just for readability, even if your setup works for single loop. This loop and how it is set up with the tile_regs call also doesn't work if the loop is > 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants