-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#18332: Update BN Kernel #18278
base: main
Are you sure you want to change the base?
#18332: Update BN Kernel #18278
Conversation
180cc5c
to
7b7bd5a
Compare
96ea39e
to
2afcad4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the functionality atm, did it solve your problem with bigger dimensions?
uint32_t num_tiles_read = 0; | ||
for (uint32_t n = start_n; n < N && num_tiles_read < num_tiles; ++n, start_c = 0) { | ||
for (uint32_t c = start_c; c < C && num_tiles_read < num_tiles; ++c, start_t = 0) { | ||
// read a tile from batch_mean | ||
cb_reserve_back(cb_id_batch_mean, onetile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this works, but you could also consider issuing the reads for both the mean tile and the var tile, and adding the barrier for both of them:
cb_reserve_back(cb_id_batch_mean, onetile);
cb_reserve_back(cb_id_batch_var, onetile);
noc_async_read_tile(tile_offset_stat, batch_mean, l1_write_addr);
noc_async_read_tile(tile_offset_stat, batch_var, l1_batch_var_write_addr);
uint32_t l1_write_addr = get_write_ptr(cb_id_batch_mean);
uint32_t l1_batch_var_write_addr = get_write_ptr(cb_id_batch_var);
noc_async_read_barrier();
FILL_TILE_WITH_FIRST_ELEMENT(cb_id_batch_mean);
FILL_TILE_WITH_FIRST_ELEMENT(cb_id_batch_var);
cb_push_back(cb_id_batch_mean, onetile);
cb_push_back(cb_id_batch_var, onetile);
You will be able to amortize both of the reads with each other. The way you have it currently should work too though, so it's fine, and let's get this working completely before you make micro-optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I will update this change. But this is not solving the hang issue that we are working on. Need to work more on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sjameelTT , I've updated the files
2afcad4
to
1f73719
Compare
Continuation of another PR. Will be merged once CI passes Used for testing
e7d7e82
to
1c06eb2
Compare
|
||
tile_regs_wait(); | ||
tile_regs_commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this is a loop of one? Might as well remove the loop unless you expect this to actually change?
Overall I think it makes things clearer for tile_regs_wait() to come after tile_regs_commit() just for readability, even if your setup works for single loop. This loop and how it is set up with the tile_regs call also doesn't work if the loop is > 1.
Ticket
#18332
Problem description
The current BN implementation splits tensor reading between NCRISC (Input tensor) and BRISC (input stat tensor) , while BRISC also handles writing. This stalls BRISC from writing until compute is done, and also writes are more expensive. It also creates performance inconsistencies.
What's changed
Moved all the input reads to reader file
Checklist