-
Notifications
You must be signed in to change notification settings - Fork 43
Agenda for sync meeting 10/16/2020 #369
Comments
If it's possible, it would be nice to get a better understanding of what implementation challenges V8 is experiencing in implementing v128.const. If it was working effectively and not through workarounds, it would make one of the most significant performance increases in the WebAssembly SIMD implementation. It would also be nice to discuss feature-detection (#356) and perhaps as a side-effect, the frequency of releases and versions. Such might reduce the burden on #343 by setting quarterly releases with semantic versioning for instance and allow the standard to continuously evolve. |
Unfortunately it isn't possible for us to commit to quarterly releases (or any schedule in particular) because we are beholden to the WebAssembly standardization process. We also discussed feature detection at the last meeting, but I'd be happy to discuss it again if we have any new developments there (and hopefully we will). |
Usually there's a way that meets the working group requirements for standardization and allows you to implement new features with "versioning". In IETF, we often regularly release versioned draft RFCs such that someone can say they implement xyz standard and the draft revision dated yyyymmdd. I can check what the standard practice and course is for W3C if you'd like. This is generally a really common occurrence. |
If possible, let's visit the items we were not able to get to at the previous meeting before revisiting feature detection 😄 |
I'd like to propose discussing our use case criteria: #203 (comment) |
Re: v128.const - I don't see an observable difference in my benchmark but this is for slightly subtle reasons: Given C++ code that uses smth like wasm_i32x4_splat(127) in a loop, I see the following behavior wrt codegen (latest v8): pre-v128.const: llvm synthesizes i32x4.const & i32.splat, v8 generates this:
post-v128.const: llvm synthesizes v128.const with 4 equivalent lanes, v8 generates this:
The second sequence is a 3-cycle sequence whereas the first one is a 2-cycle. So when running in context of a tight loop (which my loop happens to be), I'd expect to see a measurable performance delta. However, in both cases v8 actually lifts the computation outside of the loop... So the difference is nil, as the code above executes once in the loop prologue. Manually transplanting the computation back into the loop body for the purposes of profiling with llvm-mc shows 4.68 cycles per iteration without using v128.const, and 4.95 cycles per iteration after using v128.const, so some of the cost is hidden by other instructions, but the total impact should have been a 6% degradation in throughput. However because v8 lifts this outside of the loop, on my kernels the impact isn't observable. I'm still concerned about the potential to lose performance here when v8 doesn't figure it out, but just wanted to close the loop (ha!) here. llvm-mc study for my kernel (which runs at circa 12 GB/s): https://gcc.godbolt.org/z/YYWf1d |
@zeux @tlively Here's a godbolt that shows how to force load a constant from memory. https://godbolt.org/z/d1rvfv LLVM doesn't always get the cost modeling right -- (I've seen it decide to reload it a couple times over inside a loop), but it is possible to do if you know you want it to work that way. |
Yeah, volatile is often a reasonable workaround for codegen issues although you have to apply it carefully to not cause extra loads. I'm not an expert on the C volatile semantics but I believe it might be necessary for the compiler to disable elimination of duplicate loads to a volatile. @ngzhian So I understand that RIP-relative loads require more work on the v8 side that may not be trivial, but would it be possible to, as part of v128.const codegen, identify cases where:
If v8 did this this would at least equalize performance and size of generated code between cases where LLVM decides to emit v128.const vs when it decides to synthesize it; it would still result in suboptimal lowering in some cases vs a rip-relative load but this would be better than the status quo and hopefully easy to implement? |
Good to know, I am not that familiar with the actual optimizations happening in the TurboFan engine :) (just checking, is v8 doing the lifting? or is it emscripten/binaryen?)
I agree this will be useful and hope that we can dedicate time to properly work this out. Thank you for understanding 👍
You suggestions are very reasonable, sounds a lot like the shuffles matching we already did :) I have https://crbug.com/v8/10980 tracking loading constants from memory, I have also filed https://crbug.com/v8/11033 to track your suggestion. Thanks! |
Thanks everyone! Here are the notes. |
We're moving to a biweekly schedule for these syncs, so the next meeting will be Friday, October 16 at 9:00AM - 10:00AM PDT/ 6:00PM - 7:00PM CEST). Please respond with agenda items you would like to discuss.
If this meeting doesn't already appear on your calendar, or you are a new attendee, please fill out this form to attend.
Carryover items from the last meeting include:
uN
types forImmLaneIdxM
representations #256)The text was updated successfully, but these errors were encountered: