-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: perf: WASM decode #138
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a superficial look at style and such for now, haven't looked at the details:
There's no point in wrapping the blending buffer inside an ASS_Image
.
getBufferSize
, processImages
and decodeImage
duplicate functionaility already existing in renderBlend
(sans the superfluous ASS_image
wrapper, look at that to see how you can do without the ASS_Image
wrapping).
This also affects js-blend
iiuc, which should not be the case.
Rather than duplicating functionality, you should split this in two commits: first only refactor existing code, moving the relevant parts from renderBlend
into separate function(s). Then in a second commit, switch lossy
over to pre-blended buffers using the functions factored out in the previous commit and leave js-blend
as is.
Or even better — if possible — make lossy
use renderBlend
directly.
Also the commit message needs to describe what and why and ideally should follow the style from prior commit messages.
I would intuitivetly indeed expect this to be faster. But since also remember you writing blending in JS would be better due to being GPU-accelerated; can I assume you tested both varaints on hardware + browser with available GPU-acceleration and pre-blending is indeed always faster?
And as this is the main reason for lossy
to even exist: Can this impact the “non-browser-freezing” property of lossy
by doing more blending now synchronously in the worker instead of asynchronly in many createImageBitmap
s?
Otherwise
What else do you suggest? the blend render result isn't a linked list, this needs to be a linked list
getBufferSize equivalen't doesnt exist in renderBlend decodeImage doesn't exist, i assume you mean decodeBitmap processImages equivalen't doesnt exist in renderBlend
no functionality is duplicated renderBlend is untouched, its functionality is completely different from renderImage the reason my code is split into separate functions, unlike renderBlend is because profiling/benchmarking WASM only outputs information about isolated functions [unlike when profiling JS], and not about the code inside the functions, meaning with this you can see exactly which steps take what amounts of CPU time, unlike with renderBlend
Yes and no. This is a LOT, LOT faster than the old code, but its not actually much faster than renderBlend, if anything it's roughly the same, HOWEVER, with this you can offload all blending to the GPU via the canvas API's with
yes, as I described above it's roughly faster, it would be even faster if Emscripten exposed a HEAPU8Clamped, which would be a Uint8ClampledArray, but it doesn't, and I do not know how to accomplish that goal.
Don't misunderstand this PR, this doesn't make any changes to the blending, this ONLY moves the bitmap -> imagedata conversion [Alpha -> RGBA conversion/creation] to the WASM, all the blending still uses the old functionality. Note: I profiled this on a typeset which outputted roughly 700 bitmaps per frame and the data throughput was ~5GB/s, so we're talking extreme case of extreme case.
TLDR: it's even better than before, because the old bitmap->image conversion code ran outside of createImageBitmap, meaning it lagged even more than this, the old code roughly took 35ms+ on JS to decode 700 bitmaps, and now it takes ~2-3ms on WASM/C++ |
If you want to test this against some existing site which uses JSSO, even the GH pages, here's the artifact for this PR as the check doesnt output one: https://github.com/ThaUnknown/JavascriptSubtitlesOctopus/actions/runs/2404231441 |
@TheOneric so what's the next step here, as |
jellyfin#9 This is (original) |
this is STILL WASM blending, this PR doesn't do ANYTHING blending related, its purely bitmap decoding/conversion, this is for the alternative render mode, which uses JS canvas functions to blend the images, rather than doing it on the CPU |
My post was about "linked list struct". |
yes, and you suggested to instead blend the images, and still re-assemble them into a linked list? which is again, not what this PR is about. all this is, is simply moving 10 JS LOC to WASM |
If those changes are ported, you will be able to use that structure. But I agree with you -
|
ah yeah, never even used it, so never bothered checking, but the overhead from those 3 ints is so negligible it really doesn't matter, even tho I use array.unshift instead of array.push, which makes this operation x100 slower, its still nothing compared to the overhead of the construction of the JS imagedata and uint8clamped objects I myself use an unified render result struct but I didn't want to make this PR too complex, as it's already too complex [??] |
I suppose that's how you restore order, which has changed here: JavascriptSubtitlesOctopus/src/SubtitleOctopus.cpp Lines 369 to 370 in 869d0ee
You can count the items and "allocate" the array. Yes, 2 loops, but if it beats unshift , then it's worth it.Or rewrite processImages , keeping the original order. Tbh, the code that comes to mind doesn't look as clean as the current version.
|
Sorry, (as noted) I only took a superficial glance and didn't notice that it doesn't blend the images together, but only converts each bitmap to sRGB8. Which btw it doesn't do correctly, but neither does the existing code (see YCbCr Matrix docs in ass_types.h). Still relevant from my previous post:
As a side note to your performance analysis, have you or can you measure how strict mode affects the performance? var arr = Array.from({length: 2000}, () => Math.floor(Math.random() * 40))
// Only the below is measured, above is setup
(function (a) {
"use strict";
var i;
for (i = 1; i < arr.length; ++i)
arr[0] += arr[i];
return arr[0] - a;
}(20)); You should just need to add |
because JS is slow? I mean I think that much is obvious
Sure but the extra data from implementing something that's almost exactly the same seems insanely wasteful
It works quite literally the same way as it did before
I didn't bother, most of the performance lost in this lib is on insanely stupid stuff like what this PR tries to address, and it's also out of the scope of this PR, as it doesn't create any new JS, instead, gets rid of it.
sounds good, except for the inlining part, which I already explained why I haven't done it. Because profiling a single inlined function like for example renderBlend is impossible |
There's no point in profiling |
tell that to the next person which is gonna spend 12 hrs trying to figure out why |
This moves the bitmap processing from JS to WASM, reducing CPU time used by the conversion by ~95%, bringing LossyRender closely to the performance of WASM blend mode.