Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: perf: WASM decode #138

Closed
wants to merge 1 commit into from
Closed

feat: perf: WASM decode #138

wants to merge 1 commit into from

Conversation

ThaUnknown
Copy link

This moves the bitmap processing from JS to WASM, reducing CPU time used by the conversion by ~95%, bringing LossyRender closely to the performance of WASM blend mode.

Copy link
Member

@TheOneric TheOneric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a superficial look at style and such for now, haven't looked at the details:

There's no point in wrapping the blending buffer inside an ASS_Image.
getBufferSize, processImages and decodeImage duplicate functionaility already existing in renderBlend (sans the superfluous ASS_image wrapper, look at that to see how you can do without the ASS_Image wrapping).
This also affects js-blend iiuc, which should not be the case.

Rather than duplicating functionality, you should split this in two commits: first only refactor existing code, moving the relevant parts from renderBlend into separate function(s). Then in a second commit, switch lossy over to pre-blended buffers using the functions factored out in the previous commit and leave js-blend as is.
Or even better — if possible — make lossyuse renderBlend directly.

Also the commit message needs to describe what and why and ideally should follow the style from prior commit messages.

I would intuitivetly indeed expect this to be faster. But since also remember you writing blending in JS would be better due to being GPU-accelerated; can I assume you tested both varaints on hardware + browser with available GPU-acceleration and pre-blending is indeed always faster?

And as this is the main reason for lossy to even exist: Can this impact the “non-browser-freezing” property of lossy by doing more blending now synchronously in the worker instead of asynchronly in many createImageBitmaps?

Otherwise

@ThaUnknown
Copy link
Author

ThaUnknown commented May 29, 2022

@TheOneric

There's no point in wrapping the blending buffer inside an ASS_Image.

What else do you suggest? the blend render result isn't a linked list, this needs to be a linked list

getBufferSize, processImages and decodeImage duplicate functionaility already existing in renderBlend

getBufferSize equivalen't doesnt exist in renderBlend
renderBlend creates a single image, here we create a linked list of N image data, not just a single element

decodeImage doesn't exist, i assume you mean decodeBitmap
renderBlend blends the previous bitmaps's colors with the next bitmap's colors, this writes colors to the current image data on the current list position, then goes to the next element and does that separate for the next element

processImages equivalen't doesnt exist in renderBlend
renderblend again, creates a single image, this creates a linked list of image data

Rather than duplicating functionality

no functionality is duplicated

renderBlend is untouched, its functionality is completely different from renderImage

the reason my code is split into separate functions, unlike renderBlend is because profiling/benchmarking WASM only outputs information about isolated functions [unlike when profiling JS], and not about the code inside the functions, meaning with this you can see exactly which steps take what amounts of CPU time, unlike with renderBlend

I would intuitivetly indeed expect this to be faster.

Yes and no. This is a LOT, LOT faster than the old code, but its not actually much faster than renderBlend, if anything it's roughly the same, HOWEVER, with this you can offload all blending to the GPU via the canvas API's with createImageBitmap and drawImage essentially splitting half the work to the GPU, making the website actually running the code perform better, while the subs will perform roughly the same. This is however only true from my testing on the V2 version, as your version outputs inaccurate debug samples x).

blending in JS would be better due to being GPU-accelerated

yes, as I described above it's roughly faster, it would be even faster if Emscripten exposed a HEAPU8Clamped, which would be a Uint8ClampledArray, but it doesn't, and I do not know how to accomplish that goal.

pre-blending is indeed always faster

Don't misunderstand this PR, this doesn't make any changes to the blending, this ONLY moves the bitmap -> imagedata conversion [Alpha -> RGBA conversion/creation] to the WASM, all the blending still uses the old functionality.
This code doesn't do the blending, the blending is done with async render on ctx.drawImage(imagebitmap)
As described above, its lighter on the CPU by offloading some of the work to the GPU, keeping the page snappier, while keeping the performance roughly the same.
The biggest bottleneck right now is actually garbage collection after constructing the uint8clampedarray and imagedata which is why, if we got emscripten to somehow expose a HEAPU8Clamped alongside the already existing: var HEAP, buffer, HEAP8, HEAPU8, HEAP16, HEAPU16, HEAP32, HEAPU32, HEAPF32, HEAPF64; this would allow us to further reduce the garbage collection as there would be one massive array constructed less. This is not as visible in chrome which handles GC quite well, but it's very visible in firefox, which every 15 seconds froze for ~3 seconds just to do garbage collection.

Note: I profiled this on a typeset which outputted roughly 700 bitmaps per frame and the data throughput was ~5GB/s, so we're talking extreme case of extreme case.

And as this is the main reason for lossy to even exist: Can this impact the “non-browser-freezing”

TLDR: it's even better than before, because the old bitmap->image conversion code ran outside of createImageBitmap, meaning it lagged even more than this, the old code roughly took 35ms+ on JS to decode 700 bitmaps, and now it takes ~2-3ms on WASM/C++
I think this is a big misunderstanding? createImageBitmap being async doesn't mean it prevents the browser from freezing, afaik the current version of JSSO doesn't handle frame skipping correctly, [which is again one of the things I fixed in V2], especially considering the image bitmaps are sent to the main thread instead of rendered offscreen, JSSO will keep pushing more data for the main thread to draw regardless of if its lagging[can't keep up] or not as it doesn't wait for the current frames to finish rendering before requesting more. In my own fork I handle this by completely skipping frame render requests, until the last frame finishes painting.

@ThaUnknown
Copy link
Author

ThaUnknown commented May 29, 2022

The biggest bottleneck right now is actually garbage collection

image
These are samples from my branch, but the same principles apply, [you won't see the HEAPU8.buffer.slice here, as I handle the heap differently on JS to avoid duplicating data], but you can roughly see that from JS, createImageBitmap takes the most time [which you can't change], followed by garbage collection, then constructing the imagedata object, ofc renderImageData is also there, but that's the WASM code, which is essentially waiting for libass to spit out the bitmaps and for them to be converted to RGBA image data [not to be confused with the JS imagedata object], which takes close to no time anyways compared to the JS code.
image

@ThaUnknown
Copy link
Author

If you want to test this against some existing site which uses JSSO, even the GH pages, here's the artifact for this PR as the check doesnt output one: https://github.com/ThaUnknown/JavascriptSubtitlesOctopus/actions/runs/2404231441

@ThaUnknown
Copy link
Author

@TheOneric so what's the next step here, as There's no point in wrapping the blending buffer inside an ASS_Image. isn't really valid, unless you want me to create another custom struct, which does everything the same as ASS_Image, just without stride and color which is just 2 int fields, which I'd rather not do as it adds a lot of JS code on the worker

@ThaUnknown ThaUnknown requested a review from TheOneric May 29, 2022 22:37
@dmitrylyzo
Copy link
Contributor

dmitrylyzo commented May 30, 2022

What else do you suggest? the blend render result isn't a linked list, this needs to be a linked list

jellyfin#9 This is (original) smart-blend part in our fork.

@ThaUnknown
Copy link
Author

What else do you suggest? the blend render result isn't a linked list, this needs to be a linked list

jellyfin#9 This is (original) smart-blend part in our fork.

this is STILL WASM blending, this PR doesn't do ANYTHING blending related, its purely bitmap decoding/conversion, this is for the alternative render mode, which uses JS canvas functions to blend the images, rather than doing it on the CPU

@dmitrylyzo
Copy link
Contributor

My post was about "linked list struct".

@ThaUnknown
Copy link
Author

My post was about "linked list struct".

yes, and you suggested to instead blend the images, and still re-assemble them into a linked list? which is again, not what this PR is about. all this is, is simply moving 10 JS LOC to WASM

@dmitrylyzo
Copy link
Contributor

yes, and you suggested to instead blend the images, and still re-assemble them into a linked list?

If those changes are ported, you will be able to use that structure. But I agree with you - ASS_Image is sufficient, and even RenderBlendPart can be replaced with it.

ASS_Image also has enum type field. So (most likely) +3 int.

@ThaUnknown
Copy link
Author

ThaUnknown commented May 30, 2022

yes, and you suggested to instead blend the images, and still re-assemble them into a linked list?

If those changes are ported, you will be able to use that structure. But I agree with you - ASS_Image is sufficient, and even RenderBlendPart can be replaced with it.

ASS_Image also has enum type field. So (most likely) +3 int.

ah yeah, never even used it, so never bothered checking, but the overhead from those 3 ints is so negligible it really doesn't matter, even tho I use array.unshift instead of array.push, which makes this operation x100 slower, its still nothing compared to the overhead of the construction of the JS imagedata and uint8clamped objects

I myself use an unified render result struct but I didn't want to make this PR too complex, as it's already too complex [??]

@dmitrylyzo
Copy link
Contributor

dmitrylyzo commented May 30, 2022

I use array.unshift instead of array.push

I suppose that's how you restore order, which has changed here:

result->next = renderResult;
renderResult = result;

You can count the items and "allocate" the array. Yes, 2 loops, but if it beats unshift, then it's worth it.
Or rewrite processImages, keeping the original order. Tbh, the code that comes to mind doesn't look as clean as the current version.

@TheOneric
Copy link
Member

Sorry, (as noted) I only took a superficial glance and didn't notice that it doesn't blend the images together, but only converts each bitmap to sRGB8. Which btw it doesn't do correctly, but neither does the existing code (see YCbCr Matrix docs in ass_types.h).
I believe changing the function names to convertImages and convertBitmap would be good to make their function more clear. Also getBufferSize is too general a name for something not applying to all modes. Inlining it into the single calling function might be even better.

Still relevant from my previous post:

  • The Commit message must be improved to explain what and why. Ideally it should also follow the style of our other commit mesasages.
  • It must not abuse ASS_Image like this. Types also signal what data at is and how it should be used, this fills the struct with completly different data then what ASS_Image is documented to be. The struct from the tiled blending patches is a much better fit.
  • This also affects js-blending. Even if it doesn't pre-blend, I'm not sure if I like that, but will leave the decision to a maintainer more knowledgeable about the whole pipeline and JS.

As a side note to your performance analysis, have you or can you measure how strict mode affects the performance?
This stackoverflow post suggests there's a significant difference for some operations at least in V8 and a quick check on https://jsbench.me seems to support this (albeit now "only" ~20% in Chromium; none in Firefox 91ESR):

var arr = Array.from({length: 2000}, () => Math.floor(Math.random() * 40))
// Only the below is measured, above is setup
(function (a) {
	"use strict";
	var i;
	for (i = 1; i < arr.length; ++i)
		arr[0] += arr[i];
	return arr[0] - a;
}(20));

You should just need to add "use strict"; to src/subtitles-octopus.js and if on top of #139 add -s STRICT_JS=1 to the global LDFLAGS, on current master I believe adding it to EMCC_COMMON_ARGS will also work. (iirc patches to make stuff work in strict mode were applied a couple of months ago)
If there's a gain, or at least no loss, we probably want to enable this even if this instance of array processing moves to WASM.

@TheOneric TheOneric removed their request for review June 2, 2022 23:42
@ThaUnknown
Copy link
Author

ThaUnknown commented Jun 3, 2022

The Commit message must be improved to explain what and why

because JS is slow? I mean I think that much is obvious

It must not abuse ASS_Image like this.

Sure but the extra data from implementing something that's almost exactly the same seems insanely wasteful

This also affects js-blending.

It works quite literally the same way as it did before

As a side note to your performance analysis, have you or can you measure how strict mode affects the performance?

I didn't bother, most of the performance lost in this lib is on insanely stupid stuff like what this PR tries to address, and it's also out of the scope of this PR, as it doesn't create any new JS, instead, gets rid of it.

I believe changing the function names to convertImages and convertBitmap would be good to make their function more clear. Also getBufferSize is too general a name for something not applying to all modes. Inlining it into the single calling function might be even better.

sounds good, except for the inlining part, which I already explained why I haven't done it. Because profiling a single inlined function like for example renderBlend is impossible

@TheOneric
Copy link
Member

Because profiling a single inlined function

There's no point in profiling getBufferSize separate from the now barely doing anything itself renderImage.

@ThaUnknown
Copy link
Author

Because profiling a single inlined function

There's no point in profiling getBufferSize separate from the now barely doing anything itself renderImage.

tell that to the next person which is gonna spend 12 hrs trying to figure out why renderImage is so slow like I did before coding this

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants