-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poorer than expected performance on ffmpeg nvtegra decoder #1
Comments
the performance gap is larger with lower resolutions (maybe bitrate too?) nvtegra: fps 118
nvv4l2: fps 273
|
if you have links to the video files used in your benchmarking here https://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2024-May/328549.html that would be interesting for me as well |
Sorry, I forked this repo from the FFmpeg mirror which doesn't have issues enabled, and didn't notice until now.
Indeed that would be the problematic code path. FFmpeg needs to "download" the decoded data from the hardware engine, in this case going through a block linear to pitch linear layout conversion. At the moment two methods are implemented: when some constraints are satisfied, the transfer is accelerated using the VIC engine. Otherwise, it falls back to a cpu copy, which of course is far less performant. See here: https://github.com/averne/FFmpeg/blob/nvtegra-upstreaming/libavutil/hwcontext_nvtegra.c#L1009-L1015 To enable the fast path, you can apply the following patch to FFmpeg: diff --git a/libavutil/frame.c b/libavutil/frame.c
index 0775e2abd9..0b0a34f0e8 100644
--- a/libavutil/frame.c
+++ b/libavutil/frame.c
@@ -212,7 +212,7 @@ static int get_video_buffer(AVFrame *frame, int align)
total_size += sizes[i];
}
- frame->buf[0] = av_buffer_alloc(total_size);
+ frame->buf[0] = av_buffer_aligned_alloc(total_size, 0x100);
if (!frame->buf[0]) {
ret = AVERROR(ENOMEM);
goto fail;
diff --git a/libavutil/hwcontext.c b/libavutil/hwcontext.c
index 8dd05147a4..a650d66c9a 100644
--- a/libavutil/hwcontext.c
+++ b/libavutil/hwcontext.c
@@ -416,7 +416,7 @@ static int transfer_data_alloc(AVFrame *dst, const AVFrame *src, int flags)
frame_tmp->width = ctx->width;
frame_tmp->height = ctx->height;
- ret = av_frame_get_buffer(frame_tmp, 0);
+ ret = av_frame_get_buffer(frame_tmp, 0x100);
if (ret < 0)
goto fail; And if you're using mpv, you can add this: diff --git a/video/mp_image.c b/video/mp_image.c
index a89762b..af98703 100644
--- a/video/mp_image.c
+++ b/video/mp_image.c
@@ -175,7 +175,7 @@ static bool mp_image_alloc_planes(struct mp_image *mpi)
return false;
// Note: mp_image_pool assumes this creates only 1 AVBufferRef.
- mpi->bufs[0] = av_buffer_alloc(size + align);
+ mpi->bufs[0] = av_buffer_aligned_alloc(size + align, 0x100);
if (!mpi->bufs[0])
return false;
diff --git a/video/mp_image.h b/video/mp_image.h
index af0d9fd..c047a45 100644
--- a/video/mp_image.h
+++ b/video/mp_image.h
@@ -32,7 +32,7 @@
// libraries except libavcodec don't really know what alignment they want.
// Things will randomly crash or get slower if the alignment is not satisfied.
// Whatever. This value should be pretty safe with current CPU architectures.
-#define MP_IMAGE_BYTE_ALIGN 64
+#define MP_IMAGE_BYTE_ALIGN 256
#define MP_IMGFIELD_TOP_FIRST 0x02
#define MP_IMGFIELD_REPEAT_FIRST 0x04 Another consideration regarding your samples is that 10-bit hwdownloads are split in two VIC transfers due to hardware limitations, which impacts the throughput. If you want to check the actual performance of the decode engine, you can add an early exit to the transfer function. With this I find:
Ultimately though, for the best integration within a video player or another application, you really want a zero-copy pipeline, ie. directly importing decoded frames in your gpu api as a texture. I've written such code for mpv and deko3d on hos, but not for opengl/vulkan on linux. It's probably possible to adapt the dmabuf code for vaapi, but that would need a bit of research. |
Maybe you will find this commit in CTCaer's nvv4l2 decoder relevant for that theofficialgman@1dc58f7 See the original post for link to the branch itself where you can see the current decoder and encoder implementations. |
64B alignments are relevant for block linear layouts (and I use it here). In any case, I can assist with the set up of a nvtegra->opengl interop backend for mpv. You can contact me @ avhe on discord. |
would this not cause issues with other ffmpeg decoders? It seems to me like you should be modifying the default output of |
I don't see how that could break anything. |
I'd like something that is upstream worthy and included in your patchset https://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2024-May/328549.html. The point (of me testing this) is to see if its worth including as an option to the (already very well functioning) CTCaer's nvv4l2. We currently force nvv4l2's use (on switchroot ubuntu bionic/jammy/noble) for all suitable decoding which allows every application that uses system ffmpeg (vlc, mpv, dolphin-emu, obs-studio, firefox (soon)) without any changes at the upstream applications. These aren't zero-copy either but they work well enough. |
Unfortunately the patchset in its current state will not use accelerated transfers unless by chance your media width is aligned to 256 and you luck out on buffer allocations, so I would not consider it suitable yet. |
Explicitly use ldur for unaligned offsets; newer versions of armasm64 implicitly convert ldr to ldur as necessary, but older versions require it explicitly written out. This fixes these build errors: ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) : error A2518: operand 2: Memory offset must be aligned ldr s5, [x1, #1] ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) : error A2518: operand 2: Memory offset must be aligned ldr d7, [x1, FFmpeg#2] Signed-off-by: Martin Storsjö <[email protected]>
Reposting emailed issue by @theofficialgman:
I have built the nvtegra decoders from https://github.com/averne/FFmpeg/tree/nvtegra-upstreaming on Switchroot Ubuntu Noble 24.04 and am benchmarking them against a ffmpeg implementation of a wrapper around nvidia's nvv4l2 (written primarily by CTCaer) https://github.com/theofficialgman/FFmpeg/tree/6.1.1-nvv4l2
Performance on your decoder is very subpar and results in a lot of CPU utilization (100% on one thread). I believe the issue is indicated by the following repeated log
[AVHWFramesContext @ 0x7f805a55a0] Frame address/pitch not aligned to 256, falling back to cpu transfer
On nvv4l2 there are no such logs or high CPU utilization.
For reference, here are the performance numbers I am measuring from https://repo.jellyfin.org/jellyfish/ test videos
https://repo.jellyfin.org/jellyfish/media/jellyfish-120-mbps-4k-uhd-hevc-10bit.mkv
nvtegra: fps 31
nvv4l2: fps 66
https://repo.jellyfin.org/jellyfish/media/jellyfish-120-mbps-4k-uhd-h264.mkv
nvtegra: fps 56
nvv4l2: fps 62
If there is a better place to have this discussion let me know because your github repo does not have issues enabled.
Thanks,
theofficialgman
The text was updated successfully, but these errors were encountered: