Poorer than expected performance on ffmpeg nvtegra decoder #1

averne · 2024-06-26T16:05:14Z

Reposting emailed issue by @theofficialgman:

I have built the nvtegra decoders from https://github.com/averne/FFmpeg/tree/nvtegra-upstreaming on Switchroot Ubuntu Noble 24.04 and am benchmarking them against a ffmpeg implementation of a wrapper around nvidia's nvv4l2 (written primarily by CTCaer) https://github.com/theofficialgman/FFmpeg/tree/6.1.1-nvv4l2

Performance on your decoder is very subpar and results in a lot of CPU utilization (100% on one thread). I believe the issue is indicated by the following repeated log

[AVHWFramesContext @ 0x7f805a55a0] Frame address/pitch not aligned to 256, falling back to cpu transfer

On nvv4l2 there are no such logs or high CPU utilization.

For reference, here are the performance numbers I am measuring from https://repo.jellyfin.org/jellyfish/ test videos

https://repo.jellyfin.org/jellyfish/media/jellyfish-120-mbps-4k-uhd-hevc-10bit.mkv

nvtegra: fps 31

clocks: NVDEC 488
command: ./ffmpeg -hwaccel nvtegra -i ../jellyfish-120-mbps-4k-uhd-hevc-10bit.mkv -f null -

nvv4l2: fps 66

clocks: NVDEC 716, VIC 192

https://repo.jellyfin.org/jellyfish/media/jellyfish-120-mbps-4k-uhd-h264.mkv

nvtegra: fps 56

clocks: NVDEC 716
command: ./ffmpeg -hwaccel nvtegra -i ../jellyfish-120-mbps-4k-uhd-h264.mkv -f null -

nvv4l2: fps 62

clocks: NVDEC 716, VIC 307

If there is a better place to have this discussion let me know because your github repo does not have issues enabled.

Thanks,

theofficialgman

theofficialgman · 2024-06-26T16:52:51Z

the performance gap is larger with lower resolutions (maybe bitrate too?)
https://repo.jellyfin.org/jellyfish/media/jellyfish-10-mbps-hd-hevc-10bit.mkv

nvtegra: fps 118

clocks: NVDEC 448
command: ./ffmpeg -hwaccel nvtegra -i ../jellyfish-10-mbps-hd-hevc-10bit.mkv -f null -

nvv4l2: fps 273

clocks: NVDEC 716 VIC 192

theofficialgman · 2024-06-26T16:54:01Z

if you have links to the video files used in your benchmarking here https://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2024-May/328549.html that would be interesting for me as well

averne · 2024-06-26T18:54:00Z

If there is a better place to have this discussion let me know because your github repo does not have issues enabled.

Sorry, I forked this repo from the FFmpeg mirror which doesn't have issues enabled, and didn't notice until now.

[AVHWFramesContext @ 0x7f805a55a0] Frame address/pitch not aligned to 256, falling back to cpu transfer

Indeed that would be the problematic code path. FFmpeg needs to "download" the decoded data from the hardware engine, in this case going through a block linear to pitch linear layout conversion. At the moment two methods are implemented: when some constraints are satisfied, the transfer is accelerated using the VIC engine. Otherwise, it falls back to a cpu copy, which of course is far less performant. See here: https://github.com/averne/FFmpeg/blob/nvtegra-upstreaming/libavutil/hwcontext_nvtegra.c#L1009-L1015

To enable the fast path, you can apply the following patch to FFmpeg:

diff --git a/libavutil/frame.c b/libavutil/frame.c
index 0775e2abd9..0b0a34f0e8 100644
--- a/libavutil/frame.c
+++ b/libavutil/frame.c
@@ -212,7 +212,7 @@ static int get_video_buffer(AVFrame *frame, int align)
         total_size += sizes[i];
     }
 
-    frame->buf[0] = av_buffer_alloc(total_size);
+    frame->buf[0] = av_buffer_aligned_alloc(total_size, 0x100);
     if (!frame->buf[0]) {
         ret = AVERROR(ENOMEM);
         goto fail;
diff --git a/libavutil/hwcontext.c b/libavutil/hwcontext.c
index 8dd05147a4..a650d66c9a 100644
--- a/libavutil/hwcontext.c
+++ b/libavutil/hwcontext.c
@@ -416,7 +416,7 @@ static int transfer_data_alloc(AVFrame *dst, const AVFrame *src, int flags)
     frame_tmp->width  = ctx->width;
     frame_tmp->height = ctx->height;
 
-    ret = av_frame_get_buffer(frame_tmp, 0);
+    ret = av_frame_get_buffer(frame_tmp, 0x100);
     if (ret < 0)
         goto fail;

And if you're using mpv, you can add this:

diff --git a/video/mp_image.c b/video/mp_image.c
index a89762b..af98703 100644
--- a/video/mp_image.c
+++ b/video/mp_image.c
@@ -175,7 +175,7 @@ static bool mp_image_alloc_planes(struct mp_image *mpi)
         return false;
 
     // Note: mp_image_pool assumes this creates only 1 AVBufferRef.
-    mpi->bufs[0] = av_buffer_alloc(size + align);
+    mpi->bufs[0] = av_buffer_aligned_alloc(size + align, 0x100);
     if (!mpi->bufs[0])
         return false;
 
diff --git a/video/mp_image.h b/video/mp_image.h
index af0d9fd..c047a45 100644
--- a/video/mp_image.h
+++ b/video/mp_image.h
@@ -32,7 +32,7 @@
 // libraries except libavcodec don't really know what alignment they want.
 // Things will randomly crash or get slower if the alignment is not satisfied.
 // Whatever. This value should be pretty safe with current CPU architectures.
-#define MP_IMAGE_BYTE_ALIGN 64
+#define MP_IMAGE_BYTE_ALIGN 256
 
 #define MP_IMGFIELD_TOP_FIRST 0x02
 #define MP_IMGFIELD_REPEAT_FIRST 0x04

Another consideration regarding your samples is that 10-bit hwdownloads are split in two VIC transfers due to hardware limitations, which impacts the throughput. If you want to check the actual performance of the decode engine, you can add an early exit to the transfer function. With this I find:

jellyfish-120-mbps-4k-uhd-hevc-10bit.mkv: 75fps
jellyfish-10-mbps-hd-hevc-10bit.mkv: 350fps (added -stream_loop -1 since the sample is too short to stabilize)

Ultimately though, for the best integration within a video player or another application, you really want a zero-copy pipeline, ie. directly importing decoded frames in your gpu api as a texture. I've written such code for mpv and deko3d on hos, but not for opengl/vulkan on linux. It's probably possible to adapt the dmabuf code for vaapi, but that would need a bit of research.

theofficialgman · 2024-06-26T19:10:39Z

[AVHWFramesContext @ 0x7f805a55a0] Frame address/pitch not aligned to 256, falling back to cpu transfer

Indeed that would be the problematic code path. FFmpeg needs to "download" the decoded data from the hardware engine, in this case going through a block linear to pitch linear layout conversion. At the moment two methods are implemented: when some constraints are satisfied, the transfer is accelerated using the VIC engine. Otherwise, it falls back to a cpu copy, which of course is far less performant. See here: https://github.com/averne/FFmpeg/blob/nvtegra-upstreaming/libavutil/hwcontext_nvtegra.c#L1009-L1015

Maybe you will find this commit in CTCaer's nvv4l2 decoder relevant for that theofficialgman@1dc58f7

See the original post for link to the branch itself where you can see the current decoder and encoder implementations.

averne · 2024-06-26T19:18:39Z

64B alignments are relevant for block linear layouts (and I use it here).
256B is required for pitch linear, which is the format FFmpeg wants when "downloading" frames.

In any case, I can assist with the set up of a nvtegra->opengl interop backend for mpv. You can contact me @ avhe on discord.

theofficialgman · 2024-06-26T19:35:45Z

To enable the fast path, you can apply the following patch to FFmpeg:

diff --git a/libavutil/frame.c b/libavutil/frame.c
index 0775e2abd9..0b0a34f0e8 100644
--- a/libavutil/frame.c
+++ b/libavutil/frame.c
@@ -212,7 +212,7 @@ static int get_video_buffer(AVFrame *frame, int align)
         total_size += sizes[i];
     }
 
-    frame->buf[0] = av_buffer_alloc(total_size);
+    frame->buf[0] = av_buffer_aligned_alloc(total_size, 0x100);
     if (!frame->buf[0]) {
         ret = AVERROR(ENOMEM);
         goto fail;
diff --git a/libavutil/hwcontext.c b/libavutil/hwcontext.c
index 8dd05147a4..a650d66c9a 100644
--- a/libavutil/hwcontext.c
+++ b/libavutil/hwcontext.c
@@ -416,7 +416,7 @@ static int transfer_data_alloc(AVFrame *dst, const AVFrame *src, int flags)
     frame_tmp->width  = ctx->width;
     frame_tmp->height = ctx->height;
 
-    ret = av_frame_get_buffer(frame_tmp, 0);
+    ret = av_frame_get_buffer(frame_tmp, 0x100);
     if (ret < 0)
         goto fail;

would this not cause issues with other ffmpeg decoders? It seems to me like you should be modifying the default output of av_frame_get_buffer if the input is 0 (which currently chooses an optimal value based on CPU https://github.com/FFmpeg/FFmpeg/blob/e61fed8280ccf2fb9e69c8d4e1849be2dcfebd89/libavutil/frame.h#L858-L881, or at least used to until they hardcoded it https://github.com/FFmpeg/FFmpeg/blame/e61fed8280ccf2fb9e69c8d4e1849be2dcfebd89/libavutil/frame.c#L185)

averne · 2024-06-26T19:40:22Z

I don't see how that could break anything.
The first part of the patch sets the alignment of the buffer which is irrelevant for software decoders.
The second part is only relevant for hardware decoders, of which there is currently only mine available on tegra, since vaapi and cuvid are (shamefully) not supported.

theofficialgman · 2024-06-26T19:46:59Z

I don't see how that could break anything. The first part of the patch sets the alignment of the buffer which is irrelevant for software decoders. The second part is only relevant for hardware decoders, of which there is currently only mine available on tegra, since vaapi and cuvid are (shamefully) not supported.

I'd like something that is upstream worthy and included in your patchset https://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2024-May/328549.html.

The point (of me testing this) is to see if its worth including as an option to the (already very well functioning) CTCaer's nvv4l2. We currently force nvv4l2's use (on switchroot ubuntu bionic/jammy/noble) for all suitable decoding which allows every application that uses system ffmpeg (vlc, mpv, dolphin-emu, obs-studio, firefox (soon)) without any changes at the upstream applications. These aren't zero-copy either but they work well enough.

averne · 2024-06-26T19:53:52Z

Unfortunately the patchset in its current state will not use accelerated transfers unless by chance your media width is aligned to 256 and you luck out on buffer allocations, so I would not consider it suitable yet.
Another option would be to use the gpu copy engine for frame transfers, which doesn't suffer from the constraints VIC does. But that would add a heavy amount of code.

Explicitly use ldur for unaligned offsets; newer versions of armasm64 implicitly convert ldr to ldur as necessary, but older versions require it explicitly written out. This fixes these build errors: ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2039) : error A2518: operand 2: Memory offset must be aligned ldr s5, [x1, #1] ffmpeg\libavcodec\aarch64\vvc\inter.o.asm(2250) : error A2518: operand 2: Memory offset must be aligned ldr d7, [x1, FFmpeg#2] Signed-off-by: Martin Storsjö <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poorer than expected performance on ffmpeg nvtegra decoder #1

Poorer than expected performance on ffmpeg nvtegra decoder #1

averne commented Jun 26, 2024

theofficialgman commented Jun 26, 2024

theofficialgman commented Jun 26, 2024 •

edited

Loading

averne commented Jun 26, 2024

theofficialgman commented Jun 26, 2024

averne commented Jun 26, 2024

theofficialgman commented Jun 26, 2024

averne commented Jun 26, 2024

theofficialgman commented Jun 26, 2024

averne commented Jun 26, 2024

Poorer than expected performance on ffmpeg nvtegra decoder #1

Poorer than expected performance on ffmpeg nvtegra decoder #1

Comments

averne commented Jun 26, 2024

theofficialgman commented Jun 26, 2024

theofficialgman commented Jun 26, 2024 • edited Loading

averne commented Jun 26, 2024

theofficialgman commented Jun 26, 2024

averne commented Jun 26, 2024

theofficialgman commented Jun 26, 2024

averne commented Jun 26, 2024

theofficialgman commented Jun 26, 2024

averne commented Jun 26, 2024

theofficialgman commented Jun 26, 2024 •

edited

Loading