Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support persistently mapped buffers #47

Open
haasn opened this issue Jul 5, 2017 · 3 comments
Open

Support persistently mapped buffers #47

haasn opened this issue Jul 5, 2017 · 3 comments

Comments

@haasn
Copy link

haasn commented Jul 5, 2017

Right now, the only way to use buffers is to do a round-trip through [] and copy the contents into the buffer one by one. For streaming large amounts of data, this can be very inefficient.

It would be beneficial if it was possible to directly map a persistent buffer binding (as a Ptr () or otherwise) so I can do fancy things like decoding my data into the mapped buffer, thus avoiding the extra round-trip, memory copy, (and garbage collection).

@ghost
Copy link

ghost commented Jul 8, 2017 via email

@haasn
Copy link
Author

haasn commented Jul 8, 2017

@plredmond Not quite. With an OpenGL PBO (or OpenCL/Vulkan/CUDA mapped buffer), the GPU's DMA engine can directly upload the data into device memory without needing to relocate it in host memory first.

Say you implement a video player, and you have an external library like libavcodec which can decode the individual frames directly into Ptr of your choice. There are now two distinct possibilities:

  1. libavcodec decodes the data into a Ptr Word8, normally this will be a buffer that libavcodec internally mallocs.
  2. You either memcpy this into an OpenGL mapped PBO (which will be backed by DMA-visible pinned memory), or call glTexture2D on it and the OpenGL driver will internally copy it into a DMA buffer for you.
  3. The GPU's DMA engine sees the data and can begin streaming the contents into device memory.

But this requires an extra memcpy / indirection. If you map the PBO persistently (MAP_PERSISTENT_BIT) then you can do the following:

  1. You tell libavcodec to decode into your pre-mapped Ptr Word8 instead of allocating its own internal buffer
  2. The decoded data is already pinned / DMA-visible and all you need to do is flush the affected memory range and the GPU can start streaming from it.

Depending on the use case, the extra memcpy can be needlessly wasteful or even problematic (if the memory was swapped, compressed, on a wrong NUMA node or otherwise not immediately accessible, you can get nasty pipeline stalls because of this), so it would be nice if we could figure out some way of avoiding it.

Unfortunately, all of this requires pretty much breaking the Haskell “safety” and dealing with direct pointers, buffers, flushing and fences (for synchronization), so I'm not sure if it can fit into the high-level GPipe API.

I guess what I'm missing in general is the ability to use GPipe at a high level but “bypass” it to insert underlying raw calls if I promise “I Know What I'm Doing(tm)”, such as the ability to define my own raw GLSL function calls. Maybe that's the over-arching problem here?

Giving up all of GPipe's exceptional ease-of-use just for the ability to make one raw gl call is a bit daunting.

@tobbebex
Copy link
Owner

tobbebex commented Aug 9, 2017

For PBO's to be useful you would need asynchronous upload, which is hard to do in a safe way (see #40). But even without PBO's, Buffers in GPipe are already "persistant" in the sense that they live on the GPU. When writing to a Buffer in GPipe you are actually using the DMA engine, and subsequent calls that are not data-dependent on the buffer you just wrote would be running concurrently by your OpenGl driver. That you are using a [] instead of directly poking a Ptr doesnt change that, just as @plredmond commented.

So, if you can get libavcodec to provide a [] instead of giving it a Ptr (ie let GPipe pull the data instead of libavcodec push it), you can get rid of the extra memcpy that way. Still, the decoding is indeed happening in sync with other GPipe calls but to alleviate that we would need #40. Does that work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants