virtio-hostmem.tex

\section{Host Memory Device}\label{sec:Device Types / Host Memory Device}

Note: This depends on the upcoming shared-mem type of virtio
that allows sharing of host memory to the guest.

virtio-hostmem is a device for sharing host memory all the way to the guest userspace.
It runs on top of virtio-pci for virtqueue messages and
uses the PCI address space for direct access like virtio-fs does.

virtio-hostmem's purpose is
to allow high performance general memory accesses between guest and host,
and to allow the guest to access host memory constructed at runtime,
such as mapped memory from graphics APIs.

Note that vhost-pci/vhost-vsock, virtio-vsock, and virtio-fs
are also general ways to share data between the guest and host,
but they are specialized to socket APIs in the guest plus
having host OS-dependent socket communication mechanism,
or depend on a FUSE implementation.

virtio-hostmem provides guest/host communication mechanisms over raw host memory,
as opposed to sockets,
which has benefits of being more portable across hypervisors and guest OSes,
and potentially higher performance due to always being physically contiguous to the guest.

\subsection{Fixed Host Memory Regions}\label{sec:Device Types / Host Memory Device / Communication over Fixed Host Memory Regions}

Shmids will be set up as a set of fixed ranges on boot,
one for each sub-device available.
This means that the boot-up sequence plus the guest kernel
configures memory regions used by sub-devices once on startup,
and does not touch them again;
this simplifies the set of possible behavior,
and bounds the maximum amount of guest physical memory that can be used.

It is always assumed that the memory regions act as RAM
and are backed in some way by the host.
There is no caching, all access is coherent.
When the guest sends notifications to the host,
Memory fence instructions are automatically deployed
for architectures without store/load coherency.

The host is allowed to lazily back / modify which host pointers
correspond to the physical address ranges of such regions,
such as in response to sub-device specific protocols,
but is never allowed to leave those mappings unmapped.

These guest physical memory regions persist throughout the lifetime of the VMM;
they are not created or destroyed dynamically even if the virtio-hostmem device
is realized as a PCI device that can hotplug.

The guest physical regions must not overlap and must not be shared
directly across different sub-device types.

\subsection{Sub-devices and Instances}\label{sec:Device Types / Host Memory Device / Sub-devices and Instances}

The guest and host communicate
over the config, notification (tx) and event (rx) virtqueues.
The config virtqueue is used for instance creation/destruction.

The guest can create "instances" which capture
a particular use case of the device.
Different use cases are distinguished by different sub-device IDs;
virtio-hostmem is like virtio-input in that the guest can query
for sub-devices that are implemented on the host via device and vendor ID's;
the guest provides vendor and device id in a configuration message.
The host then accepts or rejects the instance creation request.

Each instance can only touch memory regions
associated with its particular sub-device,
and only knows the offset into the associated memory region.
It is up to the userspace driver / device implementation to
resolve how offsets into the memory region are shared across instances or not.

This means that it is possible to share the same physical pages across multiple processes,
which is useful for implementing functionality such as gralloc/ashmem/dmabuf;
virtio-hostmem only guarantees the security boundaries where
no sub-device instance is allowed to access the memory of an instance of
a different sub-device.

Indeed, it is possible for a malicious guest process to improperly access
the shared memory of a gralloc/ashmem/dmabuf implementation on virtio-hostmem,
but we regard that as a flaw in the security model of the guest,
not the security model of virtio-hostmem.

When a virtio-hostmem instance in the guest is created,
a use-case-specific initialization happens on the host
in response to the creation request.

In operating the device, a notification virtqueue is used for the guest to notify the host
when something interesting has happened in the shared memory via communicating
the offset / size of any transaction, if applicable, and metadata.
This makes it well suited for many kinds of high performance / low latency
devices such as graphics API forwarding, audio/video codecs, sensors, etc;
no actual memory is sent over the virtqueue.

Note that this is asymmetric;
there will be on tx notification virtqueue for each guest instance,
while there is only one rx event virtqueue for host to guest notifications.
This is because it can be faster not to share the same virtqueue
if multiple guest instances all use high bandwidth/low memory operations
over the virtio-hostmem device to send data to the host;
this is especially common for the case of graphics API forwarding
and media codecs.

Both guest kernel and userspace drivers can be written using operations
on virtio-hostmem in a way that mirrors UIO for Linux;
open()/close()/ioctl()/read()/write()/mmap(),
but concrete implementations are outside the scope of this spec.

\subsection{Example Use Case}\label{sec:Device Types / Host Memory Device / Example Use Case}

Suppose the guest wants to decode a compressed video buffer.

\begin{enumerate}

\item VMM is configured for codec support and a vendor/device/revision id is associated
    with the codec device.

\item On startup, a physical address range, say 128 MB, is associated with the codec device.
    The range must be usable as RAM, so the host backs it as part of the guest startup process.

\item To save memory, the codec device implementation on the host
    can begin servicing this range via mapping them all to the same host page. But this is not required;
        the host can initialize a codec library buffer on the host on bootup and pre-allocate the entire region there.
        The main invariant is that the physical address range is never not mapped as RAM and usable as RAM.

\item Guest creates an instance for the codec vendor id / device id / revision
    via sending a message over the config virtqueue.

\item Guest codec driver does an implementation dependent suballocation operation and communicates via
    notification virtqueue to the host that this instance wants to use that sub-region.

\item The host now does an implementation dependent operation to back the sub-region with usable memory.
    But this is not required; the host could have set the entire region up at startup.

\item Guest downloads compressed video buffers into that region.

\item Guest: After a packet of compressed video stream is downloaded to the
    buffer, another message, like a doorbell, is sent on the notification virtqueue to
        consume existing compressed data. The notification message's offset field is
        set to the proper offset into the shared-mem object.

\item Host: Codec implementation decodes the video and puts the decoded frames
    to either a host-side display library (thus with no further guest
        communication necessary), or puts the raw decompressed frame to a
        further offset in the shared memory region that the guest knows about.

\item Guest: Continue downloading video streams and sending notifications,
    or optionally, wait until the host is done first. If scheduling is not that
        big of an impact, this can be done without even any further VM exit, by
        the host writing to an agreed memory location when decoding is done,
        then the guest uses a polling sleep(N) where N is the correctly tuned
        timeout such that only a few poll spins are necessary.

\item Guest: Or, the host can send back on the event virtqueue \field{revents}
    and the guest can perform a blocking read() for it.

\end{enumerate}

The unique / interesting aspects of virtio-hostmem are demonstrated:

\begin{enumerate}

\item During instance creation the host was allowed to reject the request if
    the codec device did not exist on host.

\item The host can expose a codec library buffer directly to the guest,
    allowing the guest to write into it with zero copy and the host to decompress again without copying.

\item Large bidirectional transfers are possible with zero copy.

\item Large bidirectional transfers are possible without scatterlists, because
    the memory is always physically contiguous.

\item It is not necessary to use socket datagrams or data streams to
    communicate the notification messages; they can be raw structs fresh off the
        virtqueue.

\item After decoding, the guest has the option but not the requirement to wait
    for the host round trip, allowing for async operation of the codec.

\item The guest has the option but not the requirement to wait for the host
    round trip, allowing for async operation of the codec.

\end{enumerate}

\subsection{Device ID}\label{sec:Device Types / Host Memory Device / Device ID}

21

\subsection{Virtqueues}\label{sec:Device Types / Host Memory Device / Virtqueues}

\begin{description}
\item[0] config
\item[1] event
\item[2..n] notification
\end{description}

There is one notification virtqueue for each instance.
The maximum number of virtqueue is kept to an implementation-specific limit
that is stored in the configuration layout.

\subsection{Feature bits}\label{sec: Device Types / Host Memory Device / Feature bits }

No feature bits.

\subsubsection{Feature bit requirements}\label{sec:Device Types / Host Memory Device / Feature bit requirements}

No feature bit requirements.

\subsection{Device configuration layout}\label{sec:Device Types / Host Memory Device / Device configuration layout}

The configuration layout enumerates all sub-devices and a guest physical memory region
for each sub-device.

\begin{lstlisting}
struct virtio_hostmem_device_memory_region {
    le64 phys_addr_start;
    le64 phys_addr_end;
}

struct virtio_hostmem_device_info {
    le32 vendor_id;
    le32 device_id;
    le32 revision;
    struct virtio_hostmem_device_memory_region mem_region;
}

struct virtio_hostmem_config {
    le32 num_devices;
    virtio_hostmem_device_info available_devices[MAX_DEVICES];
    le32 MAX_INSTANCES;
};

One shared memory shmid is associatd with each sub-device.

\end{lstlisting}

\field{virtio_hostmem_device_memory_region} describes a particular memory region
corresponding to a sub-device.

\field{virtio_hostmem_device_info} describes a particular usage of the device
in terms of the vendor / device ID and revision.

\field{num_devices} represents the number of valid entries in \field{available_devices}.

\field{available_devices} represents the set of available usages of virtio-hostmem (up to \field{MAX_DEVICES}).

\field{MAX_DEVICES} is the maximum number of sub-devices possible (here, set to 32).
\field{MAX_INSTANCES} is the maximum number of instances possible (here, set to 4096).

\subsection{Device Initialization}\label{sec:Device Types / Host Memory Device / Device Initialization}

When the guest starts up, regardless of whether it is plugged in,
memory regions for each sub-device will be reserved.

When the hostmem device is plugged in via PCI,
instance creation/destruction and message sending is allowed.
Otherwise all operations fail with a guest specific error code.

\subsection{Device Operation}\label{sec:Device Types / Host Memory Device / Device Operation}

\subsubsection{Config Virtqueue Messages}\label{sec:Device Types / Host Memory Device / Device Operation / Config Virtqueue Messages}

Operation always begins on the config virtqueue.
Messages transmitted or received on the config virtqueue are of the following structure:

\begin{lstlisting}
struct virtio_hostmem_config_msg {
    le32 msg_type;
    le32 vendor_id;
    le32 device_id;
    le32 revision;
    le64 instance_handle;
    le32 error;
}
\end{lstlisting}

\field{msg_type} can only be one of the following:

\begin{lstlisting}
enum {
    VIRTIO_HOSTMEM_CONFIG_OP_CREATE_INSTANCE;
    VIRTIO_HOSTMEM_CONFIG_OP_DESTROY_INSTANCE;
}
\end{lstlisting}

\field{error} can only be one of the following:

\begin{lstlisting}
enum {
    VIRTIO_HOSTMEM_ERROR_CONFIG_DEVICE_INITIALIZATION_FAILED;
    VIRTIO_HOSTMEM_ERROR_CONFIG_INSTANCE_CREATION_FAILED;
}
\end{lstlisting}

Instances are particular contexts surrounding usages of virtio-hostmem.
They control whether and how shared memory is allocated
and how messages are dispatched to the host.

\field{vendor_id}, \field{device_id}, and \field{revision}
distinguish how the hostmem device is used.
If supported on the host via checking the device configuration;
that is, if there exists a backend corresponding to those fields,
instance creation will succeed.
The vendor and device id must match,
while \field{revision} can be more flexible depending on the use case.

Creating instances:

The guest sends a \field{virtio_hostmem_config_msg}
with \field{msg_type} equal to \field{VIRTIO_HOSTMEM_CONFIG_OP_CREATE_INSTANCE}
and with the \field{vendor_id}, \field{device_id}, \field{revision} fields set.
The guest sends this message on the config tx virtqueue.
On the host, a new \field{instance_handle} is generated.

If unsuccessful, \field{error} is set and sent back to the guest
on the config rx virtqueue, and the \field{instance_handle} is discarded.

If successful, a \field{virtio_hostmem_config_msg}
with \field{msg_type} equal to \field{VIRTIO_HOSTMEM_CONFIG_OP_CREATE_INSTANCE}
and \field{instance_handle} equal to the generated handle
is sent on the config rx virtqueue.

Destroying instances:

The guest sends a \field{virtio_hostmem_config_msg}
with \field{msg_type} equal to \field{VIRTIO_HOSTMEM_CONFIG_OP_DESTROY_INSTANCE}
on the config tx virtqueue.
The only field that needs to be populated
is \field{instance_handle}.

Destroying the instance does not affect the shared memory region.
It is up to the driver/device implemenation to get refcounts correct.

Flow control: Only one config message is allowed to be in flight
either to or from the host at any time.
That is, the handshake tx/rx for device enumeration, instance creation, and shared memory operations
are done in a globally visible single threaded manner.
This is to make it easier to synchronize operations on shared memory and instance creation.

\subsubsection{Notification Virtqueue Messages}\label{sec:Device Types / Host Memory Device / Device Operation / Notification Virtqueue Messages}

Once the instances have been created,
we can already read/write memory, and for some devices, that may already be enough
if they can operate lock-free and wait-free without needing notifications; we're done!

However, in order to prevent burning up CPU in most cases,
most devices need some kind of mechanism to trigger activity on the device
from the guest. This is captured via a new message struct,
which is separate from the config struct because it's smaller and
the common case is to send those messages.
These messages are sent from the guest to host
on the notification virtqueue.

\begin{lstlisting}
struct virtio_hostmem_notification {
    le64 instance_handle;
    le64 metadata;
    le64 shm_offset;
    le64 shm_size;
    le64 phys_addr;
    le64 host_addr;
    le32 events;
}
\end{lstlisting}

\field{instance_handle} must be a valid instance handle.
\field{shm_offset} and \field{shm_size}
are treated as offsets into the sub-device's shared memory region.

For security reasons,
\field{host_addr} is only resolved on the host after the message arrives on the host,
and \field{phys_addr} is only resolved by either the host, or the guest kernel
(as it also knows about the regions).

This allows notifications to coherently access to device memory
from both the host and guest, given a few extra considerations.
For example, for architectures that do not have store/load coherency (i.e., not x86)
an explicit set of fence or synchronization instructions will also be run by virtio-hostmem
after the message arrives on the host.
An alternative is to leave this up to the implementor of the virtual device,
but it is going to be such a common case to synchronize views of the same memory
that it is probably a good idea to include synchronization out of the box.

Although, it may be common to block a guest thread until whatever is happing on the host completes.
That is the purpose of the \field{events} field; the guest can populate it
if it is desired to sync on the host completion.
If \field{events} is not zero, then a reply shall be sent
back to the guest via the event virtqueue,
with the \field{revents} set to the appropriate value.

Flow control:

Arbitrary levels of traffic can be sent
on the notification virtqueues for guest instances.
Ordering of messages within an instance is patently preserved
due to having one virtqueue per instance.
Additional resources outside the virtqueue are used to hold incoming messages
if the virtqueue itself fills up.

This is similar to how virtio-vsock handles high traffic.
However, there will be a limit on the maximum number of messages in flight
to prevent the guest from over-notifying the host.
Once the limit is reached, the guest blocks until the number of messages in flight
is decreased.

The semantics of notification messages themselves also are not restricted to guest to host only;
the shared memory region named in the message can also be filled by the host
and used as receive traffic by the guest.
The notification message is then suitable for DMA operations in both directions,
such as glTexImage2D and glReadPixels,
and audio/video (de)compression (guest populates shared memory with (de)compressed buffers,
sends notification message, host (de)compresses into the same memory region).

Note that we also impose precise semantics for any shared memory region transaction
that is associated with a notification virtqueue message.

On available buffer notification for the notification virtqueue:
This means the guest has commited the information in the notification already,
and that the guest has completed
any memory transaction over the offset/size range in the sub-device's physical memory region.

On used buffer notification for the notification virtqueue:
This means the host has consumed the notficiation and has completed
any memory transaction over the offset/size range in the sub-device's physical memory region.

\subsubsection{Event Virtqueue Messages}\label{sec:Device Types / Host Memory Device / Device Operation / Event Virtqueue Messages}

Notification virtqueue messages are enough to cover all async device operations;
that is, operations that do not require a round trip from the host.
This is useful for most kinds of graphics API forwarding along
with media codecs.

However, it can still be important to synchronize the guest on the completion
of a device operation.

In the driver, the interface can be similar to Linux uio interrupts for example;
a blocking read() of an device is done and after unblocking,
the operation has completed.
The exact way of waiting is dependent on the guest OS.

However, it is all implemented on the event virtqueue. The message type:

\begin{lstlisting}
struct virtio_hostmem_event {
    le64 instance_handle;
    le32 revents;
}
\end{lstlisting}

Event messages are sent back to the guest if \field{events} field is nonzero,
as detailed in the section on notification virtqueue messages.

The guest driver can distinguish which instance receives which notificationotification
\field{instance_handle}.
The field \field{revents} is written by the device.

There is only one event virtqueue because it ends to be more complex to support
multiple queues at once in one interrupt handler,
and interrupt handlers are likely to play a role in most implementations;
the interrupt handler would watch the event virtqueue,
and on interrupt, empty it, unblocking the waiting guest threads if they are
waiting for events.