ivshmem: Rewrite specification document
This started as an attempt to update ivshmem_device_spec.txt for clarity, accuracy and completeness while working on its code, and quickly became a full rewrite. Since the diff would be useless anyway, I'm using the opportunity to rename the file to ivshmem-spec.txt. I tried hard to ensure the new text contradicts neither the old text nor the code. If the new text contradicts the old text but not the code, it's probably a bug in the old text. If the new text contradicts both, its probably a bug in the new text. Signed-off-by: Markus Armbruster <armbru@redhat.com> Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com> Message-Id: <1458066895-20632-11-git-send-email-armbru@redhat.com>
This commit is contained in:
		
							parent
							
								
									41b65e5eda
								
							
						
					
					
						commit
						fdee2025dd
					
				| 
						 | 
				
			
			@ -0,0 +1,243 @@
 | 
			
		|||
= Device Specification for Inter-VM shared memory device =
 | 
			
		||||
 | 
			
		||||
The Inter-VM shared memory device (ivshmem) is designed to share a
 | 
			
		||||
memory region between multiple QEMU processes running different guests
 | 
			
		||||
and the host.  In order for all guests to be able to pick up the
 | 
			
		||||
shared memory area, it is modeled by QEMU as a PCI device exposing
 | 
			
		||||
said memory to the guest as a PCI BAR.
 | 
			
		||||
 | 
			
		||||
The device can use a shared memory object on the host directly, or it
 | 
			
		||||
can obtain one from an ivshmem server.
 | 
			
		||||
 | 
			
		||||
In the latter case, the device can additionally interrupt its peers, and
 | 
			
		||||
get interrupted by its peers.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
== Configuring the ivshmem PCI device ==
 | 
			
		||||
 | 
			
		||||
There are two basic configurations:
 | 
			
		||||
 | 
			
		||||
- Just shared memory: -device ivshmem,shm=NAME,...
 | 
			
		||||
 | 
			
		||||
  This uses shared memory object NAME.
 | 
			
		||||
 | 
			
		||||
- Shared memory plus interrupts: -device ivshmem,chardev=CHR,vectors=N,...
 | 
			
		||||
 | 
			
		||||
  An ivshmem server must already be running on the host.  The device
 | 
			
		||||
  connects to the server's UNIX domain socket via character device
 | 
			
		||||
  CHR.
 | 
			
		||||
 | 
			
		||||
  Each peer gets assigned a unique ID by the server.  IDs must be
 | 
			
		||||
  between 0 and 65535.
 | 
			
		||||
 | 
			
		||||
  Interrupts are message-signaled by default (MSI-X).  With msi=off
 | 
			
		||||
  the device has no MSI-X capability, and uses legacy INTx instead.
 | 
			
		||||
  vectors=N configures the number of vectors to use.
 | 
			
		||||
 | 
			
		||||
For more details on ivshmem device properties, see The QEMU Emulator
 | 
			
		||||
User Documentation (qemu-doc.*).
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
== The ivshmem PCI device's guest interface ==
 | 
			
		||||
 | 
			
		||||
The device has vendor ID 1af4, device ID 1110, revision 0.
 | 
			
		||||
 | 
			
		||||
=== PCI BARs ===
 | 
			
		||||
 | 
			
		||||
The ivshmem PCI device has two or three BARs:
 | 
			
		||||
 | 
			
		||||
- BAR0 holds device registers (256 Byte MMIO)
 | 
			
		||||
- BAR1 holds MSI-X table and PBA (only when using MSI-X)
 | 
			
		||||
- BAR2 maps the shared memory object
 | 
			
		||||
 | 
			
		||||
There are two ways to use this device:
 | 
			
		||||
 | 
			
		||||
- If you only need the shared memory part, BAR2 suffices.  This way,
 | 
			
		||||
  you have access to the shared memory in the guest and can use it as
 | 
			
		||||
  you see fit.  Memnic, for example, uses ivshmem this way from guest
 | 
			
		||||
  user space (see http://dpdk.org/browse/memnic).
 | 
			
		||||
 | 
			
		||||
- If you additionally need the capability for peers to interrupt each
 | 
			
		||||
  other, you need BAR0 and, if using MSI-X, BAR1.  You will most
 | 
			
		||||
  likely want to write a kernel driver to handle interrupts.  Requires
 | 
			
		||||
  the device to be configured for interrupts, obviously.
 | 
			
		||||
 | 
			
		||||
If the device is configured for interrupts, BAR2 is initially invalid.
 | 
			
		||||
It becomes safely accessible only after the ivshmem server provided
 | 
			
		||||
the shared memory.  Guest software should wait for the IVPosition
 | 
			
		||||
register (described below) to become non-negative before accessing
 | 
			
		||||
BAR2.
 | 
			
		||||
 | 
			
		||||
The device is not capable to tell guest software whether it is
 | 
			
		||||
configured for interrupts.
 | 
			
		||||
 | 
			
		||||
=== PCI device registers ===
 | 
			
		||||
 | 
			
		||||
BAR 0 contains the following registers:
 | 
			
		||||
 | 
			
		||||
    Offset  Size  Access      On reset  Function
 | 
			
		||||
        0     4   read/write        0   Interrupt Mask
 | 
			
		||||
                                        bit 0: peer interrupt
 | 
			
		||||
                                        bit 1..31: reserved
 | 
			
		||||
        4     4   read/write        0   Interrupt Status
 | 
			
		||||
                                        bit 0: peer interrupt
 | 
			
		||||
                                        bit 1..31: reserved
 | 
			
		||||
        8     4   read-only   0 or -1   IVPosition
 | 
			
		||||
       12     4   write-only      N/A   Doorbell
 | 
			
		||||
                                        bit 0..15: vector
 | 
			
		||||
                                        bit 16..31: peer ID
 | 
			
		||||
       16   240   none            N/A   reserved
 | 
			
		||||
 | 
			
		||||
Software should only access the registers as specified in column
 | 
			
		||||
"Access".  Reserved bits should be ignored on read, and preserved on
 | 
			
		||||
write.
 | 
			
		||||
 | 
			
		||||
Interrupt Status and Mask Register together control the legacy INTx
 | 
			
		||||
interrupt when the device has no MSI-X capability: INTx is asserted
 | 
			
		||||
when the bit-wise AND of Status and Mask is non-zero and the device
 | 
			
		||||
has no MSI-X capability.  Interrupt Status Register bit 0 becomes 1
 | 
			
		||||
when an interrupt request from a peer is received.  Reading the
 | 
			
		||||
register clears it.
 | 
			
		||||
 | 
			
		||||
IVPosition Register: if the device is not configured for interrupts,
 | 
			
		||||
this is zero.  Else, it's -1 for a short while after reset, then
 | 
			
		||||
changes to the device's ID (between 0 and 65535).
 | 
			
		||||
 | 
			
		||||
There is no good way for software to find out whether the device is
 | 
			
		||||
configured for interrupts.  A positive IVPosition means interrupts,
 | 
			
		||||
but zero could be either.  The initial -1 cannot be reliably observed.
 | 
			
		||||
 | 
			
		||||
Doorbell Register: writing this register requests to interrupt a peer.
 | 
			
		||||
The written value's high 16 bits are the ID of the peer to interrupt,
 | 
			
		||||
and its low 16 bits select an interrupt vector.
 | 
			
		||||
 | 
			
		||||
If the device is not configured for interrupts, the write is ignored.
 | 
			
		||||
 | 
			
		||||
If the interrupt hasn't completed setup, the write is ignored.  The
 | 
			
		||||
device is not capable to tell guest software whether setup is
 | 
			
		||||
complete.  Interrupts can regress to this state on migration.
 | 
			
		||||
 | 
			
		||||
If the peer with the requested ID isn't connected, or it has fewer
 | 
			
		||||
interrupt vectors connected, the write is ignored.  The device is not
 | 
			
		||||
capable to tell guest software what peers are connected, or how many
 | 
			
		||||
interrupt vectors are connected.
 | 
			
		||||
 | 
			
		||||
If the peer doesn't use MSI-X, its Interrupt Status register is set to
 | 
			
		||||
1.  This asserts INTx unless masked by the Interrupt Mask register.
 | 
			
		||||
The device is not capable to communicate the interrupt vector to guest
 | 
			
		||||
software then.
 | 
			
		||||
 | 
			
		||||
If the peer uses MSI-X, the interrupt for this vector becomes pending.
 | 
			
		||||
There is no way for software to clear the pending bit, and a polling
 | 
			
		||||
mode of operation is therefore impossible with MSI-X.
 | 
			
		||||
 | 
			
		||||
With multiple MSI-X vectors, different vectors can be used to indicate
 | 
			
		||||
different events have occurred.  The semantics of interrupt vectors
 | 
			
		||||
are left to the application.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
== Interrupt infrastructure ==
 | 
			
		||||
 | 
			
		||||
When configured for interrupts, the peers share eventfd objects in
 | 
			
		||||
addition to shared memory.  The shared resources are managed by an
 | 
			
		||||
ivshmem server.
 | 
			
		||||
 | 
			
		||||
=== The ivshmem server ===
 | 
			
		||||
 | 
			
		||||
The server listens on a UNIX domain socket.
 | 
			
		||||
 | 
			
		||||
For each new client that connects to the server, the server
 | 
			
		||||
- picks an ID,
 | 
			
		||||
- creates eventfd file descriptors for the interrupt vectors,
 | 
			
		||||
- sends the ID and the file descriptor for the shared memory to the
 | 
			
		||||
  new client,
 | 
			
		||||
- sends connect notifications for the new client to the other clients
 | 
			
		||||
  (these contain file descriptors for sending interrupts),
 | 
			
		||||
- sends connect notifications for the other clients to the new client,
 | 
			
		||||
  and
 | 
			
		||||
- sends interrupt setup messages to the new client (these contain file
 | 
			
		||||
  descriptors for receiving interrupts).
 | 
			
		||||
 | 
			
		||||
When a client disconnects from the server, the server sends disconnect
 | 
			
		||||
notifications to the other clients.
 | 
			
		||||
 | 
			
		||||
The next section describes the protocol in detail.
 | 
			
		||||
 | 
			
		||||
If the server terminates without sending disconnect notifications for
 | 
			
		||||
its connected clients, the clients can elect to continue.  They can
 | 
			
		||||
communicate with each other normally, but won't receive disconnect
 | 
			
		||||
notification on disconnect, and no new clients can connect.  There is
 | 
			
		||||
no way for the clients to connect to a restarted server.  The device
 | 
			
		||||
is not capable to tell guest software whether the server is still up.
 | 
			
		||||
 | 
			
		||||
Example server code is in contrib/ivshmem-server/.  Not to be used in
 | 
			
		||||
production.  It assumes all clients use the same number of interrupt
 | 
			
		||||
vectors.
 | 
			
		||||
 | 
			
		||||
A standalone client is in contrib/ivshmem-client/.  It can be useful
 | 
			
		||||
for debugging.
 | 
			
		||||
 | 
			
		||||
=== The ivshmem Client-Server Protocol ===
 | 
			
		||||
 | 
			
		||||
An ivshmem device configured for interrupts connects to an ivshmem
 | 
			
		||||
server.  This section details the protocol between the two.
 | 
			
		||||
 | 
			
		||||
The connection is one-way: the server sends messages to the client.
 | 
			
		||||
Each message consists of a single 8 byte little-endian signed number,
 | 
			
		||||
and may be accompanied by a file descriptor via SCM_RIGHTS.  Both
 | 
			
		||||
client and server close the connection on error.
 | 
			
		||||
 | 
			
		||||
On connect, the server sends the following messages in order:
 | 
			
		||||
 | 
			
		||||
1. The protocol version number, currently zero.  The client should
 | 
			
		||||
   close the connection on receipt of versions it can't handle.
 | 
			
		||||
 | 
			
		||||
2. The client's ID.  This is unique among all clients of this server.
 | 
			
		||||
   IDs must be between 0 and 65535, because the Doorbell register
 | 
			
		||||
   provides only 16 bits for them.
 | 
			
		||||
 | 
			
		||||
3. The number -1, accompanied by the file descriptor for the shared
 | 
			
		||||
   memory.
 | 
			
		||||
 | 
			
		||||
4. Connect notifications for existing other clients, if any.  This is
 | 
			
		||||
   a peer ID (number between 0 and 65535 other than the client's ID),
 | 
			
		||||
   repeated N times.  Each repetition is accompanied by one file
 | 
			
		||||
   descriptor.  These are for interrupting the peer with that ID using
 | 
			
		||||
   vector 0,..,N-1, in order.  If the client is configured for fewer
 | 
			
		||||
   vectors, it closes the extra file descriptors.  If it is configured
 | 
			
		||||
   for more, the extra vectors remain unconnected.
 | 
			
		||||
 | 
			
		||||
5. Interrupt setup.  This is the client's own ID, repeated N times.
 | 
			
		||||
   Each repetition is accompanied by one file descriptor.  These are
 | 
			
		||||
   for receiving interrupts from peers using vector 0,..,N-1, in
 | 
			
		||||
   order.  If the client is configured for fewer vectors, it closes
 | 
			
		||||
   the extra file descriptors.  If it is configured for more, the
 | 
			
		||||
   extra vectors remain unconnected.
 | 
			
		||||
 | 
			
		||||
From then on, the server sends these kinds of messages:
 | 
			
		||||
 | 
			
		||||
6. Connection / disconnection notification.  This is a peer ID.
 | 
			
		||||
 | 
			
		||||
  - If the number comes with a file descriptor, it's a connection
 | 
			
		||||
    notification, exactly like in step 4.
 | 
			
		||||
 | 
			
		||||
  - Else, it's a disconnection notification for the peer with that ID.
 | 
			
		||||
 | 
			
		||||
Known bugs:
 | 
			
		||||
 | 
			
		||||
* The protocol changed incompatibly in QEMU 2.5.  Before, messages
 | 
			
		||||
  were native endian long, and there was no version number.
 | 
			
		||||
 | 
			
		||||
* The protocol is poorly designed.
 | 
			
		||||
 | 
			
		||||
=== The ivshmem Client-Client Protocol ===
 | 
			
		||||
 | 
			
		||||
An ivshmem device configured for interrupts receives eventfd file
 | 
			
		||||
descriptors for interrupting peers and getting interrupted by peers
 | 
			
		||||
from the server, as explained in the previous section.
 | 
			
		||||
 | 
			
		||||
To interrupt a peer, the device writes the 8-byte integer 1 in native
 | 
			
		||||
byte order to the respective file descriptor.
 | 
			
		||||
 | 
			
		||||
To receive an interrupt, the device reads and discards as many 8-byte
 | 
			
		||||
integers as it can.
 | 
			
		||||
| 
						 | 
				
			
			@ -1,161 +0,0 @@
 | 
			
		|||
 | 
			
		||||
Device Specification for Inter-VM shared memory device
 | 
			
		||||
------------------------------------------------------
 | 
			
		||||
 | 
			
		||||
The Inter-VM shared memory device is designed to share a memory region (created
 | 
			
		||||
on the host via the POSIX shared memory API) between multiple QEMU processes
 | 
			
		||||
running different guests. In order for all guests to be able to pick up the
 | 
			
		||||
shared memory area, it is modeled by QEMU as a PCI device exposing said memory
 | 
			
		||||
to the guest as a PCI BAR.
 | 
			
		||||
The memory region does not belong to any guest, but is a POSIX memory object on
 | 
			
		||||
the host. The host can access this shared memory if needed.
 | 
			
		||||
 | 
			
		||||
The device also provides an optional communication mechanism between guests
 | 
			
		||||
sharing the same memory object. More details about that in the section 'Guest to
 | 
			
		||||
guest communication' section.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
The Inter-VM PCI device
 | 
			
		||||
-----------------------
 | 
			
		||||
 | 
			
		||||
From the VM point of view, the ivshmem PCI device supports three BARs.
 | 
			
		||||
 | 
			
		||||
- BAR0 is a 1 Kbyte MMIO region to support registers and interrupts when MSI is
 | 
			
		||||
  not used.
 | 
			
		||||
- BAR1 is used for MSI-X when it is enabled in the device.
 | 
			
		||||
- BAR2 is used to access the shared memory object.
 | 
			
		||||
 | 
			
		||||
It is your choice how to use the device but you must choose between two
 | 
			
		||||
behaviors :
 | 
			
		||||
 | 
			
		||||
- basically, if you only need the shared memory part, you will map BAR2.
 | 
			
		||||
  This way, you have access to the shared memory in guest and can use it as you
 | 
			
		||||
  see fit (memnic, for example, uses it in userland
 | 
			
		||||
  http://dpdk.org/browse/memnic).
 | 
			
		||||
 | 
			
		||||
- BAR0 and BAR1 are used to implement an optional communication mechanism
 | 
			
		||||
  through interrupts in the guests. If you need an event mechanism between the
 | 
			
		||||
  guests accessing the shared memory, you will most likely want to write a
 | 
			
		||||
  kernel driver that will handle interrupts. See details in the section 'Guest
 | 
			
		||||
  to guest communication' section.
 | 
			
		||||
 | 
			
		||||
The behavior is chosen when starting your QEMU processes:
 | 
			
		||||
- no communication mechanism needed, the first QEMU to start creates the shared
 | 
			
		||||
  memory on the host, subsequent QEMU processes will use it.
 | 
			
		||||
 | 
			
		||||
- communication mechanism needed, an ivshmem server must be started before any
 | 
			
		||||
  QEMU processes, then each QEMU process connects to the server unix socket.
 | 
			
		||||
 | 
			
		||||
For more details on the QEMU ivshmem parameters, see qemu-doc documentation.
 | 
			
		||||
 | 
			
		||||
 | 
			
		||||
Guest to guest communication
 | 
			
		||||
----------------------------
 | 
			
		||||
 | 
			
		||||
This section details the communication mechanism between the guests accessing
 | 
			
		||||
the ivhsmem shared memory.
 | 
			
		||||
 | 
			
		||||
*ivshmem server*
 | 
			
		||||
 | 
			
		||||
This server code is available in qemu.git/contrib/ivshmem-server.
 | 
			
		||||
 | 
			
		||||
The server must be started on the host before any guest.
 | 
			
		||||
It creates a shared memory object then waits for clients to connect on a unix
 | 
			
		||||
socket. All the messages are little-endian int64_t integer.
 | 
			
		||||
 | 
			
		||||
For each client (QEMU process) that connects to the server:
 | 
			
		||||
- the server sends a protocol version, if client does not support it, the client
 | 
			
		||||
  closes the communication,
 | 
			
		||||
- the server assigns an ID for this client and sends this ID to him as the first
 | 
			
		||||
  message,
 | 
			
		||||
- the server sends a fd to the shared memory object to this client,
 | 
			
		||||
- the server creates a new set of host eventfds associated to the new client and
 | 
			
		||||
  sends this set to all already connected clients,
 | 
			
		||||
- finally, the server sends all the eventfds sets for all clients to the new
 | 
			
		||||
  client.
 | 
			
		||||
 | 
			
		||||
The server signals all clients when one of them disconnects.
 | 
			
		||||
 | 
			
		||||
The client IDs are limited to 16 bits because of the current implementation (see
 | 
			
		||||
Doorbell register in 'PCI device registers' subsection). Hence only 65536
 | 
			
		||||
clients are supported.
 | 
			
		||||
 | 
			
		||||
All the file descriptors (fd to the shared memory, eventfds for each client)
 | 
			
		||||
are passed to clients using SCM_RIGHTS over the server unix socket.
 | 
			
		||||
 | 
			
		||||
Apart from the current ivshmem implementation in QEMU, an ivshmem client has
 | 
			
		||||
been provided in qemu.git/contrib/ivshmem-client for debug.
 | 
			
		||||
 | 
			
		||||
*QEMU as an ivshmem client*
 | 
			
		||||
 | 
			
		||||
At initialisation, when creating the ivshmem device, QEMU first receives a
 | 
			
		||||
protocol version and closes communication with server if it does not match.
 | 
			
		||||
Then, QEMU gets its ID from the server then makes it available through BAR0
 | 
			
		||||
IVPosition register for the VM to use (see 'PCI device registers' subsection).
 | 
			
		||||
QEMU then uses the fd to the shared memory to map it to BAR2.
 | 
			
		||||
eventfds for all other clients received from the server are stored to implement
 | 
			
		||||
BAR0 Doorbell register (see 'PCI device registers' subsection).
 | 
			
		||||
Finally, eventfds assigned to this QEMU process are used to send interrupts in
 | 
			
		||||
this VM.
 | 
			
		||||
 | 
			
		||||
*PCI device registers*
 | 
			
		||||
 | 
			
		||||
From the VM point of view, the ivshmem PCI device supports 4 registers of
 | 
			
		||||
32-bits each.
 | 
			
		||||
 | 
			
		||||
enum ivshmem_registers {
 | 
			
		||||
    IntrMask = 0,
 | 
			
		||||
    IntrStatus = 4,
 | 
			
		||||
    IVPosition = 8,
 | 
			
		||||
    Doorbell = 12
 | 
			
		||||
};
 | 
			
		||||
 | 
			
		||||
The first two registers are the interrupt mask and status registers.  Mask and
 | 
			
		||||
status are only used with pin-based interrupts.  They are unused with MSI
 | 
			
		||||
interrupts.
 | 
			
		||||
 | 
			
		||||
Status Register: The status register is set to 1 when an interrupt occurs.
 | 
			
		||||
 | 
			
		||||
Mask Register: The mask register is bitwise ANDed with the interrupt status
 | 
			
		||||
and the result will raise an interrupt if it is non-zero.  However, since 1 is
 | 
			
		||||
the only value the status will be set to, it is only the first bit of the mask
 | 
			
		||||
that has any effect.  Therefore interrupts can be masked by setting the first
 | 
			
		||||
bit to 0 and unmasked by setting the first bit to 1.
 | 
			
		||||
 | 
			
		||||
IVPosition Register: The IVPosition register is read-only and reports the
 | 
			
		||||
guest's ID number.  The guest IDs are non-negative integers.  When using the
 | 
			
		||||
server, since the server is a separate process, the VM ID will only be set when
 | 
			
		||||
the device is ready (shared memory is received from the server and accessible
 | 
			
		||||
via the device).  If the device is not ready, the IVPosition will return -1.
 | 
			
		||||
Applications should ensure that they have a valid VM ID before accessing the
 | 
			
		||||
shared memory.
 | 
			
		||||
 | 
			
		||||
Doorbell Register:  To interrupt another guest, a guest must write to the
 | 
			
		||||
Doorbell register.  The doorbell register is 32-bits, logically divided into
 | 
			
		||||
two 16-bit fields.  The high 16-bits are the guest ID to interrupt and the low
 | 
			
		||||
16-bits are the interrupt vector to trigger.  The semantics of the value
 | 
			
		||||
written to the doorbell depends on whether the device is using MSI or a regular
 | 
			
		||||
pin-based interrupt.  In short, MSI uses vectors while regular interrupts set
 | 
			
		||||
the status register.
 | 
			
		||||
 | 
			
		||||
Regular Interrupts
 | 
			
		||||
 | 
			
		||||
If regular interrupts are used (due to either a guest not supporting MSI or the
 | 
			
		||||
user specifying not to use them on startup) then the value written to the lower
 | 
			
		||||
16-bits of the Doorbell register results is arbitrary and will trigger an
 | 
			
		||||
interrupt in the destination guest.
 | 
			
		||||
 | 
			
		||||
Message Signalled Interrupts
 | 
			
		||||
 | 
			
		||||
An ivshmem device may support multiple MSI vectors.  If so, the lower 16-bits
 | 
			
		||||
written to the Doorbell register must be between 0 and the maximum number of
 | 
			
		||||
vectors the guest supports.  The lower 16 bits written to the doorbell is the
 | 
			
		||||
MSI vector that will be raised in the destination guest.  The number of MSI
 | 
			
		||||
vectors is configurable but it is set when the VM is started.
 | 
			
		||||
 | 
			
		||||
The important thing to remember with MSI is that it is only a signal, no status
 | 
			
		||||
is set (since MSI interrupts are not shared).  All information other than the
 | 
			
		||||
interrupt itself should be communicated via the shared memory region.  Devices
 | 
			
		||||
supporting multiple MSI vectors can use different vectors to indicate different
 | 
			
		||||
events have occurred.  The semantics of interrupt vectors are left to the
 | 
			
		||||
user's discretion.
 | 
			
		||||
		Loading…
	
		Reference in New Issue