135 lines
		
	
	
		
			6.4 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
			
		
		
	
	
			135 lines
		
	
	
		
			6.4 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
Copyright (c) 2014 Red Hat Inc.
 | 
						|
 | 
						|
This work is licensed under the terms of the GNU GPL, version 2 or later.  See
 | 
						|
the COPYING file in the top-level directory.
 | 
						|
 | 
						|
 | 
						|
This document explains the IOThread feature and how to write code that runs
 | 
						|
outside the QEMU global mutex.
 | 
						|
 | 
						|
The main loop and IOThreads
 | 
						|
---------------------------
 | 
						|
QEMU is an event-driven program that can do several things at once using an
 | 
						|
event loop.  The VNC server and the QMP monitor are both processed from the
 | 
						|
same event loop, which monitors their file descriptors until they become
 | 
						|
readable and then invokes a callback.
 | 
						|
 | 
						|
The default event loop is called the main loop (see main-loop.c).  It is
 | 
						|
possible to create additional event loop threads using -object
 | 
						|
iothread,id=my-iothread.
 | 
						|
 | 
						|
Side note: The main loop and IOThread are both event loops but their code is
 | 
						|
not shared completely.  Sometimes it is useful to remember that although they
 | 
						|
are conceptually similar they are currently not interchangeable.
 | 
						|
 | 
						|
Why IOThreads are useful
 | 
						|
------------------------
 | 
						|
IOThreads allow the user to control the placement of work.  The main loop is a
 | 
						|
scalability bottleneck on hosts with many CPUs.  Work can be spread across
 | 
						|
several IOThreads instead of just one main loop.  When set up correctly this
 | 
						|
can improve I/O latency and reduce jitter seen by the guest.
 | 
						|
 | 
						|
The main loop is also deeply associated with the QEMU global mutex, which is a
 | 
						|
scalability bottleneck in itself.  vCPU threads and the main loop use the QEMU
 | 
						|
global mutex to serialize execution of QEMU code.  This mutex is necessary
 | 
						|
because a lot of QEMU's code historically was not thread-safe.
 | 
						|
 | 
						|
The fact that all I/O processing is done in a single main loop and that the
 | 
						|
QEMU global mutex is contended by all vCPU threads and the main loop explain
 | 
						|
why it is desirable to place work into IOThreads.
 | 
						|
 | 
						|
The experimental virtio-blk data-plane implementation has been benchmarked and
 | 
						|
shows these effects:
 | 
						|
ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf
 | 
						|
 | 
						|
How to program for IOThreads
 | 
						|
----------------------------
 | 
						|
The main difference between legacy code and new code that can run in an
 | 
						|
IOThread is dealing explicitly with the event loop object, AioContext
 | 
						|
(see include/block/aio.h).  Code that only works in the main loop
 | 
						|
implicitly uses the main loop's AioContext.  Code that supports running
 | 
						|
in IOThreads must be aware of its AioContext.
 | 
						|
 | 
						|
AioContext supports the following services:
 | 
						|
 * File descriptor monitoring (read/write/error on POSIX hosts)
 | 
						|
 * Event notifiers (inter-thread signalling)
 | 
						|
 * Timers
 | 
						|
 * Bottom Halves (BH) deferred callbacks
 | 
						|
 | 
						|
There are several old APIs that use the main loop AioContext:
 | 
						|
 * LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor
 | 
						|
 * LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
 | 
						|
 * LEGACY timer_new_ms() - create a timer
 | 
						|
 * LEGACY qemu_bh_new() - create a BH
 | 
						|
 * LEGACY qemu_aio_wait() - run an event loop iteration
 | 
						|
 | 
						|
Since they implicitly work on the main loop they cannot be used in code that
 | 
						|
runs in an IOThread.  They might cause a crash or deadlock if called from an
 | 
						|
IOThread since the QEMU global mutex is not held.
 | 
						|
 | 
						|
Instead, use the AioContext functions directly (see include/block/aio.h):
 | 
						|
 * aio_set_fd_handler() - monitor a file descriptor
 | 
						|
 * aio_set_event_notifier() - monitor an event notifier
 | 
						|
 * aio_timer_new() - create a timer
 | 
						|
 * aio_bh_new() - create a BH
 | 
						|
 * aio_poll() - run an event loop iteration
 | 
						|
 | 
						|
The AioContext can be obtained from the IOThread using
 | 
						|
iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
 | 
						|
Code that takes an AioContext argument works both in IOThreads or the main
 | 
						|
loop, depending on which AioContext instance the caller passes in.
 | 
						|
 | 
						|
How to synchronize with an IOThread
 | 
						|
-----------------------------------
 | 
						|
AioContext is not thread-safe so some rules must be followed when using file
 | 
						|
descriptors, event notifiers, timers, or BHs across threads:
 | 
						|
 | 
						|
1. AioContext functions can be called safely from file descriptor, event
 | 
						|
notifier, timer, or BH callbacks invoked by the AioContext.  No locking is
 | 
						|
necessary.
 | 
						|
 | 
						|
2. Other threads wishing to access the AioContext must use
 | 
						|
aio_context_acquire()/aio_context_release() for mutual exclusion.  Once the
 | 
						|
context is acquired no other thread can access it or run event loop iterations
 | 
						|
in this AioContext.
 | 
						|
 | 
						|
aio_context_acquire()/aio_context_release() calls may be nested.  This
 | 
						|
means you can call them if you're not sure whether #1 applies.
 | 
						|
 | 
						|
There is currently no lock ordering rule if a thread needs to acquire multiple
 | 
						|
AioContexts simultaneously.  Therefore, it is only safe for code holding the
 | 
						|
QEMU global mutex to acquire other AioContexts.
 | 
						|
 | 
						|
Side note: the best way to schedule a function call across threads is to create
 | 
						|
a BH in the target AioContext beforehand and then call qemu_bh_schedule().  No
 | 
						|
acquire/release or locking is needed for the qemu_bh_schedule() call.  But be
 | 
						|
sure to acquire the AioContext for aio_bh_new() if necessary.
 | 
						|
 | 
						|
The relationship between AioContext and the block layer
 | 
						|
-------------------------------------------------------
 | 
						|
The AioContext originates from the QEMU block layer because it provides a
 | 
						|
scoped way of running event loop iterations until all work is done.  This
 | 
						|
feature is used to complete all in-flight block I/O requests (see
 | 
						|
bdrv_drain_all()).  Nowadays AioContext is a generic event loop that can be
 | 
						|
used by any QEMU subsystem.
 | 
						|
 | 
						|
The block layer has support for AioContext integrated.  Each BlockDriverState
 | 
						|
is associated with an AioContext using bdrv_set_aio_context() and
 | 
						|
bdrv_get_aio_context().  This allows block layer code to process I/O inside the
 | 
						|
right AioContext.  Other subsystems may wish to follow a similar approach.
 | 
						|
 | 
						|
Block layer code must therefore expect to run in an IOThread and avoid using
 | 
						|
old APIs that implicitly use the main loop.  See the "How to program for
 | 
						|
IOThreads" above for information on how to do that.
 | 
						|
 | 
						|
If main loop code such as a QMP function wishes to access a BlockDriverState it
 | 
						|
must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure the
 | 
						|
IOThread does not run in parallel.
 | 
						|
 | 
						|
Long-running jobs (usually in the form of coroutines) are best scheduled in the
 | 
						|
BlockDriverState's AioContext to avoid the need to acquire/release around each
 | 
						|
bdrv_*() call.  Be aware that there is currently no mechanism to get notified
 | 
						|
when bdrv_set_aio_context() moves this BlockDriverState to a different
 | 
						|
AioContext (see bdrv_detach_aio_context()/bdrv_attach_aio_context()), so you
 | 
						|
may need to add this if you want to support long-running jobs.
 |