Detected by Coverity (CID 1385702). This fixes the recently added hypercall
to let guests properly apply Spectre and Meltdown workarounds.
Fixes: c59704b254 "target/ppc/spapr: Add H-Call H_GET_CPU_CHARACTERISTICS"
Reported-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit fa86f59234)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Start a vm with qemu-kvm -enable-kvm -vnc :66 -smp 1 -m 1024 -hda
redhat_5.11.qcow2 -device pcnet -vga cirrus,
then use VNC client to connect to VM, and excute the code below in guest
OS will lead to qemu crash:
int main()
{
iopl(3);
srand(time(NULL));
int a,b;
while(1){
a = rand()%0x100;
b = 0x3c0 + (rand()%0x20);
outb(a,b);
}
return 0;
}
The above code is writing the registers of VGA randomly.
We can write VGA CRT controller registers index 0x0C or 0x0D
(which is the start address register) to modify the
the display memory address of the upper left pixel
or character of the screen. The address may be out of the
range of vga ram. So we should check the validation of memory address
when reading or writing it to avoid segfault.
Signed-off-by: linzhecheng <linzhecheng@huawei.com>
Message-id: 20180111132724.13744-1-linzhecheng@huawei.com
Fixes: CVE-2018-5683
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit 191f59dc17)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
If kbd_queue is not empty and queue_count >= queue_limit,
we should free evt.
Change-Id: Ieeacf90d5e7e370a40452ec79031912d8b864d83
Signed-off-by: linzhecheng <linzhecheng@huawei.com>
Message-id: 20171225023730.5512-1-linzhecheng@huawei.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit fca4774a96)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
In this previous commit:
commit 8f61f1c5a6
Author: Daniel P. Berrange <berrange@redhat.com>
Date: Mon Dec 18 19:12:20 2017 +0000
ui: track how much decoded data we consumed when doing SASL encoding
I attempted to fix a flaw with tracking how much data had actually been
processed when encoding with SASL. With that flaw, the VNC server could
mistakenly discard queued data that had not been sent.
The fix was not quite right though, because it merely decremented the
vs->output.offset value. This is effectively discarding data from the
end of the pending output buffer. We actually need to discard data from
the start of the pending output buffer. We also want to free memory that
is no longer required. The correct way to handle this is to use the
buffer_advance() helper method instead of directly manipulating the
offset value.
Reported-by: Laszlo Ersek <lersek@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Message-id: 20180201155841.27509-1-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit 627ebec208)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Pixman returns a signed int for the image width/height, but the VNC
protocol only permits a unsigned int16. Effective framebuffer size
is determined by the guest, limited by the video RAM size, so the
dimensions are unlikely to exceed the range of an unsigned int16,
but this is not currently validated.
With the current use of 'int' for client width/height, the calculation
of offsets in vnc_update_throttle_offset() suffers from integer size
promotion and sign extension, causing coverity warnings
*** CID 1385147: Integer handling issues (SIGN_EXTENSION)
/ui/vnc.c: 979 in vnc_update_throttle_offset()
973 * than that the client would already suffering awful audio
974 * glitches, so dropping samples is no worse really).
975 */
976 static void vnc_update_throttle_offset(VncState *vs)
977 {
978 size_t offset =
>>> CID 1385147: Integer handling issues (SIGN_EXTENSION)
>>> Suspicious implicit sign extension:
"vs->client_pf.bytes_per_pixel" with type "unsigned char" (8 bits,
unsigned) is promoted in "vs->client_width * vs->client_height *
vs->client_pf.bytes_per_pixel" to type "int" (32 bits, signed), then
sign-extended to type "unsigned long" (64 bits, unsigned). If
"vs->client_width * vs->client_height * vs->client_pf.bytes_per_pixel"
is greater than 0x7FFFFFFF, the upper bits of the result will all be 1.
979 vs->client_width * vs->client_height * vs->client_pf.bytes_per_pixel;
Change client_width / client_height to be a size_t to avoid sign
extension and integer promotion. Then validate that dimensions are in
range wrt the RFB protocol u16 limits.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Message-id: 20180118155254.17053-1-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit 4c956bd81e)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
While the QIOChannel APIs for reading/writing data return ssize_t, with negative
value indicating an error, the VNC code passes this return value through the
vnc_client_io_error() method. This detects the error condition, disconnects the
client and returns 0 to indicate error. Thus all the VNC helper methods should
return size_t (unsigned), and misleading comments which refer to the possibility
of negative return values need fixing.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-14-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit 30b80fd526)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
The VNC client throttling is quite subtle so will benefit from having trace
points available for live debugging.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-13-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit 6aa22a2918)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
The previous patches fix problems with throttling of forced framebuffer updates
and audio data capture that would cause the QEMU output buffer size to grow
without bound. Those fixes are graceful in that once the client catches up with
reading data from the server, everything continues operating normally.
There is some data which the server sends to the client that is impractical to
throttle. Specifically there are various pseudo framebuffer update encodings to
inform the client of things like desktop resizes, pointer changes, audio
playback start/stop, LED state and so on. These generally only involve sending
a very small amount of data to the client, but a malicious guest might be able
to do things that trigger these changes at a very high rate. Throttling them is
not practical as missed or delayed events would cause broken behaviour for the
client.
This patch thus takes a more forceful approach of setting an absolute upper
bound on the amount of data we permit to be present in the output buffer at
any time. The previous patch set a threshold for throttling the output buffer
by allowing an amount of data equivalent to one complete framebuffer update and
one seconds worth of audio data. On top of this it allowed for one further
forced framebuffer update to be queued.
To be conservative, we thus take that throttling threshold and multiply it by
5 to form an absolute upper bound. If this bound is hit during vnc_write() we
forceably disconnect the client, refusing to queue further data. This limit is
high enough that it should never be hit unless a malicious client is trying to
exploit the sever, or the network is completely saturated preventing any sending
of data on the socket.
This completes the fix for CVE-2017-15124 started in the previous patches.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-12-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit f887cf165d)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
The VNC server must throttle data sent to the client to prevent the 'output'
buffer size growing without bound, if the client stops reading data off the
socket (either maliciously or due to stalled/slow network connection).
The current throttling is very crude because it simply checks whether the
output buffer offset is zero. This check is disabled if the client has requested
a forced update, because we want to send these as soon as possible.
As a result, the VNC client can cause QEMU to allocate arbitrary amounts of RAM.
They can first start something in the guest that triggers lots of framebuffer
updates eg play a youtube video. Then repeatedly send full framebuffer update
requests, but never read data back from the server. This can easily make QEMU's
VNC server send buffer consume 100MB of RAM per second, until the OOM killer
starts reaping processes (hopefully the rogue QEMU process, but it might pick
others...).
To address this we make the throttling more intelligent, so we can throttle
full updates. When we get a forced update request, we keep track of exactly how
much data we put on the output buffer. We will not process a subsequent forced
update request until this data has been fully sent on the wire. We always allow
one forced update request to be in flight, regardless of what data is queued
for incremental updates or audio data. The slight complication is that we do
not initially know how much data an update will send, as this is done in the
background by the VNC job thread. So we must track the fact that the job thread
has an update pending, and not process any further updates until this job is
has been completed & put data on the output buffer.
This unbounded memory growth affects all VNC server configurations supported by
QEMU, with no workaround possible. The mitigating factor is that it can only be
triggered by a client that has authenticated with the VNC server, and who is
able to trigger a large quantity of framebuffer updates or audio samples from
the guest OS. Mostly they'll just succeed in getting the OOM killer to kill
their own QEMU process, but its possible other processes can get taken out as
collateral damage.
This is a more general variant of the similar unbounded memory usage flaw in
the websockets server, that was previously assigned CVE-2017-15268, and fixed
in 2.11 by:
commit a7b20a8efa
Author: Daniel P. Berrange <berrange@redhat.com>
Date: Mon Oct 9 14:43:42 2017 +0100
io: monitor encoutput buffer size from websocket GSource
This new general memory usage flaw has been assigned CVE-2017-15124, and is
partially fixed by this patch.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-11-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit ada8d2e436)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
The VNC server must throttle data sent to the client to prevent the 'output'
buffer size growing without bound, if the client stops reading data off the
socket (either maliciously or due to stalled/slow network connection).
The current throttling is very crude because it simply checks whether the
output buffer offset is zero. This check must be disabled if audio capture is
enabled, because when streaming audio the output buffer offset will rarely be
zero due to queued audio data, and so this would starve framebuffer updates.
As a result, the VNC client can cause QEMU to allocate arbitrary amounts of RAM.
They can first start something in the guest that triggers lots of framebuffer
updates eg play a youtube video. Then enable audio capture, and simply never
read data back from the server. This can easily make QEMU's VNC server send
buffer consume 100MB of RAM per second, until the OOM killer starts reaping
processes (hopefully the rogue QEMU process, but it might pick others...).
To address this we make the throttling more intelligent, so we can throttle
when audio capture is active too. To determine how to throttle incremental
updates or audio data, we calculate a size threshold. Normally the threshold is
the approximate number of bytes associated with a single complete framebuffer
update. ie width * height * bytes per pixel. We'll send incremental updates
until we hit this threshold, at which point we'll stop sending updates until
data has been written to the wire, causing the output buffer offset to fall
back below the threshold.
If audio capture is enabled, we increase the size of the threshold to also
allow for upto 1 seconds worth of audio data samples. ie nchannels * bytes
per sample * frequency. This allows the output buffer to have a mixture of
incremental framebuffer updates and audio data queued, but once the threshold
is exceeded, audio data will be dropped and incremental updates will be
throttled.
This unbounded memory growth affects all VNC server configurations supported by
QEMU, with no workaround possible. The mitigating factor is that it can only be
triggered by a client that has authenticated with the VNC server, and who is
able to trigger a large quantity of framebuffer updates or audio samples from
the guest OS. Mostly they'll just succeed in getting the OOM killer to kill
their own QEMU process, but its possible other processes can get taken out as
collateral damage.
This is a more general variant of the similar unbounded memory usage flaw in
the websockets server, that was previously assigned CVE-2017-15268, and fixed
in 2.11 by:
commit a7b20a8efa
Author: Daniel P. Berrange <berrange@redhat.com>
Date: Mon Oct 9 14:43:42 2017 +0100
io: monitor encoutput buffer size from websocket GSource
This new general memory usage flaw has been assigned CVE-2017-15124, and is
partially fixed by this patch.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-10-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit e2b72cb6e0)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
The logic for determining if it is possible to send an update to the client
will become more complicated shortly, so pull it out into a separate method
for easier extension later.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-9-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit 0bad834228)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
According to the RFB protocol, a client sends one or more framebuffer update
requests to the server. The server can reply with a single framebuffer update
response, that covers all previously received requests. Once the client has
read this update from the server, it may send further framebuffer update
requests to monitor future changes. The client is free to delay sending the
framebuffer update request if it needs to throttle the amount of data it is
reading from the server.
The QEMU VNC server, however, has never correctly handled the framebuffer
update requests. Once QEMU has received an update request, it will continue to
send client updates forever, even if the client hasn't asked for further
updates. This prevents the client from throttling back data it gets from the
server. This change fixes the flawed logic such that after a set of updates are
sent out, QEMU waits for a further update request before sending more data.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-8-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit 728a7ac954)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Currently the VNC servers tracks whether a client has requested an incremental
or forced update with two boolean flags. There are only really 3 distinct
states to track, so create an enum to more accurately reflect permitted states.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-7-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit fef1bbadfb)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
When we encode data for writing with SASL, we encode the entire pending output
buffer. The subsequent write, however, may not be able to send the full encoded
data in one go though, particularly with a slow network. So we delay setting the
output buffer offset back to zero until all the SASL encoded data is sent.
Between encoding the data and completing sending of the SASL encoded data,
however, more data might have been placed on the pending output buffer. So it
is not valid to set offset back to zero. Instead we must keep track of how much
data we consumed during encoding and subtract only that amount.
With the current bug we would be throwing away some pending data without having
sent it at all. By sheer luck this did not previously cause any serious problem
because appending data to the send buffer is always an atomic action, so we
only ever throw away complete RFB protocol messages. In the case of frame buffer
updates we'd catch up fairly quickly, so no obvious problem was visible.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-6-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit 8f61f1c5a6)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
The vnc_update_client() method checks the 'has_dirty' flag to see if there are
dirty regions that are pending to send to the client. Regardless of this flag,
if a forced update is requested, updates must be sent. For unknown reasons
though, the code also tries to sent updates if audio capture is enabled. This
makes no sense as audio capture state does not impact framebuffer contents, so
this check is removed.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-5-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit 3541b08475)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Now that previous dead / unreachable code has been removed, we can simplify
the indentation in the vnc_client_update method.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-4-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit b939eb89b6)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
A previous commit:
commit 5a8be0f73d
Author: Gerd Hoffmann <kraxel@redhat.com>
Date: Wed Jul 13 12:21:20 2016 +0200
vnc: make sure we finish disconnect
Added a check for vs->disconnecting at the very start of the
vnc_update_client method. This means that the very next "if"
statement check for !vs->disconnecting always evaluates true,
and is thus redundant. This in turn means the vs->disconnecting
check at the very end of the method never evaluates true, and
is thus unreachable code.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-3-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit c53df96161)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
There is only one caller of vnc_update_client and that always passes false
for the 'sync' parameter.
Signed-off-by: Daniel P. Berrange <berrange@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-id: 20171218191228.31018-2-berrange@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit 6af998db05)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
If postcopy-ram was set on the source but not on the destination,
migration doesn't occur, the destination prints an error and boots
the guest:
qemu-system-ppc64: Expected vmdescription section, but got 0
We end up with two running instances.
This behaviour was introduced in 2.11 by commit 58110f0acb "migration:
split common postcopy out of ram postcopy" to prepare ground for the
upcoming dirty bitmap postcopy support. It adds a new case where the
source may send an empty postcopy advise because dirty bitmap doesn't
need to check page sizes like RAM postcopy does.
If the source has enabled postcopy-ram, then it sends an advise with
the page size values. If the destination hasn't enabled postcopy-ram,
then loadvm_postcopy_handle_advise() leaves the page size values on
the stream and returns. This confuses qemu_loadvm_state() later on
and causes the destination to start execution.
As discussed several times, postcopy-ram should be enabled both sides
to be functional. This patch changes the destination to perform some
extra checks on the advise length to ensure this is the case. Otherwise
an error is returned and migration is aborted.
Reported-by: Balamuruhan S <bala24@linux.vnet.ibm.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Message-Id: <151791621042.19120.3103118434734245776.stgit@bahia>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
(cherry picked from commit 875fcd013a)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
MAX_VM_CMD_PACKAGED_SIZE is a constant used in qemu_savevm_send_packaged
and loadvm_handle_cmd_packaged to determine whether a package is too
big to be sent or received. qemu_savevm_send_packaged is called inside
postcopy_start (migration/migration.c) to send the MigrationState
in a single blob to the destination, using the MIG_CMD_PACKAGED subcommand,
which will read it up using loadvm_handle_cmd_packaged. If the blob is
larger than MAX_VM_CMD_PACKAGED_SIZE, an error is thrown and the postcopy
migration is aborted. Both MAX_VM_CMD_PACKAGED_SIZE and MIG_CMD_PACKAGED
were introduced by commit 11cf1d984b ("MIG_CMD_PACKAGED: Send a packaged
chunk ..."). The constant has its original value of 1ul << 24 (16MB).
The current MAX_VM_CMD_PACKAGED_SIZE value is not enough to support postcopy
migration of bigger pseries guests. The blob size for a postcopy migration of
a pseries guest with the following setup:
qemu-system-ppc64 --nographic -vga none -machine pseries,accel=kvm -m 64G \
-smp 1,maxcpus=32 -device virtio-blk-pci,drive=rootdisk \
-drive file=f27.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \
-netdev user,id=u1 -net nic,netdev=u1
Goes around 12MB. Bumping the RAM to 128G makes the blob sizes goes to 20MB.
With 256G the blob goes to 37MB - more than twice the current maximum size.
At this moment the pseries machine can handle guests with up to 1TB of RAM,
making this postcopy blob goes to 128MB of size approximately.
Following the discussions made in [1], there is a need to understand what
devices are aggressively consuming the blob in that manner and see if that
can be mitigated. Until then, we can set MAX_VM_CMD_PACKAGED_SIZE to the
maximum value allowed. Since the size is a 32 bit int variable, we can set
it as 1ul << 32, giving a maximum blob size of 4G that is enough to support
postcopy migration of 32TB RAM guests given the above constraints.
[1] https://lists.nongnu.org/archive/html/qemu-devel/2018-01/msg06313.html
Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Reported-by: Balamuruhan S <bala24@linux.vnet.ibm.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
(cherry picked from commit ee555cdf4d)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
In e91d895 I added the new pause-before-switchover mechanism
to allow migration completion to be delayed; this changes the
last state prior to completion to MIGRATE_STATUS_DEVICE rather
than MIGRATE_STATUS_ACTIVE.
Fix the failure path in migration_completion to recover the block
devices if it fails in MIGRATE_STATUS_DEVICE, not just the
MIGRATE_STATUS_ACTIVE that it previously had.
This corresponds to rh bz:
https://bugzilla.redhat.com/show_bug.cgi?id=1538494
whose symptom is an occasional source crash on a failed migration.
Fixes: e91d8951d5
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
(cherry picked from commit 6039dd5b1c)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Since qemu_fopen_channel_{in,out}put take references on the underlying
IO channels, make sure to release our references to them.
Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Message-Id: <20171101142526.1006-2-ross.lagerwall@citrix.com>
Reviewed-by: Daniel P. Berrange <berrange@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
(cherry picked from commit 032b79f717)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
We should set ioeventfd_update_pending same as memory_region_update_pending.
Signed-off-by: linzhecheng <linzc@zju.edu.cn>
Message-Id: <1515934519-16158-1-git-send-email-linzc@zju.edu.cn>
Cc: qemu-stable@nongnu.org
Fixes: ade9c1aac5
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 0b15209571)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
The new H-Call H_GET_CPU_CHARACTERISTICS is used by the guest to query
behaviours and available characteristics of the cpu.
Implement the handler for this new H-Call which formulates its response
based on the setting of the spapr_caps cap-cfpc, cap-sbbc and cap-ibs.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit c59704b254)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Add new tristate cap cap-ibs to represent the indirect branch
serialisation capability.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 4be8d4e7d9)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Add new tristate cap cap-sbbc to represent the speculation barrier
bounds checking capability.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 09114fd817)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Add new tristate cap cap-cfpc to represent the cache flush on privilege
change capability.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 8f38eaf8f9)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
spapr_caps are used to represent the level of support for various
capabilities related to the spapr machine type. Currently there is
only support for boolean capabilities.
Add support for tristate capabilities by implementing their get/set
functions. These capabilities can have the values 0, 1 or 2
corresponding to broken, workaround and fixed.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 6898aed77f)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Add three new kvm capabilities used to represent the level of host support
for three corresponding workarounds.
Host support for each of the capabilities is queried through the
new ioctl KVM_PPC_GET_CPU_CHAR which returns four uint64 quantities. The
first two, character and behaviour, represent the available
characteristics of the cpu and the behaviour of the cpu respectively.
The second two, c_mask and b_mask, represent the mask of known bits for
the character and beheviour dwords respectively.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
[dwg: Correct some compile errors due to name change in final kernel
patch version]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 8acc2ae5e9)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
The vmstate description and the contained needed function for migration
of spapr_caps is the same for each cap, with the name of the cap
substituted. As such introduce a macro to allow for easier generation of
these.
Convert the three existing spapr_caps (htm, vsx, and dfp) to use this
macro.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 1f63ebaa91)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
and use them in a couple of obvious places. Other macros will be used
in the model of the XIVE interrupt controller.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 2a83f9976e)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Commit 51f84465dd changed the compatility mode setting logic:
- machine reset only sets compatibility mode for the boot CPU
- compatibility mode is set for other CPUs when they are put online
by the guest with the "start-cpu" RTAS call
This causes a regression for machines started with max-compat-cpu:
the device tree nodes related to secondary CPU cores contain wrong
"cpu-version" and "ibm,pa-features" values, as shown below.
Guest started on a POWER8 host with:
-smp cores=2 -machine pseries,max-cpu-compat=compat7
ibm,pa-features = [18 00 f6 3f c7 c0 80 f0 80 00
00 00 00 00 00 00 00 00 80 00 80 00 80 00 00 00];
cpu-version = <0x4d0200>;
^^^
second CPU core
ibm,pa-features = <0x600f63f 0xc70080c0>;
cpu-version = <0xf000003>;
^^^
boot CPU core
The second core is advertised in raw POWER8 mode. This happens because
CAS assumes all CPUs to have the same compatibility mode. Since the
boot CPU already has the requested compatibility mode, the CAS code
does not set it for the secondary one, and exposes the bogus device
tree properties in in the CAS response to the guest.
A similar situation is observed when hot-plugging a CPU core. The
related device tree properties are generated and exposed to guest
with the "ibm,configure-connector" RTAS before "start-cpu" is called.
The CPU core is advertised to the guest in raw mode as well.
It both cases, it boils down to the fact that "start-cpu" happens too
late. This can be fixed globally by propagating the compatibility mode
of the boot CPU to the other CPUs during reset. For this to work, the
compatibility mode of the boot CPU must be set before the machine code
actually resets all CPUs.
It is not needed to set the compatibility mode in "start-cpu" anymore,
so the code is dropped.
Fixes: 51f84465dd
Signed-off-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 9012a53f06)
Conflicts:
hw/ppc/spapr_cpu_core.c
hw/ppc/spapr_rtas.c
* drop context dep on d6322252b3
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Increases the max smt mode to 8 for Power9. That's because KVM supports
smt emulation in this platform so QEMU should allow users to use it as
well.
Today if we try to pass -smp ...,threads=8, QEMU will silently truncate
it to smt4 mode and may cause a crash if we try to perform a cpu
hotplug.
Signed-off-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
[dwg: Added an explanatory comment]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 03ee51d354)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Currently spapr_caps are tied to boolean values (on or off). This patch
reworks the caps so that they can have any uint8 value. This allows more
capabilities with various values to be represented in the same way
internally. Capabilities are numbered in ascending order. The internal
representation of capability values is an array of uint8s in the
sPAPRMachineState, indexed by capability number.
Capabilities can have their own name, description, options, getter and
setter functions, type and allow functions. They also each have their own
section in the migration stream. Capabilities are only migrated if they
were explictly set on the command line, with the assumption that
otherwise the default will match.
On migration we ensure that the capability value on the destination
is greater than or equal to the capability value from the source. So
long at this remains the case then the migration is considered
compatible and allowed to continue.
This patch implements generic getter and setter functions for boolean
capabilities. It also converts the existings cap-htm, cap-vsx and
cap-dfp capabilities to this new format.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 4e5fe3688e)
Conflicts:
include/hw/ppc/spapr.h
*drop context dep on 60c6823b9b
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Decimal Floating Point has been available on POWER7 and later (server)
cpus. However, it can be disabled on the hypervisor, meaning that it's
not available to guests.
We currently handle this by conditionally advertising DFP support in the
device tree depending on whether the guest CPU model supports it - which
can also depend on what's allowed in the host for -cpu host. That can lead
to confusion on migration, since host properties are silently affecting
guest visible properties.
This patch handles it by treating it as an optional capability for the
pseries machine type.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
(cherry picked from commit 2d1fb9bc8e)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
We currently have some conditionals in the spapr device tree code to decide
whether or not to advertise the availability of the VMX (aka Altivec) and
VSX vector extensions to the guest, based on whether the guest cpu has
those features.
This can lead to confusion and subtle failures on migration, since it makes
a guest visible change based only on host capabilities. We now have a
better mechanism for this, in spapr capabilities flags, which explicitly
depend on user options rather than host capabilities.
Rework the advertisement of VSX and VMX based on a new VSX capability. We
no longer bother with a conditional for VMX support, because every CPU
that's ever been supported by the pseries machine type supports VMX.
NOTE: Some userspace distributions (e.g. RHEL7.4) already rely on
availability of VSX in libc, so using cap-vsx=off may lead to a fatal
SIGILL in init.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
(cherry picked from commit 2938664286)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
When constructing the "host" cpu class we modify whether the VMX and VSX
vector extensions and DFP (Decimal Floating Point) are available
based on whether KVM can support those instructions. This can depend on
policy in the host kernel as well as on the actual host cpu capabilities.
However, the way we probe for this is not very nice: we explicitly check
the host's device tree. That works in practice, but it's not really
correct, since the device tree is a property of the host kernel's platform
which we don't really know about. We get away with it because the only
modern POWER platforms happen to encode VMX, VSX and DFP availability in
the device tree in the same way.
Arguably we should have an explicit KVM capability for this, but we haven't
needed one so far. Barring specific KVM policies which don't yet exist,
each of these instruction classes will be available in the guest if and
only if they're available in the qemu userspace process. We can determine
that from the ELF AUX vector we're supplied with.
Once reworked like this, there are no more callers for kvmppc_get_vmx() and
kvmppc_get_dfp() so remove them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
(cherry picked from commit 3f2ca480eb)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Now that the "pseries" machine type implements optional capabilities (well,
one so far) there's the possibility of having different capabilities
available at either end of a migration. Although arguably a user error,
it would be nice to catch this situation and fail as gracefully as we can.
This adds code to migrate the capabilities flags. These aren't pulled
directly into the destination's configuration since what the user has
specified on the destination command line should take precedence. However,
they are checked against the destination capabilities.
If the source was using a capability which is absent on the destination,
we fail the migration, since that could easily cause a guest crash or other
bad behaviour. If the source lacked a capability which is present on the
destination we warn, but allow the migration to proceed.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
(cherry picked from commit be85537d65)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
This adds an spapr capability bit for Hardware Transactional Memory. It is
enabled by default for pseries-2.11 and earlier machine types. with POWER8
or later CPUs (as it must be, since earlier qemu versions would implicitly
allow it). However it is disabled by default for the latest pseries-2.12
machine type.
This means that with the latest machine type, HTM will not be available,
regardless of CPU, unless it is explicitly enabled on the command line.
That change is made on the basis that:
* This way running with -M pseries,accel=tcg will start with whatever cpu
and will provide the same guest visible model as with accel=kvm.
- More specifically, this means existing make check tests don't have
to be modified to use cap-htm=off in order to run with TCG
* We hope to add a new "HTM without suspend" feature in the not too
distant future which could work on both POWER8 and POWER9 cpus, and
could be enabled by default.
* Best guesses suggest that future POWER cpus may well only support the
HTM-without-suspend model, not the (frankly, horribly overcomplicated)
POWER8 style HTM with suspend.
* Anecdotal evidence suggests problems with HTM being enabled when it
wasn't wanted are more common than being missing when it was.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
(cherry picked from commit ee76a09fc7)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Because PAPR is a paravirtual environment access to certain CPU (or other)
facilities can be blocked by the hypervisor. PAPR provides ways to
advertise in the device tree whether or not those features are available to
the guest.
In some places we automatically determine whether to make a feature
available based on whether our host can support it, in most cases this is
based on limitations in the available KVM implementation.
Although we correctly advertise this to the guest, it means that host
factors might make changes to the guest visible environment which is bad:
as well as generaly reducing reproducibility, it means that a migration
between different host environments can easily go bad.
We've mostly gotten away with it because the environments considered mature
enough to be well supported (basically, KVM on POWER8) have had consistent
feature availability. But, it's still not right and some limitations on
POWER9 is going to make it more of an issue in future.
This introduces an infrastructure for defining "sPAPR capabilities". These
are set by default based on the machine version, masked by the capabilities
of the chosen cpu, but can be overriden with machine properties.
The intention is at reset time we verify that the requested capabilities
can be supported on the host (considering TCG, KVM and/or host cpu
limitations). If not we simply fail, rather than silently modifying the
advertised featureset to the guest.
This does mean that certain configurations that "worked" may now fail, but
such configurations were already more subtly broken.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Greg Kurz <groug@kaod.org>
(cherry picked from commit 33face6b89)
Conflicts:
include/hw/ppc/spapr.h
*drop context dep on 60c6823b9b
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
While we're at it fix a couple of small errors in the 2.11 and 2.10 models
(they didn't have any real effect, but don't quite match the template).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 2b6154120c)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
if KVM is enabled and KVM capabilities MMU radix is available,
the partition table entry (patb_entry) for the radix mode is
initialized by default in ppc_spapr_reset().
It's a problem if we want to migrate the guest to a POWER8 host
while the kernel is not started to set the value to the one
expected for a POWER8 CPU.
The "-machine max-cpu-compat=power8" should allow to migrate
a POWER9 KVM host to a POWER8 KVM host, but because patb_entry
is set, the destination QEMU tries to enable radix mode on the
POWER8 host. This fails and cancels the migration:
Process table config unsupported by the host
error while loading state for instance 0x0 of device 'spapr'
load of migration failed: Invalid argument
This patch doesn't set the PATB entry if the user provides
a CPU compatibility mode that doesn't support radix mode.
Signed-off-by: Laurent Vivier <lvivier@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 1481fe5fcf)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
The SPARC code in linux-user/signal.c defines a set of
MC_* constants. On some SPARC hosts these are also defined
by sys/ucontext.h, resulting in build failures:
linux-user/signal.c:2786:0: error: "MC_NGREG" redefined [-Werror]
#define MC_NGREG 19
In file included from /usr/include/signal.h:302:0,
from include/qemu/osdep.h:86,
from linux-user/signal.c:19:
/usr/include/sparc64-linux-gnu/sys/ucontext.h:59:0: note: this is the location of the previous definition
# define MC_NGREG __MC_NGREG
Rename all these constants to SPARC_MC_* to avoid the clash.
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Message-id: 1517318239-15764-1-git-send-email-peter.maydell@linaro.org
(cherry picked from commit 8ebb314b95)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
In various place we don't correctly check if the device supports MSI or
MSI-X. This can cause devices to be advertised with MSI support, even
if they only support MSI-X (like virtio-pci-* devices for example):
ethernet@0 {
ibm,req#msi = <0x1>; <--- wrong!
.
ibm,loc-code = "qemu_virtio-net-pci:0000:00:00.0";
.
ibm,req#msi-x = <0x3>;
};
Worse, this can also cause the "ibm,change-msi" RTAS call to corrupt the
PCI status and cause migration to fail:
qemu-system-ppc64: get_pci_config_device: Bad config data: i=0x6
read: 0 device: 10 cmask: 10 wmask: 0 w1cmask:0
^^
PCI_STATUS_CAP_LIST bit which is assumed to be constant
This patch changes spapr_populate_pci_child_dt() to properly check for
MSI support using msi_present(): this ensures that PCIDevice::msi_cap
was set by msi_init() and that msi_nr_vectors_allocated() will look at
the right place in the config space.
Checking PCIDevice::msix_entries_nr is enough for MSI-X but let's add
a call to msix_present() there as well for consistency.
It also changes rtas_ibm_change_msi() to select the appropriate MSI
type in Function 1 instead of always selecting plain MSI. This new
behaviour is compliant with LoPAPR 1.1, as described in "Table 71.
ibm,change-msi Argument Call Buffer":
Function 1: If Number Outputs is equal to 3, request to set to a new
number of MSIs (including set to 0).
If the “ibm,change-msix-capable” property exists and Number
Outputs is equal to 4, request is to set to a new number of
MSI or MSI-X (platform choice) interrupts (including set to
0).
Since MSI is the the platform default (LoPAPR 6.2.3 MSI Option), let's
check for MSI support first.
And finally, it checks the input parameters are valid, as described in
LoPAPR 1.1 "R1–7.3.10.5.1–3":
For the MSI option: The platform must return a Status of -3 (Parameter
error) from ibm,change-msi, with no change in interrupt assignments if
the PCI configuration address does not support MSI and Function 3 was
requested (that is, the “ibm,req#msi” property must exist for the PCI
configuration address in order to use Function 3), or does not support
MSI-X and Function 4 is requested (that is, the “ibm,req#msi-x” property
must exist for the PCI configuration address in order to use Function 4),
or if neither MSIs nor MSI-Xs are supported and Function 1 is requested.
This ensures that the ret_intr_type variable contains a valid MSI type
for this device, and that spapr_msi_setmsg() won't corrupt the PCI status.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
(cherry picked from commit 9cbe305b60)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
Because usb-storage creates an internal scsi device, we should propagate
options. We already do so for bootindex etc, but failed to take care of
share-rw. Fix it in an apparent way: add a new parameter to
scsi_bus_legacy_add_drive and pass in s->conf.share_rw.
Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Message-id: 20180117005222.4781-1-famz@redhat.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
(cherry picked from commit 395b953959)
Conflicts:
hw/usb/dev-storage.c
* dropped context dep on ceff3e1f01
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
We could hit lock failure if there is a signal that makes fcntl return
-1 and errno set to EINTR. In this case we should retry.
Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
(cherry picked from commit f86428a1f4)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>
stfle.81 (ppa15) is a transparent facility that can be passed to the
guest without the need to implement hypervisor support. As this feature
can be provided by firmware we add it to all full models.
Cc: qemu-stable@nongnu.org
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Message-Id: <20180118085628.40798-4-borntraeger@de.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
(cherry picked from commit 9f0d13f4f1)
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com>