[Oraclevm-errata] OVMBA-2015-0038 Oracle VM 3.2 xen bug fix update

Mon Mar 23 11:12:34 PDT 2015

Oracle VM Bug Fix Advisory OVMBA-2015-0038

The following updated rpms for Oracle VM 3.2 have been uploaded to the 
Unbreakable Linux Network:

x86_64:
xen-4.1.3-25.el5.127.33.x86_64.rpm
xen-devel-4.1.3-25.el5.127.33.x86_64.rpm
xen-tools-4.1.3-25.el5.127.33.x86_64.rpm

SRPMS:
http://oss.oracle.com/oraclevm/server/3.2/SRPMS-updates/xen-4.1.3-25.el5.127.33.src.rpm

Description of changes:

[4.1.3-25.el5.127.33]
- switch internal hypercall restart indication from -EAGAIN to -ERESTART

   -EAGAIN being a return value we want to return to the actual caller in
   a couple of cases makes this unsuitable for restart indication, and x86
   already developed two cases where -EAGAIN could not be returned as
   intended due to this (which is being fixed here at once).

   Signed-off-by: Jan Beulich <jbeulich at suse.com>
   Acked-by: Ian Campbell <ian.campbell at citrix.com
   Acked-by: Aravind Gopalakrishnan<Aravind.Gopalakrishnan at amd.com>
   Reviewed-by: Tim Deegan <tim at xen.org>
   (cherry-pick from f5118cae0a7f7748c6f08f557e2cfbbae686434a)
   Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk at oracle.com>
   Conflicts:
   A LOT
   [There are lot of changes to for this change. We only care about the
   one in the domain destruction. We need the value -EAGAIN to be passed
   in the toolstack so that it will retry the destruction. Any other
   value (-ERESTART) and it will stop it - which some of the other
   backports do we convert -ERESTART to -EAGAIN only].
   Acked-by: Chuck Anderson <chuck.anderson at oracle.com>
   Reviewed-by: John Haxby <john.haxby at oracle.com> [bug 20666807]

[4.1.3-25.el5.127.32]
- rc/xendomains: 'stop' - also take care of stuck guests.
   When we are done shutting down the guests (xm --shutdown --all)
   are at that point not running at all. They might still have
   QEMU or backend drivers setup due to the asynchronous nature
   of 'shutdown' process. As such doing an 'destroy' on all
   the guests will assure us that the backend drivers and QEMU
   are indeed stopped.
   The mechanism by which 'shutdown' works is quite complex. There
   are three actors at play:
   a) xm client (Which connects to the XML RPC),
   b) Xend Xenstore watch thread,
   c) XML RPC server thread
   The way shutdown starts is:
   xm client                |  XML RPC          | watch thread
   shutdown.py
   - server....shutdown  ---|--> XenDomainInfo:shutdown
   Sets "control/shutdown"
   calls xc.domain_shutdown
  returns
   - loops calling:
   domains_with_state ----|-->XendDomain:list_names
   gets active   |
   and inactive    | watchMain
   list             _on_domains_changed
   - _refresh
   -> _refreshTxn
   -> update [sets to
   DOM_STATE_SHUTDOWN]
   ->refreshShutd
   own
   [spawns a ne
   w thread calling _maybeRestart]
   [_maybeRestart thread]:
   destroy
   [sets it to DOM_STATE_HALTED]
   -cleanupDomain
   - _releaseDevices
   - ..
   Four threads total.
  There is a race between 'watchMain' being executed and 
'domains_with_state'
   calling 'list_names'. For guests that are in DOM_STATE_UNKNOWN or 
DOM_STATE_PAUS
   ED
   they might not be updated to DOM_STATE_SHUTDOWN as list_names can be 
called
   _before_ watchMain triggers. There is an lock acquisition to call 
'refresh'
   in list_names - but if it fails - it will just use the stale list.
   As such the process works great for guests that are in STATE_SHUTDOWN,
   STATE_HALT, or STATE_RUNNING - which 'domains_with_state' will present
   to shutdown process.
   For the other states (The more troublesome ones) we might have them
   still laying around.
   As such this patch calls 'xm destroy' on all those remaining guests
   to do cleanup.
   Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk at oracle.com>
   Acked-by: Chuck Anderson <chuck.anderson at oracle.com>
   Reviewed-by: John Haxby <john.haxby at oracle.com> [bug 20666802]

[4.1.3-25.el5.127.31]
- xend: Fix race between shutdown and cleanup.
   When we invoke 'xm shutdown --wait --all' we will exit the moment
   the guest has stopped executing. That is when xcinfo returns
   shutdown=1. However that does not mean that all the infrastructure
   around the guest has been torn down - QEMU can be still running,
   Netback and Blkback as well. In the past the time between
   the shutdown and qemu being disposed of was quick - however
   the race was still present there.
   With our usage of PCIe passthrough we MUST unbind those devices
   from a guest before we can continue on with the reboot of
   the system. That is due to the complex interaction the SR-IOV
   devices have with VF and PFs - as you cannot unload the PF driver
   before the VFs driver have been unbound from the guest.
   If you try to reboot the machine at this point the PF driver
   will not unload.
   The VF drivers are bound to Xen pciback - and they are unbound
   when QEMU is stopped and XenStore keys are torn down - which
   is done _after_ the 'shutdown' xcinfo is set (in the cleanup
   stage). Worst the Xen blkback is still active - which means
   we cannot unmount the storage until said cleanup has finished.
   But as mentioned - 'xm shutdown --wait --all' would happily
   exit before the cleanup finished and the shutdown (or reboot)
   of the initial domain would continue on. It would eventually
  get wedged when trying to unmount the storage which still
   had a refcount from Xen block driver - which was not cleaned up
   as Xend was killed earlier.
   This patch solves this by delaying 'xm shutdown --wait --all'
   to wait until the guest has transitioned from RUNNING ->
   SHUTDOWN  -> HALTED stage. The SHUTDOWN means it has ceased
   to execute. The HALTED is that the cleanup is being performed.
   We will cycle through all of the guests in that state until
   they have moved out of those states (removed completly from
   the system).
   Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk at oracle.com>
   Acked-by: Chuck Anderson <chuck.anderson at oracle.com>
   Reviewed-by: John Haxby <john.haxby at oracle.com> [bug 20661867]