[Oraclevm-errata] OVMBA-2016-0022 Important: Oracle VM 3.3 Importanat xen bug fix update

Mon Mar 14 09:12:10 PDT 2016

Oracle VM Security Advisory OVMBA-2016-0022

The following updated rpms for Oracle VM 3.3 have been uploaded to the 
Unbreakable Linux Network:

x86_64:
xen-4.3.0-55.el6.119.10.x86_64.rpm
xen-tools-4.3.0-55.el6.119.10.x86_64.rpm


SRPMS:
http://oss.oracle.com/oraclevm/server/3.3/SRPMS-updates/xen-4.3.0-55.el6.119.10.src.rpm



Description of changes:

[4.3.0-55.el6.119.10]
- x86: enforce consistent cachability of MMIO mappings
We've been told by Intel that inconsistent cachability between
multiple mappings of the same page can affect system stability only
when the affected page is an MMIO one. Since the stale data issue is
of no relevance to the hypervisor (since all guest memory accesses go
through proper accessors and validation), handling of RAM pages
remains unchanged here. Any MMIO mapped by domains however needs to be
done consistently (all cachable mappings or all uncachable ones), in
order to avoid Machine Check exceptions. Since converting existing
cachable mappings to uncachable (at the time an uncachable mapping
gets established) would in the PV case require tracking all mappings,
allow MMIO to only get mapped uncachable (UC, UC-, or WC).
This also implies that in the PV case we mustn't use the L1 PTE update
fast path when cachability flags get altered.
Since in the HVM case at least for now we want to continue honoring
pinned cachability attributes for pages not mapped by the hypervisor,
special case handling of r/o MMIO pages (forcing UC) gets added there.
Arguably the counterpart change to p2m-pt.c may not be necessary, since
UC- (which already gets enforced there) is probably strict enough.
Note that the shadow code changes include fixing the write protection
of r/o MMIO ranges: shadow_l1e_remove_flags() and its siblings, other
than l1e_remove_flags() and alike, return the new PTE (and hence
ignoring their return values makes them no-ops).
This is CVE-2016-2270 / XSA-154.
Signed-off-by: Jan Beulich Acked-by: Andrew Cooper Acked-by: Chuck 
Anderson Reviewed-by: boris.ostrovsky at oracle.com [bug 22752939] 
{CVE-2016-2270}

[4.3.0-55.el6.119.9]
- x86/VMX: sanitize rIP before re-entering guest
... to prevent guest user mode arranging for a guest crash (due to
failed VM entry). (On the AMD system I checked, hardware is doing
exactly the canonicalization being added here.)
Note that fixing this in an architecturally correct way would be quite
a bit more involved: Making the x86 instruction emulator check all
branch targets for validity, plus dealing with invalid rIP resulting
from update_guest_eip() or incoming directly during a VM exit. The only
way to get the latter right would be by not having hardware do the
injection.
Note further that there are a two early returns from
vmx_vmexit_handler(): One (through vmx_failed_vmentry()) leads to
domain_crash() anyway, and the other covers real mode only and can
neither occur with a non-canonical rIP nor result in an altered rIP,
so we don't need to force those paths through the checking logic.
This is XSA-170.
Signed-off-by: Jan Beulich Reviewed-by: Andrew Cooper Tested-by: Andrew 
Cooper Acked-by: Chuck Anderson Reviewed-by: boris.ostrovsky at oracle.com 
[bug 22706636] {CVE-2016-2271}

[4.3.0-55.el6.119.8]
- x86/irq: limit the maximum number of domain PIRQs
c/s 7e73a6e 'have architectures specify the number of PIRQs a hardware 
domain
gets' increased the default number of pirqs for dom0, as 256 was found to be
too low in some cases.
However, it didn't account for the upper bound presented by the domains EOI
bitmap, registered with the PHYSDEVOP_pirq_eoi_gmfn_v* hypercall.
On a server with 240 cpus, Xen was observed to be attempting to clear 
the EOI
bit for dom0's pirq 0xb40f, which hit a pagefault.
Signed-off-by: Andrew Cooper (cherry picked from commit 
75d28c917094b0e264874e92e8980b00a372b99f)
Signed-off-by: Konrad Rzeszutek Wilk Backported-by: Joe Jin [bug 
22661468] [bug 22668626]

[4.3.0-55.el6.119.7]
- have architectures specify the number of PIRQs a hardware domain gets
The current value of nr_static_irqs + 256 is often too small for larger
systems. Make it dependent on CPU count and number of IO-APIC pins on
x86, and (until it obtains PCI support) simply NR_IRQS on ARM.
Signed-off-by: Jan Beulich Acked-by: David Vrabel Acked-by: Ian Campbell 
Release-Acked-by: Konrad Rzeszutek Wilk (cherry picked from commit 
7e73a6e7f12ae080222c5d339799905de3443a6e)
Signed-off-by: Konrad Rzeszutek Wilk Backported-by: Joe Jin [bug 
22668626] [bug 22661468]

[4.3.0-55.el6.119.3]
- x86/HVM: avoid reading ioreq state more than once
Otherwise, especially when the compiler chooses to translate the
switch() to a jump table, unpredictable behavior (and in the jump table
case arbitrary code execution) can result.
This is XSA-166.
Signed-off-by: Jan Beulich Acked-by: Ian Campbell Acked-by: Chuck 
Anderson [bug 22551154]

[4.3.0-55.el6.119]
- Fix num node allocation issue for VM
When bootup a VM with more vcpus than pcpus per node on a numa enabled 
system,
VM may get double allocation of the same node for pcpus, lower the 
performance
of guest.
On the customer rack we see the CPU spikes happen on the large vServer
which has CPU oversubscription problem caused due to this problem.
Signed-off-by: Zhenzhong Duan
Acked-by: Chuang Cao [bug 22154251]

[4.3.0-55.el6.118]
- xend/image: Don't throw VMException when using backend domains for disks.
If we are using backend domains the disk image may not be
accessible within the host (domain0). As such it is OK to
continue on.
The 'addStoreEntries' in DevController.py already does the check
to make sure that when the 'backend' configuration is used - that
said domain exists.
As such the only change we need to do is to exclude the disk
image location if the domain is not dom0.
Reviewed-by: Konrad Rzeszutek Wilk Acked-by: Adnan Misherfi 
Signed-off-by: Zhigang Wang Signed-off-by: Joe Jin [bug 22242513]

[4.3.0-55.el6.113]
- xend: fix xm list introducing memory_actual R/O field
Introduces a field to sxp data structures named 'memory_actual'
that xm will use to have up-to-date domain memory values from Xen.
This value is always up-to-date when xend refresh the list of domains,
though it's not used in domain management functions like memory_dynamic_max
and memory_static_max to determine the memory the domain will have if
rebooted/balloned/saved/restored.
Signed-off-by: Joao Martins Acked-by: Konrad Rzeszutek Wilk Acked-by: 
Adnan Misherfi Acked-by: Chuck Anderson [bug 22145952]

[4.3.0-55.el6.112]
- Revert 'xend: Fix xm list bug reporting incorrect memory size'
This patch reverts 6797b4c5 because it introduces a regression
when a guest reboots (on its own initiative) with a PCI passthrough.
By changing xcinfo domain memory with the most up-to-date values
from libxc we will cause a failed reboot (with PCI) and the creation
of a bigger guest and with balloning enabled.
xm creates a guest with X MBs plus 4 (default) megabytes in a HVM guest.
Plus, Xen/libxc won't allocate exactly that amount of memory (e.g. there
is a 128k VGA hole, 8 pages for console/xenstore/etc). This means from the
view of the toolstack we will always see a gap between mem and maxmem which
means that PoD gets enabled when guest is rebooted if these memory
values are up-to-date with libxc. Since PCI devices can't be attached with
PoD enabled, it later causes the bug we observed.
The previous behaviour was to change this domain state only when explicit
for example on xm mem-set, pretty much as xl is doing right now. The next
patch will (re)fix the memory size listing issue which this commit we are
reverting intended to fix.
Signed-off-by: Joao Martins Acked-by: Konrad Rzeszutek Wilk Acked-by: 
Adnan Misherfi Acked-by: Chuck Anderson [bug 22145952]

[4.3.0-55.el6.108]
- AMD IOMMU: don't free page table prematurely
iommu_merge_pages() still wants to look at the next level page table,
the TLB flush necessary before freeing too happens in that function,
and if it fails no free should happen at all. Hence the freeing must
be done after that function returned successfully, not before it's
being called.
Signed-off-by: Jan Beulich Reviewed-by: Andrew Cooper Reviewed-by: 
Suravee Suthikulpanit Tested-by: Suravee Suthikulpanit Signed-off-by: 
Zhenzhong Duan Acked-by: Joe Jin [bug 21906833]

[4.3.0-55.el6.107]
- chkconfig services should associated with xen-tools
Signed-off-by: Zhigang Wang Reviewed-by: Adnan Misherfi [bug 21889146]

[4.3.0-55.el6.106]
- Backport python/xc memory leak fix
3.4.1 commit 00621bb9a6471695e1ea962553f801f38cbe3884
python/xc: add missing Py_DECREF() to fix a memory leak
Python PyList_Append() will increase reference count of the item. We have to
decrease its reference count to let it garbage collected.
We missed the Py_DECREF() for 'cpuinfo_obj' here, thus we have a memory 
leak.
The memory leak could be easily confirmed by:
>>> import xen.lowlevel.xc
>>> xc = xen.lowlevel.xc.xc()
>>> for i in range(1000): xc.getcpuinfo(1)
And check the python process memory usage before and after:
Signed-off-by: Zhigang Wang Acked-by: Wei Liu [bug 21784632]

[4.3.0-55.el6.105]
- libxl: explicitly allocate BUFIOREQ event channel
There is an OVM-only change for domain creation that is only
implemented for xend. The change is to allocate the buffered IO
request server explicitly instead of doing it by default for VCPU 0.
Without this change qemu cannot initialize and thus failing the domain
creation with the following error in the device model:
bind interdomain ioctl error 22
qemu: hardware error: xen hardware virtual machine initialisation failed
Commit bebfe22 ('Xen: Fix pvhvm migration issue from ovm3.2.8 to
ovm3.4') introduces this change for xend and this patch introduces
it for libxl.
Signed-off-by: Joao Martins Acked-by: Konrad Rzeszutek Wilk Acked-by: 
Adnan Misherfi [bug 21694010] [bug (DEVICE] [bug MODEL] [bug FAILURE] 
[bug LAUNCHING] [bug HVM] [bug VM] [bug WITH] [bug 'XL')] [bug 21748894]

[4.3.0-55.el6.104]
- x86/kexec: fix kexec on systems which boot in x2apic mode
Moving straight from fully disabled to x2apic mode is an illegal state
transition, and causes an unconditional #GP fault. Bounce through xapic mode
to avoid the fault.
In addition, avoid bouncing through the various apic modes if the mode is
already correct.
Signed-off-by: Andrew Cooper Reviewed-by: Jan Beulich Upstream commit 
77ffa26374370c1c9805f9596f37a44d412a7fdb
Signed-off-by: Zhenzhong Duan [bug 21197271]

[4.3.0-55.el6.103]
- x86/MCE: Fix race condition in mctelem_reserve
These lines (in mctelem_reserve)
newhead = oldhead->mcte_next;
if (cmpxchgptr(freelp, oldhead, newhead) == oldhead) {
are racy. After you read the newhead pointer it can happen that another
flow (thread or recursive invocation) change all the list but set head
with same value. So oldhead is the same as *freelp but you are setting
a new head that could point to whatever element (even already used).
This patch use instead a bit array and atomic bit operations.
Signed-off-by: Frediano Ziglio Reviewed-by: Liu Jinsong Upstream commit 
60ea3a3ac3d2bcd8e85b250fdbfc46b3b9dc7085
Signed-off-by: Zhenzhong Duan Acked-by: Joe Jin [bug 21544772]

[4.3.0-55.el6.102]
- mce: fix race condition in mctelem_xchg_head
The function (mctelem_xchg_head()) used to exchange mce telemetry
list heads is racy. It may write to the head twice, with the second
write linking to an element in the wrong state.
If there are two threads, T1 inserting on committed list; and T2
trying to consume it.
1. T1 starts inserting an element (A), sets prev pointer (mcte_prev).
2. T1 is interrupted after the cmpxchg succeeded.
3. T2 gets the list and changes element A and updates the commit list
head.
4. T1 resumes, reads pointer to prev again and compare with result
from the cmpxchg which succeeded but in the meantime prev changed
in memory.
5. T1 thinks the cmpxchg failed and goes around the loop again,
linking head to A again.
To solve the race use temporary variable for prev pointer.
*linkp (which point to a field in the element) must be updated before
the cmpxchg() as after a successful cmpxchg the element might be
immediately removed and reinitialized.
The wmb() prior to the cmpchgptr() call is not necessary since it is
already a full memory barrier. This wmb() is thus removed.
Signed-off-by: Frediano Ziglio Reviewed-by: Liu Jinsong Upstream commit 
e9af61b969906976188609379183cb304935f448
Signed-off-by: Zhenzhong Duan Acked-by: Joe Jin ---
xen/arch/x86/cpu/mcheck/mctelem.c | 9 +++++----
1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/cpu/mcheck/mctelem.c 
b/xen/arch/x86/cpu/mcheck/mctelem.c
index 37d830f..895ce1a 100644
--- a/xen/arch/x86/cpu/mcheck/mctelem.c
+++ b/xen/arch/x86/cpu/mcheck/mctelem.c
@@ -127,13 +127,14 @@ static DEFINE_PER_CPU(struct mc_telem_cpu_ctl, 
mctctl);
static DEFINE_SPINLOCK(processing_lock);

static void mctelem_xchg_head(struct mctelem_ent **headp,
-	struct mctelem_ent **old,
+	struct mctelem_ent **linkp,
struct mctelem_ent *new)
{
for (;;) {
-	*old = *headp;
-	wmb();
-	if (cmpxchgptr(headp, *old, new) == *old)
+	struct mctelem_ent *old;
+
+	*linkp = old = *headp;
+	if (cmpxchgptr(headp, old, new) == old)
break;
}
}
--
1.7.3 [bug 21544772]

[4.3.0-55.el6.49]
- mm: Make scrubbing a low-priority task
An idle processor will attempt to scrub pages left over by a previously
exited guest. The processor takes global heap_lock in scrub_free_pages(),
manipulates pages on the heap lists and releases the lock before performing
the actual scrubbing in __scrub_free_pages().
It has been observed that on some systems, even though scrubbing itself
is done with the lock not held, other unrelated heap users are unable
to take the (now free) lock. We theorize that massive scrubbing locks out
the bus (or some other HW resources), preventing lock requests from reaching
the scrubbing node.
This patch tries to alleviate this problem by having the scrubber monitor
whether there are other waiters for the heap lock and, if such waiters
exist, stop scrubbing.
To achieve this, we make two changes to existing code:
1. Parallelize the heap lock by breaking it to per-node locks
2. Create an atomic per-node counter array. Before a CPU on a particular
node attempts to acquire the (now per-node) lock it increments the counter.
The scrubbing processor periodically checks this counter and, if it is
non-zero, stops scrubbing.
Few notes:
1. Until now, total_avail_pages and midsize_alloc_zone_pages updates 
have been
performed under global heap_lock which was also used to control access 
to heap.
Since now those accesses are guarded by per-node locks, we introduce 
heap_lock_global.
Note that this is really only to protect readers of this variables from 
reading
inconsistent values (such as if another CPU is in the middle of updating 
them).
The values themselves are somewhat 'unsynchronized' from actual heap 
state. We
try to be conservative and decrement them before pages are taken from 
the heap
and increment them after they are placed there.
2. Similarly, page_broken/offlined_list are no longer under heap_lock.
pglist_lock is added to synchronize access to those lists.
3. d->last_alloc_node used to be updated under heap_lock. It was read, 
however,
without holding this lock so it seems that lockless updates will not 
make the
situation any worse (and since these updates are simple writes, as 
opposed to
some sort of RMW, we shouldn't need to convert it to an atomic).
Signed-off-by: Boris Ostrovsky Reviewed-by: Konrad Rzeszutek Wilk 
Acked-by: Chuck Anderson [bug 20816684]

[4.3.0-55.el6.48]
- IOMMU: make page table deallocation preemptible
Backport of cedfdd43a97.
We are spending lots of time flushing CPU cache, one PTE at a time, to
make sure that IOMMU (which may not be able to watch coherence traffic
on the bus) doesn't load stale PTE from memory.
For guests with lots of memory (say, >512GB) this may take as much as
half a minute or more and as result (because this is a non-preemptable
operation) things start to break down.
Below is the original commit message:
This too can take an arbitrary amount of time.
In fact, the bulk of the work is being moved to a tasklet, as handling
the necessary preemption logic in line seems close to impossible given
that the teardown may also be invoked on error paths.
Signed-off-by: Jan Beulich Reviewed-by: Andrew Cooper Acked-by: Xiantao 
Zhang Signed-off-by: Boris Ostrovsky Acked-by: Chuck Anderson [bug 19731529]