[Oraclevm-errata] OVMBA-2015-0139 Oracle VM 3.2 xen bug fix update

Fri Nov 6 12:27:37 PST 2015

Oracle VM Bug Fix Advisory OVMBA-2015-0139

The following updated rpms for Oracle VM 3.2 have been uploaded to the 
Unbreakable Linux Network:

x86_64:
xen-4.1.3-25.el5.209.1.x86_64.rpm
xen-devel-4.1.3-25.el5.209.1.x86_64.rpm
xen-tools-4.1.3-25.el5.209.1.x86_64.rpm

SRPMS:
http://oss.oracle.com/oraclevm/server/3.2/SRPMS-updates/xen-4.1.3-25.el5.209.1.src.rpm

Description of changes:

[4.1.3-25.el5.209.1]
- chkconfig services should associated with xen-tools
Signed-off-by: Zhigang Wang
Reviewed-by: Adnan Misherfi [bug 21889174]

[4.1.3-25.el5.209]
- paging_log_dirty_disable: If not resuming (normal operations)
don't return -ERESTART
The old callsites such as hap_track_dirty_vram are not capable
of dealing with paging_log_dirty_disable returning -ERESTART.
If that happens we end up with the guest stuck in paused
mode that cannot be ever resumed. This is easily reproduced
on largeish machines with Windows guests.
When the guest enters the caching mode:
(XEN) stdvga.c:147:d1 entering stdvga and caching modes
we setup dirty log. Unfortunatly when we switch the PFNs and
disable the logging to refresh - we may be preempted and
return -ERESTART. That is returned to qemu-traditional which
ignores it and does not restart the operation.
As such if resume == 0 (so no continuation of the hypercall)
we will retry in paging_log_dirty_disable if -ERESTART is called.
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Annie Li [bug 21636035]

[4.1.3-25.el5.208]
- Use AUTO_PHP_SLOT as virtual devfn for rebooted pvhvm guest
Xend try to get vdevfn from dictionary and use it as vdevfn for reboot.
In first boot, if simulated nic is unplugged before passthroughed device 
hotplug,
and in reboot, the order is reversed, there will be a conflict of vdevfn.
qemu.log shows 'hot add pci devfn -2 exceed.'
This patch can't be upstreamed as upstream has dropped 'xend' completely.
Signed-off-by: Zhenzhong Duan
Signed-off-by: Chuang Cao
Signed-off-by: Wengang Wang
Acked-by: Konrad Rzeszutek Wilk [bug 20781678]

[4.1.3-25.el5.207]
- Remove double quotes for to avoid for-loop break.
Signed-off-by: Joe Jin [bug 21465632]

[4.1.3-25.el5.206]
- VMX: fix PAT value seen by guest
The XSA-60 fixes introduced a window during which the guest PAT gets
forced to all zeros. This shouldn't be visible to the guest. Therefore
we need to intercept PAT MSR accesses during that time period.
Signed-off-by: Jan Beulich
Reviewed-by: Liu Jinsong
 From fce79f8ce91dc45f3a4d699ee67c49e6cbeb1197
Signed-off-by: Zhenzhong Duan [bug 18325257]

[4.1.3-25.el5.200]
- x86/MCE: Fix race condition in mctelem_reserve
These lines (in mctelem_reserve)
newhead = oldhead->mcte_next;
if (cmpxchgptr(freelp, oldhead, newhead) == oldhead) {
are racy. After you read the newhead pointer it can happen that another
flow (thread or recursive invocation) change all the list but set head
with same value. So oldhead is the same as *freelp but you are setting
a new head that could point to whatever element (even already used).
This patch use instead a bit array and atomic bit operations.
Signed-off-by: Frediano Ziglio
Reviewed-by: Liu Jinsong
Upstream commit 60ea3a3ac3d2bcd8e85b250fdbfc46b3b9dc7085
Signed-off-by: Zhenzhong Duan
Acked-by: Joe Jin [bug 19613191]

[4.1.3-25.el5.199]
- mce: fix race condition in mctelem_xchg_head
The function (mctelem_xchg_head()) used to exchange mce telemetry
list heads is racy. It may write to the head twice, with the second
write linking to an element in the wrong state.
If there are two threads, T1 inserting on committed list; and T2
trying to consume it.
1. T1 starts inserting an element (A), sets prev pointer (mcte_prev).
2. T1 is interrupted after the cmpxchg succeeded.
3. T2 gets the list and changes element A and updates the commit list
head.
4. T1 resumes, reads pointer to prev again and compare with result
from the cmpxchg which succeeded but in the meantime prev changed
in memory.
5. T1 thinks the cmpxchg failed and goes around the loop again,
linking head to A again.
To solve the race use temporary variable for prev pointer.
*linkp (which point to a field in the element) must be updated before
the cmpxchg() as after a successful cmpxchg the element might be
immediately removed and reinitialized.
The wmb() prior to the cmpchgptr() call is not necessary since it is
already a full memory barrier. This wmb() is thus removed.
Signed-off-by: Frediano Ziglio
Reviewed-by: Liu Jinsong
Upstream commit e9af61b969906976188609379183cb304935f448
Signed-off-by: Zhenzhong Duan
Acked-by: Joe Jin [bug 19613191]

[4.1.3-25.el5.186]
- AMD/intremap: Prevent use of per-device vector maps until irq logic is 
fixed
XSA-36 changed the default vector map mode from global to per-device. 
This is
because a global vector map does not prevent one PCI device from 
impersonating
another and launching a DoS on the system.
However, the per-device vector map logic is broken for devices with multiple
MSI-X vectors, which can either result in a failed ASSERT() or 
misprogramming
of a guests interrupt remapping tables. The core problem is not trivial to
fix.
In an effort to get AMD systems back to a non-regressed state, introduce 
a new
type of vector map called per-device-global. This uses per-device vector 
maps
in the IOMMU, but uses a single used_vector map for the core IRQ logic.
This patch is intended to be removed as soon as the per-device logic is 
fixed
correctly.
Signed-off-by: Andrew Cooper
Acked-by: Suravee Suthikulpanit
Signed-off-by: Zhenzhong Duan
Reviewed-by: Konrad Rzeszutek Wilk
Reviewed-by: Boris Ostrovsky
This patch fix bug introduced by xsa36.patch [bug 20347950]

[4.1.3-25.el5.185]
- x86: fix ordering of operations in destroy_irq()
The fix for XSA-36, switching the default of vector map management to
be per-device, exposed more readily a problem with the cleanup of these
vector maps: dynamic_irq_cleanup() clearing desc->arch.used_vectors
keeps the subsequently invoked clear_irq_vector() from clearing the
bits for both the in-use and a possibly still outstanding old vector.
Fix this by folding dynamic_irq_cleanup() into destroy_irq(), which was
its only caller, deferring the clearing of the vector map pointer until
after clear_irq_vector().
Once at it, also defer resetting of desc->handler until after the loop
around smp_mb() checking for IRQ_INPROGRESS to be clear, fixing a
(mostly theoretical) issue with the intercation with do_IRQ(): If we
don't defer the pointer reset, do_IRQ() could, for non-guest IRQs, call
->ack() and ->end() with different ->handler pointers, potentially
leading to an IRQ remaining un-acked. The issue is mostly theoretical
because non-guest IRQs are subject to destroy_irq() only on (boot time)
error paths.
As to the changed locking: Invoking clear_irq_vector() with desc->lock
held is okay because vector_lock already nests inside desc->lock (proven
by set_desc_affinity(), which takes vector_lock and gets called from
various desc->handler->ack implementations, getting invoked with
desc->lock held).
Reported-by: Andrew Cooper
Signed-off-by: Jan Beulich
Acked-by: Keir Fraser
Reviewed-by: Andrew Cooper
Acked-by: George Dunlap
Signed-off-by: Zhenzhong Duan
Reviewed-by: Konrad Rzeszutek Wilk
Reviewed-by: Boris Ostrovsky
This patch fix bug introduced by xsa36.patch [bug 20347950]

[4.1.3-25.el5.184]
- AMD IOMMU: allow disabling only interrupt remapping when certain IVRS 
consistency checks fail
After some more thought on the XSA-36 and specifically the comments we
got regarding disabling the IOMMU in this situation altogether making
things worse instead of better, I came to the conclusion that we can
actually restrict the action in affected cases to just disabling
interrupt remapping. That doesn't make the situation worse than prior
to the XSA-36 fixes (where interrupt remapping didn't really protect
domains from one another), but allows at least DMA isolation to still
be utilized.
To do so, disabling of interrupt remapping must be explicitly requested
on the command line - respective checks will then be skipped.
Signed-off-by: Jan Beulich
Acked-by: Suravee Suthikulanit
Signed-off-by: Zhenzhong Duan
Reviewed-by: Konrad Rzeszutek Wilk
Reviewed-by: Boris Ostrovsky
This patch fix bug introduced by xsa36.patch [bug 20347950]

[4.1.3-25.el5.183]
- AMD IOMMU: also spot missing IO-APIC entries in IVRS table
Apart from dealing duplicate conflicting entries, we also have to
handle firmware omitting IO-APIC entries in IVRS altogether. Not doing
so has resulted in c/s 26517:601139e2b0db to crash such systems during
boot (whereas with the change here the IOMMU gets disabled just as is
being done in the other cases, i.e. unless global tables are being
used).
Debugging this issue has also pointed out that the debug log output is
pretty ugly to look at - consolidate the output, and add one extra
item for the IVHD special entries, so that future issues are easier
to analyze.
Signed-off-by: Jan Beulich
Tested-by: Sander Eikelenboom
Acked-by: Ian Campbell
xen-unstable changeset: 26531:e68f14b9e739
xen-unstable date: Thu Feb 14 08:40:52 UTC 2013
Signed-off-by: Zhenzhong Duan
Reviewed-by: Konrad Rzeszutek Wilk
Reviewed-by: Boris Ostrovsky
This patch applies on top of xsa36.patch which is version 3 of the
xen.org 4.1 patch already applied to OVM 3. [bug 20347950]

[4.1.3-25.el5.182]
- fix hvm migration 32 vcpus limit
When we migrate an HVM guest, by default our shared_info can
only hold up to 32 CPUs. As such the hypercall
VCPUOP_register_vcpu_info was introduced which allowed us to
setup per-page areas for VCPUs. This means we can boot PVHVM
guest with more than 32 VCPUs. During migration the per-cpu
structure is allocated fresh by the hypervisor (vcpu_info_mfn
is set to INVALID_MFN) so that the newly migrated guest
can do make the VCPUOP_register_vcpu_info hypercall.
Unfortunatly we end up triggering this condition:
/* Run this command on yourself or on other offline VCPUS. */
if ( (v != current) && !test_bit(_VPF_down, &v->pause_flags) )
which means we are unable to setup the per-cpu VCPU structures
for running vCPUS. The Linux PV code paths make this work by
iterating over every vCPU with:
1) is target CPU up (VCPUOP_is_up hypercall?)
2) if yes, then VCPUOP_down to pause it.
3) VCPUOP_register_vcpu_info
4) if it was down, then VCPUOP_up to bring it back up
But since VCPUOP_down, VCPUOP_is_up, and VCPUOP_up are
not allowed on HVM guests we can't do this. This patch
enables this. [bug 21289158]

[4.1.3-25.el5.166]
- x86: vcpu_destroy_pagetables() must not return -EINTR
.. otherwise it has the side effect that: domain_relinquish_resources
will stop and will return to user-space with -EINTR which it is not
equipped to deal with that error code; or vcpu_reset - which will
ignore it and convert the error to -ENOMEM..
The preemption mechanism we have for domain destruction is to return
-EAGAIN (and then user-space calls the hypercall again) and as such we need
to catch the case of:
domain_relinquish_resources
->vcpu_destroy_pagetables
-> put_page_and_type_preemptible
-> __put_page_type
returns -EINTR
and convert it to the proper type. For:
XEN_DOMCTL_setvcpucontext
-> vcpu_reset
-> vcpu_destroy_pagetables
we need to return -ERESTART otherwise we end up returning -ENOMEM.
There are also other callers of vcpu_destroy_pagetables: arch_vcpu_reset
(vcpu_reset) are:
- hvm_s3_suspend (asserts on any return code),
- vlapic_init_sipi_one (asserts on any return code),
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Jan Beulich
Acked-by: Chuck Anderson [bug 20854422]

[4.1.3-25.el5.165]
- mm: Make scrubbing a low-priority task
An idle processor will attempt to scrub pages left over by a previously
exited guest. The processor takes global heap_lock in scrub_free_pages(),
manipulates pages on the heap lists and releases the lock before performing
the actual scrubbing in __scrub_free_pages().
It has been observed that on some systems, even though scrubbing itself
is done with the lock not held, other unrelated heap users are unable
to take the (now free) lock. We theorize that massive scrubbing locks out
the bus (or some other HW resources), preventing lock requests from reaching
the scrubbing node.
This patch tries to alleviate this problem by having the scrubber monitor
whether there are other waiters for the heap lock and, if such waiters
exist, stop scrubbing.
To achieve this, we make two changes to existing code:
1. Parallelize the heap lock by breaking it to per-node locks
2. Create an atomic per-node counter array. Before a CPU on a particular
node attempts to acquire the (now per-node) lock it increments the counter.
The scrubbing processor periodically checks this counter and, if it is
non-zero, stops scrubbing.
Few notes:
1. Until now, total_avail_pages and midsize_alloc_zone_pages updates 
have been
performed under single heap_lock. Since we no longer have this global lock,
we introduce pgcount_lock. Note that this is really only to protect readers
of this variables from reading inconsistent values (such as if another CPU
is in the middle of updating them). The values themselves are somewhat
'unsynchronized' from actual heap state. We try to be conservative and 
decrement
them before pages are taken from the heap and increment them after they are
placed there.
2. Similarly, page_broken/offlined_list are no longer under heap_lock.
pglist_lock is added to synchronize access to those lists.
3. d->last_alloc_node used to be updated under heap_lock. It was read, 
however,
without holding this lock so it seems that lockless updates will not 
make the
situation any worse (and since these updates are simple writes, as 
opposed to
some sort of RMW, we shouldn't need to convert it to an atomic).
Signed-off-by: Boris Ostrovsky
Reviewed-by: Konrad Rzeszutek Wilk
Acked-by: Chuck Anderson [bug 20816736]

[4.1.3-25.el5.162]
- xend: fix python fork and log consume %100 cpu issue
It is caused by python internal bug: http://bugs.python.org/issue6721 .
When xend forks subprocess then calls logging function, deadlock occurred.
Because python has no fix yet, so remove the logging.debug() call in
XendBootloader.py to workaround it.
Signed-off-by: Joe Jin
Reviewed-by: Zhigang Wang [bug 20752005]

[4.1.3-25.el5.161]
- Xend: make pvhvm migration work from ovm328 to higher version
The original xend patch fixes migration failure from ovm3.2.x to
ovm3.3.x, and this issue exists in xen upstream too. However, that
fix patch is not the best solution. So reverting it here for now,
a new fix will be commited for ovm3.3.x and ovm3.4 to make migration
works well through all ovm version.
Signed-off-by: Annie Li
Acked-by: Adnan Misherfi [bug 20604979]

[4.1.3-25.el5.160]
- switch internal hypercall restart indication from -EAGAIN to -ERESTART

-EAGAIN being a return value we want to return to the actual caller in
a couple of cases makes this unsuitable for restart indication, and x86
already developed two cases where -EAGAIN could not be returned as
intended due to this (which is being fixed here at once).

Signed-off-by: Jan Beulich
Acked-by: Ian Campbell Acked-by: Aravind Gopalakrishnan
Reviewed-by: Tim Deegan
(cherry-pick from f5118cae0a7f7748c6f08f557e2cfbbae686434a)
Signed-off-by: Konrad Rzeszutek Wilk
Conflicts:
A LOT
[There are lot of changes to for this change. We only care about the
one in the domain destruction. We need the value -EAGAIN to be passed
in the toolstack so that it will retry the destruction. Any other
value (-ERESTART) and it will stop it - which some of the other
backports do we convert -ERESTART to -EAGAIN only].
Acked-by: Chuck Anderson
Reviewed-by: John Haxby [bug 20666804]

[4.1.3-25.el5.159]
- rc/xendomains: 'stop' - also take care of stuck guests.
When we are done shutting down the guests (xm --shutdown --all)
are at that point not running at all. They might still have
QEMU or backend drivers setup due to the asynchronous nature
of 'shutdown' process. As such doing an 'destroy' on all
the guests will assure us that the backend drivers and QEMU
are indeed stopped.
The mechanism by which 'shutdown' works is quite complex. There
are three actors at play:
a) xm client (Which connects to the XML RPC),
b) Xend Xenstore watch thread,
c) XML RPC server thread
The way shutdown starts is:
xm client | XML RPC | watch thread
shutdown.py
- server....shutdown ---|--> XenDomainInfo:shutdown
Sets 'control/shutdown'
calls xc.domain_shutdown
returns
- loops calling:
domains_with_state ----|-->XendDomain:list_names
gets active |
and inactive | watchMain
list _on_domains_changed
- _refresh
-> _refreshTxn
-> update [sets to
DOM_STATE_SHUTDOWN]
->refreshShutd
own
[spawns a ne
w thread calling _maybeRestart]
[_maybeRestart thread]:
destroy
[sets it to DOM_STATE_HALTED]
-cleanupDomain
- _releaseDevices
- ..
Four threads total.
There is a race between 'watchMain' being executed and 'domains_with_state'
calling 'list_names'. For guests that are in DOM_STATE_UNKNOWN or 
DOM_STATE_PAUS
ED
they might not be updated to DOM_STATE_SHUTDOWN as list_names can be called
_before_ watchMain triggers. There is an lock acquisition to call 'refresh'
in list_names - but if it fails - it will just use the stale list.
As such the process works great for guests that are in STATE_SHUTDOWN,
STATE_HALT, or STATE_RUNNING - which 'domains_with_state' will present
to shutdown process.
For the other states (The more troublesome ones) we might have them
still laying around.
As such this patch calls 'xm destroy' on all those remaining guests
to do cleanup.
Signed-off-by: Konrad Rzeszutek Wilk
Acked-by: Chuck Anderson
Reviewed-by: John Haxby [bug 20666799]

[4.1.3-25.el5.158]
- xend: Fix race between shutdown and cleanup.
When we invoke 'xm shutdown --wait --all' we will exit the moment
the guest has stopped executing. That is when xcinfo returns
shutdown=1. However that does not mean that all the infrastructure
around the guest has been torn down - QEMU can be still running,
Netback and Blkback as well. In the past the time between
the shutdown and qemu being disposed of was quick - however
the race was still present there.
With our usage of PCIe passthrough we MUST unbind those devices
from a guest before we can continue on with the reboot of
the system. That is due to the complex interaction the SR-IOV
devices have with VF and PFs - as you cannot unload the PF driver
before the VFs driver have been unbound from the guest.
If you try to reboot the machine at this point the PF driver
will not unload.
The VF drivers are bound to Xen pciback - and they are unbound
when QEMU is stopped and XenStore keys are torn down - which
is done _after_ the 'shutdown' xcinfo is set (in the cleanup
stage). Worst the Xen blkback is still active - which means
we cannot unmount the storage until said cleanup has finished.
But as mentioned - 'xm shutdown --wait --all' would happily
exit before the cleanup finished and the shutdown (or reboot)
of the initial domain would continue on. It would eventually
get wedged when trying to unmount the storage which still
had a refcount from Xen block driver - which was not cleaned up
as Xend was killed earlier.
This patch solves this by delaying 'xm shutdown --wait --all'
to wait until the guest has transitioned from RUNNING ->
SHUTDOWN -> HALTED stage. The SHUTDOWN means it has ceased
to execute. The HALTED is that the cleanup is being performed.
We will cycle through all of the guests in that state until
they have moved out of those states (removed completly from
the system).
Signed-off-by: Konrad Rzeszutek Wilk
Acked-by: Chuck Anderson
Reviewed-by: John Haxby [bug 20661826]

[4.1.3-25.el5.154]
- Fix num node allocation issue for VM
When bootup a VM with more vcpus than pcpus per node on a numa enabled 
system,
VM may get double allocation of the same node for pcpus, lower the 
performance
of guest.
On the customer rack we see the CPU spikes happen on the large vServer
which has CPU oversubscription problem caused due to this problem.
Signed-off-by: Zhenzhong Duan
Acked-by: Chuang Cao [bug 20246350]

[4.1.3-25.el5.153]
- hvmloader: Define uint64_t
The 'hvmloader: also cover PCI MMIO ranges above 4G with UC MTRR ranges'
adds an uint64_t which is not defined for the ROMBIOS.
Upstream wise I am not entirely sure how the rombios build
pulls in uint64_t as there does not seem to be any decleration
of this type - but it still compiles properly.
This fixes the issue of the build failing because of
uint64_t type.
Signed-off-by: Konrad Rzeszutek Wilk
Orabug: 20136142
Tested-by: Michel Riviere
Signed-off-by: Zhenzhong Duan
Committed-by: Chuck Anderson [bug 20136142]

[4.1.3-25.el5.152]
- fix build with certain iasl versions
Orabug: 20136142
While most of them support what we have now, Wheezy's dislikes the
empty range. Put a fake one in place - it's getting overwritten upon
evaluation of _CRS anyway.
The range could be grown (downwards) if necessary; the way it is now
it is
- the highest possible one below the 36-bit boundary (with 36 bits
being the lowest common denominator for all supported systems),
- the smallest possible one that said iasl accepts.
Reported-by: Sander Eikelenboom
Signed-off-by: Jan Beulich
Acked-by: Ian Campbell
Tested-by: Michel Riviere
Signed-off-by: Zhenzhong Duan
Committed-by: Chuck Anderson [bug 20136142]

[4.1.3-25.el5.151]
- don't use AML operations on 64-bit fields
Orabug: 20136142
WinXP and Win2K3, while having no problem with the QWordMemory resource
(there was another one there before), don't like operations on 64-bit
fields. Split the fields d0688669 ('hvmloader: also cover PCI MMIO
ranges above 4G with UC MTRR ranges') added to 32-bit ones, handling
carry over explicitly.
Sadly the constructs needed to create the sub-fields - nominally
CreateDWordField(PRT0, _SB.PCI0._CRS._Y02._MIN, MINL)
CreateDWordField(PRT0, Add(_SB.PCI0._CRS._Y02._MIN, 4), MINH)
- can't be used: The former gets warned upon by newer iasl, i.e. would
need to be replaced by the latter just with the addend changed to 0,
and the latter doesn't translate properly with recent iasl). Hence,
short of having an ASL/iasl expert at hand, we need to work around the
shortcomings of various iasl versions. See the code comment.
Signed-off-by: Jan Beulich
Acked-by: Ian Campbell
Tested-by: Michel Riviere
Signed-off-by: Zhenzhong Duan
Committed-by: Chuck Anderson [bug 20136142]

[4.1.3-25.el5.150]
- x86/mm: update max_mapped_pfn on MMIO mappings too.
Orabug: 20136142
max_mapped_pfn should reflect the highest mapping we've ever seen of
any type, or the tests in the lookup functions will be wrong. As it
happens, the highest mapping has always been a RAM one, but this is no
longer the case when we allow 64-bit BARs.
Reported-by: Xudong Hao
Signed-off-by: Tim Deegan
Committed-by: Tim Deegan
Tested-by: Michel Riviere
Signed-off-by: Zhenzhong Duan
Committed-by: Chuck Anderson [bug 20136142]

[4.1.3-25.el5.149]
- hvmloader: Add 64 bits big bar support
Orabug: 20136142
Currently it is assumed PCI device BAR access < 4G memory. If there is
such a device whose BAR size is larger than 4G, it must access > 4G
memory address. This patch enable the 64bits big BAR support on
hvmloader.
Signed-off-by: Xiantao Zhang
Signed-off-by: Xudong Hao
Committed-by: Keir Fraser
Tested-by: Michel Riviere
Signed-off-by: Zhenzhong Duan
Committed-by: Chuck Anderson [bug 20136142]

[4.1.3-25.el5.148]
- Orabug: 20136142
Currently it is assumed PCI device BAR access < 4G memory. If there is 
such a
device whose BAR size is larger than 4G, it must access > 4G memory address.
This patch enable the 64bits big BAR support on qemu-xen.
Signed-off-by: Xiantao Zhang
Signed-off-by: Xudong Hao
Tested-by: Michel Riviere
Signed-off-by: Zhenzhong Duan
Committed-by: Chuck Anderson [bug 20136142]

[4.1.3-25.el5.146]
- fix XENMEM_add_to_physmap preemption handling
Just like for all other hypercalls we shouldn't be modifying the input
structure - all of the fields are, even if not explicitly documented,
just inputs.
Signed-off-by: Jan Beulich
Reviewed-by: Tim Deegan
Acked-by: Keir Fraser
Acked-by: Ian Campbell
(cherry picked from commit ade868939fe06520bb946dad740e1f3f1c12ea82)
Conflicts:
xen/common/compat/memory.c
[Xen 4.4 had new hypercalls (claim and remove_physmap) which caused
this conflict]
xen/common/memory.c
[We did not have XENMEM_set_memory_map hypercall) which caused
this conflict]
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson [bug 20116102]

[4.1.3-25.el5.145]
- move XENMEM_add_to_physmap handling framework to common code
There's really nothing really architecture specific here; the
architecture specific handling is limited to
xenmem_add_to_physmap_one().
Signed-off-by: Jan Beulich
Reviewed-by: Tim Deegan
Acked-by: Keir Fraser
Acked-by: Ian Campbell
(cherry picked from commit 4be86bb194e25e46b6cbee900601bfee76e8090a)
Conflicts:
xen/arch/arm/mm.c
[We don't have ARM in Xen 4.1]
xen/arch/x86/mm.c
xen/arch/x86/x86_64/compat/mm.c
xen/common/compat/memory.c
xen/common/memory.c
xen/include/xen/mm.h
[And new hypercalls - claim, and remove_physmap, and as well
the cleanups done (copyback and XSM calls made earlier, contribute
to the massive conflict]
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson [bug 20116102]

[4.1.3-25.el5.144]
- IOMMU: make page table population preemptible
Since this can take an arbitrary amount of time, the rooting domctl as
well as all involved code must become aware of this requiring a
continuation.
The subject domain's rel_mem_list is being (ab)used for this, in a way
similar to and compatible with broken page offlining.
Further, operations get slightly re-ordered in assign_device(): IOMMU
page tables now get set up _before_ the first device gets assigned, at
once closing a small timing window in which the guest may already see
the device but wouldn't be able to access it.
Signed-off-by: Jan Beulich
Acked-by: Tim Deegan
Reviewed-by: Andrew Cooper
Acked-by: Xiantao Zhang
(cherry picked from commit 3e06b9890c0a691388ace5a6636728998b237b90)
Conflicts:
xen/arch/x86/domain.c
xen/arch/x86/domain.c
xen/arch/x86/mm/p2m-pod.c
xen/drivers/passthrough/iommu.c
xen/include/xen/sched.h
[As we are putting this patch on top of 
cedfdd43a9798e535a05690bb6f01394490d26bb
'IOMMU: make page table deallocation preemptible' which upstream
is done the other way around]
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson [bug 20116102]

[4.1.3-25.el5.142]
- hvmloader: Fix memory relocation loop part 2
The change to tools/firmware/hvmloader/util.h was left out of
hvmloader-Fix-memory-relocation-loop.patch. Change it here.
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson
Patch comments from hvmloader-Fix-memory-relocation-loop.patch:
Signed-off-by: Keir Fraser
(cherry picked from commit 5a9a98a7a680012d7259848f1957ad32cdde4e14)
Conflicts:
tools/firmware/hvmloader/pci.c
[We did not backport 'b39d3fa hvmloader: setup PCI bus in a common 
function again.'
which moves the pci_setup in the 'pci.c' file]. [bug 20116102]

[4.1.3-25.el5.141]
- Fix memory relocation loop.
Signed-off-by: Keir Fraser
(cherry picked from commit 5a9a98a7a680012d7259848f1957ad32cdde4e14)
Conflicts:
tools/firmware/hvmloader/pci.c
[We did not backport 'b39d3fa hvmloader: setup PCI bus in a common 
function again.'
which moves the pci_setup in the 'pci.c' file].
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson [bug 20116102]

[4.1.3-25.el5.140]
- iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid 
unnecessary iotlb flush
Add cpu flag that will be checked by the iommu low level code
to skip iotlb flushes. iommu_iotlb_flush shall be called explicitly.
Signed-off-by: Jean Guyader
Committed-by: Keir Fraser
(cherry picked from commit cf95b2a9fd5aff18408e501c67203c095b1ddc1c)
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson [bug 20116102]

[4.1.3-25.el5.139]
- hvmloader: Change memory relocation loop when overlap with PCI hole
Change the way we relocate the memory page if they overlap with pci
hole. Use new map space (XENMAPSPACE_gmfn_range) to move the loop
into xen.
This code usually get triggered when a device is pass through to a
guest and the PCI hole has to be extended to have enough room to map
the device BARs. The PCI hole will starts lower and it might overlap
with some RAM that has been alocated for the guest. That usually
happen if the guest has more than 4G of RAM. We have to relocate
those pages in high mem otherwise they won't be accessible.
Signed-off-by: Jean Guyader
Committed-by: Keir Fraser
(cherry picked from commit e51e2e0e581b91a61835413c3bfa5b46426825f7)
Conflicts:
[We did not backport 'b39d3fa hvmloader: setup PCI bus in a common 
function again.'
which moves the pci_setup in the 'pci.c' file].
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson [bug 20116102]

[4.1.3-25.el5.138]
- x86: Fix RCU locking in XENMEM_add_to_physmap.
Signed-off-by: Keir Fraser
(cherry picked from commit 28e312a8b710d2208ee9ce2c25e5dfc11bc1c1b0)
Conflicts:
xen/arch/x86/mm.c
[We did not backport 51032ca058e43fbd37ea1f7c7c003496f6451340
'Modify naming of queries into the p2m' which added a 'put_gfn'
and as such the conflict showed up]
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson [bug 20116102]

[4.1.3-25.el5.137]
- mm: New XENMEM space, XENMAPSPACE_gmfn_range
XENMAPSPACE_gmfn_range is like XENMAPSPACE_gmfn but it runs on
a range of pages. The size of the range is defined in a new field.
This new field .size is located in the 16 bits padding between .domid
and .space in struct xen_add_to_physmap to stay compatible with older
versions.
Signed-off-by: Jean Guyader
Committed-by: Keir Fraser
(cherry picked from commit a04811a315e059101fa3b3303e75b97eac7c5c95)
Conflicts:
xen/arch/x86/mm.c
[We did not backport 51032ca058e43fbd37ea1f7c7c003496f6451340
'Modify naming of queries into the p2m' which added a 'put_gfn'
and as such the conflict showed up]
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson [bug 20116102]

[4.1.3-25.el5.136]
- add_to_physmap: Move the code for XENMEM_add_to_physmap
Move the code for the XENMEM_add_to_physmap case into it's own
function (xenmem_add_to_physmap).
Signed-off-by: Jean Guyader
Committed-by: Keir Fraser
(cherry picked from commit 19c617a85bb3c4d4fa9afc4919273e0f9b71cb85)
Conflicts:
xen/arch/x86/mm.c
[We did not backport 51032ca058e43fbd37ea1f7c7c003496f6451340
'Modify naming of queries into the p2m' which added a 'put_gfn'
and as such the conflict showed up]
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson [bug 20116102]

[4.1.3-25.el5.135]
- iommu: Introduce iommu_flush and iommu_flush_all.
Signed-off-by: Jean Guyader
Committed-by: Keir Fraser
(cherry picked from commit bf3292ca31ef4eedd6aa070b04321178a60e4b8f)
Conflicts:
xen/drivers/passthrough/iommu.c
[As we are putting this patch on top of 
cedfdd43a9798e535a05690bb6f01394490d26bb
'IOMMU: make page table deallocation preemptible' which upstream
is done the other way around]
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson [bug 20116102]

[4.1.3-25.el5.134]
- vtd: Refactor iotlb flush code
Factorize the iotlb flush code from map_page and unmap_page into
it's own function.
Signed-off-by: Jean Guyader
Committed-by: Keir Fraser
(cherry picked from commit c312ccdeafa0ed2ec710f48f27d575c7bf88eafa)
Conflicts:
xen/drivers/passthrough/vtd/iommu.c
[As we are putting this patch on top of 
cedfdd43a9798e535a05690bb6f01394490d26bb
'IOMMU: make page table deallocation preemptible' which upstream
is done the other way around]
Signed-off-by: Konrad Rzeszutek Wilk
Committed-by: Chuck Anderson [bug 20116102]

[4.1.3-25.el5.133]
- set max_phys_cpus=320 [bug 19688525]
Signed-off-by: Konrad Rzeszutek Wilk
Signed-off-by: Chuck Anderson

[4.1.3-25.el5.132]
- xend: disable sslv3 due to CVE-2014-3566
Signed-off-by: Zhigang Wang
Signed-off-by: Kurt Hackel
Signed-off-by: Adnan Misherfi
Backported-by: Chuang Cao [bug 19831403]

[4.1.3-25.el5.131]
- IOMMU: make page table deallocation preemptible
Backport of cedfdd43a97.
We are spending lots of time flushing CPU cache, one PTE at a time, to
make sure that IOMMU (which may not be able to watch coherence traffic
on the bus) doesn't load stale PTE from memory.
For guests with lots of memory (say, >512GB) this may take as much as
half a minute or more and as result (because this is a non-preemptable
operation) things start to break down.
Below is the original commit message:
This too can take an arbitrary amount of time.
In fact, the bulk of the work is being moved to a tasklet, as handling
the necessary preemption logic in line seems close to impossible given
that the teardown may also be invoked on error paths.
Signed-off-by: Jan Beulich
Reviewed-by: Andrew Cooper
Acked-by: Xiantao Zhang
Signed-off-by: Boris Ostrovsky
Acked-by: Chuck Anderson [bug 19796835]