Oracle Unbreakable Enterprise Kernel Release 2 Release Notes Updated March 2012 -------------------------------------------------------------------------- Contents 1. Oracle Unbreakable Enterprise Kernel Release 2 Release Notes 1. Introduction 2. New features 1. Updates/Improvements added by Oracle 1. Btrfs 2. Xen domU improvements 3. Other improvements 2. Driver Updates 1. Storage drivers 2. Network drivers 3. Other drivers 3. Notable improvements in mainline Linux since Linux 2.6.32 3. Updated or added utilities 1. Oracle Linux 6 2. Oracle Linux 5 4. Technology Preview Features 5. Compatibility 6. Availability 7. Installation 8. Known Issues Introduction The Unbreakable Enterprise Kernel Release 2 is Oracle's second major release of its heavily tested and optimized operating system kernel for Oracle Linux 5 and Oracle Linux 6. It is based on the mainline Linux 3.0 version 3.0.16. It contains a large number of improvements and new features that have been incorporated into mainline Linux since the first version of the Unbreakable Enterprise Kernel, which was based on Linux 2.6.32. Note: the actual version number displayed by the kernel and on the RPM packages is 2.6.39. This was done to avoid potential breakage of certain low-level utilities of the Oracle Linux distribution (also known as the "plumbing") that potentially can't cope with the new 3.x version scheme. Regular Linux applications are usually not aware or affected by Linux kernel version numbers. New features Updates/Improvements added by Oracle This release of the Unbreakable Enterprise Kernel has been improved/enhanced by Oracle in several areas, including bug fixes and extended functionality. All of these modifications have been contributed back upstream and are available in mainline Linux. Btrfs Btrfs provides a flexible way to manage storage, without needing a separate volume manager. It provides built-in RAID support and ensures data integrity by using redundancy and checksums. Btrfs also supports lightweight copies/clones of files or directories with snapshots as well as online data compression. The Btrfs code in the Unbreakable Enterprise Kernel Release 2 includes many new features as well as numerous performance improvements, that were merged from a number of long running projects and cleanup queues. New Btrfs features/functionality * An updated version of btrfsfsck, a tool to check and repair a Btrfs file system, is now included in the btrfs-progs package. This new btrfsck now supports a --repair option that allows fixing errors in the extent allocation tree and block group accounting. btrfsck also provides the option --init-csum-tree which replaces the check-sum root with an empty one. This will clear out the CRCs but allows the file-system to be mounted with the mount option nodatasum. * Automatic defragmentation: Brtfs now provides an online defragmentation facility that reorganizes data into contiguous chunks wherever possible to create larger sections of available disk space and improve read and write performance. * Scrubbing: you can initiate a check of the entire file system by triggering a file system scrub job that is performed in the background. The scrub run scans the entire file system for integrity and automatically attempts to report and repair any bad blocks it finds along the way. Instead of going through the entire disk drive, the scrub run only deals with data that is actually allocated. Depending on the allocated disk space, this is much faster than performing an entire surface scan of the disk. * LZO compression: In addition to the already existing zlib compression algorithm, data can now be alternatively compressed using LZO, which provides higher compression ratios and faster decompression for certain types of data. * Read-only snapshots * Different compression and copy-on-write settings for each file/directory (in addition to the per-filesystem controls). Btrfs compression can be controlled on a per file/directory basis. It can be enabled any time after a subvolume has been created. In the default mode, it will flag the file as not compressible and will not try to compress blocks again. In compress-force mode, Btrfs will keep trying for new writes, in case the newly added file content becomes compressable. * List all subvolumes on a file system (btrfs subvolume list) * List all files recently modified (btrfs subvolume find-new) * Allow changing the subvolume to be mounted by default with btrfs subvolume set-default (to better support snapshot-assisted distribution upgrades) * Direct I/O support * Introduced mount option nospace_cache * Allow to mount -o subvol=path/to/subvol/you/want relative from the normal fs_tree root * Now records a number of previous tree roots as backups, which can be useful in recovering damaged filesystems. If a given mount fails to go through because a tree root is bad, you can now us mount -o recovery and Btrfs will walk through the array and try to mount older versions of the file system. Btrfs bug fixes and performance improvements * Asynchronous creation of snapshots. Avoids waiting for the snapshot to be committed to disk. * Significantly improved ls readdir() performance * Switched the btrfs tree locks to reader/writer * Improvements to the logging code. Lots of data was logged more than once, greatly increasing the I/O load. Log I/O traffic has been cut to ~25% of the previous level. * Allow to overcommit ENOSPC reservations (speeds up a test from 45 minutes to 10 seconds) * Be smarter about committing the transaction: xfstests 83 goes from taking 445 seconds to taking 28 seconds * Inode Items operation improves file creation and deletion performance significantly * Improved reserved space accounting and handling -ENOSPC (out of disk space) situations * Dump free space cache on disk to speed up block group caching * Fixed regressions in the mount and general error handling code, which also fixes some problems in the mount -o autodefrag mode * Tweaked the ENOSPC throttling. The file system tries to start I/O to make sure it can do all the allocations that it has promised to do. The end result is a dramatic improvement in random write workloads among many others. * Improved the scrubber and provided utilities to walk Btrfs' many backrefs. The scrubber is much faster thanks to extensive btree readahead and instead of just informing the user that a specific block is bad, it tells him which btree or which file was impacted by that bad block. * Fixed the Btrfs cache flushing. This one probably explains many of the corruptions that have been reported, especially on multi-device filesystems. Ceph users running with -o notreelog were dramatically more likely to trigger the corruptions. The problem was that Btrfs was triggering cache flushes before the last copy of the super block, instead of doing them before the first copy. Take extra care about getting flushes done to all the devices in a multi-device FS before writing any of the supers. * Fix for tree corruptions when running multi-threaded snapshots with mount -o inode_cache enabled Xen domU improvements Several bug fixes and improvements have been incorporated to make the Unbreakable Enterprise Kernel scale and cooperate better as a guest (domU) in Oracle VM and Xen. * Xen block backend from Linux 3.3 kernel. This provides the fully featured Xen blkback along with extra features, such as passing through a flush (a lighter version of barrier), discard (also known as TRIM or SCSI UNMAP) and various bug-fixes and enhancments. * Xen PCI backend from Linux 3.3 kernel, this includes the option to specify how the PCI structure shows up in the PV guest - either as in host or virtualized; Fixes to make it work with SR-IOV VF cards; and numerous mutex fixes. * Memory self-ballooning - allows the guest to automatically balloon depending on the workload. * Transcendent memory support for HVM and PV guests * Tracing API support for Xen MMU operations. * Syncing the wall-clock time from the initial domain * Numerous code cleanups and bug fixes (e.g. in the following areas: memory balloning, blkfront, P2M, E820, IRQ, MMU, Gntalloc driver) Other improvements * dm-nfs: device-mapper target that allows you to treat an NFS file as a block device. It provides loopback-style emulation of a block device using a regular file as backing storage. The backing file resides on a remote system and is accessed via the NFS protocol. Driver Updates The Unbreakable Enterprise Kernel supports a vast range of hardware and devices. In close cooperation with hardware and storage vendors, several device drivers have been updated by Oracle. The list below only indicates the updated drivers that deviate from the versions included in mainline Linux 3.0.16. Storage drivers * Broadcom bnx2i 2.7.0.3 * Broadcom bnx2fc 1.0.4 * Brocade bfa 3.0.2.2 * Emulex be2iscsi 4.1.239.0 * Emulex lpfc 8.3.5.58.2p * LSI mpt2sas 12.100.00.00 * LSI megaraid_sas 5.40-rc1 * QLogic qla2xxx 8.03.07.12.39.0-k * QLogic qla4xxx 5.02.00.00.06.02-uek2 Network drivers * Broadcom bnx2 2.1.11 * Broadcom bnx2x 1.70.00-0 * Broadcom cnic 2.5.7 * Brocade bna 3.0.2.2 * Cisco enic 2.1.1.24 * Emulex be2net 4.1.297o * Intel e1000e 1.4.4-k * Intel ixgbevf 2.1.0-k * Intel igbvf 2.0.0-k * Intel ixgbe 3.4.8-k * Mellanox mlx4_en 1.5.4.2 * QLogic netxen_nic 4.0.77 * QLogic qlcnic 5.0.25.1 Other drivers * Hewlett-Packard hpwdt 1.3.0 Notable improvements in mainline Linux since Linux 2.6.32 This section lists a some of the most visible/noteworthy improvements that have taken place in mainline Linux since the Unbreakable Enterprise Kernel Release 1 (which was based on mainline Linux 2.6.32). It is by no means exhaustive or complete, as a full list would exceed the scope of these release notes. * Transparent Huge Pages: Improves memory management capabilities of modern CPUs by allowing memory pages larger than 4kB (2MB). Frequently accessed virtual addresses for memory-intensive workloads can be better cached, making page-table walks much faster * Memory compaction: Tries to reduce external memory fragmentation in a memory zone by trying to move used pages into a new big block of contiguous pages. This will make it easier to allocate bigger chunks of memory. Testing has showed the amount of I/O required to satisfy a huge page allocation is reduced significantly. * VFS scalability: directory cache scaling. The Dcache (alias for "directory cache", which keeps a cache of directories ) and path lookup mechanisms have been reworked to be more scalable. This makes the Virtual File System (VFS) layer more scalable in multi-threaded workloads and also makes some single-threaded workloads quite faster (due to the removal of atomic CPU operations in the code paths). In particular, every application that calls stat() a lot will be faster. * Transmit Packet Steering (XPS) for multiqueue devices: Spreading of outcoming network traffic across CPUs on multiqueue devices. XPS selects a transmit queue during packet transmission based on configuration by mapping the CPU transmitting the packet to a queue. This is the transmit side analogue to RPS/RFS] (which was already included in Unbreakable Enterprise Kernel Release 1). Where RPS is selecting a CPU based on receive queue, XPS selects a queue based on the CPU. * Scheduler performance improvements: the process scheduler is more friendly to workloads that use sched_yield(). This includes any userland implementation of locking (e.g. in Java, Databases etc.). Improvements for remote wakeups: When a process on cpu N tries to wakeup a process on M, it no longer has to take as many locks to get there. * TCP: Increased the initial congestion and receive window to 10 packets. User-visible latencies can be reduced by 10% without creating congestion problems on the net by increasing the initial congestion window. * Control Groups (Cgroups) improvements: Implemented a block I/O controller - the CFQ IO scheduler uses it to recognize task groups and to control disk bandwidth allocation to such task groups. Added Cgroups I/O throttling support - the administrator can now set upper read/write limits to a group of processes. Automatic session-based process grouping to allow better latency and responsiveness for selected applications. * OCFS2 improvements: OCFS2, the Oracle Cluster File System received a number of updates and improvements in mainline Linux. Some of the notable changes include: * Global heartbeat: Earlier versions of OCFS2 had its own heartbeat for each mounted volume, which caused a lot of overhead. This has now been changed to what is called "global heartbeat", where there is only one heartbeat/disk/network for all mounted volumes. * Implemented allocation reservations, which reduce fragmentation significantly * Optimized hole-punching code, which can significantly speed up some operations * Implemented discontigous block groups * Added TRIM support for SSD devices * ext4 file system: performance/scalability improvements: ext4 now uses the Block I/O layer instead of the buffer layer (which had performance and SMP scalability problems). This speeds up concurrent fs access significantly by reducing CPU utilization. A faster mkfs.ext4 by delaying the inode table initialization to the first mount. Ext4 now also support "punch hole" functionality. Updated or added utilities In order to support the newly added functionality provided by the Unbreakable Enterprise Kernel Release 2, the following RPM packages were added or updated from the ones included in the base distribution and are included in the respective channels/repositories: Oracle Linux 6 x86_64: * bfa-firmware * btrfs-progs * kernel-uek * kernel-uek-debug * kernel-uek-debug-devel * kernel-uek-devel * kernel-uek-doc * kernel-uek-firmware * lxc * lxc-devel * lxc-libs * ocfs2-tools * ql2400-firmware * ql2500-firmware i386: * bfa-firmware * btrfs-progs * kernel-uek * kernel-uek-debug * kernel-uek-debug-devel * kernel-uek-devel * kernel-uek-doc * kernel-uek-firmware * lxc * lxc-devel * lxc-libs * ocfs2-tools * ql2400-firmware * ql2500-firmware Oracle Linux 5 x86_64: * bfa-firmware * btrfs-progs * kernel-uek * kernel-uek-debug * kernel-uek-debug-devel * kernel-uek-devel * kernel-uek-doc * kernel-uek-firmware * kudzu * kudzu-devel * ocfs2-tools * ql2xxx-firmware i386: * bfa-firmware * btrfs-progs * kernel-uek * kernel-uek-debug * kernel-uek-debug-devel * kernel-uek-devel * kernel-uek-doc * kernel-uek-firmware * kudzu * kudzu-devel * ocfs2-tools * ql2xxx-firmware Technology Preview Features In addition to the features listed above, the Unbreakable Enterprise Kernel Release 2 includes the following features which are still under development, but are already made available for testing/evaluation purposes. * Kernel module signing facility: Applies cryptographic signature checking to modules on module load, checking the signature against a ring of public keys compiled into the kernel. GPG is used to do the cryptographic work and determines the format of the signature and key data. * Linux Containers (lxc): Based on the Linux Cgroups and name spaces functionality, containers allow you to safely and securely run multiple applications or instances of an operating system on a single host without risking them interfering with each other. Containers are lightweight and resource-friendly, which saves both rack space and power. In order to get started with containers, you need to install the "lxc" package, which is included in the package repository of the Unbreakable Enterprise Kernel. * Transcendent memory: Transcendent Memory (tmem for short) provides a new approach for improving the utilization of physical memory in a virtualized environment by claiming underutilized memory in a system and making it available where it is most needed. From the perspective of an operating system, tmem is fast pseudo-RAM of indeterminate and varying size that is useful primarily when real RAM is in short supply. To learn more about this technology and its use cases, see the Transcendent Memory project page on oss.oracle.com: http://oss.oracle.com/projects/tmem/ * DTrace: DTrace is a comprehensive dynamic tracing framework that was initially developed for the Oracle Solaris operating system; it is being ported to Linux by Oracle. DTrace provides a powerful infrastructure to permit administrators, developers, and service personnel to concisely answer arbitrary questions about the behavior of the operating system and user programs in real time. DTrace feature previews will be published as a separate set of kernel packages, it is not yet included in the regular Unbreakable Enterprise Kernel distribution. * DRBD (Distributed Replicated Block Device): A shared-nothing, synchronously replicated block device ("RAID1 over network"), designed to serve as a building block for high availability (HA) clusters. It requires a cluster manager (e.g. pacemaker) for automatic failover. Compatibility Oracle Linux maintains user-space compatibility with Red Hat Enterprise Linux, which is independent of the kernel version running underneath the operating system. The existing applications will continue to run unmodified on Unbreakable Enterprise Kernel Release 2 and no re-certifications are needed for RHEL certified applications. As Unbreakable Enterprise Kernel Release 2 is based on mainline Linux 3.0.16, we expect it to have a different kernel ABI from Unbreakable Enterprise Kernel Release 1 which is based on 2.6.32. The Oracle Linux engineering team works closely with ISVs that develop kernel modules, to ensure that kernel interoperability is obtained with Unbreakable Enterprise Kernel Release 2. It is possible that kernel modules will have to be recompiled to interoperate with Unbreakable Enterprise Kernel Release 2. Oracle Linux team will work closely with the affected kernel module developers to mitigate the impact. Availability The Unbreakable Enterprise Kernel is available as binary RPM packages that can be installed from Oracle's public yum repository as well as the Unbreakable Linux Network. The kernel's source code is available via a public git source code repository from http://oss.oracle.com/git/?p=linux-uek-2.6.39.git Installation The Unbreakable Enterprise Kernel Release 2 can be installed on Oracle Linux 5 Update 8 or newer, as well as Oracle Linux 6 Update 2 or newer. If you're still running an older version of Oracle Linux, make sure to first update your system to the latest available update release. The Unbreakable Enterprise Kernel Release 2 will be provided via dedicated channels on the Oracle Unbreakable Linux Network and the public yum repositories. See the "Getting Started with the Unbreakable Enterprise Kernel for Oracle Linux" document on the Oracle Technology Network (http://www.oracle.com/technetwork/articles/servers-storage-admin/uek-rel2-getting-started-1555632.html ) for detailed instructions on how to download and install the Unbreakable Enterprise Kernel on Oracle Linux. Known Issues * Nouveau kernel driver is not compatible with NVIDIA graphics driver: After upgrading to UEK2, the NVIDIA driver upgrade script doesn't properly blacklist the Nouveau kernel driver. To properly blacklist the driver, append rdblacklist=nouveau nouveau.modeset=0 to the kernel boot parameters in /boot/grub/grub.conf. * ACPI: One some systems you may see ACPI-related error messages in dmesg similar to these: ACPI Error: [CDW1] Namespace lookup failure, AE_NOT_FOUND ACPI Error: Method parse/execution failed [\_SB_._OSC] ACPI Error: Field [CDW3] at 96 exceeds Buffer [NULL] size 64 (bits) These are not fatal and are caused by bugs in the BIOS. Try contacting your system vendor for a BIOS update. (Oracle BUG 13100702) * ASM: calling the oracleasm init script /etc/init.d/oracleasm with the parameter scandisks may lead to error messages about missing devices similar to the following: oracleasm-read-label: Unable to open device "/dev/xvdc1": No such file or directory However, the device actually exists. This error message can be ignored, it is triggered by a timing issue. The init script should only be used to start and stop the oracleasm service, all other options like scandisks or listdisk or createdisk are deprecated. For these and other administrative tasks, use the regular binary in /usr/sbin/oracleasm instead. (Oracle BUG 13639337) * Btrfs: When mounting a Btrfs file system on Oracle Linux 5, you need to explicitly specify the file system type using -t btrfs, otherwise the mount call will fail with the error mount: you must specify the filesystem type. Example: mount -t btrfs /dev/sda /mnt (Oracle BUG 13705319) * Btrfs: Running btrfs filesystem balance converts a non-RAID/concat file system setup to RAID0 after adding a new device. (Oracle BUG 13715389) * Btrfs: Converting an existing ext2/3/4 root file system to Btrfs does not carry over the associated security contexts that are stored as part of a file's extended attributes. With SELinux enabled and set to enforcing mode, you may experience a lot of "permission denied" errors after reboot, rendering the system unbootable. To avoid this problem, make sure to enforce an automatic file system relabeling run at bootup time. You can trigger this by creating an empty file named autorelabel (e.g. by using touch) in the file system's root directory before rebooting the system after the initial conversion. This will instruct SELinux to recreate the security attributes for all files on the file system. In case you forgot to do this and rebooting fails, you can either temporarily disable SELinux completely by adding selinux=0 to the kernel boot parameters, or you can just disable the enforcing of the SELinux policy by adding enforcing=0. (Oracle BUG 13806043) * CPU microcode update failures on PVM/PVHVM guests: When running Oracle Linux with the Unbreakable Enterprise Kernel Release 2, you might see error messages in dmesg or /var/log/messages similar to this one: microcode: CPU0 update to revision 0x6b failed. This warning can be ignored, as the microcode for virtual CPUs as presented to the guest does not need to be updated. (Oracle BUG 12576264 and 13782843) * IO scheduler: The Unbreakable Enterprise Kernel uses the 'deadline' scheduler as the default IO scheduler. For the Red Hat Compatible Kernel, the default IO scheduler is the 'cfq' scheduler. * libfprint: The following message might appear in dmesg or /var/log/messages: WARNING! power/level is deprecated; use power/control instead. The USB subsystem in UEKR2 deprecated the "power/level" sysfs attribute in favor of the "power/control" attribute. The "libfprint" finger printing library would trigger this warning via udev rules that try to use the old attribute first. However, the setting of the appropriate power level still succeeds - the warning can be safely ignored. (Oracle BUG 13523418) * NFS: While NFSv4.1 support and some pNFS functionality are are enabled in UEKR2, the current implementation is still considered to be incomplete and should not be tried on a production system, as it could result in data loss or system instability. * sched_yield() settings for CFS: For the Unbreakable Enterprise Kernel, kernel.sched_compat_yield=1 is set by default. For the Red Hat Compatible Kernel, kernel.sched_compat_yield=0 is used by default. * udev: A message similar to (probably different with a different PID) will show up in dmesg or /var/log/messages during boot: udevd (70): /proc/70/oom_adj is deprecated, please use /proc/70/oom_score_adj instead. The udev process uses the deprecated oom_adj kernel interface to prevent it from being killed when an OOM occurs. Despite the warning, this action still succeeds. (Oracle BUG 13655071 and 13712009) * Virtualization: When booting Unbreakable Enterprise Kernel Release 2 as a 32bit PVHVM guest, the following kernel message can be safely ignored: register_vcpu_info failed: err=-38 (Oracle BUG 13713774) * Virtualization: Booting Unbreakable Enterprise Kernel Release 2 (both 32bit and 64bit) as a paravirtualized (PVM) guest on Oracle VM 3.0 with an ext3/4 root file system may trigger error messages like the following: blkfront: barrier: empty write xvda op failed blkfront: xvda: barrier or flush: disabled end_request: I/O error, dev xvda, sector 39045520 Aborting journal on device xvda3-8. EXT4-fs error (device xvda3): ext4_journal_start_sb:296: Detected aborted journal EXT4-fs (xvda3): Remounting filesystem read-only At this point, the root file system is not writable and the system bootup aborts. This is due to a change in the Linux kernel where a WRITE_FLUSH/BARRIER is sent with a 0 sector size and the backend computes the sector incorrectly, thinking the request is past the size of the disk - and thus failing the request. This problem will be addressed in future versions of Oracle VM. To work around this issue, disable write barriers in /etc/fstab of the guest system by adding barrier=0 or nobarrier to the root file system's mount options. (Oracle BUG 13324662)