Open Source Projects Brought to You By the Linux and Virtualization Development Team
oss.oracle.com has been the home for numerous open source projects started at Oracle, most of them started to improve Linux. Here we highlight some of those projects in chronological order.
When Oracle started looking at Linux as mission critical platform for the Oracle Database early 2001, we took a close look at all the kernel features in Linux and what gaps there were with key functionality on Unix platforms. Asynchronous I/O or non-blocking I/O quickly emerged as a must have. An actively used database performs large volumes of filesystem or disk I/O work and to be able to process this work efficiently, being able to submit a large number of I/Os while continuing to process information and then check what's been written is very critical.
Asynchronous I/O offers us double digit performance improvements and while available on Unix platforms, it wasn't available on Linux. So we set out to develop it. We started by making sure asynchronous I/O worked well from userspace by creating a library that handled the asynchronous operations and support. Then we slowly moved the functionality into the Linux kernel to ensure it had the necessary infrastructure for providing asynchronous interfaces to applications. In 2002, we worked with Linux distribution vendors on add-on patches to deliver aio functionality, for example in Red Hat Advanced Server 2.1.
This was major effort and very important for many, many applications. It caused some heated debates and discussions on the Linux mailing lists about whether it was needed in the first place and about the design but in the end Linux ended up with a very scalable set of asynchronous interfaces.
Status: In mainline kernel and commercial Linux distributions
More information: Download archive
In 2002 when we made a big commitment to Linux and around the same time we released Oracle 9i Real Application Clusters (RAC). RAC was a major update to Oracle's clustered database technology to support any type of application in a clustered database environment. Oracle's database clustering depends on shared storage. At the time, managing storage with Linux was rather tedious. There was no consistent device naming, managing raw disks volumes was difficult for system and database administrators. To help the both system and database administrators we started writing a very simple clustered filesystem that was focused on being a very good, fast alternative to raw devices for Oracle RAC. The first version of OCFS was not a general purpose cluster filesystem: metadata operations (file create, delete) were slow and very synchronous and there was no caching.
But, OCFS had been created for a specific purpose: a clustered filesystem as an alternative to managing raw devices for the Oracle database. And, considering what it was designed to do, it did an admirable job: I/O operations on files on OCFS were basically as fast as I/O operations directly to the raw disk itself. This filesystem was open source (GPL) and there was nothing to prevent any application with requirement for very fast direct IO and a simple filesystem interface to use it as a filesystem. Ultimately, OCFS was not pushed to mainline Linux because it was deemed relatively too limited in functionality.
OCFS was later replaced with the much more general purpose clustered file system, OCFS2.
Status: No longer supported
Download: Code archive
hanic was another project started to improve running Linux clusters. When this project was started around 2003, the Linux network interface bonding functionality was limited. While bonding of interfaces worked, failover didn’t work well. Enhancements were implemented in user space to improve high availability (HA) behavior in cases where network failures seen by a node in a cluster weren’t caused by a switch failure. hanic reduced false failovers or ping-ponging between networks.
Eventually, a patch was submitted in November of 2002 to mainline Linux kernel 2.4.20-rc1 to the bonding module to support more than one arp_ip_target to avoid confusion when the reference node goes down causing other nodes to think their interface is malfunctioning.
Status: In mainline Linux kernel
More information: Project archive
OCFS2 file system development began in 2003 as a follow up to OCFS. OCFS was targeted exclusively as a data store for Oracle's Real Application Clustered database product (RAC). The goals for this new project were to maintain the raw-like I/O throughput for the database, be POSIX compliant, and provide near local file system performance for meta-data operations. OCFS2 was intended from the start to be included in the mainline Linux kernel. At the time, there were no clustered file systems in the kernel. OCFS2 v1.0 was released in August 2005. Shortly thereafter, it was merged into Andrew Morton's -mm Linux kernel tree. OCFS2 was merged into the mainline Linux kernel tree in January 2006. The 2.6.16 kernel, released that March, included the file system (see Oracle announcement from April 2006). Not long after, Novell added support for OCFS2 into SUSE Linux Enterprise.
OCFS2 has a contains a number of components : heartbeat, node manager, distributed lock manager (dlm), and the file system itself. To ensure successful inclusion in the mainline Linux kernel, we knew we had to be very open in development from the start, and use the appropriate development model for Linux functionality. This meant our source tree was public and during development our code was pushed to filesystem maintainers for review on a regular basis. This eventually enabled us to submit OCFS2 to the linux kernel mailing list where it was accepted into the kernel as the first general purpose cluster filesystem. We made metadata operations (create file, move file, rename file, delete file, ls,...) go really fast, made sure we added caching of data, support both direct i/o and non-direct i/o operations and of course ensured complete filesystem consistency across the cluster.
OCFS2 is still maintained, and is used widely in various Oracle products. One such product is Oracle VM, where it's used for virtual machine repositories. Over time we've added features that help us greatly with this, such as Reflink. Reflink is now also in other filesystems such as Btrfs. It allows for a very, very easy Copy on Write snapshot of files. Perfect for cloning virtual machine disks.
OCFS2 is endian-neutral. This means you can have servers with different hardware architectures in the same OCFS2 cluster accessing the same filesystem. (e.g. PowerPC, SPARC and x86)
OCFS2 has a very interesting virtual filesystem called dlmfs. This enables you to create distributed locks by just creating files and holding a file open in this virtual filesystem. Technically you could write a cluster aware program in bash!
Status: In mainline Linux kernel
hangcheck-timer is a kernel module that is loaded at boot time to monitor for sustained operating system hangs that could affect the operation of a cluster.
In order to build clusters using standard hardware components, that is, a standard network and shared disk, one has to come up with innovative ways to handle node failure. For instance, if a node cannot remotely power off or disable a network or disk interface of another node whose state is in question, the only way to preserve the integrity of the cluster is for a node to sacrifice itself during times. STONITH or "Shoot The Other Node In The Head" doesn't work without remote access capabilities. hangcheck-timer will reset a system if a hang or pause of a certain threshold is reached.
We originally developed hangchecktimer to make sure Oracle Real Application Clusters could be run on Linux, but the technology is not Oracle-specific The patch for hangcheck-timer was submitted to the 2.4.20 kernel in January of 2003and accepted in 2.5.60
Status: In mainline Linux kernel
More information: Code archive
ASMLib was written as a reference implementation for Linux to use Oracle Automatic Storage Management (ASM). Oracle introduced ASM in Oracle Database 10g. ASM, put simply, handles storage and volume management for the database. ASM was created to eliminate the need for a clustered filesystem, a volume manager or a clustered volume manager. Instead, with ASM you present raw disk devices to the Oracle ASM instance and it takes care of all volume management and device naming both in single instance and clustered databases.
We wrote ASMLib for Linux to be a reference implementation for other vendors but we also made good use of it to hide some deficiencies in Linux device naming that was persistent for a long time. With ASMLib, an admin can tag a volume with an ID/header and at boot/reboot of the OS an Oracle database will find this device, even if the device name changed. For example, say a disk was discovered as /dev/sda and used by ASM, and on a subsequent reboot this device was renamed to /dev/sdb by the operating system, if you referenced disks using device paths, Oracle database would look for /dev/sda and fail. With the use of AMLlib, the admin tags /dev/sda which stamps the header of the disk, and provides that higher level name to ASM. If on reboot the name of a given disk has changed, ASMLib will still find it.
This is the main use case of ASMLib to date: to provide and ensure consistent device naming both for single nodes and clusters. In addition to that we implemented the IO interface in ASMLib such that the Oracle database uses /dev/asm for all of its I/Os. This offered minor benefits for overall database resource usage because it reduced the number of open file descriptors.
Later in 2007, however, we began work on Linux Data Integrity with one of the goals being providing data integrity from Database to Disk. Adding Data Integrity features in the Linux kernel and in ASMLib made this possible.
The ASM kernel module is open source, but was not submitted to the mainline kernel because its functionality is specific to the Oracle Database.
Status: Maintained and supported
FSCat is a utility for dumping filesystems offline. No kernel driver is needed, just access to the block device and an FSCat driver for the particular filesystem. FSCat can list, archive, and copy out the contents of the filesystem.
The idea was that if we had a disk but there was a problem mounting the filesystem or we were on a node that could not join the cluster in the case of OCFS2 and we still wanted to be able to copy a file over for recovery purposes or debugging or diagnostics, then we could simple do this with FScat. All we needed was the actual device itself, no need to mount the filesystem. The tool was created mainly for debugging and recovery purposes.
FSCat supports OCFS2, EXT2, EXT3, and OCFS filesystems.
Status: No longer maintained.
More information: Code archive.
This project started as an experiment to set up cheap and cheerful shared storage for development and demonstration environment. Setting up proper shared storage, more so in 2003, was often too expensive to justify for experiments of development projects. At a minimum, you’d need shared SCSI or fibre channel to set up a SAN. Firewire disks were becoming popular around 2003 and as Wim Coekaerts was reading the Firewire specification, he discovered that it allowed for multiple initiators (multiple clients connecting to a target disk). his discovery led to a development project to create a kernel module that enabled setting up a Linux cluster with shared storage (e.g. Oracle RAC) using only Firewire disks.
Another nifty feature we discovered in Firewire was that a firewire device had full access (32bit) to the memory of the machine it was attached to. So we created a device that would expose a remotely connected (over firewire) machine's memory locally. This way, a debugger such as kdb or kgdb on a desktop or laptop connected to a server could debug it online remotely and even access memory if the OS had crashed on the server.
The endpoint project was to turn a machine connected to a Firewire bus into a storage device so that other systems on the same bus would see it as a Firewire disk. This allowed us to play around with testing techniques for improving shared storage.
Status: No longer maintained
Many common causes of data corruption are not caused by bit rot on the physical disk platter but rather due to bugs in the I/O path between application and drive.
Modern filesystems - including Oracle's own Btrfs - implement checksumming so that corrupted data can be detected. This detection occurs when data is read back, however, which can potentially be months after the corrupted data was written. And chances are that the good data was lost forever. The Data Integrity Initiative aims to prevent corrupted data buffers from being written to disk.
The Linux data integrity framework is the result of a multi year effort between Oracle, standards bodies, the Linux community and many storage vendors, both of disk adaptors and disk and solid state.
An addition to the SCSI specification called the T10 Protection Information (formerly Data Integrity Field or DIF) standardizes the contents of the protection data and allows the extra information to be sent and received from the host controller as well as verified along the chain of devices. Together with industry partners Oracle has developed an infrastructure that takes the T10-PI specification a step further, allowing the protection metadata to be exposed to the operating system as well as the application. Martin K. Petersen from Oracle's Linux kernel team first presented his ideas on this topic at the 2007 Linux Storage and File systems Workshop.
The Linux data integrity framework enables applications or kernel subsystems to attach metadata to I/O operations, allowing devices that support T10-PI to verify the integrity before passing them further down the stack and physically committing them to disk. The first patch for the Linux Data Integrity Framework was submitted upstream in April 2008 and the framework was accepted into the 2.6.27 kernel in October of 2008. (git commit). In 2012, together with EMC and Emulex, Oracle demonstrated a fully working stack with data integrity verified from Oracle database to disk and everything in between using commercially available components. The implementation used changes in the Linux kernel as well as the ASMLib kernel module.
Status: In mainline kernel and commercial Linux distributions. Development ongoing.
Btrfs is a good example of a contribution to the kernel that was made purely to make Linux a better operating system. The decision to put effort into Btrfs didn't arise from a particular requirement at Oracle, but rather from the realization that Linux filesystems weren't meeting datacenter requirements. Meanwhile other operating systems, offered compelling in this area (most notably ZFS on Solaris) and the world was moving forward rapidly.
The goal for Btrfs was to be an advanced filesystem for Linux with modern features such as improved snapshots, copy-on-write, compression, checksumming as well as integrated volume management. Further, Btrfs had to make optimal use of increasingly popular solid state storage technology. In June of 2007, Oracle Linux kernel developer Chris Mason sent this announcement to LKML describing early progress he'd made with this new file system. Btrfs first appeared in the 2.6.29 kernel.
Recently, its filesystem based snapshots have made Btrfs a great asset to Docker and containers in general. Btrfs also helps with creating alternative boot environments, making it possible to install updates while a system is running and easily revert to a previous snapshots of your filesystem. Btrfs is now in the Linux kernel and is very actively maintained by many people all over the globe. It's a complete, modern, general purpose filesystem for Linux.
Status: In mainline Linux kernel and commercial Linux distributions
More information: Btrfs overview, features and benefits on OTN
Reliable Datagram Sockets (RDS) is an effort to provide a socket API which is uniquely suited to the way Oracle does network Inter Process Communication (IPC). The Oracle Linux kernel development team created an open source implementation of the API for the Linux kernel. The code is now integrated into the OpenFabrics Enterprise Distribution (OFED) stack. OFED aims to deliver a unified, cross-platform, transport-independent software stack for Remote Direct Memory Access (RDMA), including a range of standard protocols.
In beta testing, RDS over InfiniBand (IB) provided up to 60 percent performance improvement over Gigabit Ethernet for interconnect-intensive applications. RDS is already used by in several Oracle products as wel as Silverstorm's Quicksilver.
What problem does RDS solve? When nodes in an Oracle cluster communicate, it's important that they can do so as quickly as possible. More nodes in a cluster generate more transactions among nodes and network latency is key inhibitor in these configurations. There are two types of communication going over the network in an Oracle cluster
Data blocks are typically 8kb or 16kb in size. RDS uses RDMA (zero copy) IO over IB so that there's no redundant copying of data. Instead, data is be placed directly in the process memory of a remote node, reducing cpu overhead of copying bytes around. RDS is used by Oracle Real Application Clusters and Exadata engineered systems, making it a very important protocol for Oracle. It works both for ethernet and Infiniband. While the use today is Infiniband, we are working on making the RDS over TCP protocol be current and usable as well. RDS is not specific to Oracle, any application can make use of the protocol to have increased network transport performance in certain use cases.
Status: In mainline kernelMore information:
Tmem or Transcendent Memory started as a research project and eventually turned into a few interesting real world implementations of a new approach to managing physical memory in a virtualized system. Tmem works by claiming underutilized memory in a system and making it available where it is most needed. From the perspective of an operating system, tmem is fast pseudo-RAM of indeterminate and varying size that is useful primarily when real RAM is in short supply.
The project found its origins in researching ways to provide better memory management, memory overcommit and efficient memory use for VMs on a hypervisor. In particular in a case where you run large memory intensive applications that expect specific behaviors, such as an Oracle database.
When you create a virtual machine, it gets assigned a certain amount of virtual memory, the application running inside (and the OS) believes that it has full access at all times to that memory . However, some virtualization technolgoies implement memory overcommit — provision more memory to VMs than is physically available in the server. When VMs use all the memory provisioned to them in an overcommit scenario, anything above the physical memory capacity of the server is written to a disk similar to how an OS may swap to disk.
The advantage of overcommitting memory is that it enables a higher density of VMs on a server. The disadvantage is that the guest OS does not know it is competing with other VMs for memory that is not available and can run into significant performance bottlenecks.
We have seen this over and over running databases virtualized on non-Oracle hypervisors where the hypervisor decides to write database buffercache memory to disk and the database when it needs to access it has to wait because behind the scenes of the VM it needs to get paged back in from disk cause major issues.
With tmem, were were trying to look for alternative approach, call it cooperative memory management, a way to share scarce memory resources among VMs without hiding what’s really available from individual VMs.
Tmem can offer VMs extra memory to improve their performance as workload in the VM demands it, but it’s memory that each individual VM could live without. This idea eventually turned into an improved balloon driver for the Xen hypervisor. If the hypervisor has extra free memory then a guest VM may access this memory but through an interface and implementation that the expectation for the VM is that it’s a cache — extra memory that could go away.
Another application of tmem that we experimented with is Ramster. Ramster aggregates tmem across nodes a super high bandwidth and low latency network such as InfiniBand. The idea was that an OS running on one physical server could temporarily add free memory available on another node on the same network.
Finally, tmem was used in zcache. Zcache is a in kernel compressed cache for file pages that takes active file pages that are in the process of being reclaimed and attempts to compress them into a dynamically allocated RAM-based memory pool. This avoids an I/O when the pages in this compressed cache are needed again and can result in significant performance gains.
Status: In mainline kernel
More information: Project archive
Xen is a virtualization technology that is very widely used. The cloud platform that's by most accounts the most widely used, Amazon AWS, is built on Xen. In 2007 Oracle launched a server virtualization product called Oracle VM, which has Xen at its core. When Oracle VM was launched, Xen support in Linux was still in its infancy. Core features such as the ability to run a standard Linux kernel as Dom0 (a control domain in Xen virtualization) or proper para-virtualization support in which Linux kernels running as a guest domain are optimized to perform well on a Xen virtualization platform were missing.
In 2010, we began laying the groundwork for proper Xen support in the Linux kernel. In 2011, Linux kernel 2.6.37 was the first version to have capabilities to run as a Xen dom0 with experimental releases of Xen. A major leap forward was made with the blkback patches, accepted in kernel 3.0 in July of 2011. This article details some more of the history of how develops from Citrix, Oracle and others worked to bring Xen support to the Linux kernel. To further improve performance and stability of Linux on Xen, patches to support a Paravirtualized kernel running in PVH mode Oracle submitted these patches in December of 2013 and became generally available in March 2014 in kernel 3.14.
Status: In mainline Linux kernel
DTrace is a comprehensive dynamic tracing facility that can be used by administrators and developers on live production systems to examine the behavior of the operating system. DTrace enables you to explore your system to understand how it works, track down performance problems, or locate the cause of aberrant behavior. DTrace lets you create your own custom programs to dynamically instrument the system and provide immediate, concise answers to arbitrary questions you can formulate using the DTrace D programming language.
DTrace was originally developed for the Solaris operating system. After Oracle acquired Sun Microsystems, we decided to port DTrace to Linux. The mail reason to do this was that Linux tracing and debugging wasn’t being adequately addressed. There are many different tools that all address some aspects of tracing but it's not very well documented and organized. you have to use a number of utilities to find one thing.
DTrace on solaris has been widely used, it's very popular and it provides a very consistent single interface to debugging and tracing what goes on on an operating system. so in order to help Linux admins and developers and provide them with a good proven tracing, diagnostics and debugging tool we decided to make DTrace availble on Oracle linux. The DTrace kernel module code is open source and available under the original CDDL license.
Status: While the linux DTrace code still misses some features, we are close to on par with the implementation in Oracle Solaris.
More information: DTrace for Oracle Linux
When Linux vendors publish source code, the convention is to to present the kernel source as it is delivered from upstream with any patches on top of that shipped separately. This way a consumer of the source can see what makes up a given vendor kernel. He or she can see the history of patches relative to a upstream kernel source or relative to the source of a previous version of a vendor kernel.
With RHEL 6, Red Hat began shipping its kernel source as a singe tarball (a monolithic archive). This change made it very difficult to know what individual changes contributed to the difference between two errata kernels. At best, one could diff two source trees to see the overall difference between two kernel versions.
Since Oracle offers ksplice updates for both Red Hat and Oracle Kernels, we already have complete understanding of each individual patch that's applied and the changelog for it. So to help those Red Hat customers that want more transparency and we already do this work, we decided to make this knowledge available to anyone. The source for Oracle's kernels has always been in public git repositories with complete changelogs (for example, see git repo for UEK3-3.8 here)
Status: Active, see git repo.