TMEM: RAM NOT OWNED BY THE KERNEL?!?

Suppose there is some "special" memory available when memory is low and the paging code needs to evict a lot of pages or when the swapper needs to start swapping pages to disk. And suppose this magic memory, which we will call tmem, is a bit quirky:

  1. Tmem is very fast... not quite as fast as RAM, but far faster than a disk access, so it can be used synchronously
  2. Tmem can't be addressed directly... it is object-oriented and addressed by a handle. It can be accessed only through a set of function calls which copies pages of memory. A put call copies one page of memory from a pageframe number in real RAM to tmem, and a get call copies it back again (maybe!), into empty RAM specified by a (probably different) page frame number.
  3. There are (at least) two types of tmem pools: One persistent and one non-persistent (ephemeral). The amount of available tmem in either pool is indeterminate and varies across time. As a result, a put to either type of pool may be rejected. And for an ephemeral pool, a get may fail even if immediately subsequent to a successful put... the put is very likely to work, but persistence is not guaranteed.

Tmem, this special quirky type of memory, is real RAM, but is owned not by the kernel but by another entity. Often this entity is a hypervisor, the kernel is running in a virtual environment, and the accesses to tmem are via hypercall. But there may be other forms of tmem too. Tmem's longer name is Transcendent Memory and it is described in more detail at http://oss.oracle.com/projects/tmem.

IF THERE'S EXTRA RAM, WHY NOT JUST GIVE IT TO THE KERNEL AND BE DONE WITH IT?

The kernel doesn't give all of the machine's physical memory to one application, nor does it pre-allocate a fixed amount of physical memory to each application. It has sophisticated algorithms to move memory where it is needed and save some for it's own needs, and also uses some to cache disk pages so files on disk can be more quickly accessed.

In a virtualized system, the hypervisor must similarly balance and move memory between virtual machines. But because the operating systems running in each of those virtual machines have always assumed they have a fixed amount of memory that they can "hoard" as they see fit, the hypervisor's problem is much harder. The extra RAM and the quirkiness of how tmem is used both improve the hypervisor's ability to very quickly give memory to the virtual machines that most need it.

Beyond virtualization, there's other potential benefits for "hiding" extra RAM from the kernel and making it available through a tmem interface, but we'll skip those for now.

SO IF THIS SPECIAL QUIRKY MEMORY EXISTS HOW CAN LINUX USE IT?

We have prototyped two Linux uses for tmem, as a precache and as a preswap. Precache uses a private ephemeral pool and preswap uses a private persistent pool. While much of the real value of tmem is apparent only from outside the Linux kernel, let's look at precache and preswap only from a kernel perspective for now. Those interested in the broader picture may look at the abovementioned webpage.

Precache can be thought of as a page-granularity victim cache for pages that the kernel's pageframe replacement algorithm would love to keep around, but there's just not enough memory. So when a page is evicted, it is first put into the precache. And any time a filesystem reads a page from the disk, it first attempts to get the page from the precache. If it is there, there's no need to go to the disk. If it's not there, the filesystem goes to the disk just like normal. A very important note: Since there's no persistence guarantee, only clean pages can/should be put to precache. And there's some complications to ensure that consistency is maintained between the disk, the precache, and Linux's page cache, but those prove to be manageable via a precache flush call.

Preswap is persistent, but for various reasons may not always be available for use. (Without getting into too much detail, in a virtualization environment, if this virtual machine is being "good" and has shared its resources nicely, then it will be able to use preswap, else it will not.) Once a page is put into preswap, a get on the page will always succeed. So when the kernel gets into a situation where it needs to swap out a page, it first attempts to use preswap. If the put works, no disk access is necessary. If it doesn't, the page is written to disk as usual. Unlike precache, whether a page is stored in preswap or swap is recorded in kernel data structures, so when a page needs to be fetched, the kernel does a get if it is in preswap and reads the swap disk if it is not in preswap.