[Tmem-devel] tmem and KVM

Fri Jan 16 15:03:53 PST 2009

> By evicted, you mean that the guest has evicted it 
> from it's page cache?

Yes, exactly.

> So precache basically becomes a way of bringing 
> in a page from disk.

More precisely, it is a way to avoid reading (some/most)
evicted pages back from disk.  Its success is ultimately
measured by reducing disk reads.  When a guest is ballooned
down it evicts a lot of pages and many of these are actually
needed in the future, which means they must be read multiple
times from disk... unless they are put in precache.

> I guess the part that hadn't clicked yet was the 
> fact that 
> you actually copy from precache into the normal page cache, 
> instead of 
> just accessing the precache memory directly.

Yes, so it in essence appears to be a very fast disk read.

> I guess how CMM2 differs, is that the guest would effectively 
> access the 
> "precache" directly but if the VMM had to evict something 
> from precache, 
> when the guest tried to access precached memory that was evicted, it 
> receives a special page fault.  This page fault tells the 
> guest to bring 
> the page from disk into memory.  But with CMM2, all page cache memory 
> that is not dirty is effectively, "precache".

Partially true, but the critical point is that the guest OS, NOT
the hypervisor, has made the decision about which page(s) are lowest
priority.

If the hypervisor has to decide which pages, it is shooting
in the dark without more information provided by the guest.
Some such information can be inferred but only by taxonomizing
groups of pages, from which the selection is still random.
With tmem the guest OS explicitly prioritizes its pages and
eviction is the trigger about its decision.

> Oh CMM2 is clearly superior :-)  The problem is the state 
> bitmap has to 
> be updated atomically so frequently that you end up needing hardware 
> support for it.  s390 has a special instruction for it.

Actually I meant it is not clear which is superior: page copying
or remapping.  Xen made a big-deal in the first year or two
about page-flipping (remapping), but afaik it has gone away,
replaced by coping.  YMMV.

But to tweak you on the point you actually made, I would *hope*
that CMM2 is superior.  If a company has control over the
design of the processor, the hypervisor, the operating system
and the I/O, and invests many thousands of man years into making
them work better together, you'd think the parts would work
together well. :-)  But apparently not so well with a commodity
processor, an open source hypervisor, an open source OS, and
a bizarre bazaar of drivers. :-)

> All KVM memory is reclaimable.  If a page in the guest is 
> dirty, then it 
> may need to be written to disk first before reclaim but in practice, 
> there should be a fair amount of memory that is reclaimable without 
> writing to disk.

First, see critical point above about selecting which page to reclaim.

But I still suspect you are wrong, unless KVM keeps a table tracking
every cached page from the disk to the disk location from where it
was obtained.  (Does it? I'll assume not...)  E.g. KVM reclaims a
clean page, discards its contents, and somehow marks the page as not
present.  A moment later the guest attempts a read from that page
and KVM gets a trap.  Where does it get the contents to reconstitute
the page?

Even if KVM *does* keep track of every page-cache-to-disk mapping
(which some recent research projects are proposing for Xen), it
still has to be read from disk... but I guess your point above
was "reclaimable without writing to disk" which would still
be true.

> The one bit of all of this that is intriguing is being able 
> to mark the 
> non-dirty page cache memory as reclaimable and providing a 
> mechanism for 
> the guest to be aware of the fact that the memory has been 
> reclaimed.  
> This would be more valuable to KVM than an explicit copy 
> interface, I think.

Indeed, that's a critical point.  Let the guest OS decide *which*
pages, and the hypervisor only needs to concern itself with
how many pages.  But tmem makes it all very explicit in a
language all(?) OS's understand.

> I question the utility of the proposed interface because it requires 
> modifying a very large amount of Linux code to use the optional cache 
> space.

Large?  For precache, I count 27 lines added to 9 existing files
(not counting comments and blank lines).  And those lines compile
away when not configured and result in only a function call if
configured in but the code is running native (not on a hypervisor).

> Why not just mark non-dirty page cache memory as 
> reclaimable and 
> if the guest accesses that memory, deliver a fault to it?

See above.  You can't really reclaim it... or else I am missing some
special magic about KVM (like page-cache-to-disk mapping maintained
for each guest).

> I think you can get away with using a partial section from the CMM2 
> state transition diagram although I'd have to think more 
> closely about it.

Could be.  But I'll bet it won't beat the performance of tmem
or the 27 lines.  And I'll bet it won't be as generic as tmem.

And BTW preswap in my mind is probably more important for system
performance than precache, at least in a densely consolidated
server.  Tmem does both precache and preswap, plus other potential
tricks.

Thanks, Anthony, for the excellent feedback and discussion!  I
respect your knowledge and value your input.  Please continue
especially if I am still misunderstanding KVM.

Dan

P.S. Off this weekend and Monday.