[Tmem-devel] tmem and KVM

Mon Jan 19 12:19:30 PST 2009

> But I still suspect you are wrong, unless KVM keeps a table tracking
> every cached page from the disk to the disk location from where it
> was obtained.  (Does it? I'll assume not...)  E.g. KVM reclaims a

Replying to myself...

After turning my brain sideways to think about this
from a different -- more KVM-ish -- angle, I think I see
where you are coming from now.  Indeed, KVM *does*
keep a tracking table or, more precisely, the host Linux
does.  The host is reading and mapping the page as part of
the VHD file, while the guest is reading and mapping the
page as a file within the VHD.  But the page-in-memory
is the same one, so if KVM decides to remove a page
(or the host Linux decides to evict a page from the
page cache), the page *is* recoverable via a disk read.

Is this correct?

My argument about page selection is still valid though
(I think... more thought needed).

> > Why not just mark non-dirty page cache memory as 
> > reclaimable and 
> > if the guest accesses that memory, deliver a fault to it?

> -----Original Message-----
> From: Dan Magenheimer 
> Sent: Friday, January 16, 2009 4:04 PM
> To: Anthony Liguori
> Cc: tmem-devel at oss.oracle.com
> Subject: Re: [Tmem-devel] tmem and KVM
> 
> 
> > By evicted, you mean that the guest has evicted it 
> > from it's page cache?
> 
> Yes, exactly.
> 
> > So precache basically becomes a way of bringing 
> > in a page from disk.
> 
> More precisely, it is a way to avoid reading (some/most)
> evicted pages back from disk.  Its success is ultimately
> measured by reducing disk reads.  When a guest is ballooned
> down it evicts a lot of pages and many of these are actually
> needed in the future, which means they must be read multiple
> times from disk... unless they are put in precache.
> 
> > I guess the part that hadn't clicked yet was the 
> > fact that 
> > you actually copy from precache into the normal page cache, 
> > instead of 
> > just accessing the precache memory directly.
> 
> Yes, so it in essence appears to be a very fast disk read.
>  
> > I guess how CMM2 differs, is that the guest would effectively 
> > access the 
> > "precache" directly but if the VMM had to evict something 
> > from precache, 
> > when the guest tried to access precached memory that was 
> evicted, it 
> > receives a special page fault.  This page fault tells the 
> > guest to bring 
> > the page from disk into memory.  But with CMM2, all page 
> cache memory 
> > that is not dirty is effectively, "precache".
> 
> Partially true, but the critical point is that the guest OS, NOT
> the hypervisor, has made the decision about which page(s) are lowest
> priority.
> 
> If the hypervisor has to decide which pages, it is shooting
> in the dark without more information provided by the guest.
> Some such information can be inferred but only by taxonomizing
> groups of pages, from which the selection is still random.
> With tmem the guest OS explicitly prioritizes its pages and
> eviction is the trigger about its decision.
> 
> > Oh CMM2 is clearly superior :-)  The problem is the state 
> > bitmap has to 
> > be updated atomically so frequently that you end up needing 
> hardware 
> > support for it.  s390 has a special instruction for it.
> 
> Actually I meant it is not clear which is superior: page copying
> or remapping.  Xen made a big-deal in the first year or two
> about page-flipping (remapping), but afaik it has gone away,
> replaced by coping.  YMMV.
> 
> But to tweak you on the point you actually made, I would *hope*
> that CMM2 is superior.  If a company has control over the
> design of the processor, the hypervisor, the operating system
> and the I/O, and invests many thousands of man years into making
> them work better together, you'd think the parts would work
> together well. :-)  But apparently not so well with a commodity
> processor, an open source hypervisor, an open source OS, and
> a bizarre bazaar of drivers. :-)
> 
> > All KVM memory is reclaimable.  If a page in the guest is 
> > dirty, then it 
> > may need to be written to disk first before reclaim but in 
> practice, 
> > there should be a fair amount of memory that is reclaimable without 
> > writing to disk.
> 
> First, see critical point above about selecting which page to reclaim.
> 
> But I still suspect you are wrong, unless KVM keeps a table tracking
> every cached page from the disk to the disk location from where it
> was obtained.  (Does it? I'll assume not...)  E.g. KVM reclaims a
> clean page, discards its contents, and somehow marks the page as not
> present.  A moment later the guest attempts a read from that page
> and KVM gets a trap.  Where does it get the contents to reconstitute
> the page?
> 
> Even if KVM *does* keep track of every page-cache-to-disk mapping
> (which some recent research projects are proposing for Xen), it
> still has to be read from disk... but I guess your point above
> was "reclaimable without writing to disk" which would still
> be true.
>  
> > The one bit of all of this that is intriguing is being able 
> > to mark the 
> > non-dirty page cache memory as reclaimable and providing a 
> > mechanism for 
> > the guest to be aware of the fact that the memory has been 
> > reclaimed.  
> > This would be more valuable to KVM than an explicit copy 
> > interface, I think.
> 
> Indeed, that's a critical point.  Let the guest OS decide *which*
> pages, and the hypervisor only needs to concern itself with
> how many pages.  But tmem makes it all very explicit in a
> language all(?) OS's understand.
> 
> > I question the utility of the proposed interface because it 
> requires 
> > modifying a very large amount of Linux code to use the 
> optional cache 
> > space.
> 
> Large?  For precache, I count 27 lines added to 9 existing files
> (not counting comments and blank lines).  And those lines compile
> away when not configured and result in only a function call if
> configured in but the code is running native (not on a hypervisor).
> 
> > Why not just mark non-dirty page cache memory as 
> > reclaimable and 
> > if the guest accesses that memory, deliver a fault to it?
> 
> See above.  You can't really reclaim it... or else I am missing some
> special magic about KVM (like page-cache-to-disk mapping maintained
> for each guest).
>  
> > I think you can get away with using a partial section from the CMM2 
> > state transition diagram although I'd have to think more 
> > closely about it.
> 
> Could be.  But I'll bet it won't beat the performance of tmem
> or the 27 lines.  And I'll bet it won't be as generic as tmem.
> 
> And BTW preswap in my mind is probably more important for system
> performance than precache, at least in a densely consolidated
> server.  Tmem does both precache and preswap, plus other potential
> tricks.
> 
> Thanks, Anthony, for the excellent feedback and discussion!  I
> respect your knowledge and value your input.  Please continue
> especially if I am still misunderstanding KVM.
> 
> Dan
> 
> P.S. Off this weekend and Monday.
> 
> _______________________________________________
> Tmem-devel mailing list
> Tmem-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/tmem-devel
>