[Tmem-devel] tmem and KVM

Dan Magenheimer dan.magenheimer at oracle.com
Mon Jan 19 12:24:15 PST 2009


Excuse my fat fingers... continuing on my reply to myself :-)

> > I think you can get away with using a partial section from the CMM2 
> > state transition diagram although I'd have to think more 
> > closely about it.

Some still partially complete thinking.... I'll bet the
tmem interface will still meet the needs of and provide
benefit to KVM, though tmem_put and tmem_get will probably
result in remappings rather than copies and tmem_flush may
be a no-op (as there is never a separate copy of the page).

Dan

> -----Original Message-----
> From: 
> Sent: Monday, January 19, 2009 1:20 PM
> To: Dan Magenheimer; 'Anthony Liguori'
> Cc: 'tmem-devel at oss.oracle.com'
> Subject: RE: [Tmem-devel] tmem and KVM
> 
> 
> > But I still suspect you are wrong, unless KVM keeps a table tracking
> > every cached page from the disk to the disk location from where it
> > was obtained.  (Does it? I'll assume not...)  E.g. KVM reclaims a
> 
> Replying to myself...
> 
> After turning my brain sideways to think about this
> from a different -- more KVM-ish -- angle, I think I see
> where you are coming from now.  Indeed, KVM *does*
> keep a tracking table or, more precisely, the host Linux
> does.  The host is reading and mapping the page as part of
> the VHD file, while the guest is reading and mapping the
> page as a file within the VHD.  But the page-in-memory
> is the same one, so if KVM decides to remove a page
> (or the host Linux decides to evict a page from the
> page cache), the page *is* recoverable via a disk read.
> 
> Is this correct?
> 
> My argument about page selection is still valid though
> (I think... more thought needed).
> 
> > > Why not just mark non-dirty page cache memory as 
> > > reclaimable and 
> > > if the guest accesses that memory, deliver a fault to it?
> 
> 
> 
> > -----Original Message-----
> > From: Dan Magenheimer 
> > Sent: Friday, January 16, 2009 4:04 PM
> > To: Anthony Liguori
> > Cc: tmem-devel at oss.oracle.com
> > Subject: Re: [Tmem-devel] tmem and KVM
> > 
> > 
> > > By evicted, you mean that the guest has evicted it 
> > > from it's page cache?
> > 
> > Yes, exactly.
> > 
> > > So precache basically becomes a way of bringing 
> > > in a page from disk.
> > 
> > More precisely, it is a way to avoid reading (some/most)
> > evicted pages back from disk.  Its success is ultimately
> > measured by reducing disk reads.  When a guest is ballooned
> > down it evicts a lot of pages and many of these are actually
> > needed in the future, which means they must be read multiple
> > times from disk... unless they are put in precache.
> > 
> > > I guess the part that hadn't clicked yet was the 
> > > fact that 
> > > you actually copy from precache into the normal page cache, 
> > > instead of 
> > > just accessing the precache memory directly.
> > 
> > Yes, so it in essence appears to be a very fast disk read.
> >  
> > > I guess how CMM2 differs, is that the guest would effectively 
> > > access the 
> > > "precache" directly but if the VMM had to evict something 
> > > from precache, 
> > > when the guest tried to access precached memory that was 
> > evicted, it 
> > > receives a special page fault.  This page fault tells the 
> > > guest to bring 
> > > the page from disk into memory.  But with CMM2, all page 
> > cache memory 
> > > that is not dirty is effectively, "precache".
> > 
> > Partially true, but the critical point is that the guest OS, NOT
> > the hypervisor, has made the decision about which page(s) are lowest
> > priority.
> > 
> > If the hypervisor has to decide which pages, it is shooting
> > in the dark without more information provided by the guest.
> > Some such information can be inferred but only by taxonomizing
> > groups of pages, from which the selection is still random.
> > With tmem the guest OS explicitly prioritizes its pages and
> > eviction is the trigger about its decision.
> > 
> > > Oh CMM2 is clearly superior :-)  The problem is the state 
> > > bitmap has to 
> > > be updated atomically so frequently that you end up needing 
> > hardware 
> > > support for it.  s390 has a special instruction for it.
> > 
> > Actually I meant it is not clear which is superior: page copying
> > or remapping.  Xen made a big-deal in the first year or two
> > about page-flipping (remapping), but afaik it has gone away,
> > replaced by coping.  YMMV.
> > 
> > But to tweak you on the point you actually made, I would *hope*
> > that CMM2 is superior.  If a company has control over the
> > design of the processor, the hypervisor, the operating system
> > and the I/O, and invests many thousands of man years into making
> > them work better together, you'd think the parts would work
> > together well. :-)  But apparently not so well with a commodity
> > processor, an open source hypervisor, an open source OS, and
> > a bizarre bazaar of drivers. :-)
> > 
> > > All KVM memory is reclaimable.  If a page in the guest is 
> > > dirty, then it 
> > > may need to be written to disk first before reclaim but in 
> > practice, 
> > > there should be a fair amount of memory that is 
> reclaimable without 
> > > writing to disk.
> > 
> > First, see critical point above about selecting which page 
> to reclaim.
> > 
> > But I still suspect you are wrong, unless KVM keeps a table tracking
> > every cached page from the disk to the disk location from where it
> > was obtained.  (Does it? I'll assume not...)  E.g. KVM reclaims a
> > clean page, discards its contents, and somehow marks the page as not
> > present.  A moment later the guest attempts a read from that page
> > and KVM gets a trap.  Where does it get the contents to reconstitute
> > the page?
> > 
> > Even if KVM *does* keep track of every page-cache-to-disk mapping
> > (which some recent research projects are proposing for Xen), it
> > still has to be read from disk... but I guess your point above
> > was "reclaimable without writing to disk" which would still
> > be true.
> >  
> > > The one bit of all of this that is intriguing is being able 
> > > to mark the 
> > > non-dirty page cache memory as reclaimable and providing a 
> > > mechanism for 
> > > the guest to be aware of the fact that the memory has been 
> > > reclaimed.  
> > > This would be more valuable to KVM than an explicit copy 
> > > interface, I think.
> > 
> > Indeed, that's a critical point.  Let the guest OS decide *which*
> > pages, and the hypervisor only needs to concern itself with
> > how many pages.  But tmem makes it all very explicit in a
> > language all(?) OS's understand.
> > 
> > > I question the utility of the proposed interface because it 
> > requires 
> > > modifying a very large amount of Linux code to use the 
> > optional cache 
> > > space.
> > 
> > Large?  For precache, I count 27 lines added to 9 existing files
> > (not counting comments and blank lines).  And those lines compile
> > away when not configured and result in only a function call if
> > configured in but the code is running native (not on a hypervisor).
> > 
> > > Why not just mark non-dirty page cache memory as 
> > > reclaimable and 
> > > if the guest accesses that memory, deliver a fault to it?
> > 
> > See above.  You can't really reclaim it... or else I am missing some
> > special magic about KVM (like page-cache-to-disk mapping maintained
> > for each guest).
> >  
> > > I think you can get away with using a partial section 
> from the CMM2 
> > > state transition diagram although I'd have to think more 
> > > closely about it.
> > 
> > Could be.  But I'll bet it won't beat the performance of tmem
> > or the 27 lines.  And I'll bet it won't be as generic as tmem.
> > 
> > And BTW preswap in my mind is probably more important for system
> > performance than precache, at least in a densely consolidated
> > server.  Tmem does both precache and preswap, plus other potential
> > tricks.
> > 
> > Thanks, Anthony, for the excellent feedback and discussion!  I
> > respect your knowledge and value your input.  Please continue
> > especially if I am still misunderstanding KVM.
> > 
> > Dan
> > 
> > P.S. Off this weekend and Monday.
> > 
> > _______________________________________________
> > Tmem-devel mailing list
> > Tmem-devel at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/tmem-devel
> > 
>



More information about the Tmem-devel mailing list