[Tmem-devel] tmem and KVM
Dan Magenheimer
dan.magenheimer at oracle.com
Mon Jan 19 12:24:15 PST 2009
Excuse my fat fingers... continuing on my reply to myself :-)
> > I think you can get away with using a partial section from the CMM2
> > state transition diagram although I'd have to think more
> > closely about it.
Some still partially complete thinking.... I'll bet the
tmem interface will still meet the needs of and provide
benefit to KVM, though tmem_put and tmem_get will probably
result in remappings rather than copies and tmem_flush may
be a no-op (as there is never a separate copy of the page).
Dan
> -----Original Message-----
> From:
> Sent: Monday, January 19, 2009 1:20 PM
> To: Dan Magenheimer; 'Anthony Liguori'
> Cc: 'tmem-devel at oss.oracle.com'
> Subject: RE: [Tmem-devel] tmem and KVM
>
>
> > But I still suspect you are wrong, unless KVM keeps a table tracking
> > every cached page from the disk to the disk location from where it
> > was obtained. (Does it? I'll assume not...) E.g. KVM reclaims a
>
> Replying to myself...
>
> After turning my brain sideways to think about this
> from a different -- more KVM-ish -- angle, I think I see
> where you are coming from now. Indeed, KVM *does*
> keep a tracking table or, more precisely, the host Linux
> does. The host is reading and mapping the page as part of
> the VHD file, while the guest is reading and mapping the
> page as a file within the VHD. But the page-in-memory
> is the same one, so if KVM decides to remove a page
> (or the host Linux decides to evict a page from the
> page cache), the page *is* recoverable via a disk read.
>
> Is this correct?
>
> My argument about page selection is still valid though
> (I think... more thought needed).
>
> > > Why not just mark non-dirty page cache memory as
> > > reclaimable and
> > > if the guest accesses that memory, deliver a fault to it?
>
>
>
> > -----Original Message-----
> > From: Dan Magenheimer
> > Sent: Friday, January 16, 2009 4:04 PM
> > To: Anthony Liguori
> > Cc: tmem-devel at oss.oracle.com
> > Subject: Re: [Tmem-devel] tmem and KVM
> >
> >
> > > By evicted, you mean that the guest has evicted it
> > > from it's page cache?
> >
> > Yes, exactly.
> >
> > > So precache basically becomes a way of bringing
> > > in a page from disk.
> >
> > More precisely, it is a way to avoid reading (some/most)
> > evicted pages back from disk. Its success is ultimately
> > measured by reducing disk reads. When a guest is ballooned
> > down it evicts a lot of pages and many of these are actually
> > needed in the future, which means they must be read multiple
> > times from disk... unless they are put in precache.
> >
> > > I guess the part that hadn't clicked yet was the
> > > fact that
> > > you actually copy from precache into the normal page cache,
> > > instead of
> > > just accessing the precache memory directly.
> >
> > Yes, so it in essence appears to be a very fast disk read.
> >
> > > I guess how CMM2 differs, is that the guest would effectively
> > > access the
> > > "precache" directly but if the VMM had to evict something
> > > from precache,
> > > when the guest tried to access precached memory that was
> > evicted, it
> > > receives a special page fault. This page fault tells the
> > > guest to bring
> > > the page from disk into memory. But with CMM2, all page
> > cache memory
> > > that is not dirty is effectively, "precache".
> >
> > Partially true, but the critical point is that the guest OS, NOT
> > the hypervisor, has made the decision about which page(s) are lowest
> > priority.
> >
> > If the hypervisor has to decide which pages, it is shooting
> > in the dark without more information provided by the guest.
> > Some such information can be inferred but only by taxonomizing
> > groups of pages, from which the selection is still random.
> > With tmem the guest OS explicitly prioritizes its pages and
> > eviction is the trigger about its decision.
> >
> > > Oh CMM2 is clearly superior :-) The problem is the state
> > > bitmap has to
> > > be updated atomically so frequently that you end up needing
> > hardware
> > > support for it. s390 has a special instruction for it.
> >
> > Actually I meant it is not clear which is superior: page copying
> > or remapping. Xen made a big-deal in the first year or two
> > about page-flipping (remapping), but afaik it has gone away,
> > replaced by coping. YMMV.
> >
> > But to tweak you on the point you actually made, I would *hope*
> > that CMM2 is superior. If a company has control over the
> > design of the processor, the hypervisor, the operating system
> > and the I/O, and invests many thousands of man years into making
> > them work better together, you'd think the parts would work
> > together well. :-) But apparently not so well with a commodity
> > processor, an open source hypervisor, an open source OS, and
> > a bizarre bazaar of drivers. :-)
> >
> > > All KVM memory is reclaimable. If a page in the guest is
> > > dirty, then it
> > > may need to be written to disk first before reclaim but in
> > practice,
> > > there should be a fair amount of memory that is
> reclaimable without
> > > writing to disk.
> >
> > First, see critical point above about selecting which page
> to reclaim.
> >
> > But I still suspect you are wrong, unless KVM keeps a table tracking
> > every cached page from the disk to the disk location from where it
> > was obtained. (Does it? I'll assume not...) E.g. KVM reclaims a
> > clean page, discards its contents, and somehow marks the page as not
> > present. A moment later the guest attempts a read from that page
> > and KVM gets a trap. Where does it get the contents to reconstitute
> > the page?
> >
> > Even if KVM *does* keep track of every page-cache-to-disk mapping
> > (which some recent research projects are proposing for Xen), it
> > still has to be read from disk... but I guess your point above
> > was "reclaimable without writing to disk" which would still
> > be true.
> >
> > > The one bit of all of this that is intriguing is being able
> > > to mark the
> > > non-dirty page cache memory as reclaimable and providing a
> > > mechanism for
> > > the guest to be aware of the fact that the memory has been
> > > reclaimed.
> > > This would be more valuable to KVM than an explicit copy
> > > interface, I think.
> >
> > Indeed, that's a critical point. Let the guest OS decide *which*
> > pages, and the hypervisor only needs to concern itself with
> > how many pages. But tmem makes it all very explicit in a
> > language all(?) OS's understand.
> >
> > > I question the utility of the proposed interface because it
> > requires
> > > modifying a very large amount of Linux code to use the
> > optional cache
> > > space.
> >
> > Large? For precache, I count 27 lines added to 9 existing files
> > (not counting comments and blank lines). And those lines compile
> > away when not configured and result in only a function call if
> > configured in but the code is running native (not on a hypervisor).
> >
> > > Why not just mark non-dirty page cache memory as
> > > reclaimable and
> > > if the guest accesses that memory, deliver a fault to it?
> >
> > See above. You can't really reclaim it... or else I am missing some
> > special magic about KVM (like page-cache-to-disk mapping maintained
> > for each guest).
> >
> > > I think you can get away with using a partial section
> from the CMM2
> > > state transition diagram although I'd have to think more
> > > closely about it.
> >
> > Could be. But I'll bet it won't beat the performance of tmem
> > or the 27 lines. And I'll bet it won't be as generic as tmem.
> >
> > And BTW preswap in my mind is probably more important for system
> > performance than precache, at least in a densely consolidated
> > server. Tmem does both precache and preswap, plus other potential
> > tricks.
> >
> > Thanks, Anthony, for the excellent feedback and discussion! I
> > respect your knowledge and value your input. Please continue
> > especially if I am still misunderstanding KVM.
> >
> > Dan
> >
> > P.S. Off this weekend and Monday.
> >
> > _______________________________________________
> > Tmem-devel mailing list
> > Tmem-devel at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/tmem-devel
> >
>
More information about the Tmem-devel
mailing list