[Ocfs2-devel] The truncate_inode_page call
inocfs_file_releasecaus es the severethroughput drop of file reading in OCFS2.
Wim Coekaerts
wim.coekaerts at oracle.com
Tue Jun 22 14:24:28 CDT 2004
yeah - unfortunately we don't have a real dlm :(
On Tue, Jun 22, 2004 at 12:55:10PM -0700, Cahill, Ben M wrote:
> I don't know if it will be helpful, but I'll tell you a bit about
> OpenGFS locking and flushing, etc. You may have something like this
> already, so I'll be brief:
>
> OGFS uses the "g-lock" layer to coordinate inter-node and intra-node
> (inter-process) locking. It provides generic hooks to invoke functions
> when:
>
> Acquiring a lock at inter-node level
> Locking a lock at process level
> Unlocking a lock at process level
> Releasing a lock at inter-node level
>
> The sets of functions are like other "ops" in Linux, a vector of
> functions. Each different type of lock (e.g. inode, journal) has its
> own set of functions (some sets are empty). These functions typically
> flush to disk, read from disk, read or write lock value blocks, etc.
>
> The g-lock layer caches an inter-node lock for 5 minutes after its last
> use within the node. When requested by another node, it will release a
> cached lock immediately if it is not being used within the node. Since
> a "glops" function is invoked when releasing the lock, this caching
> mechanism provides some hysteresis for flushing, etc.
>
> If you're interested in more info, see the rather lengthy ogfs-locking
> (a.k.a. "Locking") doc on opengfs.sourceforge.net/docs.php.
>
> I did some work to extract the g-lock layer out of OGFS back in the
> Fall. You can find the "generic" code in OGFS CVS tree at:
>
> opengfs/src/locking/glock
>
> It's actually fairly compact for what it does.
>
> -- Ben --
>
> > -----Original Message-----
> > From: ocfs2-devel-bounces at oss.oracle.com
> > [mailto:ocfs2-devel-bounces at oss.oracle.com] On Behalf Of Mark Fasheh
> > Sent: Tuesday, June 22, 2004 2:30 PM
> > To: Zhang, Sonic
> > Cc: Ocfs2-Devel
> > Subject: Re: [Ocfs2-devel] The truncate_inode_page call
> > inocfs_file_releasecaus es the severethroughput drop of file
> > reading in OCFS2.
> >
> > On Tue, Jun 22, 2004 at 04:57:56PM +0800, Zhang, Sonic wrote:
> > > Hi Wim,
> > >
> > > I remember that the OCFS only make sure the metadata is
> > > consistent among different nodes in the cluster, but it doesn't care
> > > about the file data consistency.
> > Actually we use journalling and the inode sequence numbers
> > for metadata
> > consistency. the truncate_inode_pages calls *are* used for
> > data consistency,
> > but you're right in that we only really provide a minimal
> > effort for that
> > (relying mostly on direct I/O in the database case for real
> > consistency).
> >
> > > So, I think we don't need to notify every change of a file to
> > > all active nodes. What should be done is only notify the
> > changes in the
> > > inode metadata of a file, which costs little bandwidth. Why
> > do you care
> > > about the file data consistency in your example?
> > Well, we already more or less handle this. Again, I think
> > you're thinking
> > metadata when you want to be thinking data.
> >
> > > If OCFS has to make sure the file data consistency, the current
> > > truncate_inode_page() solution also doesn't work. See my sample:
> > >
> > > 1. Node 1 writes block 1 to file 1, flush to disk and keep it open.
> > > 2. Node 2 open file 1, reads block 1 and wait.
> > > 3. Node 1 writes block 1 again with new data. Also flush to disk.
> > > 4. Node 2 reads block 1 again.
> > >
> > > Now, the data of block 1 got by node 2 is not the data on the disk.
> > Yeah, that's probably a hole in our scheme :)
> > --Mark
> >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: wim.coekaerts at oracle.com [mailto:wim.coekaerts at oracle.com]
> > > Sent: Tuesday, June 22, 2004 4:01 PM
> > > To: Zhang, Sonic
> > > Cc: Ocfs2-Devel; Rusty Lynch; Fu, Michael; Yang, Elton
> > > Subject: Re: [Ocfs2-devel] The truncate_inode_page call in
> > > ocfs_file_releasecaus es the severethroughput drop of file
> > reading in
> > > OCFS2.
> > >
> > > yeah... it's on purpose for the reason you mentioned.
> > > multinodeconsistency
> > >
> > > i was actually cosnidering testing by taking out truncateinodepages,
> > > this has been discussed internqally for quite a few months,
> > it's a big
> > > nightmare i have nightly ;-)
> > >
> > > the problem is, how can we notify. I think we don't want to
> > notify every
> > > node on every change othewise we overload the interconnect
> > and we don't
> > > have a good consistent map, if I remmeber Kurts explanation
> > correctly.
> > >
> > > this has to be fixed for regular performance for sure, the
> > question is
> > > how do we do this in a good way.
> > >
> > > I'd say, feel free to experiment... just remember that the
> > big probelm
> > > is multinode consistency. imagine this :
> > >
> > > I open file /ocfs/foo and read it
> > > all cached
> > > close file, no one on this node has it open
> > >
> > > on node2 I write some data, either O_DIRECT or regular
> > > close or keep it open whichever
> > >
> > > on node1 I now do an md5sum
> > >
> > >
> > >
> > > > development machine. But, if we try to bypass the call to
> > > > truncate_inode_page(), the file reading throughput in one node can
> > > reach
> > > > 1300M bytes/sec, which is about 75% of that of ext3.
> > > >
> > > > I think it is not a good idea to clean all page caches of an
> > > > inode when its last reference is closed. This inode may
> > be reopened
> > > very
> > > > soon and its cached pages may be accessed again.
> > > >
> > > > I guess your intention to call truncate_inode_page() is to avoid
> > > > inconsistency of the metadata if a process on the other
> > node changes
> > > the
> > > > same inode metadata on disk before it is reopened in this
> > node. Am I
> > > > right? Do you have more concern?
> > > >
> > > > I think in this case we have 2 options. One is to clean all
> > > > pages of this inode when receive the file change
> > notification (rename,
> > > > delete, move, attributes, etc) in the receiver thread.
> > The other is to
> > > > only invalidate pages contain the metadata of this inode.
> > > >
> > > > What's your opinion?
> > > >
> > > > Thank you.
> > > >
> > > >
> > > > _______________________________________________
> > > > Ocfs2-devel mailing list
> > > > Ocfs2-devel at oss.oracle.com
> > > > http://oss.oracle.com/mailman/listinfo/ocfs2-devel
> > >
> > > _______________________________________________
> > > Ocfs2-devel mailing list
> > > Ocfs2-devel at oss.oracle.com
> > > http://oss.oracle.com/mailman/listinfo/ocfs2-devel
> > --
> > Mark Fasheh
> > Software Developer, Oracle Corp
> > mark.fasheh at oracle.com
> > _______________________________________________
> > Ocfs2-devel mailing list
> > Ocfs2-devel at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-devel
> >
> >
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
More information about the Ocfs2-devel
mailing list