[Ocfs2-devel] A patch to improve the metadata reading throughput (against svn1267)

Wed Jul 21 13:21:59 CDT 2004

Hi Mark,

	Our objective to send notification messages in routine
ocfs_release_lock() for some lock types listed bellow is to keep the
data consistency in the other nodes.

	As you know, after we change to cacheable metadata and turn off
checkpoint in journal transaction, the metadata in cache may not
consistent among different nodes in some cases. The metadata
inconsistency in cache is only allowed when the access to the metadata
is protected by exclusive locks. So, after metadata is changed on one
node and the exclusive lock is released, the metadata in other nodes
should invalidate their cached copy. This is done through sending a
RELEASE notification to all other nodes when the lock of type
FLAG_FILE_CREATE, FLAG_FILE_DELETE, FLAG_FILE_TRUNCATE and
FLAG_FILE_EXTEND is released.

	We think the RELEASE notification and cached buffer reading
should work together to improve the performance while keep the
consistency.


	Regards


-----Original Message-----
From: Mark Fasheh [mailto:mark.fasheh at oracle.com] 
Sent: Saturday, July 17, 2004 1:24 AM
To: Zhang, Sonic
Cc: Ocfs2-Devel; Fu, Michael; Yang, Elton
Subject: Re: [Ocfs2-devel] A patch to improve the metadata reading
throughput (a gainst svn1267)

Hi,

On Fri, Jul 16, 2004 at 05:59:55PM +0800, Zhang, Sonic wrote:
> Hi,
> 
> 	This is a new patch to increase the metadata reading throughput
> for OCFS2. 
Would you mind regenerating the patch against the latest svn release? 
We've made some changes which you've already gone ahead and rolled into
that
one, and I noticed you're missing at least one bug fix which you want to
pick up in ocfs_populate_inode.

> Our approach is to make the metadata reading operation via
> routine ocfs_read_bh() cacheable and keep consistency through routine
> invalidate_bdev() when receiving lock release notifications, such as
> FLAG_FILE_CREATE, FLAG_FILE_DELETE, FLAG_FILE_TRUNCATE,
> FLAG_FILE_EXTEND, etc. These notification is sent in routine
> ocfs_release_lock(), which is invoked in the journal commit thread.
Yeah, for a while now we've been meaning to go through all calls to
ocfs_read_bh[s] and basically make as many of them cached as possible.
At
this point in our development, things should be cached unless you're
*absolutely* certain you want to sync off disk.

Another thing that's on the list which you might be interested in
looking at
is not sending all lock release messages. Some of them do basically
nothing
on the other end in process_vote, so there's really no reason to send
them
to the nodes at all. This should help alot when you've batched up a ton
of
locks to release in commit_cache. I already started some of this by not
sendning the final delete message in ocfs_delete_inode, though that
one's
special as the lockres accounting information is about to be thrown out
so
we don't have to update it. Others require more work as you at least
have to
decrement the lock_holders variable when you're done with it.

> 	We also did some benchmark.
> 	The directory creating rate is about 4000/s with this patch and
> 1400/s without.
> 	The directory removing rate is about 740/s with this patch and
> 10/s without.
> 
> 	Please refer to the attachment. It is against svn 1267.
> 
> 	This patch is not stable yet. Known issues:
> 1. We still need to check all code which read the metadata. Figure out
> if that metadata is suitable to be cached.
> 3. Current OCFS lock mechanism may cause dead lock in some cases after
> we change to cache metadata and release lock in journal
asynchronously.
> We plan to fix it in a new patch.
So are you planning to turn off immediate checkpointing for all the
other
journal transactions? This is also on the list :) The only one that
*may* be
troublesome I believe is truncate. Otherwise, the ones that are left
are:
link, symlink, and rename.

> 4. readdir() may get old data after the data is written back to disk
in
> journal asynchronously. It is not a bug. But which way is better, sync
> the new data to disk when other nodes notify READONLY message or just
> let them get old data?
No, we consider it a bug :)  The other nodes should be getting up to
date
directory contents.

> 
> 	We are working to improve this patch. Any suggestion?
Sure :) I'll paste your patch below:

Index: src/journal.c
===================================================================

--- src/journal.c	(revision 1267)
+++ src/journal.c	(working copy)
@@ -148,6 +148,8 @@
 	}
 	spin_unlock(&journal->cmt_lock);
 
+	if (osb->needs_flush)
+		ocfs_sync_blockdev(osb->sb);

Is this necessary? It seems awfully heavy, and since we journal *all*
metadata (so it should be synced up to disk via the journal_flush just a
couple lines above that), I don't see the point... I was actually
meaning to
take the other call to sync_blockdev out as it's never used :)

Index: src/dlm.c
===================================================================
--- src/dlm.c	(revision 1267)
+++ src/dlm.c	(working copy)
@@ -1038,9 +1038,7 @@
 	ocfs2_dinode *fe = NULL;
 	struct buffer_head *tmpbh = NULL, **b = NULL;
 	__s16 curr_master;
-	int lockflags =
-		(OCFS_I(inode)->ip_flags & OCFS_INODE_SYSTEM_FILE) ?
-		0 : OCFS_BH_CACHED;
+	int lockflags = OCFS_BH_CACHED;
 	int clear_tmp = 0;
 	ocfs_lock_res *lockres = GET_INODE_LOCKRES(inode);

this is good...

<snip>

Index: src/inode.c
===================================================================
--- src/inode.c	(revision 1267)
+++ src/inode.c	(working copy)
@@ -447,8 +447,6 @@
 		    break;
 	    case S_IFDIR:
 		    if (inode->i_nlink < 2) {
-			    LOG_ERROR_ARGS("inlink=%d for %llu\n",
inode->i_nlink, 
-					   fe->i_blkno);
 			    inode->i_nlink = 2;
 		    }

This is wrong. The latest SVN doesn't include this block as nlink on a
directory can very well be 0 if it's been moved to the orphan dir.

<snip>

Index: src/nm.c
===================================================================
--- src/nm.c	(revision 1267)
+++ src/nm.c	(working copy)

<snip>

@@ -813,8 +814,7 @@
 		}
 	}
 
-	lockflags = (OCFS_I(inode)->ip_flags & OCFS_INODE_SYSTEM_FILE)
-	       	? 0 : OCFS_BH_CACHED;
+	lockflags = OCFS_BH_CACHED;

This is good...

<snip>

@@ -910,7 +928,7 @@
 				       inode->i_nlink);
 			/* verify_update_inode does a dirty read which
 			 * might set i_nlink to an old value. */
-			if (inode->i_nlink) {
+			if ( !S_ISDIR(inode->i_mode) && inode->i_nlink)
{
 				LOG_ERROR_ARGS("Orphaned inode %lu has
link "
 					       "count = %d!\n",
inode->i_ino, 
 					       inode->i_nlink);

and if you include the fix to ocfs_populate_inode, this is no longer
necessary :)

<snip>

Index: src/buffer_head_io.c
===================================================================
--- src/buffer_head_io.c	(revision 1267)
+++ src/buffer_head_io.c	(working copy)
@@ -217,13 +217,6 @@
 	    !(OCFS_I(inode)->ip_flags & OCFS_INODE_SYSTEM_FILE))
 		flags |= OCFS_BH_CACHED;
 
-	if (inode && (flags & OCFS_BH_CACHED) &&
-	    (OCFS_I(inode)->ip_flags & OCFS_INODE_SYSTEM_FILE)) {
-		LOG_TRACE_STR("hey bozo you are trying to read "
-			      "a system thingy cached!");
-		flags &= ~OCFS_BH_CACHED;
-	}
-
 	sb = osb->sb;
 	blocknum = off >> sb->s_blocksize_bits;

excellent. This and the other places where you did something like:
-       int lockflags =
-               (OCFS_I(inode)->ip_flags & OCFS_INODE_SYSTEM_FILE) ?
-               0 : OCFS_BH_CACHED;
+       int lockflags = OCFS_BH_CACHED;

are exactly what we've needed to do for a while now.
Thanks!
	--Mark

--
Mark Fasheh
Software Developer, Oracle Corp
mark.fasheh at oracle.com