[Ocfs2-devel] [RFC] Why unlink performance is low?

Sun Jan 5 18:11:19 PST 2014

After a delete, the system thread calls evict_inode which calls the
following sequence:
ocfs2_evict_inode() -> ocfs2_delete_inode() ->
ocfs2_query_inode_wipe() -> ocfs2_try_open_lock() on d1, it fails
with -EAGAIN. The open lock fails because on the remote node
a PR->EX convert takes longer than a simple EX grant.

This starts a checkpoint because OCFS2_INODE_DELETED flag is not set.
Now, a checkpoint interferes with the journaling of the inodes deleted 
in the following unlinks. I had earlier concluded that this happens
for directories only, however I was wrong. This happens for files as well.

The patch attached is *not* correct. I am sending this to show that where
the problem lies. I worked this on a "hypothetical" situation where the
files created by other nodes are not open on any other node during the
time of deletion. I agree that open lock should not block during
inode eviction.

The root problem is that open lock fails with -EAGAIN even if the file
is not open on any other node of the cluster. The reason we get -EAGAIN
is because the lock is on the remote end and the whole locking sequence
does not complete with LKF_NOQUEUE set. Here are some numbers:

Without patch (native) times:
-------------------------------------------------
| # files  | create #s | copy  #s  | remove #s |
-------------------------------------------------
|        1 |   0:00.03 |   0:00.25 |   0:00.94 |
|        2 |   0:00.12 |   0:00.20 |   0:01.12 |
|        4 |   0:00.16 |   0:00.31 |   0:03.50 |
|        8 |   0:00.11 |   0:00.38 |   0:08.15 |
|       16 |   0:00.11 |   0:00.60 |   0:14.64 |
|       32 |   0:00.15 |   0:00.89 |   0:28.04 |
|       64 |   0:00.24 |   0:03.49 |   0:59.96 |
|      128 |   0:00.42 |   0:08.73 |   1:52.14 |
|      256 |   0:01.05 |   0:18.03 |   3:54.81 |
|     1024 |   0:02.74 |   0:44.13 |  14:46.36 |

With patch times:
-------------------------------------------------
| # files  | create #s | copy  #s  | remove #s |
-------------------------------------------------
|        1 |   0:00.02 |   0:00.83 |   0:00.33 |
|        2 |   0:00.04 |   0:00.18 |   0:00.27 |
|        4 |   0:00.07 |   0:00.26 |   0:00.27 |
|        8 |   0:00.08 |   0:00.29 |   0:00.44 |
|       16 |   0:00.10 |   0:00.39 |   0:00.69 |
|       32 |   0:00.14 |   0:00.60 |   0:01.26 |
|       64 |   0:00.23 |   0:01.19 |   0:02.33 |
|      128 |   0:00.51 |   0:02.15 |   0:04.60 |
|      256 |   0:00.87 |   0:04.59 |   0:09.74 |
|     1024 |   0:02.78 |   0:17.64 |   0:37.93 |

The numbers show that the improvement is not just with unlinks
but with other operations as well because the journal is
no longer overworked.

I am looking for suggestions where we can overcome this design issue
to make sure that try open locks succeed if the file is not
opened on any node. I think the semantics of DLM_LKF_NOQUEUE
may be interpreted incorrectly, or we are probably not waiting for the
lksb status to be updated, but I am not sure and some insight into this
would be helpful.

diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
index f2d48c8..eb3baac 100644
--- a/fs/ocfs2/dlmglue.c
+++ b/fs/ocfs2/dlmglue.c
@@ -1681,9 +1681,9 @@ void ocfs2_rw_unlock(struct inode *inode, int write)
 /*
  * ocfs2_open_lock always get PR mode lock.
  */
-int ocfs2_open_lock(struct inode *inode)
+int ocfs2_open_lock(struct inode *inode, int ex)
 {
-	int status = 0;
+	int status = 0, level;
 	struct ocfs2_lock_res *lockres;
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
 
@@ -1696,9 +1696,10 @@ int ocfs2_open_lock(struct inode *inode)
 		goto out;
 
 	lockres = &OCFS2_I(inode)->ip_open_lockres;
+	level = ex ? DLM_LOCK_EX : DLM_LOCK_PR;
 
 	status = ocfs2_cluster_lock(OCFS2_SB(inode->i_sb), lockres,
-				    DLM_LOCK_PR, 0, 0);
+				    level, 0, 0);
 	if (status < 0)
 		mlog_errno(status);
 
diff --git a/fs/ocfs2/dlmglue.h b/fs/ocfs2/dlmglue.h
index 1d596d8..12766a1 100644
--- a/fs/ocfs2/dlmglue.h
+++ b/fs/ocfs2/dlmglue.h
@@ -110,7 +110,7 @@ int ocfs2_create_new_inode_locks(struct inode *inode);
 int ocfs2_drop_inode_locks(struct inode *inode);
 int ocfs2_rw_lock(struct inode *inode, int write);
 void ocfs2_rw_unlock(struct inode *inode, int write);
-int ocfs2_open_lock(struct inode *inode);
+int ocfs2_open_lock(struct inode *inode, int ex);
 int ocfs2_try_open_lock(struct inode *inode, int write);
 void ocfs2_open_unlock(struct inode *inode);
 int ocfs2_inode_lock_atime(struct inode *inode,
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index f87f9bd..792dba7 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -454,7 +454,7 @@ static int ocfs2_read_locked_inode(struct inode *inode,
 				  0, inode);
 
 	if (can_lock) {
-		status = ocfs2_open_lock(inode);
+		status = ocfs2_open_lock(inode, 0);
 		if (status) {
 			make_bad_inode(inode);
 			mlog_errno(status);
@@ -922,7 +922,7 @@ static int ocfs2_query_inode_wipe(struct inode *inode,
 	 * Though we call this with the meta data lock held, the
 	 * trylock keeps us from ABBA deadlock.
 	 */
-	status = ocfs2_try_open_lock(inode, 1);
+	status = ocfs2_open_lock(inode, 1);
 	if (status == -EAGAIN) {
 		status = 0;
 		reason = 3;
@@ -997,6 +997,7 @@ static void ocfs2_delete_inode(struct inode *inode)
 		ocfs2_cleanup_delete_inode(inode, 0);
 		goto bail_unblock;
 	}
+
 	/* Lock down the inode. This gives us an up to date view of
 	 * it's metadata (for verification), and allows us to
 	 * serialize delete_inode on multiple nodes.
diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index be3f867..ac67f2d 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -2307,7 +2307,7 @@ int ocfs2_create_inode_in_orphan(struct inode *dir,
 	}
 
 	/* get open lock so that only nodes can't remove it from orphan dir. */
-	status = ocfs2_open_lock(inode);
+	status = ocfs2_open_lock(inode, 0);
 	if (status < 0)
 		mlog_errno(status);
 

-- 
Goldwyn