[Ocfs2-devel] Bug in inode deletion code leading to stale inodes

Wed Jan 14 17:59:22 PST 2009

On Mon, Jan 12, 2009 at 11:06:35PM +0100, Jan Kara wrote:
>   I've hit a bug in OCFS2 delete code which results in inodes being left on
> disk without any links to them. The workload triggering this creates
> directories on one node and deletes them on another node in the cluster.
> The inode is not deleted because both nodes bail out from
> ocfs2_delete_inode() with:
> Skipping delete of 100405 because it is in use on other nodes
> 
>   The scenario which I think is happening is as follows:
> 
>   node1					node2
> 					rmdir("d");
> 					  ocfs2_remote_dentry_delete()
>   ocfs2_dentry_convert_worker()
> 					  finishes ocfs2_unlink()
> 					  eventually enters ocfs2_delete_inode()
> 					    ocfs2_inode_lock()
> 					    ocfs2_query_inode_wipe() -> fail
> 					    ocfs2_inode_unlock()
>   ocfs2_dentry_post_unlock()
>     ocfs2_drop_dentry_lock()
>       iput()
>        ocfs2_delete_inode()
>          ocfs2_inode_lock()
> 	 ocfs2_query_inode_wipe() -> fail
> 	 ocfs2_inode_unlock()
>          clear_inode()
> 					    clear_inode()

FWIW, your analysis looks correct to me.

>   The question is how to avoid this. It seems to me that we have to really
> do open_lock() and not just trylock to avoid the race. Is there any reason
> why we cannot move the open_lock() before inode_lock() in
> ocfs2_delete_inode()?

We can't do that because open_lock() will deadlock with nodes still using
the inode.

The way the open lock works is that each node takes a RO on it when the
inode is instantiated. That PR is dropped in ocfs2_clear_inode(). If a node
wants to delete an inode, it will trylock at EX to see if any other nodes
have an active lock.

I think we have to look at how the downconvert thread deals with these
locks. Either we push some parts to a work queue, or we figure out how to
reorder some of the operations.

The problem with pushing off to a work queue, is we run the risk of slowing
down unlink related ops. Pushing the dentry downconvert away will result in
high unlink() latency, while pushing the final iput() away runs the risk of
bouncing the orphan dir lock (if the iput() comes after the deleting nodes
query in ->delete_inode())

Do you think there's a way we could just reverse the order - iput the inode,
BEFORE downconverting the dentry lock?
	--Mark

--
Mark Fasheh