OCFS2 Orphan Removal Strategy
Mark Fasheh <mark dot fasheh at oracle dot com>
November 27, 2006
Goals / Requirements
To remove the current delete_inode vote mechanism. A replacement should:
- Perform better in the common case
- Give a node a definitive method of ensuring that a struct inode (memory) does not exist anywhere in the cluster
- Not require any direct file system messaging
The last requirement in particular, will put OCFS2 one step closer to not needing any node information, which will in turn allow us to switch cluster stacks more easily.
Problem Description
Basically, what we need is a method for tracking inodes within the cluster so that when one becomes orphaned, the last node to remove the inode can remove it from the orphan dir. We need not worry about when an inode becomes marked as "orphaned" within the cluster - that task is handled by the dentry lock code as described in Unlink / Rename Strategy (AKA, "Dentry vote removal")
Votes
Currently the OCFS2 file system employs a vote mechanism to query nodes whether an orphaned inode can be deleted from the file system.
A vote is a network message which the file system transmits to all nodes that are concurrently mounted. The delete_inode vote contains two values of importance:
Inode # |
Orphaned Slot |
Delete inode votes are serialized by holding an exclusive lock on the orphaned inode.
Upon recieving the vote, each node tests a series of conditions on the inode, and returns a negative response if any of them fail:
Does anyone on this node still have the inode open? (ip_open_count)
- Is it an unlinked directory with an elevated use count?
If the node sending the vote recieves a negative response, it will not delete the inode and trust that another node will handle it.
There are a number of issues with this scheme:
- It always has worst-case performance
- It requires a messaging layer within the file system which has a number of it's own problems
- Messaging requires that userspace push node information (node #, ip address, etc) up into the file system
The existence of any votes requires detailed mount maps to be stored, which in turn actually requires an additional set of mount/unmount votes.
Inode lifetiming becomes more complicated because we now have an additional thread which might take and drop inode references without haveing gone through the normal steps (i.e., ->lookup, which holds parent directory locks)
ip_orphaned_slot
ocfs2_delete_inode() has to somehow figure out which orphan directory an orphaned inode belongs to. This is stored on ip_orphaned_slot. ip_orphaned_slot can be set from several places:
- The node doing the orphaning will simply set it to it's own slot
- Other nodes will send back their own idea of it as part of the delete_inode vote response.
- If the node recieving the vote doesn't know it, but a valid slot # is stored on the vote it will record it then.
- During orphan dir recovery, the recovery code will force it to the slot # of the orphan dir being recovered.
Open Locks
My proposed solution is to add a lock to each inode, the open lock. The lock is taken at a protected read level in ocfs2_read_locked_inode() after the meta data lock. It's never dropped until ocfs2_clear_inode().
When a process on a node gets to ocfs2_delete_inode() it can trylock an upgrade to exclusive mode (via LKM_NOQUEUE) the open lock. If the upgrade succeeds, then the process knows that the struct inode cannot exist elsewhere in the cluster, and the inode can then be deleted.
By losing the vote, we'll lose communication as to which orphan directory owns the inode. We can rectify this by claiming a 16 bit field in struct ocfs2_dinode which we can populate with slot # at orphan time (see ocfs2_orphan_add()). ocfs2_delete_inode() can read this value off disk in the case that it will be deleting the inode.
Testing
Test case 1
One process delete a file while another process is opening it on one node.
- openfile.c
#include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <fcntl.h> #include <stdio.h> int main(int argc, char * argv[]) { int fd; char buf[10]; if (argc <2) fd = open("fileforrm", O_RDWR | O_CREAT | O_TRUNC, 0666); else fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666); if (fd < 0) { printf("open file error.\n"); exit(-1); } while (1) { if (write(fd, buf, 8) !=8 ) { printf("write file error.\n"); exit(-1); } sleep(60); } }
- rmvote_test1.sh
#!/bin/bash TEST_PATH=/mnt/ocfs2/rmvote TEST_APP=openfile TEST_FILE=fileforrm cd $TEST_PATH ./$TEST_APP & APPID="$!" echo "Process #1 is opening the file $TEST_FILE ..." echo "Process #1 pid: $APPID" rm -f $TEST_FILE if [ $? -eq 0 ]; then echo "Process #2 remove $TEST_FILE succeeded!" else echo "Process #2 remove $TEST_FILE failed!" fi kill -9 $APPID &>/dev/null if [ -f $TEST_FILE ]; then echo "remove vote test failed!" else echo "remove vote test succeeded!" fi
Test case 2
One node delete a file while another node is opening it.
- rmvote_test2.sh
#!/bin/bash TEST_PATH=/mnt/ocfs2/rmvote TEST_APP=openfile TEST_FILE=fileforrm LOCAL_HOST=`hostname` REMOTE_HOST=ocfs80 cd $TEST_PATH ./$TEST_APP & APPID="$!" echo "Node $LOCAL_HOST is opening the file $TEST_FILE ..." echo "Process #1 pid: $APPID" ERRCODE=$(ssh $REMOTE_HOST "rm -f $TEST_PATH/$TEST_FILE &>/dev/null; echo $?") if [ $ERRCODE -eq 0 ]; then echo "Node $REMOTE_HOST remove $TEST_FILE succeeded!" else echo "Node $REMOTE_HOST remove $TEST_FILE failed!" fi kill -9 $APPID &>/dev/null if [ -f $TEST_FILE ]; then echo "remove vote test failed!" else echo "remove vote test succeeded!" fi