OCFS2/DesignDocs/OrphanRemoval

OCFS2 Orphan Removal Strategy

Mark Fasheh <mark dot fasheh at oracle dot com>

November 27, 2006

Goals / Requirements

To remove the current delete_inode vote mechanism. A replacement should:

Perform better in the common case
Give a node a definitive method of ensuring that a struct inode (memory) does not exist anywhere in the cluster
Not require any direct file system messaging

The last requirement in particular, will put OCFS2 one step closer to not needing any node information, which will in turn allow us to switch cluster stacks more easily.

Problem Description

Basically, what we need is a method for tracking inodes within the cluster so that when one becomes orphaned, the last node to remove the inode can remove it from the orphan dir. We need not worry about when an inode becomes marked as "orphaned" within the cluster - that task is handled by the dentry lock code as described in Unlink / Rename Strategy (AKA, "Dentry vote removal")

Votes

Currently the OCFS2 file system employs a vote mechanism to query nodes whether an orphaned inode can be deleted from the file system.

A vote is a network message which the file system transmits to all nodes that are concurrently mounted. The delete_inode vote contains two values of importance:

Inode #

Orphaned Slot

Delete inode votes are serialized by holding an exclusive lock on the orphaned inode.

Upon recieving the vote, each node tests a series of conditions on the inode, and returns a negative response if any of them fail:

Does anyone on this node still have the inode open? (ip_open_count)
Is it an unlinked directory with an elevated use count?

If the node sending the vote recieves a negative response, it will not delete the inode and trust that another node will handle it.

There are a number of issues with this scheme:

It always has worst-case performance
It requires a messaging layer within the file system which has a number of it's own problems
1. Messaging requires that userspace push node information (node #, ip address, etc) up into the file system
2. The existence of any votes requires detailed mount maps to be stored, which in turn actually requires an additional set of mount/unmount votes.
Inode lifetiming becomes more complicated because we now have an additional thread which might take and drop inode references without haveing gone through the normal steps (i.e., ->lookup, which holds parent directory locks)

ip_orphaned_slot

ocfs2_delete_inode() has to somehow figure out which orphan directory an orphaned inode belongs to. This is stored on ip_orphaned_slot. ip_orphaned_slot can be set from several places:

The node doing the orphaning will simply set it to it's own slot
Other nodes will send back their own idea of it as part of the delete_inode vote response.
1. If the node recieving the vote doesn't know it, but a valid slot # is stored on the vote it will record it then.
During orphan dir recovery, the recovery code will force it to the slot # of the orphan dir being recovered.

Open Locks

My proposed solution is to add a lock to each inode, the open lock. The lock is taken at a protected read level in ocfs2_read_locked_inode() after the meta data lock. It's never dropped until ocfs2_clear_inode().

When a process on a node gets to ocfs2_delete_inode() it can trylock an upgrade to exclusive mode (via LKM_NOQUEUE) the open lock. If the upgrade succeeds, then the process knows that the struct inode cannot exist elsewhere in the cluster, and the inode can then be deleted.

By losing the vote, we'll lose communication as to which orphan directory owns the inode. We can rectify this by claiming a 16 bit field in struct ocfs2_dinode which we can populate with slot # at orphan time (see ocfs2_orphan_add()). ocfs2_delete_inode() can read this value off disk in the case that it will be deleting the inode.

Testing

Test case 1

One process delete a file while another process is opening it on one node.

openfile.c

#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>

int main(int argc, char * argv[])
{
        int fd;
        char buf[10];

        if (argc <2)
                fd = open("fileforrm", O_RDWR | O_CREAT | O_TRUNC, 0666);
        else
                fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0666);

        if (fd < 0) {
                printf("open file error.\n");
                exit(-1);
        }

        while (1) {
                if (write(fd, buf, 8) !=8 ) {
                        printf("write file error.\n");
                        exit(-1);
                }
                sleep(60);
        }
}

rmvote_test1.sh

#!/bin/bash
TEST_PATH=/mnt/ocfs2/rmvote
TEST_APP=openfile
TEST_FILE=fileforrm

cd $TEST_PATH
./$TEST_APP &
APPID="$!"
echo "Process #1 is opening the file $TEST_FILE ..."
echo "Process #1 pid: $APPID"

rm -f $TEST_FILE
if [ $? -eq 0 ]; then
        echo "Process #2 remove $TEST_FILE succeeded!"
else
        echo "Process #2 remove $TEST_FILE failed!"
fi

kill -9 $APPID &>/dev/null

if [ -f $TEST_FILE ]; then
        echo "remove vote test failed!"
else
        echo "remove vote test succeeded!"
fi

Test case 2

One node delete a file while another node is opening it.

rmvote_test2.sh

#!/bin/bash
TEST_PATH=/mnt/ocfs2/rmvote
TEST_APP=openfile
TEST_FILE=fileforrm
LOCAL_HOST=`hostname`
REMOTE_HOST=ocfs80

cd $TEST_PATH
./$TEST_APP &
APPID="$!"
echo "Node $LOCAL_HOST is opening the file $TEST_FILE ..."
echo "Process #1 pid: $APPID"

ERRCODE=$(ssh $REMOTE_HOST "rm -f $TEST_PATH/$TEST_FILE &>/dev/null; echo $?")
if [ $ERRCODE -eq 0 ]; then
        echo "Node $REMOTE_HOST remove $TEST_FILE succeeded!"
else
        echo "Node $REMOTE_HOST remove $TEST_FILE failed!"
fi

kill -9 $APPID &>/dev/null

if [ -f $TEST_FILE ]; then
        echo "remove vote test failed!"
else
        echo "remove vote test succeeded!"
fi