[Ocfs2-devel] Adding Posix locking to OCFS2
Daniel Phillips
phillips at istop.com
Fri Aug 5 15:39:17 CDT 2005
Hi guys,
I'm interested in helping pull on the oars to help get the rest of the way to
Posix compliance. On the "talk is cheap" principle, I'll start with some
cheap talk.
First, you don't have to add range locks to your DLM just to handle Posix
locks. Posix locking can be handled by a completely independent and crude
mechanism. You can take your time merging this with your DLM, or switching
to another DLM if that's what you really want to do.
In the long run, the DLM has to have range locking in order to implement
efficient, sub-file granularity caching, but the VFS still needs work to
support this properly anyway, so there is no immediate urgency. And Posix
locking is even less of a reason to panic.
Now some comments on how the hookable Posix locking VFS interface works.
First off, it is not a thing of beauty. The principle is simple. By
default, the VFS maintains a simple minded range lock accounting structure.
It is just a linear list of range locks per inode. When somebody wants a new
lock, the VFS searches the whole list to find collisions. If a lock needs to
be split, merged or whatever, these are very basic list operations. The code
is short. It is obviously prone to efficiency bugs, but I have not heard
people complaining.
The hook works by simply short-circuiting into your filesystem just before the
VFS touches any of its own lock accounting data. Your filesystem gets the
original ioctl arguments and you get to replicate a bunch of work that the
VFS would have done, using its own data structure. There are about 5
shortcircuits like this. See what I mean by not beautiful? But it is fairly
obvious how to use this interface.
On a cluster, every node needs to have the same view of Posix locks for inodes
it is sharing. The easiest way I know of to accomplish this is to have a
Posix lock server in the cluster that keeps the same, bad old linear list of
locks per Posix-locked inode, but exports operations on those locks over the
network. Not much challenge: a user space server and some tcp messaging.
The server will count the nodes locking a given inode, and let the inode fall
away when all are done. I think inode deletion just works: the filesystem
just has to wait for confirmation from the lock server that the Posix locks
are dropped before it allows the normal delete path to continue.
Inodes that aren't shared don't require consulting the server, an easy
optimization.
For failover, each node will keep its own list of Posix locks. Exactly the
same code the VFS already uses will do, just cut and paste it and insert the
server messaging. To fail over a lock server, the new server has to rollcall
all cluster nodes to upload their Posix locks.
Now, obviously this locking should really be distributed, right? Sure, but
notice: the failover algorithm is the same regardless of single-server vs
distributed; in the DLM case it just has to be executed on every node for the
locks mastered on that node. And oh, you have to deal with a new layer of
bookkeeping to keep track of lock masters, and fail that over too. And of
course you want lock migration...
So my purpose today is to to introduce the VFS interface and to point out that
this doesn't have to blow up into a gigantic project just to add correct
Posix locking.
See here:
http://lxr.linux.no/source/fs/locks.c#L1560
if (filp->f_op && filp->f_op->lock) {
Regards,
Daniel
More information about the Ocfs2-devel
mailing list