OCFS2/DesignDocs/UserFileLocks

OCFS2 Support for File Locking

Mark Fasheh

Nov. 13, 2007

Overview of Linux File locking

Two types of file locks

Linux supports two types of userspace file locks: flock() and lockf(). On Linux, neither types of lock interact with each other - that is, a flock() lock won't block or wait on an fcntl() lock and vice-versa.

flock() is also known as BSD style file locking. flock() locks cover the entire file and can be obtained in shared or write mode. Trylock type operations are supported.

On Linux, lockf() provides a POSIX locking interface on top of fcntl(). In order to reduce confusion, I will use fcntl() to refer to lockf() locks for the duration of this document.

fcntl() locks are ranged based. While fcntl() allows for shared and exclusive locks, the lockf() interface only exposes the exclusive level of locking. Ranges are arbitrary and are allowed to extend infinitely (to the end of the file, even if the file end is changed via an extending operation). Trylocks are supported,

In addition to standard locking and unlocking operations, fcntl() locks support F_GETLK. With F_GETLK, a process can send a lock request and get back information on whether that request would succeed. The lock is never actually taken - if the request would normally succeed, a special status is placed in the control structure. If an existing lock would block the request, the lock details along with the process id of the blocking lock are returned.

Deadlock Detection

flock() does not do deadlock detection. Typically, delivering a signal is the easiest way to un-wedge a pair of processes which are deadlocked in flock. For reference, ctrl-c works fine to kill two deadlocked processes even on a nointr nfs mount.

fcntl() locking will detect deadlocks and return EDEADLK in the case that locking progress can not be made.

Mandatory Locking

Both lock types are advisory by default. Advisory locking means that a process can still read and write to the file without taking a file lock. It's only processes which participate in the locking (by calling flock() or fcntl()) that will block on each other.

Mandatory locking is supported, but needs to be turned on the file system via a mount option and on the file itself (by disabling group execute permission and enabling the set-group-ID permission bit. Mandatory locks are enforced for all processes reading and writing (including truncate) to the file. For example, if a process takes an exclusive lock on a file and another process tries to read that file, the reading process will block until the lock is released.

It's worth noting that mandatory file locking is almost never used. In fact, the first section of Documentation/filesystems/mandatory-locking.txt is titled Why you should avoid mandatory locking. That text is copied here for reference:

0. Why you should avoid mandatory locking
-----------------------------------------

The Linux implementation is prey to a number of difficult-to-fix race
conditions which in practice make it not dependable:

        - The write system call checks for a mandatory lock only once
          at its start.  It is therefore possible for a lock request to
          be granted after this check but before the data is modified.
          A process may then see file data change even while a mandatory
          lock was held.
        - Similarly, an exclusive lock may be granted on a file after
          the kernel has decided to proceed with a read, but before the
          read has actually completed, and the reading process may see
          the file data in a state which should not have been visible
          to it.
        - Similar races make the claimed mutual exclusion between lock
          and mmap similarly unreliable.

Additionally, POSIX.1 does not specify any scheme for mandatory locking.

Right now, no open source cluster file system that I'm aware of supports mandatory locking. My discussions with other cluster folks (mostly David Teigland, who did posix locks in GFS2) indicate that mandatory locking is a dead end - nobody seems to have plans to use or add them to their fs.

Leases

File leases allow a process to get notification of open() or truncate() calls from other processes on a file. Lease manipulation is done via fcntl()

The process who has the lease is called the lease holder whereas the process doing the open() or truncate() is referred to as the lease breaker. When a lease is broken, the lease breaker process is put to sleep until the lease holder releases it's lease, or downconverts the lease to a compatible level. If there are existing incompatible opens when a process tries to take a lease, the operation will fail.

F_GETLEASE returns the lease type that is currently held on an fd. F_SETLEASE Sets or removes a lease at a particular level.

`F_SETLEASE` argument	Opens blocked	Truncates Blocked
`F_RDLCK`	Opens for write	All
`F_WRLCK`	Opens for reading or writing	All

Right now, no distributed or cluster file system supports leases. Supporting this in a cluster would likely be very expensive. NFSv4 would like to use a subset of leases for cache management. This wouldn't be a hard requirement however, but would improve performance.

Rough notes about how GFS2 handles file locks

Right now, GFS(2) is the only open source example of how to handle file locks in a cluster fs. It stands to reason then, that the 1st place to look for ideas would be there. These rough notes catalogue my read through of their source where it pertains to file locking in GFS2. GFS(1) is ignored.

fcntl

This is almost entirely done in userspace via gfs_controld. The code is in group/gfs_controld in the cluster repository. Additional headers are pulled in from kernel (such as linux/lock_dlm_plock.h)

AIS is used for network communication. This ensures that all messages sent to all nodes are received. More importantly, it ensures cluster-wide ordering of messages - each node receives messages in the same exact order. The plock code depends on this feature, and as a result is actually rather simple. Node down messages are also ordered through cpg, an AIS service. Currently though, plock code doesn't require node down ordering.

Startup (mount) has the task of getting lock state from the cluster. This is done via a pair of functions, retrieve_plocks() and store_plocks(). retrieve_plocks() is called on the mounting node while store_plocks() is called on the node which will be sending plock state via AIS. While this is happening, requests received from other nodes are saved for later processing via process_saved_plocks(). Both ends make use of the "checkpoint" feature in openais to transfer the lock information. unpack_section_buf is used to take a struct pack_plock as read from AIS and create the resource / plock structure from it. Each "section" is a buffer consisting of an array of pack_plock structures all referring to the same resource. On the other end, store_plocks() creates a section per resource and fills it with pack_plock structures.

struct pack_plock {
        uint64_t start;
        uint64_t end;
        uint64_t owner;
        uint32_t pid;
        uint32_t nodeid;
        uint8_t ex;
        uint8_t waiter;
        uint16_t pad1;
        uint32_t pad;
};

Open Question #1 What are the specifics of checkpoints? I looked around but didn't find any good docs on how this is used.

Userspace submits a lock request to gfs2 via the standard system calls. The kernel file system code puts a struct gdlm_plock_info in a misc device, which gfs_controld reads back out. The kernel then waits for a response in the form of a write to the misc device.

struct gdlm_plock_info {
        __u32 version[3];
        __u8 optype;
        __u8 ex;
        __u8 wait;
        __u8 pad;
        __u32 pid;
        __s32 nodeid;
        __s32 rv;
        __u32 fsid;
        __u64 number;
        __u64 start;
        __u64 end;
        __u64 owner;
};

Once it has the request, process_plocks() puts it's own node number in the nodeid field. It then allocates a buffer, puts a network header on it (struct gdlm_header) and copies the gdlm_plock_info into the buffer. The buffer is sent to the entire mount group via AIS. Any failure during this process triggers a failure write into the misc device so that the kernel can properly handle it. It's important to note that no processing of the plock is taken here - it's handled in receive_plock() when the nodes own message comes back to it via AIS. This ensures that the lock requests are processed in the same order on all nodes.

struct gdlm_header {
        uint16_t                version[3];
        uint16_t                type;                   /* MSG_ */
        uint32_t                nodeid;                 /* sender */
        uint32_t                to_nodeid;              /* 0 if to all */
        char                    name[MAXNAME];
};

receive_plock() takes deliver of all plock requests. If plock state hasn't been synchronized yet (as might be the case during mount), it saves the messages in a separate queue for later processing via process_saved_plocks(). Otherwise, the request is processed. GETLK requests from other nodes are ignored as there's nothing to do. All other requests are processed by manipulating a lock resource structure. Since all messages are ordered, the lock resource structures should look the same on all nodes, and the result of request processing should be the same. One node never sends a "you got this lock" message to another one. It's always implicitly known via the synchronized state.

There are a couple of cases which will cause receive_plock() to communicate back to the local fs code via the misc device. The most obvious is that if the request is successful. Trylock failures are always communicated to the fs at this point. Likewise, local GETLK requests have their results immediately written back. If processing of a request moves the lock queues around so that a local lock request can be fulfilled, the status is written back - for example, an unlock might do this ¹.

Deadlock detection is done from various helper functions which receive_plock() calls for request processing. They do this by calling is_conflict() on the resource. is_conflict() just looks for conflicts within locks on a single resource - it doesn't attempt to find ABBA style deadlocks. Indeed, no other form of deadlock detection is currently implemented there. It seems that local node deadlock detection in the generic kernel has a large number of existing problems, and the idea behind leaving it out of plock.c is to see what happens for the kernel 1st. A short discussion with David Teigland indicated that he wouldn't be opposed to adding some trivial deadlock checks in the meantime though. [TODO: Perhaps I should just add a section above on deadlock detection and chronicle the various discussions and problems people have had around it]

Open Question #2 Though they certainly seem to process lock conflicts, I'm not sure where they ever send EDEADLK out to the FS for communication back to user...

Open Question #3 Can the file system "cancel" a request? Say the user hits ctrl-c... Is it just ignored? I guess it'd be pretty hard to "cancel" things once the request is on the message queues, so perhaps just ignoring those is the best way to handle this.

struct resource {
        struct list_head        list;      /* list of resources */
        uint64_t                number;
        struct list_head        locks;    /* one lock for each range */
        struct list_head        waiters;
};

struct posix_lock {
        struct list_head        list;      /* resource locks or waiters list */
        uint32_t                pid;
        uint64_t                owner;
        uint64_t                start;
        uint64_t                end;
        int                     ex;
        int                     nodeid;
};

struct lock_waiter {
        struct list_head        list;
        struct gdlm_plock_info  info;
};

At unmount time, or when a node dies, purge_plocks() is called. It does a simple walk of the resource list and removes waiters and posix_locks. During unmount, all waiters and posix_locks are removed. For recovery, only those belonging to the dead node are freed. If the resource becomes empty as a result of these actions, it is removed from the global list and freed.

The following table lists plock "API" functions in group/gfs_controld/plock.c. I call these "API" type functions because they're exported to the rest of gfs_controld and called at various points outside of the core plock code.

Function Name	Short Description
`setup_plocks()`	Get connection to the file system via the misc device, init variables
`retrieve_plocks()`	Used during mount to request all plock state from the cluster. This essentially syncs the cache. It looks like some sort of barrier is used to freeze the state while it is sent over. There is a protocol for which node is asked for the lock state.
`store_plocks()`	This is what's called on a node to send plock state to a mounting node.
`process_saved_plocks()`	Used to prcoess any plock messages which may have been received during `retrieve_plocks`. This is for actual lock requests as sent over from other nodes, not plock state synchronization.
`purge_plocks()`	used to purge state of a dead node, or all state during unmount.
`process_plock()`	handle a plock request from the gfs2 file system.
`receive_plock()`	handle a remote plock request.

flock

Since flock() type locks cover the entire inode, they are easily supported via use of a dlm lock. This is what GFS2 does.

flock() locks bypass some of the standard optimizations in glock.c. Specifically:

Lock caching is disabled
Lock levels have to be exactly as requested (for example, glock is prevented from holding an EXMODE lock to cover a PRMODE request)
The lock is immediately reclaimed on flock unlock

In the context of a userspace file lock, all of these exceptions make sense.

Ocfs2 Updates

The following describes a way by which we can get cluster aware user file locks in Ocfs2 using the existing cluster stack.

fcntl

If message ordering can be guaranteed within the existing stack, then we can take the core of gfs2's plock.c and pass messages to the kernel, where the existing messaging code is. In order to achieve this,ocfs2msgd is implemented as part of ocfs2/dlm/. ocfs2msgd uses a per-domain dlm lock to serialize message sending. Communication to/from user can be done via misc device. Though it's intended purpose is plock messaging, we keep the implementation as generic as possible in order to facilitate other forms of messaging in the future.

Though plock.c doesn't require ordered node down events, we could easily provide those by sending node downs via the dlm lock: Internally, o2msgd would mark a node dead as soon as it got notification from the dlm. Userspace would not be notified however until an explicit "node down" message was sent to the cluster from whichever daemon gets the send lock 1st.

Interface Description For `ocfs2msgd`

enum o2msg_command_type {
        O2MSG_JOIN_START        = 0,
        OSMSG_JOIN_CONTINUE     = 1,
        O2MSG_LEAVE             = 2,
        O2MSG_SEND_MSG          = 3,
        O2MSG_BARRIER_START     = 4,
        O2MSG_BARRIER_STOP      = 5
};

#define O2MSG_FLAG_UNICAST              0x0001
#define O2MSG_FLAG_REQUIRE_STATUS       0x0002  /* Unused? */

#define O2MSG_MAX_MSG_LEN       256

struct o2msg_command {
        __u32   cmd_type;
        __u32   cmd_flags;
        __u32   cmd_node;
        __u32   cmd_proto_ver;
        __u64   cmd_pad[2];
        __u8    cmd_uuid[32];
        __u8    cmd_msg[O2MSG_MAX_MSG_LEN];
};

enum o2msg_status_type {
        O2MSG_DOMAIN_JOINED     = 0,
        O2MSG_DOMAIN_LEFT       = 1, /* It is illegal to get any more statuses
                                        after this message */
        O2MSG_BARRIER_READY     = 2,
        O2MSG_ERROR             = 3,
        O2MSG_NODE_JOINED       = 4,
        O2MSG_NODE_DOWN         = 5, /* Nonzero 'ms_errno' if node crashed instead
                                        of a clean unmount */
};

struct o2msg_status {
        __u32   ms_status_type;
        __u32   ms_node;
        __u32   ms_errno;
        __u32   ms_pad;
        __u8    ms_uuid[32];
        __u8    ms_msg[O2MSG_MAX_MSG_LEN];
};

o2msgd talks to userspace via a misc device. Userspace writes commands (struct o2msg_command) into the device, and reads struct o2msg_status back from the device. poll() can be used to determine when there is messages waiting to be read.

Reads and writes are maintained in their own FIFO queues. All messages, statuses and commands are strictly ordered. This is critical to the operation of ocfs2msgd - we want messages and events to be processed in the same order on all nodes.

A status might include messages from another node or responses to join/leave or barrier commands.

Client commands include domain membership (joining or leaving), and the ability to put up message barriers.

Joins

Joins are done in a three step process in conjunction with a file system mount.

(1) Before sys_mount is called, an O2MSG_JOIN_START command must be sent to notify o2msgd that it needs to connect a domain with the file descriptor. It is valid to receive messages at any point after this. Userspace is not yet allowed to send messages.
(2) Once a dlm domain is joined, the kernel will send a O2MSG_DOMAIN_JOINED o2msg_status with ms_errno equal to zero. Immediately following the O2MSG_DOMAIN_JOINED status will be a series of O2MSG_NODE_JOINED messages. This can be considered the initial node list, and will always immediately follow O2MSG_DOMAIN_JOINED. Userspace can know when the node list is complete when there is nothing left to read, or if a different message type is received. Once the join and initial node list are queued, the kernel mount process will pause. Any messages sent or received will be still be processed. This allows userspace a chance to do any set up required before sys_mount returns and other processes are allowed access to the file system.
(3) When ready, userspace writes a O2MSG_JOIN_CONTINUE command. At this point the mount process continues.

Any error after step 2 will produce an o2msg_status with ms_cmd_type of O2MSG_DOMAIN_LEFT and the local node number in the ms_node field.

Barriers

Barriers can be set up or taken down by userspace. A barrier allows userspace exclusive access to the domain for a short period of time. During that time, no messages can be sent from other nodes. Statuses (such as membership changes) will still be received. Like all other commands and messages, barrier requests and statuses are strictly ordered. That way userspace will implicitly know which messages were received before the barrier was achieved.

Essentially, this mechanism allows userspace to temporarily pause the group messaging on all nodes so that it can quiesce state. Along with unicast messaging, this allows userspace daemons to send a full state to joining nodes.

A barrier request is sent via write of O2MSG_BARRIER_START command. O2MSG_BARRIER_START commands received during an active barrier are ignored.

An O2MSG_BARRIER_READY command with ms_errno of zero indicates when the barrier has been achieved. At this point, it is guaranteed that userspace on that node has exclusive access to sending messages.

Once userspace no longer needs a barrier, it should send a O2MSG_BARRIER_STOP command. After the command has been written, other nodes will be able to send messages.

Messaging

Messaging is likely to comprise the bulk of requests that userspace sends. Messages can be sent via an O2MSG_SEND_MSG command. The message contents should be stored in the cmd_msg member, and can not exceed O2MSG_MAX_MSG_LEN bytes. Messages are received on all nodes at the same time, including the sending node - specifically, the sending node can expect to receive it's own message back.

Messages can be unicast to a specific node by setting the O2MSG_FLAG_UNICAST flag in cmd_flags and the intended recipient node number in cmd_node.

open question: should userspace expect to receive it's own unicast messages back? I think perhaps that makes sense - though they'd typically be sent when holding a barrier, there might be a use where we wouldn't do it that way, so getting it back would help preserve message ordering.

Errors

Errors may be returned from sys_write() if the command is invalid or incomplete. Once sys_write() returns however, most commands don't have a status return. Typically, o2msgd can be trusted to deliver any messages it has accepted. In the case of a severe error however, an O2MSG_ERROR status will be sent. If an O2MSG_ERROR status is received, userspace should do whatever cleanup is necessary and leave the domain.

Internal Description of `ocfs2msgd`

Description of `ocfs2controld`

ocfs2controld receives and processes events from several sources:

Source	Message Types
`mount.ocfs2`	Initiate join for a domain. Mount status, so that a failed mount can be cleaned up.
`umount.ocfs2`	Leave a domain (done after `sys_umount()` returns
`ocfs2`	Posix lock requests)
`o2msgd`	Messaging, membership management

Domain startup

Upon receiving a UUID, ocfs2controld sends a O2MSG_JOIN_START command and allows mount.ocfs2 to continue. At this point, any messages received from other nodes are discarded. If the mount fails (either by receiving a bad error from mount.ocfs2 or from o2msgd), it cleans up with an explicit O2MSG_LEAVE command and frees any context associated with the mount.

If a successful O2MSG_DOMAIN_JOINED is received, ocfs2controld initiates a series of actions designed to sync it's state with the cluster.

A barrier is requested. When a O2MSG_BARRIER_READY status is received

flock

1 It seems that a lock request triggers code which also runs the queues looking for things to grant, but I'm not sure if this ever happens in practice.