OCFS2/DesignDocs/UserspaceClustering/FilesystemChanges

Running OCFS2 With a Userspace Cluster Stack - Filesystem Changes

JoelBecker, December 2007

Introduction

The ocfs2 filesystem currently is wedded to the o2cb cluster implementation. Some changes will need to be made to support userspace cluster stacks. This document describes what we need to do in the filesystem code.

Simplifying the Needs of OCFS2

ocfs2 currently knows a lot about the cluster environment. It needs to know each node sharing a filesystem. It needs to communicate with each node to make some decisions about filesystem behavior. It needs to be notified when nodes come up and go down. Finally, it needs to know what nodes are alive so that it can recover nodes when a live node goes down unexpectedly.

ocfs2 should not need any of this. Distributed locking allows a node to protect portions of the filesystem while acting on them. The cluster stack should handle detecting abnormal node death - when the "death" is normal (unmount), ocfs2 shouldn't know about it.

The first step is to stop notifying ocfs2 when nodes come up and go down in o2hb. ocfs2 doesn't care until the nodes are part of the group sharing the filesystem, and that only happens when the nodes are part of the DLM domain. In fact, ocfs2 is already notified when nodes leave the DLM domain. Thus, all notification can happen through the DLM. Thus, we remove o2hb calling into ocfs2.

Nodes send messages to each other called "votes". These messages are outside of the DLM. To send these messages, ocfs2 must know the names and addresses of each node sharing a filesystem. If there are no votes, ocfs2 does not need to know the addresses and does not need to communicate with other nodes. There is only one remaining vote, the "mount/umount" vote. This message tells nodes when another node has joined or left the filesystem group. It exists so that ocfs2 can determine if a node joins or leaves abnormally. As we already stated, the cluster stack should be making that determination. With node death events only coming from the DLM, we now know a death was abnormal. Thus, we can remove the vote.

These changes have actually been coded, and they live on the "cluster_updates" branch of ocfs2.git.

Without the vote or heartbeat requirements, ocfs2 no longer has a strong dependency on the o2cb stack. It is ready for a more generic interface. The only functions ocfs2 uses from o2cb are:

? nm fs/ocfs2/ocfs2.ko | grep U | grep -e dlm -e o2 -e ocfs2
                 U .dlm_errmsg
                 U .dlm_errname
                 U .dlm_print_one_lock
                 U .dlm_register_domain
                 U .dlm_register_eviction_cb
                 U .dlm_setup_eviction_cb
                 U .dlm_unregister_domain
                 U .dlm_unregister_eviction_cb
                 U .dlmlock
                 U .dlmunlock
                 U .o2hb_check_local_node_heartbeating
                 U .o2nm_get_hb_ctl_path
                 U .o2nm_this_node

What OCFS2 Actually Needs

The list above is very short. It falls into three basic functions:

Registering for recovery notification.
- When the filesystem comes online, it must register for recovery notification. This process ensures that the cluster stack is available before returning to the filesystem.
  (o2hb_check_local_node_heartbeating, o2nm_this_node, dlm_setup_eviction_cb, dlm_register_eviction_cb, and dlm_unregister_eviction_cb)
Connecting to the lock manager.
- The filesystem must connect to the lock manager to start using cluster locks.
  (dlm_register_domain and dlm_unregister_domain)
Using cluster locks.
- (dlmlock, dlmunlock, dlm_errmsg, dlm_errname)

The last two functions, o2nm_get_hb_ctl_path and dlm_print_one_lock, are unimportant and can be ignored.

The o2cb functions listed are specific to o2cb. A generic API can be easily distilled from them.

A Cluster API for OCFS2

Here are the operations OCFS2 needs:

struct ocfs2_cluster_connection {
    char cc_name[GROUP_NAME_MAX];
    int cc_namelen;
    int (*cc_recovery_handler)(int node_num, void *recovery_data);
    void *cc_recovery_data;
    void *cc_lockspace;
    void *cc_private;
};

union ocfs2_dlm_lksb {
    struct dlm_lockstatus;
    struct dlm_lksb;
};

int ocfs2_cluster_connect(const char *stack_name,
                          const char *group,
                          int grouplen, 
                          int (*recovery_handler)(int node_num, void *recovery_data),
                          void *recovery_data,
                          struct ocfs2_cluster_connection **conn);
int ocfs2_cluster_disconnect(struct ocfs2_cluster_connection *conn,
                             int hangup_pending);
void ocfs2_cluster_hangup(const char *group, int grouplen);
int ocfs2_cluster_this_node(unsigned int *node);
int ocfs2_dlm_lock(struct ocfs2_cluster_connection *conn,
                   int mode,
                   union ocfs2_dlm_lksb *lksb,
                   u32 flags,
                   void *name,
                   unsigned int namelen,
                   struct ocfs2_lock_res *astarg);
int ocfs2_dlm_unlock(struct ocfs2_cluster_connection *conn,
                     union ocfs2_dlm_lksb *lksb,
                     u32 flags,
                     struct ocfs2_lock_res *astarg);
int ocfs2_dlm_lock_status(union ocfs2_dlm_lksb *lksb);
void *ocfs2_dlm_lock_lvb(union ocfs2_dlm_lksb *lksb);

First, the connect and disconnect functions. o2dlm's dlm_register_domain() takes a key value computed as a crc32_le() of the name. Our glue code can compute that. fs/dlm's dlm_new_lockspace() function takes a namelen that o2dlm does not. We'll keep the namelen argument, because no one should be depending on string determination. Its flags and lvblen arguments can hopefully be hardcoded for ocfs2 in the glue as well. On the disconnect side, dlm_release_lockspace() takes a force argument that we can hopefully pick a sane value for.

The connect function will not return unless it has successfully detected the underlying cluster stack. If the stack is active, the handler is registered. If the stack is inactive, an error is returned. The cc_private pointer can be used by the stack for anything else it needs.

The node_num argument for the handler is stack-specific. The generic code can print it, but should not trust it for much else.

The hangup function is a hack. We have to have some way to call ocfs2_hb_ctl with older tools. It is called during unmount after ocfs2_cluster_disconnect(). We can't do ocfs2_hb_ctl in ocfs2_cluster_disconnect() because the filesystem may not have started a connection yet. The userspace stack will leave this hook empty. The filesystem uses hangup_pending to tell the stack whether it will call hangup or not. In some mount failures, hangup will not be called, so cleanup has to happen in ocfs2_cluster_disconnect()

The dlm_lock() functions are almost identical. The fs/dlm lock function takes one additional argument, the parent_lkid. Since we can't use it, it's left out. The lock id needed for unlock is on the lksb, so we'll just use that.

We do not need [b]ast arguments to the lock and unlock functions. ocfs2 intentionally has one function for each. These are registered with the underlying stack at initialization time. We explicitly pass the struct ocfs2_lock_res as the astarg because the fs/dlm stack needs to dereference it.

For the ocfs2_dlm_lksb union, we need a couple of accessors. They are pretty trivial.

Note that I've changed the structures. We have wrapper struct ocfs2_dlm_* structures that we will translate to the underlying DLM. The lksb is a union, as we embed it in the inode. This means it's compile-time, so we'll have to be careful about it.

Finally, note that I've removed all the error printing and eviction functions. We will be using proper kernel return codes. We should be getting node down from the grouping interface.

That's it. That's all the functionality needed by ocfs2.

Recovery

The ocfs2 recovery mechanism is still node number based. The node numbers need to be opaque to ocfs2, however. With userspace stacks having large node numbers, the node number values are promoted to UINT32. A new slot map format is required to support this.

The recovery scheme stays pretty much the same:

ocfs2 is notified of a node going down by node number.
ocfs2 stores this number off and starts a recovery thread.
The recovery thread takes the superblock lock and looks the node number up in the slot map.
The slot is recovered.
Drops the superblock lock.

There are two key points. ocfs2 cannot look up the slot number until it has the superblock lock. Thus, ocfs2 must safe off the node number and have a way to translate it into a slot later; thus the slot map must stay. Second, ocfs2 must be notified of node death before the DLM is. Thus, when ocfs2 locks the slot, it will wait until the DLM has recovered itself. If the DLM was notified first, it could have completed recovery before ocfs2 gets that lock.

Communicating With the Underlying Cluster Stack

The above API describes how ocfs2 will use the underlying stack and DLM in normal operation. However, it doesn't describe what translates the simple API into calls for the underlying stack.

As we will support more than one stack, we need a plugin architecture. Each stack plugin must register when loaded. The plugin structure obviously maps the API ocfs uses.

struct ocfs2_stack_operations {
    int (*connect)(struct ocfs2_cluster_connection *conn);
    int (*disconnect)(struct ocfs2_cluster_connection *conn);
    void (*hangup)(const char *group, int grouplen);
    int (*this_node)(unsigned int *node);
    int (*dlm_lock)(struct ocfs2_cluster_connection *conn,
                    int mode,
                    union ocfs2_dlm_lksb *lksb,
                    u32 flags,
                    void *name,
                    unsigned int namelen,
                    void *astarg);
    int (*dlm_unlock)(struct ocfs2_cluster_connection *conn,
                      union ocfs2_dlm_lksb *lksb,
                      u32 flags,
                      void *astarg);
    int (*lock_status)(union ocfs2_dlm_lksb *lksb);
    void *(*lock_lvb)(union ocfs2_dlm_lksb *lksb);
};

struct ocfs2_locking_protocol {
    u32 lp_version;
    void (*lp_lock_ast)(void *astarg);
    void (*lp_blocking_ast)(void *astarg, int level);
    void (*lb_unlock_ast)(void *astarg, int status);
};

struct ocfs2_stack_plugin {
    char *sp_name;
    struct ocfs2_stack_operations *sp_ops;
    struct module *sp_owner;

    /* Managed by the stackglue code. */
    struct list_head sp_list;
    unsigned int sp_count;
    struct ocfs2_locking_protocol *sp_proto;
};

/* called by module_init/exit() in the pluggable stack */
int ocfs2_register_stack(struct ocfs2_stack_plugin *plugin);
int ocfs2_unregister_stack(struct ocfs2_stack_plugin *plugin);

When a stack module loads, it registers itself. When it unloads, it unregisters itself. The sp_owner field is used to pin the module while the stack is in use.

There are only two stack plugins - ocfs2_stack_o2cb.ko and ocfs2_stack_user.ko. ocfs2_stack_o2cb contains the original o2cb code. It behaves just like ocfs2 always has. Upgrading users should be able to use this stack without any other changes.

The ocfs2_stack_user plugin supports all userspace cluster stacks. It uses the fs/dlm DLM to handle cluster locking. A userspace cluster stack provides a control daemon to interact with the ocfs2_stack_user plugin.

ocfs2 will create four files in sysfs in the /sys/fs/ocfs2 directory. The locking_protocol file will display the version of locking that ocfs2 is using. All nodes must agree on this to interoperate. loaded_cluster_plugins will list all loaded stack plugins. active_cluster_plugin will display the currently selected stack plugin. Userspace can write the name of an available stack into active_cluster_plugin to change the currently selected stack, but only if the present stack is not in use. sp_count is used to track a busy stack plugin. Finally, the cluster_stack file contains the name of the cluster stack currently in use.

All operations to register a stack, unregister a stack, and increment or decrement sp_count will be protected by a spinlock.

If no stack is registered, the cluster functions will return -ENOSYS.

The glue layer becomes a new module, ocfs2_stackglue.ko. The ocfs2.ko driver depends on it.

A working cut of this interface is available in the cluster_abstractions_modules branch of Joel's git tree.

Ensuring the Same Cluster Stack

One essential requirement is that all nodes are running the same cluster stack. If they are not, they will not speak to each other and corrupt the filesystem. We ensure this via the cluster stack name.

The cluster stack name is a four charachter string uniquely identifying a stack. When using ocfs2_stack_o2cb, the stack name is o2cb. All other stacks using ocfs2_stack_user will specify their name.

All stacks except o2cb store their name on the disk in the superblock.

#define OCFS2_FEATURE_INCOMPAT_USERSPACE_STACK  0x0080

struct ocfs2_cluster_info {
        __u8   ci_stack[OCFS2_STACK_LABEL_LEN];
        __le32 ci_reserved;
        __u8   ci_cluster[OCFS2_CLUSTER_NAME_LEN];
};

All filesystems using a userspace stack will set the OCFS2_FEATURE_INCOMPAT_USERSPACE_STACK incompat bit on the superblock and fill in the struct ocfs2_cluster_info structure with the stack and cluster names. o2cb will leave these blank for backwards compatibility - a lack of the incompat bit is the same as specifying o2cb. Thus, older ocfs2 tools can still work against the new filesystem modules.

When mounting with a userspace stack, mount(2) must be passed the cluster_stack=XXXX option, where 'XXXX' is the stack name. ocfs2 does two things with this value. First, it will match it with the disk's cluster_info->ci_stack. It must match verbatim - the filesystem does not interpret the value. Then it will pass this to ocfs2_cluster_connect(); ocfs2_stackglue will match it against the contents of /sys/fs/ocfs2/cluster_stack. It must match there as well. If either comparison fails, the mount will fail.

Connecting to O2CB

The o2cb portion of this is pretty straightforward. The o2cb code will just provide the stack hooks, and it should go pretty easy from there. A lot of setup that is done in dlmglue.c and heartbeat.c becomes part of the glue. It now becomes the ocfs2_stack_o2cb.ko module.

Connecting to Userspace Clustering

This is the real bit of new code.

A userspace cluster stack has to notify ocfs2 when recovery is needed. There have been multiple approaches to this, most recently via user heartbeat. That approach, however, was very heavy. o2cb did all the in-kernel lifting. With the votes and o2dlm removed from the picture, ocfs2 only needs to know when to start recovery. That is, the userspace cluster stack needs to trigger the recovery handler.

Here are the requirements from the userspace stack connection.

The userspace stack triggers the recovery handler when a node goes down unexpectedly. If a node goes down expectedly (eg, unmount), there is no trigger. If the node is not on this filesystem -- a different group -- this filesystem is not notified.
During ocfs2_cluster_connect(), the glue needs to ensure that the stack is running and available. If not, it needs to fail.
If the stack disappears or has some error while running, ocfs2 must be notified.

This interaction is handled by a misc device, /dev/ocfs2_control. This device is exposed by the ocfs2_stack_user module. When the control daemon starts up, it opens the device and negotiates a protocol (allowing for compatibility). The filesystem will refuse mounts until the device is open and connected. Once a control daemon has registered, mounts will be allowed to proceed.

When a node goes down, the control daemon sends a message through the device. The filesystem will match the message to a particular filesystem and initiate recovery.

If the misc device is closed while there are active mounts, the stack glue will self-fence. You cannot run without a connection to the cluster stack. If there are no active mounts when the device is closed, it will just prevent new mounts from starting.

All locking, as stated above, is handled via fs/dlm.

Protocol Negotiation

When the classic o2cb stack opens a connection between nodes, it negotiates a handshake packet. This allows both sides of the connection to ensure they are speaking the same protocol. If they aren't speaking the same protocol, Bad Things can happen. Thus, any mismatch on the handshake will cause a mount to fail.

The problem is that the handshake is internal to o2cb. It communicates o2cb timeout values and so on. There is no differentiation between changes to the network protocol, the o2dlm protocol, or how ocfs2 does locking itself. In previous releases, the network protocol version was changed when ocfs2 added, removed, or modified a lock type, even though the actual on-the-wire code had no changes.

The way ocfs2 does locking needs to be separated from the protocol spoken by the underlying cluster stack. This is cleaner for the classic o2cb stack, as the network and o2dlm protocols may not change even if ocfs2 does locking differently. It is imperative for a userspace stack and fs/dlm. The userspace stack will have its own protocol negotiation, as will fs/dlm. Those negotiations will not take ocfs2 into consideration.

To solve this, we add a protocol version field to struct ocfs2_locking_protocol. The underlying stack->connect() function can use the value on lp_version to negotiate with other nodes. The classic o2cb stack can use the value in creating a handshake packet. The userspace stack will create a file /sys/fs/ocfs2/protocol-version. This way, userspace software (eg ocfs2_controld) can read the version and populate it.

As the protocol version is tied to a particular ocfs2 driver, it can be static from the time module is loaded until it is unloaded. This is why we can just populate it on the stack plugin structure.

As an example, after joining the group for a particular filesystem, the ocfs2_controld process will negotiate the protocol version before returning success to mount.ocfs2(8). If the filesystem locking protocols do not match, it will leave the group and fail the mount.