Running OCFS2 With a Userspace Cluster Stack - Tools Changes
JoelBecker, December 2007
Introduction
The ocfs2 filesystem currently is wedded to the o2cb cluster implementation. Some changes will need to be made to support userspace cluster stacks. This document describes what we need to do in the ocfs2 tools code.
Like filesystem, the tools treat the environment as either "classic o2cb" or "userspace cluster stack".
New Slot Map
The new slot map format is required to support the larger node numbers in userspace stacks.
Stack Plugin Startup
The o2cb.init script needs to understand the stack plugins. It has to handle the old modules (without stack plugins) and the new modules as well. Instead of loading each and every ocfs2 support module explicitly, it uses module dependencies to pull in underlying stuff. It determines what is needed based on filesystems. That is, it checks whether "configfs", "ocfs2_dlmfs", and "ocfs2" are available filesystems. If not, it loads the appropriate modules. This has the added benefit of working when the drivers are compiled into the kernel - that didn't work before.
If the init script sees the /sys/fs/ocfs2/ files, it knows to load the appropriate stack plugin. Otherwise, it assumes the classic o2cb modules.
Once the drivers are loaded, the code splits to handle either o2cb or a userspace cluster stack. The init script is responsible for filling in /sys/fs/ocfs2/cluster_stack. All other tools and operations require this to be filled in.
In the o2cb case, startup proceeds with o2cb_ctl as always. With a userspace stack, the appropriate control daemon is started.
umount.ocfs2(8)
ocfs2 needs to detach from the cluster at unmount time. The existing ocfs2 filesystem calls a program ocfs2_hb_ctl(8) from kernelspace during the unmount process. The actual program is specified via a sysctl, /proc/sys/fs/ocfs2/nm/hb_ctl_path. Recent distributions now support a umount.ocfs2(8) program matching the mount.ocfs2(8) program. When present, umount(8) will call umount.ocfs2(8) instead of doing the work itself. The new tools will require a umount(8) new enough to support this, as userspace clustering requires the use of umount.ocfs2(8).
The nice thing is that this is cleaner than the old kernel-exec method. umount.ocfs2(8) can do the work of ocfs2_hb_ctl(8) as well in the o2cb stack case.
Backwards compatibility must be maintained. Thus, the new filesystem drivers will still call ocfs2_hb_ctl(8) in case older tools are in use; they don't have umount.ocfs2(8). When new tools are in use, the init script will fill /proc/sys/fs/ocfs2/nm/hb_ctl_path with /bin/true. The kernel, regardless of old or new driers, will still execute the program, but it will do nothing. Then umount.ocfs2(8) will actually do the work.
Group Join and Leave
The libo2cb library provided functions to start and stop o2hb regions. This is a low-level interface to the o2cb stack only. This isn't good to expose to programs, which should have a more generic API.
The API is recast in terms of joining and leaving a "group". In all our cluster stacks, all nodes mounting a particular filesystem - identified by UUID - forms a group that can interact. With o2cb, the group is encapsulated via the DLM domain. With userspace stacks, the group is defined by an API specific to that stack.
From the perspective of an ocfs2 tool, however, the API is common:
errcode_t o2cb_begin_group_join(struct o2cb_cluster_desc *cluster, struct o2cb_region_desc *desc); errcode_t o2cb_complete_group_join(struct o2cb_cluster_desc *cluster, struct o2cb_region_desc *desc, int error); errcode_t o2cb_group_leave(struct o2cb_cluster_desc *cluster, struct o2cb_region_desc *desc);
The API is based around mounting the filesystem. Before mounting, the tools call o2cb_begin_group_join() to join the group. For o2cb, this starts the heartbeat. For userspace clustering, the userspace stack is asked to join the group (more on this in a later section). When o2cb_begin_group_join() return successfully, mount.ocfs2(8) may continue and call mount(2). Once the mount(2) call returns, o2cb_complete_group_join() is called with the result. If mount(2) succeeded, the filesystem will be marked happy by the cluster infrastructure. If mount(2) failed, o2cb_complete_group_join(2) will clean up anything started by o2cb_begin_group_join(). Finally, umount.ocfs2(8) calls o2cb_group_leave() to exit the group after calling umount(2).
The cluster and region descriptors are filled in by libocfs2. The cluster descriptor describes the current cluster environment, and the region descriptor describes the physical filesystem.
It turns out there are only mount.ocfs2(8) and umount.ocfs2(8) use this API directly. Every other tool calls ocfs2_initialize_dlm() and ocfs2_shutdown_dlm() in libocfs2. This is nice, as those functions now use this API, and every tool has a consistent method regardless of underlying cluster stack.
Inside libo2cb, there are two sets of begin/complete/leave functions; one for o2cb and one for userspace stacks. It is intentional that userspace stacks are treated identically inside the library. Only the stack name differs, and that is validated against /sys/fs/ocfs2/cluster_stack.
Communicating with a Userspace Stack
When a userspace cluster stack is in use, there will be a control daemon running. When libo2cb needs to query or set cluster information, it will communicate with the daemon. A simple client protocol supports the group join/leave operations as well as listing the clusters, filesystems, and mounters involved with ocfs2. Thus, libo2cb users have a generic API to work with and do not have to know what stack is behind them.
Ensuring the Same Cluster Stack
Two nodes running different cluster stacks cannot share the same filesystem. Every mounter of a filesystem must be running the same cluster stack and be configured for the same cluster name. The ocfs2_cluster_info structure was introduced to cover this case.
#define OCFS2_FEATURE_INCOMPAT_USERSPACE_STACK 0x0080 struct ocfs2_cluster_info { __u8 ci_stack[OCFS2_STACK_LABEL_LEN]; __le32 ci_reserved; __u8 ci_cluster[OCFS2_CLUSTER_NAME_LEN]; };
This structure appears on disk in the superblock when the OCFS2_FEATURE_INCOMPAT_USERSPACE_STACK bit is set. If the o2cb stack is in use, the bit is not set and the struct ocfs2_cluster_info is invalid. The ocfs2_fill_cluster_desc() function fills in a struct o2cb_cluster_desc based on the disk's cluster info. Tools pass the cluster descriptor to all cluster operations. The underlying API, in turn, can validate the cluster descriptor against the currently running cluster to ensure they match.
Finally, mount.ocfs2(8) passes the stack name via the cluster_stack option to mount(2). This tells the kernel filesystem what cluster stack is expected - the kernel will validate it as well.
Setting the Stack Name
Obviously, the struct ocfs2_cluster_info must be filled out somewhere. mkfs.ocfs2(8) will fill it in when making a new filesystem. By default, mkfs.ocfs2(8) will look at the currently running cluster and configure a new filesystem to join that current configuration. There are options to specify a different cluster configuration, though they often require skipping cluster safetey checks. tunefs.ocfs2(8) has the ability to change this information on a filesystem that is not mounted anywhere.