OCFS2/CManUnderO2CB/CManAndOCFS2ControlD

The OCFS2 Control Daemon

The cman group daemon allows applications and nodes to specify their interest in a resource. All group members can learn the list of members, and they are notified other members join or leave the group. This is precisely what a cluster filesystem needs to know. The ocfs2_controld daemon will handle interaction between the group daemon and the configfs interface.

Starting With the GFS2 Control Daemon

GFS2 uses a special daemon called the gfs_controld to interact with the group daemon. Each filesystem has its own group. When mounting, gfs_controld requests membership in that group. Once a member, GFS2 and its DLM can now determine what nodes they need to coordinate with for that filesystem. We will be basing ocfs2_controld on gfs_controld.

gfs_controld does a lot more than ocfs2_controld will have to. ocfs2 does journal negotiation and recovery processing entirely in the filesystem driver. GFS2 lets gfs_controld make those decisions. ocfs2_controld won't need that code.

gfs_controld also manages POSIX locks (plocks), such as the ones from fcntl(2) and lockf(2). We don't handle POSIX locks in ocfs2 yet, so we won't incorporate this code either. We may at a later date, or we may implement POSIX locks a different way.

A libgroup Primer

The libgroup library handles application communication with the group daemon. The group.c file implements this interaction. Again, we're copying some boilerplate.

The core libgroup interface is like libcman. The application has a file descriptor to place in a poll(2) loop, and a group_dispatch(3) function to call when the fd has data. However, libgroup has multiple callbacks and a more complex protocol than libcman. Each callback matches a message, so it makes more sense to describe them in terms of the message.

The first message is "stop". When an application receives a stop message, it must pause anything it is doing with the shared resource. For example, GFS2 blocks the filesystem. The group daemon sends the stop message before any group configuration changes. This prevents the application from doing anything with the resource while the configuration is being changed. The stop message has only one argument, the name of the group to stop. When an application has completed its stop processing, it calls the group_stop_done(3) function.

The "start" message is the most important message. At its most basic, it tells the application to prepare to restart access to the resource. However, it also carries with it the new group configuration and the reason the group changed. The configuration is a list of node IDs. The reasons for change are either GROUP_NODE_JOIN, GROUP_NODE_LEAVE, or GROUP_NODE_FAILED. JOIN and LEAVE are pretty obvious. It is especially nice to be told that a node has failed. This allows application logic to take recovery steps. Note that this is notification the group is starting. The start is not considered complete until a "finish" message has been received as well. The message has five or more arguments.

The group name that is starting
An unique event number (to match with a "finish" message)
The reason for the change (JOIN/LEAVE/FAILED)
The number of members
(or more) The list of member node IDs

The application calls the group_start_done(3) function to tell the group daemon it has completed start processing.

The "finish" message is sent when all members have started. Until the finish message is received, the group is in transition and the application should not be doing any work against the resource. Once finish is received, the application can access the resource freely. The finish message has two arguments: the name of the group and the event number to match the start message.

The "terminate" message is sent when an application has left the group. All other members have stopped, and once

The "setid" message sends the group's ID number to the application. The message requires no response. The message has two arguments, the name of the group and the id.

The "deliver" message delivers a data string from another node. Applications can use the group_send(3) function to send a data string to all group members. They receive the message via the deliver message. The message has four arguments:

The group name
The node ID of the sender
The length of the string
The string

We won't be using the "deliver" message in ocfs2_controld.

An Application Lifecycle

How does this work for an application? Once libgroup has initialized, the application calls the group_join(3) function to request membership in a group. The application will receive its first start message for the group with itself as a member. It will also receive a setid message with the group ID number.

The application will continue to receive (stop, start, finish) message sets at every group configuration change. These can happen often or rarely, depending on what the group is up to.

Finally, when the application is done, it calls group_leave(3). It will receive a final terminate message.

Event Coding Style

gfs_controld and dlm_controld both abstract this interface away to an event coding style. Their callbacks merely set up static global variables with the most recent message as an event (DO_STOP, DO_START, DO_FINISH, etc). This allows the callback to return immediately. The real work is not done underneath group_dispatch(3), but after it has returned. As each event is processed sequentially, it is safe to store the event data in static global variables.

After group_dispatch(3) has returned, the daemon runs the action associated with the event received. ocfs2_controld will copy this pretty much verbatim.

Mount Groups

There is one gfs_controld for an entire system. It handles all GFS2 mounts. It does this via the "mountgroup" structure. A mountgroup encompasses all mounts of a particular filesystem (device). When a filesystem is mounted for the first time, the mountgroup is created. It calls group_join(3) for that filesystem. It handles all messages for that group. If a filesystem is mounted again at a second location, the daemon increments a count on the mountgroup and stores the new mountpoint location. Only when the last mountpoint is removed does the daemon leave the group. The mountgroup also caches the list of member nodes. This is where state changes can be determined.

The mountgroup contains a bunch of other information ocfs2 won't need. It contains recovery information for negotiation of journals and recovery. ocfs2 does that in kernel. It contains the POSIX lock state. It can be added later if we go that route.

Mount Protocols

Cluster mounting is complex, especially when userspace daemons do some of the work. Let's look at the current state.

The O2HB Protocol

The mount.ocfs2(8) program starts o2hb via ocfs2_hb_ctl(8). ocfs2_hb_ctl uses libo2cb to fill in the heartbeat region information. Once the region has started heartbeating with disk heartbeat, ocfs2_hb_ctl returns success. mount.ocfs2 is then safe to call mount(2).

The ocfs2 filesystem driver assumes that heartbeat is up and running when the mount process starts. It checks for a running heartbeat, then completes the mount process. It is coded to handle other nodes coming up or going down.

The GFS2 Protocol

The mount.gfs2(8) program opens a unix socket to gfs_controld. It asks gfs_controld to configure the mount via the "join" message, causing gfs_controld to join the group and create the mountgroup structure. Once the group is joined, gfs_controld negotiates any recovery action with the other members, then determines what journal the local node will use. The mount.gfs2 program is notified of the journal and other information. It then calls mount(2).

When mount(2) returns, mount.gfs2 send gfs_controld the "mount_result" message. gfs_controld tells all other members the result of the mount. In the case of a failed mount, the other nodes know to consider recovery.

For unmount, the umount.gfs2(8) program sends the "leave" message after calling umount(2). gfs_controld can then safely leave the group.

There is a "remount" message when a node does "mount -o remount". It exists to have gfs_controld pass the new rw/ro state to other nodes.

A New OCFS2 Protocol

For o2hb-based mounts, there is no reason to change the current mount.ocfs2 protocol. However, to use cman and ocfs2_controld, mount.ocfs2 will have to communicate with ocfs2_controld. The gfs2 communication scheme was pretty haphazard, so I reworked it some into a new control message protocol. ocfs2_controld will return control to mount.ocfs2 as soon as the group is joined. There is no journal or recovery negotiation. When the mount result is send to ocfs2_controld, it won't need to tell other nodes of our mount state. They will handle it in the kernel as they do today.

It's Working

The ocfs2_controld daemon is now checked in. It handles mount and unmount requests properly. The next step is to integrate with mount.ocfs. We also need to modify o2cb.init to start the appropriate services.

Also, we need to synchronize with o2cb_controld for node configuration changes.