OCFS2/DesignDocs/UserspaceClustering/ControlDaemon

Running OCFS2 With a Userspace Cluster Stack - The Control Daemon

JoelBecker, December 2007

Introduction

The ocfs2 filesystem currently is wedded to the o2cb cluster implementation. Some changes will need to be made to support userspace cluster stacks. This document describes the control daemon. The control daemon interacts with the userspace cluster stack, acting as the go-between for libo2cb. The control daemon is started out of the o2cb init script.

Each userspace cluster stack will necessarily have its own control daemon. Each stack has its own methods of determining node information and group membership, and the daemon's job is to interact with the rest of ocfs2 in a generic fashion, hiding the stack-specific work.

The interaction with ocfs2 is (hopefully) encapsulated in a simple set of libo2cb API.

Daemon Name

Each userspace cluster stack will have its own control daemon. The daemon is named ocfs2_controld.<stackname>, where <stackname> is the four character name written to the ocfs2 superblock by mkfs.ocfs2(8) and to /sys/fs/ocfs2/cluster_stack by o2cb.init. This allows o2cb.init to start the daemon given the stack name. For the rest of this document, I will just say "ocfs2_controld" or "the control daemon".

Basic Daemon Operation

Just about every daemon will follow the same basic steps:

At startup, use libo2cb to verify that ocfs2's idea of the cluster stack matches. A daemon for the "fooo" cluster stack (ocfs2_controld.fooo) is useless if ocfs2 wants the "baar" stack.
Join the cluster and gather cluster information such as node configurations.
Communicate with ocfs2_controld processes running on other cluster nodes to ensure compatibility. This includes the daemon's communication protocol and the ocfs2 filesystem protocols read from /sys/fs/ocfs2/locking_protocol. The daemon should endeavor to provide as much backwards and forwards compatibility as possible.
Connect to the filesystem control device, /dev/misc/ocfs2_control. This tells the filesystem that the daemon is ready to process cluster events.
Start the main loop, listening for requests from ocfs2 tools. This is the main guts of the daemon, where groups are joined and left as mounts come and go.
If a node dies, the daemon notifies the filesystem via the control device.

Connecting to libo2cb and Checking the Stack

errcode_t o2cb_init(void);
errcode_t o2cb_get_stack_name(const char **name);

When a control daemon starts up, it first connects to libo2cb with o2cb_init(). This will succeed if the ocfs2 drivers and configuration have been loaded. That should be the common case, as the o2cb init script is responsible for first loading the drivers, then starting the daemon.

Next, the daemon asks for the stack name with o2cb_get_stack_name(). Obviously, this should match the daemon. If the daemon is for a different stack, it should exit.

Connecting to the Cluster

This portion is stack-specific. Here, the daemon connects to the stack using whatever API the stack exposes. It gathers the list of nodes and sets up communication to other nodes for intra-daemon messaging.

Once the cluster connection is happy, the daemon may proceed.

Negotiating the Daemon Protocol

There will be a control daemon on each cluster node mounting ocfs2 filesystems. They must communicate with each other, and they must ensure compatibility between the filesystem drivers on each node. Thus, before doing any filesystem work, a daemon must negotiate compatibility with other daemons.

If a daemon is the first to start in a cluster (no other nodes are running ocfs2_controld), it chooses the highest level of features supported by itself and the local filesystem driver. It can then go on to the next stage of startup.

Otherwise, it sends a message to existing daemons asking what protocol they are speaking. This is an intra-daemon protocol, and is up to the daemon implementor. The reason we suggest a protocol negotiation is so that rolling upgrades can happen. If this daemon does not understand the protocol being used by the existing daemons, it must exit.

Negotiating the Filesystem Capabilities

Next, it sends the filesystem compatibility information found in /sys/fs/ocfs2/locking_protocol. Here, the filesystem driver specifies its capabilities. Once again, if the filesystem driver is not compatible with what the other nodes are doing, the daemon must exit.

/*
 *TODO
 * describe o2cb functions for reading locking_protocol
 */

The locking_protocol specifies a major and minor number for the filesystem. The major and minor numbers are values from 0-255. This daemon sends these values to the existing daemons as so:

This daemon sends the [major,minor] over to other daemons
1. At an existing daemon, if the major does not match, it will return a rejection. The major number is a "break the world" number. Filesystems with different locking major numbers will never be allowed to interract.
2. If the existing daemon is using a minor number greater than the request, it will return a rejection. This is because the existing daemon is running at a level not understood by the new daemon. Running daemons will never change their feature level once they have negotiated it.
3. If the existing daemon is running a minor number less than or equal to the request, it will return an OK status as well as the currently running minor.
If the response is a rejection, the new daemon exits.
If the response is an OK, the new daemon sets itself to use the minor number in the response. It communicates at this negotiated [major,minor] level.

The daemon then tells the filesystem what minor to use by writing to the locking_protocol file. Now the filesystem is compatible with all the other nodes.

Connecting to the Filesystem Control Device

errcode_t o2cb_control_open(unsigned int this_node);
void o2cb_control_close(void);
errcode_t o2cb_control_node_down(const char *uuid, unsigned int nodeid);

ocfs2 cannot allow any mounts unless it is sure that it will get node down events from the cluster stack. Thus, the control daemon opens the filesystem control device, /dev/misc/ocfs2_control. This device is where the daemon sends node down events as well.

The device has a protocol to tell the filesystem information, but the daemon doesn't need to know about it. The daemon calls o2cb_control_open() with the local node number. If it fails, the daemon exits. If it succeeds, the filesystem now knows that the daemon is alive and connected to the cluster.

At exit, the daemon closes the device with o2cb_control_close(). If the device is closed while any ocfs2 filesystem is mounted, the filesystem driver will reboot the system - mounts are unsafe without a connection to the cluster.

The daemon learns about nodes crashing via its cluster connection. When it finds that a node has crashed, it notifies the filesystem by calling o2cb_control_node_down().

Handling Client Connections

typedef enum {
        CM_MOUNT,
        CM_MRESULT,
        CM_UNMOUNT,
        CM_STATUS,
        CM_LISTFS,              
        CM_LISTMOUNTS,  
        CM_LISTCLUSTERS,
        CM_ITEMCOUNT,
        CM_ITEM,        
} client_message;

int ocfs2_client_listen(void);
int send_message(int fd, client_message message, ...);
int receive_message(int fd, char *buf, client_message *message, char **argv);

The daemon is now ready to server requests. It starts listening on a local socket via the ocfs2_client_listen() function. When there is data on the socket, it uses the receive_message() function to pull off a message. receive_message() will return an error if the client sent an invalid message. The daemon should not exit, though it can clean up after the client and log an error. The daemon then performs an action based on the message (CM_XXX). The argv array is a list of arguments to the message. See the client protocol page for more details on the messages.

When the daemon needs to send a response, it uses the send_message() function. This function takes arguments in their natural form (numbers as numbers, not strings) and converts them as appropriate.

Exiting

The daemon cannot exit while mounts exist. The exceptions are:

The cluster software goes down.
The listening socket has an error.
SIGSEGV
SIGKILL

These basically fall into two categories. The first category is "lost connection to something important". If the cluster software has exited, this node is no longer part of the cluster. It SHOULD die. The same goes if the listening socket has a problem - clients can't be handled, the node SHOULD die. The other category is "unrecoverable errors". SIGSEGV and SIGQUIT should try to exit gracefully - the filesystem will crash when the control device is closed. SIGKILL obviously isn't handled, and the filesystem will crash when the process releases the control device as well.

Otherwise, SIGHUP/PIPE/TERM/INT/QUIT/etc should exit if there are no mounts (normal shutdown) and refuse to exit if there are mounts. Log them, nothing more.

When the daemon is shutting down, it should send node down events for all mounted filesystems. This gives the filesystem a better chance to recover as best it can. In the future, it may be able to handle the node down events and then respond safely to the control device closing. Next, the daemon closes the control device. Finally, it closes any cluster connections.

Implementing

There is an implementation for CMan, ocfs2_controld.cman in the ocfs2-tools tree. Currently, it lives on the stack-glue branch of the ocfs2-tools GIT repository. This implementation uses the underlying OpenAIS Concurrent Process Group functionality to manage filesystem groups. It supports everything we have in the kernel, and is capable of mounting filesystems and recovering from node death. It is not heavily tested, however.