OCFS2/DesignDocs/UserspaceClustering

Running OCFS2 With a Userspace Cluster Stack

JoelBecker, October 2007

Introduction

[Jeff, on the OCFS2/LargeTasksList page]

Cluster membership is a solved problem in the high availability arena, and inputting node membership from a user space high availability system enables OCFS2 to take advantage of those algorithms. It also allows true external device fencing, such as terminating a node's access to a SAN device by removing permission in the SAN fabric, giving OCFS2 better options than panic()ing when it loses connectivity.

This is but one advantage. OCFS2 also gains interoperability with userspace applications that use the cluster stack. The cluster stack is debuggable wht userspace tools and without kernel crashes.

We have looked at a number of avenues for interoperability. We tried to use o2cb as the stack and drive fs/dlm from it. This didn't work well. We eventually went the opposite direction, creating a project to feed o2cb from userspace clustering stacks. The experience is documented on OCFS2/CManUnderO2CB. With this sort of approach, we were able to support multiple userspace cluster stacks, including cman and Linux-HA. However, it is a complex approach. The o2cb kernel drivers are changed immensely to handle userspace interaction. Multiple stack combinations (o2hb + o2dlm, CMan + o2dlm, Linux-HA + o2dlm, and CMan + fs/dlm just to start) create a big headache. Each combination must be tested, and the filesystem needs to understand the capabilities that one combination has.

Rather than fight complexity, we should provide two clean solutions: all userspace, or all kernelspace. The "classic" o2cb stack remains largely unchanged. This is very good for stability, as that code is well tested. Userspace clustering will be described in this design document.

Goals and Requirements

The ocfs2 filesystem should know as little about the underlying cluster infrastructure as possible. The filesystem should not care which stack is in use. It should just be able to register its needs and query the functionality available.

The filesystem should not need to know node names or addresses. It should not care when a node joins or leaves the filesystem. The cluster stack and lock manager handle those details when things go well. The filesystem only needs to know when a node needs to be recovered.

The filesystem should not need to do different things based upon the stack. ocfs2 should make the same function calls regardless of stack. A glue layer will translate the calls as needed.

The "classic" o2cb stack should remain as-is. It should still be a simple method by which anyone can configure a cluster and a cluster filesystem. Small changes made to genericize the ocfs2<->stack interface are OK. When using the o2cb stack, only the o2dlm lock manager is used.

The fs/dlm lock manager is used when a userspace cluster stack is chosen. There is only one viable DLM these days, and we don't need to try and grow any more. Any userspace cluster stack must be able to drive fs/dlm (Linux-HA and CMan both are able to).

A userspace cluster stack is responsible for safe group management, fencing, and recovery notification. The filesystem will expect this to be done safely and correctly. Outside of cluster locking, the filesystem will not take part in any inter-node communication.

The ocfs2 tools should be able to operate on the filesystem regardless of stack. This includes joining a group when mounting, leaving a group when unmounting, and locking out filesystems when running tools like fsck and tunefs. The initializaion scripts should be able to connect to the correct stack at system start. This requirement will be met for "officially" supported stacks, and we will work with other stacks to try and make it work for everyone.

The ocfs2 tools should be able to configure and modify cluster information for all "officially" supported stacks. Again, we'll work with other stacks to try and make them happy too. In this fashion, users accustomed to the "classic" stack will be able to switch to any other supported stack and continue to use the same tools.

The ocfs2 tools should be backwards and forwards compatible as long as filesystem features are constant. That is, a filesystem with a 1.2-compatible feature set should work with all of these combinations:

1.2 tools, 1.2 modules
1.2 tools, new modules
new tools, 1.2 modules
new tools, new modules

Nodes running different stacks should fail to mount. Specifically, if one node is mounted and is using stack A, another node with stack B should fail to mount. Otherwise, corruption will occur.

Switching stacks should be as simple as unmounting the filesystem everywhere, configuring the new stack everywhere, and remounting the filesystem everywhere.

Filesystem Changes

The filesystem currently knows a lot about the underlying o2cb stack. /FilesystemChanges describes the refactoring required to isolate o2cb and support userspace stacks.

Tools Changes

The tools currently expect the specific module layout of current ocfs2 and act directly on configfs paths for o2cb. /ToolsChanges describes the refactoring to support the new stack plugins, the generic way that mount.ocfs2(8) and other tools will talk to the cluster, and the measures used to ensure backwards and forwards compatibility.

The Control Daemon

Finally, each userspace cluster stack must provide a control daemon to interface between libo2cb, ocfs2, and the cluster stack. /ControlDaemon describes the job of the control daemon and the libo2cb functions it can use to talk to ocfs2.

Code

The userspace changes are available from the stack-glue branch of the ocfs2-tools GIT repository. The kernel changes are available in the cluster_abstractions_modules branch of Joel's kernel repository. Combined, they work with the CMan cluster stack (the only stack for which there is a control daemon).

TODO

The new slot map format changes have yet to be integrated. Those changes can be seen on the new-slot-map branches of each repository.

The new mount options need to be added to ocfs2.txt.