OCFS2/CManUnderO2CB

Using the CMan Stack Under O2CB

This page documents my attempts to feed the o2cb stack with cman. This means configuring the cluster via cman and managing mounts via the stack's groupd. o2cb handles the interaction with ocfs2 and the o2dlm.

Up until September 17, 2007, we used the cman software distributed with RHEL5/EL5/FC7. I'm now building sources out of source control. Here are some notes on getting it going.

Repositories

git pull git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2.git userspace_heartbeat                 
git pull git://oss.oracle.com/git/ocfs2-tools.git cman-based

Using CMan for Node Configuration

The first step is to replace o2cb_ctl with cman. The cluster is configured via cman, and cman feeds the kernel side of o2cb via configfs. This ends up being a control daemon.

Running Heartbeat from Userspace

With cman doing the node management work, we now want to get node up/down information from cman. This means communicating that information to the o2hb code. In addition, we have to disable the disk heartbeat.

SuSE has already done this work in SLES10 with their user_heartbeat driver. We need to integrate it. The upshot should be an o2cb/o2hb/ocfs2 driver set that can use any of the o2cb, cman, or HASF stacks.

Communicating Region Information

Without the disk heartbeat, ocfs2 has no idea which node is interested in which region. The userspace heartbeat code handles this for HASF as described in the previous section. We can leverage this underneath cman as well. We need another control daemon to send group membership information to the userspace heartbeat.

Reworking the Tools

The tools need to be able to handle starting and stopping both the original o2cb stack as well as the cman-based one. First, we need to rework startup. Then we need to create a more flexible group API.

It Works!

With all the above complete, at least in a first pass form, we can now bring up an ocfs2 filesystem on top of cman. Woo!

But Wait...

There is a ton of stuff still to do.

And there are, of course, things that break.

Test Test Test

There is a lot of stability to the current stack, and the new work, however it ends up, needs to meet or exceed that standard. Thus, testing must be done.

And Yet!

Discussions with David Teigland at Red Hat have led us to the following question:

What part of o2cb is necessary in a world of cman and fs/dlm?

Then answer, we've decided, is almost nothing.

If we are using fs/dlm, we don't need o2dlm. Without o2dlm, we don't need o2net. If we don't need o2net, the kernel doesn't need to know node addresses. Thus o2nm is not necessary. In addition, fs/dlm is fed its node and group information outside of the filesystem's sphere. So we don't need o2hb to maintain a group list.

The one thing we do need is notification of node death. We need to know that node N went down, so we can map it to a slot and initiate recovery. But we don't need to know anything else about the group - we assume userspace is handling that.

Thus, in a world of cman+fs/dlm, we need a tiny shim by which userspace can send "Node N died" messages to ocfs2. Nothing else.

This gets us a few big advantages over everything we've already done.

We have a tiny amount of kernel code in the cman case.
We never have to support a "cman+o2dlm" combination. The current work creates the choice between "o2cb+o2dlm" and "cman+o2dlm". It assumes that we will add fs/dlm at a later date, creating a "cman+fs/dlm" combination. That's three different stack combinations we'd have to support indefinitely. It's much easier to have only two.
We can leave the classic stack almost unchanged, which is great for stablity. There would be no interdependence between the stacks. They would stand almost completely on their own. Compare to the three stack solution, where we have a shared middle, and we have a lot of code supporting multiple stacks.
A lot of the userspace work we've done translates, so it isn't wasted.

These are all good things, but they assume cman and fs/dlm work pretty well. Not perfect - we can help fix them - but good enough that we know they are a good basis for our future.

The best way to approach this is to leave the stacks alone for now. fs/dlm can be introduced on top of the user-heartbeat work. Thus, we can determine how well it works before we rip out all of these changes and start the kernel bits from scratch.