Userspace Heartbeat for OCFS2
SuSE runs ocfs2 underneath their High Availability Storage Framework (HASF) stack. This stack handles node up/down notification in userspace. It needs to communicate this information to ocfs2.
The o2hb code actually does two things. First, it functions as a heartbeat. It determines whether nodes are alive or dead and sends notifications when node status changes. Second, it functions as a group manager. By heartbeating in a specific region, a node has declared itself interested in that region. This is partly how ocfs2 knows which nodes are mounting a given filesystem (handwaving the slot map here...)
A more robust system needs to decouple these two functions. ocfs2 should be able to know what nodes are interested in a particular filesystem regardless of how the up/down notification is happening.
Overview of Changes
The patch series in SLES10 kernels does a few things. First, it isolates quorum decisions better. Next, it genericizes the heartbeat resource so that multiple heartbeat methods can be supported. Then disk heartbeating is split off from the interest code. It is taught how to register itself. At the end of this first stage, a heartbeat method can register itself with a the generic heartbeat code, and quorum is off responding to generic heartbeat events.
Next, the behavior of generic heartbeat events is changed. It adds the ability to key a callback on a specific region or on all regions. It leaves the callback triggering up to the specific heartbeat methods. So the disk heartbeat only triggers callbacks when all regions show a node is down.
Finally, the user heartbeat is added. There is a sysfs file that switches between disk and user heartbeats (should be in configfs, I think). The user heartbeat uses configfs symlinks to tell ocfs2 when a node is interested in a region. User heartbeat triggers node up and down events per-region. This is the group management functionality described above.
Using the full user heartbeat changes and SuSE's HASF stack, quorum is now handled via the HASF stack. ocfs2 and o2hb merely respond to node up/down events as group membership changes. They trust that userspace will handle any fencing required.
Applying to CLVM Work
It's nice that SuSE has this working, but we're talking about cman here. Well, this provides an interface that we can use to communicate cman group membership to ocfs2. Like HASF, we expect cman to handle the quorum decisions. All ocfs2 wants to know is group membership.
The end result should be an ocfs2 kernel stack that can use o2cb + o2hb disk heartbeating, HASF + user heartbeating, or cman + user heartbeating. That's quite flexible.
The original user heartbeat patches from the SLES10 kernel are attached here. The last attachment is commentary by Joel while reviewing the patches.
I have now applied these changes to mainline. The git tree is at:
The SLES10 patches were for ocfs2 1.2. As such, they didn't apply at all to mainline. While this meant a lot more work integrating the changes, it also allowed me to clean up a lot of the problems I saw in commentary.txt.
I tested each change individually, as they should mostly be behavior neutral. My test set was pretty simple:
- Load the latest modules on two systems.
- Mount a filesystem.
- Run parallel compiles on each node.
- Kill the higher node's ethernet.
The expectation was that concurrent compiles would work just fine. They did. When I kill the higher node's network, o2net should timeout. Then, after quorum has determined heartbeat is still working, the higher node should self-fence and the lower node should recover and continue. This test set is valid for all patches up to the user-heartbeat code. All of the changes passed this test correctly.
Once I got to the heartbeat-mode change, where heartbeat-mode becomes selectable, I added tests for the heartbeat-mode configfs files. Basically checking and handling invalid input. This works as well.
Finally, the user heartbeat code was tested by hand-populating user heartbeat regions. I was able to mount and start the compiles. I then killed the network again. With user heartbeat, o2cb makes no quorum decisions, so it waits for userspace to do something. Finally, I killed one node and removed it from the region on the other node. The living node correctly recovered and continued.
Changes From the Original
I fixed up almost all of the complaints in commentary.txt. As such, the to_*() functions handle NULL, the o2hb_get_resource_by_name() function is correctly locking its lookups, and the heartbeat_mode file is in configfs.
I also made a few cleanup changes. The self-fencing mechanism was genericized so that o2quorum isn't the only user. The germination of genericizing the handshake is there. All regions are added to o2hb_all_regions, not just disk ones.
All of this aside, the code is ready for userspace to drive it. Modulo any problems exposed by testing, we can get on with the userspace portion.