[OracleOSS] [TitleIndex] [WordIndex]


A New Group API

The classic o2cb stack had a simple start_heartbeat()/stop_heartbeat() function set. It wasn't very fancy, and it doesn't fit the brave new world. We need an API that fits the cman stack as well, and cleanly integrates our tools.

Existing Discoveries

We had an API, o2cb_[start|stop]_heartbeat_region(). It was very low-level, and it was predicated on the exact layout of our current disk heartbeat regions. We then had a program, ocfs2_hb_ctl(8) that was used by mount.ocfs2(8) to determine the appropriate information about a region and call this API. When a volume is unmounting, it would call ocfs2_hb_ctl from kernel to userspace to stop the disk heartbeat.

I naively figured that all the other tools (fsck.ocfs2(8), mkfs.ocfs2(8), and tunefs.ocfs2(8)) used the same method to start and stop heartbeat. Nope, they don't call ocfs2_hb_ctl. In fact, they don't even call o2cb_start_heartbeat_region(). They actually call a nice wrapper, ocfs2_initialize_dlm(). This function starts the heartbeat and initializes dlmfs for use by the programs. Very nice. But why isn't ocfs2_hb_ctl using this? Why isn't mount.ocfs2? THis should be cleaned up no matter what.

CMan Needs

Not only were the tools in disagreement about how to start heartbeat, the new cman stack needs more than just a "start". It needs to know the service starting the heartbeat - usually the mountpoint. It also needs to know whether the mount(2) call succeeded via a "result" message.

To solve this, I modified the o2cb heartbeat API. Gone is o2cb_[start|stop]_heartbeat_region(). In its place are o2cb_begin_group_join(), o2cb_complete_group_join(), and o2cb_group_leave(). These map to the ocfs2_controld messages, but they also fit the old code quite well.

ocfs2_controld also keeps references based on the mountpoint. It keeps track of the mountpoints that have a device active. We need the API to pass the mountpoint to ocfs2_controld. We can use fake mountpoints for tools that aren't actually mounting. Basically, fsck.ocfs2(8) can pass its name as the mountpoint, etc. For this reason, we call it the "service". The service is mostly a mountpoint, but can also be a program using the device.

Smarter Fail Paths

In the old code, someone would call ocfs2_start_heartbeat(), then try whatever they were doing (eg mount). If their operation failed, they'd have to call ocfs2_stop_heartbeat() to kill the region. With the new group_join() code, the result of the startup operatoin is passed to o2cb_complete_group_join(). If the result is 0 (success), then o2cb_complete_group_join() will mark the region as happy as appropriate for the backing stack. If the result is an error, o2cb_complete_group_join() will handle deactivating the region.


Every tool except mount.ocfs2 uses the ocfs2_initalize_dlm() function. It can be easily modified to call o2cb_begin_group_join() before calling o2dlm_initalize(). It then calls o2cb_complete_group_join() with the result of initializing the DLM. Very clean. FLikewise, ocfs2_shutdown_dlm() calls o2cb_group_leave().


mount.ocfs2 no longer calls ocfs2_hb_ctl. It uses the libo2cb functions directly. It already opens the filesystem and uses the libraries, so there's no reason not to. Plus, it now easily wraps the mount(2) call with the begin/complete_group_join() calls. This is a nice cleanup for the classic o2cb stack, but it is necessary for the cman stack, as libo2cb has to hold a connection to ocfs2_controld across the calls.


With the need to do messaging at umount time, there is now a umount.ocfs2(8) program. All recent util-linux packages support umount.<type>, and we will take advantage of it. The kernel will no longer call ocfs2_hb_ctl at umount time. We enforce this by having o2cb.init write /bin/true to /proc/sys/fs/ocfs2/nm/hb_ctl_path. The kernel will still try to call out, but it will do nothing.

The umount.ocfs2 program knows how to call the o2cb_group_leave() function. This function will work for all backing stacks.


What about ocfs2_hb_ctl? Mount is no longer using it, and the kernel will call /bin/true. So why does it still exist? It has been modified to use the new API, and now requires the <service> argument. It exists for fixing problems. It's not likely, but it is possible to kill a tool or mount in such a way that the heartbeat isn't cleaned up. ocfs2_hb_ctl can be used to clean up the remnants. Things like that.

It can also be used by o2cb.init to check on heartbeat state. There needs to be an update to the -I option to make this even more flexible.


Even with all these changes, the tools should work with 1.2 drivers. The 1.2 drivers won't work with anything but the classic o2cb stack, but these tools should configure that stack correctly.

In addition, the 1.2 tools should work with the modern kernel. They will configure the classic stack as before, including calling ocfs2_hb_ctl from the kernel. It should "just work".

2012-11-08 13:01