OCFS2/CManUnderO2CB/ToDO

Pending Work

In no particular order.

Pinning o2cb_controld

ocfs2_controld is smart enough that it won't exit when there is a live mount. All normal exit signals are ignored until ocfs2 has unmounted every volume. You can still kill it with SIGKILL, of course, but you get what you paid for there.

o2cb_controld needs somethind similar. Right now, you can shoot it down with any normal signal. It will disappear out from under ocfs2_controld, taking the unpinned bits of the cluster configuration with it.

Ordering the Control Daemons

Here's how it works today:

Another node goes down
ocfs2_controld gets STOP
o2cb_controld is told to remove the downed node
The rmdir(2) fails, because the user heartbeat symlink is pinning the node object. ocfs2_controld hasn't gotten START yet, so it can't remove the symlink yet.

We need to do one of the following:

Have STOP prevent o2cb_controld from doing anything
Have o2cb_controld noticed the failed rmdir(2) and retry it periodically.
Have o2cb_controld lock with ocfs2_controld, and only do the rmdir(2) when ocfs2_controld says it is ok.

Exploring CPG in ocfs2_controld

ocfs2_controld uses groupd to handle group membership. I got this from gfs_controld. The cman folks tell me that CPG (openais Closed Process Group) is the future, so perhaps we should change ocfs2_controld to use libcpg directly.

I'm going to table this for now, but it' worth checking out.

libo2cb Verbosity

In the network protocol code added to libo2cb, I left some commented-out fprintf(3) calls. These provide much better debugging information for network communication problems. However, as a library, libo2cb shouldn't be printing.

We should add a o2cb_set_verbose(FILE *) function that will log to that FILE* only if it is set. I'd like to make it generic so we can have ocfs2_set_verbose(), o2dlm_set_verbose(), etc. We can then provide a generic verbosef() function like fsck does. I'm not sure how to make this cleanly generic.

Handling ocfs2_controld Client Death

Right now, ocfs2_controld does nothing if a client dies between MOUNT and MRESULT, or if there is an error sending STATUS messages. We should at least decide what to do. Allowing the mount to continue might be OK (as the kernel may have been told to mount before the client broke).

Add Safety to Stack Switching

Already There is An Abundance of Alliteration. Seriously, the new o2cb.init will happily change the stack in the configuration file while the old stack is still running. Then if "offline" is called, it will try to offline the wrong stack.

This can be fixed either by using the contents of STACKCONF during offline operation or by refusing the configuration change against a running cluster.

Query ocfs2_controld for Status

There needs to be an extension to 'ocfs2_hb_ctl -I' to query the heartbeat state in ocfs2_controld. This would allow listing of current mountgoups and mountpoints, as well as just a count of live regions.

This probably requires some more to the ocfs2_controld client protocol.

Comments from Others

The heartbeat2 and cman guys probably want to look at what we've done. We could especially use their comments on how we integrate with them.

KConfig to Compile In ocfs2_disk_heartbeat

While it's nice to have the heartbeat methods as separate modules, that won't work with the 1.2 tools. For them, loading ocfs2_nodemanager must load the disk heartbeat. The easiest way to do that is a CONFIG_BUILT_IN_DISK_HEARTBEAT. If set, disk_heartbeat.o is made part of ocfs2_nodemanager.ko. If not set, ocfs2_disk_heartbeat.ko is created. The new tools can handle either.

Handle Broken Groups

While it shouldn't happen, you can get into a situation where ocfs2_controld has crashed after adding a group. When you restart it, it has no idea the group exists. Someone needs to tell groupd to drop it (or to have ocfs2_controld add it, either way).

Clean Up The Network Handshake

NOTE: This item has been pushed out until we see a need, the current solution works just fine

The current network handshake hardcodes some values between the driver files. As we split out the heartbeat code into separate modules, we have to have a way to communicate the handshake bits to tcp.c. I have an idea on how to create a generic handshake setup.

Basically, anyone that needs to handshake registers a handshake callback:

struct handshake_block {
    __le16 hs_magic;
    __le16 hs_len;
    u8 hs_data[0];
};

struct handshake_callback {
    struct list_head hs_list;
    u16 hs_magic;
    size_t hs_len;
    int (*generate_handshake_data)(void *buf);
};

int register_handshake_block(struct handshake_callback *block)
{
    list_add(&block->hs_list, &handshakes);
}

int generate_handshake(char **buf_ret)
{
    u8 *buf, *block;
    size_t len = 0;
    struct handshake_callback *hc;

    list_for_each_entry(hc, &handshakes) {
        len += hc->hs_len;
        len += sizeof(u32) /* For the magic and length fields */
    }

    buf = kmalloc(len);
    block = buf;
    list_for_each_entry(hc, &handshakes) {
        block->hs_magic = cpu_to_le16(hc->hs_magic);
        block->hs_len = cpu_to_le16(hc->hs_len);
        hc->generate_handshake_data(block->hs_data);
        block += hc->hs_len + sizeof(u32);
    }

    *buf_ret = buf;
}

When a handshake needs to be sent, each callback generates a block. The collection of handshake blocks is sent over the wire.

On the remote end, each magic is used to determine the callback. The other end generates its own handshake, and does a memcmp() to ensure it matches.