OCFS2/CLVMAndOCFS2/O2CBUnderCLVM

Trying to use CLVM on Top of O2CB

First, Sunil asked me to look at using RHEL5's clvmd. The configuration had a library liblvm2locking.so that could be used to handle cluster LVM configuration. Sunil wanted to know if it was trivial to write such a library for o2cb and o2dlm.

The Library

I looked at the library, and it was very confusing. Instead of calling out to the cman cluster manager, it called out to more generic "talk to the cluster" code. That code, in turn, seemed to talk only to cman or gulm. I'm not explaining it well, but it was obvious that you couldn't use it to connect o2cb to clvmd.

I couldn't figure out what the point of the library was if clvmd was tied to cman anyway. I finally figured it out after a bunch of code reading (much of which is described in the following sections).

The library isn't for clvmd to talk to the cluster. The library is for the LVM commands (pvcreate, vgcreate, lvcreate, etc) to talk to clvmd. That is:

Not Clustered (locking_mode 1):
    pvcreate -> /usr/sbin/lvm -> fcntl

Clustered (locking_mode 2):
    pvcreate -> /usr/sbin/lvm -> liblvm2locking.so -> clvmd -> other nodes

If you want to implement a liblvm2locking of your own, you have to implement everything it does - locking out other nodes and sending them configuration changes. You can't use the existing clvmd, you'd have to write your own.

CLVMD Clustering

The library doesn't talk to the cluster; the library talks to clvmd. clvmd talks to the cluster. The methods it uses are hardcoded in the program.

It's actually partially modular. clvmd as it exists in RHEL5 has support for gulm and cman. Each is supported by a file, clvmd_cman.c or clvmd_gulm.c. These files are just compiled in, not loaded at runtime. When starting up, the software asks a function in each file to load an operations structure. That ops structure contains everything clvmd needs to talk to the cluster. It's about 10 functions, including "sync_lock", "send_cluster_message", and the like. The rest of clvmd appears to be pretty generic.

To use o2cb and o2dlm underneath, you'd have to implement this ops structure in an clvmd_o2cb.c file.

Putting O2CB Underneath CLVMD

What does this entail? Here's the ops structure:

struct cluster_ops {
        void (*cluster_init_completed) (void);

        int (*cluster_send_message) (void *buf, int msglen, char *csid,
                                const char *errtext);
        int (*name_from_csid) (char *csid, char *name);
        int (*csid_from_name) (char *csid, char *name);
        int (*get_num_nodes) (void);
        int (*cluster_fd_callback) (struct local_client *fd, char *buf, int len,
                               char *csid, struct local_client **new_client);
        int (*get_main_cluster_fd) (void);      /* gets accept FD or cman cluster socket */
        int (*cluster_do_node_callback) (struct local_client *client,
                                    void (*callback) (struct local_client *,
                                                      char *csid, int node_up));
        int (*is_quorate) (void);

        void (*get_our_csid) (char *csid);
        void (*add_up_node) (char *csid);
        void (*reread_config) (void);
        void (*cluster_closedown) (void);

        int (*get_cluster_name)(char *buf, int buflen);
        int (*sync_lock) (const char *resource, int mode, int flags, int *lockid);
        int (*sync_unlock) (const char *resource, int lockid);
};

What does o2cb look like here?

->sync_lock and ->sync_unlock seemed trivial. Just use the o2dlm library. However, it would appear they use PWMODE locks for some things. This allows them to make some changes while other nodes stay cached. o2dlm can'd do that.

The csid functions implement an opaque cluster identifier. We can just make up a conversion from our node number to the csid. cman just does memcpy.

The fd functions are for communication with the cluster service. cman, for example, sends all notifications via a file descriptor. clvmd uses ->get_main_cluster_fd() to put a descriptor in its select(2) for the cluster stack. When that descriptor is ready for reading, it will call ->cluster_fd_callback().

->add_up_node() is called when clvmd sees a onde come up.

->cluster_init_completed() and ->cluster_closedown() are pretty self-explanatory.

The big complexity is ->cluster_send_message().

Sending Messages

clvmd relies on the cluster stack for two things. First, for node up/down information. clvmd wants to know if every node is up, and if so when they go down. Second, it relies on the cluster stack to send messages to other nodes. It assumes that the messages get there in the absence of node down events (it has a timeout, but expects that the node down will happen by then).

Message sending is effectively "for each node, send to that node". Most of what clvmd does is via messages. Locking often only locks out lvm activity, not the device. dm suspend handles freezing the device. In other words, the locking is to prevent competing lvm commands, not to prevent access to the logical volumes.

clvmd tells other nodes what to do in order for the local node's action to be safe. For example, if you are changing a volume's size, you have to tell all the other nodes to suspend, then change the size, then tell the other nodes to refresh, then to resume.

Thus, the cluster stack must be able to provide this communication. o2cb has no such facility in userspace, and would need to write one. cman uses openais, which guarantees that messages arrive in-order. That is, all nodes in the cluster see message A before message B before message C, regardless of who sent the messages. clvmd appears to rely on this order to keep operations from overlapping. Thus, o2cb would bave to have this property.

It may be possible to use openais underneath an o2cb communication facility, but I haven't explored what it takes to feed openais the configuration from o2cb.