GLOBAL HEARTBEAT IN O2CB CLUSTER STACK

Sunil Mushran, Jun 2010

BACKGROUND

The o2cb cluster stack does disk heartbeat on a per-mount basis on an area on disk (heartbeat file) that is reserved during format. This is referred to as local heartbeat. In this scheme, the heartbeat thread is started and stopped automatically during mount and umount.

While this scheme has the advantage of being easy to setup, it has a flaw in that it requires as many heartbeat threads as there are mounts. This becomes a problem on setups having 50+ mounts. While each heartbeat io is tiny (one sector write and max 255 sectors read every 2 secs), the iops adds up. Also, as the heartbeat is started on every mount, the mount is slow as it needs to wait for the heartbeat thread to stabilize. Currently it takes over 5 secs to mount a clustered volume. But the killer is that, as the number of mounts on each node in a cluster can vary, a node has to self fence if the heartbeat io times out to even *one* device.

GOALS

The goal of this task is to provide a new heartbeat scheme that decouples the mount with the heartbeat. This will allow the user to mount 50+ volumes without the additional heartbeat io overhead. Also, fast mounts, as there will be no need to wait for the heartbeat thread to stabilize. Lastly, and most importantly, a loss of one heartbeat device need not force the node to self fence.

An additional goal would be to allow on-line addition and removal of heartbeat devices.

And this needs to be added while maintaining full backward compatibility with the existing local heartbeat and user-space clustering schemes.

GLOBAL HEARTBEAT

In this scheme, the user configures the heartbeat devices on all nodes. The heartbeat is started when the cluster is brought online. All nodes in the cluster ensure that the devices are the same on all nodes. A node self-fences if the heartbeat io times out on 50% or more of the devices.

One implication of this would be to recommend users to setup at least three heartbeat devices. Any fewer and the node will have to self-fence on losing just one device.

One thing to remember is that the o2cb cluster can only be in either global or local heartbeat mode. As in, one cannot mount one device with global and another with local.

Qs: Do we force a minimum of 3 devices? No.

CLUSTER CONFIGURATION

The list of heartbeat devices is stored in /etc/ocfs2/cluster.conf.

The notation includes a new heartbeat stanza that has the heartbeat region and the cluster name. We use the region and not the device name so as to not force stable and consistent device names across the cluster. The heartbeat device is either an existing ocfs2 volume that the user may later mount, or, an ocfs2 volume that is specifically formatted as a heartbeat device, mkfs.ocfs2 -H.

The cluster stanza has a heartbeat mode that can be set to local or global.

A cluster can have upto 32 heartbeat regions.

heartbeat:
        region = 908A022988C34A0DB6BC38C43C6B1461
        cluster = mycluster

heartbeat:
        region = 5678675678ABCDFE309888C34A0DB6B2
        cluster = mycluster

cluster:
        node_count = 10
        heartbeat_mode = global
        name = mycluster

Qs: Is the 32 region limit enough? Yes.

Qs: Should we keep a region_count in the cluster stanza? One advantage would be a quick summary. No.

CONFIGFS ABI

A new file entry, mode under the heartbeat directory specifies the active heartbeat mode. global for global heartbeat. local for local heartbeat. If this file entry is missing, local heartbeat is assumed to be in effect. This kernel disallows changes to this entry after the first heartbeat region is created.

# cat /sys/kernel/config/cluster/<cluster_name>/heartbeat/mode 
global

SUPERBLOCK AND FEATURE FLAGS

We have to stamp the (global) heartbeat mode in the superblock. One proposal is to add this mode as part of the cluster info in the superblock. Cluster info has 4 bytes reserved. We can consume 1 byte for stack flags.

A compat feature flag COMPAT_CLUSTERINFO_V2 can be used to add this change. It will be update-able like other feature flags tunefs.ocfs2 --fs-features=[no]clusterinfo-v2.

An incompat feature flag INCOMPAT_CLUSTERINFO will be used to make cluster info (with stackflags) usable for both userspace and o2cb cluster stacks. This flag will be update-able like other feature flags tunefs.ocfs2 --fs-features=clusterinfo. However, setting this incompat bit will not update the clusterinfo itself. That will only be update-able using tunefs.ocfs2 --update-cluster-stack. This includes the global heartbeat flag which will be stored as part of the stack flags.

As this new flag is a superset of the existing INCOMPAT_USERSPACE, we will start the process of deprecating the latter. Our scheme will be fully backward compatible.

--fs-features will set/clear INCOMPAT_CLUSTERINFO with all locks taken. When being set, it will clear INCOMPAT_USERSPACE. When being cleared, it will set INCOMPAT_USERSPACE if the clusterinfo shows a non-o2cb stack.
--update-cluster-stack will not set or clear INCOMPAT_CLUSTERINFO. However if the new clusterstack is userspace, it will set INCOMPAT_USERSPACE only if INCOMPAT_CLUSTERINFO has not been set.

/* stackflags for o2cb */
#define OCFS2_CLUSTER_O2CB_GLOBAL_HEARTBEAT    (0x01)

struct ocfs2_cluster_info {
/*00*/ __u8   ci_stack[OCFS2_STACK_LABEL_LEN];
        union {
                __le32 ci_reserved;
                struct {
                        __u8 ci_stackflags;
                        __u8 ci_reserved1;
                        __u8 ci_reserved2;
                        __u8 ci_reserved3;
                };
        };
/*08*/ __u8   ci_cluster[OCFS2_CLUSTER_NAME_LEN];
/*18*/
};

But for this to work, we will need the o2cb stack to also honor cluster info. Currently it does not. For that we'll need an incompat flag, INCOMPAT_O2CB_STACK. This feature will be updated using tunefs.ocfs2 --update-cluster-stack.

Global heartbeat requires cluster info because it needs to ensure that the cluster name on disk is the same as the one configured. This will help prevent cross cluster mounts. Well, it will prevent as long as the sysadmin is careful to have unique names for each cluster.

PROBLEM: This scheme has a flaw in that it is cumbersome to change a volume from local heartbeat to global heartbeat. To do it, one needs to first enable compat clusterinfo-v2. Then, start global heartbeat and do --update-cluster-stack. The second part of ok. My problem is with the first part. Currently, adding that compat will require the volume to be mounted with local heartbeat. Imagine a user that forgets to add this compat on one volume after making the transition. The user will have to: umount all the volumes, shutdown o2db, change heartbeat mode to local, start o2cb, run tunefs, shutdown o2cb, change heartbeat mode to global....

POSSIBLE SOLUTION: One solution is to have just one incompat flag. INCOMPAT_GLOBAL_HEARTBEAT. Settable using --update-cluster-stack. This incompat means the cluster info is valid.

SUGGESTED SOLUTION: Add one incompat flag, INCOMPAT_CLUSTERINFO. This will indicate the clusterinfo with stackflags on superblock is valid. This incompat flag will be update-able using --fs-features. However, the cluster info itself will only be updated using --update-cluster-stack. As this flag is a superset of INCOMPAT_USERSPACE, we intend to slowly deprecate the latter.

MOUNT OPTION

mount.ocfs2 reads the superblock to detect the cluster stack. Currently it appends certain strings to the mount option to relay the information to the kernel. Like cluster_stack=xxxx for userspace clustering, heartbeat=local for local heartbeat and heartbeat=none for local mounts. For global heartbeat mounts, it will add heartbeat=global. As usual, the kernel/fs will authenticate the flag.

JOIN DOMAIN

The cluster stack has to ensure all nodes are using the same set of heartbeat devices. The best place for this check is during dlm join domain. Currently join domain in O2DLM comprises of two messages. DLM_QUERY_JOIN_MSG to query and to learn the dlm and fs prototol version followed by DLM_ASSERT_JOINED_MSG.

We introduce two new messages, DLM_QUERY_NODEINFO and DLM_QUERY_HBREGION, that will be sent in-between the two. The change will require bumping up the minor dlm protocol version.

DLM_QUERY_HBREGION

This message is used to compare the heartbeat regions. The message includes the crc32_le() of the region name and not the region name itself. This reduces the packet size from 2084 bytes to 164 bytes. The join domain process fails if the joining node does not have the same regions as the existing nodes.

#define O2NM_MAX_HBREGIONS     32

struct dlm_query_region {
       u8 qr_node;
       u8 qr_numregions;
       u8 qr_namelen;
       u8 pad1;
       u8 qr_domain[O2NM_MAX_NAME_LEN];
       u8 qr_regions[O2HB_MAX_REGION_NAME_LEN * O2NM_MAX_REGIONS];
};

DLM_QUERY_NODEINFO

This message is used to compare the nodes. Node numbers, ip, port, etc. While this message is not a must for global heartbeat, we are adding this as it is long overdue. The join domain process fails if the joining node does not have the same nodes as the existing node. This packet is 2076 bytes.

struct dlm_node_info {
       u8 ni_nodenum;
       u8 pad1;
       u16 ni_ipv4_port;
       u32 ni_ipv4_address;
};

struct dlm_query_nodeinfo {
       u8 qn_nodenum;
       u8 qn_numnodes;
       u8 qn_namelen;
       u8 pad1;
       u8 qn_domain[O2NM_MAX_NAME_LEN];
       struct dlm_node_info qn_nodes[O2NM_MAX_NODES];
};

QUORUM

For disk heartbeat, o2hb maintains a global view of live nodes. A node is considered alive if it is heartbeating on any one device. That part remains unchanged.

However, the self-fencing code needs to be enhanced to force fence only if it loses io to 50% or more devices.

As the devices can be added/removed on-line, the quorum calculation should ignore newly added devices until it detects all the nodes heartbeating on it. Removal, however, should work as is. The quorum calculation will work with one less device and thus have a lower threshold to self-fence.

STATUS

The patches are now in Joel's merge_window for 2.6.37. The corresponding tool's patches are in ocfs2-tools newo2cb branch.

However, there are few issues that still need to be resolved.

In tools, scandisk filtering/sorting only returns sd and dm devices. We need add SCANORDER and SCANEXCLUDE logic.
o2net needs to pin, config_depend_item(), the node item before initiating the connection. Also, remove the pin after the connection is dropped.
o2net should be able to handle first connect failure gracefully. The first connect can fail because the ip address is wrong or iptables is running, etc. Currently, if such a thing were to happen, o2quo will fence the box. This does not happen in local heartbeat mode because dlm_join_domain()'s timeout is less than the quorum timeout.
In global heartbeat mode, we have to pin the heartbeat regions when the cluster is active. This is tricky because we need to determine when it is safe to shut down the heartbeat. Meaning o2hb needs to be able to differentiate between a o2dlm callback and o2net callback.

Changes

Jun 18 2010, Sunil Mushran

Initial version

Jun 27 2010, Sunil Mushran

Replace COMPAT_CLUSTERINFO_V2 and INCOMPAT_O2CB_STACK with INCOMPAT_CLUSTERINFO

Oct 18 2010, Sunil Mushran

Status added with 4 action items

OCFS2/DesignDocs/NewGlobalHeartbeat