[Ocfs2-tools-commits] jlbec commits r1007 - in
branches/global-heartbeat: debugfs.ocfs2 documentation
documentation/o2cb ocfs2_hb_ctl
svn-commits at oss.oracle.com
svn-commits at oss.oracle.com
Tue Aug 2 16:30:38 CDT 2005
Author: jlbec
Date: 2005-08-02 16:30:36 -0500 (Tue, 02 Aug 2005)
New Revision: 1007
Added:
branches/global-heartbeat/documentation/o2cb/
branches/global-heartbeat/documentation/o2cb/heartbeat-configuration.txt
branches/global-heartbeat/documentation/o2cb/heartbeat-scope.txt
Modified:
branches/global-heartbeat/debugfs.ocfs2/commands.c
branches/global-heartbeat/ocfs2_hb_ctl/ocfs2_hb_ctl.c
Log:
o Teach ocfs2_hb_ctl to resolve device->uuid mapping.
o Document the rationale behind heartbeat's scope and the use of a
global heartbeat
o Document how a global heartbeat works in conjunction.
Modified: branches/global-heartbeat/debugfs.ocfs2/commands.c
===================================================================
--- branches/global-heartbeat/debugfs.ocfs2/commands.c 2005-08-02 21:27:31 UTC (rev 1006)
+++ branches/global-heartbeat/debugfs.ocfs2/commands.c 2005-08-02 21:30:36 UTC (rev 1007)
@@ -396,7 +396,7 @@
}
flags = gbls.allow_write ? OCFS2_FLAG_RW : OCFS2_FLAG_RO;
- flags |= OCFS2_FLAG_HEARTBEAT_DEV_OK;
+ flags |= OCFS2_FLAG_HEARTBEAT_DEV_OK;
ret = ocfs2_open(dev, flags, 0, 0, &gbls.fs);
if (ret) {
gbls.fs = NULL;
Added: branches/global-heartbeat/documentation/o2cb/heartbeat-configuration.txt
===================================================================
--- branches/global-heartbeat/documentation/o2cb/heartbeat-configuration.txt 2005-08-02 21:27:31 UTC (rev 1006)
+++ branches/global-heartbeat/documentation/o2cb/heartbeat-configuration.txt 2005-08-02 21:30:36 UTC (rev 1007)
@@ -0,0 +1,175 @@
+
+[ Heartbeat Configuration ]
+
+
+[ Introduction ]
+
+This document describes the way O2CB handles configuration of
+heartbeating from the perspective of a running cluster. O2CB tries to
+make the most of simple invocations, so the administrator doesn't
+have all that much to do. This document tries to be high-level,
+describing what is done but not the exact command sytaxes, etc. At the
+end is an appendix that actually maps concepts to syntaxes (because it
+needs to be documented somewhere).
+
+
+[ Terms ]
+
+"heartbeat"
+-----------
+A method by which O2CB nodes can notify others they are alive. O2CB
+considers heartbeat a liveness test *only*. If one node is seen via
+one heartbeat method and not another, it allows another node to
+determine the first node is running. Any decisions on fencing, quorum,
+etc can be made by other layers.
+
+"method"
+--------
+The method by which nodes notify each other. Currently "disk" is the
+only supported method, though others have been proposed. A cluster
+may have more than one method in use at a time.
+
+"heartbeat region" or "region"
+------------------------------
+A specific place that heartbeat is, well, heartbeating. A region uses
+one particular method (again, only "disk" is currently supported). A
+cluster may have multiple regions at once, offering redundancy.
+
+"UUID"
+------
+A UUID, or Universally Unique Identifier, is how heartbeat names
+regions. How a UUID maps to the actual resource is up to the method and
+layout of the resource.
+
+"local heartbeat" or "local"
+----------------------------
+Consumers of heartbeat expect to start their own heartbeat region that
+is "local" to the resources they are using or managing. No heartbeat
+is started at O2CB startup. The consumers are wholly responsible for
+the state of heartbeat when they require it.
+
+"global heartbeat" or "global"
+------------------------------
+All consumers of heartbeat are expecting the heartbeat to be started
+at O2CB startup. They will make no effort to start heartbeat of their
+own accord. This requires the administrator to configure specific
+regions to be started when the cluster is brought online.
+
+"layout" and "layout driver"
+----------------------------
+O2CB has no idea what the resource backing a heartbeat region looks
+like, nor should it. There can be multiple ways of doing it, and
+codifying it in O2CB is folly. For local heartbeat, the consumer knows
+their own resource, and can act accordingly. However, global heartbeat
+must know who understands the resource. The way a region maps to the
+resource is called the layout, and the program that understands the
+mapping is the layout driver. O2CB can ask the layout driver to
+start and stop the region, and need know nothing else about the
+backing resource. O2CB refers to the region exclusively by UUID, and it
+is up to the layout driver to determine the resource refered to.
+Currently, the only known layout is that provided by OCFS2.
+
+
+[ Rationale ]
+
+Administrators and users want configuration to be easy or nonexistent.
+They would often rather avoid something than learn complex steps. O2CB
+takes this to heart, and administrators should choose local heartbeat
+for simple setups. In fact, administrators using the ocfs2console
+utility to configure their cluster will have local heartbeat enabled
+unless they choose otherwise. Local heartbeat requires no manual steps
+for each heartbeat region. The heartbeat consumers create, start, stop,
+and otherwise manage their own heartbeat needs. Administrators can just
+let it work.
+
+Take OCFS2 for example. An administrator wants to do the normal
+filesystem tasks: mkfs, mount, and umount. OCFS2 knows how to create a
+heartbeat region inside the filesystem (in other words, an ocfs2 layout
+of this particular hearbeat disk resource) during mkfs. At mount, OCFS2
+knows how to start heartbeat on that region. At umount, OCFS2 knows how
+to stop heartbeat. The heartbeat is completely hidden behind the normal
+operation of the filesystem. O2CB does nothing other than make sure the
+heartbeat subsystem is loaded.
+
+The problem is that local heartbeat doesn't always scale. One OCFS2
+filesystem has one heartbeat region. But 30 OCFS2 filesystems have 30
+heartbeat regions, one per. Each filesystem (that is, disk resource)
+has no idea the other regions exist. Every heartbeat interval, 30
+heartbeats happen, and 30 regions have collect the heartbeats from other
+nodes. This takes a toll on the I/O subsystem, and becomes noticable to
+system resource usage.
+
+Enter the global heartbeat. O2CB is now responsible for starting and
+stopping heartbeat regions. Consumers need not (and must not) manage
+heartbeat regions. They can assume (though they should verify it) that
+heartbeat is working. Now, one region on one resource can provide the
+node liveness information to all consumers. Multiple resources can be
+used for redundancy, but the number is tied to the requirements of the
+installation, not to the number of total resources.
+
+However, this creates some required configuration. A local heartbeat
+can be automatically managed by the consumer, because the consumer knows
+intrinsically what to do. But O2CB cannot know which regions are to be
+started. O2CB cannot know which two resources to use out of 30. The
+administrator must tell O2CB what regions to use.
+
+The administrator has two tasks to perform. First, they must select
+global heartbeating. Second, they must add at least one region to the
+global heartbeat. They do this by specifying the affected cluster, the
+UUID of the region, and the layout of the region. In some cases, a
+device may be specified at configuration time, allowing the layout
+driver to resolve the device->UUID mapping for O2CB.
+
+In turn, O2CB must be able to start a region based on its UUID and
+layout. The layout driver is responsible for taking the UUID,
+determining the resource it describes, and starting heartbeat on that
+region. The requirements section of this document mostly concerns
+itself with this interaction.
+
+
+[ Requirements ]
+
+A cluster MUST have a heartbeat configuration. If not, O2CB WILL error
+when asked about the cluster's heartbeat.
+
+If the cluster is in local mode, O2CB WILL ensure the heartbeat
+subsystem driver is loaded, but WILL NOT make any attempt to
+start, stop, or manage heartbeat regions. Consumer software MUST
+manage, start, and stop regions associated with a particular resource.
+
+If the cluster is in global mode, O2CB WILL start configured heartbeat
+regions when the cluster is brought online, and WILL stop configured
+heartbeat regions when the cluster is brought offline. Consumer
+software MUST NOT attempt to start, stop, or manage regions that are
+configured for use by O2CB.
+
+Consumer software MUST provide a layout driver. The driver program
+MUST be named <layout>_hb_ctl, where <layout> is replaced by the layout
+named in the heartbeat configuration. That driver MUST be able to start
+and stop a region given only the UUID.
+
+The layout driver MAY provide the ability to map a device name to a
+UUID. If this is provided, O2CB WILL add a region by device name.
+
+
+[ Determining the Heartbeat Mode From Consumer Programs ]
+
+Consumers that require heartbeat must check the mode before starting a
+region. They can ask the heartbeat configuration via the o2cb_hb_config
+program. See the manual page for o2cb_hb_config for the appropriate
+syntax.
+
+[ Layout Driver Interaction ]
+
+A layout driver is an executable program. It must support the following
+syntax:
+
+o To start a region: <layout>_hb_ctl -S -u <uuid>
+o To stop a region: <layout>_hb_ctl -K -u <uuid>
+
+If the layout driver provides the ability to map a device name to a
+UUID, it must support the following syntax:
+
+o To resolve <device> to a UUID: <layout>_hb_ctl -L -d <device>
+
+This command must output the UUID on a single line.
Added: branches/global-heartbeat/documentation/o2cb/heartbeat-scope.txt
===================================================================
--- branches/global-heartbeat/documentation/o2cb/heartbeat-scope.txt 2005-08-02 21:27:31 UTC (rev 1006)
+++ branches/global-heartbeat/documentation/o2cb/heartbeat-scope.txt 2005-08-02 21:30:36 UTC (rev 1007)
@@ -0,0 +1,199 @@
+
+[ The Scope of Heartbeat ]
+
+
+[ Introduction ]
+
+This document describes how a cluster-wide heartbeat interacts with the
+quorum and fencing requirements of quorate shared resources, such as
+shared-disk filesystems. This document does not describe the
+implementation details of quorum decisions, but rather tries to clarify
+heartbeat's role in the process
+
+
+[ Terms ]
+
+"heartbeat"
+-----------
+A method by which cluster nodes can notify others they are alive.
+Heartbeat is a liveness test *only*. If one node is seen via one
+heartbeat method and not another, it still allows a second node to
+determine the first node is running. This definition, and the
+implications thereof, are the reason for this document.
+
+"method"
+--------
+The method by which nodes nodify each other. Common methods include
+writing to a shared disk, communication over a network, or a direct
+serial cable link.
+
+"heartbeat region" or "region"
+------------------------------
+A specific place that heartbeat is, well, heartbeating. A region uses
+one particular method on one particular resource. A cluster may have
+multiple regions at once, offering redundancy.
+
+"fencing"
+---------
+Disabling a node's ability to access a shared resource. When problems
+occur, some nodes must not be allowed to access and possibly damage a
+shared resource. Fencing prevents this from happening. The most common
+form of fencing in Linux is STONITH, short for "Shoot The Other Node In
+The Head." This is the most extreme method of fencing, as it forces a
+complete reset of the fenced node.
+
+
+[ Rationale ]
+
+What happens when a node stops heartbeating on a region? How do
+consumers of the cluster's resources decide what to do next? This is a
+complex problem, and the true source of complexity in designing clusters
+and clustered applications.
+
+The real question is, what do consumer applications want and need? In
+the end, they don't care whether the other node is up or down. It
+absolutely does not matter to them. A consumer application only cares
+whether it can share access to a resource in a safe and consistent
+fashion.
+
+For emphasis: A consumer application ONLY CARES whether it can share
+access to a resource in a safe and consistent fashion.
+
+From the perspective of a consumer application, quorum and fencing are
+useful for enforcing that safety. How they accomplish that goal is
+relatively unimportant. Whether the other nodes are alive or dead is
+unimportant.
+
+Shared Resource Scenarios
+-------------------------
+Look at a shared-disk filesystem. The shared resource is the disk. The
+filesystem driver is the consumer application on each node. Under
+normal operation, the filesystem driver arbitrates access to the shared
+disk. Each node takes its turn, respecting what the other nodes are
+doing.
+
+Assume, for the moment, that the heartbeat method and region are
+separate and independant of the shared disk (network heartbeat, a
+different heartbeat disk, etc). Let's look at a few scenarios. In the
+scenarios, "happynode" is a node that is running normally. "sadnode" is
+the node that has just had a problem.
+
+o Losing Access to the Shared Resource
+--------------------------------------
+What if sadnode loses access to the shared disk? That is, a normal
+write operation to the disk receives an I/O error. On happynode,
+access to this shared disk is proceeding just fine. Note that all forms
+of communication are functioning properly. The heartbeat shows all
+nodes are alive everywhere. Network communication between the
+filesystem drivers is also working. happynode has no way of knowing
+that sadnode has a problem.
+
+In other words, heartbeat cannot do anything here. Nor should it.
+There is nothing useful for it to do. happynode and sadnode can still
+successfully arbitrate access to the resource, and that access is still
+safe.
+
+What about sadnode's inability to write to the share disk? That has to
+be handled, of course. sadnode's immediate response must be like any
+other filesystem: take the filesystem to read-only mode. Any pending
+writes and updates must be discarded. This is how all filesystems
+everywhere work, because without the ability to write to the disk, the
+updates cannot be committed to storage.
+
+But now, sadnode holds some locks and other state that it cannot write
+out. happynode needs to know this, so that happynode can recover
+sadnode's state. The naive approach is to fence sadnode. This is
+especially disasterous in the case of STONITH, as all of sadnode's
+responsibilities are aborted. Fencing is not needed, as sadnode has
+prevented itself from writing. If sadnode notifies happynode that it
+has gone read-only, happynode has all the information needed to start
+recovery.
+
+sadnode and happynode can now continue. Access to the shared disk is
+still safe. While sadnode is no longer taking part in arbitration,
+sadnode is also not making any changes. This means that happynode's
+accesses are safe. sadnode's other processes can continue as if nothing
+has happened, and online reintroduction of sadnode to the shared
+resource could even commence after some intervention to repair the
+fault.
+
+o Losing Access to one Heartbeat Region of Many
+-----------------------------------------------
+What if sadnode no longer successfully access one of many heartbeat
+regions? That is, a write operation to the region either fails silently
+or returns an I/O error. The end result is that happynode sees no more
+heartbeats from sadnode on the one region. happynode and sadnode still
+see each other on a different heartbeat region. The filesystem drivers
+can still communicate. The nodes can still arbitrate safe access to the
+shared disk.
+
+There is nothing to do here except log the error. All operation can
+continue normally. There is no danger to any operation. Fencing of any
+sort would be detrimental.
+
+o Losing Access to all Heartbeat Regions
+----------------------------------------
+What if sadnode only had one heartbeat region and could no longer
+successfully access it? Or if sadnode had many and could access none.
+That is, write operations fail silently or return an error. The end
+result is that happynode sees no heartbeats from sadnode.
+
+This is virtually indistinguishable from sadnode crashing completely.
+sadnode may well have gotten I/O errors and done everything it can to
+clean itself up. However, there is no way for happynode to know if
+sadnode is alive and unable to heartbeat or dead and crashed.
+
+Here, fencing is the only appropriate thing. Because happynode does not
+know sadnode's state, happynode cannot consider access to the shared
+disk to be safe. Arbitration cannot happen. As such, sadnode must be
+prevented from accessing the shared disk.
+
+What form fencing takes is unimportant from the perspective of the
+filesystem driver. All that the driver cares is that sadnode cannot
+access the shared disk. Once that is assured, happynode can recover
+sadnode's state and continue with normal operation.
+
+Things are very different from the system perspective. The form of
+fencing is very important. A STONITH approach is strongly discourgaged.
+sadnode may have many responsibilities, only a few of which are affected
+by this cluster problem. Some I/O subsystems support fencing at that
+level. That is, sadnode would be prevented from sending I/O requests to
+the I/O subsystem. So, sadnode would be unable to access the shared
+disks, but would be able to continue all processes that do not use the
+shared disks. This prevents sadnode from unsafe access to the shared
+resource and allows online repair of the problem.
+
+o Losing Communication to Peers
+-------------------------------
+In this scenario, heartbeat is working fine, but happynode's consumer
+application is unable to talk to sadnode. Without the ability to
+communicate, happynode and sadnode cannot arbitrate access to the shared
+disk.
+
+Again, fencing is required (though the STONITH vs I/O fencing argument
+still applies). However, we now have an important question to ask:
+which node is having the problem? In our example, it is sadnode.
+Perhaps the ethernet cable was pulled. Perhaps sadnode's switch has
+failed. The problem is that the software doesn't know. Each node
+thinks they are OK, but the other guy is missing.
+
+This is where quorum comes in. There must be some way to decide which
+node is happynode and which is sadnode. It becomes more important (and
+more complex) when there are more than two nodes.
+
+Somehow, a decision is reached, and the losing node or nodes are fenced.
+The remaining happynodes recover the sadnode state, and continue on with
+life.
+
+Conclusion
+----------
+Notice that, in all three scenarios above, the question of specific
+heartbeat regions was completely unimportant. From the perspective of
+the consumer application, all heartbeat is good for is node up/down
+information. As long as the node appears in one heartbeat region, the
+higher-level logic knows that the machine is running. The rest of the
+decisions can be made without heartbeat's interaction.
+
+Thus, it is unimportant whether a heartbeat region is on the shared
+resource itself or not. It is also unimportant when heartbeating to one
+of many regions fails.
Modified: branches/global-heartbeat/ocfs2_hb_ctl/ocfs2_hb_ctl.c
===================================================================
--- branches/global-heartbeat/ocfs2_hb_ctl/ocfs2_hb_ctl.c 2005-08-02 21:27:31 UTC (rev 1006)
+++ branches/global-heartbeat/ocfs2_hb_ctl/ocfs2_hb_ctl.c 2005-08-02 21:30:36 UTC (rev 1007)
@@ -53,6 +53,7 @@
HB_ACTION_START,
HB_ACTION_STOP,
HB_ACTION_REFINFO,
+ HB_ACTION_LIST,
};
struct hb_ctl_options {
@@ -310,6 +311,66 @@
return err;
}
+static errcode_t list_dev(const char *dev,
+ struct hb_ctl_options *hbo)
+{
+ int len;
+ char *device;
+
+ if (region_desc) {
+ fprintf(stderr, "We have a descriptor already!\n");
+ free_desc();
+ }
+
+ len = strlen(DEV_PREFIX) + strlen(dev) + 1;
+ device = malloc(sizeof(char) * len);
+ if (!device)
+ return OCFS2_ET_NO_MEMORY;
+ snprintf(device, len, DEV_PREFIX "%s", dev);
+
+ /* Any problem with getting the descriptor is NOT FOUND */
+ if (get_desc(device))
+ goto out;
+
+ fprintf(stdout, "%s:%s\n", region_desc->r_name, device);
+
+ free_desc();
+
+out:
+ free(device);
+
+ /* Always return NOT_FOUND, which means continue */
+ return OCFS2_ET_FILE_NOT_FOUND;
+}
+
+static int run_list(struct hb_ctl_options *hbo)
+{
+ int ret = 0;
+ errcode_t err;
+ char hbuuid[33];
+
+ if (hbo->dev_str) {
+ err = get_uuid(hbo->dev_str, hbuuid);
+ if (err) {
+ com_err(progname, err,
+ "while reading uuid from device \"%s\"",
+ hbo->dev_str);
+ ret = -EINVAL;
+ } else {
+ fprintf(stdout, "%s\n", hbuuid);
+ }
+ } else {
+ err = scan_devices(list_dev, hbo);
+ if (err && (err != OCFS2_ET_FILE_NOT_FOUND)) {
+ com_err(progname, err,
+ "while listing devices");
+ ret = -EIO;
+ }
+ }
+
+ return ret;
+}
+
static int read_options(int argc, char **argv, struct hb_ctl_options *hbo)
{
int c, ret;
@@ -317,7 +378,7 @@
ret = 0;
while(1) {
- c = getopt(argc, argv, "ISKd:u:h");
+ c = getopt(argc, argv, "ISKLd:u:h");
if (c == -1)
break;
@@ -334,6 +395,10 @@
hbo->action = HB_ACTION_START;
break;
+ case 'L':
+ hbo->action = HB_ACTION_LIST;
+ break;
+
case 'd':
if (optarg)
hbo->dev_str = strdup(optarg);
@@ -385,6 +450,11 @@
ret = -EINVAL;
break;
+ case HB_ACTION_LIST:
+ if (hbo->uuid_str)
+ ret = -EINVAL;
+ break;
+
case HB_ACTION_UNKNOWN:
ret = -EINVAL;
break;
@@ -407,6 +477,7 @@
fprintf(output, " %s -K -u <uuid>\n", progname);
fprintf(output, " %s -I -d <device>\n", progname);
fprintf(output, " %s -I -u <uuid>\n", progname);
+ fprintf(output, " %s -L [-d <device>]\n", progname);
fprintf(output, " %s -h\n", progname);
}
@@ -441,6 +512,11 @@
goto bail;
}
+ if (hbo.action == HB_ACTION_LIST) {
+ ret = run_list(&hbo);
+ goto bail;
+ }
+
err = o2cb_init();
if (err) {
com_err(progname, err, "Cannot initialize cluster\n");
More information about the Ocfs2-tools-commits
mailing list