[Ocfs2-tools-commits] jlbec commits r1007 - in branches/global-heartbeat: debugfs.ocfs2 documentation documentation/o2cb ocfs2_hb_ctl

Tue Aug 2 16:30:38 CDT 2005

Author: jlbec
Date: 2005-08-02 16:30:36 -0500 (Tue, 02 Aug 2005)
New Revision: 1007

Added:
   branches/global-heartbeat/documentation/o2cb/
   branches/global-heartbeat/documentation/o2cb/heartbeat-configuration.txt
   branches/global-heartbeat/documentation/o2cb/heartbeat-scope.txt
Modified:
   branches/global-heartbeat/debugfs.ocfs2/commands.c
   branches/global-heartbeat/ocfs2_hb_ctl/ocfs2_hb_ctl.c
Log:

o Teach ocfs2_hb_ctl to resolve device->uuid mapping.
o Document the rationale behind heartbeat's scope and the use of a
  global heartbeat
o Document how a global heartbeat works in conjunction.



Modified: branches/global-heartbeat/debugfs.ocfs2/commands.c
===================================================================

--- branches/global-heartbeat/debugfs.ocfs2/commands.c	2005-08-02 21:27:31 UTC (rev 1006)
+++ branches/global-heartbeat/debugfs.ocfs2/commands.c	2005-08-02 21:30:36 UTC (rev 1007)
@@ -396,7 +396,7 @@
 	}
 
 	flags = gbls.allow_write ? OCFS2_FLAG_RW : OCFS2_FLAG_RO;
-        flags |= OCFS2_FLAG_HEARTBEAT_DEV_OK;
+	flags |= OCFS2_FLAG_HEARTBEAT_DEV_OK;
 	ret = ocfs2_open(dev, flags, 0, 0, &gbls.fs);
 	if (ret) {
 		gbls.fs = NULL;

Added: branches/global-heartbeat/documentation/o2cb/heartbeat-configuration.txt
===================================================================
--- branches/global-heartbeat/documentation/o2cb/heartbeat-configuration.txt	2005-08-02 21:27:31 UTC (rev 1006)
+++ branches/global-heartbeat/documentation/o2cb/heartbeat-configuration.txt	2005-08-02 21:30:36 UTC (rev 1007)
@@ -0,0 +1,175 @@
+
+[ Heartbeat Configuration ]
+
+
+[ Introduction ]
+
+This document describes the way O2CB handles configuration of
+heartbeating from the perspective of a running cluster.  O2CB tries to
+make the most of simple invocations, so the administrator doesn't
+have all that much to do.  This document tries to be high-level,
+describing what is done but not the exact command sytaxes, etc.  At the
+end is an appendix that actually maps concepts to syntaxes (because it
+needs to be documented somewhere).
+
+
+[ Terms ]
+
+"heartbeat"
+-----------
+A method by which O2CB nodes can notify others they are alive.  O2CB
+considers heartbeat a liveness test *only*.  If one node is seen via
+one heartbeat method and not another, it allows another node to
+determine the first node is running.  Any decisions on fencing, quorum,
+etc can be made by other layers.
+
+"method"
+--------
+The method by which nodes notify each other.  Currently "disk" is the
+only supported method, though others have been proposed.  A cluster
+may have more than one method in use at a time.
+
+"heartbeat region" or "region"
+------------------------------
+A specific place that heartbeat is, well, heartbeating.  A region uses
+one particular method (again, only "disk" is currently supported).  A
+cluster may have multiple regions at once, offering redundancy.
+
+"UUID"
+------
+A UUID, or Universally Unique Identifier, is how heartbeat names
+regions.  How a UUID maps to the actual resource is up to the method and
+layout of the resource.
+
+"local heartbeat" or "local"
+----------------------------
+Consumers of heartbeat expect to start their own heartbeat region that
+is "local" to the resources they are using or managing.  No heartbeat
+is started at O2CB startup.  The consumers are wholly responsible for
+the state of heartbeat when they require it.
+
+"global heartbeat" or "global"
+------------------------------
+All consumers of heartbeat are expecting the heartbeat to be started
+at O2CB startup.  They will make no effort to start heartbeat of their
+own accord.  This requires the administrator to configure specific
+regions to be started when the cluster is brought online.  
+
+"layout" and "layout driver"
+----------------------------
+O2CB has no idea what the resource backing a heartbeat region looks
+like, nor should it.  There can be multiple ways of doing it, and
+codifying it in O2CB is folly.  For local heartbeat, the consumer knows
+their own resource, and can act accordingly.  However, global heartbeat 
+must know who understands the resource.  The way a region maps to the
+resource is called the layout, and the program that understands the
+mapping is the layout driver.  O2CB can ask the layout driver to
+start and stop the region, and need know nothing else about the
+backing resource.  O2CB refers to the region exclusively by UUID, and it
+is up to the layout driver to determine the resource refered to.
+Currently, the only known layout is that provided by OCFS2.  
+
+
+[ Rationale ]
+
+Administrators and users want configuration to be easy or nonexistent.
+They would often rather avoid something than learn complex steps.  O2CB
+takes this to heart, and administrators should choose local heartbeat
+for simple setups.  In fact, administrators using the ocfs2console
+utility to configure their cluster will have local heartbeat enabled
+unless they choose otherwise.  Local heartbeat requires no manual steps
+for each heartbeat region.  The heartbeat consumers create, start, stop,
+and otherwise manage their own heartbeat needs.  Administrators can just
+let it work.
+
+Take OCFS2 for example.  An administrator wants to do the normal
+filesystem tasks: mkfs, mount, and umount.  OCFS2 knows how to create a
+heartbeat region inside the filesystem (in other words, an ocfs2 layout
+of this particular hearbeat disk resource) during mkfs.  At mount, OCFS2
+knows how to start heartbeat on that region.  At umount, OCFS2 knows how
+to stop heartbeat.  The heartbeat is completely hidden behind the normal
+operation of the filesystem.  O2CB does nothing other than make sure the
+heartbeat subsystem is loaded.
+
+The problem is that local heartbeat doesn't always scale.  One OCFS2
+filesystem has one heartbeat region.  But 30 OCFS2 filesystems have 30
+heartbeat regions, one per.  Each filesystem (that is, disk resource)
+has no idea the other regions exist.  Every heartbeat interval, 30
+heartbeats happen, and 30 regions have collect the heartbeats from other
+nodes.  This takes a toll on the I/O subsystem, and becomes noticable to
+system resource usage.
+
+Enter the global heartbeat.  O2CB is now responsible for starting and
+stopping heartbeat regions.  Consumers need not (and must not) manage
+heartbeat regions.  They can assume (though they should verify it) that
+heartbeat is working.  Now, one region on one resource can provide the
+node liveness information to all consumers.  Multiple resources can be
+used for redundancy, but the number is tied to the requirements of the
+installation, not to the number of total resources.
+
+However, this creates some required configuration.  A local heartbeat
+can be automatically managed by the consumer, because the consumer knows
+intrinsically what to do.  But O2CB cannot know which regions are to be
+started.  O2CB cannot know which two resources to use out of 30.  The
+administrator must tell O2CB what regions to use.
+
+The administrator has two tasks to perform.  First, they must select
+global heartbeating.  Second, they must add at least one region to the
+global heartbeat.  They do this by specifying the affected cluster, the
+UUID of the region, and the layout of the region.  In some cases, a
+device may be specified at configuration time, allowing the layout
+driver to resolve the device->UUID mapping for O2CB.
+
+In turn, O2CB must be able to start a region based on its UUID and
+layout.  The layout driver is responsible for taking the UUID,
+determining the resource it describes, and starting heartbeat on that
+region.  The requirements section of this document mostly concerns
+itself with this interaction.
+
+
+[ Requirements ]
+
+A cluster MUST have a heartbeat configuration.  If not, O2CB WILL error
+when asked about the cluster's heartbeat.
+
+If the cluster is in local mode, O2CB WILL ensure the heartbeat
+subsystem driver is loaded, but WILL NOT make any attempt to
+start, stop, or manage heartbeat regions.  Consumer software MUST
+manage, start, and stop regions associated with a particular resource.
+
+If the cluster is in global mode, O2CB WILL start configured heartbeat
+regions when the cluster is brought online, and WILL stop configured
+heartbeat regions when the cluster is brought offline.  Consumer
+software MUST NOT attempt to start, stop, or manage regions that are
+configured for use by O2CB.
+
+Consumer software MUST provide a layout driver.  The driver program
+MUST be named <layout>_hb_ctl, where <layout> is replaced by the layout
+named in the heartbeat configuration.  That driver MUST be able to start
+and stop a region given only the UUID.
+
+The layout driver MAY provide the ability to map a device name to a
+UUID.  If this is provided, O2CB WILL add a region by device name.
+
+
+[ Determining the Heartbeat Mode From Consumer Programs ]
+
+Consumers that require heartbeat must check the mode before starting a
+region.  They can ask the heartbeat configuration via the o2cb_hb_config
+program.  See the manual page for o2cb_hb_config for the appropriate
+syntax.
+
+[ Layout Driver Interaction ]
+
+A layout driver is an executable program.  It must support the following
+syntax:
+
+o To start a region: <layout>_hb_ctl -S -u <uuid>
+o To stop a region:  <layout>_hb_ctl -K -u <uuid>
+
+If the layout driver provides the ability to map a device name to a
+UUID, it must support the following syntax:
+
+o To resolve <device> to a UUID: <layout>_hb_ctl -L -d <device>
+
+This command must output the UUID on a single line.

Added: branches/global-heartbeat/documentation/o2cb/heartbeat-scope.txt
===================================================================
--- branches/global-heartbeat/documentation/o2cb/heartbeat-scope.txt	2005-08-02 21:27:31 UTC (rev 1006)
+++ branches/global-heartbeat/documentation/o2cb/heartbeat-scope.txt	2005-08-02 21:30:36 UTC (rev 1007)
@@ -0,0 +1,199 @@
+
+[ The Scope of Heartbeat ]
+
+
+[ Introduction ]
+
+This document describes how a cluster-wide heartbeat interacts with the
+quorum and fencing requirements of quorate shared resources, such as
+shared-disk filesystems.  This document does not describe the
+implementation details of quorum decisions, but rather tries to clarify
+heartbeat's role in the process
+
+
+[ Terms ]
+
+"heartbeat"
+-----------
+A method by which cluster nodes can notify others they are alive.
+Heartbeat is a liveness test *only*.  If one node is seen via one
+heartbeat method and not another, it still allows a second node to
+determine the first node is running.  This definition, and the
+implications thereof, are the reason for this document.
+
+"method"
+--------
+The method by which nodes nodify each other.  Common methods include
+writing to a shared disk, communication over a network, or a direct
+serial cable link.
+
+"heartbeat region" or "region"
+------------------------------
+A specific place that heartbeat is, well, heartbeating.  A region uses
+one particular method on one particular resource.  A cluster may have
+multiple regions at once, offering redundancy.
+
+"fencing"
+---------
+Disabling a node's ability to access a shared resource.  When problems
+occur, some nodes must not be allowed to access and possibly damage a
+shared resource.  Fencing prevents this from happening.  The most common
+form of fencing in Linux is STONITH, short for "Shoot The Other Node In
+The Head."  This is the most extreme method of fencing, as it forces a
+complete reset of the fenced node.
+
+
+[ Rationale ]
+
+What happens when a node stops heartbeating on a region?  How do
+consumers of the cluster's resources decide what to do next?  This is a
+complex problem, and the true source of complexity in designing clusters
+and clustered applications.
+
+The real question is, what do consumer applications want and need?  In
+the end, they don't care whether the other node is up or down.  It
+absolutely does not matter to them.  A consumer application only cares
+whether it can share access to a resource in a safe and consistent
+fashion.
+
+For emphasis: A consumer application ONLY CARES whether it can share
+access to a resource in a safe and consistent fashion.
+
+From the perspective of a consumer application, quorum and fencing are
+useful for enforcing that safety.  How they accomplish that goal is
+relatively unimportant.  Whether the other nodes are alive or dead is
+unimportant.
+
+Shared Resource Scenarios
+-------------------------
+Look at a shared-disk filesystem.  The shared resource is the disk.  The
+filesystem driver is the consumer application on each node.  Under
+normal operation, the filesystem driver arbitrates access to the shared
+disk.  Each node takes its turn, respecting what the other nodes are
+doing.
+
+Assume, for the moment, that the heartbeat method and region are
+separate and independant of the shared disk (network heartbeat, a
+different heartbeat disk, etc).  Let's look at a few scenarios.  In the
+scenarios, "happynode" is a node that is running normally.  "sadnode" is
+the node that has just had a problem.
+
+o Losing Access to the Shared Resource 
+--------------------------------------
+What if sadnode loses access to the shared disk?  That is, a normal
+write operation to the disk receives an I/O error.  On happynode,
+access to this shared disk is proceeding just fine.  Note that all forms
+of communication are functioning properly.  The heartbeat shows all
+nodes are alive everywhere.  Network communication between the
+filesystem drivers is also working.  happynode has no way of knowing
+that sadnode has a problem.
+
+In other words, heartbeat cannot do anything here.  Nor should it.
+There is nothing useful for it to do.  happynode and sadnode can still
+successfully arbitrate access to the resource, and that access is still
+safe.
+
+What about sadnode's inability to write to the share disk?  That has to
+be handled, of course.  sadnode's immediate response must be like any
+other filesystem: take the filesystem to read-only mode.  Any pending
+writes and updates must be discarded.  This is how all filesystems
+everywhere work, because without the ability to write to the disk, the
+updates cannot be committed to storage.
+
+But now, sadnode holds some locks and other state that it cannot write
+out.  happynode needs to know this, so that happynode can recover
+sadnode's state.  The naive approach is to fence sadnode.  This is
+especially disasterous in the case of STONITH, as all of sadnode's
+responsibilities are aborted.  Fencing is not needed, as sadnode has
+prevented itself from writing.  If sadnode notifies happynode that it
+has gone read-only, happynode has all the information needed to start
+recovery.
+
+sadnode and happynode can now continue.  Access to the shared disk is
+still safe.  While sadnode is no longer taking part in arbitration,
+sadnode is also not making any changes.  This means that happynode's
+accesses are safe.  sadnode's other processes can continue as if nothing
+has happened, and online reintroduction of sadnode to the shared
+resource could even commence after some intervention to repair the
+fault.
+
+o Losing Access to one Heartbeat Region of Many
+-----------------------------------------------
+What if sadnode no longer successfully access one of many heartbeat
+regions?  That is, a write operation to the region either fails silently
+or returns an I/O error.  The end result is that happynode sees no more
+heartbeats from sadnode on the one region.  happynode and sadnode still
+see each other on a different heartbeat region.  The filesystem drivers
+can still communicate.  The nodes can still arbitrate safe access to the
+shared disk.
+
+There is nothing to do here except log the error.  All operation can
+continue normally.  There is no danger to any operation.  Fencing of any
+sort would be detrimental.
+
+o Losing Access to all Heartbeat Regions
+----------------------------------------
+What if sadnode only had one heartbeat region and could no longer
+successfully access it?  Or if sadnode had many and could access none.
+That is, write operations fail silently or return an error.  The end
+result is that happynode sees no heartbeats from sadnode.
+
+This is virtually indistinguishable from sadnode crashing completely.
+sadnode may well have gotten I/O errors and done everything it can to
+clean itself up.  However, there is no way for happynode to know if
+sadnode is alive and unable to heartbeat or dead and crashed.
+
+Here, fencing is the only appropriate thing.  Because happynode does not
+know sadnode's state, happynode cannot consider access to the shared
+disk to be safe.  Arbitration cannot happen.  As such, sadnode must be
+prevented from accessing the shared disk.
+
+What form fencing takes is unimportant from the perspective of the
+filesystem driver.  All that the driver cares is that sadnode cannot
+access the shared disk.  Once that is assured, happynode can recover
+sadnode's state and continue with normal operation.
+
+Things are very different from the system perspective.  The form of
+fencing is very important.  A STONITH approach is strongly discourgaged.
+sadnode may have many responsibilities, only a few of which are affected
+by this cluster problem.  Some I/O subsystems support fencing at that
+level.  That is, sadnode would be prevented from sending I/O requests to
+the I/O subsystem.  So, sadnode would be unable to access the shared
+disks, but would be able to continue all processes that do not use the
+shared disks.  This prevents sadnode from unsafe access to the shared
+resource and allows online repair of the problem.
+
+o Losing Communication to Peers
+-------------------------------
+In this scenario, heartbeat is working fine, but happynode's consumer
+application is unable to talk to sadnode.  Without the ability to
+communicate, happynode and sadnode cannot arbitrate access to the shared
+disk.
+
+Again, fencing is required (though the STONITH vs I/O fencing argument
+still applies).  However, we now have an important question to ask:
+which node is having the problem?  In our example, it is sadnode.
+Perhaps the ethernet cable was pulled.  Perhaps sadnode's switch has
+failed.  The problem is that the software doesn't know.  Each node
+thinks they are OK, but the other guy is missing.
+
+This is where quorum comes in.  There must be some way to decide which
+node is happynode and which is sadnode.  It becomes more important (and
+more complex) when there are more than two nodes.
+
+Somehow, a decision is reached, and the losing node or nodes are fenced.
+The remaining happynodes recover the sadnode state, and continue on with
+life.
+
+Conclusion
+----------
+Notice that, in all three scenarios above, the question of specific
+heartbeat regions was completely unimportant.  From the perspective of
+the consumer application, all heartbeat is good for is node up/down
+information.  As long as the node appears in one heartbeat region, the
+higher-level logic knows that the machine is running.  The rest of the
+decisions can be made without heartbeat's interaction.
+
+Thus, it is unimportant whether a heartbeat region is on the shared
+resource itself or not.  It is also unimportant when heartbeating to one
+of many regions fails.

Modified: branches/global-heartbeat/ocfs2_hb_ctl/ocfs2_hb_ctl.c
===================================================================
--- branches/global-heartbeat/ocfs2_hb_ctl/ocfs2_hb_ctl.c	2005-08-02 21:27:31 UTC (rev 1006)
+++ branches/global-heartbeat/ocfs2_hb_ctl/ocfs2_hb_ctl.c	2005-08-02 21:30:36 UTC (rev 1007)
@@ -53,6 +53,7 @@
 	HB_ACTION_START,
 	HB_ACTION_STOP,
 	HB_ACTION_REFINFO,
+	HB_ACTION_LIST,
 };
 
 struct hb_ctl_options {
@@ -310,6 +311,66 @@
 	return err;
 }
 
+static errcode_t list_dev(const char *dev,
+			  struct hb_ctl_options *hbo)
+{
+	int len;
+	char *device;
+
+	if (region_desc) {
+		fprintf(stderr, "We have a descriptor already!\n");
+		free_desc();
+	}
+	
+	len = strlen(DEV_PREFIX) + strlen(dev) + 1;
+	device = malloc(sizeof(char) * len);
+	if (!device)
+		return OCFS2_ET_NO_MEMORY;
+	snprintf(device, len, DEV_PREFIX "%s", dev);
+
+	/* Any problem with getting the descriptor is NOT FOUND */
+	if (get_desc(device))
+		goto out;
+
+	fprintf(stdout, "%s:%s\n", region_desc->r_name, device);
+
+	free_desc();
+
+out:
+	free(device);
+
+	/* Always return NOT_FOUND, which means continue */
+	return OCFS2_ET_FILE_NOT_FOUND;
+}
+
+static int run_list(struct hb_ctl_options *hbo)
+{
+	int ret = 0;
+	errcode_t err;
+	char hbuuid[33];
+
+	if (hbo->dev_str) {
+		err = get_uuid(hbo->dev_str, hbuuid);
+		if (err) {
+			com_err(progname, err,
+				"while reading uuid from device \"%s\"",
+				hbo->dev_str);
+			ret = -EINVAL;
+		} else {
+			fprintf(stdout, "%s\n", hbuuid);
+		}
+	} else {
+		err = scan_devices(list_dev, hbo);
+		if (err && (err != OCFS2_ET_FILE_NOT_FOUND)) {
+			com_err(progname, err,
+				"while listing devices");
+			ret = -EIO;
+		}
+	}
+
+	return ret;
+}
+
 static int read_options(int argc, char **argv, struct hb_ctl_options *hbo)
 {
 	int c, ret;
@@ -317,7 +378,7 @@
 	ret = 0;
 
 	while(1) {
-		c = getopt(argc, argv, "ISKd:u:h");
+		c = getopt(argc, argv, "ISKLd:u:h");
 		if (c == -1)
 			break;
 
@@ -334,6 +395,10 @@
 			hbo->action = HB_ACTION_START;
 			break;
 
+		case 'L':
+			hbo->action = HB_ACTION_LIST;
+			break;
+
 		case 'd':
 			if (optarg)
 				hbo->dev_str = strdup(optarg);
@@ -385,6 +450,11 @@
 			ret = -EINVAL;
 		break;
 
+	case HB_ACTION_LIST:
+		if (hbo->uuid_str)
+			ret = -EINVAL;
+		break;
+
 	case HB_ACTION_UNKNOWN:
 		ret = -EINVAL;
 		break;
@@ -407,6 +477,7 @@
 	fprintf(output, "       %s -K -u <uuid>\n", progname);
 	fprintf(output, "       %s -I -d <device>\n", progname);
 	fprintf(output, "       %s -I -u <uuid>\n", progname);
+	fprintf(output, "       %s -L [-d <device>]\n", progname);
 	fprintf(output, "       %s -h\n", progname);
 }
 
@@ -441,6 +512,11 @@
 		goto bail;
 	}
 
+	if (hbo.action == HB_ACTION_LIST) {
+		ret = run_list(&hbo);
+		goto bail;
+	}
+
 	err = o2cb_init();
 	if (err) {
 		com_err(progname, err, "Cannot initialize cluster\n");