[Ocfs2-tools-commits] zab commits r1098 - trunk/documentation
svn-commits at oss.oracle.com
svn-commits at oss.oracle.com
Mon Oct 3 17:03:31 CDT 2005
Author: zab
Date: 2005-10-03 17:03:30 -0500 (Mon, 03 Oct 2005)
New Revision: 1098
Modified:
trunk/documentation/ocfs2_faq.txt
Log:
o add some entries to the faq that describe ocfs2's net quorum and fencing
Modified: trunk/documentation/ocfs2_faq.txt
===================================================================
--- trunk/documentation/ocfs2_faq.txt 2005-10-01 01:06:31 UTC (rev 1097)
+++ trunk/documentation/ocfs2_faq.txt 2005-10-03 22:03:30 UTC (rev 1098)
@@ -417,3 +417,54 @@
As the journal is shutdown before this broadcast, any node crash
after this point is ignored as there is no need for recovery.
==============================================================================
+
+Quorum and Fencing
+------------------
+
+Q01 What is a quorum?
+A01 A quorum is a designation given to a group of nodes in a cluster which
+ are still allowed to operate on shared storage. It comes up when
+ there there is a failure in the cluster which breaks the nodes up
+ into groups which can communicate in their groups and with the
+ shared storage but not between groups.
+
+
+Q02 How does OCFS2's cluster services define a quorum?
+A02 The quorum decision is made by a single node based on the number
+ of other nodes that are considered alive by heartbeating and the number
+ of other nodes that are reachable via the network.
+
+ A node has quorum when:
+ * it sees an odd number of heartbeating nodes and has network
+ connectivity to more than half of them.
+ or
+ * it sees an even number of heartbeating nodes and has network
+ connectivity to at least half of them *and* has connectivity to
+ the heartbeating node with the lowest node number.
+
+Q03 What is fencing?
+A03 Fencing is the act of forecefully removing a node from a cluster.
+ A node with OCFS2 mounted will fence itself when it realizes that it
+ doesn't have quorum in a degraded cluster. It does this so that other
+ nodes won't get stuck trying to access its resources. Currently OCFS2
+ will panic the machine when it realizes it has to fence itself
+ off from the cluster. As described in Q02, it will do this when it
+ sees more nodes heartbeating than it has connectivity to and fails
+ the quorum test.
+
+Q04 How does a node decide that it has connectivity with another?
+A04 When a node sees another come to life via heartbeating it will try
+ and establish a TCP connection to that newly live node. It considers
+ that other node connected as long as the TCP connection persists and
+ the connection is not idle for 10 seconds. Once that TCP connection
+ is closed or idle it will not be reestablished until heartbeat thinks
+ the other node has died and come back alive.
+
+Q05 How long does the quorum process take?
+A05 First a node will realize that it doesn't have connectivity with
+ another node. This can happen immediately if the connection is closed
+ but can take a maximum of 10 seconds of idle time. Then the node
+ must wait long enough to give heartbeating a chance to declare the
+ node dead. This is tunable, but defaults to 18 seconds. By default,
+ then, a maximum of 28 seconds can pass from the time a network fault
+ occurs until a node fences itself.
More information about the Ocfs2-tools-commits
mailing list