[Ocfs2-devel] [RFC] Senority based Quorum Formation

Thu Jun 8 22:06:40 CDT 2006

Goals:
   * Fast algorithm for cluster formation and seniority takeover
   * Minimize nondeterminism of cluster management algorithms
   * Stick to tried and true network protocols: tcp and udp
   * Recovery time from node failure as short as possible
   * Simple and suitable for kernel implementation
   * Support pluggable heartbeating and quorum calculation
   * Also support user space plugin methods

Cluster quorum formation is a notoriously difficult process, prone to races
and deadlock.  I present a simple algorithm to do the job with a pleasant
balance of simplicity, generality, reliability and speed.  To accomplish
this, I generalize the notion of heartbeating and integrate heartbeating
with the quorum formation and recovery algorithm.  I define the notion of
a senior node which is the final arbiter of cluster decisions, while
taking care to ensure that this senior node does not become a bottleneck
or a single point of failure.  I define a line of succession of senority so
that if a senior node fails its duties can be taken over deterministically
by the next node in line of senority, with minimal or no loss of cluster
availability.

The plan is to confine nondeterministic elements of cluster management
entirely to the algorithm described here, so that higher lever cluster
algorithms such as lock recovery and service failover are able to operate
deterministically.  This should allow the removal of a considerable amount
of code now in OCFS2 that is dedicated to coping with nondeterminism.

The algorithm described here is surprisingly simple considering what it
accomplishes and can be described in terms of one type of event source and
three event handlers.  Senior node failover requires less than two
heartbeat periods and typically should not cause any loss of cluster
availability.

Senior node
-----------

The senior node of a cluster is the final arbiter of global synchronization
decisions.  Most cluster decisions can in fact be delegated, leaving the
only truly global decision as, which node to delegate to.  The availability
of a senior node allows alows such a decision to be computed quickly
by a single authority, the senior node.  The process of appointing a senior
node is nondeterministic but runs in bounded time.

Each node in the cluster has a tcp connection to the senior node used to
communicate membership and other cluster events, and is also a factor in
determining relative node seniority.

It is possible for a senior node to lose communication with its members,
in which case the cluster may be unable to carry out certain actions
such as fencing or membership changes until a new senior is appointed.
So it is important to appoint a new senior quickly to avoid service
interruptions.

In a large cluster the senior node might delegate some cluster functions to
other nodes while retaining the authority to reassign such functions.  In
a simplified cluster the senior node will perform a number of
administrative functions itself:

   * Heartbest
   * Fencing
   * DLM recovery master
   * Membership

Each of these functions is performed efficiently so that in moderately
sized clusters these duties do not create a significant extra load on the
senior node in addition to the the workload of a normal cluster member.

Line of Succession
------------------

If the senior node of a cluster fails, the next node in line of succession
takes over its duties.  Line of succession is determined by the order of
members in the membership list, where the senior node is always at the head
of the list and new cluster members join at the tail of the list.

Note: other criteria for determining seniority are certainly possible, but
it is not clear that any alternative provides a better combination of
simplicity and stability.  In particular, node number might be used to
determine senority, but then:

   * If a new, lower numbered node joins the cluster, senority would
     need to be handed over to it, an unnecessary disruption

   * The configuration file order dependency would introduce unnecessary
     restrictions on online updating of the configuration file

So for the time being, line of succession of seniority is the order in
which nodes joined the cluster.

One can imagine situations in which an administrator might wish to modify
the line of succession of a running cluster.  Such a feature is easily
implemented but is not discussed here.

Relative seniority
------------------

Relative seniority is used to accelerate the process of forming a new
quorate cluster and in recovering from failure of a senior node.

Relative seniority of two nodes is affected by whether the two nodes are
now or were members of a quorum, their relative positions in line of
succession of seniority, and their relative positions in a global
configuration file.  (Note: the notion of global configuration will be
pluggable too, at some point.)

Node A is more senior than node B if:

   - A is a member of a quorum and B is not

   - A was a member of a quorum and B was not

   - Both A and B are members of a quorum and A is ahead of B in
     line of succession

   - Both A and B were members of a quorum and A is ahead of B in
     line of succession

   - Neither A nor B are or were members of a quorum and A is
     listed before B in the global cluster configuration file

Quorum
------

The senior node determines whether it has a quorum on the basis of nodes
connected to itself to which it has successfully downloaded a membership
list, and which are live according to heartbeat responses.  The details of
the quorum calculation should be configurable, however for now they are
not.  The cluster formation algorithm below does not rely on any such
details.

The senior node may only fence a nonresponsive node if it has quorum.  This
rule prevents nonquorate groups from fencing each other.

When a senior node achieves quorum it sends a quorum event to each of its
members, which is material to the algorithm below.  Future quorum events
such as losing or regaining quorum are not material to the algorithm
below, though some other cluster services may be able to make use of this
information.  Note that the notion of quorum is inherently racy and
algorithms that rely on it are likely to inherit this raciness.  The
notion of senior node as defined here however is not racy: there can never
be two senior nodes with quorum at the same time, or if there can be then I
made a mistake in the algorithm.

Member Modes
------------

If a node is not a senior node and does not currently have a tcp connection
to a senior node then it is in one of two modes:

    1) Cluster Formation mode: has never been a member of a quorum
    2) Senior Takeover mode: senior node of its quorum has failed

On initial bringup each node node is in formation mode.  After forming a
quorate cluster, if the senior node of the cluster fails then the node
enters takeover mode.  A third mode, the one we hope the node will be in
most of the time, is normal operation.  In normal operation, only the
senior node heartbeats other nodes.  In the other two modes the node's
strategy is to initiate its own heartbeat in order to advertise the fact
that it is available to form a cluster.

Heartbeating
------------

Here, we want the heartbeat process to do more than just help determine
node liveness: we also want to broadcast some state information to be
used by the cluster membership algorithms.  A heartbeating node
therefore sends a record containing at least the following information:

   - Address:port at which the heartbeating node will accept a tcp
     connection from some other node.

   - Node state:
       1) cluster formation
       2) senior of quorum
       3) Senior takeover

OCFS2 currently implements a disk-based heartbeat that seems to be modeled
on a similar scheme implemented in Veritas clusters.  Compared to a
network-based heartbeat this is a bad idea:

   - Average latency of a disk is typically much higher than a network

   - If our shared storage is exported over the network then the
     latency of the disk is added to the latency of the network

   - Typically disk is the bottleneck so a disk based heartbeat must
     use a very slow period in order to avoid generating too much
     extra seeking.

   - Our synchronization logic runs over the network, so how does
     heartbeating the disk tell us the network is live?

   - If we want to know if the disk is unresponsive, our cluster
     nodes can easily report that over the network

There is a case where heartbeating a shared disk is useful, and that is
where the disk is to be treated as a quorum device.  An algorithm similar
to heartbeating must be used to do this accurately.  Note that this
technique is not useful for a distributed storage cluster because the
storage itself may suffer a network split.

What we probably want to do with OCFS2 heartbeating is recast the
algorithm as a quorum method (after inventing a quorum plugin harness) and
switch to simple, udp-based heartbeating.  This will allow the heartbeat
to run much faster and reduce recovery latency.  We should be aiming at
recovery latency in the 500 millisecond range, far less than is practical
today.

All that said, OCFS2's current heartbeat implementation will suffice to
implement the algorithm described here, even though I wrote the rfc as if
the heartbeat were udp.

Cluster formation and Senior Takeover Algorithm
-----------------------------------------------

These two algorithms were initially designed separately.  However they
turned out to be similar enough to become a single algorithm.  The only
difference is the means by which seniority is calculated.  In senior
takeover mode the node knows the line of succession of seniority and
wishes to connect to the next live node in that line of succession.
Otherwise the node is less particular because it does not have any
valuable cluster state to preserve.

When a node first attempts to join a cluster it is in cluster formation
mode.  If a node that is a member of a quorate cluster fails to receive a
heartbeat from its senior node (with network latency and a safety factor
taken into account) then it enters senior takeover mode.

The algorithm can now be described in terms of events:

   - On entering formation or takeover mode a node begins to broadcast
     its own heartbeat, which includes its tcp contact address and port
     and whether or not it is in takeover (former quorum member) mode.

   - When a node receives a heartbeat from a node more senior than
     itself it connects to that node and stops heartbeating.

   - When a node receives a heartbeat from an even more senior node, it
     drops its connection to the former most senior and connects to the
     new most senior node.

   - When a senior node receives a new connection it downloads its
     member list to the connecting node.  This new membership list
     places the former senior node at the end of the list (the old
     senior node is still a member of the cluster because it has not yet
     been fenced).  The connecting node treats the new membership list
     as tentative until quorum is achieved, then it discards the old
     list, otherwise it retains it in order to determine seniority.

During senior takeover all cluster operations can still proceed except
those specific to the senior node, including fencing.  Since fencing is not
possible then lock recovery is not possible.  So it is important that
senior takeover be accomplished as rapidly as possible.  This algorithm
accomplishes takeover in a little more than one heartbeat period, the one
heartbeat being the minimum required to determine that takeover should
begin.

Notes
-----

* In its simplest form this algorithm may be prone to a thundering herd of
heartbeats effect, which in turn might generate excessive connections and
disconnections.  The thundering herd effect is easily avoided at the expense
of introducing a little more latency in some cases: a node in takeover mode
just delays its start of heartbeating by an amount that depends on its
position in line of succession.  This way, the node next in line of
succession will try to take over the senior role first and, except in the
case of multiple node failures, each node will only reconnect once.

* We should accomodate voluntary handover of the senior node role even
faster than takeover, so that a senior node can leave the cluster without
disrupting availability of the services it was providing.  This is detail
is left for later.

* It is important to preserve the invariant that a new cluster cannot be
formed faster than a node can be fenced, otherwise a senior node with
quorum cannot be sure that it is the only senior node with quorum.  This
issue is not addressed here.

Q: Under what conditions does a node respond to a heartbeat?
A: If the heartbeat comes from a senior to which it is connected

Q: What stops an old senior from fencing everybody because it isn't
receiving heartbeat responses?
A: It has lost quorum because some members have disconnected, otherwise
it has every right to fence them.

Q: What if a new quorum is formed?
A: we aren't getting heartbeats from any of the nodes in that new quorum
except the senior, which has quorum state set so all members disconnect
from the nonquorate senior and connect to the quorate senior.

Q: Should a node that has lost senior be able to calculate loss of quorum?
A: What would it use this information for?  It only needs to calculate
quorum if it is senior, and it will tell its member nodes.

Q: In takeover mode does a node need to check for nodes not responding
to pings?
A: Same as above, why should it care.  If it becomes senior then it cares.