[Ocfs2-users] OCFS2 v1.4 hangs
John Murphy
john.murphy at mandac.eu
Fri Jun 5 00:54:31 PDT 2009
HI Karim,
Excellent information, many thanks indeed.
Best Regards
John
On Fri, 2009-06-05 at 00:24 +0300, Karim Alkhayer wrote:
> Hi John,
>
>
>
> When multiple systems/nodes have access to data via shared storage,
> the integrity of the data depends on inter-node communication ensuring
> that each node is aware when other nodes are writing data. When the
> coordination between the nodes fails, it results in a “split brain”
> condition; A situation in which two servers try to independently
> control the storage, potentially resulting in application failure or
> even corruption of critical data.
>
>
>
> I/O fencing is a method of choice (used by vendors cluster frameworks,
> including OCFS2) for ensuring the integrity of critical information by
> preventing data corruption, allowing a set of systems to have
> temporary registrations with the disk and coordinate a write-exclusive
> reservation with the disk containing the data. With I/O fencing, the
> cluster system ensures that errant nodes are “fenced” and do not have
> access to the shared storage, while the eligible node(s) continue to
> have access to the data, virtually eliminating the risk of data
> corruption.
>
>
>
> The quorum is the group of nodes in a cluster that is allowed to
> operate on the shared storage. When there is a failure in the cluster,
> nodes may be split into groups that can communicate in their groups
> and with the shared storage but not between groups.
>
>
>
> O2QUO determines which group is allowed to continue and initiates
> fencing of the other group(s).
>
> Fencing is the act of forcefully removing a node from a cluster. A
> node with OCFS2 mounted will fence itself when it realizes that it
> does not have quorum in a degraded cluster. It does this so that other
> nodes won’t be stuck trying to access its resources. However, the
> resources do NOT get released
>
>
>
> O2CB uses a node reset mechanism to fence; this however, is causing
> the machine(s) to hang instead of seamless handover. In OCFS2 1.4,
> Oracle has introduced a new fencing mechanism which no longer uses
> “panic” for fencing. Instead, by default, it uses "machine restart".
>
>
>
> In your case, taking the network down the way you’ve done is causing
> the servers to hang, including the mounted file system which becomes
> locked until the OCFS cluster services is restarted.
>
>
>
> RAC handover fails due to exactly this problem: the file system is
> locked by another node which was kicked out of the cluster, but still
> occupying the file system
>
> The healthy node will try to continue to work, but the databases
> hosted on the occupied file system will hang, and possibly the
> machine. At this time there is no solution but to
>
> - Force shutdown the troublesome node(s)
>
> - Shutdown the databases processes
>
> - Restart the OCFS2 services
>
>
>
> Network failure resolution can be applied in a situation where you
> have setup a net bonding for the interconnects, which is highly
> recommended.
>
>
>
> Best regards,
>
> Karim Alkhayer
>
>
>
>
>
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of John Murphy
> Sent: Thursday, June 04, 2009 10:15 PM
> To: ocfs2-users at oss.oracle.com
> Subject: [Ocfs2-users] OCFS2 v1.4 hangs
>
>
>
> I have four database servers in a high-availability, load-balancing
>
> configuration. Each machine has a mount to a common data source which
> is
>
> an OCFS2 v1.4 file-system. While working on three of the servers, I
>
> restarted the IP network and found after-wards the fourth machine
> hung.
>
> I could not reboot and could not unmount the ocfs2 partitions. I am
>
> pretty sure this was all caused by my taking down the network on all
>
> three of the remaining machines, can anyone shed some light on this
> for.
>
> Ironically, I have four machines in order to ensure reliability.
>
>
>
> TIA
>
>
>
> John
>
> --
>
> John Murphy
>
> Technical And Managing Director
>
> MANDAC Ltd
>
> Kandoy House
>
> 2 Fairview Strand
>
> Dublin 3
>
> p: +353 1 5143001
>
> m: +353 85 711 6844
>
> e: john.murphy at mandac.eu
>
> w: www.mandac.eu
>
>
>
>
>
>
>
> _______________________________________________
>
> Ocfs2-users mailing list
>
> Ocfs2-users at oss.oracle.com
>
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
--
John Murphy
Technical And Managing Director
MANDAC Ltd
Kandoy House
2 Fairview Strand
Dublin 3
p: +353 1 5143001
m: +353 85 711 6844
e: john.murphy at mandac.eu
w: www.mandac.eu
More information about the Ocfs2-users
mailing list