[Ocfs2-users] OCFS2 v1.4 hangs

John Murphy john.murphy at mandac.eu
Fri Jun 5 00:54:31 PDT 2009


HI Karim,

Excellent information, many thanks indeed. 

Best Regards

John


On Fri, 2009-06-05 at 00:24 +0300, Karim Alkhayer wrote:
> Hi John,
> 
>  
> 
> When multiple systems/nodes have access to data via shared storage,
> the integrity of the data depends on inter-node communication ensuring
> that each node is aware when other nodes are writing data. When the
> coordination between the nodes fails, it results in a “split brain”
> condition; A situation in which two servers try to independently
> control the storage, potentially resulting in application failure or
> even corruption of critical data.
> 
>  
> 
> I/O fencing is a method of choice (used by vendors cluster frameworks,
> including OCFS2) for ensuring the integrity of critical information by
> preventing data corruption, allowing a set of systems to have
> temporary registrations with the disk and coordinate a write-exclusive
> reservation with the disk containing the data. With I/O fencing, the
> cluster system ensures that errant nodes are “fenced” and do not have
> access to the shared storage, while the eligible node(s) continue to
> have access to the data, virtually eliminating the risk of data
> corruption.
> 
>  
> 
> The quorum is the group of nodes in a cluster that is allowed to
> operate on the shared storage. When there is a failure in the cluster,
> nodes may be split into groups that can communicate in their groups
> and with the shared storage but not between groups.
> 
>  
> 
> O2QUO determines which group is allowed to continue and initiates
> fencing of the other group(s).
> 
> Fencing is the act of forcefully removing a node from a cluster. A
> node with OCFS2 mounted will fence itself when it realizes that it
> does not have quorum in a degraded cluster. It does this so that other
> nodes won’t be stuck trying to access its resources. However, the
> resources do NOT get released
> 
>  
> 
> O2CB uses a node reset mechanism to fence; this however, is causing
> the machine(s) to hang instead of seamless handover. In OCFS2 1.4,
> Oracle has introduced a new fencing mechanism which no longer uses
> “panic” for fencing. Instead, by default, it uses "machine restart".
> 
>  
> 
> In your case, taking the network down the way you’ve done is causing
> the servers to hang, including the mounted file system which becomes
> locked until the OCFS cluster services is restarted.
> 
>  
> 
> RAC handover fails due to exactly this problem: the file system is
> locked by another node which was kicked out of the cluster, but still
> occupying the file system
> 
> The healthy node will try to continue to work, but the databases
> hosted on the occupied file system will hang, and possibly the
> machine. At this time there is no solution but to 
> 
> -        Force shutdown the troublesome node(s)
> 
> -        Shutdown the databases processes
> 
> -        Restart the OCFS2 services
> 
>  
> 
> Network failure resolution can be applied in a situation where you
> have setup a net bonding for the interconnects, which is highly
> recommended.
> 
>  
> 
> Best regards,
> 
> Karim Alkhayer
> 
>  
> 
>  
> 
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of John Murphy
> Sent: Thursday, June 04, 2009 10:15 PM
> To: ocfs2-users at oss.oracle.com
> Subject: [Ocfs2-users] OCFS2 v1.4 hangs
> 
>  
> 
> I have four database servers in a high-availability, load-balancing
> 
> configuration. Each machine has a mount to a common data source which
> is
> 
> an OCFS2 v1.4 file-system. While working on three of the servers, I
> 
> restarted the IP network and found after-wards the fourth machine
> hung.
> 
> I could not reboot and could not unmount the ocfs2 partitions. I am
> 
> pretty sure this was all caused by my taking down the network on all
> 
> three of the remaining machines, can anyone shed some light on this
> for.
> 
> Ironically, I have four machines in order to ensure reliability.
> 
>  
> 
> TIA
> 
>  
> 
> John
> 
> -- 
> 
> John Murphy
> 
> Technical And Managing Director
> 
> MANDAC Ltd
> 
> Kandoy House
> 
> 2 Fairview Strand
> 
> Dublin 3
> 
> p: +353 1 5143001
> 
> m: +353 85 711 6844
> 
> e: john.murphy at mandac.eu
> 
> w: www.mandac.eu
> 
>  
> 
>  
> 
>  
> 
> _______________________________________________
> 
> Ocfs2-users mailing list
> 
> Ocfs2-users at oss.oracle.com
> 
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> 
-- 
John Murphy
Technical And Managing Director
MANDAC Ltd
Kandoy House
2 Fairview Strand
Dublin 3
p: +353 1 5143001
m: +353 85 711 6844
e: john.murphy at mandac.eu
w: www.mandac.eu





More information about the Ocfs2-users mailing list