[Ocfs2-users] 2 Node cluster crashing

Alexei_Roudnev Alexei_Roudnev at exigengroup.com
Mon Jul 10 11:38:37 CDT 2006


Just for upgrade - you must upgrade both, kernel and utilities, to SLES9 SP3
+ make dynamic update to, at least, kernel 257 and current ocfs tools.

Oracle supports OCFSv2 startiung with SLES9 kernel 255, which is 'SP3 + few
online updates'.

Another advice - always run 3-d node, even if you dont use it, just for
quorum. And update heartbeat parameters (see /etc/sysconfig/o2cb) - default
results in 12 seconds timeout, which is not practical because network
convergency in ethernet is 40 seconds by STP standard.



----- Original Message ----- 
From: "Mark Maiden" <markm at globoforce.com>
To: "ocfs2-users" <ocfs2-users at oss.oracle.com>
Sent: Monday, July 10, 2006 4:02 AM
Subject: [Ocfs2-users] 2 Node cluster crashing


> Hi,
>
> We have a two node cluster running SLES 9 SP2 connecting directly to an
> EMC CX300 for storage.
>
> We are using OCFS(OCFS2 DLM 0.99.15-SLES) for the voting disk etc, and
> ASM for data files.
>
> The system has been running until last Friday when the whole cluster
> went down with the following error messages in the /var/log/messages
> files :
>
> rac1:
>
> Jul 7 14:56:23 rac1 kernel: (0,3):o2net_state_change:512 connection to
node
> rac2.globoforce.com num 1 at 198.87.235.246:7777 has been idle for 10
> seconds,
> shutting it down.
> Jul 7 14:56:23 rac1 kernel: (10042,0):o2net_set_nn_state:414 no longer
> connected to node rac2.globoforce.com at 198.87.235.246:7777
> Jul 7 14:56:56 rac1 kernel: (14410,3):ocfs2_replay_journal:1123 Recovering
> node 1 from slot 1 on device (8,65)
>
> rac2:
>
> Jul 7 14:56:24 rac2 kernel: (0,0):o2net_state_change:512 connection to
node
> rac1.globoforce.com num 0 at 198.87.235.244:7777 has been idle for 10
> seconds,
> shutting it down.
> Jul 7 14:56:24 rac2 kernel: (10201,0):o2net_set_nn_state:414 no longer
> connected to node rac1.globoforce.com at 198.87.235.244:7777
> Jul 7 14:56:42 rac2 kernel: (10201,0):o2net_check_quorum:1468 ERROR:
fencing
> this node because it is connected to a half-quorum of 1 out of 2 nodes
which
> doesn't include the lowest active node 0
> Jul 7 14:56:42 rac2 kernel: (10201,0):o2hb_stop_all_regions:1589 ERROR:
> stopping heartbeat on all active regions.
> Jul 7 14:56:42 rac2 kernel: Kernel panic: ocfs2 is very sorry to be
fencing
> this system by panicing
>
>
> I opened up an SR with Oracle and they recommended that we upgrade to
> SLES 9 SP3 because they don't support the OCFS version that we are
> running. I inquired as to whether this will sort out the problem, but
> they replied with a very vague answer.
>
> Can somebody please shed some light on this : is this version of OCFS
> that we are running very buggy and causes lots of problems like this?
> And if we upgrade is it going to sort out the problem, or are we just
> brining ourselves into "Supported-land" and we can get fixed from there?
>
> Also(sorry for all the questions :), when we upgrade, is it just a case
> of upgrading the kernel and the OCFS rpm's?
>
> Thank you for your help in advance...much appreciated!!
> -- 
>
> Mark Maiden
> Systems Administrator
> Globoforce, Ltd
>   6 Beckett Way Parkwest
>   Dublin 12
>   Ireland
>   t: +353 1 625 8812
>   f: +353 1 625 8880
>   e: markm at globoforce.com
>    www.globoforce.com
>
>    http://guidance.gospelcom.net/answer.htm
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>




More information about the Ocfs2-users mailing list