[Ocfs2-users] Troubles with two node

Thu Nov 29 11:24:30 PST 2007

Seem opensuse 10.2 use version 1.2.2, who is interested can get the
source here:

ftp://ftp.suse.com/pub/suse/update/10.2/rpm/src/kernel-source-2.6.18.8-0.7.src.rpm
http://download.opensuse.org/distribution/10.2/repo/src-oss/suse/src/ocfs2-tools-1.2.2-11.src.rpm

So your advice is update to version 1.2.7 and hope that will never
happen again?

During the update from versione 1.2.2 to 1.2.7 there are know problems?

Thanks you in advance

Maurizio

Sunil Mushran wrote:
> To elaborate on this. Prior to 1.2.5, we used to hear complaints
> about a frozen node causing processes on other functioning nodes
> go D state, presumably while it was accessing the fs.
> 
> There were two reasons for the same. First was the fencing method.
> We used to panic() which at times would not reset the box. In these
> cases, the node would freeze, but the disk hb thread would keep
> chugging along. The D state processes on the other nodes would
> be waiting for that node to stop heartbeating. Power off/on would
> solve the issue. This was resolved in 1.2.5 when we changed the
> fencing call from panic() to machine_restart().
> 
> The second reason were the insanely low default cluster timeouts
> leading to unnecessary fencing. This was partially resolved in 1.2.5
> when we allowed custom values for all cluster timeouts. In 1.2.6/1.2.7,
> we upped the default timeouts to more saner values.
> 
> So, let's start with the kernel version as that will atleast narrow
> down the known issues.
> 
> Sunil Mushran wrote:
>> What's the kernel version#?
>>
>> inode wrote:
>>> Hi all,
>>>
>>> I'm running OCFS2 on two system with OpenSUSE 10.2 connected on fibre
>>> channel with a shared storage (HP MSA1500 + HP PROLIANT MSA20).
>>>
>>> The cluster has two node (web-ha1 and web-ha2), sometimes (1 or 2 times
>>> on a month) the OCFS2 stop to work on both system. On the first node I'm
>>> getting no error in log files and after a forced shoutdown of the first
>>> node on the second I can see the logs on the bottom of this message.
>>>
>>> I saw some other people is getting on a similar trouble
>>> (http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg01135.html)
>>> but the thread don't gave me help...
>>>
>>> Anyone has any idea?
>>>
>>> Thanks you in advance.
>>>
>>> Maurizio
>>>
>>>
>>> web-ha1:~ # cat /etc/sysconfig/o2cb
>>>
>>> O2CB_ENABLED=true
>>> O2CB_BOOTCLUSTER=ocfs2
>>> O2CB_HEARTBEAT_THRESHOLD=451
>>>
>>> web-ha1:~ #
>>> web-ha1:~ # cat /etc/ocfs2/cluster.conf
>>> node:
>>>         ip_port = 7777
>>>         ip_address = 192.168.255.1
>>>         number = 0
>>>         name = web-ha1
>>>         cluster = ocfs2
>>>
>>> node:
>>>         ip_port = 7777
>>>         ip_address = 192.168.255.2
>>>         number = 1
>>>         name = web-ha2
>>>         cluster = ocfs2
>>>
>>> cluster:
>>>         node_count = 2
>>>         name = ocfs2
>>>
>>> web-ha1:~ #
>>>
>>>
>>>
>>> Nov 28 15:28:59 web-ha2 kernel: o2net: connection to node web-ha1 (num
>>> 0) at 192.168.255.1:7777 has been idle for 10 seconds, shutting it down.
>>> Nov 28 15:28:59 web-ha2 kernel: (23432,0):o2net_idle_timer:1297 here are
>>> some times that might help debug the situation: (tmr 1196260129.36511
>>> now 1196260139
>>> .34907 dr 1196260129.36503 adv 1196260129.36514:1196260129.36515 func
>>> (95bc84eb:504) 1196260129.36329:1196260129.36337)
>>> Nov 28 15:28:59 web-ha2 kernel: o2net: no longer connected to node
>>> web-ha1 (num 0) at 192.168.255.1:7777
>>> Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_do_master_request:1331
>>> ERROR: link to 0 went down!
>>> Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_get_lock_resource:915
>>> ERROR: status = -112
>>> Nov 28 15:29:18 web-ha2 sshd[23503]: pam_unix2(sshd:auth): conversation
>>> failed
>>> Nov 28 15:29:18 web-ha2 sshd[23503]: error: ssh_msg_send: write
>>> Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_do_master_request:1331
>>> ERROR: link to 0 went down!
>>> Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_get_lock_resource:915
>>> ERROR: status = -107
>>> Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_do_master_request:1331
>>> ERROR: link to 0 went down!
>>> Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_get_lock_resource:915
>>> ERROR: status = -107
>>> Nov 28 15:29:46 web-ha2 kernel: (23443,0):dlm_do_master_request:1331
>>> ERROR: link to 0 went down!
>>> ERROR: status = -107
>>>
>>> [...]
>>>
>>> Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_restart_lock_mastery:1215
>>> ERROR: node down! 0
>>> Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_wait_for_lock_mastery:1036
>>> ERROR: status = -11
>>> Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_restart_lock_mastery:1215
>>> ERROR: node down! 0
>>> Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_wait_for_lock_mastery:1036
>>> ERROR: status = -11
>>> Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_restart_lock_mastery:1215
>>> ERROR: node down! 0
>>> Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_wait_for_lock_mastery:1036
>>> ERROR: status = -11
>>> Nov 22 18:14:51 web-ha2 kernel: (17804,1):dlm_get_lock_resource:896
>>> 86472C5C33A54FF88030591B1210C560:M0000000000000009e7e54516dd16ec: at
>>> least one node (0) torecover before lock mastery can begin
>>> Nov 22 18:14:51 web-ha2 kernel: (17730,1):dlm_get_lock_resource:896
>>> 86472C5C33A54FF88030591B1210C560:M0000000000000009e76bf516dd144d: at
>>> least one node (0) torecover before lock mastery can begin
>>> Nov 22 18:14:51 web-ha2 kernel: (17634,0):dlm_get_lock_resource:896
>>> 86472C5C33A54FF88030591B1210C560:M000000000000000ac0d22b1f78e53c: at
>>> least one node (0) torecover before lock mastery can begin
>>> Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_restart_lock_mastery:1215
>>> ERROR: node down! 0
>>> Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_wait_for_lock_mastery:1036
>>> ERROR: status = -11
>>>
>>> [...]
>>>
>>> Nov 22 18:14:54 web-ha2 kernel: (17702,1):dlm_get_lock_resource:896
>>> 86472C5C33A54FF88030591B1210C560:M0000000000000007a6dab9ef6eacbd: at
>>> least one node (0) torecover before lock mastery can begin
>>> Nov 22 18:14:54 web-ha2 kernel: (17701,1):dlm_get_lock_resource:896
>>> 86472C5C33A54FF88030591B1210C560:M000000000000000a06a13716de553e: at
>>> least one node (0) torecover before lock mastery can begin
>>> Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:849
>>> 86472C5C33A54FF88030591B1210C560:$RECOVERY: at least one node (0)
>>> torecover before lock mastery can begin
>>> Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:876
>>> 86472C5C33A54FF88030591B1210C560: recovery map is not empty, but must
>>> master $RECOVERY lock now
>>> Nov 22 18:14:54 web-ha2 kernel: (17893,0):ocfs2_replay_journal:1184
>>> Recovering node 0 from slot 0 on device (8,17)
>>> Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_restart_lock_mastery:1215
>>> ERROR: node down! 0
>>> Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_wait_for_lock_mastery:1036
>>> ERROR: status = -11
>>> Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_restart_lock_mastery:1215
>>> ERROR: node down! 0
>>> Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_wait_for_lock_mastery:1036
>>> ERROR: status = -11
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>   
>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> 
>