[Ocfs2-users] Troubles with two node

Thu Nov 29 10:42:20 PST 2007

To elaborate on this. Prior to 1.2.5, we used to hear complaints
about a frozen node causing processes on other functioning nodes
go D state, presumably while it was accessing the fs.

There were two reasons for the same. First was the fencing method.
We used to panic() which at times would not reset the box. In these
cases, the node would freeze, but the disk hb thread would keep
chugging along. The D state processes on the other nodes would
be waiting for that node to stop heartbeating. Power off/on would
solve the issue. This was resolved in 1.2.5 when we changed the
fencing call from panic() to machine_restart().

The second reason were the insanely low default cluster timeouts
leading to unnecessary fencing. This was partially resolved in 1.2.5
when we allowed custom values for all cluster timeouts. In 1.2.6/1.2.7,
we upped the default timeouts to more saner values.

So, let's start with the kernel version as that will atleast narrow
down the known issues.

Sunil Mushran wrote:
> What's the kernel version#?
>
> inode wrote:
>> Hi all,
>>
>> I'm running OCFS2 on two system with OpenSUSE 10.2 connected on fibre
>> channel with a shared storage (HP MSA1500 + HP PROLIANT MSA20).
>>
>> The cluster has two node (web-ha1 and web-ha2), sometimes (1 or 2 times
>> on a month) the OCFS2 stop to work on both system. On the first node I'm
>> getting no error in log files and after a forced shoutdown of the first
>> node on the second I can see the logs on the bottom of this message.
>>
>> I saw some other people is getting on a similar trouble
>> (http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg01135.html)
>> but the thread don't gave me help...
>>
>> Anyone has any idea?
>>
>> Thanks you in advance.
>>
>> Maurizio
>>
>>
>> web-ha1:~ # cat /etc/sysconfig/o2cb
>>
>> O2CB_ENABLED=true
>> O2CB_BOOTCLUSTER=ocfs2
>> O2CB_HEARTBEAT_THRESHOLD=451
>>
>> web-ha1:~ #
>> web-ha1:~ # cat /etc/ocfs2/cluster.conf
>> node:
>>         ip_port = 7777
>>         ip_address = 192.168.255.1
>>         number = 0
>>         name = web-ha1
>>         cluster = ocfs2
>>
>> node:
>>         ip_port = 7777
>>         ip_address = 192.168.255.2
>>         number = 1
>>         name = web-ha2
>>         cluster = ocfs2
>>
>> cluster:
>>         node_count = 2
>>         name = ocfs2
>>
>> web-ha1:~ #
>>
>>
>>
>> Nov 28 15:28:59 web-ha2 kernel: o2net: connection to node web-ha1 (num
>> 0) at 192.168.255.1:7777 has been idle for 10 seconds, shutting it down.
>> Nov 28 15:28:59 web-ha2 kernel: (23432,0):o2net_idle_timer:1297 here are
>> some times that might help debug the situation: (tmr 1196260129.36511
>> now 1196260139
>> .34907 dr 1196260129.36503 adv 1196260129.36514:1196260129.36515 func
>> (95bc84eb:504) 1196260129.36329:1196260129.36337)
>> Nov 28 15:28:59 web-ha2 kernel: o2net: no longer connected to node
>> web-ha1 (num 0) at 192.168.255.1:7777
>> Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_do_master_request:1331
>> ERROR: link to 0 went down!
>> Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_get_lock_resource:915
>> ERROR: status = -112
>> Nov 28 15:29:18 web-ha2 sshd[23503]: pam_unix2(sshd:auth): conversation
>> failed
>> Nov 28 15:29:18 web-ha2 sshd[23503]: error: ssh_msg_send: write
>> Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_do_master_request:1331
>> ERROR: link to 0 went down!
>> Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_get_lock_resource:915
>> ERROR: status = -107
>> Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_do_master_request:1331
>> ERROR: link to 0 went down!
>> Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_get_lock_resource:915
>> ERROR: status = -107
>> Nov 28 15:29:46 web-ha2 kernel: (23443,0):dlm_do_master_request:1331
>> ERROR: link to 0 went down!
>> ERROR: status = -107
>>
>> [...]
>>
>> Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_restart_lock_mastery:1215
>> ERROR: node down! 0
>> Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_wait_for_lock_mastery:1036
>> ERROR: status = -11
>> Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_restart_lock_mastery:1215
>> ERROR: node down! 0
>> Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_wait_for_lock_mastery:1036
>> ERROR: status = -11
>> Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_restart_lock_mastery:1215
>> ERROR: node down! 0
>> Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_wait_for_lock_mastery:1036
>> ERROR: status = -11
>> Nov 22 18:14:51 web-ha2 kernel: (17804,1):dlm_get_lock_resource:896
>> 86472C5C33A54FF88030591B1210C560:M0000000000000009e7e54516dd16ec: at
>> least one node (0) torecover before lock mastery can begin
>> Nov 22 18:14:51 web-ha2 kernel: (17730,1):dlm_get_lock_resource:896
>> 86472C5C33A54FF88030591B1210C560:M0000000000000009e76bf516dd144d: at
>> least one node (0) torecover before lock mastery can begin
>> Nov 22 18:14:51 web-ha2 kernel: (17634,0):dlm_get_lock_resource:896
>> 86472C5C33A54FF88030591B1210C560:M000000000000000ac0d22b1f78e53c: at
>> least one node (0) torecover before lock mastery can begin
>> Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_restart_lock_mastery:1215
>> ERROR: node down! 0
>> Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_wait_for_lock_mastery:1036
>> ERROR: status = -11
>>
>> [...]
>>
>> Nov 22 18:14:54 web-ha2 kernel: (17702,1):dlm_get_lock_resource:896
>> 86472C5C33A54FF88030591B1210C560:M0000000000000007a6dab9ef6eacbd: at
>> least one node (0) torecover before lock mastery can begin
>> Nov 22 18:14:54 web-ha2 kernel: (17701,1):dlm_get_lock_resource:896
>> 86472C5C33A54FF88030591B1210C560:M000000000000000a06a13716de553e: at
>> least one node (0) torecover before lock mastery can begin
>> Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:849
>> 86472C5C33A54FF88030591B1210C560:$RECOVERY: at least one node (0)
>> torecover before lock mastery can begin
>> Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:876
>> 86472C5C33A54FF88030591B1210C560: recovery map is not empty, but must
>> master $RECOVERY lock now
>> Nov 22 18:14:54 web-ha2 kernel: (17893,0):ocfs2_replay_journal:1184
>> Recovering node 0 from slot 0 on device (8,17)
>> Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_restart_lock_mastery:1215
>> ERROR: node down! 0
>> Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_wait_for_lock_mastery:1036
>> ERROR: status = -11
>> Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_restart_lock_mastery:1215
>> ERROR: node down! 0
>> Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_wait_for_lock_mastery:1036
>> ERROR: status = -11
>>
>>
>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>   
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users