[Ocfs2-users] Troubles with two node

Thu Nov 29 09:26:26 PST 2007

Hi all,

I'm running OCFS2 on two system with OpenSUSE 10.2 connected on fibre
channel with a shared storage (HP MSA1500 + HP PROLIANT MSA20).

The cluster has two node (web-ha1 and web-ha2), sometimes (1 or 2 times
on a month) the OCFS2 stop to work on both system. On the first node I'm
getting no error in log files and after a forced shoutdown of the first
node on the second I can see the logs on the bottom of this message.

I saw some other people is getting on a similar trouble
(http://www.mail-archive.com/ocfs2-users@oss.oracle.com/msg01135.html)
but the thread don't gave me help...

Anyone has any idea?

Thanks you in advance.

Maurizio

web-ha1:~ # cat /etc/sysconfig/o2cb

O2CB_ENABLED=true
O2CB_BOOTCLUSTER=ocfs2
O2CB_HEARTBEAT_THRESHOLD=451

web-ha1:~ #
web-ha1:~ # cat /etc/ocfs2/cluster.conf
node:
        ip_port = 7777
        ip_address = 192.168.255.1
        number = 0
        name = web-ha1
        cluster = ocfs2

node:
        ip_port = 7777
        ip_address = 192.168.255.2
        number = 1
        name = web-ha2
        cluster = ocfs2

cluster:
        node_count = 2
        name = ocfs2

web-ha1:~ #

Nov 28 15:28:59 web-ha2 kernel: o2net: connection to node web-ha1 (num
0) at 192.168.255.1:7777 has been idle for 10 seconds, shutting it down.
Nov 28 15:28:59 web-ha2 kernel: (23432,0):o2net_idle_timer:1297 here are
some times that might help debug the situation: (tmr 1196260129.36511
now 1196260139
.34907 dr 1196260129.36503 adv 1196260129.36514:1196260129.36515 func
(95bc84eb:504) 1196260129.36329:1196260129.36337)
Nov 28 15:28:59 web-ha2 kernel: o2net: no longer connected to node
web-ha1 (num 0) at 192.168.255.1:7777
Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
Nov 28 15:28:59 web-ha2 kernel: (23315,0):dlm_get_lock_resource:915
ERROR: status = -112
Nov 28 15:29:18 web-ha2 sshd[23503]: pam_unix2(sshd:auth): conversation
failed
Nov 28 15:29:18 web-ha2 sshd[23503]: error: ssh_msg_send: write
Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
Nov 28 15:29:22 web-ha2 kernel: (23396,0):dlm_get_lock_resource:915
ERROR: status = -107
Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
Nov 28 15:29:29 web-ha2 kernel: (23450,0):dlm_get_lock_resource:915
ERROR: status = -107
Nov 28 15:29:46 web-ha2 kernel: (23443,0):dlm_do_master_request:1331
ERROR: link to 0 went down!
ERROR: status = -107

[...]

Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:50 web-ha2 kernel: (17634,0):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:51 web-ha2 kernel: (17619,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:51 web-ha2 kernel: (17798,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:51 web-ha2 kernel: (17804,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M0000000000000009e7e54516dd16ec: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:51 web-ha2 kernel: (17730,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M0000000000000009e76bf516dd144d: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:51 web-ha2 kernel: (17634,0):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M000000000000000ac0d22b1f78e53c: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:51 web-ha2 kernel: (17644,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11

[...]

Nov 22 18:14:54 web-ha2 kernel: (17702,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M0000000000000007a6dab9ef6eacbd: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:54 web-ha2 kernel: (17701,1):dlm_get_lock_resource:896
86472C5C33A54FF88030591B1210C560:M000000000000000a06a13716de553e: at
least one node (0) torecover before lock mastery can begin
Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:849
86472C5C33A54FF88030591B1210C560:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Nov 22 18:14:54 web-ha2 kernel: (3550,0):dlm_get_lock_resource:876
86472C5C33A54FF88030591B1210C560: recovery map is not empty, but must
master $RECOVERY lock now
Nov 22 18:14:54 web-ha2 kernel: (17893,0):ocfs2_replay_journal:1184
Recovering node 0 from slot 0 on device (8,17)
Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:55 web-ha2 kernel: (17803,1):dlm_wait_for_lock_mastery:1036
ERROR: status = -11
Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_restart_lock_mastery:1215
ERROR: node down! 0
Nov 22 18:14:55 web-ha2 kernel: (17602,0):dlm_wait_for_lock_mastery:1036
ERROR: status = -11