[Ocfs2-users] Access to OCFS2 volume paused when a node crashes

paul fretter (TOC) paul.fretter at bbsrc.ac.uk
Tue Oct 9 03:41:24 PDT 2007


There is a node (node1) on our cluster that for some reason hangs every
now and again, but it seems that when it happens it also pauses access
to the OCFS2 volume for the other nodes.

We are running the latest version of OCFS2 and the tools, on RHEL4
(x86_64) with kernel 2.6.9-42.  All nodes area connected by fibrechannel
to a common LUN for data sharing.

I guess there may be something I can do with configuring timeouts
etc(?), but I thought I'd check with this list first.  Here is the
relevant info from /va/log/messages


Oct  9 11:24:41 jic55124 kernel: o2net: connection to node node1 (num 0)
at 10.1 0.10.1:7777 has been idle for 10.0 seconds, shutting it down.
Oct  9 11:24:41 jic55124 kernel: (0,1):o2net_idle_timer:1418 here are
some times  that might help debug the situation: (tmr 1191925471.993435
now 1191925481.9942 92 dr 1191925471.993425 adv
1191925471.993436:1191925471.993437 func (98e2d068:5 07)
1191924562.14841:1191924562.14844)
Oct  9 11:24:41 jic55124 kernel: o2net: no longer connected to node
node1 (num 0 ) at 10.10.10.1:7777
Oct  9 11:24:41 jic55124 kernel: (727,3):dlm_do_master_request:1418
ERROR: link to 0 went down!
Oct  9 11:24:41 jic55124 kernel: (727,3):dlm_get_lock_resource:995
ERROR: status  = -112
[root at jic55124 ~]# tail /var/log/messages
Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
ERROR: status = -107
Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_do_master_request:1418
ERROR: link to 0 went down!
Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
ERROR: status = -107
Oct  9 11:33:42 jic55124 kernel: (865,0):dlm_get_lock_resource:921
6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
least one node (0) torecover before lock mastery can begin
Oct  9 11:33:42 jic55124 kernel: (3765,1):ocfs2_dlm_eviction_cb:119
device (8,80): dlm has evicted node 0
Oct  9 11:33:43 jic55124 kernel: (865,0):dlm_get_lock_resource:976
6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
least one node (0) torecover before lock mastery can begin
Oct  9 11:33:46 jic55124 kernel: (727,3):dlm_restart_lock_mastery:1301
ERROR: node down! 0
Oct  9 11:33:46 jic55124 kernel: (727,3):dlm_wait_for_lock_mastery:1118
ERROR: status = -11
Oct  9 11:33:48 jic55124 kernel: (865,1):ocfs2_replay_journal:1167
Recovering node 0 from slot 5 on device (8,80)
Oct  9 11:33:50 jic55124 kernel: kjournald starting.  Commit interval 5
seconds


Many thanks
Paul Fretter



More information about the Ocfs2-users mailing list