[Ocfs2-users] RE: Access to OCFS2 volume paused when a node crashes

paul fretter (TOC) paul.fretter at bbsrc.ac.uk
Tue Oct 9 03:50:07 PDT 2007


To clarify,

The host "node1" is the OCFS node 0 in the config file.

The log entries are from another system in the cluster.

Kind regards
Paul



> -----Original Message-----
> From: paul fretter (TOC)
> Sent: 09 October 2007 11:41
> To: ocfs2-users at oss.oracle.com
> Subject: Access to OCFS2 volume paused when a node crashes
> 
> There is a node (node1) on our cluster that for some reason hangs
every
> now and again, but it seems that when it happens it also pauses access
> to the OCFS2 volume for the other nodes.
> 
> We are running the latest version of OCFS2 and the tools, on RHEL4
> (x86_64) with kernel 2.6.9-42.  All nodes area connected by
> fibrechannel to a common LUN for data sharing.
> 
> I guess there may be something I can do with configuring timeouts
> etc(?), but I thought I'd check with this list first.  Here is the
> relevant info from /va/log/messages
> 
> 
> Oct  9 11:24:41 jic55124 kernel: o2net: connection to node node1 (num
> 0) at 10.1 0.10.1:7777 has been idle for 10.0 seconds, shutting it
> down.
> Oct  9 11:24:41 jic55124 kernel: (0,1):o2net_idle_timer:1418 here are
> some times  that might help debug the situation: (tmr
1191925471.993435
> now 1191925481.9942 92 dr 1191925471.993425 adv
> 1191925471.993436:1191925471.993437 func (98e2d068:5 07)
> 1191924562.14841:1191924562.14844)
> Oct  9 11:24:41 jic55124 kernel: o2net: no longer connected to node
> node1 (num 0 ) at 10.10.10.1:7777
> Oct  9 11:24:41 jic55124 kernel: (727,3):dlm_do_master_request:1418
> ERROR: link to 0 went down!
> Oct  9 11:24:41 jic55124 kernel: (727,3):dlm_get_lock_resource:995
> ERROR: status  = -112
> [root at jic55124 ~]# tail /var/log/messages
> Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
> ERROR: status = -107
> Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_do_master_request:1418
> ERROR: link to 0 went down!
> Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
> ERROR: status = -107
> Oct  9 11:33:42 jic55124 kernel: (865,0):dlm_get_lock_resource:921
> 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
> least one node (0) torecover before lock mastery can begin
> Oct  9 11:33:42 jic55124 kernel: (3765,1):ocfs2_dlm_eviction_cb:119
> device (8,80): dlm has evicted node 0
> Oct  9 11:33:43 jic55124 kernel: (865,0):dlm_get_lock_resource:976
> 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
> least one node (0) torecover before lock mastery can begin
> Oct  9 11:33:46 jic55124 kernel: (727,3):dlm_restart_lock_mastery:1301
> ERROR: node down! 0
> Oct  9 11:33:46 jic55124 kernel:
(727,3):dlm_wait_for_lock_mastery:1118
> ERROR: status = -11
> Oct  9 11:33:48 jic55124 kernel: (865,1):ocfs2_replay_journal:1167
> Recovering node 0 from slot 5 on device (8,80)
> Oct  9 11:33:50 jic55124 kernel: kjournald starting.  Commit interval
5
> seconds
> 
> 
> Many thanks
> Paul Fretter



More information about the Ocfs2-users mailing list