[Ocfs2-users] RE: Access to OCFS2 volume paused when a node crashes
paul fretter (TOC)
paul.fretter at bbsrc.ac.uk
Tue Oct 9 03:50:07 PDT 2007
To clarify,
The host "node1" is the OCFS node 0 in the config file.
The log entries are from another system in the cluster.
Kind regards
Paul
> -----Original Message-----
> From: paul fretter (TOC)
> Sent: 09 October 2007 11:41
> To: ocfs2-users at oss.oracle.com
> Subject: Access to OCFS2 volume paused when a node crashes
>
> There is a node (node1) on our cluster that for some reason hangs
every
> now and again, but it seems that when it happens it also pauses access
> to the OCFS2 volume for the other nodes.
>
> We are running the latest version of OCFS2 and the tools, on RHEL4
> (x86_64) with kernel 2.6.9-42. All nodes area connected by
> fibrechannel to a common LUN for data sharing.
>
> I guess there may be something I can do with configuring timeouts
> etc(?), but I thought I'd check with this list first. Here is the
> relevant info from /va/log/messages
>
>
> Oct 9 11:24:41 jic55124 kernel: o2net: connection to node node1 (num
> 0) at 10.1 0.10.1:7777 has been idle for 10.0 seconds, shutting it
> down.
> Oct 9 11:24:41 jic55124 kernel: (0,1):o2net_idle_timer:1418 here are
> some times that might help debug the situation: (tmr
1191925471.993435
> now 1191925481.9942 92 dr 1191925471.993425 adv
> 1191925471.993436:1191925471.993437 func (98e2d068:5 07)
> 1191924562.14841:1191924562.14844)
> Oct 9 11:24:41 jic55124 kernel: o2net: no longer connected to node
> node1 (num 0 ) at 10.10.10.1:7777
> Oct 9 11:24:41 jic55124 kernel: (727,3):dlm_do_master_request:1418
> ERROR: link to 0 went down!
> Oct 9 11:24:41 jic55124 kernel: (727,3):dlm_get_lock_resource:995
> ERROR: status = -112
> [root at jic55124 ~]# tail /var/log/messages
> Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
> ERROR: status = -107
> Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_do_master_request:1418
> ERROR: link to 0 went down!
> Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
> ERROR: status = -107
> Oct 9 11:33:42 jic55124 kernel: (865,0):dlm_get_lock_resource:921
> 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
> least one node (0) torecover before lock mastery can begin
> Oct 9 11:33:42 jic55124 kernel: (3765,1):ocfs2_dlm_eviction_cb:119
> device (8,80): dlm has evicted node 0
> Oct 9 11:33:43 jic55124 kernel: (865,0):dlm_get_lock_resource:976
> 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
> least one node (0) torecover before lock mastery can begin
> Oct 9 11:33:46 jic55124 kernel: (727,3):dlm_restart_lock_mastery:1301
> ERROR: node down! 0
> Oct 9 11:33:46 jic55124 kernel:
(727,3):dlm_wait_for_lock_mastery:1118
> ERROR: status = -11
> Oct 9 11:33:48 jic55124 kernel: (865,1):ocfs2_replay_journal:1167
> Recovering node 0 from slot 5 on device (8,80)
> Oct 9 11:33:50 jic55124 kernel: kjournald starting. Commit interval
5
> seconds
>
>
> Many thanks
> Paul Fretter
More information about the Ocfs2-users
mailing list