[Ocfs2-users] RE: Access to OCFS2 volume paused when a node crashes

Tue Oct 9 05:30:55 PDT 2007

You may want to try to increase the network timeout. You will have to do
it on all nodes.

See the FAQ
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT 
with special attention to #104 and 105.

Regards,

Marcos Eduardo Matsunaga

Oracle USA
Linux Engineering

paul fretter (TOC) wrote:
> To clarify,
>
> The host "node1" is the OCFS node 0 in the config file.
>
> The log entries are from another system in the cluster.
>
> Kind regards
> Paul
>
>
>
>   
>> -----Original Message-----
>> From: paul fretter (TOC)
>> Sent: 09 October 2007 11:41
>> To: ocfs2-users at oss.oracle.com
>> Subject: Access to OCFS2 volume paused when a node crashes
>>
>> There is a node (node1) on our cluster that for some reason hangs
>>     
> every
>   
>> now and again, but it seems that when it happens it also pauses access
>> to the OCFS2 volume for the other nodes.
>>
>> We are running the latest version of OCFS2 and the tools, on RHEL4
>> (x86_64) with kernel 2.6.9-42.  All nodes area connected by
>> fibrechannel to a common LUN for data sharing.
>>
>> I guess there may be something I can do with configuring timeouts
>> etc(?), but I thought I'd check with this list first.  Here is the
>> relevant info from /va/log/messages
>>
>>
>> Oct  9 11:24:41 jic55124 kernel: o2net: connection to node node1 (num
>> 0) at 10.1 0.10.1:7777 has been idle for 10.0 seconds, shutting it
>> down.
>> Oct  9 11:24:41 jic55124 kernel: (0,1):o2net_idle_timer:1418 here are
>> some times  that might help debug the situation: (tmr
>>     
> 1191925471.993435
>   
>> now 1191925481.9942 92 dr 1191925471.993425 adv
>> 1191925471.993436:1191925471.993437 func (98e2d068:5 07)
>> 1191924562.14841:1191924562.14844)
>> Oct  9 11:24:41 jic55124 kernel: o2net: no longer connected to node
>> node1 (num 0 ) at 10.10.10.1:7777
>> Oct  9 11:24:41 jic55124 kernel: (727,3):dlm_do_master_request:1418
>> ERROR: link to 0 went down!
>> Oct  9 11:24:41 jic55124 kernel: (727,3):dlm_get_lock_resource:995
>> ERROR: status  = -112
>> [root at jic55124 ~]# tail /var/log/messages
>> Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
>> ERROR: status = -107
>> Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_do_master_request:1418
>> ERROR: link to 0 went down!
>> Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
>> ERROR: status = -107
>> Oct  9 11:33:42 jic55124 kernel: (865,0):dlm_get_lock_resource:921
>> 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
>> least one node (0) torecover before lock mastery can begin
>> Oct  9 11:33:42 jic55124 kernel: (3765,1):ocfs2_dlm_eviction_cb:119
>> device (8,80): dlm has evicted node 0
>> Oct  9 11:33:43 jic55124 kernel: (865,0):dlm_get_lock_resource:976
>> 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
>> least one node (0) torecover before lock mastery can begin
>> Oct  9 11:33:46 jic55124 kernel: (727,3):dlm_restart_lock_mastery:1301
>> ERROR: node down! 0
>> Oct  9 11:33:46 jic55124 kernel:
>>     
> (727,3):dlm_wait_for_lock_mastery:1118
>   
>> ERROR: status = -11
>> Oct  9 11:33:48 jic55124 kernel: (865,1):ocfs2_replay_journal:1167
>> Recovering node 0 from slot 5 on device (8,80)
>> Oct  9 11:33:50 jic55124 kernel: kjournald starting.  Commit interval
>>     
> 5
>   
>> seconds
>>
>>
>> Many thanks
>> Paul Fretter
>>     
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20071009/4306e32d/attachment-0001.html