[Ocfs2-users] RE: Access to OCFS2 volume paused when a node
crashes
Marcos E. Matsunaga
Marcos.Matsunaga at oracle.com
Tue Oct 9 05:30:55 PDT 2007
You may want to try to increase the network timeout. You will have to do
it on all nodes.
See the FAQ
http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT
with special attention to #104 and 105.
Regards,
Marcos Eduardo Matsunaga
Oracle USA
Linux Engineering
paul fretter (TOC) wrote:
> To clarify,
>
> The host "node1" is the OCFS node 0 in the config file.
>
> The log entries are from another system in the cluster.
>
> Kind regards
> Paul
>
>
>
>
>> -----Original Message-----
>> From: paul fretter (TOC)
>> Sent: 09 October 2007 11:41
>> To: ocfs2-users at oss.oracle.com
>> Subject: Access to OCFS2 volume paused when a node crashes
>>
>> There is a node (node1) on our cluster that for some reason hangs
>>
> every
>
>> now and again, but it seems that when it happens it also pauses access
>> to the OCFS2 volume for the other nodes.
>>
>> We are running the latest version of OCFS2 and the tools, on RHEL4
>> (x86_64) with kernel 2.6.9-42. All nodes area connected by
>> fibrechannel to a common LUN for data sharing.
>>
>> I guess there may be something I can do with configuring timeouts
>> etc(?), but I thought I'd check with this list first. Here is the
>> relevant info from /va/log/messages
>>
>>
>> Oct 9 11:24:41 jic55124 kernel: o2net: connection to node node1 (num
>> 0) at 10.1 0.10.1:7777 has been idle for 10.0 seconds, shutting it
>> down.
>> Oct 9 11:24:41 jic55124 kernel: (0,1):o2net_idle_timer:1418 here are
>> some times that might help debug the situation: (tmr
>>
> 1191925471.993435
>
>> now 1191925481.9942 92 dr 1191925471.993425 adv
>> 1191925471.993436:1191925471.993437 func (98e2d068:5 07)
>> 1191924562.14841:1191924562.14844)
>> Oct 9 11:24:41 jic55124 kernel: o2net: no longer connected to node
>> node1 (num 0 ) at 10.10.10.1:7777
>> Oct 9 11:24:41 jic55124 kernel: (727,3):dlm_do_master_request:1418
>> ERROR: link to 0 went down!
>> Oct 9 11:24:41 jic55124 kernel: (727,3):dlm_get_lock_resource:995
>> ERROR: status = -112
>> [root at jic55124 ~]# tail /var/log/messages
>> Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
>> ERROR: status = -107
>> Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_do_master_request:1418
>> ERROR: link to 0 went down!
>> Oct 9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_resource:995
>> ERROR: status = -107
>> Oct 9 11:33:42 jic55124 kernel: (865,0):dlm_get_lock_resource:921
>> 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
>> least one node (0) torecover before lock mastery can begin
>> Oct 9 11:33:42 jic55124 kernel: (3765,1):ocfs2_dlm_eviction_cb:119
>> device (8,80): dlm has evicted node 0
>> Oct 9 11:33:43 jic55124 kernel: (865,0):dlm_get_lock_resource:976
>> 6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at
>> least one node (0) torecover before lock mastery can begin
>> Oct 9 11:33:46 jic55124 kernel: (727,3):dlm_restart_lock_mastery:1301
>> ERROR: node down! 0
>> Oct 9 11:33:46 jic55124 kernel:
>>
> (727,3):dlm_wait_for_lock_mastery:1118
>
>> ERROR: status = -11
>> Oct 9 11:33:48 jic55124 kernel: (865,1):ocfs2_replay_journal:1167
>> Recovering node 0 from slot 5 on device (8,80)
>> Oct 9 11:33:50 jic55124 kernel: kjournald starting. Commit interval
>>
> 5
>
>> seconds
>>
>>
>> Many thanks
>> Paul Fretter
>>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20071009/4306e32d/attachment-0001.html
More information about the Ocfs2-users
mailing list