[Ocfs2-users] Cluster lockup when one node fails

Sunil Mushran sunil.mushran at oracle.com
Thu May 28 13:00:54 PDT 2009


So what's happening is that this unstable node has locked up but is still
heartbeating on the disk. The other nodes are simply waiting for it to die.
For fencing, we do a pci bus reset - so it is locking up in an interesting
way. See if you can take some kernel traces (alt-sysrq-t) of the hung node.
File a bugzilla and attach. Maybe we get lucky. But if the same node is
hanging always, it could also mean some hardware issue.

Kees Hoekzema wrote:
> Sorry for the missing info, I should have known better :)
>
> All nodes run Debian, with the following software installed:
> Kernel: 2.6.26-1-amd64 x86_64 
>
> modinfo ocfs2:
> version:        1.5.0
> description:    OCFS2 1.5.0
> srcversion:     B19D847BA86E871E41B7A64
> vermagic:       2.6.26-1-amd64 SMP mod_unload modversions
>
> ocfs2-tools:
> Version: 1.4.1-1
>
> Tia,
> Kees Hoekzema
>
>
>   
>> -----Original Message-----
>> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
>> Sent: woensdag 27 mei 2009 20:03
>> To: Kees Hoekzema
>> Cc: ocfs2-users at oss.oracle.com
>> Subject: Re: [Ocfs2-users] Cluster lockup when one node fails
>>
>> kernel version, ocfs2 version?
>>
>> $ uname -a
>> $ modinfo ocfs2
>> $ rpm -qa | grep ocfs2
>>
>>
>> Kees Hoekzema wrote:
>>     
>>> Hello List,
>>>
>>> At the moment I'm running a 7-node ocfs2 cluster on a Dell MD3000i
>>>       
>> (iscsi)
>>     
>>> NAS. This cluster has run fine for well over a year now, but recently
>>>       
>> one of
>>     
>>> the older and more unstable servers in the cluster has started to
>>>       
>> fail
>>     
>>> sometimes.
>>>
>>> While it is not a big problem that this particular server reboots, it
>>>       
>> is
>>     
>>> however a problem that when he does that the whole cluster becomes
>>>       
>> unusable
>>     
>>> until that node reboots and returns.
>>>
>>> Today we had another crash on the server. The other nodes displayed
>>>       
>> it like
>>     
>>> this in the dmesg output:
>>>
>>> May 27 16:45:03 aphaea kernel:
>>> o2net: connection to node achelois (num 5) at 10.0.1.24:7777 has been
>>>       
>> idle
>>     
>>> for 10.0 seconds, shutting it down.
>>> (0,3):o2net_idle_timer:1468 here are some times that might help debug
>>>       
>> the
>>     
>>> situation: (tmr 1243435493.522086 now 1243435503.520354 dr
>>>       
>> 1243435493.522080
>>     
>>> adv 1243435493.522090:1243435493.522091 func (6169a8d1:502)
>>> 1243435148.2972:1243435148.2999)
>>> o2net: no longer connected to node achelois (num 5) at 10.0.1.24:7777
>>> (3762,1):dlm_do_master_request:1335 ERROR: link to 5 went down!
>>> (3762,1):dlm_get_lock_resource:912 ERROR: status = -112
>>> (5196,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
>>> (5196,3):dlm_get_lock_resource:912 ERROR: status = -107
>>> (735,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
>>> (735,3):dlm_get_lock_resource:912 ERROR: status = -107
>>> (21573,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
>>> (21573,3):dlm_get_lock_resource:912 ERROR: status = -107
>>> (2825,3):o2net_connect_expired:1629 ERROR: no connection established
>>>       
>> with
>>     
>>> node 5 after 10.0 seconds, giving up and returning errors.
>>> (1916,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
>>> (1916,3):dlm_get_lock_resource:912 ERROR: status = -107
>>> ..
>>> [and a lot more similar errors]
>>> ..
>>> May 27 17:14:45 aphaea kernel:  (2825,3):o2dlm_eviction_cb:258 o2dlm
>>>       
>> has
>>     
>>> evicted node 5 from group 20AB0E216A25479A986F8FDFE574C640
>>>
>>> The node that is in fault was totally frozen, so it most likely did
>>>       
>> not even
>>     
>>> receive a kernel panic from ocfs2 so that it reboots.
>>>
>>> After we rebooted the node, the cluster became available again.
>>>       
>> However, it
>>     
>>> still prevented the other 6 servers from accessing the shared storage
>>>       
>> for
>>     
>>> almost 30 minutes.
>>>
>>> Is there a way to 'evict' a node faster? and continue normal
>>>       
>> read/write
>>     
>>> operations without the node?
>>> Or is it possible to have at least read operations continue without
>>>       
>> being
>>     
>>> locked out as well?
>>>
>>> Tia,
>>> Kees Hoekzema
>>>
>>>
>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>
>>>       
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>   




More information about the Ocfs2-users mailing list