[Ocfs2-users] Cluster lockup when one node fails

Wed May 27 11:02:46 PDT 2009

kernel version, ocfs2 version?

$ uname -a
$ modinfo ocfs2
$ rpm -qa | grep ocfs2

Kees Hoekzema wrote:
> Hello List,
>
> At the moment I'm running a 7-node ocfs2 cluster on a Dell MD3000i (iscsi)
> NAS. This cluster has run fine for well over a year now, but recently one of
> the older and more unstable servers in the cluster has started to fail
> sometimes.
>
> While it is not a big problem that this particular server reboots, it is
> however a problem that when he does that the whole cluster becomes unusable
> until that node reboots and returns.
>
> Today we had another crash on the server. The other nodes displayed it like
> this in the dmesg output: 
>
> May 27 16:45:03 aphaea kernel: 
> o2net: connection to node achelois (num 5) at 10.0.1.24:7777 has been idle
> for 10.0 seconds, shutting it down.
> (0,3):o2net_idle_timer:1468 here are some times that might help debug the
> situation: (tmr 1243435493.522086 now 1243435503.520354 dr 1243435493.522080
> adv 1243435493.522090:1243435493.522091 func (6169a8d1:502)
> 1243435148.2972:1243435148.2999)
> o2net: no longer connected to node achelois (num 5) at 10.0.1.24:7777
> (3762,1):dlm_do_master_request:1335 ERROR: link to 5 went down!
> (3762,1):dlm_get_lock_resource:912 ERROR: status = -112
> (5196,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> (5196,3):dlm_get_lock_resource:912 ERROR: status = -107
> (735,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> (735,3):dlm_get_lock_resource:912 ERROR: status = -107
> (21573,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> (21573,3):dlm_get_lock_resource:912 ERROR: status = -107
> (2825,3):o2net_connect_expired:1629 ERROR: no connection established with
> node 5 after 10.0 seconds, giving up and returning errors.
> (1916,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> (1916,3):dlm_get_lock_resource:912 ERROR: status = -107
> ..
> [and a lot more similar errors]
> ..
> May 27 17:14:45 aphaea kernel:  (2825,3):o2dlm_eviction_cb:258 o2dlm has
> evicted node 5 from group 20AB0E216A25479A986F8FDFE574C640
>
> The node that is in fault was totally frozen, so it most likely did not even
> receive a kernel panic from ocfs2 so that it reboots.
>
> After we rebooted the node, the cluster became available again. However, it
> still prevented the other 6 servers from accessing the shared storage for
> almost 30 minutes.
>
> Is there a way to 'evict' a node faster? and continue normal read/write
> operations without the node?
> Or is it possible to have at least read operations continue without being
> locked out as well?
>
> Tia,
> Kees Hoekzema
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>