[Ocfs2-users] Cluster lockup when one node fails

Thu May 28 03:50:15 PDT 2009

Sorry for the missing info, I should have known better :)

All nodes run Debian, with the following software installed:
Kernel: 2.6.26-1-amd64 x86_64 

modinfo ocfs2:
version:        1.5.0
description:    OCFS2 1.5.0
srcversion:     B19D847BA86E871E41B7A64
vermagic:       2.6.26-1-amd64 SMP mod_unload modversions

ocfs2-tools:
Version: 1.4.1-1

Tia,
Kees Hoekzema


> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
> Sent: woensdag 27 mei 2009 20:03
> To: Kees Hoekzema
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] Cluster lockup when one node fails
> 
> kernel version, ocfs2 version?
> 
> $ uname -a
> $ modinfo ocfs2
> $ rpm -qa | grep ocfs2
> 
> 
> Kees Hoekzema wrote:
> > Hello List,
> >
> > At the moment I'm running a 7-node ocfs2 cluster on a Dell MD3000i
> (iscsi)
> > NAS. This cluster has run fine for well over a year now, but recently
> one of
> > the older and more unstable servers in the cluster has started to
> fail
> > sometimes.
> >
> > While it is not a big problem that this particular server reboots, it
> is
> > however a problem that when he does that the whole cluster becomes
> unusable
> > until that node reboots and returns.
> >
> > Today we had another crash on the server. The other nodes displayed
> it like
> > this in the dmesg output:
> >
> > May 27 16:45:03 aphaea kernel:
> > o2net: connection to node achelois (num 5) at 10.0.1.24:7777 has been
> idle
> > for 10.0 seconds, shutting it down.
> > (0,3):o2net_idle_timer:1468 here are some times that might help debug
> the
> > situation: (tmr 1243435493.522086 now 1243435503.520354 dr
> 1243435493.522080
> > adv 1243435493.522090:1243435493.522091 func (6169a8d1:502)
> > 1243435148.2972:1243435148.2999)
> > o2net: no longer connected to node achelois (num 5) at 10.0.1.24:7777
> > (3762,1):dlm_do_master_request:1335 ERROR: link to 5 went down!
> > (3762,1):dlm_get_lock_resource:912 ERROR: status = -112
> > (5196,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> > (5196,3):dlm_get_lock_resource:912 ERROR: status = -107
> > (735,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> > (735,3):dlm_get_lock_resource:912 ERROR: status = -107
> > (21573,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> > (21573,3):dlm_get_lock_resource:912 ERROR: status = -107
> > (2825,3):o2net_connect_expired:1629 ERROR: no connection established
> with
> > node 5 after 10.0 seconds, giving up and returning errors.
> > (1916,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
> > (1916,3):dlm_get_lock_resource:912 ERROR: status = -107
> > ..
> > [and a lot more similar errors]
> > ..
> > May 27 17:14:45 aphaea kernel:  (2825,3):o2dlm_eviction_cb:258 o2dlm
> has
> > evicted node 5 from group 20AB0E216A25479A986F8FDFE574C640
> >
> > The node that is in fault was totally frozen, so it most likely did
> not even
> > receive a kernel panic from ocfs2 so that it reboots.
> >
> > After we rebooted the node, the cluster became available again.
> However, it
> > still prevented the other 6 servers from accessing the shared storage
> for
> > almost 30 minutes.
> >
> > Is there a way to 'evict' a node faster? and continue normal
> read/write
> > operations without the node?
> > Or is it possible to have at least read operations continue without
> being
> > locked out as well?
> >
> > Tia,
> > Kees Hoekzema
> >
> >
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >