[Ocfs2-users] Cluster lockup when one node fails

Wed May 27 09:20:21 PDT 2009

Hello List,

At the moment I'm running a 7-node ocfs2 cluster on a Dell MD3000i (iscsi)
NAS. This cluster has run fine for well over a year now, but recently one of
the older and more unstable servers in the cluster has started to fail
sometimes.

While it is not a big problem that this particular server reboots, it is
however a problem that when he does that the whole cluster becomes unusable
until that node reboots and returns.

Today we had another crash on the server. The other nodes displayed it like
this in the dmesg output: 

May 27 16:45:03 aphaea kernel: 
o2net: connection to node achelois (num 5) at 10.0.1.24:7777 has been idle
for 10.0 seconds, shutting it down.
(0,3):o2net_idle_timer:1468 here are some times that might help debug the
situation: (tmr 1243435493.522086 now 1243435503.520354 dr 1243435493.522080
adv 1243435493.522090:1243435493.522091 func (6169a8d1:502)
1243435148.2972:1243435148.2999)
o2net: no longer connected to node achelois (num 5) at 10.0.1.24:7777
(3762,1):dlm_do_master_request:1335 ERROR: link to 5 went down!
(3762,1):dlm_get_lock_resource:912 ERROR: status = -112
(5196,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
(5196,3):dlm_get_lock_resource:912 ERROR: status = -107
(735,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
(735,3):dlm_get_lock_resource:912 ERROR: status = -107
(21573,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
(21573,3):dlm_get_lock_resource:912 ERROR: status = -107
(2825,3):o2net_connect_expired:1629 ERROR: no connection established with
node 5 after 10.0 seconds, giving up and returning errors.
(1916,3):dlm_do_master_request:1335 ERROR: link to 5 went down!
(1916,3):dlm_get_lock_resource:912 ERROR: status = -107
..
[and a lot more similar errors]
..
May 27 17:14:45 aphaea kernel:  (2825,3):o2dlm_eviction_cb:258 o2dlm has
evicted node 5 from group 20AB0E216A25479A986F8FDFE574C640

The node that is in fault was totally frozen, so it most likely did not even
receive a kernel panic from ocfs2 so that it reboots.

After we rebooted the node, the cluster became available again. However, it
still prevented the other 6 servers from accessing the shared storage for
almost 30 minutes.

Is there a way to 'evict' a node faster? and continue normal read/write
operations without the node?
Or is it possible to have at least read operations continue without being
locked out as well?

Tia,
Kees Hoekzema