[Ocfs2-users] ocfs2 freeze

Wed Jun 24 06:14:19 PDT 2009

Hi,

I've cluster ocfs2 with 8 nodes and 2 devices mapped from Disk Storage
to this nodes (disks are formatted, file systems ocfs created)

I can start the cluster on each node and mount device - this works fine.

Let say my first node name is host1 and node numer is 0 and ip address
172.28.4.1
my second node name is host2 node number 1 and ip address 172.28.4.2 and
i do nothing on other nodes (but the device is mounted on every node).

when I run find /mount_point -type f on host1 it searches and displays
files.
Before the find ends,  on host2 I remove IP address from interface (the
network connection is broken) and the find on host1 freeze.
This is the log on host1:

Jun 24 12:36:33 host1 kernel: [ 1816.861233] o2net: connection to node
host2 (num 1) at 172.28.4.2:7777 has been idle for 30.0 seconds,
shutting it down.
Jun 24 12:36:33 host1 kernel: [ 1816.861242] (0,5):o2net_idle_timer:1468
here are some times that might help debug the situation: (tmr
1245839763.115691 now 1245839793.115494 dr 1245839763.115676 adv
1245839763.115691:1245839763.115691 func (cd6c8a07:500)
1245839758.695001:1245839758.695003)
Jun 24 12:36:33 host1 kernel: [ 1816.861260] o2net: no longer connected
to node host2 (num 1) at 172.28.4.2:7777

Few minutes later the find can search again (I do not kill the proccess)
and I have in my logs:
Jun 24 12:38:41 host1 kernel: [ 2011.612478]
(5935,0):o2dlm_eviction_cb:258 o2dlm has evicted node 1 from group
C9113043842642AD9694FDF0E9BE6E29
Jun 24 12:38:42 host1 kernel: [ 2013.370655]
(5950,5):dlm_get_lock_resource:839 C9113043842642AD9694FDF0E9BE6E29:
$RECOVERY: at least one node (1) to recover before lock mastery can
begin
Jun 24 12:38:42 host1 kernel: [ 2013.370661]
(5950,5):dlm_get_lock_resource:873 C9113043842642AD9694FDF0E9BE6E29:
recovery map is not empty, but must master $RECOVERY lock now
Jun 24 12:38:42 host1 kernel: [ 2013.378061]
(5950,5):dlm_do_recovery:524 (5950) Node 0 is the Recovery Master for
the Dead Node 1 for Domain C9113043842642AD9694FDF0E9BE6E29

Is that normal that I can't access (from any of health node) to the ocfs
until this few minutes? I do not need to write for 2 minutes but this
kind of break for read is unacceptable

I have the default settings for HB:
O2CB_ENABLED=true
O2CB_BOOTCLUSTER=ocfs2
O2CB_HEARTBEAT_THRESHOLD=31
O2CB_IDLE_TIMEOUT_MS=30000
O2CB_KEEPALIVE_DELAY_MS=2000
O2CB_RECONNECT_DELAY_MS=2000

ocfs2-tools 1.4.1 (debian lenny)
kernel 2.6.26-2-amd64
+multipath
+bonding

modinfo ocfs2
filename:       /lib/modules/2.6.26-2-amd64/kernel/fs/ocfs2/ocfs2.ko
license:        GPL
author:         Oracle
version:        1.5.0
description:    OCFS2 1.5.0
srcversion:     B19D847BA86E871E41B7A64
depends:        jbd,ocfs2_stackglue,ocfs2_nodemanager
vermagic:       2.6.26-2-amd64 SMP mod_unload modversions

Any advise?

Peter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090624/f48fd03b/attachment.html