[Ocfs2-users] ocfs2 freeze

Wed Jun 24 13:25:14 PDT 2009

The nodes are not frozen. The processes that are attempting to talk
to the "disconnected" node are waiting for that node to reply, failing
which, to die. The default timeout for the disk heartbeat is 60 secs.

If that node simply died, the other nodes would have deemed the node
dead after 60 secs, recovered it and carried on with life.

But in this case, you disconnected that node from the cluster. Meaning
that node has to first decide on its course of action. It kills itself
but after the 60 sec timeout. After that, the other nodes have to wait
another 60 secs before deeming it dead. That's your 120 secs.

You can always cut the disk heartbeat by a half. But it depends a lot
on your shared disk / io path. Some multipaths require the timeouts to
be as high as 2 mins.

Many users multipath the network setup too by setting up net bonding.

Sunil

Piotr Teodorowski wrote:
> Hi,
>
> I've cluster ocfs2 with 8 nodes and 2 devices mapped from Disk Storage 
> to this nodes (disks are formatted, file systems ocfs created)
>
> I can start the cluster on each node and mount device - this works fine.
>
> Let say my first node name is host1 and node numer is 0 and ip address 
> 172.28.4.1
> my second node name is host2 node number 1 and ip address 172.28.4.2 
> and i do nothing on other nodes (but the device is mounted on every node).
>
> when I run find /mount_point -type f on host1 it searches and displays 
> files.
> Before the find ends,  on host2 I remove IP address from interface 
> (the network connection is broken) and the find on host1 freeze.
> This is the log on host1:
>
> Jun 24 12:36:33 host1 kernel: [ 1816.861233] o2net: connection to node 
> host2 (num 1) at 172.28.4.2:7777 has been idle for 30.0 seconds, 
> shutting it down.
> Jun 24 12:36:33 host1 kernel: [ 1816.861242] 
> (0,5):o2net_idle_timer:1468 here are some times that might help debug 
> the situation: (tmr 1245839763.115691 now 1245839793.115494 dr 
> 1245839763.115676 adv 1245839763.115691:1245839763.115691 func 
> (cd6c8a07:500) 1245839758.695001:1245839758.695003)
> Jun 24 12:36:33 host1 kernel: [ 1816.861260] o2net: no longer 
> connected to node host2 (num 1) at 172.28.4.2:7777
>
> Few minutes later the find can search again (I do not kill the proccess)
> and I have in my logs:
> Jun 24 12:38:41 host1 kernel: [ 2011.612478] 
> (5935,0):o2dlm_eviction_cb:258 o2dlm has evicted node 1 from group 
> C9113043842642AD9694FDF0E9BE6E29
> Jun 24 12:38:42 host1 kernel: [ 2013.370655] 
> (5950,5):dlm_get_lock_resource:839 
> C9113043842642AD9694FDF0E9BE6E29:$RECOVERY: at least one node (1) to 
> recover before lock mastery can begin
> Jun 24 12:38:42 host1 kernel: [ 2013.370661] 
> (5950,5):dlm_get_lock_resource:873 C9113043842642AD9694FDF0E9BE6E29: 
> recovery map is not empty, but must master $RECOVERY lock now
> Jun 24 12:38:42 host1 kernel: [ 2013.378061] 
> (5950,5):dlm_do_recovery:524 (5950) Node 0 is the Recovery Master for 
> the Dead Node 1 for Domain C9113043842642AD9694FDF0E9BE6E29
>
> Is that normal that I can't access (from any of health node) to the 
> ocfs until this few minutes? I do not need to write for 2 minutes but 
> this kind of break for read is unacceptable
>
> I have the default settings for HB:
> O2CB_ENABLED=true
> O2CB_BOOTCLUSTER=ocfs2
> O2CB_HEARTBEAT_THRESHOLD=31
> O2CB_IDLE_TIMEOUT_MS=30000
> O2CB_KEEPALIVE_DELAY_MS=2000
> O2CB_RECONNECT_DELAY_MS=2000
>
> ocfs2-tools 1.4.1 (debian lenny)
> kernel 2.6.26-2-amd64
> +multipath
> +bonding
>
> modinfo ocfs2
> filename:       /lib/modules/2.6.26-2-amd64/kernel/fs/ocfs2/ocfs2.ko
> license:        GPL
> author:         Oracle
> version:        1.5.0
> description:    OCFS2 1.5.0
> srcversion:     B19D847BA86E871E41B7A64
> depends:        jbd,ocfs2_stackglue,ocfs2_nodemanager
> vermagic:       2.6.26-2-amd64 SMP mod_unload modversions
>
> Any advise?
>
> Peter