[Ocfs2-users] ocfs2 freeze

Thu Jun 25 00:33:36 PDT 2009

Ok, so this is normal behaviour that process has no access to the files
on ocfs when there is a problem whith some node in the cluster at least
for few seconds... 
Before I simulate the situation with network problems I had the "real"
problem 
one node (let say it was node5) lost connection from all other 7 nodes
after about 7 seconds later that node reconnect to 5 nodes and couldn't
connect to two remaining nodes.
All 7 nodes notice problem with only one node - node5.
And I had in my logs:

Jun 24 10:26:45 node5 kernel: [845664.767625] o2net: connection to node
node0 (num 0) at 172.28.4.48:7777 has been idle for 30.0 seconds,
shutting it down.
Jun 24 10:26:45 node5 kernel: [845664.767634]
(0,5):o2net_idle_timer:1468 here are some times that might help debug
the situation: (tmr 1245831975.367358 now 1245832005.363626 dr
1245831975.367342 adv 1245831975.367364:1245831975.367364 func
(cd6c8a07:502) 1245765277.343131:1245765277.343132)
Jun 24 10:26:45 node5 kernel: [845664.767625] o2net: no longer connected
to node node0 (num 0) at 172.28.4.48:7777
Jun 24 10:26:45 node5 kernel: [845665.013442] o2net: connection to node
node2 (num 2) at 172.28.4.64:7777 has been idle for 30.0 seconds,
shutting it down.
Jun 24 10:26:45 node5 kernel: [845665.013449]
(0,1):o2net_idle_timer:1468 here are some times that might help debug
the situation: (tmr 1245831975.603754 now 1245832005.603626 dr
1245831975.603749 adv 1245831975.603755:1245831975.603755 func
(b0b3255d:500) 1245831909.670314:1245831909.670317)
Jun 24 10:26:45 node5 kernel: [845665.013463] o2net: no longer connected
to node node2 (num 2) at 172.28.4.64:7777
Jun 24 10:26:45 node5 kernel: [845665.403172] o2net: connection to node
node3 (num 3) at 172.28.4.65:7777 has been idle for 30.0 seconds,
shutting it down.
Jun 24 10:26:45 node5 kernel: [845665.403181]
(0,7):o2net_idle_timer:1468 here are some times that might help debug
the situation: (tmr 1245831975.787772 now 1245832005.787627 dr
1245831975.787759 adv 1245831975.787773:1245831975.787773 func
(b0b3255d:500) 1245831925.837397:1245831925.837399)
Jun 24 10:26:45 node5 kernel: [845665.415165] o2net: no longer connected
to node node3 (num 3) at 172.28.4.65:7777
Jun 24 10:26:45 node5 kernel: [845666.094914] o2net: connection to node
node7 (num 7) at 172.28.4.78:7777 has been idle for 30.0 seconds,
shutting it down.
Jun 24 10:26:45 node5 kernel: [845666.094922]
(0,6):o2net_idle_timer:1468 here are some times that might help debug
the situation: (tmr 1245831975.915755 now 1245832005.915626 dr
1245831975.915743 adv 1245831975.915756:1245831975.915756 func
(b0b3255d:502) 1245831943.949686:1245831943.949687)
Jun 24 10:26:45 node5 kernel: [845666.094930] o2net: connection to node
node4 (num 4) at 172.28.4.66:7777 has been idle for 30.0 seconds,
shutting it down.
Jun 24 10:26:45 node5 kernel: [845666.094935]
(0,6):o2net_idle_timer:1468 here are some times that might help debug
the situation: (tmr 1245831975.915758 now 1245832005.915642 dr
1245831975.915748 adv 1245831975.915759:1245831975.915759 func
(b0b3255d:500) 1245831943.948920:1245831943.948923)
Jun 24 10:26:45 node5 kernel: [845666.108161] o2net: no longer connected
to node node7 (num 7) at 172.28.4.78:7777
Jun 24 10:26:45 node5 kernel: [845666.108161] o2net: no longer connected
to node node4 (num 4) at 172.28.4.66:7777
Jun 24 10:26:46 node5 kernel: [845666.348213] o2net: connection to node
node6 (num 6) at 172.28.4.81:7777 has been idle for 30.0 seconds,
shutting it down.
Jun 24 10:26:46 node5 kernel: [845666.348220]
(0,7):o2net_idle_timer:1468 here are some times that might help debug
the situation: (tmr 1245831976.123740 now 1245832006.123625 dr
1245831976.123732 adv 1245831976.123740:1245831976.123741 func
(b0b3255d:505) 1245822335.389460:1245822335.389465)
Jun 24 10:26:46 node5 kernel: [845666.368210] o2net: no longer connected
to node node6 (num 6) at 172.28.4.81:7777
Jun 24 10:26:47 node5 kernel: [845668.151772] o2net: connection to node
node1 (num 1) at 172.28.4.56:7777 has been idle for 30.0 seconds,
shutting it down.
Jun 24 10:26:47 node5 kernel: [845668.151780]
(0,1):o2net_idle_timer:1468 here are some times that might help debug
the situation: (tmr 1245831977.195745 now 1245832007.195625 dr
1245831977.195734 adv 1245831977.195746:1245831977.195746 func
(cd6c8a07:502) 1245829979.42551:1245829979.42555)
Jun 24 10:26:47 node5 kernel: [845668.151795] o2net: no longer connected
to node node1 (num 1) at 172.28.4.56:7777
Jun 24 10:26:54 node5 kernel: [845682.375877] o2net: connected to node
node0 (num 0) at 172.28.4.48:7777
Jun 24 10:26:54 node5 kernel: [845683.287292] o2net: connected to node
node4 (num 4) at 172.28.4.66:7777
Jun 24 10:26:56 node5 kernel: [845685.463900] o2net: connected to node
node1 (num 1) at 172.28.4.56:7777
Jun 24 10:27:05 node5 kernel: [845701.592536] o2net: connected to node
node2 (num 2) at 172.28.4.64:7777
Jun 24 10:27:05 node5 kernel: [845701.987283] o2net: connected to node
node3 (num 3) at 172.28.4.65:7777
Jun 24 10:27:15 node5 kernel: [845722.566864]
(5697,1):o2net_connect_expired:1629 ERROR: no connection established
with node 7 after 30.0 seconds, giving up and returning errors.
Jun 24 10:27:16 node5 kernel: [845722.787106]
(5697,1):o2net_connect_expired:1629 ERROR: no connection established
with node 6 after 30.0 seconds, giving up and returning errors.

This logs that inform that node5 couldn't connect to nodes 6 and 7 shows
in endless loop untill I tried to umout the device and restart o2cb
One device I umounted successfully and the second I couldn't umount -
and my logs:
Jun 24 11:43:01 node5 kernel: [855031.567140]
(20663,7):dlm_send_remote_unlock_request:359 ERROR: status = -107
Jun 24 11:43:01 node5 kernel: [855031.567140]
(20663,7):dlm_send_remote_unlock_request:359 ERROR: status = -107

The problem is that I couldn't access to the files on ocfs on none of
the nodes until I reboot the nodes.

On Wed, 2009-06-24 at 13:25 -0700, Sunil Mushran wrote:

> The nodes are not frozen. The processes that are attempting to talk
> to the "disconnected" node are waiting for that node to reply, failing
> which, to die. The default timeout for the disk heartbeat is 60 secs.
> 
> If that node simply died, the other nodes would have deemed the node
> dead after 60 secs, recovered it and carried on with life.
> 
> But in this case, you disconnected that node from the cluster. Meaning
> that node has to first decide on its course of action. It kills itself
> but after the 60 sec timeout. After that, the other nodes have to wait
> another 60 secs before deeming it dead. That's your 120 secs.
> 
> You can always cut the disk heartbeat by a half. But it depends a lot
> on your shared disk / io path. Some multipaths require the timeouts to
> be as high as 2 mins.
> 
> Many users multipath the network setup too by setting up net bonding.
> 
> Sunil
> 
> Piotr Teodorowski wrote:
> > Hi,
> >
> > I've cluster ocfs2 with 8 nodes and 2 devices mapped from Disk Storage 
> > to this nodes (disks are formatted, file systems ocfs created)
> >
> > I can start the cluster on each node and mount device - this works fine.
> >
> > Let say my first node name is host1 and node numer is 0 and ip address 
> > 172.28.4.1
> > my second node name is host2 node number 1 and ip address 172.28.4.2 
> > and i do nothing on other nodes (but the device is mounted on every node).
> >
> > when I run find /mount_point -type f on host1 it searches and displays 
> > files.
> > Before the find ends,  on host2 I remove IP address from interface 
> > (the network connection is broken) and the find on host1 freeze.
> > This is the log on host1:
> >
> > Jun 24 12:36:33 host1 kernel: [ 1816.861233] o2net: connection to node 
> > host2 (num 1) at 172.28.4.2:7777 has been idle for 30.0 seconds, 
> > shutting it down.
> > Jun 24 12:36:33 host1 kernel: [ 1816.861242] 
> > (0,5):o2net_idle_timer:1468 here are some times that might help debug 
> > the situation: (tmr 1245839763.115691 now 1245839793.115494 dr 
> > 1245839763.115676 adv 1245839763.115691:1245839763.115691 func 
> > (cd6c8a07:500) 1245839758.695001:1245839758.695003)
> > Jun 24 12:36:33 host1 kernel: [ 1816.861260] o2net: no longer 
> > connected to node host2 (num 1) at 172.28.4.2:7777
> >
> > Few minutes later the find can search again (I do not kill the proccess)
> > and I have in my logs:
> > Jun 24 12:38:41 host1 kernel: [ 2011.612478] 
> > (5935,0):o2dlm_eviction_cb:258 o2dlm has evicted node 1 from group 
> > C9113043842642AD9694FDF0E9BE6E29
> > Jun 24 12:38:42 host1 kernel: [ 2013.370655] 
> > (5950,5):dlm_get_lock_resource:839 
> > C9113043842642AD9694FDF0E9BE6E29:$RECOVERY: at least one node (1) to 
> > recover before lock mastery can begin
> > Jun 24 12:38:42 host1 kernel: [ 2013.370661] 
> > (5950,5):dlm_get_lock_resource:873 C9113043842642AD9694FDF0E9BE6E29: 
> > recovery map is not empty, but must master $RECOVERY lock now
> > Jun 24 12:38:42 host1 kernel: [ 2013.378061] 
> > (5950,5):dlm_do_recovery:524 (5950) Node 0 is the Recovery Master for 
> > the Dead Node 1 for Domain C9113043842642AD9694FDF0E9BE6E29
> >
> > Is that normal that I can't access (from any of health node) to the 
> > ocfs until this few minutes? I do not need to write for 2 minutes but 
> > this kind of break for read is unacceptable
> >
> > I have the default settings for HB:
> > O2CB_ENABLED=true
> > O2CB_BOOTCLUSTER=ocfs2
> > O2CB_HEARTBEAT_THRESHOLD=31
> > O2CB_IDLE_TIMEOUT_MS=30000
> > O2CB_KEEPALIVE_DELAY_MS=2000
> > O2CB_RECONNECT_DELAY_MS=2000
> >
> > ocfs2-tools 1.4.1 (debian lenny)
> > kernel 2.6.26-2-amd64
> > +multipath
> > +bonding
> >
> > modinfo ocfs2
> > filename:       /lib/modules/2.6.26-2-amd64/kernel/fs/ocfs2/ocfs2.ko
> > license:        GPL
> > author:         Oracle
> > version:        1.5.0
> > description:    OCFS2 1.5.0
> > srcversion:     B19D847BA86E871E41B7A64
> > depends:        jbd,ocfs2_stackglue,ocfs2_nodemanager
> > vermagic:       2.6.26-2-amd64 SMP mod_unload modversions
> >
> > Any advise?
> >
> > Peter
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090625/8b555c1c/attachment-0001.html