[Ocfs2-users] ocfs2 freeze

Thu Jun 25 10:38:38 PDT 2009

Log a bugzilla and _attach_ the logs. The full messages file
of all nodes. It is not easy to read this on email readers.

Piotr Teodorowski wrote:
> Ok, so this is normal behaviour that process has no access to the 
> files on ocfs when there is a problem whith some node in the cluster 
> at least for few seconds...
> Before I simulate the situation with network problems I had the "real" 
> problem
> one node (let say it was node5) lost connection from all other 7 nodes
> after about 7 seconds later that node reconnect to 5 nodes and 
> couldn't connect to two remaining nodes.
> All 7 nodes notice problem with only one node - node5.
> And I had in my logs:
>
> Jun 24 10:26:45 node5 kernel: [845664.767625] o2net: connection to 
> node node0 (num 0) at 172.28.4.48:7777 has been idle for 30.0 seconds, 
> shutting it down.
> Jun 24 10:26:45 node5 kernel: [845664.767634] 
> (0,5):o2net_idle_timer:1468 here are some times that might help debug 
> the situation: (tmr 1245831975.367358 now 1245832005.363626 dr 
> 1245831975.367342 adv 1245831975.367364:1245831975.367364 func 
> (cd6c8a07:502) 1245765277.343131:1245765277.343132)
> Jun 24 10:26:45 node5 kernel: [845664.767625] o2net: no longer 
> connected to node node0 (num 0) at 172.28.4.48:7777
> Jun 24 10:26:45 node5 kernel: [845665.013442] o2net: connection to 
> node node2 (num 2) at 172.28.4.64:7777 has been idle for 30.0 seconds, 
> shutting it down.
> Jun 24 10:26:45 node5 kernel: [845665.013449] 
> (0,1):o2net_idle_timer:1468 here are some times that might help debug 
> the situation: (tmr 1245831975.603754 now 1245832005.603626 dr 
> 1245831975.603749 adv 1245831975.603755:1245831975.603755 func 
> (b0b3255d:500) 1245831909.670314:1245831909.670317)
> Jun 24 10:26:45 node5 kernel: [845665.013463] o2net: no longer 
> connected to node node2 (num 2) at 172.28.4.64:7777
> Jun 24 10:26:45 node5 kernel: [845665.403172] o2net: connection to 
> node node3 (num 3) at 172.28.4.65:7777 has been idle for 30.0 seconds, 
> shutting it down.
> Jun 24 10:26:45 node5 kernel: [845665.403181] 
> (0,7):o2net_idle_timer:1468 here are some times that might help debug 
> the situation: (tmr 1245831975.787772 now 1245832005.787627 dr 
> 1245831975.787759 adv 1245831975.787773:1245831975.787773 func 
> (b0b3255d:500) 1245831925.837397:1245831925.837399)
> Jun 24 10:26:45 node5 kernel: [845665.415165] o2net: no longer 
> connected to node node3 (num 3) at 172.28.4.65:7777
> Jun 24 10:26:45 node5 kernel: [845666.094914] o2net: connection to 
> node node7 (num 7) at 172.28.4.78:7777 has been idle for 30.0 seconds, 
> shutting it down.
> Jun 24 10:26:45 node5 kernel: [845666.094922] 
> (0,6):o2net_idle_timer:1468 here are some times that might help debug 
> the situation: (tmr 1245831975.915755 now 1245832005.915626 dr 
> 1245831975.915743 adv 1245831975.915756:1245831975.915756 func 
> (b0b3255d:502) 1245831943.949686:1245831943.949687)
> Jun 24 10:26:45 node5 kernel: [845666.094930] o2net: connection to 
> node node4 (num 4) at 172.28.4.66:7777 has been idle for 30.0 seconds, 
> shutting it down.
> Jun 24 10:26:45 node5 kernel: [845666.094935] 
> (0,6):o2net_idle_timer:1468 here are some times that might help debug 
> the situation: (tmr 1245831975.915758 now 1245832005.915642 dr 
> 1245831975.915748 adv 1245831975.915759:1245831975.915759 func 
> (b0b3255d:500) 1245831943.948920:1245831943.948923)
> Jun 24 10:26:45 node5 kernel: [845666.108161] o2net: no longer 
> connected to node node7 (num 7) at 172.28.4.78:7777
> Jun 24 10:26:45 node5 kernel: [845666.108161] o2net: no longer 
> connected to node node4 (num 4) at 172.28.4.66:7777
> Jun 24 10:26:46 node5 kernel: [845666.348213] o2net: connection to 
> node node6 (num 6) at 172.28.4.81:7777 has been idle for 30.0 seconds, 
> shutting it down.
> Jun 24 10:26:46 node5 kernel: [845666.348220] 
> (0,7):o2net_idle_timer:1468 here are some times that might help debug 
> the situation: (tmr 1245831976.123740 now 1245832006.123625 dr 
> 1245831976.123732 adv 1245831976.123740:1245831976.123741 func 
> (b0b3255d:505) 1245822335.389460:1245822335.389465)
> Jun 24 10:26:46 node5 kernel: [845666.368210] o2net: no longer 
> connected to node node6 (num 6) at 172.28.4.81:7777
> Jun 24 10:26:47 node5 kernel: [845668.151772] o2net: connection to 
> node node1 (num 1) at 172.28.4.56:7777 has been idle for 30.0 seconds, 
> shutting it down.
> Jun 24 10:26:47 node5 kernel: [845668.151780] 
> (0,1):o2net_idle_timer:1468 here are some times that might help debug 
> the situation: (tmr 1245831977.195745 now 1245832007.195625 dr 
> 1245831977.195734 adv 1245831977.195746:1245831977.195746 func 
> (cd6c8a07:502) 1245829979.42551:1245829979.42555)
> Jun 24 10:26:47 node5 kernel: [845668.151795] o2net: no longer 
> connected to node node1 (num 1) at 172.28.4.56:7777
> Jun 24 10:26:54 node5 kernel: [845682.375877] o2net: connected to node 
> node0 (num 0) at 172.28.4.48:7777
> Jun 24 10:26:54 node5 kernel: [845683.287292] o2net: connected to node 
> node4 (num 4) at 172.28.4.66:7777
> Jun 24 10:26:56 node5 kernel: [845685.463900] o2net: connected to node 
> node1 (num 1) at 172.28.4.56:7777
> Jun 24 10:27:05 node5 kernel: [845701.592536] o2net: connected to node 
> node2 (num 2) at 172.28.4.64:7777
> Jun 24 10:27:05 node5 kernel: [845701.987283] o2net: connected to node 
> node3 (num 3) at 172.28.4.65:7777
> Jun 24 10:27:15 node5 kernel: [845722.566864] 
> (5697,1):o2net_connect_expired:1629 ERROR: no connection established 
> with node 7 after 30.0 seconds, giving up and returning errors.
> Jun 24 10:27:16 node5 kernel: [845722.787106] 
> (5697,1):o2net_connect_expired:1629 ERROR: no connection established 
> with node 6 after 30.0 seconds, giving up and returning errors.
>
> This logs that inform that node5 couldn't connect to nodes 6 and 7 
> shows in endless loop untill I tried to umout the device and restart o2cb
> One device I umounted successfully and the second I couldn't umount - 
> and my logs:
> Jun 24 11:43:01 node5 kernel: [855031.567140] 
> (20663,7):dlm_send_remote_unlock_request:359 ERROR: status = -107
> Jun 24 11:43:01 node5 kernel: [855031.567140] 
> (20663,7):dlm_send_remote_unlock_request:359 ERROR: status = -107
>
> The problem is that I couldn't access to the files on ocfs on none of 
> the nodes until I reboot the nodes.
>
>
>
> On Wed, 2009-06-24 at 13:25 -0700, Sunil Mushran wrote:
>> The nodes are not frozen. The processes that are attempting to talk
>> to the "disconnected" node are waiting for that node to reply, failing
>> which, to die. The default timeout for the disk heartbeat is 60 secs.
>>
>> If that node simply died, the other nodes would have deemed the node
>> dead after 60 secs, recovered it and carried on with life.
>>
>> But in this case, you disconnected that node from the cluster. Meaning
>> that node has to first decide on its course of action. It kills itself
>> but after the 60 sec timeout. After that, the other nodes have to wait
>> another 60 secs before deeming it dead. That's your 120 secs.
>>
>> You can always cut the disk heartbeat by a half. But it depends a lot
>> on your shared disk / io path. Some multipaths require the timeouts to
>> be as high as 2 mins.
>>
>> Many users multipath the network setup too by setting up net bonding.
>>
>> Sunil
>>
>> Piotr Teodorowski wrote:
>> > Hi,
>> >
>> > I've cluster ocfs2 with 8 nodes and 2 devices mapped from Disk Storage 
>> > to this nodes (disks are formatted, file systems ocfs created)
>> >
>> > I can start the cluster on each node and mount device - this works fine.
>> >
>> > Let say my first node name is host1 and node numer is 0 and ip address 
>> > 172.28.4.1
>> > my second node name is host2 node number 1 and ip address 172.28.4.2 
>> > and i do nothing on other nodes (but the device is mounted on every node).
>> >
>> > when I run find /mount_point -type f on host1 it searches and displays 
>> > files.
>> > Before the find ends,  on host2 I remove IP address from interface 
>> > (the network connection is broken) and the find on host1 freeze.
>> > This is the log on host1:
>> >
>> > Jun 24 12:36:33 host1 kernel: [ 1816.861233] o2net: connection to node 
>> > host2 (num 1) at 172.28.4.2:7777 has been idle for 30.0 seconds, 
>> > shutting it down.
>> > Jun 24 12:36:33 host1 kernel: [ 1816.861242] 
>> > (0,5):o2net_idle_timer:1468 here are some times that might help debug 
>> > the situation: (tmr 1245839763.115691 now 1245839793.115494 dr 
>> > 1245839763.115676 adv 1245839763.115691:1245839763.115691 func 
>> > (cd6c8a07:500) 1245839758.695001:1245839758.695003)
>> > Jun 24 12:36:33 host1 kernel: [ 1816.861260] o2net: no longer 
>> > connected to node host2 (num 1) at 172.28.4.2:7777
>> >
>> > Few minutes later the find can search again (I do not kill the proccess)
>> > and I have in my logs:
>> > Jun 24 12:38:41 host1 kernel: [ 2011.612478] 
>> > (5935,0):o2dlm_eviction_cb:258 o2dlm has evicted node 1 from group 
>> > C9113043842642AD9694FDF0E9BE6E29
>> > Jun 24 12:38:42 host1 kernel: [ 2013.370655] 
>> > (5950,5):dlm_get_lock_resource:839 
>> > C9113043842642AD9694FDF0E9BE6E29:$RECOVERY: at least one node (1) to 
>> > recover before lock mastery can begin
>> > Jun 24 12:38:42 host1 kernel: [ 2013.370661] 
>> > (5950,5):dlm_get_lock_resource:873 C9113043842642AD9694FDF0E9BE6E29: 
>> > recovery map is not empty, but must master $RECOVERY lock now
>> > Jun 24 12:38:42 host1 kernel: [ 2013.378061] 
>> > (5950,5):dlm_do_recovery:524 (5950) Node 0 is the Recovery Master for 
>> > the Dead Node 1 for Domain C9113043842642AD9694FDF0E9BE6E29
>> >
>> > Is that normal that I can't access (from any of health node) to the 
>> > ocfs until this few minutes? I do not need to write for 2 minutes but 
>> > this kind of break for read is unacceptable
>> >
>> > I have the default settings for HB:
>> > O2CB_ENABLED=true
>> > O2CB_BOOTCLUSTER=ocfs2
>> > O2CB_HEARTBEAT_THRESHOLD=31
>> > O2CB_IDLE_TIMEOUT_MS=30000
>> > O2CB_KEEPALIVE_DELAY_MS=2000
>> > O2CB_RECONNECT_DELAY_MS=2000
>> >
>> > ocfs2-tools 1.4.1 (debian lenny)
>> > kernel 2.6.26-2-amd64
>> > +multipath
>> > +bonding
>> >
>> > modinfo ocfs2
>> > filename:       /lib/modules/2.6.26-2-amd64/kernel/fs/ocfs2/ocfs2.ko
>> > license:        GPL
>> > author:         Oracle
>> > version:        1.5.0
>> > description:    OCFS2 1.5.0
>> > srcversion:     B19D847BA86E871E41B7A64
>> > depends:        jbd,ocfs2_stackglue,ocfs2_nodemanager
>> > vermagic:       2.6.26-2-amd64 SMP mod_unload modversions
>> >
>> > Any advise?
>> >
>> > Peter
>>
>>