[Ocfs2-users] ocfs2 hosts reboot under load

Alexander Mestiashvili alexander.mestiashvili at biotec.tu-dresden.de
Sun Feb 1 09:31:14 PST 2009


Hello , I have  troubles with my 4 node ocfs2 cluster . Hosts reboot under load .

hardware is 4 dell 1850 servers connected via 100M network .
storage is raid 5 connected with fiber channel .
I ran boonie++ simultaneously on two hosts for testing.
On the second host (host 8) I got such messages in kern.log .

first one(host 7) rebooted at 
Jan 30 16:23:48 host8 kernel: o2net: connection to node host7 (num 0) at 192.168.0.27:7777 has been idle for 30.0 seconds, shutting it down.


mount | grep ocfs
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
/dev/sda on /shared type ocfs2 (rw,_netdev,heartbeat=local)

command I used  : bonnie++ -d /shared/ocfs2_nutch8/ -u root -s 0 -n 100:100m:10k:100

Jan 30 16:23:48 host8 kernel: o2net: connection to node host7 (num 0) at 192.168.0.27:7777 has been idle for 30.0 seconds, shutting it down.
Jan 30 16:23:48 host8 kernel: (0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr 1233328998.315538 now 1233329028.313246 dr 1233328998.315530 adv 1233328998.315541:1233328998.315541 func (fa7e1976:502) 1233328900.631572:1233328900.631582)
Jan 30 16:23:48 host8 kernel: o2net: no longer connected to node host7 (num 0) at 192.168.0.27:7777
Jan 30 16:23:48 host8 kernel: (16132,0):dlm_do_master_request:1335 ERROR: link to 0 went down!
Jan 30 16:23:48 host8 kernel: (16132,0):dlm_get_lock_resource:912 ERROR: status = -112
Jan 30 16:23:55 host8 kernel: (2616,1):o2dlm_eviction_cb:258 o2dlm has evicted node 0 from group DE9BC917EFB247458EF221C2167F6CC1
Jan 30 16:23:58 host8 kernel: (16132,0):dlm_restart_lock_mastery:1218 ERROR: node down! 0
Jan 30 16:23:58 host8 kernel: (16132,0):dlm_wait_for_lock_mastery:1035 ERROR: status = -11
Jan 30 16:24:00 host8 kernel: (16132,0):dlm_get_lock_resource:893 DE9BC917EFB247458EF221C2167F6CC1:N0000000009f618da: at least one node (0) to recover before lock mastery can begin
Jan 30 16:24:22 host8 last message repeated 2 times
Jan 30 16:25:18 host8 kernel: o2net: connected to node host7 (num 0) at 192.168.0.27:7777
Jan 30 16:25:18 host8 kernel: ocfs2_dlm: Node 0 joins domain DE9BC917EFB247458EF221C2167F6CC1
Jan 30 16:25:18 host8 kernel: ocfs2_dlm: Nodes in domain ("DE9BC917EFB247458EF221C2167F6CC1"): 0 1 2 3 
Jan 30 16:42:11 host8 kernel: INFO: task kswapd0:207 blocked for more than 120 seconds.
Jan 30 16:42:11 host8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 30 16:42:11 host8 kernel: kswapd0       D 0000000000000100     0   207      2
Jan 30 16:42:11 host8 kernel:  ffff88012dd09cf0 0000000000000046 ffff88012e2dc148 ffffffff8021e03f
Jan 30 16:42:11 host8 kernel:  ffff88012fbd7340 ffff88012faf46a0 ffff88012fbd7600 0000000000000001
Jan 30 16:42:11 host8 kernel:  0000000000000286 0000000000000003 ffff88012dd09cf0 ffffffff8021ec30
Jan 30 16:42:11 host8 kernel: Call Trace:
Jan 30 16:42:11 host8 kernel:  [<ffffffff8021e03f>] 0xffffffff8021e03f
Jan 30 16:42:11 host8 kernel:  [<ffffffff8021ec30>] 0xffffffff8021ec30
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01ceb1b>] 0xffffffffa01ceb1b
Jan 30 16:42:11 host8 kernel:  [<ffffffff8023b605>] 0xffffffff8023b605
Jan 30 16:42:11 host8 kernel:  [<ffffffff8028bbe0>] 0xffffffff8028bbe0
Jan 30 16:42:11 host8 kernel:  [<ffffffff8028c201>] 0xffffffff8028c201
Jan 30 16:42:11 host8 kernel:  [<ffffffff8028c469>] 0xffffffff8028c469
Jan 30 16:42:11 host8 kernel:  [<ffffffff8025d7d8>] 0xffffffff8025d7d8
Jan 30 16:42:11 host8 kernel:  [<ffffffff8025df2b>] 0xffffffff8025df2b
Jan 30 16:42:11 host8 kernel:  [<ffffffff8025cb00>] 0xffffffff8025cb00
Jan 30 16:42:11 host8 kernel:  [<ffffffff80414d37>] 0xffffffff80414d37
Jan 30 16:42:11 host8 kernel:  [<ffffffff8023b605>] 0xffffffff8023b605
Jan 30 16:42:11 host8 kernel:  [<ffffffff8025dbea>] 0xffffffff8025dbea
Jan 30 16:42:11 host8 kernel:  [<ffffffff8023b4de>] 0xffffffff8023b4de
Jan 30 16:42:11 host8 kernel:  [<ffffffff80225a29>] 0xffffffff80225a29
Jan 30 16:42:11 host8 kernel:  [<ffffffff80203c79>] 0xffffffff80203c79
Jan 30 16:42:11 host8 kernel:  [<ffffffff8023b497>] 0xffffffff8023b497
Jan 30 16:42:11 host8 kernel:  [<ffffffff80203c6f>] 0xffffffff80203c6f
Jan 30 16:42:11 host8 kernel: 
Jan 30 16:42:11 host8 kernel: INFO: task bonnie++:16132 blocked for more than 120 seconds.
Jan 30 16:42:11 host8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 30 16:42:11 host8 kernel: bonnie++      D 0000000102fe0588     0 16132   2991
Jan 30 16:42:11 host8 kernel:  ffff88010022f888 0000000000000086 0000000000000000 ffff880123438748
Jan 30 16:42:11 host8 kernel:  ffff88012fbd6cf0 ffff88012fa7a6a0 ffff88012fbd6fb0 000000012f402380
Jan 30 16:42:11 host8 kernel:  0000000000000003 0000000000000001 0000000000000000 0000000000000000
Jan 30 16:42:11 host8 kernel: Call Trace:
Jan 30 16:42:11 host8 kernel:  [<ffffffff80415f99>] 0xffffffff80415f99
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01d4040>] 0xffffffffa01d4040
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01c78a5>] 0xffffffffa01c78a5
Jan 30 16:42:11 host8 kernel:  [<ffffffff80299508>] 0xffffffff80299508
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01c9524>] 0xffffffffa01c9524
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01b60fb>] 0xffffffffa01b60fb
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01be69d>] 0xffffffffa01be69d
Jan 30 16:42:11 host8 kernel:  [<ffffffff80415e90>] 0xffffffff80415e90
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01b8215>] 0xffffffffa01b8215
Jan 30 16:42:11 host8 kernel:  [<ffffffff80254305>] 0xffffffff80254305
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01eada0>] 0xffffffffa01eada0
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01eada0>] 0xffffffffa01eada0
Jan 30 16:42:11 host8 kernel:  [<ffffffff8022dd99>] 0xffffffff8022dd99
Jan 30 16:42:11 host8 kernel:  [<ffffffff80253441>] 0xffffffff80253441
Jan 30 16:42:11 host8 kernel:  [<ffffffff8028c750>] 0xffffffff8028c750
Jan 30 16:42:11 host8 kernel:  [<ffffffff80254cfa>] 0xffffffff80254cfa
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01bd79f>] 0xffffffffa01bd79f
Jan 30 16:42:11 host8 kernel:  [<ffffffff80254e17>] 0xffffffff80254e17
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01cc468>] 0xffffffffa01cc468
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01c6724>] 0xffffffffa01c6724
Jan 30 16:42:11 host8 kernel:  [<ffffffff80279227>] 0xffffffff80279227
Jan 30 16:42:11 host8 kernel:  [<ffffffff80277c41>] 0xffffffff80277c41
Jan 30 16:42:11 host8 kernel:  [<ffffffff8023b605>] 0xffffffff8023b605
Jan 30 16:42:11 host8 kernel:  [<ffffffff8028d122>] 0xffffffff8028d122
Jan 30 16:42:11 host8 kernel:  [<ffffffffa01cbfe6>] 0xffffffffa01cbfe6
Jan 30 16:42:11 host8 kernel:  [<ffffffff80279984>] 0xffffffff80279984
Jan 30 16:42:11 host8 kernel:  [<ffffffff80279e0c>] 0xffffffff80279e0c
Jan 30 16:42:11 host8 kernel:  [<ffffffff80202d9b>] 0xffffffff80202d9b
Jan 30 16:42:11 host8 kernel: 
Jan 31 09:50:10 host8 kernel: INFO: task kswapd0:207 blocked for more than 120 seconds.
Jan 31 09:50:10 host8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 31 09:50:10 host8 kernel: kswapd0       D 0000000000000080     0   207      2
Jan 31 09:50:10 host8 kernel:  ffff88012dd09cf0 0000000000000046 ffff88012e2dc148 ffffffff8021e03f
Jan 31 09:50:10 host8 kernel:  ffff88012fbd7340 ffff88012faf46a0 ffff88012fbd7600 0000000000000001
Jan 31 09:50:10 host8 kernel:  0000000000000286 0000000000000003 ffff88012dd09cf0 ffffffff8021ec30
Jan 31 09:50:10 host8 kernel: Call Trace:
Jan 31 09:50:10 host8 kernel:  [<ffffffff8021e03f>] 0xffffffff8021e03f
Jan 31 09:50:10 host8 kernel:  [<ffffffff8021ec30>] 0xffffffff8021ec30
Jan 31 09:50:10 host8 kernel:  [<ffffffffa01ceb1b>] 0xffffffffa01ceb1b
Jan 31 09:50:10 host8 kernel:  [<ffffffff8023b605>] 0xffffffff8023b605
Jan 31 09:50:10 host8 kernel:  [<ffffffff8028bbe0>] 0xffffffff8028bbe0
Jan 31 09:50:10 host8 kernel:  [<ffffffff8028c201>] 0xffffffff8028c201
Jan 31 09:50:10 host8 kernel:  [<ffffffff8028c469>] 0xffffffff8028c469
Jan 31 09:50:10 host8 kernel:  [<ffffffff8025d7d8>] 0xffffffff8025d7d8
Jan 31 09:50:10 host8 kernel:  [<ffffffff8025df2b>] 0xffffffff8025df2b
Jan 31 09:50:10 host8 kernel:  [<ffffffff8025cb00>] 0xffffffff8025cb00
Jan 31 09:50:10 host8 kernel:  [<ffffffff80414d37>] 0xffffffff80414d37
Jan 31 09:50:10 host8 kernel:  [<ffffffff8023b605>] 0xffffffff8023b605
Jan 31 09:50:10 host8 kernel:  [<ffffffff8025dbea>] 0xffffffff8025dbea
Jan 31 09:50:10 host8 kernel:  [<ffffffff8023b4de>] 0xffffffff8023b4de
Jan 31 09:50:10 host8 kernel:  [<ffffffff80225a29>] 0xffffffff80225a29
Jan 31 09:50:10 host8 kernel:  [<ffffffff80203c79>] 0xffffffff80203c79
Jan 31 09:50:10 host8 kernel:  [<ffffffff8023b497>] 0xffffffff8023b497
Jan 31 09:50:10 host8 kernel:  [<ffffffff80203c6f>] 0xffffffff80203c6f
Jan 31 09:50:10 host8 kernel: 
Jan 31 09:50:10 host8 kernel: INFO: task bonnie++:20292 blocked for more than 120 seconds.
Jan 31 09:50:10 host8 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 31 09:50:10 host8 kernel: bonnie++      D ffff88005a533000     0 20292   2991
Jan 31 09:50:10 host8 kernel:  ffff880075a83cb8 0000000000000086 0000000000000000 ffff880030cd7c10
Jan 31 09:50:10 host8 kernel:  ffff880001410cf0 ffff88012f0946a0 ffff880001410fb0 00000001a01cf3cb
Jan 31 09:50:10 host8 kernel:  000000000ba734ca ffff8800a57510c8 000000000025a000 ffff88003e8243a0
Jan 31 09:50:10 host8 kernel: Call Trace:
Jan 31 09:50:10 host8 kernel:  [<ffffffff80415f99>] 0xffffffff80415f99
Jan 31 09:50:10 host8 kernel:  [<ffffffffa01d4040>] 0xffffffffa01d4040
Jan 31 09:50:10 host8 kernel:  [<ffffffffa01d9258>] 0xffffffffa01d9258
Jan 31 09:50:10 host8 kernel:  [<ffffffffa01d9975>] 0xffffffffa01d9975
Jan 31 09:50:10 host8 kernel:  [<ffffffff802805dd>] 0xffffffff802805dd
Jan 31 09:50:10 host8 kernel:  [<ffffffff80281d90>] 0xffffffff80281d90
Jan 31 09:50:10 host8 kernel:  [<ffffffff8028405d>] 0xffffffff8028405d
Jan 31 09:50:10 host8 kernel:  [<ffffffff8028d122>] 0xffffffff8028d122
Jan 31 09:50:10 host8 kernel:  [<ffffffffa01cbfe6>] 0xffffffffa01cbfe6
Jan 31 09:50:10 host8 kernel:  [<ffffffff80277a2b>] 0xffffffff80277a2b
Jan 31 09:50:10 host8 kernel:  [<ffffffff80202d9b>] 0xffffffff80202d9b
Jan 31 09:50:10 host8 kernel: 

kernel version is vanilla 2.6.27.13 + atop + grsecurity patches 
ocfs-tools version is 1.4.1-1 

here is timeouts :
#/etc/init.d/o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster nutch: Online
Heartbeat dead threshold = 31
  Network idle timeout: 30000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Active

what can I adjust ? or may be I should use older kernel ? 
Thanks in advance .




More information about the Ocfs2-users mailing list