[Ocfs2-users] ocfs2 hosts reboot under load

Karim Alkhayer kkhayer at gmail.com
Tue Feb 3 00:55:58 PST 2009


> Hi Alex,
> The performance largely depends on:
> What sort of operation are you performing, copy, create, parse,...
> What sort of storage you've got: disk rpm, RAID, fiber connection (if you
have SAN), LUN allocation per system/app, ... 
> How the files are organized? Do you place them all under one directory or
spread them across several 
> It would also worth to evaluate the process/application in terms of
parallelism and flow

> Best regards,
> Karim 

-----Original Message-----
From: ocfs2-users-bounces at oss.oracle.com
[mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Alex Mestiashvili
Sent: Tuesday, February 03, 2009 10:40 AM
To: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] ocfs2 hosts reboot under load

Karim Alkhayer wrote:
> Hi Alexander,
> Try to increase the value of "Heartbeat dead threshold"
> 5 minutes could be acceptable for the preliminary testing under load
> This will allow you to assess the problem before the node(s) die
>
> Best regards,
> Karim
>
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Alexander
> Mestiashvili
> Sent: Sunday, February 01, 2009 7:31 PM
> To: ocfs2-users at oss.oracle.com
> Subject: [Ocfs2-users] ocfs2 hosts reboot under load
>
> Hello , I have  troubles with my 4 node ocfs2 cluster . Hosts reboot under
> load .
>
> hardware is 4 dell 1850 servers connected via 100M network .
> storage is raid 5 connected with fiber channel .
> I ran boonie++ simultaneously on two hosts for testing.
> On the second host (host 8) I got such messages in kern.log .
>
> first one(host 7) rebooted at 
> Jan 30 16:23:48 host8 kernel: o2net: connection to node host7 (num 0) at
> 192.168.0.27:7777 has been idle for 30.0 seconds, shutting it down.
>
>
> mount | grep ocfs
> ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
> /dev/sda on /shared type ocfs2 (rw,_netdev,heartbeat=local)
>
> command I used  : bonnie++ -d /shared/ocfs2_nutch8/ -u root -s 0 -n
> 100:100m:10k:100
>
> Jan 30 16:23:48 host8 kernel: o2net: connection to node host7 (num 0) at
> 192.168.0.27:7777 has been idle for 30.0 seconds, shutting it down.
> Jan 30 16:23:48 host8 kernel: (0,0):o2net_idle_timer:1498 here are some
> times that might help debug the situation: (tmr 1233328998.315538 now
> 1233329028.313246 dr 1233328998.315530 adv
> 1233328998.315541:1233328998.315541 func (fa7e1976:502)
> 1233328900.631572:1233328900.631582)
> Jan 30 16:23:48 host8 kernel: o2net: no longer connected to node host7
(num
> 0) at 192.168.0.27:7777
> Jan 30 16:23:48 host8 kernel: (16132,0):dlm_do_master_request:1335 ERROR:
> link to 0 went down!
> Jan 30 16:23:48 host8 kernel: (16132,0):dlm_get_lock_resource:912 ERROR:
> status = -112
> Jan 30 16:23:55 host8 kernel: (2616,1):o2dlm_eviction_cb:258 o2dlm has
> evicted node 0 from group DE9BC917EFB247458EF221C2167F6CC1
> Jan 30 16:23:58 host8 kernel: (16132,0):dlm_restart_lock_mastery:1218
ERROR:
> node down! 0
> Jan 30 16:23:58 host8 kernel: (16132,0):dlm_wait_for_lock_mastery:1035
> ERROR: status = -11
> Jan 30 16:24:00 host8 kernel: (16132,0):dlm_get_lock_resource:893
> DE9BC917EFB247458EF221C2167F6CC1:N0000000009f618da: at least one node (0)
to
> recover before lock mastery can begin
> Jan 30 16:24:22 host8 last message repeated 2 times
> Jan 30 16:25:18 host8 kernel: o2net: connected to node host7 (num 0) at
> 192.168.0.27:7777
> Jan 30 16:25:18 host8 kernel: ocfs2_dlm: Node 0 joins domain
> DE9BC917EFB247458EF221C2167F6CC1
> Jan 30 16:25:18 host8 kernel: ocfs2_dlm: Nodes in domain
> ("DE9BC917EFB247458EF221C2167F6CC1"): 0 1 2 3 
> Jan 30 16:42:11 host8 kernel: INFO: task kswapd0:207 blocked for more than
> 120 seconds.
> Jan 30 16:42:11 host8 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jan 30 16:42:11 host8 kernel: kswapd0       D 0000000000000100     0   207
> 2
> Jan 30 16:42:11 host8 kernel:  ffff88012dd09cf0 0000000000000046
> ffff88012e2dc148 ffffffff8021e03f
> Jan 30 16:42:11 host8 kernel:  ffff88012fbd7340 ffff88012faf46a0
> ffff88012fbd7600 0000000000000001
> Jan 30 16:42:11 host8 kernel:  0000000000000286 0000000000000003
> ffff88012dd09cf0 ffffffff8021ec30
> Jan 30 16:42:11 host8 kernel: Call Trace:
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8021e03f>] 0xffffffff8021e03f
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8021ec30>] 0xffffffff8021ec30
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01ceb1b>] 0xffffffffa01ceb1b
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8023b605>] 0xffffffff8023b605
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8028bbe0>] 0xffffffff8028bbe0
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8028c201>] 0xffffffff8028c201
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8028c469>] 0xffffffff8028c469
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8025d7d8>] 0xffffffff8025d7d8
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8025df2b>] 0xffffffff8025df2b
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8025cb00>] 0xffffffff8025cb00
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80414d37>] 0xffffffff80414d37
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8023b605>] 0xffffffff8023b605
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8025dbea>] 0xffffffff8025dbea
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8023b4de>] 0xffffffff8023b4de
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80225a29>] 0xffffffff80225a29
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80203c79>] 0xffffffff80203c79
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8023b497>] 0xffffffff8023b497
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80203c6f>] 0xffffffff80203c6f
> Jan 30 16:42:11 host8 kernel: 
> Jan 30 16:42:11 host8 kernel: INFO: task bonnie++:16132 blocked for more
> than 120 seconds.
> Jan 30 16:42:11 host8 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jan 30 16:42:11 host8 kernel: bonnie++      D 0000000102fe0588     0 16132
> 2991
> Jan 30 16:42:11 host8 kernel:  ffff88010022f888 0000000000000086
> 0000000000000000 ffff880123438748
> Jan 30 16:42:11 host8 kernel:  ffff88012fbd6cf0 ffff88012fa7a6a0
> ffff88012fbd6fb0 000000012f402380
> Jan 30 16:42:11 host8 kernel:  0000000000000003 0000000000000001
> 0000000000000000 0000000000000000
> Jan 30 16:42:11 host8 kernel: Call Trace:
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80415f99>] 0xffffffff80415f99
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01d4040>] 0xffffffffa01d4040
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01c78a5>] 0xffffffffa01c78a5
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80299508>] 0xffffffff80299508
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01c9524>] 0xffffffffa01c9524
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01b60fb>] 0xffffffffa01b60fb
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01be69d>] 0xffffffffa01be69d
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80415e90>] 0xffffffff80415e90
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01b8215>] 0xffffffffa01b8215
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80254305>] 0xffffffff80254305
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01eada0>] 0xffffffffa01eada0
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01eada0>] 0xffffffffa01eada0
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8022dd99>] 0xffffffff8022dd99
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80253441>] 0xffffffff80253441
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8028c750>] 0xffffffff8028c750
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80254cfa>] 0xffffffff80254cfa
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01bd79f>] 0xffffffffa01bd79f
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80254e17>] 0xffffffff80254e17
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01cc468>] 0xffffffffa01cc468
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01c6724>] 0xffffffffa01c6724
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80279227>] 0xffffffff80279227
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80277c41>] 0xffffffff80277c41
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8023b605>] 0xffffffff8023b605
> Jan 30 16:42:11 host8 kernel:  [<ffffffff8028d122>] 0xffffffff8028d122
> Jan 30 16:42:11 host8 kernel:  [<ffffffffa01cbfe6>] 0xffffffffa01cbfe6
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80279984>] 0xffffffff80279984
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80279e0c>] 0xffffffff80279e0c
> Jan 30 16:42:11 host8 kernel:  [<ffffffff80202d9b>] 0xffffffff80202d9b
> Jan 30 16:42:11 host8 kernel: 
> Jan 31 09:50:10 host8 kernel: INFO: task kswapd0:207 blocked for more than
> 120 seconds.
> Jan 31 09:50:10 host8 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jan 31 09:50:10 host8 kernel: kswapd0       D 0000000000000080     0   207
> 2
> Jan 31 09:50:10 host8 kernel:  ffff88012dd09cf0 0000000000000046
> ffff88012e2dc148 ffffffff8021e03f
> Jan 31 09:50:10 host8 kernel:  ffff88012fbd7340 ffff88012faf46a0
> ffff88012fbd7600 0000000000000001
> Jan 31 09:50:10 host8 kernel:  0000000000000286 0000000000000003
> ffff88012dd09cf0 ffffffff8021ec30
> Jan 31 09:50:10 host8 kernel: Call Trace:
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8021e03f>] 0xffffffff8021e03f
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8021ec30>] 0xffffffff8021ec30
> Jan 31 09:50:10 host8 kernel:  [<ffffffffa01ceb1b>] 0xffffffffa01ceb1b
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8023b605>] 0xffffffff8023b605
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8028bbe0>] 0xffffffff8028bbe0
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8028c201>] 0xffffffff8028c201
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8028c469>] 0xffffffff8028c469
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8025d7d8>] 0xffffffff8025d7d8
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8025df2b>] 0xffffffff8025df2b
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8025cb00>] 0xffffffff8025cb00
> Jan 31 09:50:10 host8 kernel:  [<ffffffff80414d37>] 0xffffffff80414d37
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8023b605>] 0xffffffff8023b605
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8025dbea>] 0xffffffff8025dbea
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8023b4de>] 0xffffffff8023b4de
> Jan 31 09:50:10 host8 kernel:  [<ffffffff80225a29>] 0xffffffff80225a29
> Jan 31 09:50:10 host8 kernel:  [<ffffffff80203c79>] 0xffffffff80203c79
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8023b497>] 0xffffffff8023b497
> Jan 31 09:50:10 host8 kernel:  [<ffffffff80203c6f>] 0xffffffff80203c6f
> Jan 31 09:50:10 host8 kernel: 
> Jan 31 09:50:10 host8 kernel: INFO: task bonnie++:20292 blocked for more
> than 120 seconds.
> Jan 31 09:50:10 host8 kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jan 31 09:50:10 host8 kernel: bonnie++      D ffff88005a533000     0 20292
> 2991
> Jan 31 09:50:10 host8 kernel:  ffff880075a83cb8 0000000000000086
> 0000000000000000 ffff880030cd7c10
> Jan 31 09:50:10 host8 kernel:  ffff880001410cf0 ffff88012f0946a0
> ffff880001410fb0 00000001a01cf3cb
> Jan 31 09:50:10 host8 kernel:  000000000ba734ca ffff8800a57510c8
> 000000000025a000 ffff88003e8243a0
> Jan 31 09:50:10 host8 kernel: Call Trace:
> Jan 31 09:50:10 host8 kernel:  [<ffffffff80415f99>] 0xffffffff80415f99
> Jan 31 09:50:10 host8 kernel:  [<ffffffffa01d4040>] 0xffffffffa01d4040
> Jan 31 09:50:10 host8 kernel:  [<ffffffffa01d9258>] 0xffffffffa01d9258
> Jan 31 09:50:10 host8 kernel:  [<ffffffffa01d9975>] 0xffffffffa01d9975
> Jan 31 09:50:10 host8 kernel:  [<ffffffff802805dd>] 0xffffffff802805dd
> Jan 31 09:50:10 host8 kernel:  [<ffffffff80281d90>] 0xffffffff80281d90
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8028405d>] 0xffffffff8028405d
> Jan 31 09:50:10 host8 kernel:  [<ffffffff8028d122>] 0xffffffff8028d122
> Jan 31 09:50:10 host8 kernel:  [<ffffffffa01cbfe6>] 0xffffffffa01cbfe6
> Jan 31 09:50:10 host8 kernel:  [<ffffffff80277a2b>] 0xffffffff80277a2b
> Jan 31 09:50:10 host8 kernel:  [<ffffffff80202d9b>] 0xffffffff80202d9b
> Jan 31 09:50:10 host8 kernel: 
>
> kernel version is vanilla 2.6.27.13 + atop + grsecurity patches 
> ocfs-tools version is 1.4.1-1 
>
> here is timeouts :
> #/etc/init.d/o2cb status
> Driver for "configfs": Loaded
> Filesystem "configfs": Mounted
> Stack glue driver: Loaded
> Stack plugin "o2cb": Loaded
> Driver for "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster nutch: Online
> Heartbeat dead threshold = 31
>   Network idle timeout: 30000
>   Network keepalive delay: 2000
>   Network reconnect delay: 2000
> Checking O2CB heartbeat: Active
>
> what can I adjust ? or may be I should use older kernel ? 
> Thanks in advance .
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>   
Thanks Karim ,
I have changed timeouts , and hosts don't reboot any more .

Heartbeat dead threshold = 136
  Network idle timeout: 160000
  Network keepalive delay: 2000
  Network reconnect delay: 2000

but I still have very high iowait and I see kernel Call Traces which are 
caused by blocked tasks.
and ocfs2 is very slow with big amounts of small files , is there any 
way to increase ocfs2 performance for small files ?

Alex

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users




More information about the Ocfs2-users mailing list