[Ocfs2-users] AoE+ocfs2 = Heartbeat write timeout to device

Sunil Mushran Sunil.Mushran at oracle.com
Mon Mar 10 10:08:11 PDT 2008


$ ocfs2_hb_ctl -P -d <device> [-n <io_priority>]

Yes, ocfs2_hb_ctl has had an option to increase the io priority
for a long time. However, io priorities are only available with
kernels starting... not sure of the ver. It is definitely not
available with EL4. I would think it is available with EL5 but
can't say for sure.

BTW, io priorities can only go so far. If the disk array is
overloaded, then your ios will take more time. So setting the hb
timeout to the array's io timeout is not a workaround but part
of the solution.

Luis Freitas wrote:
> Sunil,
>
>     Can I configure this heartbeat to use a high priority (realtime) 
> schedulling?
>
>      If I simply increase the timeout it still could timeout on heavy 
> I/O situations, like several different threads queuing large amounts 
> of writes. The kernel should know this is a high priority write so 
> that it is put ahead of the queue.
>
> Regards,
> Luis
>
> */Sunil Mushran <Sunil.Mushran at oracle.com>/* wrote:
>
>     The older 12 sec default timeout was too low. It has been bumped
>     up to 60 secs. The FAQ has details on this.
>
>     b52 at entrap.de wrote:
>     > Hi,
>     >
>     > I got a problem regarding 100Mbit Ethernet, AoE and ocfs2. I
>     setup 2 boxes
>     > connected per 100Mbit ethernet to their Ata-over-Ethernet
>     storage. The
>     > ocfs filesystem resides on such an AoE-Partition. If I produce high
>     > troughput to that ocfs-partition on one node, it reboots after some
>     > seconds.
>     >
>     > I use dd for testing, like dd if=/dev/zero of=test bs=1M count=1000
>     > If I write 100Mb of data to the disk everything is fine. If I
>     write 1Gb of
>     > data to the disk, the node reboots after some seconds and prints the
>     > following error:
>     >
>     > (9,0):o2hb_write_timeout:167 ERROR: Heartbeat write timeout to
>     device
>     > etherd/e402.0 after 12000 milliseconds
>     > (9,0):o2hb_stop_all_regions:1865 ERROR: stopping heartbeat on
>     all active
>     > regions.
>     >
>     > This couldn't be caused by lost heartbeat packets. I setup a
>     seperate
>     > network for heartbeat to track this problem.
>     >
>     > Actually I know that 100Mbit Ethernet is a bottleneck, but this
>     should not
>     > cause the system to reboot, right? Even if I could switch to Gigbit
>     > Ethernet it may be the bottleneck in future..
>     >
>     > Someone experienced this already? Do you know how to solve this
>     issue?
>     > Please help, I need to do some tests..
>     > Your help is really appreciated.
>     >
>     > Cheers,
>     > Holger
>     >
>     >
>     > _______________________________________________
>     > Ocfs2-users mailing list
>     > Ocfs2-users at oss.oracle.com
>     > http://oss.oracle.com/mailman/listinfo/ocfs2-users
>     >
>
>
>     _______________________________________________
>     Ocfs2-users mailing list
>     Ocfs2-users at oss.oracle.com
>     http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
> ------------------------------------------------------------------------
> Never miss a thing. Make Yahoo your homepage. 
> <http://us.rd.yahoo.com/evt=51438/*http://www.yahoo.com/r/hs>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users




More information about the Ocfs2-users mailing list