[Ocfs2-users] heartbeat write timeout

Thu Mar 30 18:36:05 CST 2006

Are you seeing timeouts with elevator=deadline?

We only test with the default value and have not seen any
disk hb timeouts on either 2G fc or gige iscsi. And these
are heavy db loads.

When the hb thread panics, it dumps messages indicating
the times it took to perform the tasks. Could you share
those messages?

SCOTT, Gavin wrote:
> After confirming with Stephan, this problem appears to relate to the HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After encountering this myself and having confirmed with a couple of other people in the list that it has caused problems, it seems that the default threshold of 7 is possibly too short, even in reasonably fast server-storage solutions such as an HP DL380 Packaged Cluster.
>
> Does the OCFS2 development team also consider this to be too short, or is altering the paramater just a workaround that shouldn't be used? If this is the case then how should we approach the problem of self-fencing nodes? 
>
> Also, can we expect this behaviour with some platforms but not others, or is it too short for all platforms? If it is a blanket problem, then should the default threshold be raised?
>
> Finally, if the altering the threshold is a valid solution, could it please be added to the FAQs and the user guide so that people know to adjust it as a first step on encountering the problem, rather than having to post to the list and wait for replies. 
>
> Regards,
> Gavin
>  
>
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Stephan A. Rickauer
> Sent: Thursday, 30 March 2006 00:47
> To: ocfs2-users at oss.oracle.com
> Subject: [Ocfs2-users] heartbeat write timeout
>
> Dear list,
>
> I am evaluating ocfs2 in a test environment, that currently runs a "cluster" in a one node mode (AMD Opteron, 2GB RAM, RH AS4 (CentOS 4.3),
> 2.6.9-34.EL) connected to an iSCSI storage device. While doing load tests with 'bonnie++' to test the performance of the storage device together with the file system I experience regular kernel panics related to ocfs2 (1.2.0 RPMs).
>
> Here is the message I get (I did not want to file a bug yet, maybe it's just me missing something). sdb1 is the iscsi device:
>
> ---snip---
> (3,0):o2hb_write_timeout: 164 ERROR: Heartbeat write timeout to device
> sdb1 after 12000 milliseconds
> (3,0):02hb_stop_all_regions: 1727 ERROR: stopping heartbeat on all active regions Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing
> ---snip---
>
> I am tempted to rule out iscsi storage device related problems, but this is not 100% sure, though tests with GFS and ext3 did not reveal comparable problems.
>
> On the bug page I spotted ID565 which seems to fit my szenario, but the status of the bug is unclear to me (references to version 0.99 are
> given): http://oss.oracle.com/bugzilla/show_bug.cgi?id=565
>
> Any help / comments etc. are appreciated.
> Thanks.
>
>