[Ocfs2-users] heartbeat write timeout

Thu Mar 30 18:15:19 CST 2006

After confirming with Stephan, this problem appears to relate to the HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After encountering this myself and having confirmed with a couple of other people in the list that it has caused problems, it seems that the default threshold of 7 is possibly too short, even in reasonably fast server-storage solutions such as an HP DL380 Packaged Cluster.

Does the OCFS2 development team also consider this to be too short, or is altering the paramater just a workaround that shouldn't be used? If this is the case then how should we approach the problem of self-fencing nodes? 

Also, can we expect this behaviour with some platforms but not others, or is it too short for all platforms? If it is a blanket problem, then should the default threshold be raised?

Finally, if the altering the threshold is a valid solution, could it please be added to the FAQs and the user guide so that people know to adjust it as a first step on encountering the problem, rather than having to post to the list and wait for replies. 

Regards,
Gavin

-----Original Message-----
From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Stephan A. Rickauer
Sent: Thursday, 30 March 2006 00:47
To: ocfs2-users at oss.oracle.com
Subject: [Ocfs2-users] heartbeat write timeout

Dear list,

I am evaluating ocfs2 in a test environment, that currently runs a "cluster" in a one node mode (AMD Opteron, 2GB RAM, RH AS4 (CentOS 4.3),
2.6.9-34.EL) connected to an iSCSI storage device. While doing load tests with 'bonnie++' to test the performance of the storage device together with the file system I experience regular kernel panics related to ocfs2 (1.2.0 RPMs).

Here is the message I get (I did not want to file a bug yet, maybe it's just me missing something). sdb1 is the iscsi device:

---snip---
(3,0):o2hb_write_timeout: 164 ERROR: Heartbeat write timeout to device
sdb1 after 12000 milliseconds
(3,0):02hb_stop_all_regions: 1727 ERROR: stopping heartbeat on all active regions Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing
---snip---

I am tempted to rule out iscsi storage device related problems, but this is not 100% sure, though tests with GFS and ext3 did not reveal comparable problems.

On the bug page I spotted ID565 which seems to fit my szenario, but the status of the bug is unclear to me (references to version 0.99 are
given): http://oss.oracle.com/bugzilla/show_bug.cgi?id=565

Any help / comments etc. are appreciated.
Thanks.

-- 

 Stephan A. Rickauer

 -----------------------------------------------------------
 Institut für Neuroinformatik          Tel: +41 44 635 30 50
 Universität / ETH Zürich              Sek: +41 44 635 30 52
 Winterthurerstrasse 190               Fax: +41 44 635 30 53
 CH-8057 Zürich                        Web:  www.ini.ethz.ch

 RSA public key: https://www.ini.ethz.ch/~stephan/pubkey.asc
 -----------------------------------------------------------