[Ocfs2-users] heartbeat write timeout
SCOTT, Gavin
gavin.l.scott at baesystems.com
Thu Mar 30 18:15:19 CST 2006
After confirming with Stephan, this problem appears to relate to the HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After encountering this myself and having confirmed with a couple of other people in the list that it has caused problems, it seems that the default threshold of 7 is possibly too short, even in reasonably fast server-storage solutions such as an HP DL380 Packaged Cluster.
Does the OCFS2 development team also consider this to be too short, or is altering the paramater just a workaround that shouldn't be used? If this is the case then how should we approach the problem of self-fencing nodes?
Also, can we expect this behaviour with some platforms but not others, or is it too short for all platforms? If it is a blanket problem, then should the default threshold be raised?
Finally, if the altering the threshold is a valid solution, could it please be added to the FAQs and the user guide so that people know to adjust it as a first step on encountering the problem, rather than having to post to the list and wait for replies.
Regards,
Gavin
-----Original Message-----
From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Stephan A. Rickauer
Sent: Thursday, 30 March 2006 00:47
To: ocfs2-users at oss.oracle.com
Subject: [Ocfs2-users] heartbeat write timeout
Dear list,
I am evaluating ocfs2 in a test environment, that currently runs a "cluster" in a one node mode (AMD Opteron, 2GB RAM, RH AS4 (CentOS 4.3),
2.6.9-34.EL) connected to an iSCSI storage device. While doing load tests with 'bonnie++' to test the performance of the storage device together with the file system I experience regular kernel panics related to ocfs2 (1.2.0 RPMs).
Here is the message I get (I did not want to file a bug yet, maybe it's just me missing something). sdb1 is the iscsi device:
---snip---
(3,0):o2hb_write_timeout: 164 ERROR: Heartbeat write timeout to device
sdb1 after 12000 milliseconds
(3,0):02hb_stop_all_regions: 1727 ERROR: stopping heartbeat on all active regions Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing
---snip---
I am tempted to rule out iscsi storage device related problems, but this is not 100% sure, though tests with GFS and ext3 did not reveal comparable problems.
On the bug page I spotted ID565 which seems to fit my szenario, but the status of the bug is unclear to me (references to version 0.99 are
given): http://oss.oracle.com/bugzilla/show_bug.cgi?id=565
Any help / comments etc. are appreciated.
Thanks.
--
Stephan A. Rickauer
-----------------------------------------------------------
Institut für Neuroinformatik Tel: +41 44 635 30 50
Universität / ETH Zürich Sek: +41 44 635 30 52
Winterthurerstrasse 190 Fax: +41 44 635 30 53
CH-8057 Zürich Web: www.ini.ethz.ch
RSA public key: https://www.ini.ethz.ch/~stephan/pubkey.asc
-----------------------------------------------------------
More information about the Ocfs2-users
mailing list