[Ocfs2-users] heartbeat write timeout

Thu Mar 30 22:50:12 CST 2006

I implemented the change to the threshold to get around the self-fencing
before the scheduler bug was reported, as suggested by yourself in a
post to someone with a similar problem from November last year
(http://oss.oracle.com/pipermail/ocfs2-users/2005-November/000269.html).
Perhaps it should be made clear to anyone who read that post and changed
the threshold that it should be changed back to default once the
elevator=deadline fix is implemented.

I've already implemented the elevator=deadline fix but haven't changed
the threshold back to default. I'll do that & hopefully won't see the
self fences; if I do I'll send back the message dump.

Gavin  

-----Original Message-----
From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com] 
Sent: Friday, 31 March 2006 11:06
To: SCOTT, Gavin
Cc: 
Subject: Re: [Ocfs2-users] heartbeat write timeout

Are you seeing timeouts with elevator=deadline?

We only test with the default value and have not seen any disk hb
timeouts on either 2G fc or gige iscsi. And these are heavy db loads.

When the hb thread panics, it dumps messages indicating the times it
took to perform the tasks. Could you share those messages?

SCOTT, Gavin wrote:
> After confirming with Stephan, this problem appears to relate to the
HEARTBEAT_THRESHOLD parameter as set in /etc/sysconfig/o2cb. After
encountering this myself and having confirmed with a couple of other
people in the list that it has caused problems, it seems that the
default threshold of 7 is possibly too short, even in reasonably fast
server-storage solutions such as an HP DL380 Packaged Cluster.
>
> Does the OCFS2 development team also consider this to be too short, or
is altering the paramater just a workaround that shouldn't be used? If
this is the case then how should we approach the problem of self-fencing
nodes? 
>
> Also, can we expect this behaviour with some platforms but not others,
or is it too short for all platforms? If it is a blanket problem, then
should the default threshold be raised?
>
> Finally, if the altering the threshold is a valid solution, could it
please be added to the FAQs and the user guide so that people know to
adjust it as a first step on encountering the problem, rather than
having to post to the list and wait for replies. 
>
> Regards,
> Gavin
>  
>
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com 
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Stephan A. 
> Rickauer
> Sent: Thursday, 30 March 2006 00:47
> To: ocfs2-users at oss.oracle.com
> Subject: [Ocfs2-users] heartbeat write timeout
>
> Dear list,
>
> I am evaluating ocfs2 in a test environment, that currently runs a 
> "cluster" in a one node mode (AMD Opteron, 2GB RAM, RH AS4 (CentOS 
> 4.3),
> 2.6.9-34.EL) connected to an iSCSI storage device. While doing load
tests with 'bonnie++' to test the performance of the storage device
together with the file system I experience regular kernel panics related
to ocfs2 (1.2.0 RPMs).
>
> Here is the message I get (I did not want to file a bug yet, maybe
it's just me missing something). sdb1 is the iscsi device:
>
> ---snip---
> (3,0):o2hb_write_timeout: 164 ERROR: Heartbeat write timeout to device
> sdb1 after 12000 milliseconds
> (3,0):02hb_stop_all_regions: 1727 ERROR: stopping heartbeat on all 
> active regions Kernel panic - not syncing: ocfs2 is very sorry to be 
> fencing this system by panicing
> ---snip---
>
> I am tempted to rule out iscsi storage device related problems, but
this is not 100% sure, though tests with GFS and ext3 did not reveal
comparable problems.
>
> On the bug page I spotted ID565 which seems to fit my szenario, but 
> the status of the bug is unclear to me (references to version 0.99 are
> given): http://oss.oracle.com/bugzilla/show_bug.cgi?id=565
>
> Any help / comments etc. are appreciated.
> Thanks.
>
>