[Ocfs2-users] 6 node cluster with unexplained reboots
Alexei_Roudnev
Alexei_Roudnev at exigengroup.com
Wed Aug 15 17:52:49 PDT 2007
ANY SCSI controller can quitly delay IO for 10 - 20 seconds, without errors
and explanationbs. 10 seconds threshold in OCFSv2 will never work properly.
For example, controller was busy getting statistics from the disk, or one of
the disks required reset (which takes more then 10 seconds) and cache was
full, and so on.
Even 1 minute timeout is too short for many SAN systems (and sometimes we
need 2 minutes); 10 seconds is something out of any consideration.
(Of course there is a tradeof such as _longer recovery time after
singlefailure_, but 1 - 2 minutes don't look unreasonable here.)
----- Original Message -----
From: "Ulf Zimmermann" <ulf at atc-onlane.com>
To: "Mark Fasheh" <mark.fasheh at oracle.com>
Cc: "Sunil Mushran" <Sunil.Mushran at oracle.com>; <ocfs2-users at oss.oracle.com>
Sent: Wednesday, August 15, 2007 5:43 PM
Subject: RE: [Ocfs2-users] 6 node cluster with unexplained reboots
> -----Original Message-----
> From: Mark Fasheh [mailto:mark.fasheh at oracle.com]
> Sent: Wednesday, August 15, 2007 16:49
> To: Ulf Zimmermann
> Cc: Sunil Mushran; ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
>
> On Mon, Aug 13, 2007 at 08:46:51AM -0700, Ulf Zimmermann wrote:
> > Index 22: took 10003 ms to do waiting for write completion
> > *** ocfs2 is very sorry to be fencing this system by restarting ***
> >
> > There were no SCSI errors on the console or logs around the time of
this
> > reboot.
>
> It looks like the write took too long - as a first step, you might
want to
> up the disk heartbeat timeouts on those systems. Run:
>
> $ /etc/init.d/o2cb configure
>
> on each node to do that. That won't hide any hardware problems, but if
the
> problem is just a latency to get the write to disk, it'd help tune it
> away.
> --Mark
The SAN is a 3Par E200, which does write into cache on its two
controllers, then acknowledges a write and then writes it actually to
disk. I have not found any reason for this delay yet, so sofar I am
stumped why it had such a long delay writing.
Ulf.
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
More information about the Ocfs2-users
mailing list