[Ocfs2-users] 6 node cluster with unexplained reboots

Alexei_Roudnev Alexei_Roudnev at exigengroup.com
Wed Aug 15 17:52:49 PDT 2007


ANY SCSI controller can quitly delay IO for 10 - 20 seconds, without errors 
and explanationbs. 10 seconds threshold in OCFSv2 will never work properly.

For example, controller was busy getting statistics from the disk, or one of 
the disks required reset (which takes more then 10 seconds) and cache was 
full, and so on.

Even 1 minute timeout is too short for many SAN systems (and sometimes we 
need 2 minutes); 10 seconds is something out of any consideration.

(Of course there is a tradeof such as _longer recovery time after 
singlefailure_, but 1 - 2 minutes don't look unreasonable here.)

----- Original Message ----- 
From: "Ulf Zimmermann" <ulf at atc-onlane.com>
To: "Mark Fasheh" <mark.fasheh at oracle.com>
Cc: "Sunil Mushran" <Sunil.Mushran at oracle.com>; <ocfs2-users at oss.oracle.com>
Sent: Wednesday, August 15, 2007 5:43 PM
Subject: RE: [Ocfs2-users] 6 node cluster with unexplained reboots


> -----Original Message-----
> From: Mark Fasheh [mailto:mark.fasheh at oracle.com]
> Sent: Wednesday, August 15, 2007 16:49
> To: Ulf Zimmermann
> Cc: Sunil Mushran; ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
>
> On Mon, Aug 13, 2007 at 08:46:51AM -0700, Ulf Zimmermann wrote:
> > Index 22: took 10003 ms to do waiting for write completion
> > *** ocfs2 is very sorry to be fencing this system by restarting ***
> >
> > There were no SCSI errors on the console or logs around the time of
this
> > reboot.
>
> It looks like the write took too long - as a first step, you might
want to
> up the disk heartbeat timeouts on those systems. Run:
>
> $ /etc/init.d/o2cb configure
>
> on each node to do that. That won't hide any hardware problems, but if
the
> problem is just a latency to get the write to disk, it'd help tune it
> away.
> --Mark

The SAN is a 3Par E200, which does write into cache on its two
controllers, then acknowledges a write and then writes it actually to
disk. I have not found any reason for this delay yet, so sofar I am
stumped why it had such a long delay writing.

Ulf.


_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users




More information about the Ocfs2-users mailing list