[Ocfs2-users] 6 node cluster with unexplained reboots

Wed Aug 15 17:54:54 PDT 2007

> -----Original Message-----
> From: Mark Fasheh [mailto:mark.fasheh at oracle.com]
> Sent: Wednesday, August 15, 2007 17:50
> To: Ulf Zimmermann
> Cc: Sunil Mushran; ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
> 
> On Wed, Aug 15, 2007 at 05:43:14PM -0700, Ulf Zimmermann wrote:
> > The SAN is a 3Par E200, which does write into cache on its two
> > controllers, then acknowledges a write and then writes it actually
to
> > disk. I have not found any reason for this delay yet, so sofar I am
> > stumped why it had such a long delay writing.
> 
> Are you saying that the controllers are doing write-back caching? If
> they're
> in that sort of mode, you need to change it to write-through for a
> clustered
> environment.
> 	--Mark

The controller getting the request mirrors the request to the second
controller (in this case there is only 1, there can be up to 7 other).
Then it acknowledges the request and writes it to disk. Each controller
has double batteries to be able to finish any pending writes. If a
controller fails, it will only acknowledge the write after it is
physical on the disk. This is part of the 3Par operation. I have
submitted a request to 3Par to check the extensive logs they generate to
see if there is anything which can explain this write delay. 

The previous reboots we had, for which we have no console logs, may have
been OCFS2 fencing or something else, all of which happened while the
cluster has been pretty much idle, while this time there was activity
(import). Monday's reboot was the first since the initial 4 reboots. I
wished OCFS2 would still log more then just on the console so we had
evidence on the other reboots.

Ulf.