[Ocfs2-users] 6 node cluster with unexplained reboots

Wed Aug 15 18:11:58 PDT 2007

On Wed, Aug 15, 2007 at 05:54:54PM -0700, Ulf Zimmermann wrote:
> The controller getting the request mirrors the request to the second
> controller (in this case there is only 1, there can be up to 7 other).
> Then it acknowledges the request and writes it to disk. Each controller
> has double batteries to be able to finish any pending writes. If a
> controller fails, it will only acknowledge the write after it is
> physical on the disk. This is part of the 3Par operation. I have
> submitted a request to 3Par to check the extensive logs they generate to
> see if there is anything which can explain this write delay. 

Ahh, ok - I was mistaken. When you said "controller" I was thinking "HBA"!

So, ignore the bit about write-back versus write-through caching - I was at
the wrong part of the storage stack :)

So yeah, I'd try increasing the hb timeout to compensate. It'd definitely be
interesting to find out why the I/O's are taking so long. They're
technically allowed to, but it doesn't seem like there's a good reason for
it based on your description thus far.
	--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
mark.fasheh at oracle.com