[Ocfs2-users] 6 node cluster with unexplained reboots

Ulf Zimmermann ulf at atc-onlane.com
Wed Aug 15 18:18:32 PDT 2007


> -----Original Message-----
> From: Mark Fasheh [mailto:mark.fasheh at oracle.com]
> Sent: Wednesday, August 15, 2007 18:12
> To: Ulf Zimmermann
> Cc: Sunil Mushran; ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
> 
> On Wed, Aug 15, 2007 at 05:54:54PM -0700, Ulf Zimmermann wrote:
> > The controller getting the request mirrors the request to the second
> > controller (in this case there is only 1, there can be up to 7
other).
> > Then it acknowledges the request and writes it to disk. Each
controller
> > has double batteries to be able to finish any pending writes. If a
> > controller fails, it will only acknowledge the write after it is
> > physical on the disk. This is part of the 3Par operation. I have
> > submitted a request to 3Par to check the extensive logs they
generate to
> > see if there is anything which can explain this write delay.
> 
> Ahh, ok - I was mistaken. When you said "controller" I was thinking
"HBA"!
> 
> So, ignore the bit about write-back versus write-through caching - I
was
> at
> the wrong part of the storage stack :)
> 
> So yeah, I'd try increasing the hb timeout to compensate. It'd
definitely
> be
> interesting to find out why the I/O's are taking so long. They're
> technically allowed to, but it doesn't seem like there's a good reason
for
> it based on your description thus far.
> 	--Mark

Yup, there were no other messages during that time, i.e. SCSI errors or
multipath. The Qlogic cards are set to 30 seconds timeout for link down,
hard coded for point to point. The one thing which I do know about for
some of the 10 nodes (6+4 cluster) is too many SCSI errors which we
currently suspect either the FC cable or the SFP on the fibre channel
switch.

It is all kind of a waiting game right now, as said previous I just
wanted to follow up so when people use Yahoo or Google search and find
my original posts also get answers what we found out.

Ulf.




More information about the Ocfs2-users mailing list