[Ocfs2-users] 6 node cluster with unexplained reboots

Mon Jul 30 11:11:02 PDT 2007

Too early to call. Management made the call "This hardware seems to have
been stable, lets use it".

> -----Original Message-----
> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> Sent: Monday, July 30, 2007 11:07
> To: Ulf Zimmermann
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
> 
> So are you suggesting the reason was bad hardware?
> Or, is it too early to call?
> 
> Ulf Zimmermann wrote:
> > I have serial console setup with logging via conserver but so far no
> > further crash. We also swapped hardware a bit around (another 4 node
> > cluster with DL360g5 was working without crash for several weeks, we
> > swapped those 4 nodes in for the first 4 in the 6 node cluster).
> >
> >
> >> -----Original Message-----
> >> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> >> Sent: Monday, July 30, 2007 10:21
> >> To: Ulf Zimmermann
> >> Cc: ocfs2-users at oss.oracle.com
> >> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
> >>
> >> Do you have a netconsole setup? If not, set it up. That will
capture
> >>
> > the
> >
> >> real reason for the reset. Well, it typically does.
> >>
> >> Ulf Zimmermann wrote:
> >>
> >>> We just installed a new cluster with 6 HP DL380g5, dual single
port
> >>>
> >> Qlogic 24xx HBAs connected via two HP 4/16 Storageworks switches to
a
> >>
> > 3Par
> >
> >> S400. We are using the 3Par recommended config for the Qlogic
driver
> >>
> > and
> >
> >> device-mapper-multipath giving us 4 paths to the SAN. We do see
some
> >>
> > SCSI
> >
> >> errors where DM-MP is failing a path after get a 0x2000 error from
the
> >>
> > SAN
> >
> >> controller, but the path gets puts back in service in less then 10
> >> seconds.
> >>
> >>> This needs to be fixed but I don't think it is what is causing our
> >>>
> >> reboots. 2 of the nodes rebooted once while being idle (ocfs2 and
> >> clusterware were running, no db) and one node rebooted while idle
> >>
> > (another
> >
> >> node was copying using fscat our 9i db from ocfs1 to the ocfs2 data
> >> volume) and once while some load was put on it via the upgraded 10g
> >> database. In all cases it is as if someone a hardware reset button.
No
> >> kernel panic (at least not one leading to a stop with visable
> >>
> > message), we
> >
> >> can get a dirty write cache for the internal cciss controller.
> >>
> >>> The only messages we get on the nodes are when the crashed node is
> >>>
> >> already in reset and it missed its ocfs2 heartbeat (set to the
default
> >>
> > of
> >
> >> 7), followed later by crs moving the vip.
> >>
> >>> Any hints on trouble shooting this would be appreciated.
> >>>
> >>> Regards, Ulf.
> >>>
> >>>
> >>> --------------------------
> >>> Sent from my BlackBerry Wireless Handheld
> >>>
> >>>
> >>>
> >>>
> >
------------------------------------------------------------------------
> >
> >>> _______________________________________________
> >>> Ocfs2-users mailing list
> >>> Ocfs2-users at oss.oracle.com
> >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>>