[Ocfs2-users] 6 node cluster with unexplained reboots

Ulf Zimmermann ulf at atc-onlane.com
Mon Jul 30 10:22:29 PDT 2007


I have serial console setup with logging via conserver but so far no
further crash. We also swapped hardware a bit around (another 4 node
cluster with DL360g5 was working without crash for several weeks, we
swapped those 4 nodes in for the first 4 in the 6 node cluster).

> -----Original Message-----
> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> Sent: Monday, July 30, 2007 10:21
> To: Ulf Zimmermann
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
> 
> Do you have a netconsole setup? If not, set it up. That will capture
the
> real reason for the reset. Well, it typically does.
> 
> Ulf Zimmermann wrote:
> > We just installed a new cluster with 6 HP DL380g5, dual single port
> Qlogic 24xx HBAs connected via two HP 4/16 Storageworks switches to a
3Par
> S400. We are using the 3Par recommended config for the Qlogic driver
and
> device-mapper-multipath giving us 4 paths to the SAN. We do see some
SCSI
> errors where DM-MP is failing a path after get a 0x2000 error from the
SAN
> controller, but the path gets puts back in service in less then 10
> seconds.
> >
> > This needs to be fixed but I don't think it is what is causing our
> reboots. 2 of the nodes rebooted once while being idle (ocfs2 and
> clusterware were running, no db) and one node rebooted while idle
(another
> node was copying using fscat our 9i db from ocfs1 to the ocfs2 data
> volume) and once while some load was put on it via the upgraded 10g
> database. In all cases it is as if someone a hardware reset button. No
> kernel panic (at least not one leading to a stop with visable
message), we
> can get a dirty write cache for the internal cciss controller.
> >
> > The only messages we get on the nodes are when the crashed node is
> already in reset and it missed its ocfs2 heartbeat (set to the default
of
> 7), followed later by crs moving the vip.
> >
> > Any hints on trouble shooting this would be appreciated.
> >
> > Regards, Ulf.
> >
> >
> > --------------------------
> > Sent from my BlackBerry Wireless Handheld
> >
> >
> >
------------------------------------------------------------------------
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users



More information about the Ocfs2-users mailing list