[Ocfs2-users] 6 node cluster with unexplained reboots

Mon Aug 13 08:46:51 PDT 2007

One node of our 4-node cluster rebooted last night:

(11,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
dm-1 after 12000 milliseconds
Heartbeat thread (11) printing last 24 blocking operations (cur = 22):
Heartbeat thread stuck at waiting for write completion, stuffing current
time in
to that blocker (index 22)
Index 23: took 0 ms to do checking slots
Index 0: took 1 ms to do waiting for write completion
Index 1: took 1997 ms to do msleep
Index 2: took 0 ms to do allocating bios for read
Index 3: took 0 ms to do bio alloc read
Index 4: took 0 ms to do bio add page read
Index 5: took 0 ms to do submit_bio for read
Index 6: took 8 ms to do waiting for read completion
Index 7: took 0 ms to do bio alloc write
Index 8: took 0 ms to do bio add page write
Index 9: took 0 ms to do submit_bio for write
Index 10: took 0 ms to do checking slots
Index 11: took 0 ms to do waiting for write completion
Index 12: took 1992 ms to do msleep
Index 13: took 0 ms to do allocating bios for read
Index 14: took 0 ms to do bio alloc read
Index 15: took 0 ms to do bio add page read
Index 16: took 0 ms to do submit_bio for read
Index 17: took 7 ms to do waiting for read completion
Index 18: took 0 ms to do bio alloc write
Index 19: took 0 ms to do bio add page write
Index 20: took 0 ms to do submit_bio for write
Index 21: took 0 ms to do checking slots
Index 22: took 10003 ms to do waiting for write completion
*** ocfs2 is very sorry to be fencing this system by restarting ***

There were no SCSI errors on the console or logs around the time of this
reboot.

> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> bounces at oss.oracle.com] On Behalf Of Ulf Zimmermann
> Sent: Monday, July 30, 2007 11:11
> To: Sunil Mushran
> Cc: ocfs2-users at oss.oracle.com
> Subject: RE: [Ocfs2-users] 6 node cluster with unexplained reboots
> 
> Too early to call. Management made the call "This hardware seems to
have
> been stable, lets use it".
> 
> > -----Original Message-----
> > From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> > Sent: Monday, July 30, 2007 11:07
> > To: Ulf Zimmermann
> > Cc: ocfs2-users at oss.oracle.com
> > Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
> >
> > So are you suggesting the reason was bad hardware?
> > Or, is it too early to call?
> >
> > Ulf Zimmermann wrote:
> > > I have serial console setup with logging via conserver but so far
no
> > > further crash. We also swapped hardware a bit around (another 4
node
> > > cluster with DL360g5 was working without crash for several weeks,
we
> > > swapped those 4 nodes in for the first 4 in the 6 node cluster).
> > >
> > >
> > >> -----Original Message-----
> > >> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> > >> Sent: Monday, July 30, 2007 10:21
> > >> To: Ulf Zimmermann
> > >> Cc: ocfs2-users at oss.oracle.com
> > >> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained
reboots
> > >>
> > >> Do you have a netconsole setup? If not, set it up. That will
> capture
> > >>
> > > the
> > >
> > >> real reason for the reset. Well, it typically does.
> > >>
> > >> Ulf Zimmermann wrote:
> > >>
> > >>> We just installed a new cluster with 6 HP DL380g5, dual single
> port
> > >>>
> > >> Qlogic 24xx HBAs connected via two HP 4/16 Storageworks switches
to
> a
> > >>
> > > 3Par
> > >
> > >> S400. We are using the 3Par recommended config for the Qlogic
> driver
> > >>
> > > and
> > >
> > >> device-mapper-multipath giving us 4 paths to the SAN. We do see
> some
> > >>
> > > SCSI
> > >
> > >> errors where DM-MP is failing a path after get a 0x2000 error
from
> the
> > >>
> > > SAN
> > >
> > >> controller, but the path gets puts back in service in less then
10
> > >> seconds.
> > >>
> > >>> This needs to be fixed but I don't think it is what is causing
our
> > >>>
> > >> reboots. 2 of the nodes rebooted once while being idle (ocfs2 and
> > >> clusterware were running, no db) and one node rebooted while idle
> > >>
> > > (another
> > >
> > >> node was copying using fscat our 9i db from ocfs1 to the ocfs2
data
> > >> volume) and once while some load was put on it via the upgraded
10g
> > >> database. In all cases it is as if someone a hardware reset
button.
> No
> > >> kernel panic (at least not one leading to a stop with visable
> > >>
> > > message), we
> > >
> > >> can get a dirty write cache for the internal cciss controller.
> > >>
> > >>> The only messages we get on the nodes are when the crashed node
is
> > >>>
> > >> already in reset and it missed its ocfs2 heartbeat (set to the
> default
> > >>
> > > of
> > >
> > >> 7), followed later by crs moving the vip.
> > >>
> > >>> Any hints on trouble shooting this would be appreciated.
> > >>>
> > >>> Regards, Ulf.
> > >>>
> > >>>
> > >>> --------------------------
> > >>> Sent from my BlackBerry Wireless Handheld
> > >>>
> > >>>
> > >>>
> > >>>
> > >
>
------------------------------------------------------------------------
> > >
> > >>> _______________________________________________
> > >>> Ocfs2-users mailing list
> > >>> Ocfs2-users at oss.oracle.com
> > >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> > >>>
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users