[Ocfs2-users] 6 node cluster with unexplained reboots

Fri Aug 31 08:19:43 PDT 2007

Our reboots were because of timing. We set o2cb to 31, 30000, 2000 and
2000 now. Since then no more reboots. We also fixed our FC wiring
(replaced 3 cables and it seems we have 1 bad port on a switch) to get
rid of SCSI errors and with that multipath not disabling paths.

> -----Original Message-----
> From: Hagmann, Michael [mailto:Michael.Hagmann at hilti.com]
> Sent: Thursday, August 30, 2007 23:47
> To: ocfs2-users at oss.oracle.com
> Cc: Ulf Zimmermann; Sunil Mushran
> Subject: RE: [Ocfs2-users] 6 node cluster with unexplained reboots
> 
> Hi
> 
> I have the same situation here, that one node from 4 reboot with this
> error:
> 
> Aug 30 17:27:47 lilr206c (12,2):o2hb_write_timeout:269 ERROR:
Heartbeat
> write timeout to device dm-26 after 120000 milliseconds
>  Aug 30 17:27:47 lilr206c Heartbeat thread (12) printing last 24
> blocking operations (cur = 13):
>  Aug 30 17:27:47 lilr206c Heartbeat thread stuck at waiting for write
> completion, stuffing current time into that blocker (index 13)
>  Aug 30 17:27:47 lilr206c Index 14: took 0 ms to do checking slots
>  Aug 30 17:27:47 lilr206c Index 15: took 1105 ms to do waiting for
write
> completion
>  Aug 30 17:27:47 lilr206c Index 16: took 18 ms to do msleep
>  Aug 30 17:27:47 lilr206c Index 17: took 0 ms to do allocating bios
for
> read
>  Aug 30 17:27:47 lilr206c Index 18: took 0 ms to do bio alloc read
>  Aug 30 17:27:47 lilr206c Index 19: took 0 ms to do bio add page read
>  Aug 30 17:27:47 lilr206c Index 20: took 0 ms to do submit_bio for
read
>  Aug 30 17:27:47 lilr206c Index 21: took 401 ms to do waiting for read
> completion
>  Aug 30 17:27:47 lilr206c Index 22: took 0 ms to do bio alloc write
>  Aug 30 17:27:47 lilr206c Index 23: took 0 ms to do bio add page write
>  Aug 30 17:27:47 lilr206c Index 0: took 0 ms to do submit_bio for
write
>  Aug 30 17:27:47 lilr206c Index 1: took 0 ms to do checking slots
>  Aug 30 17:27:47 lilr206c Index 2: took 276 ms to do waiting for write
> completion
>  Aug 30 17:27:47 lilr206c Index 3: took 1322 ms to do msleep
>  Aug 30 17:27:47 lilr206c Index 4: took 0 ms to do allocating bios for
> read
>  Aug 30 17:27:47 lilr206c Index 5: took 0 ms to do bio alloc read
>  Aug 30 17:27:47 lilr206c Index 6: took 0 ms to do bio add page read
>  Aug 30 17:27:47 lilr206c Index 7: took 0 ms to do submit_bio for read
>  Aug 30 17:27:47 lilr206c Index 8: took 85285 ms to do waiting for
read
> completion
>  Aug 30 17:27:47 lilr206c Index 9: took 0 ms to do bio alloc write
>  Aug 30 17:27:47 lilr206c Index 10: took 0 ms to do bio add page write
>  Aug 30 17:27:47 lilr206c Index 11: took 0 ms to do submit_bio for
write
>  Aug 30 17:27:47 lilr206c Index 12: took 0 ms to do checking slots
>  Aug 30 17:27:47 lilr206c Index 13: took 33389 ms to do waiting for
> write completion
>  Aug 30 17:27:47 lilr206c *** ocfs2 is very sorry to be fencing this
> system by restarting ***
> 
> Did you find any reason for you reboot? Please let me now.
> 
> we have here a 4 Node HP DL585 G2 Cluster ( with multipath
> device-mapper, 2 SAN Cards per Server, EMC CX3-20 Storage ) with RHEL4
> U5 and ocfs2-2.6.9-55.0.2.ELsmp-1.2.5-2.
> 
> thx mike
> 
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Ulf
Zimmermann
> Sent: Montag, 13. August 2007 17:47
> To: Ulf Zimmermann; Sunil Mushran
> Cc: ocfs2-users at oss.oracle.com
> Subject: RE: [Ocfs2-users] 6 node cluster with unexplained reboots
> 
> One node of our 4-node cluster rebooted last night:
> 
> (11,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
> dm-1 after 12000 milliseconds
> Heartbeat thread (11) printing last 24 blocking operations (cur = 22):
> Heartbeat thread stuck at waiting for write completion, stuffing
current
> time in to that blocker (index 22) Index 23: took 0 ms to do checking
> slots Index 0: took 1 ms to do waiting for write completion Index 1:
> took 1997 ms to do msleep Index 2: took 0 ms to do allocating bios for
> read Index 3: took 0 ms to do bio alloc read Index 4: took 0 ms to do
> bio add page read Index 5: took 0 ms to do submit_bio for read Index
6:
> took 8 ms to do waiting for read completion Index 7: took 0 ms to do
bio
> alloc write Index 8: took 0 ms to do bio add page write Index 9: took
0
> ms to do submit_bio for write Index 10: took 0 ms to do checking slots
> Index 11: took 0 ms to do waiting for write completion Index 12: took
> 1992 ms to do msleep Index 13: took 0 ms to do allocating bios for
read
> Index 14: took 0 ms to do bio alloc read Index 15: took 0 ms to do bio
> add page read Index 16: took 0 ms to do submit_bio for read Index 17:
> took 7 ms to do waiting for read completion Index 18: took 0 ms to do
> bio alloc write Index 19: took 0 ms to do bio add page write Index 20:
> took 0 ms to do submit_bio for write Index 21: took 0 ms to do
checking
> slots Index 22: took 10003 ms to do waiting for write completion
> *** ocfs2 is very sorry to be fencing this system by restarting ***
> 
> There were no SCSI errors on the console or logs around the time of
this
> reboot.
> 
> > -----Original Message-----
> > From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> > bounces at oss.oracle.com] On Behalf Of Ulf Zimmermann
> > Sent: Monday, July 30, 2007 11:11
> > To: Sunil Mushran
> > Cc: ocfs2-users at oss.oracle.com
> > Subject: RE: [Ocfs2-users] 6 node cluster with unexplained reboots
> >
> > Too early to call. Management made the call "This hardware seems to
> have
> > been stable, lets use it".
> >
> > > -----Original Message-----
> > > From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> > > Sent: Monday, July 30, 2007 11:07
> > > To: Ulf Zimmermann
> > > Cc: ocfs2-users at oss.oracle.com
> > > Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
> > >
> > > So are you suggesting the reason was bad hardware?
> > > Or, is it too early to call?
> > >
> > > Ulf Zimmermann wrote:
> > > > I have serial console setup with logging via conserver but so
far
> no
> > > > further crash. We also swapped hardware a bit around (another 4
> node
> > > > cluster with DL360g5 was working without crash for several
weeks,
> we
> > > > swapped those 4 nodes in for the first 4 in the 6 node cluster).
> > > >
> > > >
> > > >> -----Original Message-----
> > > >> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> > > >> Sent: Monday, July 30, 2007 10:21
> > > >> To: Ulf Zimmermann
> > > >> Cc: ocfs2-users at oss.oracle.com
> > > >> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained
> reboots
> > > >>
> > > >> Do you have a netconsole setup? If not, set it up. That will
> > capture
> > > >>
> > > > the
> > > >
> > > >> real reason for the reset. Well, it typically does.
> > > >>
> > > >> Ulf Zimmermann wrote:
> > > >>
> > > >>> We just installed a new cluster with 6 HP DL380g5, dual single
> > port
> > > >>>
> > > >> Qlogic 24xx HBAs connected via two HP 4/16 Storageworks
switches
> to
> > a
> > > >>
> > > > 3Par
> > > >
> > > >> S400. We are using the 3Par recommended config for the Qlogic
> > driver
> > > >>
> > > > and
> > > >
> > > >> device-mapper-multipath giving us 4 paths to the SAN. We do see
> > some
> > > >>
> > > > SCSI
> > > >
> > > >> errors where DM-MP is failing a path after get a 0x2000 error
> from
> > the
> > > >>
> > > > SAN
> > > >
> > > >> controller, but the path gets puts back in service in less then
> 10
> > > >> seconds.
> > > >>
> > > >>> This needs to be fixed but I don't think it is what is causing
> our
> > > >>>
> > > >> reboots. 2 of the nodes rebooted once while being idle (ocfs2
and
> 
> > > >> clusterware were running, no db) and one node rebooted while
idle
> > > >>
> > > > (another
> > > >
> > > >> node was copying using fscat our 9i db from ocfs1 to the ocfs2
> data
> > > >> volume) and once while some load was put on it via the upgraded
> 10g
> > > >> database. In all cases it is as if someone a hardware reset
> button.
> > No
> > > >> kernel panic (at least not one leading to a stop with visable
> > > >>
> > > > message), we
> > > >
> > > >> can get a dirty write cache for the internal cciss controller.
> > > >>
> > > >>> The only messages we get on the nodes are when the crashed
node
> is
> > > >>>
> > > >> already in reset and it missed its ocfs2 heartbeat (set to the
> > default
> > > >>
> > > > of
> > > >
> > > >> 7), followed later by crs moving the vip.
> > > >>
> > > >>> Any hints on trouble shooting this would be appreciated.
> > > >>>
> > > >>> Regards, Ulf.
> > > >>>