[Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?

Fri Jun 9 14:30:05 CDT 2006

Understood, but how do I determine why once a week I'm failing the 12
second heartbeat?  Before I bump the HB, shouldn't I figure out why dm-6
is gone for 12 seconds?  The last 24 ops are as follows:

(7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
dm-6 after 12000 milliseconds
Heartbeat thread (7) printing last 24 blocking operations (cur = 3):
Heartbeat thread stuck at waiting for read completion, stuffing current
time into that blocker (index 3)
Index 4: took 0 ms to do submit_bio for read
Index 5: took 0 ms to do waiting for read completion
Index 6: took 0 ms to do bio alloc write
Index 7: took 0 ms to do bio add page write
Index 8: took 0 ms to do submit_bio for write
Index 9: took 0 ms to do checking slots
Index 10: took 0 ms to do waiting for write completion
Index 11: took 1998 ms to do msleep
Index 12: took 0 ms to do allocating bios for read
Index 13: took 0 ms to do bio alloc read
Index 14: took 0 ms to do bio add page read
Index 15: took 0 ms to do submit_bio for read
Index 16: took 0 ms to do waiting for read completion
Index 17: took 0 ms to do bio alloc write
Index 18: took 0 ms to do bio add page write
Index 19: took 0 ms to do submit_bio for write
Index 20: took 0 ms to do checking slots
Index 21: took 0 ms to do waiting for write completion
Index 22: took 1999 ms to do msleep
Index 23: took 0 ms to do allocating bios for read
Index 0: took 0 ms to do bio alloc read
Index 1: took 0 ms to do bio add page read
Index 2: took 0 ms to do submit_bio for read
Index 3: took 9998 ms to do waiting for read completion
(7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active
regions.
Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
system by panicing

/Brian/

On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote:
> The hb failure is just the effect of the ios not completing within 12 secs.
> The full oops trace gives the last 24 ops and their timings.
> 
> One solution is to double up the hb timeout. Set,
> O2CB_HEARTBEAT_THRESHOLD = 14
> 
> Brian Long wrote:
> > Hello,
> >
> > I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
> > 1.2.1 RPMs.  About once a week, one of the nodes crashes itself (self-
> > fencing) and I get a full vmcore on my netdump server.  The netdump log
> > file shows the shared filesystem LUN (/dev/dm-6) did not respond within
> > 12000ms.  I have not changed the default heartbeat values
> > in /etc/sysconfig/o2cb.  There was no other IO ongoing when this
> > happens, but they are HP Proliant servers running the Insight Manager
> > agents.
> >
> > Why would the heartbeat fail roughly once a week?  Should I open a
> > bugzilla and upload my netdump log file?
> >
> > Thanks.
> >
> > /Brian/
> >   
-- 
       Brian Long                      |         |           |
       IT Data Center Systems          |       .|||.       .|||.
       Cisco Linux Developer           |   ..:|||||||:...:|||||||:..
       Phone: (919) 392-7363           |   C i s c o   S y s t e m s