[Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?
Brian Long
brilong at cisco.com
Fri Jun 9 14:30:05 CDT 2006
Understood, but how do I determine why once a week I'm failing the 12
second heartbeat? Before I bump the HB, shouldn't I figure out why dm-6
is gone for 12 seconds? The last 24 ops are as follows:
(7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
dm-6 after 12000 milliseconds
Heartbeat thread (7) printing last 24 blocking operations (cur = 3):
Heartbeat thread stuck at waiting for read completion, stuffing current
time into that blocker (index 3)
Index 4: took 0 ms to do submit_bio for read
Index 5: took 0 ms to do waiting for read completion
Index 6: took 0 ms to do bio alloc write
Index 7: took 0 ms to do bio add page write
Index 8: took 0 ms to do submit_bio for write
Index 9: took 0 ms to do checking slots
Index 10: took 0 ms to do waiting for write completion
Index 11: took 1998 ms to do msleep
Index 12: took 0 ms to do allocating bios for read
Index 13: took 0 ms to do bio alloc read
Index 14: took 0 ms to do bio add page read
Index 15: took 0 ms to do submit_bio for read
Index 16: took 0 ms to do waiting for read completion
Index 17: took 0 ms to do bio alloc write
Index 18: took 0 ms to do bio add page write
Index 19: took 0 ms to do submit_bio for write
Index 20: took 0 ms to do checking slots
Index 21: took 0 ms to do waiting for write completion
Index 22: took 1999 ms to do msleep
Index 23: took 0 ms to do allocating bios for read
Index 0: took 0 ms to do bio alloc read
Index 1: took 0 ms to do bio add page read
Index 2: took 0 ms to do submit_bio for read
Index 3: took 9998 ms to do waiting for read completion
(7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active
regions.
Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
system by panicing
/Brian/
On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote:
> The hb failure is just the effect of the ios not completing within 12 secs.
> The full oops trace gives the last 24 ops and their timings.
>
> One solution is to double up the hb timeout. Set,
> O2CB_HEARTBEAT_THRESHOLD = 14
>
> Brian Long wrote:
> > Hello,
> >
> > I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
> > 1.2.1 RPMs. About once a week, one of the nodes crashes itself (self-
> > fencing) and I get a full vmcore on my netdump server. The netdump log
> > file shows the shared filesystem LUN (/dev/dm-6) did not respond within
> > 12000ms. I have not changed the default heartbeat values
> > in /etc/sysconfig/o2cb. There was no other IO ongoing when this
> > happens, but they are HP Proliant servers running the Insight Manager
> > agents.
> >
> > Why would the heartbeat fail roughly once a week? Should I open a
> > bugzilla and upload my netdump log file?
> >
> > Thanks.
> >
> > /Brian/
> >
--
Brian Long | | |
IT Data Center Systems | .|||. .|||.
Cisco Linux Developer | ..:|||||||:...:|||||||:..
Phone: (919) 392-7363 | C i s c o S y s t e m s
More information about the Ocfs2-users
mailing list