[Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?

Fri Jun 9 15:00:48 CDT 2006

This dump is very much like the one we used to see with the
cfq io scheduler. The very last io op would consume all the time.
I am assuming that you are running with the DEADLINE io sched.

Is there any other common factors in all the crashes. Like, happens
on one node? Or, around the same time? How do you know there is
no other io happening at that time? What about cron jobs?

Also, is the shared disk connected to some other nodes which
could be the cause of the io spike?

Brian Long wrote:
> Understood, but how do I determine why once a week I'm failing the 12
> second heartbeat?  Before I bump the HB, shouldn't I figure out why dm-6
> is gone for 12 seconds?  The last 24 ops are as follows:
>
> (7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
> dm-6 after 12000 milliseconds
> Heartbeat thread (7) printing last 24 blocking operations (cur = 3):
> Heartbeat thread stuck at waiting for read completion, stuffing current
> time into that blocker (index 3)
> Index 4: took 0 ms to do submit_bio for read
> Index 5: took 0 ms to do waiting for read completion
> Index 6: took 0 ms to do bio alloc write
> Index 7: took 0 ms to do bio add page write
> Index 8: took 0 ms to do submit_bio for write
> Index 9: took 0 ms to do checking slots
> Index 10: took 0 ms to do waiting for write completion
> Index 11: took 1998 ms to do msleep
> Index 12: took 0 ms to do allocating bios for read
> Index 13: took 0 ms to do bio alloc read
> Index 14: took 0 ms to do bio add page read
> Index 15: took 0 ms to do submit_bio for read
> Index 16: took 0 ms to do waiting for read completion
> Index 17: took 0 ms to do bio alloc write
> Index 18: took 0 ms to do bio add page write
> Index 19: took 0 ms to do submit_bio for write
> Index 20: took 0 ms to do checking slots
> Index 21: took 0 ms to do waiting for write completion
> Index 22: took 1999 ms to do msleep
> Index 23: took 0 ms to do allocating bios for read
> Index 0: took 0 ms to do bio alloc read
> Index 1: took 0 ms to do bio add page read
> Index 2: took 0 ms to do submit_bio for read
> Index 3: took 9998 ms to do waiting for read completion
> (7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active
> regions.
> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
> system by panicing
>
> /Brian/
>
> On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote:
>   
>> The hb failure is just the effect of the ios not completing within 12 secs.
>> The full oops trace gives the last 24 ops and their timings.
>>
>> One solution is to double up the hb timeout. Set,
>> O2CB_HEARTBEAT_THRESHOLD = 14
>>
>> Brian Long wrote:
>>     
>>> Hello,
>>>
>>> I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
>>> 1.2.1 RPMs.  About once a week, one of the nodes crashes itself (self-
>>> fencing) and I get a full vmcore on my netdump server.  The netdump log
>>> file shows the shared filesystem LUN (/dev/dm-6) did not respond within
>>> 12000ms.  I have not changed the default heartbeat values
>>> in /etc/sysconfig/o2cb.  There was no other IO ongoing when this
>>> happens, but they are HP Proliant servers running the Insight Manager
>>> agents.
>>>
>>> Why would the heartbeat fail roughly once a week?  Should I open a
>>> bugzilla and upload my netdump log file?
>>>
>>> Thanks.
>>>
>>> /Brian/
>>>   
>>>