Not sure if it would help , but try kernel 2.4.21<br><br>
<div><span class="gmail_quote">On 6/10/06, <b class="gmail_sendername"><a href="mailto:ocfs2-users-request@oss.oracle.com">ocfs2-users-request@oss.oracle.com</a></b> <<a href="mailto:ocfs2-users-request@oss.oracle.com">
ocfs2-users-request@oss.oracle.com</a>> wrote:</span>
<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">Send Ocfs2-users mailing list submissions to<br> <a href="mailto:ocfs2-users@oss.oracle.com">ocfs2-users@oss.oracle.com
</a><br><br>To subscribe or unsubscribe via the World Wide Web, visit<br> <a href="http://oss.oracle.com/mailman/listinfo/ocfs2-users">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a><br>or, via email, send a message with subject or body 'help' to
<br> <a href="mailto:ocfs2-users-request@oss.oracle.com">ocfs2-users-request@oss.oracle.com</a><br><br>You can reach the person managing the list at<br> <a href="mailto:ocfs2-users-owner@oss.oracle.com">ocfs2-users-owner@oss.oracle.com
</a><br><br>When replying, please edit your Subject line so it is more specific<br>than "Re: Contents of Ocfs2-users digest..."<br><br><br>Today's Topics:<br><br> 1. RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Brian Long)
<br> 2. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Sunil Mushran)<br> 3. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Brian Long)<br> 4. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Sunil Mushran)<br><br><br>----------------------------------------------------------------------
<br><br>Message: 1<br>Date: Fri, 09 Jun 2006 13:38:58 -0400<br>From: Brian Long <<a href="mailto:brilong@cisco.com">brilong@cisco.com</a>><br>Subject: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?<br>To: <a href="mailto:ocfs2-users@oss.oracle.com">
ocfs2-users@oss.oracle.com</a><br>Message-ID: <1149874738.4142.17.camel@brilong-lnx><br>Content-Type: text/plain<br><br>Hello,<br><br>I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2<br>1.2.1 RPMs. About once a week, one of the nodes crashes itself (self-
<br>fencing) and I get a full vmcore on my netdump server. The netdump log<br>file shows the shared filesystem LUN (/dev/dm-6) did not respond within<br>12000ms. I have not changed the default heartbeat values<br>in /etc/sysconfig/o2cb. There was no other IO ongoing when this
<br>happens, but they are HP Proliant servers running the Insight Manager<br>agents.<br><br>Why would the heartbeat fail roughly once a week? Should I open a<br>bugzilla and upload my netdump log file?<br><br>Thanks.<br>
<br>/Brian/<br>--<br> Brian Long | | |<br> IT Data Center Systems | .|||. .|||.<br> Cisco Linux Developer | ..:|||||||:...:|||||||:..<br>
Phone: (919) 392-7363 | C i s c o S y s t e m s<br><br><br><br><br>------------------------------<br><br>Message: 2<br>Date: Fri, 09 Jun 2006 10:49:48 -0700<br>From: Sunil Mushran <<a href="mailto:Sunil.Mushran@oracle.com">
Sunil.Mushran@oracle.com</a>><br>Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?<br>To: Brian Long <<a href="mailto:brilong@cisco.com">brilong@cisco.com</a>><br>Cc: <a href="mailto:ocfs2-users@oss.oracle.com">
ocfs2-users@oss.oracle.com</a><br>Message-ID: <<a href="mailto:4489B4BC.50309@oracle.com">4489B4BC.50309@oracle.com</a>><br>Content-Type: text/plain; charset=ISO-8859-1; format=flowed<br><br>The hb failure is just the effect of the ios not completing within 12 secs.
<br>The full oops trace gives the last 24 ops and their timings.<br><br>One solution is to double up the hb timeout. Set,<br>O2CB_HEARTBEAT_THRESHOLD = 14<br><br>Brian Long wrote:<br>> Hello,<br>><br>> I have two nodes running the
2.6.9-22.0.2.ELsmp kernel and the OCFS2<br>> 1.2.1 RPMs. About once a week, one of the nodes crashes itself (self-<br>> fencing) and I get a full vmcore on my netdump server. The netdump log<br>> file shows the shared filesystem LUN (/dev/dm-6) did not respond within
<br>> 12000ms. I have not changed the default heartbeat values<br>> in /etc/sysconfig/o2cb. There was no other IO ongoing when this<br>> happens, but they are HP Proliant servers running the Insight Manager<br>
> agents.<br>><br>> Why would the heartbeat fail roughly once a week? Should I open a<br>> bugzilla and upload my netdump log file?<br>><br>> Thanks.<br>><br>> /Brian/<br>><br><br><br><br>------------------------------
<br><br>Message: 3<br>Date: Fri, 09 Jun 2006 15:30:05 -0400<br>From: Brian Long <<a href="mailto:brilong@cisco.com">brilong@cisco.com</a>><br>Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?<br>To: Sunil Mushran <
<a href="mailto:Sunil.Mushran@oracle.com">Sunil.Mushran@oracle.com</a>><br>Cc: <a href="mailto:ocfs2-users@oss.oracle.com">ocfs2-users@oss.oracle.com</a><br>Message-ID: <1149881406.4142.27.camel@brilong-lnx><br>Content-Type: text/plain
<br><br>Understood, but how do I determine why once a week I'm failing the 12<br>second heartbeat? Before I bump the HB, shouldn't I figure out why dm-6<br>is gone for 12 seconds? The last 24 ops are as follows:<br><br>
(7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device<br>dm-6 after 12000 milliseconds<br>Heartbeat thread (7) printing last 24 blocking operations (cur = 3):<br>Heartbeat thread stuck at waiting for read completion, stuffing current
<br>time into that blocker (index 3)<br>Index 4: took 0 ms to do submit_bio for read<br>Index 5: took 0 ms to do waiting for read completion<br>Index 6: took 0 ms to do bio alloc write<br>Index 7: took 0 ms to do bio add page write
<br>Index 8: took 0 ms to do submit_bio for write<br>Index 9: took 0 ms to do checking slots<br>Index 10: took 0 ms to do waiting for write completion<br>Index 11: took 1998 ms to do msleep<br>Index 12: took 0 ms to do allocating bios for read
<br>Index 13: took 0 ms to do bio alloc read<br>Index 14: took 0 ms to do bio add page read<br>Index 15: took 0 ms to do submit_bio for read<br>Index 16: took 0 ms to do waiting for read completion<br>Index 17: took 0 ms to do bio alloc write
<br>Index 18: took 0 ms to do bio add page write<br>Index 19: took 0 ms to do submit_bio for write<br>Index 20: took 0 ms to do checking slots<br>Index 21: took 0 ms to do waiting for write completion<br>Index 22: took 1999 ms to do msleep
<br>Index 23: took 0 ms to do allocating bios for read<br>Index 0: took 0 ms to do bio alloc read<br>Index 1: took 0 ms to do bio add page read<br>Index 2: took 0 ms to do submit_bio for read<br>Index 3: took 9998 ms to do waiting for read completion
<br>(7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active<br>regions.<br>Kernel panic - not syncing: ocfs2 is very sorry to be fencing this<br>system by panicing<br><br>/Brian/<br><br>On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote:
<br>> The hb failure is just the effect of the ios not completing within 12 secs.<br>> The full oops trace gives the last 24 ops and their timings.<br>><br>> One solution is to double up the hb timeout. Set,<br>
> O2CB_HEARTBEAT_THRESHOLD = 14<br>><br>> Brian Long wrote:<br>> > Hello,<br>> ><br>> > I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2<br>> > 1.2.1 RPMs. About once a week, one of the nodes crashes itself (self-
<br>> > fencing) and I get a full vmcore on my netdump server. The netdump log<br>> > file shows the shared filesystem LUN (/dev/dm-6) did not respond within<br>> > 12000ms. I have not changed the default heartbeat values
<br>> > in /etc/sysconfig/o2cb. There was no other IO ongoing when this<br>> > happens, but they are HP Proliant servers running the Insight Manager<br>> > agents.<br>> ><br>> > Why would the heartbeat fail roughly once a week? Should I open a
<br>> > bugzilla and upload my netdump log file?<br>> ><br>> > Thanks.<br>> ><br>> > /Brian/<br>> ><br>--<br> Brian Long | | |<br> IT Data Center Systems | .|||. .|||.
<br> Cisco Linux Developer | ..:|||||||:...:|||||||:..<br> Phone: (919) 392-7363 | C i s c o S y s t e m s<br><br><br><br><br>------------------------------<br><br>Message: 4<br>Date: Fri, 09 Jun 2006 13:00:48 -0700
<br>From: Sunil Mushran <<a href="mailto:Sunil.Mushran@oracle.com">Sunil.Mushran@oracle.com</a>><br>Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?<br>To: Brian Long <<a href="mailto:brilong@cisco.com">
brilong@cisco.com</a>><br>Cc: <a href="mailto:ocfs2-users@oss.oracle.com">ocfs2-users@oss.oracle.com</a><br>Message-ID: <<a href="mailto:4489D370.4050103@oracle.com">4489D370.4050103@oracle.com</a>><br>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
<br><br>This dump is very much like the one we used to see with the<br>cfq io scheduler. The very last io op would consume all the time.<br>I am assuming that you are running with the DEADLINE io sched.<br><br>Is there any other common factors in all the crashes. Like, happens
<br>on one node? Or, around the same time? How do you know there is<br>no other io happening at that time? What about cron jobs?<br><br>Also, is the shared disk connected to some other nodes which<br>could be the cause of the io spike?
<br><br>Brian Long wrote:<br>> Understood, but how do I determine why once a week I'm failing the 12<br>> second heartbeat? Before I bump the HB, shouldn't I figure out why dm-6<br>> is gone for 12 seconds? The last 24 ops are as follows:
<br>><br>> (7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device<br>> dm-6 after 12000 milliseconds<br>> Heartbeat thread (7) printing last 24 blocking operations (cur = 3):<br>> Heartbeat thread stuck at waiting for read completion, stuffing current
<br>> time into that blocker (index 3)<br>> Index 4: took 0 ms to do submit_bio for read<br>> Index 5: took 0 ms to do waiting for read completion<br>> Index 6: took 0 ms to do bio alloc write<br>> Index 7: took 0 ms to do bio add page write
<br>> Index 8: took 0 ms to do submit_bio for write<br>> Index 9: took 0 ms to do checking slots<br>> Index 10: took 0 ms to do waiting for write completion<br>> Index 11: took 1998 ms to do msleep<br>> Index 12: took 0 ms to do allocating bios for read
<br>> Index 13: took 0 ms to do bio alloc read<br>> Index 14: took 0 ms to do bio add page read<br>> Index 15: took 0 ms to do submit_bio for read<br>> Index 16: took 0 ms to do waiting for read completion<br>
> Index 17: took 0 ms to do bio alloc write<br>> Index 18: took 0 ms to do bio add page write<br>> Index 19: took 0 ms to do submit_bio for write<br>> Index 20: took 0 ms to do checking slots<br>> Index 21: took 0 ms to do waiting for write completion
<br>> Index 22: took 1999 ms to do msleep<br>> Index 23: took 0 ms to do allocating bios for read<br>> Index 0: took 0 ms to do bio alloc read<br>> Index 1: took 0 ms to do bio add page read<br>> Index 2: took 0 ms to do submit_bio for read
<br>> Index 3: took 9998 ms to do waiting for read completion<br>> (7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active<br>> regions.<br>> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
<br>> system by panicing<br>><br>> /Brian/<br>><br>> On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote:<br>><br>>> The hb failure is just the effect of the ios not completing within 12 secs.<br>
>> The full oops trace gives the last 24 ops and their timings.<br>>><br>>> One solution is to double up the hb timeout. Set,<br>>> O2CB_HEARTBEAT_THRESHOLD = 14<br>>><br>>> Brian Long wrote:
<br>>><br>>>> Hello,<br>>>><br>>>> I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2<br>>>> 1.2.1 RPMs. About once a week, one of the nodes crashes itself (self-
<br>>>> fencing) and I get a full vmcore on my netdump server. The netdump log<br>>>> file shows the shared filesystem LUN (/dev/dm-6) did not respond within<br>>>> 12000ms. I have not changed the default heartbeat values
<br>>>> in /etc/sysconfig/o2cb. There was no other IO ongoing when this<br>>>> happens, but they are HP Proliant servers running the Insight Manager<br>>>> agents.<br>>>><br>>>> Why would the heartbeat fail roughly once a week? Should I open a
<br>>>> bugzilla and upload my netdump log file?<br>>>><br>>>> Thanks.<br>>>><br>>>> /Brian/<br>>>><br>>>><br><br><br><br>------------------------------<br><br>
_______________________________________________<br>Ocfs2-users mailing list<br><a href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a><br><a href="http://oss.oracle.com/mailman/listinfo/ocfs2-users">http://oss.oracle.com/mailman/listinfo/ocfs2-users
</a><br><br><br>End of Ocfs2-users Digest, Vol 30, Issue 7<br>******************************************<br></blockquote></div><br>