Not sure if it would help , but try kernel 2.4.21<br><br>

<div><span class="gmail_quote">On 6/10/06, <b class="gmail_sendername"><a href="mailto:ocfs2-users-request@oss.oracle.com">ocfs2-users-request@oss.oracle.com</a></b> &lt;<a href="mailto:ocfs2-users-request@oss.oracle.com">

ocfs2-users-request@oss.oracle.com</a>&gt; wrote:</span>

<blockquote class="gmail_quote" style="PADDING-LEFT: 1ex; MARGIN: 0px 0px 0px 0.8ex; BORDER-LEFT: #ccc 1px solid">Send Ocfs2-users mailing list submissions to<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href="mailto:ocfs2-users@oss.oracle.com">ocfs2-users@oss.oracle.com

</a><br><br>To subscribe or unsubscribe via the World Wide Web, visit<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href="http://oss.oracle.com/mailman/listinfo/ocfs2-users">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a><br>or, via email, send a message with subject or body 'help' to

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href="mailto:ocfs2-users-request@oss.oracle.com">ocfs2-users-request@oss.oracle.com</a><br><br>You can reach the person managing the list at<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <a href="mailto:ocfs2-users-owner@oss.oracle.com">ocfs2-users-owner@oss.oracle.com

</a><br><br>When replying, please edit your Subject line so it is more specific<br>than &quot;Re: Contents of Ocfs2-users digest...&quot;<br><br><br>Today's Topics:<br><br>&nbsp;&nbsp;1. RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Brian Long)

<br>&nbsp;&nbsp;2. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Sunil Mushran)<br>&nbsp;&nbsp;3. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Brian Long)<br>&nbsp;&nbsp;4. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Sunil Mushran)<br><br><br>----------------------------------------------------------------------

<br><br>Message: 1<br>Date: Fri, 09 Jun 2006 13:38:58 -0400<br>From: Brian Long &lt;<a href="mailto:brilong@cisco.com">brilong@cisco.com</a>&gt;<br>Subject: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?<br>To: <a href="mailto:ocfs2-users@oss.oracle.com">

ocfs2-users@oss.oracle.com</a><br>Message-ID: &lt;1149874738.4142.17.camel@brilong-lnx&gt;<br>Content-Type: text/plain<br><br>Hello,<br><br>I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2<br>1.2.1 RPMs.&nbsp;&nbsp;About once a week, one of the nodes crashes itself (self-

<br>fencing) and I get a full vmcore on my netdump server.&nbsp;&nbsp;The netdump log<br>file shows the shared filesystem LUN (/dev/dm-6) did not respond within<br>12000ms.&nbsp;&nbsp;I have not changed the default heartbeat values<br>in /etc/sysconfig/o2cb.&nbsp;&nbsp;There was no other IO ongoing when this

<br>happens, but they are HP Proliant servers running the Insight Manager<br>agents.<br><br>Why would the heartbeat fail roughly once a week?&nbsp;&nbsp;Should I open a<br>bugzilla and upload my netdump log file?<br><br>Thanks.<br>

<br>/Brian/<br>--<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Brian Long&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;IT Data Center Systems&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .|||.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .|||.<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Cisco Linux Developer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp; ..:|||||||:...:|||||||:..<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Phone: (919) 392-7363&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp; C i s c o&nbsp;&nbsp; S y s t e m s<br><br><br><br><br>------------------------------<br><br>Message: 2<br>Date: Fri, 09 Jun 2006 10:49:48 -0700<br>From: Sunil Mushran &lt;<a href="mailto:Sunil.Mushran@oracle.com">

Sunil.Mushran@oracle.com</a>&gt;<br>Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?<br>To: Brian Long &lt;<a href="mailto:brilong@cisco.com">brilong@cisco.com</a>&gt;<br>Cc: <a href="mailto:ocfs2-users@oss.oracle.com">

ocfs2-users@oss.oracle.com</a> Message-ID: &lt;<a href="mailto:4489B4BC.50309@oracle.com">4489B4BC.50309@oracle.com</a>&gt; Content-Type: text/plain; charset=ISO-8859-1; format=flowed The hb failure is just the effect of the ios not completing within 12 secs.

<br>The full oops trace gives the last 24 ops and their timings.<br><br>One solution is to double up the hb timeout. Set,<br>O2CB_HEARTBEAT_THRESHOLD = 14<br><br>Brian Long wrote:<br>&gt; Hello,<br>&gt;<br>&gt; I have two nodes running the 

2.6.9-22.0.2.ELsmp kernel and the OCFS2<br>&gt; 1.2.1 RPMs.&nbsp;&nbsp;About once a week, one of the nodes crashes itself (self-<br>&gt; fencing) and I get a full vmcore on my netdump server.&nbsp;&nbsp;The netdump log<br>&gt; file shows the shared filesystem LUN (/dev/dm-6) did not respond within

<br>&gt; 12000ms.&nbsp;&nbsp;I have not changed the default heartbeat values<br>&gt; in /etc/sysconfig/o2cb.&nbsp;&nbsp;There was no other IO ongoing when this<br>&gt; happens, but they are HP Proliant servers running the Insight Manager<br>

&gt; agents.<br>&gt;<br>&gt; Why would the heartbeat fail roughly once a week?&nbsp;&nbsp;Should I open a<br>&gt; bugzilla and upload my netdump log file?<br>&gt;<br>&gt; Thanks.<br>&gt;<br>&gt; /Brian/<br>&gt;<br><br><br><br>------------------------------

<br><br>Message: 3<br>Date: Fri, 09 Jun 2006 15:30:05 -0400<br>From: Brian Long &lt;<a href="mailto:brilong@cisco.com">brilong@cisco.com</a>&gt;<br>Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?<br>To: Sunil Mushran &lt;

<a href="mailto:Sunil.Mushran@oracle.com">Sunil.Mushran@oracle.com</a>&gt;<br>Cc: <a href="mailto:ocfs2-users@oss.oracle.com">ocfs2-users@oss.oracle.com</a><br>Message-ID: &lt;1149881406.4142.27.camel@brilong-lnx&gt;<br>Content-Type: text/plain

<br><br>Understood, but how do I determine why once a week I'm failing the 12<br>second heartbeat?&nbsp;&nbsp;Before I bump the HB, shouldn't I figure out why dm-6<br>is gone for 12 seconds?&nbsp;&nbsp;The last 24 ops are as follows:<br><br>

(7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device<br>dm-6 after 12000 milliseconds<br>Heartbeat thread (7) printing last 24 blocking operations (cur = 3):<br>Heartbeat thread stuck at waiting for read completion, stuffing current

<br>time into that blocker (index 3)<br>Index 4: took 0 ms to do submit_bio for read<br>Index 5: took 0 ms to do waiting for read completion<br>Index 6: took 0 ms to do bio alloc write<br>Index 7: took 0 ms to do bio add page write

<br>Index 8: took 0 ms to do submit_bio for write<br>Index 9: took 0 ms to do checking slots<br>Index 10: took 0 ms to do waiting for write completion<br>Index 11: took 1998 ms to do msleep<br>Index 12: took 0 ms to do allocating bios for read

<br>Index 13: took 0 ms to do bio alloc read<br>Index 14: took 0 ms to do bio add page read<br>Index 15: took 0 ms to do submit_bio for read<br>Index 16: took 0 ms to do waiting for read completion<br>Index 17: took 0 ms to do bio alloc write

<br>Index 18: took 0 ms to do bio add page write<br>Index 19: took 0 ms to do submit_bio for write<br>Index 20: took 0 ms to do checking slots<br>Index 21: took 0 ms to do waiting for write completion<br>Index 22: took 1999 ms to do msleep

<br>Index 23: took 0 ms to do allocating bios for read<br>Index 0: took 0 ms to do bio alloc read<br>Index 1: took 0 ms to do bio add page read<br>Index 2: took 0 ms to do submit_bio for read<br>Index 3: took 9998 ms to do waiting for read completion

<br>(7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active<br>regions.<br>Kernel panic - not syncing: ocfs2 is very sorry to be fencing this<br>system by panicing<br><br>/Brian/<br><br>On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote:

<br>&gt; The hb failure is just the effect of the ios not completing within 12 secs.<br>&gt; The full oops trace gives the last 24 ops and their timings.<br>&gt;<br>&gt; One solution is to double up the hb timeout. Set,<br>

&gt; O2CB_HEARTBEAT_THRESHOLD = 14<br>&gt;<br>&gt; Brian Long wrote:<br>&gt; &gt; Hello,<br>&gt; &gt;<br>&gt; &gt; I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2<br>&gt; &gt; 1.2.1 RPMs.&nbsp;&nbsp;About once a week, one of the nodes crashes itself (self-

<br>&gt; &gt; fencing) and I get a full vmcore on my netdump server.&nbsp;&nbsp;The netdump log<br>&gt; &gt; file shows the shared filesystem LUN (/dev/dm-6) did not respond within<br>&gt; &gt; 12000ms.&nbsp;&nbsp;I have not changed the default heartbeat values

<br>&gt; &gt; in /etc/sysconfig/o2cb.&nbsp;&nbsp;There was no other IO ongoing when this<br>&gt; &gt; happens, but they are HP Proliant servers running the Insight Manager<br>&gt; &gt; agents.<br>&gt; &gt;<br>&gt; &gt; Why would the heartbeat fail roughly once a week?&nbsp;&nbsp;Should I open a

&gt; &gt; bugzilla and upload my netdump log file? &gt; &gt; &gt; &gt; Thanks. &gt; &gt; &gt; &gt; /Brian/ &gt; &gt; -- &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Brian Long&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; | &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;IT Data Center Systems&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .|||.&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; .|||.

<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Cisco Linux Developer&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp; ..:|||||||:...:|||||||:..<br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Phone: (919) 392-7363&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; |&nbsp;&nbsp; C i s c o&nbsp;&nbsp; S y s t e m s<br><br><br><br><br>------------------------------<br><br>Message: 4<br>Date: Fri, 09 Jun 2006 13:00:48 -0700

<br>From: Sunil Mushran &lt;<a href="mailto:Sunil.Mushran@oracle.com">Sunil.Mushran@oracle.com</a>&gt;<br>Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?<br>To: Brian Long &lt;<a href="mailto:brilong@cisco.com">

brilong@cisco.com</a>&gt;<br>Cc: <a href="mailto:ocfs2-users@oss.oracle.com">ocfs2-users@oss.oracle.com</a><br>Message-ID: &lt;<a href="mailto:4489D370.4050103@oracle.com">4489D370.4050103@oracle.com</a>&gt;<br>Content-Type: text/plain; charset=ISO-8859-1; format=flowed

<br><br>This dump is very much like the one we used to see with the<br>cfq io scheduler. The very last io op would consume all the time.<br>I am assuming that you are running with the DEADLINE io sched.<br><br>Is there any other common factors in all the crashes. Like, happens

on one node? Or, around the same time? How do you know there is no other io happening at that time? What about cron jobs? Also, is the shared disk connected to some other nodes which could be the cause of the io spike?

<br><br>Brian Long wrote:<br>&gt; Understood, but how do I determine why once a week I'm failing the 12<br>&gt; second heartbeat?&nbsp;&nbsp;Before I bump the HB, shouldn't I figure out why dm-6<br>&gt; is gone for 12 seconds?&nbsp;&nbsp;The last 24 ops are as follows:

<br>&gt;<br>&gt; (7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device<br>&gt; dm-6 after 12000 milliseconds<br>&gt; Heartbeat thread (7) printing last 24 blocking operations (cur = 3):<br>&gt; Heartbeat thread stuck at waiting for read completion, stuffing current

<br>&gt; time into that blocker (index 3)<br>&gt; Index 4: took 0 ms to do submit_bio for read<br>&gt; Index 5: took 0 ms to do waiting for read completion<br>&gt; Index 6: took 0 ms to do bio alloc write<br>&gt; Index 7: took 0 ms to do bio add page write

<br>&gt; Index 8: took 0 ms to do submit_bio for write<br>&gt; Index 9: took 0 ms to do checking slots<br>&gt; Index 10: took 0 ms to do waiting for write completion<br>&gt; Index 11: took 1998 ms to do msleep<br>&gt; Index 12: took 0 ms to do allocating bios for read

<br>&gt; Index 13: took 0 ms to do bio alloc read<br>&gt; Index 14: took 0 ms to do bio add page read<br>&gt; Index 15: took 0 ms to do submit_bio for read<br>&gt; Index 16: took 0 ms to do waiting for read completion<br>

&gt; Index 17: took 0 ms to do bio alloc write<br>&gt; Index 18: took 0 ms to do bio add page write<br>&gt; Index 19: took 0 ms to do submit_bio for write<br>&gt; Index 20: took 0 ms to do checking slots<br>&gt; Index 21: took 0 ms to do waiting for write completion

<br>&gt; Index 22: took 1999 ms to do msleep<br>&gt; Index 23: took 0 ms to do allocating bios for read<br>&gt; Index 0: took 0 ms to do bio alloc read<br>&gt; Index 1: took 0 ms to do bio add page read<br>&gt; Index 2: took 0 ms to do submit_bio for read

<br>&gt; Index 3: took 9998 ms to do waiting for read completion<br>&gt; (7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active<br>&gt; regions.<br>&gt; Kernel panic - not syncing: ocfs2 is very sorry to be fencing this

<br>&gt; system by panicing<br>&gt;<br>&gt; /Brian/<br>&gt;<br>&gt; On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote:<br>&gt;<br>&gt;&gt; The hb failure is just the effect of the ios not completing within 12 secs.<br>

&gt;&gt; The full oops trace gives the last 24 ops and their timings.<br>&gt;&gt;<br>&gt;&gt; One solution is to double up the hb timeout. Set,<br>&gt;&gt; O2CB_HEARTBEAT_THRESHOLD = 14<br>&gt;&gt;<br>&gt;&gt; Brian Long wrote:

<br>&gt;&gt;<br>&gt;&gt;&gt; Hello,<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2<br>&gt;&gt;&gt; 1.2.1 RPMs.&nbsp;&nbsp;About once a week, one of the nodes crashes itself (self-

<br>&gt;&gt;&gt; fencing) and I get a full vmcore on my netdump server.&nbsp;&nbsp;The netdump log<br>&gt;&gt;&gt; file shows the shared filesystem LUN (/dev/dm-6) did not respond within<br>&gt;&gt;&gt; 12000ms.&nbsp;&nbsp;I have not changed the default heartbeat values

<br>&gt;&gt;&gt; in /etc/sysconfig/o2cb.&nbsp;&nbsp;There was no other IO ongoing when this<br>&gt;&gt;&gt; happens, but they are HP Proliant servers running the Insight Manager<br>&gt;&gt;&gt; agents.<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; Why would the heartbeat fail roughly once a week?&nbsp;&nbsp;Should I open a

<br>&gt;&gt;&gt; bugzilla and upload my netdump log file?<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; Thanks.<br>&gt;&gt;&gt;<br>&gt;&gt;&gt; /Brian/<br>&gt;&gt;&gt;<br>&gt;&gt;&gt;<br><br><br><br>------------------------------<br><br>

_______________________________________________<br>Ocfs2-users mailing list<br><a href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a><br><a href="http://oss.oracle.com/mailman/listinfo/ocfs2-users">http://oss.oracle.com/mailman/listinfo/ocfs2-users

</a><br><br><br>End of Ocfs2-users Digest, Vol 30, Issue 7<br>******************************************<br></blockquote></div><br>