[Ocfs2-users] Ocfs2-users Digest, Vol 30, Issue 7

Sun Jun 11 11:10:38 CDT 2006

Not sure if it would help , but try kernel 2.4.21

On 6/10/06, ocfs2-users-request at oss.oracle.com <
ocfs2-users-request at oss.oracle.com> wrote:
>
> Send Ocfs2-users mailing list submissions to
>        ocfs2-users at oss.oracle.com
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://oss.oracle.com/mailman/listinfo/ocfs2-users
> or, via email, send a message with subject or body 'help' to
>        ocfs2-users-request at oss.oracle.com
>
> You can reach the person managing the list at
>        ocfs2-users-owner at oss.oracle.com
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Ocfs2-users digest..."
>
>
> Today's Topics:
>
>   1. RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Brian Long)
>   2. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Sunil Mushran)
>   3. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Brian Long)
>   4. Re: RHEL 4 U2 / OCFS 1.2.1 weekly crash? (Sunil Mushran)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 09 Jun 2006 13:38:58 -0400
> From: Brian Long <brilong at cisco.com>
> Subject: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?
> To: ocfs2-users at oss.oracle.com
> Message-ID: <1149874738.4142.17.camel at brilong-lnx>
> Content-Type: text/plain
>
> Hello,
>
> I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
> 1.2.1 RPMs.  About once a week, one of the nodes crashes itself (self-
> fencing) and I get a full vmcore on my netdump server.  The netdump log
> file shows the shared filesystem LUN (/dev/dm-6) did not respond within
> 12000ms.  I have not changed the default heartbeat values
> in /etc/sysconfig/o2cb.  There was no other IO ongoing when this
> happens, but they are HP Proliant servers running the Insight Manager
> agents.
>
> Why would the heartbeat fail roughly once a week?  Should I open a
> bugzilla and upload my netdump log file?
>
> Thanks.
>
> /Brian/
> --
>       Brian Long                      |         |           |
>       IT Data Center Systems          |       .|||.       .|||.
>       Cisco Linux Developer           |   ..:|||||||:...:|||||||:..
>       Phone: (919) 392-7363           |   C i s c o   S y s t e m s
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 09 Jun 2006 10:49:48 -0700
> From: Sunil Mushran <Sunil.Mushran at oracle.com>
> Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?
> To: Brian Long <brilong at cisco.com>
> Cc: ocfs2-users at oss.oracle.com
> Message-ID: <4489B4BC.50309 at oracle.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> The hb failure is just the effect of the ios not completing within 12
> secs.
> The full oops trace gives the last 24 ops and their timings.
>
> One solution is to double up the hb timeout. Set,
> O2CB_HEARTBEAT_THRESHOLD = 14
>
> Brian Long wrote:
> > Hello,
> >
> > I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
> > 1.2.1 RPMs.  About once a week, one of the nodes crashes itself (self-
> > fencing) and I get a full vmcore on my netdump server.  The netdump log
> > file shows the shared filesystem LUN (/dev/dm-6) did not respond within
> > 12000ms.  I have not changed the default heartbeat values
> > in /etc/sysconfig/o2cb.  There was no other IO ongoing when this
> > happens, but they are HP Proliant servers running the Insight Manager
> > agents.
> >
> > Why would the heartbeat fail roughly once a week?  Should I open a
> > bugzilla and upload my netdump log file?
> >
> > Thanks.
> >
> > /Brian/
> >
>
>
>
> ------------------------------
>
> Message: 3
> Date: Fri, 09 Jun 2006 15:30:05 -0400
> From: Brian Long <brilong at cisco.com>
> Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?
> To: Sunil Mushran <Sunil.Mushran at oracle.com>
> Cc: ocfs2-users at oss.oracle.com
> Message-ID: <1149881406.4142.27.camel at brilong-lnx>
> Content-Type: text/plain
>
> Understood, but how do I determine why once a week I'm failing the 12
> second heartbeat?  Before I bump the HB, shouldn't I figure out why dm-6
> is gone for 12 seconds?  The last 24 ops are as follows:
>
> (7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
> dm-6 after 12000 milliseconds
> Heartbeat thread (7) printing last 24 blocking operations (cur = 3):
> Heartbeat thread stuck at waiting for read completion, stuffing current
> time into that blocker (index 3)
> Index 4: took 0 ms to do submit_bio for read
> Index 5: took 0 ms to do waiting for read completion
> Index 6: took 0 ms to do bio alloc write
> Index 7: took 0 ms to do bio add page write
> Index 8: took 0 ms to do submit_bio for write
> Index 9: took 0 ms to do checking slots
> Index 10: took 0 ms to do waiting for write completion
> Index 11: took 1998 ms to do msleep
> Index 12: took 0 ms to do allocating bios for read
> Index 13: took 0 ms to do bio alloc read
> Index 14: took 0 ms to do bio add page read
> Index 15: took 0 ms to do submit_bio for read
> Index 16: took 0 ms to do waiting for read completion
> Index 17: took 0 ms to do bio alloc write
> Index 18: took 0 ms to do bio add page write
> Index 19: took 0 ms to do submit_bio for write
> Index 20: took 0 ms to do checking slots
> Index 21: took 0 ms to do waiting for write completion
> Index 22: took 1999 ms to do msleep
> Index 23: took 0 ms to do allocating bios for read
> Index 0: took 0 ms to do bio alloc read
> Index 1: took 0 ms to do bio add page read
> Index 2: took 0 ms to do submit_bio for read
> Index 3: took 9998 ms to do waiting for read completion
> (7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active
> regions.
> Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
> system by panicing
>
> /Brian/
>
> On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote:
> > The hb failure is just the effect of the ios not completing within 12
> secs.
> > The full oops trace gives the last 24 ops and their timings.
> >
> > One solution is to double up the hb timeout. Set,
> > O2CB_HEARTBEAT_THRESHOLD = 14
> >
> > Brian Long wrote:
> > > Hello,
> > >
> > > I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
> > > 1.2.1 RPMs.  About once a week, one of the nodes crashes itself (self-
> > > fencing) and I get a full vmcore on my netdump server.  The netdump
> log
> > > file shows the shared filesystem LUN (/dev/dm-6) did not respond
> within
> > > 12000ms.  I have not changed the default heartbeat values
> > > in /etc/sysconfig/o2cb.  There was no other IO ongoing when this
> > > happens, but they are HP Proliant servers running the Insight Manager
> > > agents.
> > >
> > > Why would the heartbeat fail roughly once a week?  Should I open a
> > > bugzilla and upload my netdump log file?
> > >
> > > Thanks.
> > >
> > > /Brian/
> > >
> --
>       Brian Long                      |         |           |
>       IT Data Center Systems          |       .|||.       .|||.
>       Cisco Linux Developer           |   ..:|||||||:...:|||||||:..
>       Phone: (919) 392-7363           |   C i s c o   S y s t e m s
>
>
>
>
> ------------------------------
>
> Message: 4
> Date: Fri, 09 Jun 2006 13:00:48 -0700
> From: Sunil Mushran <Sunil.Mushran at oracle.com>
> Subject: Re: [Ocfs2-users] RHEL 4 U2 / OCFS 1.2.1 weekly crash?
> To: Brian Long <brilong at cisco.com>
> Cc: ocfs2-users at oss.oracle.com
> Message-ID: <4489D370.4050103 at oracle.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> This dump is very much like the one we used to see with the
> cfq io scheduler. The very last io op would consume all the time.
> I am assuming that you are running with the DEADLINE io sched.
>
> Is there any other common factors in all the crashes. Like, happens
> on one node? Or, around the same time? How do you know there is
> no other io happening at that time? What about cron jobs?
>
> Also, is the shared disk connected to some other nodes which
> could be the cause of the io spike?
>
> Brian Long wrote:
> > Understood, but how do I determine why once a week I'm failing the 12
> > second heartbeat?  Before I bump the HB, shouldn't I figure out why dm-6
> > is gone for 12 seconds?  The last 24 ops are as follows:
> >
> > (7,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout to device
> > dm-6 after 12000 milliseconds
> > Heartbeat thread (7) printing last 24 blocking operations (cur = 3):
> > Heartbeat thread stuck at waiting for read completion, stuffing current
> > time into that blocker (index 3)
> > Index 4: took 0 ms to do submit_bio for read
> > Index 5: took 0 ms to do waiting for read completion
> > Index 6: took 0 ms to do bio alloc write
> > Index 7: took 0 ms to do bio add page write
> > Index 8: took 0 ms to do submit_bio for write
> > Index 9: took 0 ms to do checking slots
> > Index 10: took 0 ms to do waiting for write completion
> > Index 11: took 1998 ms to do msleep
> > Index 12: took 0 ms to do allocating bios for read
> > Index 13: took 0 ms to do bio alloc read
> > Index 14: took 0 ms to do bio add page read
> > Index 15: took 0 ms to do submit_bio for read
> > Index 16: took 0 ms to do waiting for read completion
> > Index 17: took 0 ms to do bio alloc write
> > Index 18: took 0 ms to do bio add page write
> > Index 19: took 0 ms to do submit_bio for write
> > Index 20: took 0 ms to do checking slots
> > Index 21: took 0 ms to do waiting for write completion
> > Index 22: took 1999 ms to do msleep
> > Index 23: took 0 ms to do allocating bios for read
> > Index 0: took 0 ms to do bio alloc read
> > Index 1: took 0 ms to do bio add page read
> > Index 2: took 0 ms to do submit_bio for read
> > Index 3: took 9998 ms to do waiting for read completion
> > (7,1):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active
> > regions.
> > Kernel panic - not syncing: ocfs2 is very sorry to be fencing this
> > system by panicing
> >
> > /Brian/
> >
> > On Fri, 2006-06-09 at 10:49 -0700, Sunil Mushran wrote:
> >
> >> The hb failure is just the effect of the ios not completing within 12
> secs.
> >> The full oops trace gives the last 24 ops and their timings.
> >>
> >> One solution is to double up the hb timeout. Set,
> >> O2CB_HEARTBEAT_THRESHOLD = 14
> >>
> >> Brian Long wrote:
> >>
> >>> Hello,
> >>>
> >>> I have two nodes running the 2.6.9-22.0.2.ELsmp kernel and the OCFS2
> >>> 1.2.1 RPMs.  About once a week, one of the nodes crashes itself (self-
> >>> fencing) and I get a full vmcore on my netdump server.  The netdump
> log
> >>> file shows the shared filesystem LUN (/dev/dm-6) did not respond
> within
> >>> 12000ms.  I have not changed the default heartbeat values
> >>> in /etc/sysconfig/o2cb.  There was no other IO ongoing when this
> >>> happens, but they are HP Proliant servers running the Insight Manager
> >>> agents.
> >>>
> >>> Why would the heartbeat fail roughly once a week?  Should I open a
> >>> bugzilla and upload my netdump log file?
> >>>
> >>> Thanks.
> >>>
> >>> /Brian/
> >>>
> >>>
>
>
>
> ------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
> End of Ocfs2-users Digest, Vol 30, Issue 7
> ******************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20060611/056f2b17/attachment.html