[Ocfs2-users] OCFS2 DLM problems

Ulf Zimmermann ulf at atc-onlane.com
Wed Jan 23 13:35:50 PST 2008


It looks like around 3:20am we had about 800 to 1,200 packets per second
coming in per node. But the packet size was not large, looks like less
then 1Mbit/sec. 4 of the nodes are connected to our front end
application servers and they would be pretty much idle at 3am. Our first
customers usual do not login until just about then (East coast people
starting to get to the dealer ships) and only in small numbers. We did
not have much batch processing happening on the 5. and 6. node.

We are planning on upgrading to 1.2.5-6 tonight but people here want to
know more why it suddenly now happens.

> -----Original Message-----
> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> Sent: Wednesday, January 23, 2008 1:07 PM
> To: Ulf Zimmermann
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] OCFS2 DLM problems
> 
> Depends on the net traffic I guess. The error returned asks the user
> to retry and the older code wasn't. AFAIR, we have never encountered
> this in our main test cluster.
> 
> Ulf Zimmermann wrote:
> > Currently running 1.2.5-1 so we should upgrade. Is there any
explanation
> > how this bug gets triggered? We are trying to understand why we are
> > suddenly hitting this bug, as this has been running for several
months
> > without being triggered.
> >
> > -----Original Message-----
> > From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
> > Sent: Wednesday, January 23, 2008 9:58 AM
> > To: Ulf Zimmermann
> > Cc: ocfs2-users at oss.oracle.com
> > Subject: Re: [Ocfs2-users] OCFS2 DLM problems
> >
> > 1.2.5-what?
> >
> > If you are not on 1.2.5-6, upgrade to that. It could be you are
hitting
> > the
> > following issue addressed in that release.
> >
> > r3033 tcp - Retry sendpage() if it returns EAGAIN (bugzilla#896)
> >
> > No, don't upgrade to 1.2.7. We just discovered an issue in it and
will
> > be releasing 1.2.8 shortly.
> >
> > Ulf Zimmermann wrote:
> >
> >> Hello everyone, once again.
> >>
> >> We are running into a problem, which has shown now 2 times,
possible 3
> >> (once the systems looked different.)
> >>
> >> The environment is 6 HP DL360/380 g5 servers with eth0 being the
> >>
> > public
> >
> >> interface, eth1 and bond0 (eth2 and eth3) used for clusterware and
> >>
> > bond0
> >
> >> also used for OCFS2. The bond0 interface is in active/passive mode.
> >> There are no network errors counters showing and even during the
> >>
> > problem
> >
> >> we can communicate via the bond0 interface. This setup has been
> >>
> > running
> >
> >> for more then 2 months but last Wednesday morning and today again,
we
> >> had 2 nodes causing locking problems. The problem starts with
messages
> >> like this:
> >>
> >> Jan 23 03:20:44 dbprd01 kernel: o2net: no longer connected to node
> >> dbprd02 (num 1) at 192.168.202.2:7777
> >> Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459
> >> ERROR: status = -107
> >> Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR:
> >> status = -107
> >> Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_send_proxy_ast_msg:459
> >> ERROR: status = -107
> >> Jan 23 03:20:46 dbprd01 kernel: (5172,0):dlm_flush_asts:600 ERROR:
> >> status = -107
> >>
> >> Jan 23 03:20:44 dbprd02 kernel: (5096,0):o2net_sendpage:868 ERROR:
> >> sendpage of size 24 to node dbprd01 (num 0) at 192.168.202.1:7777
> >>
> > failed
> >
> >> with -11
> >> Jan 23 03:20:44 dbprd02 kernel: o2net: no longer connected to node
> >> dbprd01 (num 0) at 192.168.202.1:7777
> >>
> >> After these there are plenty of more messages, such as
> >> "dlm_wait_for_node_death", "dlm_send_remote_convert_request" on
> >>
> > dbprd02
> >
> >> and "dlm_send_proxy_ast_msg", "dlm_flush_asts" on dbprd01.
> >>
> >> We are currently running OCFS2 1.2.5, the kernel is EL4 Update 5
> >>
> > x86_64
> >
> >> (2.6.9-55.ELsmp).
> >>
> >> I see there is one bug fixed in 1.2.6/1.2.7 related to DLM and I
was
> >> wondering if the above problem could be related to it or if this is
> >> something different.
> >>
> >>
> >> Ulf Zimmermann | Senior System Architect
> >>
> >> ATC-Onlane, Inc.
> >> 4600 Bohannon Drive, Suite 100
> >> Menlo Park, CA 94025
> >>
> >> O: 650-532-6382  M: (510) 396-1764  F: (510) 580-0929
> >>
> >> Email: ulf at atc-onlane.com | Web: www.atc-onlane.com
> >>
> >> DISCLAIMER:
> >> This e-mail and any attachments are confidential and also may be
> >> privileged. If you are not the named recipient, or have otherwise
> >> received this communication in error, please delete it from your
> >>
> > inbox,
> >
> >> notify the sender immediately, and do not disclose its contents to
any
> >> other person, use them for any purpose, or store or copy them in
any
> >> medium. Thank you for your cooperation.
> >>
> >>
> >>
> >> _______________________________________________
> >> Ocfs2-users mailing list
> >> Ocfs2-users at oss.oracle.com
> >> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>
> >>
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >



More information about the Ocfs2-users mailing list