[Ocfs2-users] problem with 2 host cluster

Andy Phillips Andrew.Phillips at betfair.com
Mon Sep 18 10:04:46 PDT 2006


Alexei

But given that the problem is o2net_idle_timer, that sort of takes the
disk heartbeat out of the equation.

Andy


On Mon, 2006-09-18 at 09:57 -0700, Alexei_Roudnev wrote:
> OCFS have 2 heartbeat thresholds, and only one is configured by this option.
> (I don't remember, which one of 2 - network and disk heartbeats).
> 
> 
> ----- Original Message ----- 
> From: "Mark Maiden" <markm at globoforce.com>
> To: "Andy Phillips" <Andrew.Phillips at betfair.com>
> Cc: <ocfs2-users at oss.oracle.com>
> Sent: Monday, September 18, 2006 4:17 AM
> Subject: Re: [Ocfs2-users] problem with 2 host cluster
> 
> 
> We had a similar issue using SLES 9 and a CX300.
> 
> We upgraded to the latest ocfs version and changed our
> O2CB_HEARTBEAT_THRESHOLD in the /etc/sysconfig/o2cb file(on both nodes)
> to the following :
> 
> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
> O2CB_HEARTBEAT_THRESHOLD=61
> 
> It seemed to sort the issue out for us, but could be a totally different
> issue! ;-)
> 
> Mark Maiden
> Systems Administrator
> Globoforce, Ltd
>   6 Beckett Way Parkwest
>   Dublin 12
>   Ireland
>   t: +353 1 625 8812
>   f: +353 1 625 8880
>   e: markm at globoforce.com
>    www.globoforce.com
> 
>    http://guidance.gospelcom.net/answer.htm
> 
> 
> Andy Phillips wrote:
> > Hi,
> >
> >    I've got _exactly_ the same problem. I've not had the time to dive
> > through the source code and check it. We're on ES4.3 and ocfs-1.2.3.
> >
> >    For us the problem (same trace as below) was not that repeatable, and
> > was possibly related to the i/o pattern.
> >
> >    What seems to happen is that the underlying "network services" of
> > ocfs2 (o2net) believes that no packets are being sent. The tcp socket is
> > surrounded by wrapper functions, one of which times when the last packet
> > is received. Its this that decides the socket is dead, then closes the
> > socket. Meanwhile, the upper layers (which are actually sending data
> > regularly) find the carpet yanked out from underneath them, and decide
> > to halt the cluster to protect the data.
> >
> >    Highly annoying. I expect it will be some signed 32bit integer
> > wrapping somewhere....
> >
> >    Andy
> >
> >
> > On Mon, 2006-09-18 at 11:14 +0100, Andrew Brunton wrote:
> >> Hi,
> >>
> >>
> >>
> >> We have 2 Dell 1850’s in a cluster, both machines are running Redhat
> >> Enterprise Linux 4 AS, update 2.
> >>
> >>
> >>
> >> The boxes are connected to a Dell EMC CX300 using emulex HBA’s
> >>
> >>
> >>
> >> The cluster is running an Oracle 10gR2 std edition RAC.
> >>
> >>
> >>
> >> We are using ocfs2 to store files generated by our application and not
> >> to store anything to do with the database.
> >>
> >>
> >>
> >> We’ve been having a few problems were the servers appear to hang, and
> >> have to be shutdown (using the powerbutton) and then started up again.
> >> This seems to be happening every weekend and I don’t really understand
> >> what’s happening, or how to fix it.
> >>
> >>
> >>
> >> I’ve included an extract from messages in the hope someone can shed
> >> some light on the matter.
> >>
> >>
> >>
> >> Kind regards
> >>
> >>
> >>
> >> Andrew
> >>
> >>
> >>
> >> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection
> >> to node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been
> >> idle for 10 seconds, shutting it down.
> >>
> >> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1321 here are
> >> some times that might help debug the situation: (tmr 1158527154.993223
> >> now 1158527164.993090 dr 1158527154.993213 adv
> >> 1158527154.993227:1158527154.993228 func (101e0528:505)
> >> 1158527153.796194:1158527153.796200)
> >>
> >> Sep 17 22:06:04 argon2 kernel: (3854,0):o2net_set_nn_state:411 no
> >> longer connected to node argon1.crewe.ukfuels.co.uk (num 0) at
> >> 10.1.1.110:7777
> >>
> >> Sep 17 22:06:04 argon2 kernel:
> >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -112
> >>
> >> Sep 17 22:06:04 argon2 kernel:
> >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:05 argon2 last message repeated 185 times
> >>
> >> Sep 17 22:06:05 argon2 kernel:
> >> (26144,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:05 argon2 last message repeated 154 times
> >>
> >> Sep 17 22:06:05 argon2 kernel:
> >> (25274,2):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:05 argon2 last message repeated 123 times
> >>
> >> Sep 17 22:06:05 argon2 kernel:
> >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:05 argon2 last message repeated 472 times
> >>
> >> Sep 17 22:06:05 argon2 kernel:
> >> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:08 argon2 last message repeated 3239 times
> >>
> >> Sep 17 22:06:08 argon2 kernel:
> >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:08 argon2 last message repeated 118 times
> >>
> >> Sep 17 22:06:08 argon2 kernel:
> >> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 18 08:40:32 argon2 syslogd 1.4.1: restart.
> >>
> >> Sep 18 08:40:32 argon2 syslog: syslogd startup succeeded
> >>
> >> Sep 18 08:40:32 argon2 kernel: klogd 1.4.1, log source = /proc/kmsg
> >> started.
> >>
> >> Sep 18 08:40:32 argon2 kernel: Bootdata ok (command line is ro
> >> root=LABEL=/ apic rhgb quiet)
> >>
> >> Sep 18 08:40:32 argon2 kernel: Linux version 2.6.9-22.0.1.ELsmp
> >> (bhcompile at hs20-bc1-2.build.redhat.com) (gcc version 3.4.4 20050721
> >> (Red Hat 3.4.4-2)) #1 SMP
> >>
> >>
> >>
> >> Andrew Brunton
> >>
> >> Senior Application Developer
> >>
> >> UK Fuels Limited
> >>
> >>
> >>
> >> Tel +44 (0)1270 655636
> >>
> >> Fax +44 (0)1270 655700
> >>
> >>
> >>
> >> andrew.brunton at ukfuels.co.uk
> >>
> >>
> >>
> >>
> >>
> >> ________________________________________________________________________
> >> In order to protect our email recipients, Betfair use SkyScan from
> >> MessageLabs to scan all Incoming and Outgoing mail for viruses.
> >>
> >> ________________________________________________________________________
> >> _______________________________________________
> >> Ocfs2-users mailing list
> >> Ocfs2-users at oss.oracle.com
> >> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> 
> ________________________________________________________________________
> In order to protect our email recipients, Betfair use SkyScan from 
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
> 
> ________________________________________________________________________
-- 
Andy Phillips
Systems Architecture Manager, Betfair.com

Office: 0208 8348436

Betfair Limited | Winslow Road | Hammersmith Embankment | London | W6
9HP(Change address information to reflect company of employment and your
work address)

Company No. 5140986 (Modify company number to correspond with company
name listed above)


The information in this e-mail and any attachment is confidential and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.




More information about the Ocfs2-users mailing list