[Ocfs2-users] problem with 2 host cluster

Andy Phillips Andrew.Phillips at betfair.com
Mon Sep 18 06:13:22 PDT 2006


Hi,

   The timeout you're interested in is;

ocfs2-1.2.3/fs/ocfs2/cluster/tcp_internal.h:
#define O2NET_IDLE_TIMEOUT_SECS          10

   The function o2net_idle_timer, which is referenced in your error
message, is in ocfs2-1.2.3/fs/ocfs2/cluster/tcp.c
    
The code that is probably giving you that error message is;

printk(KERN_INFO "o2net: connection to " SC_NODEF_FMT " has been idle
for 10 "
             "seconds, shutting it down.\n", SC_NODEF_ARGS(sc));
        mlog(ML_NOTICE, "here are some times that might help debug the "
             "situation: (tmr %ld.%ld now %ld.%ld dr %ld.%ld adv "
             "%ld.%ld:%ld.%ld func (%08x:%u) %ld.%ld:%ld.%ld)\n",
             sc->sc_tv_timer.tv_sec, sc->sc_tv_timer.tv_usec,
             now.tv_sec, now.tv_usec,
             sc->sc_tv_data_ready.tv_sec, sc->sc_tv_data_ready.tv_usec,
             sc->sc_tv_advance_start.tv_sec,
sc->sc_tv_advance_start.tv_usec,
             sc->sc_tv_advance_stop.tv_sec,
sc->sc_tv_advance_stop.tv_usec,
             sc->sc_msg_key, sc->sc_msg_type,
             sc->sc_tv_func_start.tv_sec, sc->sc_tv_func_start.tv_usec,
             sc->sc_tv_func_stop.tv_sec, sc->sc_tv_func_stop.tv_usec);


 (excuse formatting). You'll notice that the 10 seconds is hardwired
into the error message. But the code around it uses the #define above.

  Also note that this is the "I have not got any packets on this socket"
timeout, not the "I have not got a heartbeat packet", which is the
o2cb_heartbeat_timer.

  Restart does not work, because OCFS2 (higher layers) halts the node,
rather than restarting, as a data safety precaution. Its deliberate.

   Andy


On Mon, 2006-09-18 at 14:04 +0100, Andrew Brunton wrote:
> Hi,
> 
> I was wondering about the timeout, but wasn't sure where to set it
> 
> Why doesn't the restart work ? ( I assume it's trying to restart )
> 
> Andrew
> 
> 
> -----Original Message-----
> From: Mark Maiden [mailto:markm at globoforce.com] 
> Sent: 18 September 2006 12:18
> To: Andy Phillips
> Cc: Andrew Brunton; ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] problem with 2 host cluster
> 
> We had a similar issue using SLES 9 and a CX300.
> 
> We upgraded to the latest ocfs version and changed our 
> O2CB_HEARTBEAT_THRESHOLD in the /etc/sysconfig/o2cb file(on both nodes) 
> to the following :
> 
> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
> O2CB_HEARTBEAT_THRESHOLD=61
> 
> It seemed to sort the issue out for us, but could be a totally different 
> issue! ;-)
> 
> Mark Maiden
> Systems Administrator
> Globoforce, Ltd
>   6 Beckett Way Parkwest
>   Dublin 12
>   Ireland
>   t: +353 1 625 8812
>   f: +353 1 625 8880
>   e: markm at globoforce.com
>    www.globoforce.com
> 
>    http://guidance.gospelcom.net/answer.htm
> 
> 
> Andy Phillips wrote:
> > Hi,
> > 
> >    I've got _exactly_ the same problem. I've not had the time to dive
> > through the source code and check it. We're on ES4.3 and ocfs-1.2.3.
> > 
> >    For us the problem (same trace as below) was not that repeatable, and
> > was possibly related to the i/o pattern. 
> > 
> >    What seems to happen is that the underlying "network services" of
> > ocfs2 (o2net) believes that no packets are being sent. The tcp socket is
> > surrounded by wrapper functions, one of which times when the last packet
> > is received. Its this that decides the socket is dead, then closes the 
> > socket. Meanwhile, the upper layers (which are actually sending data
> > regularly) find the carpet yanked out from underneath them, and decide
> > to halt the cluster to protect the data. 
> > 
> >    Highly annoying. I expect it will be some signed 32bit integer
> > wrapping somewhere....
> > 
> >    Andy
> >  
> > 
> > On Mon, 2006-09-18 at 11:14 +0100, Andrew Brunton wrote:
> >> Hi,
> >>
> >>  
> >>
> >> We have 2 Dell 1850's in a cluster, both machines are running Redhat
> >> Enterprise Linux 4 AS, update 2.
> >>
> >>  
> >>
> >> The boxes are connected to a Dell EMC CX300 using emulex HBA's
> >>
> >>  
> >>
> >> The cluster is running an Oracle 10gR2 std edition RAC. 
> >>
> >>  
> >>
> >> We are using ocfs2 to store files generated by our application and not
> >> to store anything to do with the database.
> >>
> >>  
> >>
> >> We've been having a few problems were the servers appear to hang, and
> >> have to be shutdown (using the powerbutton) and then started up again.
> >> This seems to be happening every weekend and I don't really understand
> >> what's happening, or how to fix it.
> >>
> >>  
> >>
> >> I've included an extract from messages in the hope someone can shed
> >> some light on the matter.
> >>
> >>  
> >>
> >> Kind regards
> >>
> >>  
> >>
> >> Andrew
> >>
> >>  
> >>
> >> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection
> >> to node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been
> >> idle for 10 seconds, shutting it down.
> >>
> >> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1321 here are
> >> some times that might help debug the situation: (tmr 1158527154.993223
> >> now 1158527164.993090 dr 1158527154.993213 adv
> >> 1158527154.993227:1158527154.993228 func (101e0528:505)
> >> 1158527153.796194:1158527153.796200)
> >>
> >> Sep 17 22:06:04 argon2 kernel: (3854,0):o2net_set_nn_state:411 no
> >> longer connected to node argon1.crewe.ukfuels.co.uk (num 0) at
> >> 10.1.1.110:7777
> >>
> >> Sep 17 22:06:04 argon2 kernel:
> >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -112
> >>
> >> Sep 17 22:06:04 argon2 kernel:
> >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:05 argon2 last message repeated 185 times
> >>
> >> Sep 17 22:06:05 argon2 kernel:
> >> (26144,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:05 argon2 last message repeated 154 times
> >>
> >> Sep 17 22:06:05 argon2 kernel:
> >> (25274,2):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:05 argon2 last message repeated 123 times
> >>
> >> Sep 17 22:06:05 argon2 kernel:
> >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:05 argon2 last message repeated 472 times
> >>
> >> Sep 17 22:06:05 argon2 kernel:
> >> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:08 argon2 last message repeated 3239 times
> >>
> >> Sep 17 22:06:08 argon2 kernel:
> >> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 17 22:06:08 argon2 last message repeated 118 times
> >>
> >> Sep 17 22:06:08 argon2 kernel:
> >> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
> >>
> >> Sep 18 08:40:32 argon2 syslogd 1.4.1: restart.
> >>
> >> Sep 18 08:40:32 argon2 syslog: syslogd startup succeeded
> >>
> >> Sep 18 08:40:32 argon2 kernel: klogd 1.4.1, log source = /proc/kmsg
> >> started.
> >>
> >> Sep 18 08:40:32 argon2 kernel: Bootdata ok (command line is ro
> >> root=LABEL=/ apic rhgb quiet)
> >>
> >> Sep 18 08:40:32 argon2 kernel: Linux version 2.6.9-22.0.1.ELsmp
> >> (bhcompile at hs20-bc1-2.build.redhat.com) (gcc version 3.4.4 20050721
> >> (Red Hat 3.4.4-2)) #1 SMP
> >>
> >>  
> >>
> >> Andrew Brunton
> >>
> >> Senior Application Developer
> >>
> >> UK Fuels Limited
> >>
> >>  
> >>
> >> Tel +44 (0)1270 655636
> >>
> >> Fax +44 (0)1270 655700
> >>
> >>  
> >>
> >> andrew.brunton at ukfuels.co.uk
> >>
> >>  
> >>
> >>
> >>
> >> ________________________________________________________________________
> >> In order to protect our email recipients, Betfair use SkyScan from 
> >> MessageLabs to scan all Incoming and Outgoing mail for viruses.
> >>
> >> ________________________________________________________________________
> >> _______________________________________________
> >> Ocfs2-users mailing list
> >> Ocfs2-users at oss.oracle.com
> >> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> 
> 
> ________________________________________________________________________
> In order to protect our email recipients, Betfair use SkyScan from 
> MessageLabs to scan all Incoming and Outgoing mail for viruses.
> 
> ________________________________________________________________________
-- 
Andy Phillips
Systems Architecture Manager, Betfair.com

Office: 0208 8348436

Betfair Limited | Winslow Road | Hammersmith Embankment | London | W6
9HP(Change address information to reflect company of employment and your
work address)

Company No. 5140986 (Modify company number to correspond with company
name listed above)


The information in this e-mail and any attachment is confidential and is
intended only for the named recipient(s). The e-mail may not be
disclosed or used by any person other than the addressee, nor may it be
copied in any way. If you are not a named recipient please notify the
sender immediately and delete any copies of this message. Any
unauthorized copying, disclosure or distribution of the material in this
e-mail is strictly forbidden. Any view or opinions presented are solely
those of the author and do not necessarily represent those of the
company.




More information about the Ocfs2-users mailing list