[Ocfs2-users] problem with 2 host cluster

Alexei_Roudnev Alexei_Roudnev at exigengroup.com
Mon Sep 18 09:57:51 PDT 2006


OCFS have 2 heartbeat thresholds, and only one is configured by this option.
(I don't remember, which one of 2 - network and disk heartbeats).


----- Original Message ----- 
From: "Mark Maiden" <markm at globoforce.com>
To: "Andy Phillips" <Andrew.Phillips at betfair.com>
Cc: <ocfs2-users at oss.oracle.com>
Sent: Monday, September 18, 2006 4:17 AM
Subject: Re: [Ocfs2-users] problem with 2 host cluster


We had a similar issue using SLES 9 and a CX300.

We upgraded to the latest ocfs version and changed our
O2CB_HEARTBEAT_THRESHOLD in the /etc/sysconfig/o2cb file(on both nodes)
to the following :

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=61

It seemed to sort the issue out for us, but could be a totally different
issue! ;-)

Mark Maiden
Systems Administrator
Globoforce, Ltd
  6 Beckett Way Parkwest
  Dublin 12
  Ireland
  t: +353 1 625 8812
  f: +353 1 625 8880
  e: markm at globoforce.com
   www.globoforce.com

   http://guidance.gospelcom.net/answer.htm


Andy Phillips wrote:
> Hi,
>
>    I've got _exactly_ the same problem. I've not had the time to dive
> through the source code and check it. We're on ES4.3 and ocfs-1.2.3.
>
>    For us the problem (same trace as below) was not that repeatable, and
> was possibly related to the i/o pattern.
>
>    What seems to happen is that the underlying "network services" of
> ocfs2 (o2net) believes that no packets are being sent. The tcp socket is
> surrounded by wrapper functions, one of which times when the last packet
> is received. Its this that decides the socket is dead, then closes the
> socket. Meanwhile, the upper layers (which are actually sending data
> regularly) find the carpet yanked out from underneath them, and decide
> to halt the cluster to protect the data.
>
>    Highly annoying. I expect it will be some signed 32bit integer
> wrapping somewhere....
>
>    Andy
>
>
> On Mon, 2006-09-18 at 11:14 +0100, Andrew Brunton wrote:
>> Hi,
>>
>>
>>
>> We have 2 Dell 1850’s in a cluster, both machines are running Redhat
>> Enterprise Linux 4 AS, update 2.
>>
>>
>>
>> The boxes are connected to a Dell EMC CX300 using emulex HBA’s
>>
>>
>>
>> The cluster is running an Oracle 10gR2 std edition RAC.
>>
>>
>>
>> We are using ocfs2 to store files generated by our application and not
>> to store anything to do with the database.
>>
>>
>>
>> We’ve been having a few problems were the servers appear to hang, and
>> have to be shutdown (using the powerbutton) and then started up again.
>> This seems to be happening every weekend and I don’t really understand
>> what’s happening, or how to fix it.
>>
>>
>>
>> I’ve included an extract from messages in the hope someone can shed
>> some light on the matter.
>>
>>
>>
>> Kind regards
>>
>>
>>
>> Andrew
>>
>>
>>
>> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310 connection
>> to node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777 has been
>> idle for 10 seconds, shutting it down.
>>
>> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1321 here are
>> some times that might help debug the situation: (tmr 1158527154.993223
>> now 1158527164.993090 dr 1158527154.993213 adv
>> 1158527154.993227:1158527154.993228 func (101e0528:505)
>> 1158527153.796194:1158527153.796200)
>>
>> Sep 17 22:06:04 argon2 kernel: (3854,0):o2net_set_nn_state:411 no
>> longer connected to node argon1.crewe.ukfuels.co.uk (num 0) at
>> 10.1.1.110:7777
>>
>> Sep 17 22:06:04 argon2 kernel:
>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -112
>>
>> Sep 17 22:06:04 argon2 kernel:
>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:05 argon2 last message repeated 185 times
>>
>> Sep 17 22:06:05 argon2 kernel:
>> (26144,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:05 argon2 last message repeated 154 times
>>
>> Sep 17 22:06:05 argon2 kernel:
>> (25274,2):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:05 argon2 last message repeated 123 times
>>
>> Sep 17 22:06:05 argon2 kernel:
>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:05 argon2 last message repeated 472 times
>>
>> Sep 17 22:06:05 argon2 kernel:
>> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:08 argon2 last message repeated 3239 times
>>
>> Sep 17 22:06:08 argon2 kernel:
>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 17 22:06:08 argon2 last message repeated 118 times
>>
>> Sep 17 22:06:08 argon2 kernel:
>> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>
>> Sep 18 08:40:32 argon2 syslogd 1.4.1: restart.
>>
>> Sep 18 08:40:32 argon2 syslog: syslogd startup succeeded
>>
>> Sep 18 08:40:32 argon2 kernel: klogd 1.4.1, log source = /proc/kmsg
>> started.
>>
>> Sep 18 08:40:32 argon2 kernel: Bootdata ok (command line is ro
>> root=LABEL=/ apic rhgb quiet)
>>
>> Sep 18 08:40:32 argon2 kernel: Linux version 2.6.9-22.0.1.ELsmp
>> (bhcompile at hs20-bc1-2.build.redhat.com) (gcc version 3.4.4 20050721
>> (Red Hat 3.4.4-2)) #1 SMP
>>
>>
>>
>> Andrew Brunton
>>
>> Senior Application Developer
>>
>> UK Fuels Limited
>>
>>
>>
>> Tel +44 (0)1270 655636
>>
>> Fax +44 (0)1270 655700
>>
>>
>>
>> andrew.brunton at ukfuels.co.uk
>>
>>
>>
>>
>>
>> ________________________________________________________________________
>> In order to protect our email recipients, Betfair use SkyScan from
>> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>>
>> ________________________________________________________________________
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users




More information about the Ocfs2-users mailing list