[Ocfs2-users] problem with 2 host cluster

Tue Sep 19 07:38:19 PDT 2006

We had a lot of issues in that same configuration - 4 e1000 nic's  
bonded against 2 1g switches.  I don't know if that's coincidental at  
all and we have since downed the whole thing until we can make it  
work in a more stable fashion.  I would suggest, just based on that  
pattern alone, that you break up the nic bonding and just let it sit  
on one interface.  See what happens.  It would be interesting if  
there was any connection on the network layer between the bonding and  
some kind of latency that it's generating - causing the OCFS cluster  
to oops and panic.

Adam

On Sep 19, 2006, at 9:46 AM, Andrew Brunton wrote:

> We've had the system go down under various load conditions. (one  
> machine has
> gone down before now during the day (not good)
>
> In each box we have 4 e1000 network cards bonded into 2 bonded  
> connections
> bond0 (public) and bond1 (private), we have 2 1g switches with a  
> connection
> from each of the bonds going into each switch.
>
> http://www.puschitz.com/ 
> TuningLinuxForOracle.shtml#ChangingNetworkKernelSett
> ings mentions about flowcontrol for the e1000, (options e1000  
> FlowControl=1)
> could this be something to do with the problem ?
>
> Something else I have noticed is that I'm using the public bonded  
> connection
> for the heartbeat link rather than the private one which is used by  
> the RAC
> Cluster. I assume I can change it ? If so Can I just down the ocfs  
> cluster
> change the ip address to the private one and then start them back  
> up again ?
> What's the recommeneded way to do it ?
>
> The ocfs mount is shared to our windows clients using samba, i've  
> noticed a
> large number of errors concerning samba in messages and I'm  
> wondering if
> that's whats causing my problem.
>
> How do you work out the O2CB_HEARTBEAT_THRESHOLD ?
>
> This is a bit OT but how do I stop samba from bonding to the  
> private bonded
> connection ?
>
> Andrew
>
>
>
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com
> [mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of  
> Alexei_Roudnev
> Sent: 19 September 2006 09:23
> To: Andrew.Phillips at betfair.com
> Cc: ocfs2-users at oss.oracle.com
> Subject: ***Bulk SPAM*** Re: [Ocfs2-users] problem with 2 host cluster
>
> I do not remember, which timeouts are confuigurabkle and which are  
> not -
> but, if I am not mistaken, network timeout is hardcoded.
> So, if disks are reconnecetd more than 12 seconds (normally, they  
> reconnect
> during 60 seconds), you can reconfigure OCFSv2, but if network  
> reconenction
> time is > 12 seconds (and it is ALWAYS > 12 seconds! no exceptions)  
> then you
> have not a choices.
>
> It is design flaw, in general. The only idea which I have, if it is  
> NETWORK
> glitch, you can try direct cross-connection (but it is unlikely - more
> likely it is server's glitch - server loops in the kernel and so delay
> service from receiving TCP/IP in time - and you better find the  
> core reason
> for it).
>
>
>
> ----- Original Message -----
> From: "Andy Phillips" <Andrew.Phillips at betfair.com>
> To: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
> Cc: <markm at globoforce.com>; <ocfs2-users at oss.oracle.com>
> Sent: Monday, September 18, 2006 10:04 AM
> Subject: Re: [Ocfs2-users] problem with 2 host cluster
>
>
> Alexei
>
> But given that the problem is o2net_idle_timer, that sort of takes the
> disk heartbeat out of the equation.
>
> Andy
>
>
> On Mon, 2006-09-18 at 09:57 -0700, Alexei_Roudnev wrote:
>> OCFS have 2 heartbeat thresholds, and only one is configured by this
> option.
>> (I don't remember, which one of 2 - network and disk heartbeats).
>>
>>
>> ----- Original Message -----
>> From: "Mark Maiden" <markm at globoforce.com>
>> To: "Andy Phillips" <Andrew.Phillips at betfair.com>
>> Cc: <ocfs2-users at oss.oracle.com>
>> Sent: Monday, September 18, 2006 4:17 AM
>> Subject: Re: [Ocfs2-users] problem with 2 host cluster
>>
>>
>> We had a similar issue using SLES 9 and a CX300.
>>
>> We upgraded to the latest ocfs version and changed our
>> O2CB_HEARTBEAT_THRESHOLD in the /etc/sysconfig/o2cb file(on both  
>> nodes)
>> to the following :
>>
>> # O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered  
>> dead.
>> O2CB_HEARTBEAT_THRESHOLD=61
>>
>> It seemed to sort the issue out for us, but could be a totally  
>> different
>> issue! ;-)
>>
>> Mark Maiden
>> Systems Administrator
>> Globoforce, Ltd
>>   6 Beckett Way Parkwest
>>   Dublin 12
>>   Ireland
>>   t: +353 1 625 8812
>>   f: +353 1 625 8880
>>   e: markm at globoforce.com
>>    www.globoforce.com
>>
>>    http://guidance.gospelcom.net/answer.htm
>>
>>
>> Andy Phillips wrote:
>>> Hi,
>>>
>>>    I've got _exactly_ the same problem. I've not had the time to  
>>> dive
>>> through the source code and check it. We're on ES4.3 and ocfs-1.2.3.
>>>
>>>    For us the problem (same trace as below) was not that  
>>> repeatable, and
>>> was possibly related to the i/o pattern.
>>>
>>>    What seems to happen is that the underlying "network services" of
>>> ocfs2 (o2net) believes that no packets are being sent. The tcp  
>>> socket is
>>> surrounded by wrapper functions, one of which times when the last  
>>> packet
>>> is received. Its this that decides the socket is dead, then  
>>> closes the
>>> socket. Meanwhile, the upper layers (which are actually sending data
>>> regularly) find the carpet yanked out from underneath them, and  
>>> decide
>>> to halt the cluster to protect the data.
>>>
>>>    Highly annoying. I expect it will be some signed 32bit integer
>>> wrapping somewhere....
>>>
>>>    Andy
>>>
>>>
>>> On Mon, 2006-09-18 at 11:14 +0100, Andrew Brunton wrote:
>>>> Hi,
>>>>
>>>>
>>>>
>>>> We have 2 Dell 1850's in a cluster, both machines are running  
>>>> Redhat
>>>> Enterprise Linux 4 AS, update 2.
>>>>
>>>>
>>>>
>>>> The boxes are connected to a Dell EMC CX300 using emulex HBA's
>>>>
>>>>
>>>>
>>>> The cluster is running an Oracle 10gR2 std edition RAC.
>>>>
>>>>
>>>>
>>>> We are using ocfs2 to store files generated by our application  
>>>> and not
>>>> to store anything to do with the database.
>>>>
>>>>
>>>>
>>>> We've been having a few problems were the servers appear to  
>>>> hang, and
>>>> have to be shutdown (using the powerbutton) and then started up  
>>>> again.
>>>> This seems to be happening every weekend and I don't really  
>>>> understand
>>>> what's happening, or how to fix it.
>>>>
>>>>
>>>>
>>>> I've included an extract from messages in the hope someone can shed
>>>> some light on the matter.
>>>>
>>>>
>>>>
>>>> Kind regards
>>>>
>>>>
>>>>
>>>> Andrew
>>>>
>>>>
>>>>
>>>> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1310  
>>>> connection
>>>> to node argon1.crewe.ukfuels.co.uk (num 0) at 10.1.1.110:7777  
>>>> has been
>>>> idle for 10 seconds, shutting it down.
>>>>
>>>> Sep 17 22:06:04 argon2 kernel: (0,0):o2net_idle_timer:1321 here are
>>>> some times that might help debug the situation: (tmr  
>>>> 1158527154.993223
>>>> now 1158527164.993090 dr 1158527154.993213 adv
>>>> 1158527154.993227:1158527154.993228 func (101e0528:505)
>>>> 1158527153.796194:1158527153.796200)
>>>>
>>>> Sep 17 22:06:04 argon2 kernel: (3854,0):o2net_set_nn_state:411 no
>>>> longer connected to node argon1.crewe.ukfuels.co.uk (num 0) at
>>>> 10.1.1.110:7777
>>>>
>>>> Sep 17 22:06:04 argon2 kernel:
>>>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -112
>>>>
>>>> Sep 17 22:06:04 argon2 kernel:
>>>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>>>
>>>> Sep 17 22:06:05 argon2 last message repeated 185 times
>>>>
>>>> Sep 17 22:06:05 argon2 kernel:
>>>> (26144,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>>>
>>>> Sep 17 22:06:05 argon2 last message repeated 154 times
>>>>
>>>> Sep 17 22:06:05 argon2 kernel:
>>>> (25274,2):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>>>
>>>> Sep 17 22:06:05 argon2 last message repeated 123 times
>>>>
>>>> Sep 17 22:06:05 argon2 kernel:
>>>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>>>
>>>> Sep 17 22:06:05 argon2 last message repeated 472 times
>>>>
>>>> Sep 17 22:06:05 argon2 kernel:
>>>> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>>>
>>>> Sep 17 22:06:08 argon2 last message repeated 3239 times
>>>>
>>>> Sep 17 22:06:08 argon2 kernel:
>>>> (73,3):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>>>
>>>> Sep 17 22:06:08 argon2 last message repeated 118 times
>>>>
>>>> Sep 17 22:06:08 argon2 kernel:
>>>> (73,1):dlm_send_remote_unlock_request:350 ERROR: status = -107
>>>>
>>>> Sep 18 08:40:32 argon2 syslogd 1.4.1: restart.
>>>>
>>>> Sep 18 08:40:32 argon2 syslog: syslogd startup succeeded
>>>>
>>>> Sep 18 08:40:32 argon2 kernel: klogd 1.4.1, log source = /proc/kmsg
>>>> started.
>>>>
>>>> Sep 18 08:40:32 argon2 kernel: Bootdata ok (command line is ro
>>>> root=LABEL=/ apic rhgb quiet)
>>>>
>>>> Sep 18 08:40:32 argon2 kernel: Linux version 2.6.9-22.0.1.ELsmp
>>>> (bhcompile at hs20-bc1-2.build.redhat.com) (gcc version 3.4.4 20050721
>>>> (Red Hat 3.4.4-2)) #1 SMP
>>>>
>>>>
>>>>
>>>> Andrew Brunton
>>>>
>>>> Senior Application Developer
>>>>
>>>> UK Fuels Limited
>>>>
>>>>
>>>>
>>>> Tel +44 (0)1270 655636
>>>>
>>>> Fax +44 (0)1270 655700
>>>>
>>>>
>>>>
>>>> andrew.brunton at ukfuels.co.uk
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
> ______________________________________________________________________ 
> __
>>>> In order to protect our email recipients, Betfair use SkyScan from
>>>> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>>>>
>>>>
> ______________________________________________________________________ 
> __
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users at oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>>
>> _____________________________________________________________________ 
>> ___
>> In order to protect our email recipients, Betfair use SkyScan from
>> MessageLabs to scan all Incoming and Outgoing mail for viruses.
>>
>> _____________________________________________________________________ 
>> ___
> -- 
> Andy Phillips
> Systems Architecture Manager, Betfair.com
>
> Office: 0208 8348436
>
> Betfair Limited | Winslow Road | Hammersmith Embankment | London | W6
> 9HP(Change address information to reflect company of employment and  
> your
> work address)
>
> Company No. 5140986 (Modify company number to correspond with company
> name listed above)
>
>
> The information in this e-mail and any attachment is confidential  
> and is
> intended only for the named recipient(s). The e-mail may not be
> disclosed or used by any person other than the addressee, nor may  
> it be
> copied in any way. If you are not a named recipient please notify the
> sender immediately and delete any copies of this message. Any
> unauthorized copying, disclosure or distribution of the material in  
> this
> e-mail is strictly forbidden. Any view or opinions presented are  
> solely
> those of the author and do not necessarily represent those of the
> company.
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users