[Ocfs2-users] Node reboot during network outage

Tue Apr 22 16:06:12 PDT 2008

STP Shouldnt have anything to do with the nodes still seeing each other 
when the switch fails

Have you checked that your bonding config is correct?
ie

cat /proc/net/bonding/bond0
and check that it fails over to the other eth when the switch goes down

Cheers
Brendan

Sunil Mushran wrote:
> The issue is not the time the switch takes to reboot. The issue is the
> amount of time the secondary switch takes to find a unique path.
>
> http://en.wikipedia.org/wiki/Spanning_tree_protocol
>
> Mick Waters wrote:
>   
>> Thanks Sunil,
>>
>> The network switch is brand new but has a fairly complex configuration due to us running a number of VLANs - however, we have found that it has always taken quite a while to reboot.
>>
>> I'll try increasing the idle timeout as suggested and let you know what happens.  However, surely this is only treating the symptoms of what is, after all, a contrived scenario.  Rebooting the switch is supposed to test what would happen if we had a real network outage.  What if the switch were to stay down?
>>
>> My issue is that we have an alternative route via the other NIC in the bond and the other switch.  The affected nodes in cluster shouldn't fence because they should still be able to see all of the other nodes in the cluster via this other route.
>>
>> Does this make sense?
>>
>> Regards,
>>
>> Mick.
>>
>> -----Original Message-----
>> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
>> Sent: 22 April 2008 17:40
>> To: Mick Waters
>> Cc: ocfs2-users at oss.oracle.com
>> Subject: Re: [Ocfs2-users] Node reboot during network outage
>>
>> The interface died at 14:25:44 and recovered at 14:27:43.
>> That's two minutes.
>>
>> One solution is to increase o2cb_idle_timeout to > 2mins.
>>
>> Better solution would be to look into your router setup to determine why it is taking 2 minutes for the router to reconfigure.
>>
>> Mick Waters wrote:
>>   
>>     
>>> Hi, my company is in the process of moving our web and database
>>> servers to new hardware.  We have a HP EVA 4100 SAN which is being
>>> used by two database servers running in an Oracle 10g cluster and that
>>> works fine.  We have gone to extreme lengths to ensure high
>>> availability.  The SAN has twin disk arrays, twin controllers, and all
>>> servers have dual fibre interfaces.  Networking is (should be)
>>> similarly redundant with bonded NICs connected in two-switch
>>> configuration, two firewalls and so on.
>>>
>>> We also want to share regular Linux filesystems between our servers -
>>> HP DL580 G5s running RedHat AS 5 (kernel 2.6.18-53.1.14.el5) and we
>>> chose OCFS2 (1.2.8) to manage the cluster.
>>>
>>> As stated, each server in the 4 node cluster has a bonded interface
>>> set up as bond0 in a two-switch configuration (each NIC in the bond is
>>> connected to a different switch).  Because this is a two-switch
>>> configuration, we are running the bond in active-standby mode and this
>>> works just fine.
>>>
>>> Our problem occurred when we were doing failover testing where we
>>> simulated the loss of one of the network switches by powering it off.
>>> The result was that the servers rebooted and this make a mockery of
>>> our attempts at a HA solution.
>>>
>>> Here is a short section from /var/log/messages following a reboot of
>>> one of the switches to simulate an outage:
>>>
>>> ----------------------------------------------------------------------
>>> ---- Apr 22 14:25:44 mtkws01p1 kernel: bonding: bond0: backup
>>> interface eth0 is now down Apr 22 14:25:44 mtkws01p1 kernel: bnx2:
>>> eth0 NIC Link is Down Apr 22 14:26:13 mtkws01p1 kernel: o2net:
>>> connection to node mtkdb01p2 (num 1) at 10.1.3.50:7777 has been idle
>>> for 30.0 seconds, shutting it down.
>>> Apr 22 14:26:13 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here
>>> are some times that might help debug the situation: (tmr
>>> 1208870743.673433 now 1208870773.673192 dr 1208870743.673427 adv
>>> 1208870743.673433:1208870743.673434 func (97690d75:2)
>>> 1208870697.670758:1208870697.670760)
>>> Apr 22 14:26:13 mtkws01p1 kernel: o2net: no longer connected to node
>>> mtkdb01p2 (num 1) at 10.1.3.50:7777
>>> Apr 22 14:27:38 mtkws01p1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps
>>> full duplex Apr 22 14:27:43 mtkws01p1 kernel: bonding: bond0: backup
>>> interface eth0 is now up Apr 22 14:28:35 mtkws01p1 kernel:
>>> (5234,9):dlm_do_master_request:1418
>>> ERROR: link to 1 went down!
>>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_get_lock_resource:995
>>> ERROR: status = -107
>>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_broadcast_vote:731
>>> ERROR: status = -107
>>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_do_request_vote:804
>>> ERROR: status = -107
>>> Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_unlink:843 ERROR:
>>> status = -107
>>> Apr 22 14:29:29 mtkws01p1 kernel: o2net: connection to node mtkdb02p2
>>> (num 2) at 10.1.3.51:7777 has been idle for 30.0 seconds, shutting it
>>> down.
>>> Apr 22 14:29:29 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here
>>> are some times that might help debug the situation: (tmr
>>> 1208870939.955991 now 1208870969.956343 dr 1208870939.955984 adv
>>> 1208870939.955992:1208870939.955993 func (97690d75:2)
>>> 1208870697.670916:1208870697.670918)
>>> Apr 22 14:29:29 mtkws01p1 kernel: o2net: no longer connected to node
>>> mtkdb02p2 (num 2) at 10.1.3.51:7777
>>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_broadcast_vote:731
>>> ERROR: status = -107
>>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_do_request_vote:804
>>> ERROR: status = -107
>>> Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_unlink:843 ERROR:
>>> status = -107
>>> Apr 22 14:34:23 mtkws01p1 syslogd 1.4.1: restart.
>>> ----------------------------------------------------------------------
>>> ----
>>>
>>> Things that I have tried...
>>>
>>> I've tried setting up the bond with both miimon and ARP monitoring (at
>>> different times of course) because when the switch comes back up the
>>> link detect goes up and down several times while the switch
>>> initialises and I hoped that ARP monitoring might be more reliable -
>>> it made no difference at all.
>>>
>>> I've increased the heartbeat timeout to 61 from 31 but, as yet, I
>>> haven't played with any of the other cluster configuration variables.
>>>
>>> Has anyone with a similar configuration experienced problems like this
>>> and found a solution?
>>>
>>> Regards,
>>>
>>> Mick.
>>>
>>> ----------------------------------------------------------------------
>>> --
>>>
>>> Mick Waters
>>> Senior Systems Developer
>>>
>>> w: +44 (0)208 335 2011
>>> m: +44 (0)7849 887 277
>>> e: mick.waters at motortrak.com <mailto:mick.waters at motortrak.com>
>>>
>>> www.motortrak.com <http://www.motortrak.com/> digital media solutions
>>>
>>> Motortrak Ltd, AC Court, High St, Thames Ditton, Surrey, KT7 0SR,
>>> United Kingdom
>>>
>>> The information contained in this message is for the intended
>>> addressee only and may contain confidential and/or privileged
>>> information. If you are not the intended addressee, please delete this
>>> message and notify the sender; do not copy or distribute this message
>>> or disclose its contents to anyone. Any views or opinions expressed in
>>> this message are those of the author and do not necessarily represent
>>> those of Motortrak Limited or of any of its associated companies. No
>>> reliance may be placed on this message without written confirmation
>>> from an authorised representative of the company.
>>>
>>> Registered in England 3098391 V.A.T. Registered No. 667463890
>>>
>>>
>>> ----------------------------------------------------------------------
>>> --
>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>     
>>>       
>>   
>>     
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>