[Ocfs2-users] Node reboot during network outage

Tue Apr 22 09:28:27 PDT 2008

Hi, my company is in the process of moving our web and database servers to new hardware.  We have a HP EVA 4100 SAN which is being used by two database servers running in an Oracle 10g cluster and that works fine.  We have gone to extreme lengths to ensure high availability.  The SAN has twin disk arrays, twin controllers, and all servers have dual fibre interfaces.  Networking is (should be) similarly redundant with bonded NICs connected in two-switch configuration, two firewalls and so on.

We also want to share regular Linux filesystems between our servers - HP DL580 G5s running RedHat AS 5 (kernel 2.6.18-53.1.14.el5) and we chose OCFS2 (1.2.8) to manage the cluster.

As stated, each server in the 4 node cluster has a bonded interface set up as bond0 in a two-switch configuration (each NIC in the bond is connected to a different switch).  Because this is a two-switch configuration, we are running the bond in active-standby mode and this works just fine.

Our problem occurred when we were doing failover testing where we simulated the loss of one of the network switches by powering it off.  The result was that the servers rebooted and this make a mockery of our attempts at a HA solution.

Here is a short section from /var/log/messages following a reboot of one of the switches to simulate an outage:

--------------------------------------------------------------------------
Apr 22 14:25:44 mtkws01p1 kernel: bonding: bond0: backup interface eth0 is now down
Apr 22 14:25:44 mtkws01p1 kernel: bnx2: eth0 NIC Link is Down
Apr 22 14:26:13 mtkws01p1 kernel: o2net: connection to node mtkdb01p2 (num 1) at 10.1.3.50:7777 has been idle for 30.0 seconds, shutting it down.
Apr 22 14:26:13 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1208870743.673433 now 1208870773.673192 dr 1208870743.673427 adv 1208870743.673433:1208870743.673434 func (97690d75:2) 1208870697.670758:1208870697.670760)
Apr 22 14:26:13 mtkws01p1 kernel: o2net: no longer connected to node mtkdb01p2 (num 1) at 10.1.3.50:7777
Apr 22 14:27:38 mtkws01p1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps full duplex
Apr 22 14:27:43 mtkws01p1 kernel: bonding: bond0: backup interface eth0 is now up
Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_do_master_request:1418 ERROR: link to 1 went down!
Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):dlm_get_lock_resource:995 ERROR: status = -107
Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_broadcast_vote:731 ERROR: status = -107
Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_do_request_vote:804 ERROR: status = -107
Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_unlink:843 ERROR: status = -107
Apr 22 14:29:29 mtkws01p1 kernel: o2net: connection to node mtkdb02p2 (num 2) at 10.1.3.51:7777 has been idle for 30.0 seconds, shutting it down.
Apr 22 14:29:29 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here are some times that might help debug the situation: (tmr 1208870939.955991 now 1208870969.956343 dr 1208870939.955984 adv 1208870939.955992:1208870939.955993 func (97690d75:2) 1208870697.670916:1208870697.670918)
Apr 22 14:29:29 mtkws01p1 kernel: o2net: no longer connected to node mtkdb02p2 (num 2) at 10.1.3.51:7777
Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_broadcast_vote:731 ERROR: status = -107
Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_do_request_vote:804 ERROR: status = -107
Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_unlink:843 ERROR: status = -107
Apr 22 14:34:23 mtkws01p1 syslogd 1.4.1: restart.
--------------------------------------------------------------------------

Things that I have tried...

I've tried setting up the bond with both miimon and ARP monitoring (at different times of course) because when the switch comes back up the link detect goes up and down several times while the switch initialises and I hoped that ARP monitoring might be more reliable - it made no difference at all.

I've increased the heartbeat timeout to 61 from 31 but, as yet, I haven't played with any of the other cluster configuration variables.

Has anyone with a similar configuration experienced problems like this and found a solution?

Regards,

Mick.

________________________________

Mick Waters
Senior Systems Developer

w: +44 (0)208 335 2011
m: +44 (0)7849 887 277
e: mick.waters at motortrak.com<mailto:mick.waters at motortrak.com>

www.motortrak.com<http://www.motortrak.com/>
digital media solutions

Motortrak Ltd, AC Court, High St, Thames Ditton, Surrey, KT7 0SR, United Kingdom

The information contained in this message is for the intended addressee only and may contain confidential and/or privileged information. If you are not the intended addressee, please delete this message and notify the sender; do not copy or distribute this message or disclose its contents to anyone. Any views or opinions expressed in this message are those of the author and do not necessarily represent those of Motortrak Limited or of any of its associated companies. No reliance may be placed on this message without written confirmation from an authorised representative of the company.

Registered in England 3098391 V.A.T. Registered No. 667463890

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080422/2b1b611a/attachment.html