<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.6000.16640" name=GENERATOR></HEAD>
<BODY>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008>Hi, my company is in the process of moving our web
and database servers to new hardware. We have a HP EVA 4100 SAN which
is being used by two database servers running in an Oracle 10g cluster and that
works fine. We have gone to extreme lengths to ensure high
availability. The SAN has twin disk arrays, twin controllers, and all
servers have dual fibre interfaces. Networking is (should be) similarly
redundant with bonded NICs connected in two-switch configuration, two firewalls
and so on.</SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008></SPAN></FONT></FONT></SPAN> </DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008>We also want to share regular Linux filesystems between
our servers - HP DL580 G5s running RedHat AS 5 (kernel
2.6.18-53.1.14.el5) and we chose OCFS2 (1.2.8) to manage the
cluster.</SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008></SPAN></FONT></FONT></SPAN> </DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008>As stated, each server in the 4 node cluster has a
bonded interface set up as bond0 in a two-switch configuration (each NIC in the
bond is connected to a different switch). Because this is a
two-switch configuration, we are running the bond in active-standby mode and
this works just fine.</SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008></SPAN></FONT></FONT></SPAN> </DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008>Our problem occurred when we were doing failover
testing where we simulated the loss of one of the network switches by powering
it off. The result was that the servers rebooted and this make a
mockery of our attempts at a HA solution.</SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008></SPAN></FONT></FONT></SPAN> </DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008>Here is a short section from /var/log/messages
following a reboot of one of the switches to simulate an
outage:</SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008></SPAN></FONT></FONT></SPAN> </DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008>--------------------------------------------------------------------------</SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008>Apr 22 14:25:44 mtkws01p1 kernel: bonding: bond0:
backup interface eth0 is now down<BR>Apr 22 14:25:44 mtkws01p1 kernel: bnx2:
eth0 NIC Link is Down<BR>Apr 22 14:26:13 mtkws01p1 kernel: o2net: connection to
node mtkdb01p2 (num 1) at 10.1.3.50:7777 has been idle for 30.0 seconds,
shutting it down.<BR>Apr 22 14:26:13 mtkws01p1 kernel:
(0,12):o2net_idle_timer:1426 here are some times that might help debug the
situation: (tmr 1208870743.673433 now 1208870773.673192 dr 1208870743.673427 adv
1208870743.673433:1208870743.673434 func (97690d75:2)
1208870697.670758:1208870697.670760)<BR>Apr 22 14:26:13 mtkws01p1 kernel: o2net:
no longer connected to node mtkdb01p2 (num 1) at 10.1.3.50:7777<BR>Apr 22
14:27:38 mtkws01p1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps full
duplex<BR>Apr 22 14:27:43 mtkws01p1 kernel: bonding: bond0: backup interface
eth0 is now up<BR>Apr 22 14:28:35 mtkws01p1 kernel:
(5234,9):dlm_do_master_request:1418 ERROR: link to 1 went down!<BR>Apr 22
14:28:35 mtkws01p1 kernel: (5234,9):dlm_get_lock_resource:995 ERROR: status =
-107<BR>Apr 22 14:28:35 mtkws01p1 kernel: (5234,9):ocfs2_broadcast_vote:731
ERROR: status = -107<BR>Apr 22 14:28:35 mtkws01p1 kernel:
(5234,9):ocfs2_do_request_vote:804 ERROR: status = -107<BR>Apr 22 14:28:35
mtkws01p1 kernel: (5234,9):ocfs2_unlink:843 ERROR: status = -107<BR>Apr 22
14:29:29 mtkws01p1 kernel: o2net: connection to node mtkdb02p2 (num 2) at
10.1.3.51:7777 has been idle for 30.0 seconds, shutting it down.<BR>Apr 22
14:29:29 mtkws01p1 kernel: (0,12):o2net_idle_timer:1426 here are some times that
might help debug the situation: (tmr 1208870939.955991 now 1208870969.956343 dr
1208870939.955984 adv 1208870939.955992:1208870939.955993 func (97690d75:2)
1208870697.670916:1208870697.670918)<BR>Apr 22 14:29:29 mtkws01p1 kernel: o2net:
no longer connected to node mtkdb02p2 (num 2) at 10.1.3.51:7777<BR>Apr 22
14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_broadcast_vote:731 ERROR: status =
-107<BR>Apr 22 14:30:18 mtkws01p1 kernel: (5268,6):ocfs2_do_request_vote:804
ERROR: status = -107<BR>Apr 22 14:30:18 mtkws01p1 kernel:
(5268,6):ocfs2_unlink:843 ERROR: status = -107<BR>Apr 22 14:34:23 mtkws01p1
syslogd 1.4.1: restart.<BR>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008>--------------------------------------------------------------------------</SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT color=#0000ff
size=2><SPAN class=450553815-22042008></SPAN></FONT></FONT></SPAN> </DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial><FONT size=2><SPAN
class=450553815-22042008>Things that I have
tried...</SPAN></FONT></FONT></SPAN></DIV></SPAN></FONT></FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial size=2><SPAN
class=450553815-22042008>I've tried setting up the bond with both miimon and ARP
monitoring (at different times of course) because when the switch comes back up
the link detect goes up and down several times while the switch initialises and
I hoped that ARP monitoring might be more reliable - it made no difference at
all.</SPAN></FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial size=2><SPAN
class=450553815-22042008></SPAN></FONT></SPAN> </DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial size=2><SPAN
class=450553815-22042008>I've increased the heartbeat timeout to 61 from 31 but,
as yet, I haven't played with any of the other cluster configuration
variables.</SPAN></FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial size=2><SPAN
class=450553815-22042008></SPAN></FONT></SPAN> </DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial size=2><SPAN
class=450553815-22042008>Has anyone with a similar configuration experienced
problems like this and found a solution?</SPAN></FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial
size=2>Regards,</FONT></SPAN></DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial
size=2></FONT></SPAN> </DIV>
<DIV><SPAN class=708142315-22042008><FONT face=Arial
size=2>Mick.</FONT></SPAN></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV>
<DIV align=left>
<HR>
</DIV>
<P><SPAN style="FONT-SIZE: 11pt; FONT-FAMILY: 'Trebuchet MS'">Mick Waters
</SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: 'Trebuchet MS'"><BR>Senior
Systems Developer </SPAN></P>
<P><SPAN style="FONT-SIZE: 9pt; COLOR: black; FONT-FAMILY: 'Trebuchet MS'">w:
+44 (0)208 335 2011<BR>m: +44 (0)7849 887 277<BR>e: <A
href="mailto:mick.waters@motortrak.com">mick.waters@motortrak.com</A>
</SPAN></P>
<P><SPAN style="FONT-SIZE: 9pt; FONT-FAMILY: 'Trebuchet MS'"><A
title=http://www.motortrak.com/
href="http://www.motortrak.com/">www.motortrak.com</A><BR>digital media
solutions </SPAN></P>
<P><SPAN style="FONT-SIZE: 9pt; FONT-FAMILY: 'Trebuchet MS'">Motortrak Ltd, AC
Court, High St, Thames Ditton, Surrey, KT7 0SR, United Kingdom </SPAN></P>
<P><SPAN style="FONT-SIZE: 9pt; COLOR: gray; FONT-FAMILY: 'Trebuchet MS'">The
information contained in this message is for the intended addressee only and may
contain confidential and/or privileged information. If you are not the intended
addressee, please delete this message and notify the sender; do not copy or
distribute this message or disclose its contents to anyone. Any views or
opinions expressed in this message are those of the author and do not
necessarily represent those of Motortrak Limited or of any of its associated
companies. No reliance may be placed on this message without written
confirmation from an authorised representative of the company. </SPAN></P>
<P><SPAN
style="FONT-SIZE: 9pt; COLOR: gray; FONT-FAMILY: 'Trebuchet MS'">Registered in
England 3098391 V.A.T. Registered No. 667463890 </SPAN></P></DIV>
<DIV> </DIV></BODY></HTML>