[Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

Colin.Farley at ecarecenters.com Colin.Farley at ecarecenters.com
Tue Nov 14 22:35:10 PST 2006


I decided to rebuild this from scratch today and got the same result.

two cluster node, both boxes remain connected to the shared storage
throughout tests.

I unplug network connection from node0 and get e1000 driver "Tx Unit Hang"
messages on node0 console
node1 console displays "o2net_idle_timer:1309 here are some times to help
debug the situation" followed by additional output
node1 sits for a while and eventually displays "o2quo_make_decision:143
error: fencing this node because it is connected to a half-quorum of one of
two nodes which doesn't include the lowest active node 0"
node 0 replays node 1's journal, too bad it still isn't on the network

this is in node 1 /var/log/messages after reboot

Nov 14 23:55:56 FTP02 kernel: o2net: connection to node FTP01.mydomain.net
(num 0) at 10.xxx.0.45:7777 has been idle for 10 seconds, shutting it down.
Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some
times that might help debug the situation: (tmr 1163570146.656474 now
1163570156.65
5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func
(3a33f0f8:505) 1163570057.403947:1163570057.403950)
Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node
FTP01.mydomain.net (num 0) at 10.xxx.0.45:7777

I'm confused by this.  Shouldn't node 0 have eventually rebooted since it
lost network connectivity and node 1 replayed node 0's journal and kept
going?  As it is right now we are left with no IP reachable box.

If I do this same test but unplug node 1 instead of node 0, it works as it
should. node 1 will fence and node 0 will reply the journal and stay
online.

Any input is greatly appreciated.

Thanks,

Colin Farley
Network Administrator
E-Care Contact Center Services
Phone:(204) 940-6244
Fax:(204) 940-7394


                                                                           
             Sunil Mushran                                                 
             <Sunil.Mushran at or                                             
             acle.com>                                                  To 
                                       Colin.Farley at ecarecenters.com       
             11/13/2006 08:23                                           cc 
             PM                        ocfs2-users at oss.oracle.com          
                                                                   Subject 
                                       Re: [Ocfs2-users] ESX and           
                                       Unbreakable 2.0 OCFS2 problem       
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           



Considering o2net only cares whether it is connected to the other node
or not, it should not make a difference whether one unplugs node 0 or
node 1.
The result should be the same. Node 1 should fence in both cases.

Do you see messages indicating that the node(s) have lost connectivity?
If so, could you share them.

It would be easiest if you could file a bug on oss.oracle.com/bugzilla with
the messages file and listing the course of events... as in, unplugged
cable
on node 0 at time x, etc.

Colin.Farley at ecarecenters.com wrote:
> I'm testing a 2 node cluster in a VMWare ESX environment for use as a
high
> availability FTP server to support a CRM application.  Both nodes run
> Unbreakable 2.0 x86_64.  They access a 300GB OCFS2 volume on an RDM LUN
on
> an HP EVA.  All disk connectivity is fine and haven't seen any problems
> there.  The problem comes when doing some IP failover testing.  The IP
> failover is done using UCARP so to test failover I tried unplugging one
> nodes virtual network cable to see what happens.
>
> If I unplug node 1 everything is fine, node 1 eventually panics and
reboots
> while node 0 chugs along fine.  The problem comes when unplugging node 0.
> When node 0 loses network connectivity it does not panic and eventually
> node 1 panics and reboots.  Is there a reason why the lower node does not
> panic if it loses network connectivity?
>
> Heartbeat thresholds are the same on each node at 31 and both nodes are
set
> to reboot on panic, node0 just never panics.  All software installed are
> versions that come with Unbreakable 2.0.
>
> I didn't do the config on these boxes so the first thing I'm going to do
on
> Tuesday when I work on this is rebuild both nodes from scratch but I
> figured I would ask first to see if it was an easy question for someone
on
> the list to answer.
>
> Thanks,
>
> Colin Farley
> Network Administrator
> E-Care Contact Center Services
> Phone:(204) 940-6244
> Fax:(204) 940-7394
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>





More information about the Ocfs2-users mailing list