[Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

Alexei_Roudnev Alexei_Roudnev at exigengroup.com
Tue Nov 14 23:01:07 PST 2006


It is not a bug; it is all by design.

Problem is that OCFSv2:
- can't support more than 1 interconnection link, so you always risk to lost
intercionnection;
In additional, to make things worst, it dont support serial interconenction;
- can't find a quorum in 2 node configuration (it's not ocfsv2 problem but
general concern with any 2 nodes cluster) -
 so all nodes lost quorum if network connection is lost;
- don't analyze FS activity and reboot all nodes without quorum, except
node0, in case of losing network connection.

It can't be improved without supporting multiple interconnections + better
decisions about fencing (there is not any use to fence a node, if it have
not outstanding IO on cluster file system).

Well known problem with OCFSv2. One solution is to add 3-d node and use
interface bonding (be sure that interface convergeency time is less that
o2cb timeout).


----- Original Message ----- 
From: <Colin.Farley at ecarecenters.com>
To: "Sunil Mushran" <Sunil.Mushran at oracle.com>
Cc: <ocfs2-users at oss.oracle.com>
Sent: Tuesday, November 14, 2006 10:35 PM
Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem


> I decided to rebuild this from scratch today and got the same result.
>
> two cluster node, both boxes remain connected to the shared storage
> throughout tests.
>
> I unplug network connection from node0 and get e1000 driver "Tx Unit Hang"
> messages on node0 console
> node1 console displays "o2net_idle_timer:1309 here are some times to help
> debug the situation" followed by additional output
> node1 sits for a while and eventually displays "o2quo_make_decision:143
> error: fencing this node because it is connected to a half-quorum of one
of
> two nodes which doesn't include the lowest active node 0"
> node 0 replays node 1's journal, too bad it still isn't on the network
>
> this is in node 1 /var/log/messages after reboot
>
> Nov 14 23:55:56 FTP02 kernel: o2net: connection to node FTP01.mydomain.net
> (num 0) at 10.xxx.0.45:7777 has been idle for 10 seconds, shutting it
down.
> Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some
> times that might help debug the situation: (tmr 1163570146.656474 now
> 1163570156.65
> 5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func
> (3a33f0f8:505) 1163570057.403947:1163570057.403950)
> Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node
> FTP01.mydomain.net (num 0) at 10.xxx.0.45:7777
>
> I'm confused by this.  Shouldn't node 0 have eventually rebooted since it
> lost network connectivity and node 1 replayed node 0's journal and kept
> going?  As it is right now we are left with no IP reachable box.
>
> If I do this same test but unplug node 1 instead of node 0, it works as it
> should. node 1 will fence and node 0 will reply the journal and stay
> online.
>
> Any input is greatly appreciated.
>
> Thanks,
>
> Colin Farley
> Network Administrator
> E-Care Contact Center Services
> Phone:(204) 940-6244
> Fax:(204) 940-7394
>
>
>
>              Sunil Mushran
>              <Sunil.Mushran at or
>              acle.com>                                                  To
>                                        Colin.Farley at ecarecenters.com
>              11/13/2006 08:23                                           cc
>              PM                        ocfs2-users at oss.oracle.com
>                                                                    Subject
>                                        Re: [Ocfs2-users] ESX and
>                                        Unbreakable 2.0 OCFS2 problem
>
>
>

>
>
>
>
>
>
> Considering o2net only cares whether it is connected to the other node
> or not, it should not make a difference whether one unplugs node 0 or
> node 1.
> The result should be the same. Node 1 should fence in both cases.
>
> Do you see messages indicating that the node(s) have lost connectivity?
> If so, could you share them.
>
> It would be easiest if you could file a bug on oss.oracle.com/bugzilla
with
> the messages file and listing the course of events... as in, unplugged
> cable
> on node 0 at time x, etc.
>
> Colin.Farley at ecarecenters.com wrote:
> > I'm testing a 2 node cluster in a VMWare ESX environment for use as a
> high
> > availability FTP server to support a CRM application.  Both nodes run
> > Unbreakable 2.0 x86_64.  They access a 300GB OCFS2 volume on an RDM LUN
> on
> > an HP EVA.  All disk connectivity is fine and haven't seen any problems
> > there.  The problem comes when doing some IP failover testing.  The IP
> > failover is done using UCARP so to test failover I tried unplugging one
> > nodes virtual network cable to see what happens.
> >
> > If I unplug node 1 everything is fine, node 1 eventually panics and
> reboots
> > while node 0 chugs along fine.  The problem comes when unplugging node
0.
> > When node 0 loses network connectivity it does not panic and eventually
> > node 1 panics and reboots.  Is there a reason why the lower node does
not
> > panic if it loses network connectivity?
> >
> > Heartbeat thresholds are the same on each node at 31 and both nodes are
> set
> > to reboot on panic, node0 just never panics.  All software installed are
> > versions that come with Unbreakable 2.0.
> >
> > I didn't do the config on these boxes so the first thing I'm going to do
> on
> > Tuesday when I work on this is rebuild both nodes from scratch but I
> > figured I would ask first to see if it was an easy question for someone
> on
> > the list to answer.
> >
> > Thanks,
> >
> > Colin Farley
> > Network Administrator
> > E-Care Contact Center Services
> > Phone:(204) 940-6244
> > Fax:(204) 940-7394
> >
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>




More information about the Ocfs2-users mailing list