[Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

Wed Nov 15 11:41:03 PST 2006

Behavior is not difference - if you broke node1-node0 connection, node1 will
self-reboot in the current design.
It dont matter what exactly you unplug - socket on nod1, socket on node2 or
inter-switch connection if it is used.

Add node-3 and everything will change.

----- Original Message ----- 
From: "Sunil Mushran" <Sunil.Mushran at oracle.com>
To: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
Cc: <Colin.Farley at ecarecenters.com>; <ocfs2-users at oss.oracle.com>
Sent: Wednesday, November 15, 2006 11:03 AM
Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

> You are missing his point. He is not saying that fencing is the problem.
> He is asking as to why the behavior differs between unplugging node 0
> and node 1.
>
> Alexei_Roudnev wrote:
> > It is not a bug; it is all by design.
> >
> > Problem is that OCFSv2:
> > - can't support more than 1 interconnection link, so you always risk to
lost
> > intercionnection;
> > In additional, to make things worst, it dont support serial
interconenction;
> > - can't find a quorum in 2 node configuration (it's not ocfsv2 problem
but
> > general concern with any 2 nodes cluster) -
> >  so all nodes lost quorum if network connection is lost;
> > - don't analyze FS activity and reboot all nodes without quorum, except
> > node0, in case of losing network connection.
> >
> > It can't be improved without supporting multiple interconnections +
better
> > decisions about fencing (there is not any use to fence a node, if it
have
> > not outstanding IO on cluster file system).
> >
> > Well known problem with OCFSv2. One solution is to add 3-d node and use
> > interface bonding (be sure that interface convergeency time is less that
> > o2cb timeout).
> >
> >
> > ----- Original Message ----- 
> > From: <Colin.Farley at ecarecenters.com>
> > To: "Sunil Mushran" <Sunil.Mushran at oracle.com>
> > Cc: <ocfs2-users at oss.oracle.com>
> > Sent: Tuesday, November 14, 2006 10:35 PM
> > Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem
> >
> >
> >
> >> I decided to rebuild this from scratch today and got the same result.
> >>
> >> two cluster node, both boxes remain connected to the shared storage
> >> throughout tests.
> >>
> >> I unplug network connection from node0 and get e1000 driver "Tx Unit
Hang"
> >> messages on node0 console
> >> node1 console displays "o2net_idle_timer:1309 here are some times to
help
> >> debug the situation" followed by additional output
> >> node1 sits for a while and eventually displays "o2quo_make_decision:143
> >> error: fencing this node because it is connected to a half-quorum of
one
> >>
> > of
> >
> >> two nodes which doesn't include the lowest active node 0"
> >> node 0 replays node 1's journal, too bad it still isn't on the network
> >>
> >> this is in node 1 /var/log/messages after reboot
> >>
> >> Nov 14 23:55:56 FTP02 kernel: o2net: connection to node
FTP01.mydomain.net
> >> (num 0) at 10.xxx.0.45:7777 has been idle for 10 seconds, shutting it
> >>
> > down.
> >
> >> Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some
> >> times that might help debug the situation: (tmr 1163570146.656474 now
> >> 1163570156.65
> >> 5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func
> >> (3a33f0f8:505) 1163570057.403947:1163570057.403950)
> >> Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node
> >> FTP01.mydomain.net (num 0) at 10.xxx.0.45:7777
> >>
> >> I'm confused by this.  Shouldn't node 0 have eventually rebooted since
it
> >> lost network connectivity and node 1 replayed node 0's journal and kept
> >> going?  As it is right now we are left with no IP reachable box.
> >>
> >> If I do this same test but unplug node 1 instead of node 0, it works as
it
> >> should. node 1 will fence and node 0 will reply the journal and stay
> >> online.
> >>
> >> Any input is greatly appreciated.
> >>
> >> Thanks,
> >>
> >> Colin Farley
> >> Network Administrator
> >> E-Care Contact Center Services
> >> Phone:(204) 940-6244
> >> Fax:(204) 940-7394
> >>
> >>
> >>
> >>              Sunil Mushran
> >>              <Sunil.Mushran at or
> >>              acle.com>
To
> >>                                        Colin.Farley at ecarecenters.com
> >>              11/13/2006 08:23
cc
> >>              PM                        ocfs2-users at oss.oracle.com
> >>
Subject
> >>                                        Re: [Ocfs2-users] ESX and
> >>                                        Unbreakable 2.0 OCFS2 problem
> >>
> >>
> >>
> >>
> >
> >
> >>
> >>
> >>
> >>
> >> Considering o2net only cares whether it is connected to the other node
> >> or not, it should not make a difference whether one unplugs node 0 or
> >> node 1.
> >> The result should be the same. Node 1 should fence in both cases.
> >>
> >> Do you see messages indicating that the node(s) have lost connectivity?
> >> If so, could you share them.
> >>
> >> It would be easiest if you could file a bug on oss.oracle.com/bugzilla
> >>
> > with
> >
> >> the messages file and listing the course of events... as in, unplugged
> >> cable
> >> on node 0 at time x, etc.
> >>
> >> Colin.Farley at ecarecenters.com wrote:
> >>
> >>> I'm testing a 2 node cluster in a VMWare ESX environment for use as a
> >>>
> >> high
> >>
> >>> availability FTP server to support a CRM application.  Both nodes run
> >>> Unbreakable 2.0 x86_64.  They access a 300GB OCFS2 volume on an RDM
LUN
> >>>
> >> on
> >>
> >>> an HP EVA.  All disk connectivity is fine and haven't seen any
problems
> >>> there.  The problem comes when doing some IP failover testing.  The IP
> >>> failover is done using UCARP so to test failover I tried unplugging
one
> >>> nodes virtual network cable to see what happens.
> >>>
> >>> If I unplug node 1 everything is fine, node 1 eventually panics and
> >>>
> >> reboots
> >>
> >>> while node 0 chugs along fine.  The problem comes when unplugging node
> >>>
> > 0.
> >
> >>> When node 0 loses network connectivity it does not panic and
eventually
> >>> node 1 panics and reboots.  Is there a reason why the lower node does
> >>>
> > not
> >
> >>> panic if it loses network connectivity?
> >>>
> >>> Heartbeat thresholds are the same on each node at 31 and both nodes
are
> >>>
> >> set
> >>
> >>> to reboot on panic, node0 just never panics.  All software installed
are
> >>> versions that come with Unbreakable 2.0.
> >>>
> >>> I didn't do the config on these boxes so the first thing I'm going to
do
> >>>
> >> on
> >>
> >>> Tuesday when I work on this is rebuild both nodes from scratch but I
> >>> figured I would ask first to see if it was an easy question for
someone
> >>>
> >> on
> >>
> >>> the list to answer.
> >>>
> >>> Thanks,
> >>>
> >>> Colin Farley
> >>> Network Administrator
> >>> E-Care Contact Center Services
> >>> Phone:(204) 940-6244
> >>> Fax:(204) 940-7394
> >>>
> >>>
> >>> _______________________________________________
> >>> Ocfs2-users mailing list
> >>> Ocfs2-users at oss.oracle.com
> >>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>>
> >>>
> >>
> >> _______________________________________________
> >> Ocfs2-users mailing list
> >> Ocfs2-users at oss.oracle.com
> >> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>
> >>
> >
> >
>