[Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

Wed Nov 15 11:03:40 PST 2006

You are missing his point. He is not saying that fencing is the problem.
He is asking as to why the behavior differs between unplugging node 0
and node 1.

Alexei_Roudnev wrote:
> It is not a bug; it is all by design.
>
> Problem is that OCFSv2:
> - can't support more than 1 interconnection link, so you always risk to lost
> intercionnection;
> In additional, to make things worst, it dont support serial interconenction;
> - can't find a quorum in 2 node configuration (it's not ocfsv2 problem but
> general concern with any 2 nodes cluster) -
>  so all nodes lost quorum if network connection is lost;
> - don't analyze FS activity and reboot all nodes without quorum, except
> node0, in case of losing network connection.
>
> It can't be improved without supporting multiple interconnections + better
> decisions about fencing (there is not any use to fence a node, if it have
> not outstanding IO on cluster file system).
>
> Well known problem with OCFSv2. One solution is to add 3-d node and use
> interface bonding (be sure that interface convergeency time is less that
> o2cb timeout).
>
>
> ----- Original Message ----- 
> From: <Colin.Farley at ecarecenters.com>
> To: "Sunil Mushran" <Sunil.Mushran at oracle.com>
> Cc: <ocfs2-users at oss.oracle.com>
> Sent: Tuesday, November 14, 2006 10:35 PM
> Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem
>
>
>   
>> I decided to rebuild this from scratch today and got the same result.
>>
>> two cluster node, both boxes remain connected to the shared storage
>> throughout tests.
>>
>> I unplug network connection from node0 and get e1000 driver "Tx Unit Hang"
>> messages on node0 console
>> node1 console displays "o2net_idle_timer:1309 here are some times to help
>> debug the situation" followed by additional output
>> node1 sits for a while and eventually displays "o2quo_make_decision:143
>> error: fencing this node because it is connected to a half-quorum of one
>>     
> of
>   
>> two nodes which doesn't include the lowest active node 0"
>> node 0 replays node 1's journal, too bad it still isn't on the network
>>
>> this is in node 1 /var/log/messages after reboot
>>
>> Nov 14 23:55:56 FTP02 kernel: o2net: connection to node FTP01.mydomain.net
>> (num 0) at 10.xxx.0.45:7777 has been idle for 10 seconds, shutting it
>>     
> down.
>   
>> Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some
>> times that might help debug the situation: (tmr 1163570146.656474 now
>> 1163570156.65
>> 5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func
>> (3a33f0f8:505) 1163570057.403947:1163570057.403950)
>> Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node
>> FTP01.mydomain.net (num 0) at 10.xxx.0.45:7777
>>
>> I'm confused by this.  Shouldn't node 0 have eventually rebooted since it
>> lost network connectivity and node 1 replayed node 0's journal and kept
>> going?  As it is right now we are left with no IP reachable box.
>>
>> If I do this same test but unplug node 1 instead of node 0, it works as it
>> should. node 1 will fence and node 0 will reply the journal and stay
>> online.
>>
>> Any input is greatly appreciated.
>>
>> Thanks,
>>
>> Colin Farley
>> Network Administrator
>> E-Care Contact Center Services
>> Phone:(204) 940-6244
>> Fax:(204) 940-7394
>>
>>
>>
>>              Sunil Mushran
>>              <Sunil.Mushran at or
>>              acle.com>                                                  To
>>                                        Colin.Farley at ecarecenters.com
>>              11/13/2006 08:23                                           cc
>>              PM                        ocfs2-users at oss.oracle.com
>>                                                                    Subject
>>                                        Re: [Ocfs2-users] ESX and
>>                                        Unbreakable 2.0 OCFS2 problem
>>
>>
>>
>>     
>
>   
>>
>>
>>
>>
>> Considering o2net only cares whether it is connected to the other node
>> or not, it should not make a difference whether one unplugs node 0 or
>> node 1.
>> The result should be the same. Node 1 should fence in both cases.
>>
>> Do you see messages indicating that the node(s) have lost connectivity?
>> If so, could you share them.
>>
>> It would be easiest if you could file a bug on oss.oracle.com/bugzilla
>>     
> with
>   
>> the messages file and listing the course of events... as in, unplugged
>> cable
>> on node 0 at time x, etc.
>>
>> Colin.Farley at ecarecenters.com wrote:
>>     
>>> I'm testing a 2 node cluster in a VMWare ESX environment for use as a
>>>       
>> high
>>     
>>> availability FTP server to support a CRM application.  Both nodes run
>>> Unbreakable 2.0 x86_64.  They access a 300GB OCFS2 volume on an RDM LUN
>>>       
>> on
>>     
>>> an HP EVA.  All disk connectivity is fine and haven't seen any problems
>>> there.  The problem comes when doing some IP failover testing.  The IP
>>> failover is done using UCARP so to test failover I tried unplugging one
>>> nodes virtual network cable to see what happens.
>>>
>>> If I unplug node 1 everything is fine, node 1 eventually panics and
>>>       
>> reboots
>>     
>>> while node 0 chugs along fine.  The problem comes when unplugging node
>>>       
> 0.
>   
>>> When node 0 loses network connectivity it does not panic and eventually
>>> node 1 panics and reboots.  Is there a reason why the lower node does
>>>       
> not
>   
>>> panic if it loses network connectivity?
>>>
>>> Heartbeat thresholds are the same on each node at 31 and both nodes are
>>>       
>> set
>>     
>>> to reboot on panic, node0 just never panics.  All software installed are
>>> versions that come with Unbreakable 2.0.
>>>
>>> I didn't do the config on these boxes so the first thing I'm going to do
>>>       
>> on
>>     
>>> Tuesday when I work on this is rebuild both nodes from scratch but I
>>> figured I would ask first to see if it was an easy question for someone
>>>       
>> on
>>     
>>> the list to answer.
>>>
>>> Thanks,
>>>
>>> Colin Farley
>>> Network Administrator
>>> E-Care Contact Center Services
>>> Phone:(204) 940-6244
>>> Fax:(204) 940-7394
>>>
>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>
>>>       
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>>     
>
>