[Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

Sunil Mushran Sunil.Mushran at oracle.com
Wed Nov 15 11:39:58 PST 2006


Again, read his email.

Alexei_Roudnev wrote:
> Behavior is not difference - if you broke node1-node0 connection, node1 will
> self-reboot in the current design.
> It dont matter what exactly you unplug - socket on nod1, socket on node2 or
> inter-switch connection if it is used.
>
> Add node-3 and everything will change.
>
> ----- Original Message ----- 
> From: "Sunil Mushran" <Sunil.Mushran at oracle.com>
> To: "Alexei_Roudnev" <Alexei_Roudnev at exigengroup.com>
> Cc: <Colin.Farley at ecarecenters.com>; <ocfs2-users at oss.oracle.com>
> Sent: Wednesday, November 15, 2006 11:03 AM
> Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem
>
>
>   
>> You are missing his point. He is not saying that fencing is the problem.
>> He is asking as to why the behavior differs between unplugging node 0
>> and node 1.
>>
>> Alexei_Roudnev wrote:
>>     
>>> It is not a bug; it is all by design.
>>>
>>> Problem is that OCFSv2:
>>> - can't support more than 1 interconnection link, so you always risk to
>>>       
> lost
>   
>>> intercionnection;
>>> In additional, to make things worst, it dont support serial
>>>       
> interconenction;
>   
>>> - can't find a quorum in 2 node configuration (it's not ocfsv2 problem
>>>       
> but
>   
>>> general concern with any 2 nodes cluster) -
>>>  so all nodes lost quorum if network connection is lost;
>>> - don't analyze FS activity and reboot all nodes without quorum, except
>>> node0, in case of losing network connection.
>>>
>>> It can't be improved without supporting multiple interconnections +
>>>       
> better
>   
>>> decisions about fencing (there is not any use to fence a node, if it
>>>       
> have
>   
>>> not outstanding IO on cluster file system).
>>>
>>> Well known problem with OCFSv2. One solution is to add 3-d node and use
>>> interface bonding (be sure that interface convergeency time is less that
>>> o2cb timeout).
>>>
>>>
>>> ----- Original Message ----- 
>>> From: <Colin.Farley at ecarecenters.com>
>>> To: "Sunil Mushran" <Sunil.Mushran at oracle.com>
>>> Cc: <ocfs2-users at oss.oracle.com>
>>> Sent: Tuesday, November 14, 2006 10:35 PM
>>> Subject: Re: [Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem
>>>
>>>
>>>
>>>       
>>>> I decided to rebuild this from scratch today and got the same result.
>>>>
>>>> two cluster node, both boxes remain connected to the shared storage
>>>> throughout tests.
>>>>
>>>> I unplug network connection from node0 and get e1000 driver "Tx Unit
>>>>         
> Hang"
>   
>>>> messages on node0 console
>>>> node1 console displays "o2net_idle_timer:1309 here are some times to
>>>>         
> help
>   
>>>> debug the situation" followed by additional output
>>>> node1 sits for a while and eventually displays "o2quo_make_decision:143
>>>> error: fencing this node because it is connected to a half-quorum of
>>>>         
> one
>   
>>> of
>>>
>>>       
>>>> two nodes which doesn't include the lowest active node 0"
>>>> node 0 replays node 1's journal, too bad it still isn't on the network
>>>>
>>>> this is in node 1 /var/log/messages after reboot
>>>>
>>>> Nov 14 23:55:56 FTP02 kernel: o2net: connection to node
>>>>         
> FTP01.mydomain.net
>   
>>>> (num 0) at 10.xxx.0.45:7777 has been idle for 10 seconds, shutting it
>>>>
>>>>         
>>> down.
>>>
>>>       
>>>> Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some
>>>> times that might help debug the situation: (tmr 1163570146.656474 now
>>>> 1163570156.65
>>>> 5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func
>>>> (3a33f0f8:505) 1163570057.403947:1163570057.403950)
>>>> Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node
>>>> FTP01.mydomain.net (num 0) at 10.xxx.0.45:7777
>>>>
>>>> I'm confused by this.  Shouldn't node 0 have eventually rebooted since
>>>>         
> it
>   
>>>> lost network connectivity and node 1 replayed node 0's journal and kept
>>>> going?  As it is right now we are left with no IP reachable box.
>>>>
>>>> If I do this same test but unplug node 1 instead of node 0, it works as
>>>>         
> it
>   
>>>> should. node 1 will fence and node 0 will reply the journal and stay
>>>> online.
>>>>
>>>> Any input is greatly appreciated.
>>>>
>>>> Thanks,
>>>>
>>>> Colin Farley
>>>> Network Administrator
>>>> E-Care Contact Center Services
>>>> Phone:(204) 940-6244
>>>> Fax:(204) 940-7394
>>>>
>>>>
>>>>
>>>>              Sunil Mushran
>>>>              <Sunil.Mushran at or
>>>>              acle.com>
>>>>         
> To
>   
>>>>                                        Colin.Farley at ecarecenters.com
>>>>              11/13/2006 08:23
>>>>         
> cc
>   
>>>>              PM                        ocfs2-users at oss.oracle.com
>>>>
>>>>         
> Subject
>   
>>>>                                        Re: [Ocfs2-users] ESX and
>>>>                                        Unbreakable 2.0 OCFS2 problem
>>>>
>>>>
>>>>
>>>>
>>>>         
>>>       
>>>>
>>>>
>>>> Considering o2net only cares whether it is connected to the other node
>>>> or not, it should not make a difference whether one unplugs node 0 or
>>>> node 1.
>>>> The result should be the same. Node 1 should fence in both cases.
>>>>
>>>> Do you see messages indicating that the node(s) have lost connectivity?
>>>> If so, could you share them.
>>>>
>>>> It would be easiest if you could file a bug on oss.oracle.com/bugzilla
>>>>
>>>>         
>>> with
>>>
>>>       
>>>> the messages file and listing the course of events... as in, unplugged
>>>> cable
>>>> on node 0 at time x, etc.
>>>>
>>>> Colin.Farley at ecarecenters.com wrote:
>>>>
>>>>         
>>>>> I'm testing a 2 node cluster in a VMWare ESX environment for use as a
>>>>>
>>>>>           
>>>> high
>>>>
>>>>         
>>>>> availability FTP server to support a CRM application.  Both nodes run
>>>>> Unbreakable 2.0 x86_64.  They access a 300GB OCFS2 volume on an RDM
>>>>>           
> LUN
>   
>>>> on
>>>>
>>>>         
>>>>> an HP EVA.  All disk connectivity is fine and haven't seen any
>>>>>           
> problems
>   
>>>>> there.  The problem comes when doing some IP failover testing.  The IP
>>>>> failover is done using UCARP so to test failover I tried unplugging
>>>>>           
> one
>   
>>>>> nodes virtual network cable to see what happens.
>>>>>
>>>>> If I unplug node 1 everything is fine, node 1 eventually panics and
>>>>>
>>>>>           
>>>> reboots
>>>>
>>>>         
>>>>> while node 0 chugs along fine.  The problem comes when unplugging node
>>>>>
>>>>>           
>>> 0.
>>>
>>>       
>>>>> When node 0 loses network connectivity it does not panic and
>>>>>           
> eventually
>   
>>>>> node 1 panics and reboots.  Is there a reason why the lower node does
>>>>>
>>>>>           
>>> not
>>>
>>>       
>>>>> panic if it loses network connectivity?
>>>>>
>>>>> Heartbeat thresholds are the same on each node at 31 and both nodes
>>>>>           
> are
>   
>>>> set
>>>>
>>>>         
>>>>> to reboot on panic, node0 just never panics.  All software installed
>>>>>           
> are
>   
>>>>> versions that come with Unbreakable 2.0.
>>>>>
>>>>> I didn't do the config on these boxes so the first thing I'm going to
>>>>>           
> do
>   
>>>> on
>>>>
>>>>         
>>>>> Tuesday when I work on this is rebuild both nodes from scratch but I
>>>>> figured I would ask first to see if it was an easy question for
>>>>>           
> someone
>   
>>>> on
>>>>
>>>>         
>>>>> the list to answer.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Colin Farley
>>>>> Network Administrator
>>>>> E-Care Contact Center Services
>>>>> Phone:(204) 940-6244
>>>>> Fax:(204) 940-7394
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Ocfs2-users mailing list
>>>>> Ocfs2-users at oss.oracle.com
>>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>>
>>>>>
>>>>>           
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users at oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>
>>>>
>>>>         
>>>       
>
>   



More information about the Ocfs2-users mailing list