[Ocfs2-users] ESX and Unbreakable 2.0 OCFS2 problem

Sunil Mushran Sunil.Mushran at oracle.com
Wed Nov 15 11:02:01 PST 2006


Again, create a bug on oss.oracle.com/bugzilla and upload
the messages files from both nodes. It is hard to state anything
with incomplete information.

Colin.Farley at ecarecenters.com wrote:
> I decided to rebuild this from scratch today and got the same result.
>
> two cluster node, both boxes remain connected to the shared storage
> throughout tests.
>
> I unplug network connection from node0 and get e1000 driver "Tx Unit Hang"
> messages on node0 console
> node1 console displays "o2net_idle_timer:1309 here are some times to help
> debug the situation" followed by additional output
> node1 sits for a while and eventually displays "o2quo_make_decision:143
> error: fencing this node because it is connected to a half-quorum of one of
> two nodes which doesn't include the lowest active node 0"
> node 0 replays node 1's journal, too bad it still isn't on the network
>
> this is in node 1 /var/log/messages after reboot
>
> Nov 14 23:55:56 FTP02 kernel: o2net: connection to node FTP01.mydomain.net
> (num 0) at 10.xxx.0.45:7777 has been idle for 10 seconds, shutting it down.
> Nov 14 23:55:56 FTP02 kernel: (0,0):o2net_idle_timer:1309 here are some
> times that might help debug the situation: (tmr 1163570146.656474 now
> 1163570156.65
> 5334 dr 1163570146.656446 adv 1163570146.656476:1163570146.656478 func
> (3a33f0f8:505) 1163570057.403947:1163570057.403950)
> Nov 14 23:55:56 FTP02 kernel: o2net: no longer connected to node
> FTP01.mydomain.net (num 0) at 10.xxx.0.45:7777
>
> I'm confused by this.  Shouldn't node 0 have eventually rebooted since it
> lost network connectivity and node 1 replayed node 0's journal and kept
> going?  As it is right now we are left with no IP reachable box.
>
> If I do this same test but unplug node 1 instead of node 0, it works as it
> should. node 1 will fence and node 0 will reply the journal and stay
> online.
>
> Any input is greatly appreciated.
>
> Thanks,
>
> Colin Farley
> Network Administrator
> E-Care Contact Center Services
> Phone:(204) 940-6244
> Fax:(204) 940-7394
>
>
>                                                                            
>              Sunil Mushran                                                 
>              <Sunil.Mushran at or                                             
>              acle.com>                                                  To 
>                                        Colin.Farley at ecarecenters.com       
>              11/13/2006 08:23                                           cc 
>              PM                        ocfs2-users at oss.oracle.com          
>                                                                    Subject 
>                                        Re: [Ocfs2-users] ESX and           
>                                        Unbreakable 2.0 OCFS2 problem       
>                                                                            
>                                                                            
>                                                                            
>                                                                            
>                                                                            
>                                                                            
>
>
>
> Considering o2net only cares whether it is connected to the other node
> or not, it should not make a difference whether one unplugs node 0 or
> node 1.
> The result should be the same. Node 1 should fence in both cases.
>
> Do you see messages indicating that the node(s) have lost connectivity?
> If so, could you share them.
>
> It would be easiest if you could file a bug on oss.oracle.com/bugzilla with
> the messages file and listing the course of events... as in, unplugged
> cable
> on node 0 at time x, etc.
>
> Colin.Farley at ecarecenters.com wrote:
>   
>> I'm testing a 2 node cluster in a VMWare ESX environment for use as a
>>     
> high
>   
>> availability FTP server to support a CRM application.  Both nodes run
>> Unbreakable 2.0 x86_64.  They access a 300GB OCFS2 volume on an RDM LUN
>>     
> on
>   
>> an HP EVA.  All disk connectivity is fine and haven't seen any problems
>> there.  The problem comes when doing some IP failover testing.  The IP
>> failover is done using UCARP so to test failover I tried unplugging one
>> nodes virtual network cable to see what happens.
>>
>> If I unplug node 1 everything is fine, node 1 eventually panics and
>>     
> reboots
>   
>> while node 0 chugs along fine.  The problem comes when unplugging node 0.
>> When node 0 loses network connectivity it does not panic and eventually
>> node 1 panics and reboots.  Is there a reason why the lower node does not
>> panic if it loses network connectivity?
>>
>> Heartbeat thresholds are the same on each node at 31 and both nodes are
>>     
> set
>   
>> to reboot on panic, node0 just never panics.  All software installed are
>> versions that come with Unbreakable 2.0.
>>
>> I didn't do the config on these boxes so the first thing I'm going to do
>>     
> on
>   
>> Tuesday when I work on this is rebuild both nodes from scratch but I
>> figured I would ask first to see if it was an easy question for someone
>>     
> on
>   
>> the list to answer.
>>
>> Thanks,
>>
>> Colin Farley
>> Network Administrator
>> E-Care Contact Center Services
>> Phone:(204) 940-6244
>> Fax:(204) 940-7394
>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>>     
>
>
>   



More information about the Ocfs2-users mailing list