[Ocfs2-users] How to force node [a] to consider node [b] dead?

Sunil Mushran sunil.mushran at oracle.com
Mon Jan 26 10:22:55 PST 2009


Your analysis of the problem is correct. Because you have set
the timeout to 20 mins, the cluster waits 20 mins before declaring
the node dead and re-admitting it into its dlm domain. There is
no solution other than reducing the timeout. That you have to set
it to 20 mins suggests that the SAN/io setup needs to be looked into.

Karim Alkhayer wrote:
> Hi Sunil,
>
> Already advised SP4 but the platform supplier is hesitating to support the
> upgrade, servers and SAN impact wise. 
>
> It is a dead end when talking about the upgrade, any alternatives?
>
> Regards,
> Karim
>
> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com] 
> Sent: Monday, January 26, 2009 7:52 PM
> To: Karim Alkhayer
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] How to force node [a] to consider node [b] dead?
>
> You are running a 3 year old version of the fs. Please upgrade
> to something more current. Like sles9 sp4 or sles10 sp1 that
> bundles ocfs2 1.2.9, or sles10 sp2 that ships ocfs2 1.4.1.
>
> Karim Alkhayer wrote:
>   
>> Hi All,
>>
>> We have O2CB_HEARTBEAT_THRESHOLD set to 601 as the SAN gets overloaded 
>> sometimes and hence causing the nodes to panic
>>
>> This value has proven to be more stable than 31. However, there are 
>> sometimes where one of the nodes, for instance node [b] crashes, for 
>> whatever reason. While attempting to startup the troublesome node, 
>> auto mount is enabled but doesn't succeed, "Transport endpoint is not 
>> connected" is usually displayed.
>>
>> My opinion is this: the mount doesn't succeed because node [a] still 
>> thinks that node [b] is alive
>>
>> We're talking about a restart that can take around 15 minutes, so 
>> basically, the threshold is passed
>>
>> I was wondering if there is a workaround to kick node [b] out of the 
>> cluster so that it can join it again. What I've done so far, the 
>> incident happened once - a month ago, is to restart the cluster 
>> services on both machines. This was very expensive solution as all 
>> database instances had to go down
>>
>> OCFS2 1.2.1, SLES9 SP3 2.6.5-7.257-default, RAC 10.1.0.5, 5 DBs
>>
>> Thanks
>>
>> Karim
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>     
>
>   




More information about the Ocfs2-users mailing list