[Ocfs2-users] Fencing options

Wed Jan 13 10:47:48 PST 2010

The problem was likely storage related and not network related.

Do you have netconsole setup? If so, look at the logs. It will tell
you as to why that node was fenced.

Angelo McComis wrote:
> After learning more about what fencing means when you see it in  
> action. (the default of emergency_restart(); ). I'm now researching  
> how to determine what causes a fencing to occur.
>
> This is sles10.2 on the 2.6.16-42.5 kernel which means 1.4.1-sles is  
> the version of ocfs2.
>
> I know the default reply from Sunil will be to ask Novell....  :-).  
> But we actually have a support partnership with HP and since they're  
> not Novell, we have to wait for their backline contacts to make  
> connection. Which is why I'm asking the users community simultaneous  
> to the support call. The call has been open for 8 hrs now with no call  
> back yet.
>
> We have a set of 6 servers in a cluster and they're  only in a cluster  
> for the sake of ocfs2 for a shared volume. Today within a one minute  
> time span, node 1 says he lost connectivity to node 2 and 3, followed  
> about a minute later by saying he lost connectivity to node 0 and 5. 1  
> and 4 stayed up. But 2, 3, 0, and 5 all were evicted and rebooted.
>
> This happened on the prod cluster and simultaneously on our nonprod  
> cluster simultaneously. The only difference between nonprod and prod  
> is that nonprod has 7 nodes rather than 6... On the nonprod cluster, 4  
> out of 7 servers rebooted due to node eviction.
>
> This set of servers are setup across two blade chassis and the nic  
> config is a private vlan, non routed. It's eth1 using a 192.168.x.y  
> scheme. The blade servers were running a load average of about 1.1 or  
> so but are 8ways (dual quad core) which isn't exactly taxing the  
> boxes.  The LAN environment is 10gbit fiber from the connect modules  
> on the chassis to the switches and are gig uplink on the blades  
> themselves. Ifconfig shows no evidence of packet loss.
>
> Questions:
> Can we set up redundant heartbeat ip connections?  Can we also add a  
> disk heartbeat?  If it truly is network connectivity, can we set the  
> timeout to be more lenient? And can we change the fencing to something  
> other than machine reset? Eg unmount the volume, change it to read  
> only, etc?
>
> Thanks...
>
> Angelo
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>