[Ocfs2-users] Fencing options

Angelo McComis angelo at mccomis.com
Tue Jan 12 18:38:01 PST 2010


After learning more about what fencing means when you see it in  
action. (the default of emergency_restart(); ). I'm now researching  
how to determine what causes a fencing to occur.

This is sles10.2 on the 2.6.16-42.5 kernel which means 1.4.1-sles is  
the version of ocfs2.

I know the default reply from Sunil will be to ask Novell....  :-).  
But we actually have a support partnership with HP and since they're  
not Novell, we have to wait for their backline contacts to make  
connection. Which is why I'm asking the users community simultaneous  
to the support call. The call has been open for 8 hrs now with no call  
back yet.

We have a set of 6 servers in a cluster and they're  only in a cluster  
for the sake of ocfs2 for a shared volume. Today within a one minute  
time span, node 1 says he lost connectivity to node 2 and 3, followed  
about a minute later by saying he lost connectivity to node 0 and 5. 1  
and 4 stayed up. But 2, 3, 0, and 5 all were evicted and rebooted.

This happened on the prod cluster and simultaneously on our nonprod  
cluster simultaneously. The only difference between nonprod and prod  
is that nonprod has 7 nodes rather than 6... On the nonprod cluster, 4  
out of 7 servers rebooted due to node eviction.

This set of servers are setup across two blade chassis and the nic  
config is a private vlan, non routed. It's eth1 using a 192.168.x.y  
scheme. The blade servers were running a load average of about 1.1 or  
so but are 8ways (dual quad core) which isn't exactly taxing the  
boxes.  The LAN environment is 10gbit fiber from the connect modules  
on the chassis to the switches and are gig uplink on the blades  
themselves. Ifconfig shows no evidence of packet loss.

Questions:
Can we set up redundant heartbeat ip connections?  Can we also add a  
disk heartbeat?  If it truly is network connectivity, can we set the  
timeout to be more lenient? And can we change the fencing to something  
other than machine reset? Eg unmount the volume, change it to read  
only, etc?

Thanks...

Angelo



More information about the Ocfs2-users mailing list