[Ocfs2-users] 2-node configuration ?

Fri Feb 29 12:44:40 PST 2008

Laurent,

    What you need to be able to decide is what node still have network connectivity. If both have network connectivity you could fence any of them. If both lost connectivity (someone turned the switch off), then you are in trouble.

   You will need to plug the backend network in a switch and monitor the interface status, so when one machine is shutdown or you disconnect its network cable, you still get the up status on the other machine. If you dont want to use two switches, plug them into the same switch and use different vlans.

   To deal with OCFS2 I think the easiest approach is increase its timeouts to let your cluster manager decide which node will survive before the OCFS2 heartbeat fences the node. I wouldnt be messing with its inner workings, YMMV...

Regards,
Luis

Sunil Mushran <Sunil.Mushran at oracle.com> wrote: Laurent Neiger wrote:
> We could check at regular intervals (<10s of ocfs2 timeout, let's say 
> every 5 seconds
> for example) if the network comm between the 2 nodes is up. If not, on 
> maq2, if
> network comm is still OK (checking ifconfig status, or pinging a third 
> party such as
> a router), then maq2 is OK, and comm is lost between the 2 nodes 
> because of maq1.
> So on maq2, stop the ocfs2 heartbeat for avoiding self-fence, by using
> ocfs2_hb_ctl -K -d /dev/drbd0 (please tell me if I misunderstood this 
> command)
> and remote fence maq1 (if not a power supply failure, but a network 
> card one for example,
> we power off the bad node).
>
> So our cluster will still continue to work in degraded mode, until we 
> repair and power
> up maq1, and restart o2cb and ocfs2 on both nodes.
>
> So do you think doing that could be efficient for having a strong 
> cluster or do you have
> a better idea ?
>
Each of those pings will require a timeout - short timeouts. So short 
that you
may not even be able to distinguish between errors and overloaded run-queue,
transmit queue, router, etc. You will need an external hardware probes to
distinguish between slowdowns and errors.

Easy solution for your problem is to use net-bonding.

But then I guess you can rephrase the issue with some other precise hardware
error that allows the node to run as a single node but not in cluster. 
And what
if that node is the lower number.

In the end, you have to have shutdown windows. Windows in which you can 
recyle
the cluster. There is a reason people talk about 99.999% uptime and not 
100%. ;)

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

---------------------------------
Never miss a thing.   Make Yahoo your homepage.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080229/d177c22e/attachment.html