[Ocfs2-users] Suggestion about Heartbeat

Wed Aug 13 12:18:38 PDT 2008

I have a suggestion about the heartbeat and the way that "downed node" detection works.

There are occasions where a node is up, and for whatever reason, I need to power cycle it (for instance, a frozen process, etc). In these instances, my other nodes are unable to perform file system operations until the heartbeat period expires. This ends up being somewhere around 30-60 seconds (this is the value which works best for me, and does not cause self fencing). It would be useful to allow me to force the remaining nodes to just understand the node was taken down purposefully, and move on with their lives.

A real world example:

OCFS2 hosting files used with a website, driven by Apache. If a node goes down, the load average on all remaining nodes skyrockets to 500 or more, as the Apache processes all enter a state of uninterruptible sleep. This triggers alerts, pages, and on occasion, application specific triggers (web app) that show a "Too Busy" page when the load average is too high (for instance, that which vBulletin does).

It would be magnificent to be able to instruct the remaining nodes that the node in question was taken down purposefully, and to go on about their lives immediately (beginning of course with the journal replay, etc).

It's very simple in concept, and probably also execution. Could something like this be added? It would allow me to really do wonderful things from a STONITH perspective.

Thanks,
Michael
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080813/47ffc4d9/attachment.html