Cman and OCFS2 Testing Matrix
In lieu of a more detailed plan, I'm just going to put together a table of things I've tried and the resulting behavior for both stacks. I want to keep track of it. In the end, the cman stuff should behave as well as the o2cb stuff if not better.
Things to watch out for
CMan does some things that are surprising to a user of o2cb. It requires quorum just to do anything. So, if you have a four node cluster, and two nodes are down, nothing works until you bring a node back up! That's because you need 3 nodes for quorum. You can't even stop the cluster stack on the two remaining nodes. Generally, I have to kill the things.
I'm sure there is a better way to do it, but I don't know it. There's something in the docs about reducing "expected votes", but I'm afraid that will persist even after all nodes are back up.
If I find out more, I'll put it here.
Configuration
The tests use four nodes. Two PPC64 LPARs running RHEL5 and two amd64 HVM's running Fedora 7. All are running the user_heartbeat branch of ocfs2.git. The shared disk is iSCSI off of a NetApp filer.
Legend
Background |
Meaning |
|
Erroneous behavior |
|
Regression compared to the other stack |
The Tests
We have a number of stack tests. There are a couple variables. The first is the state of the stack. How many nodes are configured? How many nodes are actually live in the cluster? How many nodes have ocfs2 mounts? The second is how the test change affects the stack. For example, the cman stack really doesn't like it when the number of live nodes drops below quorum.
As such, we run the tests is a few circumstances. We test two node clusters as the minimum configuration. Cman has special configuration to make this behave well. We test changes that will stay within quorum, and we test changes that will drop below quorum.
Dead Nodes
This test covers stack behavior when one or more nodes die. Full death, like a panic() or hardware crash.
# Nodes |
# Live Nodes |
# Mounts |
Action |
O2CB Behavior |
CMan Behavior |
||
Expected |
Actual |
Expected |
Actual |
||||
4 |
4 |
0 |
Kill a node |
Nothing (The nodes don't talk without mounts). |
Nothing |
The dead node is fenced |
The dead node is fenced. |
4 |
The dead node is noticed missing and recovery occurs. |
After 30 seconds, the node is noticed missing from the network. After another 30 seconds, it is recovered. |
The dead node is fenced and recovery occurs. |
The dead node is fenced and recovery occurs. |
|||
0 |
Kill two nodes. |
Nothing |
Nothing |
The cluster notices the two nodes missing and hangs without quorum. |
The dead nodes are noticed, and the cluster hangs without quorum. Note that only one of the two nodes moves the fence group into FAIL_STOP_WAIT. The other node just leaves the fence group in the "none" state. In addition, I tried setting expeted_votes to 3 to get quorum back. Still no fencing. |
||
4 |
The dead nodes are noticed missing and recovery occurs. |
The remaining nodes notice the network timeouts and then the heartbeat timeouts. The lowest node masters $RECOVERY and recovers both. |
The dead nodes are noticed missing and the cluster hangs without quorum. |
Same as above. |
|||
2 |
2 |
0 |
Kill a node |
Nothing |
Nothing |
The dead node is fenced. |
The dead node is fenced. |
2 |
The remaining node notices and recovers the dead node. |
The remaining node successfully notices and recovers. |
The remaining node notices, fences, and recovers the dead node. |
The remaining node successfully fences and recovers. |
Dead Network
This test considers what happens when nodes lose network connectivity.
# Nodes |
# Live Nodes |
# Mounts |
Action |
O2CB Behavior |
CMan Behavior |
||
Expected |
Actual |
Expected |
Actual |
||||
4 |
4 |
0 |
Bring down the network on one node. |
Nothing (The nodes don't talk without mounts). |
Nothing |
The disconnected node is fenced and the cluster continues. |
The disconnected node is fenced and the cluster continues. |
4 |
The disconnected node is noticed missing but heartbeating. The disconnected node itself notices it has lost all three other nodes and fences itself. Afterwards, the cluster notices it is dead and recovers. |
The disconnected node self-fences, and the rest of the cluster then recovers and continues. |
The disconnected node is fenced and recovery occurs. |
The same behavior as with Dead Nodes above. |
|||
0 |
Bring down the network on two nodes. |
Nothing |
Nothing |
The cluster notices the two nodes missing and hangs without quorum. |
The cluster notices the two nodes missing and hangs without quorum. |
||
4 |
The disconnected nodes are noticed missing but heartbeating. The nodes themselves have lost connectivity to all three nodes and thus will fence themselves. The remaining two nodes will recover and continue. |
If the lowest-numbered node still has network connectivity, the two disconnected nodes self-fence, while the two remaining nodes correctly recover and continue. If the lowest-numbered node is disconnected, the two disconnected nodes self-fence, but the two remaining nodes also self-fence. The reason they self-fence is because they are a half-quorum, but they do not contain the lowest heartbeating node. |
The disconnected nodes are noticed missing and the cluster hangs without quorum. |
The cluster indeed hangs. However, three of the four nodes still claim to see 4 live members and the fence group in a good state. Only one node takes its groups to FAIL_ALL_STOPPED, and it even claims it can't see the other good node! |
|||
2 |
2 |
0 |
Bring down the network on one node. |
Nothing |
Nothing |
Each node notices the other has disappeared and attempts to fence it. Whichever fences first wins. |
The node without connectivity takes its own cluster down. The node with connectivity fences and recovers. This is probably because I changed the IP to take the network down, and aisexec didn't like it. The cman folks say that the expected behavior happens with a cable pull. |
2 |
The high-numbered node self-fences, regardless of whether it has network connectivity. The low node recovers the high node's journal. |
The high-numbered node self-fences, regardless of whether it has network connectivity. The low node recovers the high node's journal. |
Each node notices the other has disappeared and attempts to fence it. Whomever fences first wins. |
The node without connectivity takes its own cluster down. The node with connectivity fences and recovers. This is probably because I changed the IP to take the network down, and aisexec didn't like it. The cman folks say that the expected behavior happens with a cable pull. |