[Ocfs2-users] 2 node OCFS2 clusters

Tue Nov 17 04:38:14 PST 2009

How about adding a third [FAKE] node? If this is a feasible workaround, then
only the troublesome node should fence itself.

Best regards,

Karim Alkhayer

From: ocfs2-users-bounces at oss.oracle.com
[mailto:ocfs2-users-bounces at oss.oracle.com] On Behalf Of Thompson, Mark
Sent: Tuesday, November 17, 2009 2:27 PM
To: Srinivas Eeda
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] 2 node OCFS2 clusters

Hi,

I have done some more tests today, and I observed the following:

Test 1:

node 0 - ifdown eth2

node 0 - OCFS2 filesystem stalls on both nodes

node 1 - Decides to reboot

node 0 - Resumes OCFS2 service (while still off the network) OCFS2
filesystem back online

node 1 - Cannot re-join cluster as node 0 is off the network and has the fs
lock (Transport endpoint error)

node 0 - ifup eth2

node 1 - Re-joins the clusters and re-mounts OCFS2 filesystem.

Test 2:

node 1 - ifdown eth2

node 0 - OCFS2 filesystem stalls on both nodes

node 1  - Decides to reboot

node 0 - Resumes OCFS2 service, OCFS2 filesystem back online

node 1 - Boots up, re-joins cluster and re-mounts OCFS2 filesystem.

Is this the expected behaviour? And if it is, is there anything we can do
avoid the loss of the OCFS2 filesystems?

Here's the messages file outputs.

Test 1 - Node 0

Nov 17 11:00:26 my_node0 kernel: ocfs2: Unmounting device (253,9) on (node
0)

Nov 17 11:02:21 my_node0 modprobe: FATAL: Module ocfs2_stackglue not found.

Nov 17 11:02:21 my_node0 kernel: OCFS2 Node Manager 1.4.4 Tue Sep  8
11:56:46 PDT 2009 (build 18a3a72794aaca6c0334f456bca873cd)

Nov 17 11:02:21 my_node0 kernel: OCFS2 DLM 1.4.4 Tue Sep  8 11:56:46 PDT
2009 (build e6e41b84c785deeea891e5873dbf19ab)

Nov 17 11:02:21 my_node0 kernel: OCFS2 DLMFS 1.4.4 Tue Sep  8 11:56:46 PDT
2009 (build e6e41b84c785deeea891e5873dbf19ab)

Nov 17 11:02:21 my_node0 kernel: OCFS2 User DLM kernel interface loaded

Nov 17 11:02:46 my_node0 kernel: OCFS2 1.4.4 Tue Sep  8 11:56:43 PDT 2009
(build 3a5bffa75b910d5bcdd5c607c4394b1e)

Nov 17 11:02:46 my_node0 kernel: ocfs2_dlm: Nodes in domain
("21751145F96E45649324C9EEF5485248"): 0

Nov 17 11:02:46 my_node0 kernel: ocfs2: Mounting device (253,9) on (node 0,
slot 0) with ordered data mode.

Nov 17 11:02:59 my_node0 kernel: ocfs2_dlm: Node 1 joins domain
21751145F96E45649324C9EEF5485248

Nov 17 11:02:59 my_node0 kernel: ocfs2_dlm: Nodes in domain
("21751145F96E45649324C9EEF5485248"): 0 1

Nov 17 11:07:51 my_node0 kernel: (15,1):dlm_do_master_request:1334 ERROR:
link to 1 went down!

Nov 17 11:07:51 my_node0 kernel: (15,1):dlm_get_lock_resource:917 ERROR:
status = -107

Nov 17 11:09:34 my_node0 kernel: (22108,1):ocfs2_dlm_eviction_cb:98 device
(253,9): dlm has evicted node 1

Nov 17 11:09:34 my_node0 kernel: (29443,1):dlm_get_lock_resource:844
21751145F96E45649324C9EEF5485248:M000000000000000000001f96e7b609: at least
one node (1) to recover before lock mastery can begin

Nov 17 11:09:35 my_node0 kernel: (29443,1):dlm_get_lock_resource:898
21751145F96E45649324C9EEF5485248:M000000000000000000001f96e7b609: at least
one node (1) to recover before lock mastery can begin

Nov 17 11:09:36 my_node0 kernel: (15,1):dlm_restart_lock_mastery:1223 ERROR:
node down! 1

Nov 17 11:09:36 my_node0 kernel: (15,1):dlm_wait_for_lock_mastery:1040
ERROR: status = -11

Nov 17 11:09:36 my_node0 kernel: (22167,0):dlm_get_lock_resource:844
21751145F96E45649324C9EEF5485248:$RECOVERY: at least one node (1) to recover
before lock mastery can begin

Nov 17 11:09:36 my_node0 kernel: (22167,0):dlm_get_lock_resource:878
21751145F96E45649324C9EEF5485248: recovery map is not empty, but must master
$RECOVERY lock now

Nov 17 11:09:36 my_node0 kernel: (22167,0):dlm_do_recovery:524 (22167) Node
0 is the Recovery Master for the Dead Node 1 for Domain
21751145F96E45649324C9EEF5485248

Nov 17 11:09:46 my_node0 kernel: (29443,1):ocfs2_replay_journal:1183
Recovering node 1 from slot 1 on device (253,9)

Nov 17 11:12:27 my_node0 kernel: ocfs2_dlm: Node 1 joins domain
21751145F96E45649324C9EEF5485248

Nov 17 11:12:27 my_node0 kernel: ocfs2_dlm: Nodes in domain
("21751145F96E45649324C9EEF5485248"): 0 1

Test 1 - Node 1

Nov 17 11:00:26 my_node1 kernel: ocfs2_dlm: Node 0 leaves domain
21751145F96E45649324C9EEF5485248

Nov 17 11:00:26 my_node1 kernel: ocfs2_dlm: Nodes in domain
("21751145F96E45649324C9EEF5485248"): 1

Nov 17 11:00:46 my_node1 kernel: ocfs2: Unmounting device (253,9) on (node
1)

Nov 17 11:02:30 my_node1 modprobe: FATAL: Module ocfs2_stackglue not found.

Nov 17 11:02:30 my_node1 kernel: OCFS2 Node Manager 1.4.4 Tue Sep  8
11:56:46 PDT 2009 (build 18a3a72794aaca6c0334f456bca873cd)

Nov 17 11:02:30 my_node1 kernel: OCFS2 DLM 1.4.4 Tue Sep  8 11:56:46 PDT
2009 (build e6e41b84c785deeea891e5873dbf19ab)

Nov 17 11:02:30 my_node1 kernel: OCFS2 DLMFS 1.4.4 Tue Sep  8 11:56:46 PDT
2009 (build e6e41b84c785deeea891e5873dbf19ab)

Nov 17 11:02:30 my_node1 kernel: OCFS2 User DLM kernel interface loaded

Nov 17 11:02:59 my_node1 kernel: OCFS2 1.4.4 Tue Sep  8 11:56:43 PDT 2009
(build 3a5bffa75b910d5bcdd5c607c4394b1e)

Nov 17 11:02:59 my_node1 kernel: ocfs2_dlm: Nodes in domain
("21751145F96E45649324C9EEF5485248"): 0 1

Nov 17 11:02:59 my_node1 kernel: ocfs2: Mounting device (253,9) on (node 1,
slot 1) with ordered data mode.

Nov 17 11:07:27 my_node1 kernel:
(7351,3):dlm_send_remote_convert_request:395 ERROR: status = -112

Nov 17 11:07:27 my_node1 kernel: (7351,3):dlm_wait_for_node_death:370
21751145F96E45649324C9EEF5485248: waiting 5000ms for notification of death
of node 0

Nov 17 11:07:57 my_node1 kernel:
(7351,3):dlm_send_remote_convert_request:395 ERROR: status = -107

Nov 17 11:07:57 my_node1 kernel: (7351,3):dlm_wait_for_node_death:370
21751145F96E45649324C9EEF5485248: waiting 5000ms for notification of death
of node 0

Nov 17 11:08:27 my_node1 kernel: (15,1):dlm_do_master_request:1334 ERROR:
link to 0 went down!

Nov 17 11:08:27 my_node1 kernel:
(7351,3):dlm_send_remote_convert_request:395 ERROR: status = -107

Nov 17 11:08:27 my_node1 kernel: (7351,3):dlm_wait_for_node_death:370
21751145F96E45649324C9EEF5485248: waiting 5000ms for notification of death
of node 0

Nov 17 11:08:27 my_node1 kernel: (15,1):dlm_get_lock_resource:917 ERROR:
status = -107

Nov 17 11:11:31 my_node1 modprobe: FATAL: Module ocfs2_stackglue not found.

Nov 17 11:11:32 my_node1 kernel: OCFS2 Node Manager 1.4.4 Tue Sep  8
11:56:46 PDT 2009 (build 18a3a72794aaca6c0334f456bca873cd)

Nov 17 11:11:32 my_node1 kernel: OCFS2 DLM 1.4.4 Tue Sep  8 11:56:46 PDT
2009 (build e6e41b84c785deeea891e5873dbf19ab)

Nov 17 11:11:32 my_node1 kernel: OCFS2 DLMFS 1.4.4 Tue Sep  8 11:56:46 PDT
2009 (build e6e41b84c785deeea891e5873dbf19ab)

Nov 17 11:11:32 my_node1 kernel: OCFS2 User DLM kernel interface loaded

Nov 17 11:11:40 my_node1 kernel: OCFS2 1.4.4 Tue Sep  8 11:56:43 PDT 2009
(build 3a5bffa75b910d5bcdd5c607c4394b1e)

Nov 17 11:12:06 my_node1 kernel: (6282,0):dlm_request_join:1036 ERROR:
status = -107

Nov 17 11:12:06 my_node1 kernel: (6282,0):dlm_try_to_join_domain:1210 ERROR:
status = -107

Nov 17 11:12:06 my_node1 kernel: (6282,0):dlm_join_domain:1488 ERROR: status
= -107

Nov 17 11:12:06 my_node1 kernel: (6282,0):dlm_register_domain:1754 ERROR:
status = -107

Nov 17 11:12:06 my_node1 kernel: (6282,0):ocfs2_dlm_init:2723 ERROR: status
= -107

Nov 17 11:12:06 my_node1 kernel: (6282,0):ocfs2_mount_volume:1437 ERROR:
status = -107

Nov 17 11:12:06 my_node1 kernel: ocfs2: Unmounting device (253,9) on (node
1)

Nov 17 11:12:27 my_node1 kernel: ocfs2_dlm: Nodes in domain
("21751145F96E45649324C9EEF5485248"): 0 1

Nov 17 11:12:27 my_node1 kernel: ocfs2: Mounting device (253,9) on (node 1,
slot 1) with ordered data mode.

Test 2 - Node 0

Nov 17 11:16:37 my_node0 kernel: (22166,3):dlm_send_proxy_ast_msg:458 ERROR:
status = -107

Nov 17 11:16:37 my_node0 kernel: (22166,3):dlm_flush_asts:600 ERROR: status
= -107

Nov 17 11:17:35 my_node0 kernel: (22108,1):ocfs2_dlm_eviction_cb:98 device
(253,9): dlm has evicted node 1

Nov 17 11:17:35 my_node0 kernel: (6515,1):ocfs2_replay_journal:1183
Recovering node 1 from slot 1 on device (253,9)

Nov 17 11:17:36 my_node0 kernel: (22167,0):dlm_get_lock_resource:844
21751145F96E45649324C9EEF5485248:$RECOVERY: at least one node (1) to recover
before lock mastery can begin

Nov 17 11:17:36 my_node0 kernel: (22167,0):dlm_get_lock_resource:878
21751145F96E45649324C9EEF5485248: recovery map is not empty, but must master
$RECOVERY lock now

Nov 17 11:17:36 my_node0 kernel: (22167,0):dlm_do_recovery:524 (22167) Node
0 is the Recovery Master for the Dead Node 1 for Domain
21751145F96E45649324C9EEF5485248

Nov 17 11:19:31 my_node0 kernel: ocfs2_dlm: Node 1 joins domain
21751145F96E45649324C9EEF5485248

Nov 17 11:19:31 my_node0 kernel: ocfs2_dlm: Nodes in domain
("21751145F96E45649324C9EEF5485248"): 0 1

Test2 - Node 1

Nov 17 11:19:22 my_node1 modprobe: FATAL: Module ocfs2_stackglue not found.

Nov 17 11:19:23 my_node1 kernel: OCFS2 Node Manager 1.4.4 Tue Sep  8
11:56:46 PDT 2009 (build 18a3a72794aaca6c0334f456bca873cd)

Nov 17 11:19:23 my_node1 kernel: OCFS2 DLM 1.4.4 Tue Sep  8 11:56:46 PDT
2009 (build e6e41b84c785deeea891e5873dbf19ab)

Nov 17 11:19:23 my_node1 kernel: OCFS2 DLMFS 1.4.4 Tue Sep  8 11:56:46 PDT
2009 (build e6e41b84c785deeea891e5873dbf19ab)

Nov 17 11:19:23 my_node1 kernel: OCFS2 User DLM kernel interface loaded

Nov 17 11:19:31 my_node1 kernel: OCFS2 1.4.4 Tue Sep  8 11:56:43 PDT 2009
(build 3a5bffa75b910d5bcdd5c607c4394b1e)

Nov 17 11:19:31 my_node1 kernel: ocfs2_dlm: Nodes in domain
("21751145F96E45649324C9EEF5485248"): 0 1

Nov 17 11:19:31 my_node1 kernel: ocfs2: Mounting device (253,8) on (node 1,
slot 1) with ordered data mode.

Regards,

Mark

From: Srinivas Eeda [mailto:srinivas.eeda at oracle.com] 
Sent: 16 November 2009 16:05
To: Thompson, Mark
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] 2 node OCFS2 clusters

Thompson, Mark wrote: 

Hi Srini,

Thanks for the response.

So are the following statements correct:

If I stop the networking on node 1, node 0 will continue to allow OCFS2
filesystems to work and not reboot itself. 

If I stop the networking on node 0, node 1 (now being the lowest node?) will
continue to allow OCFS2 filesystems to work and not reboot itself.

In both the cases node 0 will survive, because that's the node that has
lowest node number (defined in cluster.conf). This applies to the scenario
where interconnect went down but nodes are healthy and are heartbeating to
the disk.

I guess I just need to know if it's possible to have a 2 node OCFS2 cluster
that will cope with either one of the nodes dying, and have the remaining
node still provide service.

If node 0 itself panics, reboots then node 1 will survive.

Regards,

Mark 

From: Srinivas Eeda [mailto:srinivas.eeda at oracle.com] 
Sent: 16 November 2009 14:57
To: Thompson, Mark
Cc: ocfs2-users at oss.oracle.com
Subject: Re: [Ocfs2-users] 2 node OCFS2 clusters

In a cluster with more than 2 nodes, if a network on one node goes down,
that node will evict itself but other nodes will survive. But in a two node
cluster, the node with lowest node number will survive no mater on which
node network went down.

thanks,
--Srini

Thompson, Mark wrote: 

Hi,

This is my first post here so please be gentle with me.

My question is, can you have a 2 node OCFS2 cluster, disconnect one node
from the network, and have the remaining node continue to function normally?
Currently we have a 2 node cluster and if we stop the NIC that has the OCFS2
o2cb net connection running on it, the other node will reboot itself. I have
researched having a 2 node OCFS2 cluster but so far I have been unable to
find a clear solution. I have looked at the FAQ regarding quorum, and my
OCFS2 init scripts are enabled etc.

Is this possible, or should we look at alternative solutions?

Regards,

Mark

This e-mail has come from Experian, the only business to have been twice
named the UK's 'Business of the Year' 

============================================================================
=======

Information in this e-mail and any attachments is confidential, and may not
be copied or used by anyone other than the addressee, nor disclosed to any
third party without our permission. There is no intention to create any
legally binding contract or other binding commitment through the use of this
electronic communication unless it is issued in accordance with the Experian
Limited standard terms and conditions of purchase or other express written
agreement between Experian Limited and the recipient. 

Although Experian has taken reasonable steps to ensure that this
communication and any attachments are free from computer virus, you are
advised to take your own steps to ensure that they are actually virus free. 

Companies Act information:

Registered name: Experian Limited 

Registered office: Landmark House, Experian Way, NG2 Business Park,
Nottingham, NG80 1ZZ, United Kingdom

Place of registration: England and Wales 

Registered number: 653331

  _____  

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20091117/6cbb19a6/attachment-0001.html