[Ocfs2-users] "node down!" are related with SVN rev 3004?

Sunil Mushran Sunil.Mushran at oracle.com
Thu May 24 08:52:58 PDT 2007


Such issues are handled best via bugzilla. File one
on oss.oracle.com/bugzilla with all the details.

The most important detail would be node3's netdump
or netconsole output. The real reason for the outage
will be in that dump.

Marcus Alves Grando wrote:
> Hi list,
>
> Today i have a problem with ocfs2, one server stop access to ocfs2 
> disks and the only message in /var/log/messages are:
>
> May 23 16:24:26 node3 kernel: (6956,3):dlm_restart_lock_mastery:1301 
> ERROR: node down! 1
> May 23 16:24:26 node3 kernel: (6956,3):dlm_wait_for_lock_mastery:1118 
> ERROR: status = -11
>
> I don't know what's happened. Maybe that's related with 3004 fix? 
> Someone already see that?
>
> Another strange fact are all nodes mount 13 SAN disks and "leaves" 
> messages occurrs only nine times.
>
> Another fact are node1 is down to maintanance since 08:30.
>
> Others servers have this messages:
>
> **** node2
>
> May 23 16:48:31 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
> 84407FC4A92E451DADEF260A2FE0E366
> May 23 16:48:31 node2 kernel: ocfs2_dlm: Nodes in domain 
> ("84407FC4A92E451DADEF260A2FE0E366"): 2 4
> May 23 16:48:37 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
> 1FB62EB34D1F495A9F11F396E707588C
> May 23 16:48:37 node2 kernel: ocfs2_dlm: Nodes in domain 
> ("1FB62EB34D1F495A9F11F396E707588C"): 2 4
> May 23 16:48:42 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
> 793ACD36E8CA4067AB99F9F4F2229634
> May 23 16:48:42 node2 kernel: ocfs2_dlm: Nodes in domain 
> ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4
> May 23 16:48:48 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
> ECECF9980CBD44EFA7E8A950EDE40573
> May 23 16:48:48 node2 kernel: ocfs2_dlm: Nodes in domain 
> ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4
> May 23 16:48:53 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
> D8AFCBD0CF59404991FAB19916CEE08B
> May 23 16:48:53 node2 kernel: ocfs2_dlm: Nodes in domain 
> ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4
> May 23 16:48:58 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
> E8B0A018151943A28674662818529F0F
> May 23 16:48:58 node2 kernel: ocfs2_dlm: Nodes in domain 
> ("E8B0A018151943A28674662818529F0F"): 2 4
> May 23 16:49:03 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
> 3D227E224D0D4D9F97B84B0BB7DE7E22
> May 23 16:49:03 node2 kernel: ocfs2_dlm: Nodes in domain 
> ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4
> May 23 16:49:09 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
> C40090C8D14D48C9AC0D1024A228EC59
> May 23 16:49:09 node2 kernel: ocfs2_dlm: Nodes in domain 
> ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4
> May 23 16:49:14 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
> 9CB224941DC64A39872A5012FBD12354
> May 23 16:49:14 node2 kernel: ocfs2_dlm: Nodes in domain 
> ("9CB224941DC64A39872A5012FBD12354"): 2 4
> May 23 16:50:04 node2 kernel: o2net: connection to node node3.hst.host 
> (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting 
> it down.
> May 23 16:50:04 node2 kernel: (0,3):o2net_idle_timer:1418 here are 
> some times that might help debug the situation: (tmr 1179949774.944117 
> now 1179949804.944585 dr 1179949774.944109 adv 
> 1179949774.944119:1179949774.944121 func (d21ddb4d:513) 
> 1179949754.944260:1179949754.944271)
> May 23 16:50:04 node2 kernel: o2net: no longer connected to node 
> node3.hst.host (num 3) at 192.168.0.3:7777
> May 23 16:52:31 node2 kernel: 
> (23351,2):dlm_send_remote_convert_request:398 ERROR: status = -107
> May 23 16:52:31 node2 kernel: (23351,2):dlm_wait_for_node_death:365 
> BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of 
> death of node 3
> May 23 16:52:32 node2 kernel: (4379,3):ocfs2_dlm_eviction_cb:119 
> device (8,49): dlm has evicted node 3
> May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:921 
> BD2D6C1943FB4771B018EA2A7D056E8A:$RECOVERY: at least one node (3) 
> torecover before lock mastery can begin
> May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:955 
> BD2D6C1943FB4771B018EA2A7D056E8A: recovery map is not empty, but must 
> master $RECOVERY lock now
> May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:921 
> 36D7DEC36FC44C53A6107B6A9CE863A2:$RECOVERY: at least one node (3) 
> torecover before lock mastery can begin
> May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:955 
> 36D7DEC36FC44C53A6107B6A9CE863A2: recovery map is not empty, but must 
> master $RECOVERY lock now
> May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:921 
> 9D0941F9B5B843E0B8F8C9FD7D514C35:$RECOVERY: at least one node (3) 
> torecover before lock mastery can begin
> May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:955 
> 9D0941F9B5B843E0B8F8C9FD7D514C35: recovery map is not empty, but must 
> master $RECOVERY lock now
> May 23 16:52:37 node2 kernel: (23351,2):ocfs2_replay_journal:1167 
> Recovering node 3 from slot 1 on device (8,97)
> May 23 16:52:42 node2 kernel: kjournald starting.  Commit interval 5 
> seconds
>
> **** node4
>
> May 23 16:48:31 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
> 84407FC4A92E451DADEF260A2FE0E366
> May 23 16:48:31 node4 kernel: ocfs2_dlm: Nodes in domain 
> ("84407FC4A92E451DADEF260A2FE0E366"): 2 4
> May 23 16:48:37 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
> 1FB62EB34D1F495A9F11F396E707588C
> May 23 16:48:37 node4 kernel: ocfs2_dlm: Nodes in domain 
> ("1FB62EB34D1F495A9F11F396E707588C"): 2 4
> May 23 16:48:42 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
> 793ACD36E8CA4067AB99F9F4F2229634
> May 23 16:48:42 node4 kernel: ocfs2_dlm: Nodes in domain 
> ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4
> May 23 16:48:48 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
> ECECF9980CBD44EFA7E8A950EDE40573
> May 23 16:48:48 node4 kernel: ocfs2_dlm: Nodes in domain 
> ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4
> May 23 16:48:53 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
> D8AFCBD0CF59404991FAB19916CEE08B
> May 23 16:48:53 node4 kernel: ocfs2_dlm: Nodes in domain 
> ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4
> May 23 16:48:58 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
> E8B0A018151943A28674662818529F0F
> May 23 16:48:58 node4 kernel: ocfs2_dlm: Nodes in domain 
> ("E8B0A018151943A28674662818529F0F"): 2 4
> May 23 16:49:03 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
> 3D227E224D0D4D9F97B84B0BB7DE7E22
> May 23 16:49:03 node4 kernel: ocfs2_dlm: Nodes in domain 
> ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4
> May 23 16:49:09 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
> C40090C8D14D48C9AC0D1024A228EC59
> May 23 16:49:09 node4 kernel: ocfs2_dlm: Nodes in domain 
> ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4
> May 23 16:49:14 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
> 9CB224941DC64A39872A5012FBD12354
> May 23 16:49:14 node4 kernel: ocfs2_dlm: Nodes in domain 
> ("9CB224941DC64A39872A5012FBD12354"): 2 4
> May 23 16:50:04 node4 kernel: o2net: connection to node node3.hst.host 
> (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting 
> it down.
> May 23 16:50:04 node4 kernel: (19355,0):o2net_idle_timer:1418 here are 
> some times that might help debug the situation: (tmr 1179949774.943813 
> now 1179949804.944242 dr 1179949774.943805 adv 
> 1179949774.943815:1179949774.943817 func (d21ddb4d:513) 
> 1179949754.944088:1179949754.944097)
> May 23 16:50:04 node4 kernel: o2net: no longer connected to node 
> node3.hst.host (num 3) at 192.168.0.3:7777
> May 23 16:50:04 node4 kernel: (18902,0):dlm_do_master_request:1418 
> ERROR: link to 3 went down!
> May 23 16:50:04 node4 kernel: (18902,0):dlm_get_lock_resource:995 
> ERROR: status = -112
> May 23 16:52:31 node4 kernel: 
> (22785,3):dlm_send_remote_convert_request:398 ERROR: status = -107
> May 23 16:52:31 node4 kernel: (22785,3):dlm_wait_for_node_death:365 
> BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of 
> death of node 3
> May 23 16:52:31 node4 kernel: (22786,3):dlm_get_lock_resource:921 
> 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at 
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:31 node4 kernel: (22784,3):dlm_get_lock_resource:921 
> 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at 
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:31 node4 kernel: (22783,3):dlm_get_lock_resource:921 
> A85D18C01AE747AC905343D919B60525:M000000000000000000021535d8e891: at 
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:921 
> A85D18C01AE747AC905343D919B60525:$RECOVERY: at least one node (3) 
> torecover before lock mastery can begin
> May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:955 
> A85D18C01AE747AC905343D919B60525: recovery map is not empty, but must 
> master $RECOVERY lock now
> May 23 16:52:32 node4 kernel: (22786,3):dlm_get_lock_resource:976 
> 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at 
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:32 node4 kernel: (22784,3):dlm_get_lock_resource:976 
> 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at 
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:32 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 
> device (8,97): dlm has evicted node 3
> May 23 16:52:33 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 
> device (8,81): dlm has evicted node 3
> May 23 16:52:34 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 
> device (8,161): dlm has evicted node 3
> May 23 16:52:35 node4 kernel: (18902,0):dlm_restart_lock_mastery:1301 
> ERROR: node down! 3
> May 23 16:52:35 node4 kernel: (18902,0):dlm_wait_for_lock_mastery:1118 
> ERROR: status = -11
> May 23 16:52:36 node4 kernel: (18902,0):dlm_get_lock_resource:976 
> 9D0941F9B5B843E0B8F8C9FD7D514C35:D0000000000000000030b2be3ea1a0c: at 
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:37 node4 kernel: (22783,3):ocfs2_replay_journal:1167 
> Recovering node 3 from slot 1 on device (8,49)
> May 23 16:52:39 node4 kernel: (22784,0):ocfs2_replay_journal:1167 
> Recovering node 3 from slot 1 on device (8,81)
> May 23 16:52:40 node4 kernel: (22786,0):ocfs2_replay_journal:1167 
> Recovering node 3 from slot 1 on device (8,161)
> May 23 16:52:44 node4 kernel: kjournald starting.  Commit interval 5 
> seconds
>




More information about the Ocfs2-users mailing list