[Ocfs2-users] "node down!" are related with SVN rev 3004?
Marcus Alves Grando
marcus.grando at terra.com.br
Wed May 23 15:00:38 PDT 2007
I forget to say, all nodes are RedHat AS 4 update 5 and
# rpm -qa | grep ocfs2
ocfs2-2.6.9-55.ELhugemem-1.2.5-1
ocfs2-tools-1.2.4-1
# uname -r
2.6.9-55.ELhugemem
Regards
Marcus Alves Grando wrote:
> Hi list,
>
> Today i have a problem with ocfs2, one server stop access to ocfs2 disks
> and the only message in /var/log/messages are:
>
> May 23 16:24:26 node3 kernel: (6956,3):dlm_restart_lock_mastery:1301
> ERROR: node down! 1
> May 23 16:24:26 node3 kernel: (6956,3):dlm_wait_for_lock_mastery:1118
> ERROR: status = -11
>
> I don't know what's happened. Maybe that's related with 3004 fix?
> Someone already see that?
>
> Another strange fact are all nodes mount 13 SAN disks and "leaves"
> messages occurrs only nine times.
>
> Another fact are node1 is down to maintanance since 08:30.
>
> Others servers have this messages:
>
> **** node2
>
> May 23 16:48:31 node2 kernel: ocfs2_dlm: Node 3 leaves domain
> 84407FC4A92E451DADEF260A2FE0E366
> May 23 16:48:31 node2 kernel: ocfs2_dlm: Nodes in domain
> ("84407FC4A92E451DADEF260A2FE0E366"): 2 4
> May 23 16:48:37 node2 kernel: ocfs2_dlm: Node 3 leaves domain
> 1FB62EB34D1F495A9F11F396E707588C
> May 23 16:48:37 node2 kernel: ocfs2_dlm: Nodes in domain
> ("1FB62EB34D1F495A9F11F396E707588C"): 2 4
> May 23 16:48:42 node2 kernel: ocfs2_dlm: Node 3 leaves domain
> 793ACD36E8CA4067AB99F9F4F2229634
> May 23 16:48:42 node2 kernel: ocfs2_dlm: Nodes in domain
> ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4
> May 23 16:48:48 node2 kernel: ocfs2_dlm: Node 3 leaves domain
> ECECF9980CBD44EFA7E8A950EDE40573
> May 23 16:48:48 node2 kernel: ocfs2_dlm: Nodes in domain
> ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4
> May 23 16:48:53 node2 kernel: ocfs2_dlm: Node 3 leaves domain
> D8AFCBD0CF59404991FAB19916CEE08B
> May 23 16:48:53 node2 kernel: ocfs2_dlm: Nodes in domain
> ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4
> May 23 16:48:58 node2 kernel: ocfs2_dlm: Node 3 leaves domain
> E8B0A018151943A28674662818529F0F
> May 23 16:48:58 node2 kernel: ocfs2_dlm: Nodes in domain
> ("E8B0A018151943A28674662818529F0F"): 2 4
> May 23 16:49:03 node2 kernel: ocfs2_dlm: Node 3 leaves domain
> 3D227E224D0D4D9F97B84B0BB7DE7E22
> May 23 16:49:03 node2 kernel: ocfs2_dlm: Nodes in domain
> ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4
> May 23 16:49:09 node2 kernel: ocfs2_dlm: Node 3 leaves domain
> C40090C8D14D48C9AC0D1024A228EC59
> May 23 16:49:09 node2 kernel: ocfs2_dlm: Nodes in domain
> ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4
> May 23 16:49:14 node2 kernel: ocfs2_dlm: Node 3 leaves domain
> 9CB224941DC64A39872A5012FBD12354
> May 23 16:49:14 node2 kernel: ocfs2_dlm: Nodes in domain
> ("9CB224941DC64A39872A5012FBD12354"): 2 4
> May 23 16:50:04 node2 kernel: o2net: connection to node node3.hst.host
> (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting it
> down.
> May 23 16:50:04 node2 kernel: (0,3):o2net_idle_timer:1418 here are some
> times that might help debug the situation: (tmr 1179949774.944117 now
> 1179949804.944585 dr 1179949774.944109 adv
> 1179949774.944119:1179949774.944121 func (d21ddb4d:513)
> 1179949754.944260:1179949754.944271)
> May 23 16:50:04 node2 kernel: o2net: no longer connected to node
> node3.hst.host (num 3) at 192.168.0.3:7777
> May 23 16:52:31 node2 kernel:
> (23351,2):dlm_send_remote_convert_request:398 ERROR: status = -107
> May 23 16:52:31 node2 kernel: (23351,2):dlm_wait_for_node_death:365
> BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of
> death of node 3
> May 23 16:52:32 node2 kernel: (4379,3):ocfs2_dlm_eviction_cb:119 device
> (8,49): dlm has evicted node 3
> May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:921
> BD2D6C1943FB4771B018EA2A7D056E8A:$RECOVERY: at least one node (3)
> torecover before lock mastery can begin
> May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:955
> BD2D6C1943FB4771B018EA2A7D056E8A: recovery map is not empty, but must
> master $RECOVERY lock now
> May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:921
> 36D7DEC36FC44C53A6107B6A9CE863A2:$RECOVERY: at least one node (3)
> torecover before lock mastery can begin
> May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:955
> 36D7DEC36FC44C53A6107B6A9CE863A2: recovery map is not empty, but must
> master $RECOVERY lock now
> May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:921
> 9D0941F9B5B843E0B8F8C9FD7D514C35:$RECOVERY: at least one node (3)
> torecover before lock mastery can begin
> May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:955
> 9D0941F9B5B843E0B8F8C9FD7D514C35: recovery map is not empty, but must
> master $RECOVERY lock now
> May 23 16:52:37 node2 kernel: (23351,2):ocfs2_replay_journal:1167
> Recovering node 3 from slot 1 on device (8,97)
> May 23 16:52:42 node2 kernel: kjournald starting. Commit interval 5
> seconds
>
> **** node4
>
> May 23 16:48:31 node4 kernel: ocfs2_dlm: Node 3 leaves domain
> 84407FC4A92E451DADEF260A2FE0E366
> May 23 16:48:31 node4 kernel: ocfs2_dlm: Nodes in domain
> ("84407FC4A92E451DADEF260A2FE0E366"): 2 4
> May 23 16:48:37 node4 kernel: ocfs2_dlm: Node 3 leaves domain
> 1FB62EB34D1F495A9F11F396E707588C
> May 23 16:48:37 node4 kernel: ocfs2_dlm: Nodes in domain
> ("1FB62EB34D1F495A9F11F396E707588C"): 2 4
> May 23 16:48:42 node4 kernel: ocfs2_dlm: Node 3 leaves domain
> 793ACD36E8CA4067AB99F9F4F2229634
> May 23 16:48:42 node4 kernel: ocfs2_dlm: Nodes in domain
> ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4
> May 23 16:48:48 node4 kernel: ocfs2_dlm: Node 3 leaves domain
> ECECF9980CBD44EFA7E8A950EDE40573
> May 23 16:48:48 node4 kernel: ocfs2_dlm: Nodes in domain
> ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4
> May 23 16:48:53 node4 kernel: ocfs2_dlm: Node 3 leaves domain
> D8AFCBD0CF59404991FAB19916CEE08B
> May 23 16:48:53 node4 kernel: ocfs2_dlm: Nodes in domain
> ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4
> May 23 16:48:58 node4 kernel: ocfs2_dlm: Node 3 leaves domain
> E8B0A018151943A28674662818529F0F
> May 23 16:48:58 node4 kernel: ocfs2_dlm: Nodes in domain
> ("E8B0A018151943A28674662818529F0F"): 2 4
> May 23 16:49:03 node4 kernel: ocfs2_dlm: Node 3 leaves domain
> 3D227E224D0D4D9F97B84B0BB7DE7E22
> May 23 16:49:03 node4 kernel: ocfs2_dlm: Nodes in domain
> ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4
> May 23 16:49:09 node4 kernel: ocfs2_dlm: Node 3 leaves domain
> C40090C8D14D48C9AC0D1024A228EC59
> May 23 16:49:09 node4 kernel: ocfs2_dlm: Nodes in domain
> ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4
> May 23 16:49:14 node4 kernel: ocfs2_dlm: Node 3 leaves domain
> 9CB224941DC64A39872A5012FBD12354
> May 23 16:49:14 node4 kernel: ocfs2_dlm: Nodes in domain
> ("9CB224941DC64A39872A5012FBD12354"): 2 4
> May 23 16:50:04 node4 kernel: o2net: connection to node node3.hst.host
> (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting it
> down.
> May 23 16:50:04 node4 kernel: (19355,0):o2net_idle_timer:1418 here are
> some times that might help debug the situation: (tmr 1179949774.943813
> now 1179949804.944242 dr 1179949774.943805 adv
> 1179949774.943815:1179949774.943817 func (d21ddb4d:513)
> 1179949754.944088:1179949754.944097)
> May 23 16:50:04 node4 kernel: o2net: no longer connected to node
> node3.hst.host (num 3) at 192.168.0.3:7777
> May 23 16:50:04 node4 kernel: (18902,0):dlm_do_master_request:1418
> ERROR: link to 3 went down!
> May 23 16:50:04 node4 kernel: (18902,0):dlm_get_lock_resource:995 ERROR:
> status = -112
> May 23 16:52:31 node4 kernel:
> (22785,3):dlm_send_remote_convert_request:398 ERROR: status = -107
> May 23 16:52:31 node4 kernel: (22785,3):dlm_wait_for_node_death:365
> BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of
> death of node 3
> May 23 16:52:31 node4 kernel: (22786,3):dlm_get_lock_resource:921
> 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:31 node4 kernel: (22784,3):dlm_get_lock_resource:921
> 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:31 node4 kernel: (22783,3):dlm_get_lock_resource:921
> A85D18C01AE747AC905343D919B60525:M000000000000000000021535d8e891: at
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:921
> A85D18C01AE747AC905343D919B60525:$RECOVERY: at least one node (3)
> torecover before lock mastery can begin
> May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:955
> A85D18C01AE747AC905343D919B60525: recovery map is not empty, but must
> master $RECOVERY lock now
> May 23 16:52:32 node4 kernel: (22786,3):dlm_get_lock_resource:976
> 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:32 node4 kernel: (22784,3):dlm_get_lock_resource:976
> 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:32 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 device
> (8,97): dlm has evicted node 3
> May 23 16:52:33 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 device
> (8,81): dlm has evicted node 3
> May 23 16:52:34 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 device
> (8,161): dlm has evicted node 3
> May 23 16:52:35 node4 kernel: (18902,0):dlm_restart_lock_mastery:1301
> ERROR: node down! 3
> May 23 16:52:35 node4 kernel: (18902,0):dlm_wait_for_lock_mastery:1118
> ERROR: status = -11
> May 23 16:52:36 node4 kernel: (18902,0):dlm_get_lock_resource:976
> 9D0941F9B5B843E0B8F8C9FD7D514C35:D0000000000000000030b2be3ea1a0c: at
> least one node (3) torecover before lock mastery can begin
> May 23 16:52:37 node4 kernel: (22783,3):ocfs2_replay_journal:1167
> Recovering node 3 from slot 1 on device (8,49)
> May 23 16:52:39 node4 kernel: (22784,0):ocfs2_replay_journal:1167
> Recovering node 3 from slot 1 on device (8,81)
> May 23 16:52:40 node4 kernel: (22786,0):ocfs2_replay_journal:1167
> Recovering node 3 from slot 1 on device (8,161)
> May 23 16:52:44 node4 kernel: kjournald starting. Commit interval 5
> seconds
>
--
Marcus Alves Grando <marcus.grando [] terra.com.br>
Suporte Engenharia 1
Terra Networks Brasil S/A
Tel: 55 (51) 3284-4238
Qual é a sua Terra?
More information about the Ocfs2-users
mailing list