[Ocfs2-users] "node down!" are related with SVN rev 3004?

Marcus Alves Grando marcus.grando at terra.com.br
Thu May 24 09:12:45 PDT 2007


I can't reproduce that. That's occurs one time only.

Why i will open a bug since i can't reproduce that? First i need more 
info how reproduce that.

Regards

Sunil Mushran wrote:
> Such issues are handled best via bugzilla. File one
> on oss.oracle.com/bugzilla with all the details.
> 
> The most important detail would be node3's netdump
> or netconsole output. The real reason for the outage
> will be in that dump.
> 
> Marcus Alves Grando wrote:
>> Hi list,
>>
>> Today i have a problem with ocfs2, one server stop access to ocfs2 
>> disks and the only message in /var/log/messages are:
>>
>> May 23 16:24:26 node3 kernel: (6956,3):dlm_restart_lock_mastery:1301 
>> ERROR: node down! 1
>> May 23 16:24:26 node3 kernel: (6956,3):dlm_wait_for_lock_mastery:1118 
>> ERROR: status = -11
>>
>> I don't know what's happened. Maybe that's related with 3004 fix? 
>> Someone already see that?
>>
>> Another strange fact are all nodes mount 13 SAN disks and "leaves" 
>> messages occurrs only nine times.
>>
>> Another fact are node1 is down to maintanance since 08:30.
>>
>> Others servers have this messages:
>>
>> **** node2
>>
>> May 23 16:48:31 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>> 84407FC4A92E451DADEF260A2FE0E366
>> May 23 16:48:31 node2 kernel: ocfs2_dlm: Nodes in domain 
>> ("84407FC4A92E451DADEF260A2FE0E366"): 2 4
>> May 23 16:48:37 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>> 1FB62EB34D1F495A9F11F396E707588C
>> May 23 16:48:37 node2 kernel: ocfs2_dlm: Nodes in domain 
>> ("1FB62EB34D1F495A9F11F396E707588C"): 2 4
>> May 23 16:48:42 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>> 793ACD36E8CA4067AB99F9F4F2229634
>> May 23 16:48:42 node2 kernel: ocfs2_dlm: Nodes in domain 
>> ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4
>> May 23 16:48:48 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>> ECECF9980CBD44EFA7E8A950EDE40573
>> May 23 16:48:48 node2 kernel: ocfs2_dlm: Nodes in domain 
>> ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4
>> May 23 16:48:53 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>> D8AFCBD0CF59404991FAB19916CEE08B
>> May 23 16:48:53 node2 kernel: ocfs2_dlm: Nodes in domain 
>> ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4
>> May 23 16:48:58 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>> E8B0A018151943A28674662818529F0F
>> May 23 16:48:58 node2 kernel: ocfs2_dlm: Nodes in domain 
>> ("E8B0A018151943A28674662818529F0F"): 2 4
>> May 23 16:49:03 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>> 3D227E224D0D4D9F97B84B0BB7DE7E22
>> May 23 16:49:03 node2 kernel: ocfs2_dlm: Nodes in domain 
>> ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4
>> May 23 16:49:09 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>> C40090C8D14D48C9AC0D1024A228EC59
>> May 23 16:49:09 node2 kernel: ocfs2_dlm: Nodes in domain 
>> ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4
>> May 23 16:49:14 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>> 9CB224941DC64A39872A5012FBD12354
>> May 23 16:49:14 node2 kernel: ocfs2_dlm: Nodes in domain 
>> ("9CB224941DC64A39872A5012FBD12354"): 2 4
>> May 23 16:50:04 node2 kernel: o2net: connection to node node3.hst.host 
>> (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting 
>> it down.
>> May 23 16:50:04 node2 kernel: (0,3):o2net_idle_timer:1418 here are 
>> some times that might help debug the situation: (tmr 1179949774.944117 
>> now 1179949804.944585 dr 1179949774.944109 adv 
>> 1179949774.944119:1179949774.944121 func (d21ddb4d:513) 
>> 1179949754.944260:1179949754.944271)
>> May 23 16:50:04 node2 kernel: o2net: no longer connected to node 
>> node3.hst.host (num 3) at 192.168.0.3:7777
>> May 23 16:52:31 node2 kernel: 
>> (23351,2):dlm_send_remote_convert_request:398 ERROR: status = -107
>> May 23 16:52:31 node2 kernel: (23351,2):dlm_wait_for_node_death:365 
>> BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of 
>> death of node 3
>> May 23 16:52:32 node2 kernel: (4379,3):ocfs2_dlm_eviction_cb:119 
>> device (8,49): dlm has evicted node 3
>> May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:921 
>> BD2D6C1943FB4771B018EA2A7D056E8A:$RECOVERY: at least one node (3) 
>> torecover before lock mastery can begin
>> May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:955 
>> BD2D6C1943FB4771B018EA2A7D056E8A: recovery map is not empty, but must 
>> master $RECOVERY lock now
>> May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:921 
>> 36D7DEC36FC44C53A6107B6A9CE863A2:$RECOVERY: at least one node (3) 
>> torecover before lock mastery can begin
>> May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:955 
>> 36D7DEC36FC44C53A6107B6A9CE863A2: recovery map is not empty, but must 
>> master $RECOVERY lock now
>> May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:921 
>> 9D0941F9B5B843E0B8F8C9FD7D514C35:$RECOVERY: at least one node (3) 
>> torecover before lock mastery can begin
>> May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:955 
>> 9D0941F9B5B843E0B8F8C9FD7D514C35: recovery map is not empty, but must 
>> master $RECOVERY lock now
>> May 23 16:52:37 node2 kernel: (23351,2):ocfs2_replay_journal:1167 
>> Recovering node 3 from slot 1 on device (8,97)
>> May 23 16:52:42 node2 kernel: kjournald starting.  Commit interval 5 
>> seconds
>>
>> **** node4
>>
>> May 23 16:48:31 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>> 84407FC4A92E451DADEF260A2FE0E366
>> May 23 16:48:31 node4 kernel: ocfs2_dlm: Nodes in domain 
>> ("84407FC4A92E451DADEF260A2FE0E366"): 2 4
>> May 23 16:48:37 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>> 1FB62EB34D1F495A9F11F396E707588C
>> May 23 16:48:37 node4 kernel: ocfs2_dlm: Nodes in domain 
>> ("1FB62EB34D1F495A9F11F396E707588C"): 2 4
>> May 23 16:48:42 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>> 793ACD36E8CA4067AB99F9F4F2229634
>> May 23 16:48:42 node4 kernel: ocfs2_dlm: Nodes in domain 
>> ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4
>> May 23 16:48:48 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>> ECECF9980CBD44EFA7E8A950EDE40573
>> May 23 16:48:48 node4 kernel: ocfs2_dlm: Nodes in domain 
>> ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4
>> May 23 16:48:53 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>> D8AFCBD0CF59404991FAB19916CEE08B
>> May 23 16:48:53 node4 kernel: ocfs2_dlm: Nodes in domain 
>> ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4
>> May 23 16:48:58 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>> E8B0A018151943A28674662818529F0F
>> May 23 16:48:58 node4 kernel: ocfs2_dlm: Nodes in domain 
>> ("E8B0A018151943A28674662818529F0F"): 2 4
>> May 23 16:49:03 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>> 3D227E224D0D4D9F97B84B0BB7DE7E22
>> May 23 16:49:03 node4 kernel: ocfs2_dlm: Nodes in domain 
>> ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4
>> May 23 16:49:09 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>> C40090C8D14D48C9AC0D1024A228EC59
>> May 23 16:49:09 node4 kernel: ocfs2_dlm: Nodes in domain 
>> ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4
>> May 23 16:49:14 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>> 9CB224941DC64A39872A5012FBD12354
>> May 23 16:49:14 node4 kernel: ocfs2_dlm: Nodes in domain 
>> ("9CB224941DC64A39872A5012FBD12354"): 2 4
>> May 23 16:50:04 node4 kernel: o2net: connection to node node3.hst.host 
>> (num 3) at 192.168.0.3:7777 has been idle for 30.0 seconds, shutting 
>> it down.
>> May 23 16:50:04 node4 kernel: (19355,0):o2net_idle_timer:1418 here are 
>> some times that might help debug the situation: (tmr 1179949774.943813 
>> now 1179949804.944242 dr 1179949774.943805 adv 
>> 1179949774.943815:1179949774.943817 func (d21ddb4d:513) 
>> 1179949754.944088:1179949754.944097)
>> May 23 16:50:04 node4 kernel: o2net: no longer connected to node 
>> node3.hst.host (num 3) at 192.168.0.3:7777
>> May 23 16:50:04 node4 kernel: (18902,0):dlm_do_master_request:1418 
>> ERROR: link to 3 went down!
>> May 23 16:50:04 node4 kernel: (18902,0):dlm_get_lock_resource:995 
>> ERROR: status = -112
>> May 23 16:52:31 node4 kernel: 
>> (22785,3):dlm_send_remote_convert_request:398 ERROR: status = -107
>> May 23 16:52:31 node4 kernel: (22785,3):dlm_wait_for_node_death:365 
>> BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of 
>> death of node 3
>> May 23 16:52:31 node4 kernel: (22786,3):dlm_get_lock_resource:921 
>> 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at 
>> least one node (3) torecover before lock mastery can begin
>> May 23 16:52:31 node4 kernel: (22784,3):dlm_get_lock_resource:921 
>> 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at 
>> least one node (3) torecover before lock mastery can begin
>> May 23 16:52:31 node4 kernel: (22783,3):dlm_get_lock_resource:921 
>> A85D18C01AE747AC905343D919B60525:M000000000000000000021535d8e891: at 
>> least one node (3) torecover before lock mastery can begin
>> May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:921 
>> A85D18C01AE747AC905343D919B60525:$RECOVERY: at least one node (3) 
>> torecover before lock mastery can begin
>> May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:955 
>> A85D18C01AE747AC905343D919B60525: recovery map is not empty, but must 
>> master $RECOVERY lock now
>> May 23 16:52:32 node4 kernel: (22786,3):dlm_get_lock_resource:976 
>> 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at 
>> least one node (3) torecover before lock mastery can begin
>> May 23 16:52:32 node4 kernel: (22784,3):dlm_get_lock_resource:976 
>> 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at 
>> least one node (3) torecover before lock mastery can begin
>> May 23 16:52:32 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 
>> device (8,97): dlm has evicted node 3
>> May 23 16:52:33 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 
>> device (8,81): dlm has evicted node 3
>> May 23 16:52:34 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 
>> device (8,161): dlm has evicted node 3
>> May 23 16:52:35 node4 kernel: (18902,0):dlm_restart_lock_mastery:1301 
>> ERROR: node down! 3
>> May 23 16:52:35 node4 kernel: (18902,0):dlm_wait_for_lock_mastery:1118 
>> ERROR: status = -11
>> May 23 16:52:36 node4 kernel: (18902,0):dlm_get_lock_resource:976 
>> 9D0941F9B5B843E0B8F8C9FD7D514C35:D0000000000000000030b2be3ea1a0c: at 
>> least one node (3) torecover before lock mastery can begin
>> May 23 16:52:37 node4 kernel: (22783,3):ocfs2_replay_journal:1167 
>> Recovering node 3 from slot 1 on device (8,49)
>> May 23 16:52:39 node4 kernel: (22784,0):ocfs2_replay_journal:1167 
>> Recovering node 3 from slot 1 on device (8,81)
>> May 23 16:52:40 node4 kernel: (22786,0):ocfs2_replay_journal:1167 
>> Recovering node 3 from slot 1 on device (8,161)
>> May 23 16:52:44 node4 kernel: kjournald starting.  Commit interval 5 
>> seconds
>>
> 
> 
> Esta mensagem foi verificada pelo E-mail Protegido Terra.
> Scan engine: McAfee VirusScan / Atualizado em 23/05/2007 / Versão: 
> 5.1.00/5037
> Proteja o seu e-mail Terra: http://mail.terra.com.br/

-- 
Marcus Alves Grando <marcus.grando [] terra.com.br>
Suporte Engenharia 1
Terra Networks Brasil S/A
Tel: 55 (51) 3284-4238

Qual é a sua Terra?



More information about the Ocfs2-users mailing list