[Ocfs2-users] "node down!" are related with SVN rev 3004?

Sunil Mushran Sunil.Mushran at oracle.com
Thu May 24 09:31:27 PDT 2007


While a reproducible testcase is ideal, it is not a must for filing
a bugzilla.

Please file one with the netdump output of node3. Outputs of other
nodes could be useful too.

Marcus Alves Grando wrote:
> I can't reproduce that. That's occurs one time only.
>
> Why i will open a bug since i can't reproduce that? First i need more 
> info how reproduce that.
>
> Regards
>
> Sunil Mushran wrote:
>> Such issues are handled best via bugzilla. File one
>> on oss.oracle.com/bugzilla with all the details.
>>
>> The most important detail would be node3's netdump
>> or netconsole output. The real reason for the outage
>> will be in that dump.
>>
>> Marcus Alves Grando wrote:
>>> Hi list,
>>>
>>> Today i have a problem with ocfs2, one server stop access to ocfs2 
>>> disks and the only message in /var/log/messages are:
>>>
>>> May 23 16:24:26 node3 kernel: (6956,3):dlm_restart_lock_mastery:1301 
>>> ERROR: node down! 1
>>> May 23 16:24:26 node3 kernel: 
>>> (6956,3):dlm_wait_for_lock_mastery:1118 ERROR: status = -11
>>>
>>> I don't know what's happened. Maybe that's related with 3004 fix? 
>>> Someone already see that?
>>>
>>> Another strange fact are all nodes mount 13 SAN disks and "leaves" 
>>> messages occurrs only nine times.
>>>
>>> Another fact are node1 is down to maintanance since 08:30.
>>>
>>> Others servers have this messages:
>>>
>>> **** node2
>>>
>>> May 23 16:48:31 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> 84407FC4A92E451DADEF260A2FE0E366
>>> May 23 16:48:31 node2 kernel: ocfs2_dlm: Nodes in domain 
>>> ("84407FC4A92E451DADEF260A2FE0E366"): 2 4
>>> May 23 16:48:37 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> 1FB62EB34D1F495A9F11F396E707588C
>>> May 23 16:48:37 node2 kernel: ocfs2_dlm: Nodes in domain 
>>> ("1FB62EB34D1F495A9F11F396E707588C"): 2 4
>>> May 23 16:48:42 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> 793ACD36E8CA4067AB99F9F4F2229634
>>> May 23 16:48:42 node2 kernel: ocfs2_dlm: Nodes in domain 
>>> ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4
>>> May 23 16:48:48 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> ECECF9980CBD44EFA7E8A950EDE40573
>>> May 23 16:48:48 node2 kernel: ocfs2_dlm: Nodes in domain 
>>> ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4
>>> May 23 16:48:53 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> D8AFCBD0CF59404991FAB19916CEE08B
>>> May 23 16:48:53 node2 kernel: ocfs2_dlm: Nodes in domain 
>>> ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4
>>> May 23 16:48:58 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> E8B0A018151943A28674662818529F0F
>>> May 23 16:48:58 node2 kernel: ocfs2_dlm: Nodes in domain 
>>> ("E8B0A018151943A28674662818529F0F"): 2 4
>>> May 23 16:49:03 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> 3D227E224D0D4D9F97B84B0BB7DE7E22
>>> May 23 16:49:03 node2 kernel: ocfs2_dlm: Nodes in domain 
>>> ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4
>>> May 23 16:49:09 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> C40090C8D14D48C9AC0D1024A228EC59
>>> May 23 16:49:09 node2 kernel: ocfs2_dlm: Nodes in domain 
>>> ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4
>>> May 23 16:49:14 node2 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> 9CB224941DC64A39872A5012FBD12354
>>> May 23 16:49:14 node2 kernel: ocfs2_dlm: Nodes in domain 
>>> ("9CB224941DC64A39872A5012FBD12354"): 2 4
>>> May 23 16:50:04 node2 kernel: o2net: connection to node 
>>> node3.hst.host (num 3) at 192.168.0.3:7777 has been idle for 30.0 
>>> seconds, shutting it down.
>>> May 23 16:50:04 node2 kernel: (0,3):o2net_idle_timer:1418 here are 
>>> some times that might help debug the situation: (tmr 
>>> 1179949774.944117 now 1179949804.944585 dr 1179949774.944109 adv 
>>> 1179949774.944119:1179949774.944121 func (d21ddb4d:513) 
>>> 1179949754.944260:1179949754.944271)
>>> May 23 16:50:04 node2 kernel: o2net: no longer connected to node 
>>> node3.hst.host (num 3) at 192.168.0.3:7777
>>> May 23 16:52:31 node2 kernel: 
>>> (23351,2):dlm_send_remote_convert_request:398 ERROR: status = -107
>>> May 23 16:52:31 node2 kernel: (23351,2):dlm_wait_for_node_death:365 
>>> BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of 
>>> death of node 3
>>> May 23 16:52:32 node2 kernel: (4379,3):ocfs2_dlm_eviction_cb:119 
>>> device (8,49): dlm has evicted node 3
>>> May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:921 
>>> BD2D6C1943FB4771B018EA2A7D056E8A:$RECOVERY: at least one node (3) 
>>> torecover before lock mastery can begin
>>> May 23 16:52:32 node2 kernel: (4451,1):dlm_get_lock_resource:955 
>>> BD2D6C1943FB4771B018EA2A7D056E8A: recovery map is not empty, but 
>>> must master $RECOVERY lock now
>>> May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:921 
>>> 36D7DEC36FC44C53A6107B6A9CE863A2:$RECOVERY: at least one node (3) 
>>> torecover before lock mastery can begin
>>> May 23 16:52:33 node2 kernel: (4441,2):dlm_get_lock_resource:955 
>>> 36D7DEC36FC44C53A6107B6A9CE863A2: recovery map is not empty, but 
>>> must master $RECOVERY lock now
>>> May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:921 
>>> 9D0941F9B5B843E0B8F8C9FD7D514C35:$RECOVERY: at least one node (3) 
>>> torecover before lock mastery can begin
>>> May 23 16:52:34 node2 kernel: (4491,2):dlm_get_lock_resource:955 
>>> 9D0941F9B5B843E0B8F8C9FD7D514C35: recovery map is not empty, but 
>>> must master $RECOVERY lock now
>>> May 23 16:52:37 node2 kernel: (23351,2):ocfs2_replay_journal:1167 
>>> Recovering node 3 from slot 1 on device (8,97)
>>> May 23 16:52:42 node2 kernel: kjournald starting.  Commit interval 5 
>>> seconds
>>>
>>> **** node4
>>>
>>> May 23 16:48:31 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> 84407FC4A92E451DADEF260A2FE0E366
>>> May 23 16:48:31 node4 kernel: ocfs2_dlm: Nodes in domain 
>>> ("84407FC4A92E451DADEF260A2FE0E366"): 2 4
>>> May 23 16:48:37 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> 1FB62EB34D1F495A9F11F396E707588C
>>> May 23 16:48:37 node4 kernel: ocfs2_dlm: Nodes in domain 
>>> ("1FB62EB34D1F495A9F11F396E707588C"): 2 4
>>> May 23 16:48:42 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> 793ACD36E8CA4067AB99F9F4F2229634
>>> May 23 16:48:42 node4 kernel: ocfs2_dlm: Nodes in domain 
>>> ("793ACD36E8CA4067AB99F9F4F2229634"): 2 4
>>> May 23 16:48:48 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> ECECF9980CBD44EFA7E8A950EDE40573
>>> May 23 16:48:48 node4 kernel: ocfs2_dlm: Nodes in domain 
>>> ("ECECF9980CBD44EFA7E8A950EDE40573"): 2 4
>>> May 23 16:48:53 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> D8AFCBD0CF59404991FAB19916CEE08B
>>> May 23 16:48:53 node4 kernel: ocfs2_dlm: Nodes in domain 
>>> ("D8AFCBD0CF59404991FAB19916CEE08B"): 2 4
>>> May 23 16:48:58 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> E8B0A018151943A28674662818529F0F
>>> May 23 16:48:58 node4 kernel: ocfs2_dlm: Nodes in domain 
>>> ("E8B0A018151943A28674662818529F0F"): 2 4
>>> May 23 16:49:03 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> 3D227E224D0D4D9F97B84B0BB7DE7E22
>>> May 23 16:49:03 node4 kernel: ocfs2_dlm: Nodes in domain 
>>> ("3D227E224D0D4D9F97B84B0BB7DE7E22"): 2 4
>>> May 23 16:49:09 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> C40090C8D14D48C9AC0D1024A228EC59
>>> May 23 16:49:09 node4 kernel: ocfs2_dlm: Nodes in domain 
>>> ("C40090C8D14D48C9AC0D1024A228EC59"): 2 4
>>> May 23 16:49:14 node4 kernel: ocfs2_dlm: Node 3 leaves domain 
>>> 9CB224941DC64A39872A5012FBD12354
>>> May 23 16:49:14 node4 kernel: ocfs2_dlm: Nodes in domain 
>>> ("9CB224941DC64A39872A5012FBD12354"): 2 4
>>> May 23 16:50:04 node4 kernel: o2net: connection to node 
>>> node3.hst.host (num 3) at 192.168.0.3:7777 has been idle for 30.0 
>>> seconds, shutting it down.
>>> May 23 16:50:04 node4 kernel: (19355,0):o2net_idle_timer:1418 here 
>>> are some times that might help debug the situation: (tmr 
>>> 1179949774.943813 now 1179949804.944242 dr 1179949774.943805 adv 
>>> 1179949774.943815:1179949774.943817 func (d21ddb4d:513) 
>>> 1179949754.944088:1179949754.944097)
>>> May 23 16:50:04 node4 kernel: o2net: no longer connected to node 
>>> node3.hst.host (num 3) at 192.168.0.3:7777
>>> May 23 16:50:04 node4 kernel: (18902,0):dlm_do_master_request:1418 
>>> ERROR: link to 3 went down!
>>> May 23 16:50:04 node4 kernel: (18902,0):dlm_get_lock_resource:995 
>>> ERROR: status = -112
>>> May 23 16:52:31 node4 kernel: 
>>> (22785,3):dlm_send_remote_convert_request:398 ERROR: status = -107
>>> May 23 16:52:31 node4 kernel: (22785,3):dlm_wait_for_node_death:365 
>>> BD2D6C1943FB4771B018EA2A7D056E8A: waiting 5000ms for notification of 
>>> death of node 3
>>> May 23 16:52:31 node4 kernel: (22786,3):dlm_get_lock_resource:921 
>>> 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at 
>>> least one node (3) torecover before lock mastery can begin
>>> May 23 16:52:31 node4 kernel: (22784,3):dlm_get_lock_resource:921 
>>> 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at 
>>> least one node (3) torecover before lock mastery can begin
>>> May 23 16:52:31 node4 kernel: (22783,3):dlm_get_lock_resource:921 
>>> A85D18C01AE747AC905343D919B60525:M000000000000000000021535d8e891: at 
>>> least one node (3) torecover before lock mastery can begin
>>> May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:921 
>>> A85D18C01AE747AC905343D919B60525:$RECOVERY: at least one node (3) 
>>> torecover before lock mastery can begin
>>> May 23 16:52:31 node4 kernel: (4525,3):dlm_get_lock_resource:955 
>>> A85D18C01AE747AC905343D919B60525: recovery map is not empty, but 
>>> must master $RECOVERY lock now
>>> May 23 16:52:32 node4 kernel: (22786,3):dlm_get_lock_resource:976 
>>> 9D0941F9B5B843E0B8F8C9FD7D514C35:M0000000000000000000215b9fa93cd: at 
>>> least one node (3) torecover before lock mastery can begin
>>> May 23 16:52:32 node4 kernel: (22784,3):dlm_get_lock_resource:976 
>>> 36D7DEC36FC44C53A6107B6A9CE863A2:M0000000000000000000215b39ab40a: at 
>>> least one node (3) torecover before lock mastery can begin
>>> May 23 16:52:32 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 
>>> device (8,97): dlm has evicted node 3
>>> May 23 16:52:33 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 
>>> device (8,81): dlm has evicted node 3
>>> May 23 16:52:34 node4 kernel: (4483,0):ocfs2_dlm_eviction_cb:119 
>>> device (8,161): dlm has evicted node 3
>>> May 23 16:52:35 node4 kernel: 
>>> (18902,0):dlm_restart_lock_mastery:1301 ERROR: node down! 3
>>> May 23 16:52:35 node4 kernel: 
>>> (18902,0):dlm_wait_for_lock_mastery:1118 ERROR: status = -11
>>> May 23 16:52:36 node4 kernel: (18902,0):dlm_get_lock_resource:976 
>>> 9D0941F9B5B843E0B8F8C9FD7D514C35:D0000000000000000030b2be3ea1a0c: at 
>>> least one node (3) torecover before lock mastery can begin
>>> May 23 16:52:37 node4 kernel: (22783,3):ocfs2_replay_journal:1167 
>>> Recovering node 3 from slot 1 on device (8,49)
>>> May 23 16:52:39 node4 kernel: (22784,0):ocfs2_replay_journal:1167 
>>> Recovering node 3 from slot 1 on device (8,81)
>>> May 23 16:52:40 node4 kernel: (22786,0):ocfs2_replay_journal:1167 
>>> Recovering node 3 from slot 1 on device (8,161)
>>> May 23 16:52:44 node4 kernel: kjournald starting.  Commit interval 5 
>>> seconds
>>>
>>
>>
>> Esta mensagem foi verificada pelo E-mail Protegido Terra.
>> Scan engine: McAfee VirusScan / Atualizado em 23/05/2007 / Versão: 
>> 5.1.00/5037
>> Proteja o seu e-mail Terra: http://mail.terra.com.br/
>




More information about the Ocfs2-users mailing list