[Ocfs2-users] iscsi connection drop, comes back in seconds, then deadlock in cluster

James Harrell james at emediasales.com
Fri Nov 6 13:45:59 PST 2009


Greetings ocfs2 folks,

A client is experiencing some random deadlock issues within a cluster, 
wondering if anyone can point us in the right direction. The iSCSI 
connection seemed to have dropped on one node briefly, ultimately 
several hours later landing us in a complete deadlock scenario where 
multiple nodes (Node 7 and Node 8) had to be panic'd (by hand - they 
didn't ever panic on their own) in order to get the cluster back. The 
would not unmount the file system, so we hard-panic'd (insmod panic.ko).

Vitals:
 - Linux mgr01 2.6.18-164.2.1.el5 #1 SMP Wed Sep 30 12:52:46 EDT 2009 
x86_64 x86_64 x86_64 GNU/Linux
 - ocfs2-2.6.18-164.2.1.el5-1.4.4-1.el5
 - MD3000i iSCSI
 - O2CB_HEARTBEAT_THRESHOLD=61
 - O2CB_IDLE_TIMEOUT_MS=30000
 - O2CB_KEEPALIVE_DELAY_MS=2000
 - O2CB_RECONNECT_DELAY_MS=2000
 - iSCSI node.session.timeo.replacement_timeout = 120
 - iSCSI node.conn[0].timeo.noop_out_interval = 5
 - node.conn[0].timeo.noop_out_timeout = 5


Logs at the time of failure on mgr01:
Nov  6 01:00:12 mgr01 kernel:  connection1:0: ping timeout of 5 secs 
expired, recv timeout 5, last rx 4325764337, last ping 4325769337, now 
4325774337
Nov  6 01:00:12 mgr01 kernel:  connection1:0: detected conn error (1011)
Nov  6 01:00:13 mgr01 iscsid: Kernel reported iSCSI connection 1:0 error 
(1011) state (3)
Nov  6 01:00:15 mgr01 iscsid: connection1:0 is operational after 
recovery (1 attempts)
Nov  6 01:00:38 mgr01 kernel: o2net: no longer connected to node rack105 
(num 7) at 10.244.1.105:7777
Nov  6 01:00:38 mgr01 kernel: 
(3270,0):dlm_send_remote_convert_request:395 ERROR: status = -112
Nov  6 01:00:38 mgr01 kernel: (3270,0):dlm_wait_for_node_death:370 
4FF4E858AF6E4AEEB2650A543A320C2F: waiting 5000ms for notification of 
death of node 7
Nov  6 01:00:38 mgr01 kernel: o2net: accepted connection from node 
rack105 (num 7) at 10.244.1.105:7777
Nov  6 01:00:46 mgr01 kernel:  connection1:0: ping timeout of 5 secs 
expired, recv timeout 5, last rx 4325798155, last ping 4325803155, now 
4325808155
Nov  6 01:00:46 mgr01 kernel:  connection1:0: detected conn error (1011)
Nov  6 01:00:46 mgr01 iscsid: Kernel reported iSCSI connection 1:0 error 
(1011) state (3)
Nov  6 01:00:52 mgr01 iscsid: connection1:0 is operational after 
recovery (1 attempts)
Nov  6 01:01:22 mgr01 kernel:  connection1:0: ping timeout of 5 secs 
expired, recv timeout 5, last rx 4325834382, last ping 4325839382, now 
4325844382
Nov  6 01:01:22 mgr01 kernel:  connection1:0: detected conn error (1011)
Nov  6 01:01:23 mgr01 iscsid: Kernel reported iSCSI connection 1:0 error 
(1011) state (3)
Nov  6 01:01:28 mgr01 kernel:  connection2:0: ping timeout of 5 secs 
expired, recv timeout 5, last rx 4325840761, last ping 4325845761, now 
4325850761
Nov  6 01:01:28 mgr01 kernel:  connection2:0: detected conn error (1011)
Nov  6 01:01:29 mgr01 iscsid: Kernel reported iSCSI connection 2:0 error 
(1011) state (3)
Nov  6 01:01:36 mgr01 kernel: o2net: no longer connected to node rack107 
(num 9) at 10.244.1.107:7777
Nov  6 01:01:36 mgr01 kernel: o2net: accepted connection from node 
rack107 (num 9) at 10.244.1.107:7777
Nov  6 01:01:40 mgr01 kernel: o2net: no longer connected to node rack105 
(num 7) at 10.244.1.105:7777
Nov  6 01:01:40 mgr01 kernel: o2net: accepted connection from node 
rack105 (num 7) at 10.244.1.105:7777
Nov  6 01:01:41 mgr01 iscsid: connection2:0 is operational after 
recovery (1 attempts)
Nov  6 01:01:42 mgr01 kernel: o2net: no longer connected to node db01 
(num 1) at 10.244.1.120:7777
Nov  6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack104 
(num 6) at 10.244.1.104:7777
Nov  6 01:01:42 mgr01 kernel: o2net: accepted connection from node 
rack104 (num 6) at 10.244.1.104:7777
Nov  6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack103 
(num 5) at 10.244.1.103:7777
Nov  6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack101 
(num 3) at 10.244.1.101:7777
Nov  6 01:01:42 mgr01 kernel: o2net: accepted connection from node 
rack101 (num 3) at 10.244.1.101:7777
Nov  6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack106 
(num 8) at 10.244.1.106:7777
Nov  6 01:01:42 mgr01 kernel: 
(3216,0):dlm_send_remote_convert_request:395 ERROR: status = -112
Nov  6 01:01:42 mgr01 kernel: (3216,0):dlm_wait_for_node_death:370 
5B4A18B8C0B64B05BAB5F5AA1C93D7E0: waiting 5000ms for notification of 
death of node 8


Logs at time of failure on Node 7 (rack105):
Nov  6 01:00:38 rack105 kernel: o2net: connection to node mgr01 (num 0) 
at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down.
Nov  6 01:00:38 rack105 kernel: (0,1):o2net_idle_timer:1503 here are 
some times that might help debug the situation: (tmr 1257498008.773099 
now 1257498038.772814 dr 1257498008.773068 adv 
1257498008.773099:1257498008.773100 func (6055cf71:502) 
1257498002.348395:1257498002.348398)
Nov  6 01:00:38 rack105 kernel: o2net: no longer connected to node mgr01 
(num 0) at 10.244.1.100:7777
Nov  6 01:00:38 rack105 kernel: (4209,6):dlm_send_proxy_ast_msg:458 
ERROR: status = -112
Nov  6 01:00:38 rack105 kernel: (4209,6):dlm_flush_asts:600 ERROR: 
status = -112
Nov  6 01:00:38 rack105 kernel: o2net: connected to node mgr01 (num 0) 
at 10.244.1.100:7777
Nov  6 01:01:40 rack105 kernel: o2net: connection to node mgr01 (num 0) 
at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down.
Nov  6 01:01:40 rack105 kernel: (0,2):o2net_idle_timer:1503 here are 
some times that might help debug the situation: (tmr 1257498070.407069 
now 1257498100.406902 dr 1257498070.407060 adv 
1257498070.407078:1257498070.407079 func (098ec5c8:504) 
1257498043.875896:1257498043.875904)



Logs at time of failure on Node 8 (rack106):
Nov  6 01:01:42 rack106 kernel: o2net: connection to node mgr01 (num 0) 
at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down.
Nov  6 01:01:42 rack106 kernel: (0,0):o2net_idle_timer:1503 here are 
some times that might help debug the situation: (tmr 1257498072.440067 
now 1257498102.440021 dr 1257498072.440044 adv 
1257498072.440067:1257498072.440068 func (6055cf71:502) 
1257498001.698233:1257498001.698236)
Nov  6 01:01:42 rack106 kernel: o2net: no longer connected to node mgr01 
(num 0) at 10.244.1.100:7777
Nov  6 01:01:42 rack106 kernel: (4162,2):dlm_send_proxy_ast_msg:458 
ERROR: status = -112
Nov  6 01:01:42 rack106 kernel: (4162,2):dlm_flush_asts:600 ERROR: 
status = -112
Nov  6 01:01:42 rack106 kernel: (4162,2):dlm_send_proxy_ast_msg:458 
ERROR: status = -107
Nov  6 01:01:42 rack106 kernel: (4162,2):dlm_flush_asts:600 ERROR: 
status = -107
Nov  6 01:01:42 rack106 kernel: o2net: connected to node mgr01 (num 0) 
at 10.244.1.100:7777
Nov  6 01:02:20 rack106 kernel: o2net: connection to node mgr01 (num 0) 
at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down.
Nov  6 01:02:20 rack106 kernel: (0,2):o2net_idle_timer:1503 here are 
some times that might help debug the situation: (tmr 1257498110.160230 
now 1257498140.160050 dr 1257498110.160211 adv 
1257498110.160239:1257498110.160240 func (6055cf71:504) 
1257498108.160393:1257498108.160401)
Nov  6 01:02:20 rack106 kernel: o2net: no longer connected to node mgr01 
(num 0) at 10.244.1.100:7777
Nov  6 01:02:20 rack106 kernel: o2net: connected to node mgr01 (num 0) 
at 10.244.1.100:7777
Nov  6 01:02:55 rack106 kernel: o2net: connection to node mgr01 (num 0) 
at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down.
Nov  6 01:02:55 rack106 kernel: (0,2):o2net_idle_timer:1503 here are 
some times that might help debug the situation: (tmr 1257498145.231769 
now 1257498175.231038 dr 1257498145.231761 adv 
1257498145.231781:1257498145.231781 func (00000000:0) 0.0:0.0)
Nov  6 01:02:55 rack106 kernel: o2net: no longer connected to node mgr01 
(num 0) at 10.244.1.100:7777
Nov  6 01:02:58 rack106 kernel: o2net: connected to node mgr01 (num 0) 
at 10.244.1.100:7777

Thanks,
james



More information about the Ocfs2-users mailing list