[Ocfs2-users] iscsi connection drop, comes back in seconds, then deadlock in cluster
James Harrell
james at emediasales.com
Fri Nov 6 13:45:59 PST 2009
Greetings ocfs2 folks,
A client is experiencing some random deadlock issues within a cluster,
wondering if anyone can point us in the right direction. The iSCSI
connection seemed to have dropped on one node briefly, ultimately
several hours later landing us in a complete deadlock scenario where
multiple nodes (Node 7 and Node 8) had to be panic'd (by hand - they
didn't ever panic on their own) in order to get the cluster back. The
would not unmount the file system, so we hard-panic'd (insmod panic.ko).
Vitals:
- Linux mgr01 2.6.18-164.2.1.el5 #1 SMP Wed Sep 30 12:52:46 EDT 2009
x86_64 x86_64 x86_64 GNU/Linux
- ocfs2-2.6.18-164.2.1.el5-1.4.4-1.el5
- MD3000i iSCSI
- O2CB_HEARTBEAT_THRESHOLD=61
- O2CB_IDLE_TIMEOUT_MS=30000
- O2CB_KEEPALIVE_DELAY_MS=2000
- O2CB_RECONNECT_DELAY_MS=2000
- iSCSI node.session.timeo.replacement_timeout = 120
- iSCSI node.conn[0].timeo.noop_out_interval = 5
- node.conn[0].timeo.noop_out_timeout = 5
Logs at the time of failure on mgr01:
Nov 6 01:00:12 mgr01 kernel: connection1:0: ping timeout of 5 secs
expired, recv timeout 5, last rx 4325764337, last ping 4325769337, now
4325774337
Nov 6 01:00:12 mgr01 kernel: connection1:0: detected conn error (1011)
Nov 6 01:00:13 mgr01 iscsid: Kernel reported iSCSI connection 1:0 error
(1011) state (3)
Nov 6 01:00:15 mgr01 iscsid: connection1:0 is operational after
recovery (1 attempts)
Nov 6 01:00:38 mgr01 kernel: o2net: no longer connected to node rack105
(num 7) at 10.244.1.105:7777
Nov 6 01:00:38 mgr01 kernel:
(3270,0):dlm_send_remote_convert_request:395 ERROR: status = -112
Nov 6 01:00:38 mgr01 kernel: (3270,0):dlm_wait_for_node_death:370
4FF4E858AF6E4AEEB2650A543A320C2F: waiting 5000ms for notification of
death of node 7
Nov 6 01:00:38 mgr01 kernel: o2net: accepted connection from node
rack105 (num 7) at 10.244.1.105:7777
Nov 6 01:00:46 mgr01 kernel: connection1:0: ping timeout of 5 secs
expired, recv timeout 5, last rx 4325798155, last ping 4325803155, now
4325808155
Nov 6 01:00:46 mgr01 kernel: connection1:0: detected conn error (1011)
Nov 6 01:00:46 mgr01 iscsid: Kernel reported iSCSI connection 1:0 error
(1011) state (3)
Nov 6 01:00:52 mgr01 iscsid: connection1:0 is operational after
recovery (1 attempts)
Nov 6 01:01:22 mgr01 kernel: connection1:0: ping timeout of 5 secs
expired, recv timeout 5, last rx 4325834382, last ping 4325839382, now
4325844382
Nov 6 01:01:22 mgr01 kernel: connection1:0: detected conn error (1011)
Nov 6 01:01:23 mgr01 iscsid: Kernel reported iSCSI connection 1:0 error
(1011) state (3)
Nov 6 01:01:28 mgr01 kernel: connection2:0: ping timeout of 5 secs
expired, recv timeout 5, last rx 4325840761, last ping 4325845761, now
4325850761
Nov 6 01:01:28 mgr01 kernel: connection2:0: detected conn error (1011)
Nov 6 01:01:29 mgr01 iscsid: Kernel reported iSCSI connection 2:0 error
(1011) state (3)
Nov 6 01:01:36 mgr01 kernel: o2net: no longer connected to node rack107
(num 9) at 10.244.1.107:7777
Nov 6 01:01:36 mgr01 kernel: o2net: accepted connection from node
rack107 (num 9) at 10.244.1.107:7777
Nov 6 01:01:40 mgr01 kernel: o2net: no longer connected to node rack105
(num 7) at 10.244.1.105:7777
Nov 6 01:01:40 mgr01 kernel: o2net: accepted connection from node
rack105 (num 7) at 10.244.1.105:7777
Nov 6 01:01:41 mgr01 iscsid: connection2:0 is operational after
recovery (1 attempts)
Nov 6 01:01:42 mgr01 kernel: o2net: no longer connected to node db01
(num 1) at 10.244.1.120:7777
Nov 6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack104
(num 6) at 10.244.1.104:7777
Nov 6 01:01:42 mgr01 kernel: o2net: accepted connection from node
rack104 (num 6) at 10.244.1.104:7777
Nov 6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack103
(num 5) at 10.244.1.103:7777
Nov 6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack101
(num 3) at 10.244.1.101:7777
Nov 6 01:01:42 mgr01 kernel: o2net: accepted connection from node
rack101 (num 3) at 10.244.1.101:7777
Nov 6 01:01:42 mgr01 kernel: o2net: no longer connected to node rack106
(num 8) at 10.244.1.106:7777
Nov 6 01:01:42 mgr01 kernel:
(3216,0):dlm_send_remote_convert_request:395 ERROR: status = -112
Nov 6 01:01:42 mgr01 kernel: (3216,0):dlm_wait_for_node_death:370
5B4A18B8C0B64B05BAB5F5AA1C93D7E0: waiting 5000ms for notification of
death of node 8
Logs at time of failure on Node 7 (rack105):
Nov 6 01:00:38 rack105 kernel: o2net: connection to node mgr01 (num 0)
at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down.
Nov 6 01:00:38 rack105 kernel: (0,1):o2net_idle_timer:1503 here are
some times that might help debug the situation: (tmr 1257498008.773099
now 1257498038.772814 dr 1257498008.773068 adv
1257498008.773099:1257498008.773100 func (6055cf71:502)
1257498002.348395:1257498002.348398)
Nov 6 01:00:38 rack105 kernel: o2net: no longer connected to node mgr01
(num 0) at 10.244.1.100:7777
Nov 6 01:00:38 rack105 kernel: (4209,6):dlm_send_proxy_ast_msg:458
ERROR: status = -112
Nov 6 01:00:38 rack105 kernel: (4209,6):dlm_flush_asts:600 ERROR:
status = -112
Nov 6 01:00:38 rack105 kernel: o2net: connected to node mgr01 (num 0)
at 10.244.1.100:7777
Nov 6 01:01:40 rack105 kernel: o2net: connection to node mgr01 (num 0)
at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down.
Nov 6 01:01:40 rack105 kernel: (0,2):o2net_idle_timer:1503 here are
some times that might help debug the situation: (tmr 1257498070.407069
now 1257498100.406902 dr 1257498070.407060 adv
1257498070.407078:1257498070.407079 func (098ec5c8:504)
1257498043.875896:1257498043.875904)
Logs at time of failure on Node 8 (rack106):
Nov 6 01:01:42 rack106 kernel: o2net: connection to node mgr01 (num 0)
at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down.
Nov 6 01:01:42 rack106 kernel: (0,0):o2net_idle_timer:1503 here are
some times that might help debug the situation: (tmr 1257498072.440067
now 1257498102.440021 dr 1257498072.440044 adv
1257498072.440067:1257498072.440068 func (6055cf71:502)
1257498001.698233:1257498001.698236)
Nov 6 01:01:42 rack106 kernel: o2net: no longer connected to node mgr01
(num 0) at 10.244.1.100:7777
Nov 6 01:01:42 rack106 kernel: (4162,2):dlm_send_proxy_ast_msg:458
ERROR: status = -112
Nov 6 01:01:42 rack106 kernel: (4162,2):dlm_flush_asts:600 ERROR:
status = -112
Nov 6 01:01:42 rack106 kernel: (4162,2):dlm_send_proxy_ast_msg:458
ERROR: status = -107
Nov 6 01:01:42 rack106 kernel: (4162,2):dlm_flush_asts:600 ERROR:
status = -107
Nov 6 01:01:42 rack106 kernel: o2net: connected to node mgr01 (num 0)
at 10.244.1.100:7777
Nov 6 01:02:20 rack106 kernel: o2net: connection to node mgr01 (num 0)
at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down.
Nov 6 01:02:20 rack106 kernel: (0,2):o2net_idle_timer:1503 here are
some times that might help debug the situation: (tmr 1257498110.160230
now 1257498140.160050 dr 1257498110.160211 adv
1257498110.160239:1257498110.160240 func (6055cf71:504)
1257498108.160393:1257498108.160401)
Nov 6 01:02:20 rack106 kernel: o2net: no longer connected to node mgr01
(num 0) at 10.244.1.100:7777
Nov 6 01:02:20 rack106 kernel: o2net: connected to node mgr01 (num 0)
at 10.244.1.100:7777
Nov 6 01:02:55 rack106 kernel: o2net: connection to node mgr01 (num 0)
at 10.244.1.100:7777 has been idle for 30.0 seconds, shutting it down.
Nov 6 01:02:55 rack106 kernel: (0,2):o2net_idle_timer:1503 here are
some times that might help debug the situation: (tmr 1257498145.231769
now 1257498175.231038 dr 1257498145.231761 adv
1257498145.231781:1257498145.231781 func (00000000:0) 0.0:0.0)
Nov 6 01:02:55 rack106 kernel: o2net: no longer connected to node mgr01
(num 0) at 10.244.1.100:7777
Nov 6 01:02:58 rack106 kernel: o2net: connected to node mgr01 (num 0)
at 10.244.1.100:7777
Thanks,
james
More information about the Ocfs2-users
mailing list