[Ocfs2-users] nodes do not reconnect after network failure

info+ocfs at polnik.de info+ocfs at polnik.de
Thu Oct 27 04:31:45 PDT 2011


Hi,

I use ocfs2 with a isci device on 4 servers (vmhost1 - vmhost4) and try
to simulate a network problem with iptables.

uname -a
Linux vmhost3 2.6.39-gentoo-r3 #1 SMP Tue Sep 27 12:07:18 CEST 2011 i686
Intel(R) Xeon(R) CPU X5650 @ 2.67GHz GenuineIntel GNU/Linux

ocfs2-tools 1.6.4

#  grep -v '#\|^$'  /etc/conf.d/ocfs2
echo 1 > /proc/sys/kernel/panic_on_oops
echo 30 > /proc/sys/kernel/panic
OCFS2_CLUSTER="vmhostfiles"
OCFS2_IDLE_TIMEOUT_MS="30000"
OCFS2_KEEPALIVE_DELAY_MS="2000"
OCFS2_RECONNECT_DELAY_MS="2000"
OCFS2_DEAD_THRESHOLD="61"
OCFS2_FSCK="-fy"
OCFS2_FSCK_SWAPOFF="yes"

Test: What happens, if one node can't communicate with one other node.

1. Step (simulate a network failure)
vmhost1:
iptables -A INPUT -p tcp -s vmhost2 -j DROP

=> No access possible to the mounted ocfs2 device on all 4 nodes.

syslog messages from vmhost1/2:

Oct 27 12:41:18 vmhost2 kernel:
(kworker/u:4,1149,7):o2net_connect_expired:1724 ERROR: no connection
established with node 0 after 30.0 seconds, giving up and returning errors.
Oct 27 12:41:18 vmhost1 kernel:
(kworker/u:6,1168,1):o2net_connect_expired:1724 ERROR: no connection
established with node 1 after 30.0 seconds, giving up and returning errors.
Oct 27 12:41:48 vmhost2 kernel:
(kworker/u:6,1168,7):o2net_connect_expired:1724 ERROR: no connection
established with node 0 after 30.0 seconds, giving up and returning errors.
Oct 27 12:41:48 vmhost1 kernel:
(kworker/u:5,1150,1):o2net_connect_expired:1724 ERROR: no connection
established with node 1 after 30.0 seconds, giving up and returning errors.
Oct 27 12:42:18 vmhost2 kernel:
(kworker/u:4,1149,7):o2net_connect_expired:1724 ERROR: no connection
established with node 0 after 30.0 seconds, giving up and returning errors.
Oct 27 12:42:18 vmhost1 kernel:
(kworker/u:6,1168,1):o2net_connect_expired:1724 ERROR: no connection
established with node 1 after 30.0 seconds, giving up and returning errors.
Oct 27 12:42:33 vmhost1 kernel:
(kworker/u:6,1168,0):dlm_do_assert_master:1661 ERROR: Error -107 when
sending message 502 (key 0x6aa537f1) to node 1
Oct 27 12:42:33 vmhost1 kernel:
(dlm_thread,3143,6):dlm_send_proxy_ast_msg:484 ERROR:
3EF4047BABBC4DAD9E52FFEAECC8DED8: res P000000000000000000000000000000,
error -107 send AST to node 1
Oct 27 12:42:33 vmhost1 kernel: (dlm_thread,3143,6):dlm_flush_asts:605
ERROR: status = -107
Oct 27 12:42:48 vmhost2 kernel:
(kworker/u:6,1168,7):o2net_connect_expired:1724 ERROR: no connection
established with node 0 after 30.0 seconds, giving up and returning errors.
Oct 27 12:42:48 vmhost1 kernel:
(kworker/u:6,1168,1):o2net_connect_expired:1724 ERROR: no connection
established with node 1 after 30.0 seconds, giving up and returning errors.
Oct 27 12:43:18 vmhost2 kernel:
(kworker/u:4,1149,7):o2net_connect_expired:1724 ERROR: no connection
established with node 0 after 30.0 seconds, giving up and returning errors.
Oct 27 12:43:18 vmhost1 kernel:
(kworker/u:5,1150,1):o2net_connect_expired:1724 ERROR: no connection
established with node 1 after 30.0 seconds, giving up and returning errors.



2. Step (network failure is solved)
vmhost1:
iptables -F

... but node 1 and 2 don't want communicate.

Oct 27 13:17:54 vmhost2 kernel:
(kworker/u:0,3354,0):o2net_connect_expired:1724 ERROR: no connection
established with node 0 after 30.0 seconds, giving up and returning errors.
Oct 27 13:17:54 vmhost1 kernel:
(kworker/u:0,3365,1):o2net_connect_expired:1724 ERROR: no connection
established with node 1 after 30.0 seconds, giving up and returning errors.
Oct 27 13:18:24 vmhost2 kernel:
(kworker/u:6,1168,0):o2net_connect_expired:1724 ERROR: no connection
established with node 0 after 30.0 seconds, giving up and returning errors.
Oct 27 13:18:24 vmhost1 kernel:
(kworker/u:1,4053,1):o2net_connect_expired:1724 ERROR: no connection
established with node 1 after 30.0 seconds, giving up and returning errors.


I check it with tcpdump - A ping works fine, but ocfs on node 1 does not
send any packets to node 2 and vice versa, but the syslog messages
suggest, that node 1/2 try to established a connection but it fails.

What must I do, that after a network failure all ocfs nodes communicate
again?


Best regards,
thomas polnik.









More information about the Ocfs2-users mailing list