[Ocfs2-devel] 答复: Is it an issue and whether the code changed correct? Thanks a lot

Sat Jul 27 10:23:21 PDT 2013

I would like to understand what is causing the ocfs2 network heartbeat
to timeout. If you can reproduce the issue, can you please run the
following tcpdump command on all nodes and provide me the output.

tcpdump -Z root -i $DEVICE -C 50 -W 10 -s 2500 -Sw /tmp/`hostname
-s`_tcpdump.log -ttt 'port 7777' &

Please run and capture "top -2" output as well.

On 07/27/2013 02:27 AM, Guozhonghua wrote:
> Hi Liu,
> Sorry for delay response, and I am very glad to receive your email.
>
> We are using OCFS2 as to full use IP-SAN or FC-SAN storage and provide conveniences of the iSCSI/FC storage management.
>
> The test scenario is that there are several nodes in the ocfs2 cluster.
> All the nodes have two network interface, one is used to be management network, such as 192.168.0.12, the other is network connected with iSCSI storage, such as 192.168.10.12.
> And the management IP 192.168.0.12 is used as by OCFS2 to setup tcp connection, configured in the /etc/ocfs2/cluster.conf file.
> This scenario may be same as that used as FC SAN, management network is used by OCFS2 to setup TCP connections to communicates information.
> As we setup bond of the management network of the host node on the switch directly connected with node, the node's network interface will down and up, so OCFS2 kernel detect it, the TCP may be disconnected without reconnection. But the storage network is still OK, connected with iSCSI or FC SAN, the heart beat of OCFS2 writing disk on the iSCSI/FC SAN is OK still.
> There are messages cannot be communicated between nodes, such as dlm messages, so the OCFS2 cluster may be blocked sometime.
>
> I review the code and have setup one cluster to test, trying to find out the reason why all the cluster or several node of the cluster blocked on the storage disk.
>
> There are code to process reconnect between node. The connection shutdown node whose cluster number is little, should be reconnected from the node whose cluster node number is larger does not triggered as my email described before.
>
> And there is another issues, which blocks the use of the cluster, the cluster is blocked and several nodes could not access the OCFS storage.
> The node host does not have response with packets, such as ping, this issue may be DLM.
>
> Thanks a lot
>
> Guozhonghua
>
> -------------------------------------------------------------------------------------------------------------------------------------
> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
> 或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
> 邮件！
> This e-mail and its attachments contain confidential information from H3C, which is
> intended only for the person or entity whose address is listed above. Any use of the
> information contained herein in any way (including, but not limited to, total or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!