<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#ffffff" text="#000000">
    ocfs2 uses disk heartbeat to detect node liveness. It uses net
    heartbeat<br>
    to detect link liveness. Both need to operate for the cluster to
    function.<br>
    If the network link between two nodes snaps, then one of the two
    nodes<br>
    is fenced.<br>
    <br>
    The stack below indicates that the two nodes are not able to
    communicate.<br>
    The two nodes are waiting on the quorum to fence one of the nodes.<br>
    It appears you have upped the disk heartbeat timeout &gt; 2mins. I
    would imagine<br>
    one of the nodes reset after that timeout.<br>
    <br>
    On 09/10/2011 08:54 PM, Hai Tao wrote:
    <blockquote cite="mid:BAY156-W639393F983CE859CD6974CEB030@phx.gbl"
      type="cite">
      <style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 10pt;
font-family:Tahoma
}
--></style>
      <div dir="ltr">
        is ocfs2 heartbeat transferred over the network, or just
        updating a file&nbsp;to the shared disk?<br>
        &nbsp;<br>
        If the heartbeat lost, what should happen? what if only one node
        is writing, and the other is still? Will it still cause any file
        system issue?<br>
        <br>
        <br>
        <div>Thanks.</div>
        <div>&nbsp;</div>
        <div>Hai Tao</div>
        <br>
        &nbsp;
        <br>
        <div>
          <hr id="stopSpelling">
          From: <a class="moz-txt-link-abbreviated" href="mailto:taoh666@hotmail.com">taoh666@hotmail.com</a><br>
          To: <a class="moz-txt-link-abbreviated" href="mailto:ocfs2-users@oss.oracle.com">ocfs2-users@oss.oracle.com</a><br>
          Date: Sat, 10 Sep 2011 00:50:23 -0700<br>
          Subject: [Ocfs2-users] disable heartbeat nic caused ocfs2
          errors<br>
          <br>
          <meta name="Generator" content="Microsoft SafeHTML">
          <style>
.ExternalClass .ecxhmmessage P
{padding:0px;}
.ExternalClass body.ecxhmmessage
{font-size:10pt;font-family:Tahoma;}

</style>
          <div dir="ltr">I have a two nodes ocfs2 cluster, and I
            disabled the heartbeat nic with "ifdown eth1". I got
            following weird logs on both nodes:<br>
            &nbsp;<br>
            Sep&nbsp; 7 10:45:49 dbtest-01 kernel: o2net: connection to node
            dbtest-02 (num 1) at 10.194.59.65:7777 has been idle for
            30.0 seconds, shutting it down.<br>
            Sep&nbsp; 7 10:45:49 dbtest-01 kernel:
            (swapper,0,3):o2net_idle_timer:1503 here are some times that
            might help debug the situation: (tmr 1315417519.185025 now
            1315417549.183798 dr 1315417519.185016 adv
            1315417519.185032:1315417519.185032 func (b9bb7168:504)
            1315417518.872227:1315417518.872268)<br>
            Sep&nbsp; 7 10:45:49 dbtest-01 kernel: o2net: no longer connected
            to node dbtest-02 (num 1) at 10.194.59.65:7777<br>
            Sep&nbsp; 7 10:45:49 dbtest-01 kernel:
            (dlm_thread,3781,2):dlm_send_proxy_ast_msg:457 ERROR: status
            = -112<br>
            Sep&nbsp; 7 10:45:49 dbtest-01 kernel:
            (oracle,26129,1):dlm_do_master_request:1334 ERROR: link to 1
            went down!<br>
            Sep&nbsp; 7 10:45:49 dbtest-01 kernel:
            (oracle,26129,1):dlm_get_lock_<a class="moz-txt-link-freetext" href="resource:917">resource:917</a> ERROR: status =
            -112<br>
            Sep&nbsp; 7 10:45:49 dbtest-01 kernel:
            (dlm_thread,4256,1):dlm_send_proxy_ast_msg:457 ERROR: status
            = -112<br>
            Sep&nbsp; 7 10:45:49 dbtest-01 kernel:
            (dlm_thread,4256,1):dlm_flush_asts:604 ERROR: status = -112<br>
            Sep&nbsp; 7 10:45:49 dbtest-01 kernel:
            (dlm_thread,3781,2):dlm_flush_asts:604 ERROR: status = -112<br>
            Sep&nbsp; 7 10:46:19 dbtest-01 kernel:
            (o2net,3736,3):o2net_connect_expired:1664 ERROR: no
            connection established with node 1 after 30.0 seconds,
            giving up and returning errors.<br>
            Sep&nbsp; 7 10:46:19 dbtest-01 kernel: o2net: accepted connection
            from node dbtest-02 (num 1) at 10.194.59.65:7777<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel: INFO: task events/0:10
            blocked for more than 120 seconds.<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel: "echo 0 &gt;
            /proc/sys/kernel/hung_task_timeout_secs" disables this
            message.<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel: events/0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; D
            ffff810001004420&nbsp;&nbsp;&nbsp;&nbsp; 0&nbsp;&nbsp;&nbsp; 10&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 11&nbsp;&nbsp;&nbsp;&nbsp; 9
            (L-TLB)<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp; ffff81083ffedc80
            0000000000000046 ffffffff80333680 0000000000000001<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp; 0000000000000400
            000000000000000a ffff81083ffe1820 ffffffff80309b60<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp; 0030b62498ce7b3f
            000000000000416b ffff81083ffe1a08 0000000000000000<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel: Call Trace:<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel: Call Trace:<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff80064167&gt;] wait_for_completion+0x79/0xa2<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff8008e16d&gt;] default_wake_function+0x0/0xe<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff884e64b7&gt;]
            :ocfs2:ocfs2_wait_for_mask+0xd/0x19<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff884e78d8&gt;]
            :ocfs2:ocfs2_cluster_lock+0x9ae/0x9d3<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff885013e5&gt;]
            :ocfs2:ocfs2_orphan_scan_work+0x0/0x83<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff884ed1e4&gt;]
            :ocfs2:ocfs2_orphan_scan_lock+0x55/0x84<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff884fc59b&gt;]
            :ocfs2:ocfs2_queue_orphan_scan+0x32/0x147<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff885013ff&gt;]
            :ocfs2:ocfs2_orphan_scan_work+0x1a/0x83<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff8004dc37&gt;] run_workqueue+0x94/0xe4<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff8004a472&gt;] worker_thread+0x0/0x122<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff8004a562&gt;] worker_thread+0xf0/0x122<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff8008e16d&gt;] default_wake_function+0x0/0xe<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff80032bdc&gt;] kthread+0xfe/0x132<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff8005efb1&gt;] child_rip+0xa/0x11<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff80032ade&gt;] kthread+0x0/0x132<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:&nbsp;
            [&lt;ffffffff8005efa7&gt;] child_rip+0x0/0x11<br>
            Sep&nbsp; 7 10:48:37 dbtest-01 kernel:<br>
            <br>
            Does anyone know why this happened?<br>
            &nbsp;<br>
            Thanks.<br>
          </div>
          <br>
          _______________________________________________ Ocfs2-users
          mailing list <a class="moz-txt-link-abbreviated" href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a>
          <a class="moz-txt-link-freetext" href="http://oss.oracle.com/mailman/listinfo/ocfs2-users">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a></div>
      </div>
      <pre wrap="">
<fieldset class="mimeAttachmentHeader"></fieldset>
_______________________________________________
Ocfs2-users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a>
<a class="moz-txt-link-freetext" href="http://oss.oracle.com/mailman/listinfo/ocfs2-users">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a></pre>
    </blockquote>
    <br>
  </body>
</html>