<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=UTF-8" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffcc" text="#000066">

<tt>You may want to try to increase the network timeout. You will have

to do it on all nodes.<br>

<br>

See the FAQ

<a class="moz-txt-link-freetext" href="http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT">http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT</a> 

with special attention to #104 and 105.<br>

<br>

<br>

</tt>

<pre class="moz-signature" cols="72">Regards,

Marcos Eduardo Matsunaga

Oracle USA

Linux Engineering

</pre>

<br>

<br>

paul fretter (TOC) wrote:

<blockquote

 cite="mid:D2AD15EF58CEB8448EA3626C1A102209FE44B2@NBIE2KSRV1.nbi.bbsrc.ac.uk"

 type="cite">

  <pre wrap="">To clarify,

The host "node1" is the OCFS node 0 in the config file.

The log entries are from another system in the cluster.

Kind regards

Paul

  </pre>

  <blockquote type="cite">

    <pre wrap="">-----Original Message-----

From: paul fretter (TOC)

Sent: 09 October 2007 11:41

To: <a class="moz-txt-link-abbreviated" href="mailto:ocfs2-users@oss.oracle.com">ocfs2-users@oss.oracle.com</a>

Subject: Access to OCFS2 volume paused when a node crashes

There is a node (node1) on our cluster that for some reason hangs

    </pre>

  </blockquote>

  <pre wrap=""><!---->every

  </pre>

  <blockquote type="cite">

    <pre wrap="">now and again, but it seems that when it happens it also pauses access

to the OCFS2 volume for the other nodes.

We are running the latest version of OCFS2 and the tools, on RHEL4

(x86_64) with kernel 2.6.9-42.  All nodes area connected by

fibrechannel to a common LUN for data sharing.

I guess there may be something I can do with configuring timeouts

etc(?), but I thought I'd check with this list first.  Here is the

relevant info from /va/log/messages

Oct  9 11:24:41 jic55124 kernel: o2net: connection to node node1 (num

0) at 10.1 0.10.1:7777 has been idle for 10.0 seconds, shutting it

down.

Oct  9 11:24:41 jic55124 kernel: (0,1):o2net_idle_timer:1418 here are

some times  that might help debug the situation: (tmr

    </pre>

  </blockquote>

  <pre wrap=""><!---->1191925471.993435

  </pre>

  <blockquote type="cite">

    <pre wrap="">now 1191925481.9942 92 dr 1191925471.993425 adv

1191925471.993436:1191925471.993437 func (98e2d068:5 07)

1191924562.14841:1191924562.14844)

Oct  9 11:24:41 jic55124 kernel: o2net: no longer connected to node

node1 (num 0 ) at 10.10.10.1:7777

Oct  9 11:24:41 jic55124 kernel: (727,3):dlm_do_master_request:1418

ERROR: link to 0 went down!

Oct  9 11:24:41 jic55124 kernel: (727,3):dlm_get_lock_<a class="moz-txt-link-freetext" href="resource:995">resource:995</a>

ERROR: status  = -112

[root@jic55124 ~]# tail /var/log/messages

Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_<a class="moz-txt-link-freetext" href="resource:995">resource:995</a>

ERROR: status = -107

Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_do_master_request:1418

ERROR: link to 0 went down!

Oct  9 11:28:48 jic55124 kernel: (856,2):dlm_get_lock_<a class="moz-txt-link-freetext" href="resource:995">resource:995</a>

ERROR: status = -107

Oct  9 11:33:42 jic55124 kernel: (865,0):dlm_get_lock_<a class="moz-txt-link-freetext" href="resource:921">resource:921</a>

6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at

least one node (0) torecover before lock mastery can begin

Oct  9 11:33:42 jic55124 kernel: (3765,1):ocfs2_dlm_eviction_cb:119

device (8,80): dlm has evicted node 0

Oct  9 11:33:43 jic55124 kernel: (865,0):dlm_get_lock_<a class="moz-txt-link-freetext" href="resource:976">resource:976</a>

6B13C23CB44C4D888150894FE4D35D4E:M000000000000000000007571339968: at

least one node (0) torecover before lock mastery can begin

Oct  9 11:33:46 jic55124 kernel: (727,3):dlm_restart_lock_mastery:1301

ERROR: node down! 0

Oct  9 11:33:46 jic55124 kernel:

    </pre>

  </blockquote>

  <pre wrap=""><!---->(727,3):dlm_wait_for_lock_mastery:1118

  </pre>

  <blockquote type="cite">

    <pre wrap="">ERROR: status = -11

Oct  9 11:33:48 jic55124 kernel: (865,1):ocfs2_replay_journal:1167

Recovering node 0 from slot 5 on device (8,80)

Oct  9 11:33:50 jic55124 kernel: kjournald starting.  Commit interval

    </pre>

  </blockquote>

  <pre wrap=""><!---->5

  </pre>

  <blockquote type="cite">

    <pre wrap="">seconds

Many thanks

Paul Fretter

    </pre>

  </blockquote>

  <pre wrap=""><!---->

_______________________________________________

Ocfs2-users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a>

<a class="moz-txt-link-freetext" href="http://oss.oracle.com/mailman/listinfo/ocfs2-users">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a>

  </pre>

</blockquote>

</body>

</html>