<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <meta content="text/html;charset=UTF-8" http-equiv="Content-Type">

</head>

<body bgcolor="#ffffcc" text="#000066">

<tt>You may start with what is suggested in the FAQ.<br>

<br>

<a class="moz-txt-link-freetext" href="http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT">http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT</a><br>

</tt>

<pre class="moz-signature" cols="72">Regards,

Marcos Eduardo Matsunaga

Oracle USA

Linux Engineering

</pre>

<br>

<br>

Ulf Zimmermann wrote:

<blockquote

 cite="mid:5DE4B7D3E79067418154C49A739C1251022E3DA6@msmpk01.corp.autc.com"

 type="cite">

  <blockquote type="cite">

    <pre wrap="">-----Original Message-----

From: Mark Fasheh [<a class="moz-txt-link-freetext" href="mailto:mark.fasheh@oracle.com">mailto:mark.fasheh@oracle.com</a>]

Sent: Wednesday, August 15, 2007 16:49

To: Ulf Zimmermann

Cc: Sunil Mushran; <a class="moz-txt-link-abbreviated" href="mailto:ocfs2-users@oss.oracle.com">ocfs2-users@oss.oracle.com</a>

Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots

On Mon, Aug 13, 2007 at 08:46:51AM -0700, Ulf Zimmermann wrote:

    </pre>

    <blockquote type="cite">

      <pre wrap="">Index 22: took 10003 ms to do waiting for write completion

*** ocfs2 is very sorry to be fencing this system by restarting ***

There were no SCSI errors on the console or logs around the time of

      </pre>

    </blockquote>

  </blockquote>

  <pre wrap=""><!---->this

  </pre>

  <blockquote type="cite">

    <blockquote type="cite">

      <pre wrap="">reboot.

      </pre>

    </blockquote>

    <pre wrap="">It looks like the write took too long - as a first step, you might

    </pre>

  </blockquote>

  <pre wrap=""><!---->want to

  </pre>

  <blockquote type="cite">

    <pre wrap="">up the disk heartbeat timeouts on those systems. Run:

$ /etc/init.d/o2cb configure

on each node to do that. That won't hide any hardware problems, but if

    </pre>

  </blockquote>

  <pre wrap=""><!---->the

  </pre>

  <blockquote type="cite">

    <pre wrap="">problem is just a latency to get the write to disk, it'd help tune it

away.

        --Mark

    </pre>

  </blockquote>

  <pre wrap=""><!---->

Ok, we had now 4 reboots, plus 2 more by my own action, which were by

OCFS2 fencing. As said in previous emails we were seeing some SCSI

errors and although device-mapper-multipath seems to take care of it,

sometimes the 10 second configured in multipath.conf and the default

timings of o2cb are colliding.

On the two clusters we have run into this, I have now replaced several

fibre cables and it seems we also have 1 bad port on one of the fibre

channel switches. Swapped first cable, still problems. Swapped SPF,

still problem, moved node to another port from where the SPF was swapped

from, 0 errors.

Now I am still concerned about the timing of device-mapper-multipath and

o2cb. O2cb is currently set to the default of:

Specify heartbeat dead threshold (&gt;=7) [7]: 

Specify network idle timeout in ms (&gt;=5000) [10000]: 

Specify network keepalive delay in ms (&gt;=1000) [5000]: 

Specify network reconnect delay in ms (&gt;=2000) [2000]:

So the timeout I seem to hit is the 10,000 of network idle timeout? Even

this timeout occurs on the disk? What values would you recommend I

should set this to?

Another question in case someone can answer this. If I get a syslog

entries like:

Aug 16 00:44:33 dbprd01 kernel: SCSI error : &lt;1 0 0 1&gt; return code =

0x20000

Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdj, sector

346452448

Aug 16 00:44:33 dbprd01 kernel: device-mapper: dm-multipath: Failing

path 8:144.

Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdj, sector

346452456

Aug 16 00:44:33 dbprd01 kernel: SCSI error : &lt;1 0 1 1&gt; return code =

0x20000

Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdn, sector

1469242384

Aug 16 00:44:33 dbprd01 kernel: device-mapper: dm-multipath: Failing

path 8:208.

Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdn, sector

1469242392

Aug 16 00:44:33 dbprd01 multipathd: 8:144: mark as failed

Aug 16 00:44:33 dbprd01 multipathd: u01: remaining active paths: 3

Aug 16 00:44:33 dbprd01 multipathd: 8:208: mark as failed

Aug 16 00:44:33 dbprd01 multipathd: u01: remaining active paths: 2

Does this actually errors out all the way or does the request still go

to one of the remaining paths? If this request doesn't error out,

because it was able to still fulfill it via the 2 remaining paths, then

it is really just the timing between device-mapper-multipath recovering

this request through the remain paths and our o2cb settings. If not, we

might still have another problem. We have seen many such errors but only

had like 8 reboots, all I think attributed to fencing now.

Regards, Ulf.

_______________________________________________

Ocfs2-users mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a>

<a class="moz-txt-link-freetext" href="http://oss.oracle.com/mailman/listinfo/ocfs2-users">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a>

  </pre>

</blockquote>

</body>

</html>