[Ocfs2-users] 6 node cluster with unexplained reboots

Thu Aug 16 14:41:10 PDT 2007

On Thu, Aug 16, 2007 at 02:29:43AM -0700, Ulf Zimmermann wrote:
> Ok, we had now 4 reboots, plus 2 more by my own action, which were by
> OCFS2 fencing. As said in previous emails we were seeing some SCSI
> errors and although device-mapper-multipath seems to take care of it,
> sometimes the 10 second configured in multipath.conf and the default
> timings of o2cb are colliding.

	I can certainly see that happening.

> Now I am still concerned about the timing of device-mapper-multipath and
> o2cb. O2cb is currently set to the default of:
> 
> Specify heartbeat dead threshold (>=7) [7]: 
> Specify network idle timeout in ms (>=5000) [10000]: 
> Specify network keepalive delay in ms (>=1000) [5000]: 
> Specify network reconnect delay in ms (>=2000) [2000]:

	I would certainly bump up the timeouts as in the FAQ (see
Marcos' reply)

> So the timeout I seem to hit is the 10,000 of network idle timeout? Even
> this timeout occurs on the disk? What values would you recommend I
> should set this to?

	I think you're hitting heartbeat dead threshold.  Bump that to
something larger (eg, 31) like in the FAQ.  This is a multiplier of the
region check interval.  Better way to put it:  Every N seconds the
system checks for a heartbeat on disk.  If it checks <threshold> times
without seeing the other node (that is, N * <threshold> seconds), it
will consider that node dead.
	What it looks like you are seeing is that the one node gets to
(N * <threshold> seconds) before multipath has a chance to fix the I/O.
That's why we suggest bumping the timeout.  Mark/Marcos, correct me if
I'm wrong here.

> Another question in case someone can answer this. If I get a syslog
> entries like:
> 
> Aug 16 00:44:33 dbprd01 kernel: SCSI error : <1 0 0 1> return code =
> 0x20000
> Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdj, sector
> 346452448
> Aug 16 00:44:33 dbprd01 kernel: device-mapper: dm-multipath: Failing
> path 8:144.
...<snip>
> Aug 16 00:44:33 dbprd01 multipathd: 8:144: mark as failed
> Aug 16 00:44:33 dbprd01 multipathd: u01: remaining active paths: 3
> 
> Does this actually errors out all the way or does the request still go
> to one of the remaining paths? If this request doesn't error out,
> because it was able to still fulfill it via the 2 remaining paths, then
> it is really just the timing between device-mapper-multipath recovering
> this request through the remain paths and our o2cb settings. If not, we
> might still have another problem. We have seen many such errors but only
> had like 8 reboots, all I think attributed to fencing now.

	That error log looks like multipath is correcting the I/O.  The
error isn't coming all the way up, it's being handled by multipath.  If
the error came all the way back up, you should see errors printed by the
heartbeat process.  If you don't see them, it never got an error, which
is how multipath should behave.

Joel

-- 

Life's Little Instruction Book #314

	"Never underestimate the power of forgiveness."

Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127