[Ocfs2-users] 6 node cluster with unexplained reboots

Thu Aug 16 07:59:04 PDT 2007

You may start with what is suggested in the FAQ.

http://oss.oracle.com/projects/ocfs2/dist/documentation/ocfs2_faq.html#TIMEOUT

Regards,

Marcos Eduardo Matsunaga

Oracle USA
Linux Engineering

Ulf Zimmermann wrote:
>> -----Original Message-----
>> From: Mark Fasheh [mailto:mark.fasheh at oracle.com]
>> Sent: Wednesday, August 15, 2007 16:49
>> To: Ulf Zimmermann
>> Cc: Sunil Mushran; ocfs2-users at oss.oracle.com
>> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
>>
>> On Mon, Aug 13, 2007 at 08:46:51AM -0700, Ulf Zimmermann wrote:
>>     
>>> Index 22: took 10003 ms to do waiting for write completion
>>> *** ocfs2 is very sorry to be fencing this system by restarting ***
>>>
>>> There were no SCSI errors on the console or logs around the time of
>>>       
> this
>   
>>> reboot.
>>>       
>> It looks like the write took too long - as a first step, you might
>>     
> want to
>   
>> up the disk heartbeat timeouts on those systems. Run:
>>
>> $ /etc/init.d/o2cb configure
>>
>> on each node to do that. That won't hide any hardware problems, but if
>>     
> the
>   
>> problem is just a latency to get the write to disk, it'd help tune it
>> away.
>> 	--Mark
>>     
>
> Ok, we had now 4 reboots, plus 2 more by my own action, which were by
> OCFS2 fencing. As said in previous emails we were seeing some SCSI
> errors and although device-mapper-multipath seems to take care of it,
> sometimes the 10 second configured in multipath.conf and the default
> timings of o2cb are colliding.
>
> On the two clusters we have run into this, I have now replaced several
> fibre cables and it seems we also have 1 bad port on one of the fibre
> channel switches. Swapped first cable, still problems. Swapped SPF,
> still problem, moved node to another port from where the SPF was swapped
> from, 0 errors.
>
> Now I am still concerned about the timing of device-mapper-multipath and
> o2cb. O2cb is currently set to the default of:
>
> Specify heartbeat dead threshold (>=7) [7]: 
> Specify network idle timeout in ms (>=5000) [10000]: 
> Specify network keepalive delay in ms (>=1000) [5000]: 
> Specify network reconnect delay in ms (>=2000) [2000]:
>
> So the timeout I seem to hit is the 10,000 of network idle timeout? Even
> this timeout occurs on the disk? What values would you recommend I
> should set this to?
>
> Another question in case someone can answer this. If I get a syslog
> entries like:
>
> Aug 16 00:44:33 dbprd01 kernel: SCSI error : <1 0 0 1> return code =
> 0x20000
> Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdj, sector
> 346452448
> Aug 16 00:44:33 dbprd01 kernel: device-mapper: dm-multipath: Failing
> path 8:144.
> Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdj, sector
> 346452456
> Aug 16 00:44:33 dbprd01 kernel: SCSI error : <1 0 1 1> return code =
> 0x20000
> Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdn, sector
> 1469242384
> Aug 16 00:44:33 dbprd01 kernel: device-mapper: dm-multipath: Failing
> path 8:208.
> Aug 16 00:44:33 dbprd01 kernel: end_request: I/O error, dev sdn, sector
> 1469242392
> Aug 16 00:44:33 dbprd01 multipathd: 8:144: mark as failed
> Aug 16 00:44:33 dbprd01 multipathd: u01: remaining active paths: 3
> Aug 16 00:44:33 dbprd01 multipathd: 8:208: mark as failed
> Aug 16 00:44:33 dbprd01 multipathd: u01: remaining active paths: 2
>
> Does this actually errors out all the way or does the request still go
> to one of the remaining paths? If this request doesn't error out,
> because it was able to still fulfill it via the 2 remaining paths, then
> it is really just the timing between device-mapper-multipath recovering
> this request through the remain paths and our o2cb settings. If not, we
> might still have another problem. We have seen many such errors but only
> had like 8 reboots, all I think attributed to fencing now.
>
> Regards, Ulf.
>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20070816/c60027cc/attachment-0001.html