[Ocfs2-users] 6 node cluster with unexplained reboots

Mon Jul 30 11:07:26 PDT 2007

So are you suggesting the reason was bad hardware?
Or, is it too early to call?

Ulf Zimmermann wrote:
> I have serial console setup with logging via conserver but so far no
> further crash. We also swapped hardware a bit around (another 4 node
> cluster with DL360g5 was working without crash for several weeks, we
> swapped those 4 nodes in for the first 4 in the 6 node cluster).
>
>   
>> -----Original Message-----
>> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
>> Sent: Monday, July 30, 2007 10:21
>> To: Ulf Zimmermann
>> Cc: ocfs2-users at oss.oracle.com
>> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
>>
>> Do you have a netconsole setup? If not, set it up. That will capture
>>     
> the
>   
>> real reason for the reset. Well, it typically does.
>>
>> Ulf Zimmermann wrote:
>>     
>>> We just installed a new cluster with 6 HP DL380g5, dual single port
>>>       
>> Qlogic 24xx HBAs connected via two HP 4/16 Storageworks switches to a
>>     
> 3Par
>   
>> S400. We are using the 3Par recommended config for the Qlogic driver
>>     
> and
>   
>> device-mapper-multipath giving us 4 paths to the SAN. We do see some
>>     
> SCSI
>   
>> errors where DM-MP is failing a path after get a 0x2000 error from the
>>     
> SAN
>   
>> controller, but the path gets puts back in service in less then 10
>> seconds.
>>     
>>> This needs to be fixed but I don't think it is what is causing our
>>>       
>> reboots. 2 of the nodes rebooted once while being idle (ocfs2 and
>> clusterware were running, no db) and one node rebooted while idle
>>     
> (another
>   
>> node was copying using fscat our 9i db from ocfs1 to the ocfs2 data
>> volume) and once while some load was put on it via the upgraded 10g
>> database. In all cases it is as if someone a hardware reset button. No
>> kernel panic (at least not one leading to a stop with visable
>>     
> message), we
>   
>> can get a dirty write cache for the internal cciss controller.
>>     
>>> The only messages we get on the nodes are when the crashed node is
>>>       
>> already in reset and it missed its ocfs2 heartbeat (set to the default
>>     
> of
>   
>> 7), followed later by crs moving the vip.
>>     
>>> Any hints on trouble shooting this would be appreciated.
>>>
>>> Regards, Ulf.
>>>
>>>
>>> --------------------------
>>> Sent from my BlackBerry Wireless Handheld
>>>
>>>
>>>
>>>       
> ------------------------------------------------------------------------
>   
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>