[Ocfs2-users] 6 node cluster with unexplained reboots
Sunil Mushran
Sunil.Mushran at oracle.com
Mon Jul 30 11:07:26 PDT 2007
So are you suggesting the reason was bad hardware?
Or, is it too early to call?
Ulf Zimmermann wrote:
> I have serial console setup with logging via conserver but so far no
> further crash. We also swapped hardware a bit around (another 4 node
> cluster with DL360g5 was working without crash for several weeks, we
> swapped those 4 nodes in for the first 4 in the 6 node cluster).
>
>
>> -----Original Message-----
>> From: Sunil Mushran [mailto:Sunil.Mushran at oracle.com]
>> Sent: Monday, July 30, 2007 10:21
>> To: Ulf Zimmermann
>> Cc: ocfs2-users at oss.oracle.com
>> Subject: Re: [Ocfs2-users] 6 node cluster with unexplained reboots
>>
>> Do you have a netconsole setup? If not, set it up. That will capture
>>
> the
>
>> real reason for the reset. Well, it typically does.
>>
>> Ulf Zimmermann wrote:
>>
>>> We just installed a new cluster with 6 HP DL380g5, dual single port
>>>
>> Qlogic 24xx HBAs connected via two HP 4/16 Storageworks switches to a
>>
> 3Par
>
>> S400. We are using the 3Par recommended config for the Qlogic driver
>>
> and
>
>> device-mapper-multipath giving us 4 paths to the SAN. We do see some
>>
> SCSI
>
>> errors where DM-MP is failing a path after get a 0x2000 error from the
>>
> SAN
>
>> controller, but the path gets puts back in service in less then 10
>> seconds.
>>
>>> This needs to be fixed but I don't think it is what is causing our
>>>
>> reboots. 2 of the nodes rebooted once while being idle (ocfs2 and
>> clusterware were running, no db) and one node rebooted while idle
>>
> (another
>
>> node was copying using fscat our 9i db from ocfs1 to the ocfs2 data
>> volume) and once while some load was put on it via the upgraded 10g
>> database. In all cases it is as if someone a hardware reset button. No
>> kernel panic (at least not one leading to a stop with visable
>>
> message), we
>
>> can get a dirty write cache for the internal cciss controller.
>>
>>> The only messages we get on the nodes are when the crashed node is
>>>
>> already in reset and it missed its ocfs2 heartbeat (set to the default
>>
> of
>
>> 7), followed later by crs moving the vip.
>>
>>> Any hints on trouble shooting this would be appreciated.
>>>
>>> Regards, Ulf.
>>>
>>>
>>> --------------------------
>>> Sent from my BlackBerry Wireless Handheld
>>>
>>>
>>>
>>>
> ------------------------------------------------------------------------
>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>
More information about the Ocfs2-users
mailing list