[Ocfs2-users] OCFS2 Crash

Thu Jun 30 10:24:18 PDT 2011

Try setting /proc/sys/kernel/panic_on_oops to 1.  It appears you are 
getting oopses but the box keeps running.  The oops could be due to some 
memory corruption, and there is no way to know what damage has been done 
by it.  You'll need to catch the very first one, and get it fixed.

Thanks,
Herbert.

On 6/29/11 7:27 PM, B Leggett wrote:
> That's a great idea, but I replaced the ram - then the whole box. Right now the only hardware kept were the hard drives for /.
>
>
> ----- Original Message -----
> From: "Jürgen Herrmann"<Juergen.Herrmann at XLhost.de>
> To: ocfs2-users at oss.oracle.com
> Sent: Wednesday, June 29, 2011 5:57:19 PM GMT -05:00 US/Canada Eastern
> Subject: Re: [Ocfs2-users] OCFS2 Crash
>
> On Wed, 29 Jun 2011 16:43:09 -0500 (GMT-05:00), B Leggett wrote:
>> That's troubling, these are really static systems. I know anything
>> can happen, but to inherit a kernel issue two years later seems nuts.
>> Not that your analysis is wrong, just blows me away is all. Is there
>> a
>> chance I would be better off removing this node and replacing it with
>> a fresh build?
>
> as your oopses all look different, i'd first replace all ram on the
> node
> in question. i had machines behave this strange with faulty ram several
> times.
>
> just my 2c
>
> best regards,
> jürgen
>
>>
>> ----- Original Message -----
>> From: "Sunil Mushran"<sunil.mushran at oracle.com>
>> To: "B Leggett"<bleggett at ngent.com>
>> Cc: ocfs2-users at oss.oracle.com
>> Sent: Wednesday, June 29, 2011 5:23:40 PM GMT -05:00 US/Canada
>> Eastern
>> Subject: Re: [Ocfs2-users] OCFS2 Crash
>>
>> You should ping your kernel vendor. While this does not look ocfs2
>> related, even if it did, you will be first asked to upgrade to a more
>> recent kernel, etc. And all those bits will come from the vendor.
>>
>> On 06/29/2011 02:20 PM, B Leggett wrote:
>>> Sunril,
>>> After that first attempt I tried severla more times and got actual
>>> oops. I think try #3 has the most details.
>>>
>>> Try #2:
>>>
>>> Oops: 0000 [#1]
>>> SMP
>>> last sysfs file: /firmware/edd/int13_dev80/mbr_signature
>>> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm
>>> ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi
>>> scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop
>>> dm_mod netconsole usbhid cpqphp i2c_piix4 ohci_hcd sworks_agp ide_cd
>>> cdrom pci_hotplug i2c_core agpgart usbcore tg3 reiserfs edd fan
>>> thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
>>> CPU:    0
>>> EIP:    0060:[<c029723e>]    Tainted: P     X VLI
>>> EFLAGS: 00210086   (2.6.16.21-0.8-bigsmp #1)
>>> EIP is at do_page_fault+0x8e/0x5f6
>>> eax: f3f64000   ebx: c02fbc00   ecx: 00000000   edx: 00000000
>>> esi: f3f6605c   edi: c02971b0   ebp: 00000098   esp: f3f64088
>>> ds: 007b   es: 007b   ss: 0068
>>>
>>>
>>> Try#3
>>>
>>> Oops: 0000 [#1]
>>> SMP
>>> last sysfs file: /firmware/edd/int13_dev80/mbr_signature
>>> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm
>>> ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi
>>> scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop
>>> dm_mod netconsole usbhid i2c_piix4 ide_cd cpqphp cdrom ohci_hcd
>>> i2c_core usbcore sworks_agp pci_hotplug agpgart tg3 reiserfs edd fan
>>> thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
>>> CPU:    2
>>> EIP:    0060:[<c029723e>]    Tainted: P     X VLI
>>> EFLAGS: 00210006   (2.6.16.21-0.8-bigsmp #1)
>>> EIP is at do_page_fault+0x8e/0x5f6
>>> eax: f3f2c000   ebx: 880f0133   ecx: 64656e77   edx: 64656e77
>>> esi: f3f30058   edi: c02971b0   ebp: 64656f0f   esp: f3f2c084
>>> ds: 007b   es: 007b   ss: 0068
>>> Unable to handle kernel paging request at virtual address 01110954
>>>    printing eip:
>>> c029723e
>>> *pde = 33dda001
>>> Unable to handle kernel NULL pointer dereference at virtual address
>>> 00000030
>>>    printing eip:
>>> c015c752
>>> *pde = 3629c001
>>> o2net: connection to node node-02 (num 2) at 192.168.1.173:7777 has
>>> been idle for 10 seconds, shutting it down.
>>> (10,0):o2net_idle_timer:1309 here are some times that might help
>>> debug the situation: (tmr 1309364991.767445 now 1309365001.767502 dr
>>> 1309364996.769068 adv 1309364991.767450:1309364991.767451 func
>>> (9987e679:2) 1309364870.220076:1309364870.220078)
>>> o2net: connection to node node-05 (num 4) at 192.168.1.62:7777 has
>>> been idle for 10 seconds, shutting it down.
>>> (10,0):o2net_idle_timer:1309 here are some times that might help
>>> debug the situation: (tmr 1309364991.769291 now 1309365001.767537 dr
>>> 1309364996.770248 adv 1309364991.769302:1309364991.769303 func
>>> (3768d12f:505) 1309364991.769291:1309364991.769296)
>>> Unable to handle kernel paging request at virtual address 4e0b5293
>>>    printing eip:
>>> c024c829
>>> *pde = 36b61001
>>>
>>> Try #4
>>>
>>> Unable to handle kernel paging request at virtual address fffffffc
>>>    printing eip:
>>> c016e54e
>>> *pde = 00000000
>>> Oops: 0000 [#1]
>>> SMP
>>> last sysfs file: /firmware/edd/int13_dev80/mbr_signature
>>> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm
>>> ocfs2_nodemanager ipv6 configfs iscsi_tcp libiscsi
>>> scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop
>>> dm_mod netconsole usbhid ide_cd cpqphp cdrom i2c_piix4 ohci_hcd
>>> sworks_agp i2c_core usbcore agpgart pci_hotplug tg3 reiserfs edd fan
>>> thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
>>> CPU:    3
>>> EIP:    0060:[<c016e54e>]    Tainted: P     X VLI
>>> EFLAGS: 00010297   (2.6.16.21-0.8-bigsmp #1)
>>> EIP is at poll_freewait+0xd/0x3a
>>> eax: f5ab5f90   ebx: ffffffe4   ecx: dffff040   edx: c1000000
>>> esi: f31c4000   edi: bffa3bf4   ebp: f34b8310   esp: f5ab5f60
>>> ds: 007b   es: 007b   ss: 0068
>>> Process iscsid (pid: 3206, threadinfo=f5ab4000 task=f54521b0)
>>> Stack:<0>00000000 00000000 c016e85a f5ab5fb0 bffa3bf4 bffa3bf4
>>> 00000000 f34b8310
>>>          00000002 00000002 00000000 f34b8300 c016f12a f31c4000
>>> 00000000 bffa3be4
>>>          00000000 b7f08ff4 f5ab4000 c016e8a8 00000000 00000000
>>> c0103cab bffa3be4
>>> Call Trace:
>>>    [<c016e85a>] do_sys_poll+0x2df/0x2e9
>>>    [<c016f12a>] __pollwait+0x0/0x95
>>>    [<c016e8a8>] sys_poll+0x44/0x47
>>>    [<c0103cab>] sysenter_past_esp+0x54/0x79
>>> Code: c4 10 89 d8 5b 5e 5f 5d c3 c7 00 2a f1 16 c0 c7 40 08 00 00 00
>>> 00 c7 40 04 00 00 00 00 c3 56 53 8b 70 04 eb 2c 8b 5e 04 83 eb 1c<8b>
>>> 43 18 8d 53 04 e8 6d 3d fc ff 8b 03 e8 a8 12 ff ff 8d 46 08
>>>
>>> ----- Original Message -----
>>> From: "B Leggett"<bleggett at ngent.com>
>>> To: ocfs2-users at oss.oracle.com
>>> Sent: Wednesday, June 29, 2011 3:42:42 PM GMT -05:00 US/Canada
>>> Eastern
>>> Subject: Re: [Ocfs2-users] OCFS2 Crash
>>>
>>> For the list, I accidentally sent it direct to Sunil. My apologies
>>> for that.
>>>
>>> Bruce
>>> ----- Original Message -----
>>> From: "B Leggett"<bleggett at ngent.com>
>>> To: "Sunil Mushran"<sunil.mushran at oracle.com>
>>> Sent: Wednesday, June 29, 2011 3:40:52 PM GMT -05:00 US/Canada
>>> Eastern
>>> Subject: Re: [Ocfs2-users] OCFS2 Crash
>>>
>>> Sunil,
>>> I did as you requested an got one line of output.
>>>
>>> o2net: accepted connection from node node-05 (num 4) at
>>> 192.168.1.62:7777
>>>
>>> Bruce
>>> ----- Original Message -----
>>> From: "Sunil Mushran"<sunil.mushran at oracle.com>
>>> To: "B Leggett"<bleggett at ngent.com>
>>> Cc: ocfs2-users at oss.oracle.com
>>> Sent: Wednesday, June 29, 2011 2:42:08 PM GMT -05:00 US/Canada
>>> Eastern
>>> Subject: Re: [Ocfs2-users] OCFS2 Crash
>>>
>>> 1.2.1? That's 5 years old. We've had a few fixes since then. ;)
>>>
>>> You have to catch the oops trace to figure out the reason. And one
>>> way to get it by using netconsole. Check the sles10 docs to see how
>>> to
>>> configure netconsole. Or, whatever is recommended for capturing the
>>> oops log in that release.
>>>
>>> On 06/29/2011 11:28 AM, B Leggett wrote:
>>>> Hi,
>>>> I am running the OCFS2 1.2.1 on SLES 10, just the stuff right out
>>>> of the box. This is a 3 node cluster that's been running for 2 years
>>>> with just about zero modification. The storage is a high end SAN and
>>>> the transport is iscsi. We went two years without an issue and all a
>>>> sudden node 1 in the cluster keeps crashing. I have never had to
>>>> troubleshoot OCFS2, so I started with what I could control.
>>>>
>>>> I checked /var/log/messages and nothing there suggests a problem. I
>>>> replaced hardware that went as far as me popping the scsi drives out
>>>> and putting them in another server and trying it with all new
>>>> hardware. The problem still persists.
>>>>
>>>> I had the network team check the iscsi port on the private iscsi
>>>> network and they are not seeing errors.
>>>>
>>>> I've check the few OCFS2 settings in play and they all look good.
>>>>
>>>> My question to the group is how go I continue troubleshooting this
>>>> issue? I'm not aware of any native logs etc to reference. I would
>>>> appreciate any help that gets this diagnosis moving to a solution.
>>>>
>>>> Thanks,
>>>> Bruce
>>>
>>> _______________________________________________
>>> Ocfs2-users mailing list
>>> Ocfs2-users at oss.oracle.com
>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>