[Ocfs2-users] OCFS2 Crash

B Leggett bleggett at ngent.com
Wed Jun 29 19:27:56 PDT 2011


That's a great idea, but I replaced the ram - then the whole box. Right now the only hardware kept were the hard drives for /.


----- Original Message -----
From: "Jürgen Herrmann" <Juergen.Herrmann at XLhost.de>
To: ocfs2-users at oss.oracle.com
Sent: Wednesday, June 29, 2011 5:57:19 PM GMT -05:00 US/Canada Eastern
Subject: Re: [Ocfs2-users] OCFS2 Crash

On Wed, 29 Jun 2011 16:43:09 -0500 (GMT-05:00), B Leggett wrote:
> That's troubling, these are really static systems. I know anything
> can happen, but to inherit a kernel issue two years later seems nuts.
> Not that your analysis is wrong, just blows me away is all. Is there 
> a
> chance I would be better off removing this node and replacing it with
> a fresh build?

as your oopses all look different, i'd first replace all ram on the 
node
in question. i had machines behave this strange with faulty ram several
times.

just my 2c

best regards,
jürgen

>
> ----- Original Message -----
> From: "Sunil Mushran" <sunil.mushran at oracle.com>
> To: "B Leggett" <bleggett at ngent.com>
> Cc: ocfs2-users at oss.oracle.com
> Sent: Wednesday, June 29, 2011 5:23:40 PM GMT -05:00 US/Canada 
> Eastern
> Subject: Re: [Ocfs2-users] OCFS2 Crash
>
> You should ping your kernel vendor. While this does not look ocfs2
> related, even if it did, you will be first asked to upgrade to a more
> recent kernel, etc. And all those bits will come from the vendor.
>
> On 06/29/2011 02:20 PM, B Leggett wrote:
>> Sunril,
>> After that first attempt I tried severla more times and got actual 
>> oops. I think try #3 has the most details.
>>
>> Try #2:
>>
>> Oops: 0000 [#1]
>> SMP
>> last sysfs file: /firmware/edd/int13_dev80/mbr_signature
>> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm 
>> ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi 
>> scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop 
>> dm_mod netconsole usbhid cpqphp i2c_piix4 ohci_hcd sworks_agp ide_cd 
>> cdrom pci_hotplug i2c_core agpgart usbcore tg3 reiserfs edd fan 
>> thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
>> CPU:    0
>> EIP:    0060:[<c029723e>]    Tainted: P     X VLI
>> EFLAGS: 00210086   (2.6.16.21-0.8-bigsmp #1)
>> EIP is at do_page_fault+0x8e/0x5f6
>> eax: f3f64000   ebx: c02fbc00   ecx: 00000000   edx: 00000000
>> esi: f3f6605c   edi: c02971b0   ebp: 00000098   esp: f3f64088
>> ds: 007b   es: 007b   ss: 0068
>>
>>
>> Try#3
>>
>> Oops: 0000 [#1]
>> SMP
>> last sysfs file: /firmware/edd/int13_dev80/mbr_signature
>> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm 
>> ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi 
>> scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop 
>> dm_mod netconsole usbhid i2c_piix4 ide_cd cpqphp cdrom ohci_hcd 
>> i2c_core usbcore sworks_agp pci_hotplug agpgart tg3 reiserfs edd fan 
>> thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
>> CPU:    2
>> EIP:    0060:[<c029723e>]    Tainted: P     X VLI
>> EFLAGS: 00210006   (2.6.16.21-0.8-bigsmp #1)
>> EIP is at do_page_fault+0x8e/0x5f6
>> eax: f3f2c000   ebx: 880f0133   ecx: 64656e77   edx: 64656e77
>> esi: f3f30058   edi: c02971b0   ebp: 64656f0f   esp: f3f2c084
>> ds: 007b   es: 007b   ss: 0068
>> Unable to handle kernel paging request at virtual address 01110954
>>   printing eip:
>> c029723e
>> *pde = 33dda001
>> Unable to handle kernel NULL pointer dereference at virtual address 
>> 00000030
>>   printing eip:
>> c015c752
>> *pde = 3629c001
>> o2net: connection to node node-02 (num 2) at 192.168.1.173:7777 has 
>> been idle for 10 seconds, shutting it down.
>> (10,0):o2net_idle_timer:1309 here are some times that might help 
>> debug the situation: (tmr 1309364991.767445 now 1309365001.767502 dr 
>> 1309364996.769068 adv 1309364991.767450:1309364991.767451 func 
>> (9987e679:2) 1309364870.220076:1309364870.220078)
>> o2net: connection to node node-05 (num 4) at 192.168.1.62:7777 has 
>> been idle for 10 seconds, shutting it down.
>> (10,0):o2net_idle_timer:1309 here are some times that might help 
>> debug the situation: (tmr 1309364991.769291 now 1309365001.767537 dr 
>> 1309364996.770248 adv 1309364991.769302:1309364991.769303 func 
>> (3768d12f:505) 1309364991.769291:1309364991.769296)
>> Unable to handle kernel paging request at virtual address 4e0b5293
>>   printing eip:
>> c024c829
>> *pde = 36b61001
>>
>> Try #4
>>
>> Unable to handle kernel paging request at virtual address fffffffc
>>   printing eip:
>> c016e54e
>> *pde = 00000000
>> Oops: 0000 [#1]
>> SMP
>> last sysfs file: /firmware/edd/int13_dev80/mbr_signature
>> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm 
>> ocfs2_nodemanager ipv6 configfs iscsi_tcp libiscsi 
>> scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop 
>> dm_mod netconsole usbhid ide_cd cpqphp cdrom i2c_piix4 ohci_hcd 
>> sworks_agp i2c_core usbcore agpgart pci_hotplug tg3 reiserfs edd fan 
>> thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
>> CPU:    3
>> EIP:    0060:[<c016e54e>]    Tainted: P     X VLI
>> EFLAGS: 00010297   (2.6.16.21-0.8-bigsmp #1)
>> EIP is at poll_freewait+0xd/0x3a
>> eax: f5ab5f90   ebx: ffffffe4   ecx: dffff040   edx: c1000000
>> esi: f31c4000   edi: bffa3bf4   ebp: f34b8310   esp: f5ab5f60
>> ds: 007b   es: 007b   ss: 0068
>> Process iscsid (pid: 3206, threadinfo=f5ab4000 task=f54521b0)
>> Stack:<0>00000000 00000000 c016e85a f5ab5fb0 bffa3bf4 bffa3bf4 
>> 00000000 f34b8310
>>         00000002 00000002 00000000 f34b8300 c016f12a f31c4000 
>> 00000000 bffa3be4
>>         00000000 b7f08ff4 f5ab4000 c016e8a8 00000000 00000000 
>> c0103cab bffa3be4
>> Call Trace:
>>   [<c016e85a>] do_sys_poll+0x2df/0x2e9
>>   [<c016f12a>] __pollwait+0x0/0x95
>>   [<c016e8a8>] sys_poll+0x44/0x47
>>   [<c0103cab>] sysenter_past_esp+0x54/0x79
>> Code: c4 10 89 d8 5b 5e 5f 5d c3 c7 00 2a f1 16 c0 c7 40 08 00 00 00 
>> 00 c7 40 04 00 00 00 00 c3 56 53 8b 70 04 eb 2c 8b 5e 04 83 eb 1c<8b>  
>> 43 18 8d 53 04 e8 6d 3d fc ff 8b 03 e8 a8 12 ff ff 8d 46 08
>>
>> ----- Original Message -----
>> From: "B Leggett"<bleggett at ngent.com>
>> To: ocfs2-users at oss.oracle.com
>> Sent: Wednesday, June 29, 2011 3:42:42 PM GMT -05:00 US/Canada 
>> Eastern
>> Subject: Re: [Ocfs2-users] OCFS2 Crash
>>
>> For the list, I accidentally sent it direct to Sunil. My apologies 
>> for that.
>>
>> Bruce
>> ----- Original Message -----
>> From: "B Leggett"<bleggett at ngent.com>
>> To: "Sunil Mushran"<sunil.mushran at oracle.com>
>> Sent: Wednesday, June 29, 2011 3:40:52 PM GMT -05:00 US/Canada 
>> Eastern
>> Subject: Re: [Ocfs2-users] OCFS2 Crash
>>
>> Sunil,
>> I did as you requested an got one line of output.
>>
>> o2net: accepted connection from node node-05 (num 4) at 
>> 192.168.1.62:7777
>>
>> Bruce
>> ----- Original Message -----
>> From: "Sunil Mushran"<sunil.mushran at oracle.com>
>> To: "B Leggett"<bleggett at ngent.com>
>> Cc: ocfs2-users at oss.oracle.com
>> Sent: Wednesday, June 29, 2011 2:42:08 PM GMT -05:00 US/Canada 
>> Eastern
>> Subject: Re: [Ocfs2-users] OCFS2 Crash
>>
>> 1.2.1? That's 5 years old. We've had a few fixes since then. ;)
>>
>> You have to catch the oops trace to figure out the reason. And one
>> way to get it by using netconsole. Check the sles10 docs to see how 
>> to
>> configure netconsole. Or, whatever is recommended for capturing the
>> oops log in that release.
>>
>> On 06/29/2011 11:28 AM, B Leggett wrote:
>>> Hi,
>>> I am running the OCFS2 1.2.1 on SLES 10, just the stuff right out 
>>> of the box. This is a 3 node cluster that's been running for 2 years 
>>> with just about zero modification. The storage is a high end SAN and 
>>> the transport is iscsi. We went two years without an issue and all a 
>>> sudden node 1 in the cluster keeps crashing. I have never had to 
>>> troubleshoot OCFS2, so I started with what I could control.
>>>
>>> I checked /var/log/messages and nothing there suggests a problem. I 
>>> replaced hardware that went as far as me popping the scsi drives out 
>>> and putting them in another server and trying it with all new 
>>> hardware. The problem still persists.
>>>
>>> I had the network team check the iscsi port on the private iscsi 
>>> network and they are not seeing errors.
>>>
>>> I've check the few OCFS2 settings in play and they all look good.
>>>
>>> My question to the group is how go I continue troubleshooting this 
>>> issue? I'm not aware of any native logs etc to reference. I would 
>>> appreciate any help that gets this diagnosis moving to a solution.
>>>
>>> Thanks,
>>> Bruce
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

-- 
>> XLhost.de ® - Webhosting von supersmall bis eXtra Large <<

XLhost.de GmbH
Jürgen Herrmann, Geschäftsführer
Boelckestrasse 21, 93051 Regensburg, Germany

Geschäftsführer: Jürgen Herrmann
Registriert unter: HRB9918
Umsatzsteuer-Identifikationsnummer: DE245931218

Fon:  +49 (0)800 XLHOSTDE [0800 95467833]
Fax:  +49 (0)800 95467830
Web:  http://www.XLhost.de

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users



More information about the Ocfs2-users mailing list