[Ocfs2-users] OCFS2 Crash

Wed Jun 29 14:43:09 PDT 2011

That's troubling, these are really static systems. I know anything can happen, but to inherit a kernel issue two years later seems nuts. Not that your analysis is wrong, just blows me away is all. Is there a chance I would be better off removing this node and replacing it with a fresh build?

----- Original Message -----
From: "Sunil Mushran" <sunil.mushran at oracle.com>
To: "B Leggett" <bleggett at ngent.com>
Cc: ocfs2-users at oss.oracle.com
Sent: Wednesday, June 29, 2011 5:23:40 PM GMT -05:00 US/Canada Eastern
Subject: Re: [Ocfs2-users] OCFS2 Crash

You should ping your kernel vendor. While this does not look ocfs2
related, even if it did, you will be first asked to upgrade to a more
recent kernel, etc. And all those bits will come from the vendor.

On 06/29/2011 02:20 PM, B Leggett wrote:
> Sunril,
> After that first attempt I tried severla more times and got actual oops. I think try #3 has the most details.
>
> Try #2:
>
> Oops: 0000 [#1]
> SMP
> last sysfs file: /firmware/edd/int13_dev80/mbr_signature
> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop dm_mod netconsole usbhid cpqphp i2c_piix4 ohci_hcd sworks_agp ide_cd cdrom pci_hotplug i2c_core agpgart usbcore tg3 reiserfs edd fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
> CPU:    0
> EIP:    0060:[<c029723e>]    Tainted: P     X VLI
> EFLAGS: 00210086   (2.6.16.21-0.8-bigsmp #1)
> EIP is at do_page_fault+0x8e/0x5f6
> eax: f3f64000   ebx: c02fbc00   ecx: 00000000   edx: 00000000
> esi: f3f6605c   edi: c02971b0   ebp: 00000098   esp: f3f64088
> ds: 007b   es: 007b   ss: 0068
>
>
> Try#3
>
> Oops: 0000 [#1]
> SMP
> last sysfs file: /firmware/edd/int13_dev80/mbr_signature
> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop dm_mod netconsole usbhid i2c_piix4 ide_cd cpqphp cdrom ohci_hcd i2c_core usbcore sworks_agp pci_hotplug agpgart tg3 reiserfs edd fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
> CPU:    2
> EIP:    0060:[<c029723e>]    Tainted: P     X VLI
> EFLAGS: 00210006   (2.6.16.21-0.8-bigsmp #1)
> EIP is at do_page_fault+0x8e/0x5f6
> eax: f3f2c000   ebx: 880f0133   ecx: 64656e77   edx: 64656e77
> esi: f3f30058   edi: c02971b0   ebp: 64656f0f   esp: f3f2c084
> ds: 007b   es: 007b   ss: 0068
> Unable to handle kernel paging request at virtual address 01110954
>   printing eip:
> c029723e
> *pde = 33dda001
> Unable to handle kernel NULL pointer dereference at virtual address 00000030
>   printing eip:
> c015c752
> *pde = 3629c001
> o2net: connection to node node-02 (num 2) at 192.168.1.173:7777 has been idle for 10 seconds, shutting it down.
> (10,0):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1309364991.767445 now 1309365001.767502 dr 1309364996.769068 adv 1309364991.767450:1309364991.767451 func (9987e679:2) 1309364870.220076:1309364870.220078)
> o2net: connection to node node-05 (num 4) at 192.168.1.62:7777 has been idle for 10 seconds, shutting it down.
> (10,0):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1309364991.769291 now 1309365001.767537 dr 1309364996.770248 adv 1309364991.769302:1309364991.769303 func (3768d12f:505) 1309364991.769291:1309364991.769296)
> Unable to handle kernel paging request at virtual address 4e0b5293
>   printing eip:
> c024c829
> *pde = 36b61001
>
> Try #4
>
> Unable to handle kernel paging request at virtual address fffffffc
>   printing eip:
> c016e54e
> *pde = 00000000
> Oops: 0000 [#1]
> SMP
> last sysfs file: /firmware/edd/int13_dev80/mbr_signature
> Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager ipv6 configfs iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop dm_mod netconsole usbhid ide_cd cpqphp cdrom i2c_piix4 ohci_hcd sworks_agp i2c_core usbcore agpgart pci_hotplug tg3 reiserfs edd fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
> CPU:    3
> EIP:    0060:[<c016e54e>]    Tainted: P     X VLI
> EFLAGS: 00010297   (2.6.16.21-0.8-bigsmp #1)
> EIP is at poll_freewait+0xd/0x3a
> eax: f5ab5f90   ebx: ffffffe4   ecx: dffff040   edx: c1000000
> esi: f31c4000   edi: bffa3bf4   ebp: f34b8310   esp: f5ab5f60
> ds: 007b   es: 007b   ss: 0068
> Process iscsid (pid: 3206, threadinfo=f5ab4000 task=f54521b0)
> Stack:<0>00000000 00000000 c016e85a f5ab5fb0 bffa3bf4 bffa3bf4 00000000 f34b8310
>         00000002 00000002 00000000 f34b8300 c016f12a f31c4000 00000000 bffa3be4
>         00000000 b7f08ff4 f5ab4000 c016e8a8 00000000 00000000 c0103cab bffa3be4
> Call Trace:
>   [<c016e85a>] do_sys_poll+0x2df/0x2e9
>   [<c016f12a>] __pollwait+0x0/0x95
>   [<c016e8a8>] sys_poll+0x44/0x47
>   [<c0103cab>] sysenter_past_esp+0x54/0x79
> Code: c4 10 89 d8 5b 5e 5f 5d c3 c7 00 2a f1 16 c0 c7 40 08 00 00 00 00 c7 40 04 00 00 00 00 c3 56 53 8b 70 04 eb 2c 8b 5e 04 83 eb 1c<8b>  43 18 8d 53 04 e8 6d 3d fc ff 8b 03 e8 a8 12 ff ff 8d 46 08
>
> ----- Original Message -----
> From: "B Leggett"<bleggett at ngent.com>
> To: ocfs2-users at oss.oracle.com
> Sent: Wednesday, June 29, 2011 3:42:42 PM GMT -05:00 US/Canada Eastern
> Subject: Re: [Ocfs2-users] OCFS2 Crash
>
> For the list, I accidentally sent it direct to Sunil. My apologies for that.
>
> Bruce
> ----- Original Message -----
> From: "B Leggett"<bleggett at ngent.com>
> To: "Sunil Mushran"<sunil.mushran at oracle.com>
> Sent: Wednesday, June 29, 2011 3:40:52 PM GMT -05:00 US/Canada Eastern
> Subject: Re: [Ocfs2-users] OCFS2 Crash
>
> Sunil,
> I did as you requested an got one line of output.
>
> o2net: accepted connection from node node-05 (num 4) at 192.168.1.62:7777
>
> Bruce
> ----- Original Message -----
> From: "Sunil Mushran"<sunil.mushran at oracle.com>
> To: "B Leggett"<bleggett at ngent.com>
> Cc: ocfs2-users at oss.oracle.com
> Sent: Wednesday, June 29, 2011 2:42:08 PM GMT -05:00 US/Canada Eastern
> Subject: Re: [Ocfs2-users] OCFS2 Crash
>
> 1.2.1? That's 5 years old. We've had a few fixes since then. ;)
>
> You have to catch the oops trace to figure out the reason. And one
> way to get it by using netconsole. Check the sles10 docs to see how to
> configure netconsole. Or, whatever is recommended for capturing the
> oops log in that release.
>
> On 06/29/2011 11:28 AM, B Leggett wrote:
>> Hi,
>> I am running the OCFS2 1.2.1 on SLES 10, just the stuff right out of the box. This is a 3 node cluster that's been running for 2 years with just about zero modification. The storage is a high end SAN and the transport is iscsi. We went two years without an issue and all a sudden node 1 in the cluster keeps crashing. I have never had to troubleshoot OCFS2, so I started with what I could control.
>>
>> I checked /var/log/messages and nothing there suggests a problem. I replaced hardware that went as far as me popping the scsi drives out and putting them in another server and trying it with all new hardware. The problem still persists.
>>
>> I had the network team check the iscsi port on the private iscsi network and they are not seeing errors.
>>
>> I've check the few OCFS2 settings in play and they all look good.
>>
>> My question to the group is how go I continue troubleshooting this issue? I'm not aware of any native logs etc to reference. I would appreciate any help that gets this diagnosis moving to a solution.
>>
>> Thanks,
>> Bruce
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users