[Ocfs2-users] OCFS2 Crash

Wed Jun 29 14:20:35 PDT 2011

Sunril,
After that first attempt I tried severla more times and got actual oops. I think try #3 has the most details.

Try #2:

Oops: 0000 [#1]
SMP 
last sysfs file: /firmware/edd/int13_dev80/mbr_signature
Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop dm_mod netconsole usbhid cpqphp i2c_piix4 ohci_hcd sworks_agp ide_cd cdrom pci_hotplug i2c_core agpgart usbcore tg3 reiserfs edd fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
CPU:    0
EIP:    0060:[<c029723e>]    Tainted: P     X VLI
EFLAGS: 00210086   (2.6.16.21-0.8-bigsmp #1) 
EIP is at do_page_fault+0x8e/0x5f6
eax: f3f64000   ebx: c02fbc00   ecx: 00000000   edx: 00000000
esi: f3f6605c   edi: c02971b0   ebp: 00000098   esp: f3f64088
ds: 007b   es: 007b   ss: 0068

Try#3

Oops: 0000 [#1]
SMP 
last sysfs file: /firmware/edd/int13_dev80/mbr_signature
Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs ipv6 iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop dm_mod netconsole usbhid i2c_piix4 ide_cd cpqphp cdrom ohci_hcd i2c_core usbcore sworks_agp pci_hotplug agpgart tg3 reiserfs edd fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
CPU:    2
EIP:    0060:[<c029723e>]    Tainted: P     X VLI
EFLAGS: 00210006   (2.6.16.21-0.8-bigsmp #1) 
EIP is at do_page_fault+0x8e/0x5f6
eax: f3f2c000   ebx: 880f0133   ecx: 64656e77   edx: 64656e77
esi: f3f30058   edi: c02971b0   ebp: 64656f0f   esp: f3f2c084
ds: 007b   es: 007b   ss: 0068
Unable to handle kernel paging request at virtual address 01110954
 printing eip:
c029723e
*pde = 33dda001
Unable to handle kernel NULL pointer dereference at virtual address 00000030
 printing eip:
c015c752
*pde = 3629c001
o2net: connection to node node-02 (num 2) at 192.168.1.173:7777 has been idle for 10 seconds, shutting it down.
(10,0):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1309364991.767445 now 1309365001.767502 dr 1309364996.769068 adv 1309364991.767450:1309364991.767451 func (9987e679:2) 1309364870.220076:1309364870.220078)
o2net: connection to node node-05 (num 4) at 192.168.1.62:7777 has been idle for 10 seconds, shutting it down.
(10,0):o2net_idle_timer:1309 here are some times that might help debug the situation: (tmr 1309364991.769291 now 1309365001.767537 dr 1309364996.770248 adv 1309364991.769302:1309364991.769303 func (3768d12f:505) 1309364991.769291:1309364991.769296)
Unable to handle kernel paging request at virtual address 4e0b5293
 printing eip:
c024c829
*pde = 36b61001

Try #4

Unable to handle kernel paging request at virtual address fffffffc
 printing eip:
c016e54e
*pde = 00000000
Oops: 0000 [#1]
SMP 
last sysfs file: /firmware/edd/int13_dev80/mbr_signature
Modules linked in: ocfs2 jbd sg ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager ipv6 configfs iscsi_tcp libiscsi scsi_transport_iscsi xofs button battery ac apparmor aamatch_pcre loop dm_mod netconsole usbhid ide_cd cpqphp cdrom i2c_piix4 ohci_hcd sworks_agp i2c_core usbcore agpgart pci_hotplug tg3 reiserfs edd fan thermal processor cciss serverworks sd_mod scsi_mod ide_disk ide_core
CPU:    3
EIP:    0060:[<c016e54e>]    Tainted: P     X VLI
EFLAGS: 00010297   (2.6.16.21-0.8-bigsmp #1) 
EIP is at poll_freewait+0xd/0x3a
eax: f5ab5f90   ebx: ffffffe4   ecx: dffff040   edx: c1000000
esi: f31c4000   edi: bffa3bf4   ebp: f34b8310   esp: f5ab5f60
ds: 007b   es: 007b   ss: 0068
Process iscsid (pid: 3206, threadinfo=f5ab4000 task=f54521b0)
Stack: <0>00000000 00000000 c016e85a f5ab5fb0 bffa3bf4 bffa3bf4 00000000 f34b8310 
       00000002 00000002 00000000 f34b8300 c016f12a f31c4000 00000000 bffa3be4 
       00000000 b7f08ff4 f5ab4000 c016e8a8 00000000 00000000 c0103cab bffa3be4 
Call Trace:
 [<c016e85a>] do_sys_poll+0x2df/0x2e9
 [<c016f12a>] __pollwait+0x0/0x95
 [<c016e8a8>] sys_poll+0x44/0x47
 [<c0103cab>] sysenter_past_esp+0x54/0x79
Code: c4 10 89 d8 5b 5e 5f 5d c3 c7 00 2a f1 16 c0 c7 40 08 00 00 00 00 c7 40 04 00 00 00 00 c3 56 53 8b 70 04 eb 2c 8b 5e 04 83 eb 1c <8b> 43 18 8d 53 04 e8 6d 3d fc ff 8b 03 e8 a8 12 ff ff 8d 46 08 

----- Original Message -----
From: "B Leggett" <bleggett at ngent.com>
To: ocfs2-users at oss.oracle.com
Sent: Wednesday, June 29, 2011 3:42:42 PM GMT -05:00 US/Canada Eastern
Subject: Re: [Ocfs2-users] OCFS2 Crash

For the list, I accidentally sent it direct to Sunil. My apologies for that.

Bruce
----- Original Message -----
From: "B Leggett" <bleggett at ngent.com>
To: "Sunil Mushran" <sunil.mushran at oracle.com>
Sent: Wednesday, June 29, 2011 3:40:52 PM GMT -05:00 US/Canada Eastern
Subject: Re: [Ocfs2-users] OCFS2 Crash

Sunil,
I did as you requested an got one line of output.

o2net: accepted connection from node node-05 (num 4) at 192.168.1.62:7777

Bruce
----- Original Message -----
From: "Sunil Mushran" <sunil.mushran at oracle.com>
To: "B Leggett" <bleggett at ngent.com>
Cc: ocfs2-users at oss.oracle.com
Sent: Wednesday, June 29, 2011 2:42:08 PM GMT -05:00 US/Canada Eastern
Subject: Re: [Ocfs2-users] OCFS2 Crash

1.2.1? That's 5 years old. We've had a few fixes since then. ;)

You have to catch the oops trace to figure out the reason. And one
way to get it by using netconsole. Check the sles10 docs to see how to
configure netconsole. Or, whatever is recommended for capturing the
oops log in that release.

On 06/29/2011 11:28 AM, B Leggett wrote:
> Hi,
> I am running the OCFS2 1.2.1 on SLES 10, just the stuff right out of the box. This is a 3 node cluster that's been running for 2 years with just about zero modification. The storage is a high end SAN and the transport is iscsi. We went two years without an issue and all a sudden node 1 in the cluster keeps crashing. I have never had to troubleshoot OCFS2, so I started with what I could control.
>
> I checked /var/log/messages and nothing there suggests a problem. I replaced hardware that went as far as me popping the scsi drives out and putting them in another server and trying it with all new hardware. The problem still persists.
>
> I had the network team check the iscsi port on the private iscsi network and they are not seeing errors.
>
> I've check the few OCFS2 settings in play and they all look good.
>
> My question to the group is how go I continue troubleshooting this issue? I'm not aware of any native logs etc to reference. I would appreciate any help that gets this diagnosis moving to a solution.
>
> Thanks,
> Bruce

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users