[Ocfs2-users] Weird crash

Tue Jan 19 13:16:34 PST 2010

So novell does not update the version number. But they do update
the code. The version number update is hard because they have
a different kernel than EL. Meaning the patches we have for EL
may not apply on their tree. And vice-versa.

Report this problem to them and let them handle it from there.

The fsck issue rings a bell. But I'll need more info to jog my memory.

I am aware of the -2 issue. It is harmless. The ocfs2 people in novell
are already aware of it. In short, the fs is unable to remove the
unlinked file from the orphan dir.

The last problem indicates incorrect journal credit calculation. The
last fix we have for this issue for that stack (mknod) is dated 2007.
So it could be sles specific.

Sunil

Sérgio Surkamp wrote:
> Hello Mr. Sunil and list,
>
> Last weekend we have upgraded our servers to SLES10SP3 and with it the
> OCFS2 version 1.4.1-1.
>
> During the upgrade we have run the fsck in the volume (using the new
> OCFS2 version) and it just died in "Pass 1c" with the error:
> fsck.ocfs2: Unknown code while performing pass 1
>
> Checking the dmesg there was this messages repeated many times:
> (5713,1):ocfs2_orphan_del:1869 ERROR: status = -2
> (5713,1):ocfs2_remove_inode:614 ERROR: status = -2
> (5713,1):ocfs2_wipe_inode:740 ERROR: status = -2
> (5713,1):ocfs2_delete_inode:974 ERROR: status = -2
>
> Could this mean that our filesystem is corrupted? Could it be the cause
> of this bug to happen?
>
> Current versions:
> OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
> f922955d99ef972235bd0c1fc236c5ddbb368611)
> OCFS2 DLMFS 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
> f922955d99ef972235bd0c1fc236c5ddbb368611)
> OCFS2 User DLM kernel interface loaded bnx2: eth0 NIC Copper Link is
> Up, 1000 Mbps full duplex ADDRCONF(NETDEV_CHANGE): eth0: link becomes
> ready bnx2: eth1 NIC Copper Link is Up, 100 Mbps full duplex, receive &
> transmit flow control ON ADDRCONF(NETDEV_CHANGE): eth1: link becomes
> ready o2net: connected to node soap02 (num 0) at [--cut--]:7777
> OCFS2 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
> f922955d99ef972235bd0c1fc236c5ddbb368611)
>
> Latest crash dmesg message:
> Assertion failure in journal_dirty_metadata() at
> fs/jbd/transaction.c:1114: "handle->h_buffer_credits > 0" -----------
> [cut here ] --------- [please bite here ] --------- Kernel BUG at
> fs/jbd/transaction.c:1114 invalid opcode: 0000 [1] SMP 
> last sysfs file: /block/sdb/sdb1/stat
> CPU 2 
> Modules linked in: nls_iso8859_1 nls_cp437 vfat fat nls_utf8 nfsd
> exportfs lockd nfs_acl sunrpc iptable_nat ip_nat iptable_mangle
> xt_comment xt_tcpudp xt_state ip_conntrack nfnetlink iptable_filter
> ip_tables x_tables joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm
> ocfs2_nodemanager configfs ipv6 dock button battery ac apparmor sr_mod
> loop usbhid usb_storage uhci_hcd ide_cd shpchp cdrom ehci_hcd hw_random
> usbcore pci_hotplug bnx2 reiserfs ata_piix ahci libata dm_snapshot edd
> dm_mod fan thermal processor qla2xxx firmware_class scsi_transport_fc
> st aic7xxx scsi_transport_spi sg megaraid_sas piix sd_mod scsi_mod
> ide_disk ide_core Pid: 29322, comm: nfsd Not tainted
> 2.6.16.60-0.58.1-smp #1 RIP: 0010:[<ffffffff883f11e0>]
> <ffffffff883f11e0>{:jbd:journal_dirty_metadata+200} RSP:
> 0018:ffff8101ca763c18  EFLAGS: 00010292 RAX: 000000000000006e RBX:
> ffff8100cc920a80 RCX: ffffffff80361c68 RDX: ffffffff80361c68 RSI:
> 0000000000000296 RDI: ffffffff80361c60 RBP: ffff81012e3978c8 R08:
> ffffffff80361c68 R09: ffff8102244cd540 R10: 0000000000000292 R11:
> 0000000000000070 R12: ffff8101f00274f8 R13: ffff810074c031f0 R14:
> ffff810226624a00 R15: ffff81012352653c FS:  00002ae6d8b156d0(0000)
> GS:ffff81022856a2c0(0000) knlGS:0000000000000000 CS:  0010 DS: 0000 ES:
> 0000 CR0: 000000008005003b CR2: 00002aaaafb2d000 CR3: 000000009e42a000
> CR4: 00000000000006e0 Process nfsd (pid: 29322, threadinfo
> ffff8101ca762000, task ffff8102284de810) Stack: ffff81012e3978c8
> ffff8101f00274f8 0000000000000000 0000000000000005 ffff81019ea88000
> ffffffff8843264a 0000000000000000 ffff81019ea88050 ffff81000b55cc20
> ffffffff8841b445 Call Trace:
> <ffffffff8843264a>{:ocfs2:ocfs2_journal_dirty+106}
> <ffffffff8841b445>{:ocfs2:__ocfs2_add_entry+745}
> <ffffffff884396a9>{:ocfs2:ocfs2_mknod+1710}
> <ffffffff88439988>{:ocfs2:ocfs2_mkdir+127}
> <ffffffff80194075>{vfs_mkdir+346}
> <ffffffff88526ef5>{:nfsd:nfsd_create+753}
> <ffffffff8852dc63>{:nfsd:nfsd3_proc_mkdir+217}
> <ffffffff885220ea>{:nfsd:nfsd_dispatch+216}
> <ffffffff884d975b>{:sunrpc:svc_process+982}
> <ffffffff802f0c59>{__down_read+21} <ffffffff88522454>{:nfsd:nfsd+0}
> <ffffffff88522623>{:nfsd:nfsd+463} <ffffffff8010bea6>{child_rip+8}
> <ffffffff88522454>{:nfsd:nfsd+0} <ffffffff88522454>{:nfsd:nfsd+0}
> <ffffffff8010be9e>{child_rip+0}
>
> Code: 0f 0b 68 d9 7a 3f 88 c2 5a 04 41 ff 4c 24 08 49 39 5d 28 75 
> RIP <ffffffff883f11e0>{:jbd:journal_dirty_metadata+200} RSP
> <ffff8101ca763c18>
>
> Bottom line: We are contacting Novell by phone support to check why
> OCFS2 version isn't the latest in SLES10SP3.
>
> Regards,
> Sérgio
>
> Em Tue, 01 Sep 2009 11:14:37 -0700
> Sunil Mushran <sunil.mushran at oracle.com> escreveu:
>
>   
>> For issues on sles, please file a bug/sr with novell.
>>
>> The issue here is insufficient journal credits. It _could_ be that
>> this version
>> is missing mainline git commit
>> e051fda4fd14fe878e6d2183b3a4640febe9e9a8. But I don't know. Novell
>> Support will be better placed to track down the issue.
>>
>> Sérgio Surkamp wrote:
>>     
>>> Hi list,
>>>
>>> One of our OCFS2 servers crashed with this message:
>>>
>>> Aug 26 11:33:11 soap01 kernel: Assertion failure in
>>> journal_dirty_metadata() at fs/jbd/transaction.c:1114:
>>> "handle->h_buffer_credits > 0"
>>> Aug 26 11:33:11 soap01 kernel: ----------- [cut here ] ---------
>>> [please bite here ] ---------
>>> Aug 26 11:33:11 soap01 kernel: Kernel BUG at
>>> fs/jbd/transaction.c:1114 Aug 26 11:33:11 soap01 kernel: invalid
>>> opcode: 0000 [1] SMP Aug 26 11:33:11 soap01 kernel: last sysfs
>>> file: /devices/pci0000:00/0000:00:00.0/irq
>>> Aug 26 11:33:11 soap01 kernel: CPU 0 Aug 26 11:33:11 soap01 kernel:
>>> Modules linked in: af_packet joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm
>>> ocfs2_nodemanager configfs nfsd exportfs lockd nfs_acl sunrpc ipv6
>>> button battery ac netconsole xt_comment xt_tcpudp xt_state
>>> iptable_filter iptable_mangle iptable_nat ip_nat ip_ conntrack
>>> nfnetlink ip_tables x_tables apparmor loop st sr_mod usbhid
>>> usb_storage hw_random shpchp ide_cd aic7xxx uhci_hcd cdrom
>>> pci_hotplug ehci_hcd scsi_transport_spi usbcore bnx2 reiserfs
>>> ata_piix ahci libata dm_snapshot qla2xxx firmware_class
>>> qla2xxx_conf intermodule edd d m_mod fan thermal processor sg
>>> megaraid_sas piix sd_mod scsi_mod ide_disk ide_core Aug 26 11:33:11
>>> soap01 kernel: Pid: 4874, comm: nfsd Tainted: G     U
>>> 2.6.16.60-0.21-smp #1 Aug 26 11:33:11 soap01 kernel: RIP:
>>> 0010:[<ffffffff885e21e0>]
>>> <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} Aug 26 11:33:11
>>> soap01 kernel: RSP: 0018:ffff81021e9f1c18  EFLAGS: 00010292
>>> Aug 26 11:33:11 soap01 kernel: RAX: 000000000000006e RBX:
>>> ffff8101decf30c0 RCX: 0000000000000292
>>> Aug 26 11:33:11 soap01 kernel: RDX: ffffffff80359968 RSI:
>>> 0000000000000296 RDI: ffffffff80359960
>>> Aug 26 11:33:11 soap01 kernel: RBP: ffff81002f753870 R08:
>>> ffffffff80359968 R09: ffff810221d3ad80
>>> Aug 26 11:33:11 soap01 kernel: R10: ffff810001035680 R11:
>>> 0000000000000070 R12: ffff8101dda21588
>>> Aug 26 11:33:11 soap01 kernel: R13: ffff810207e2fa90 R14:
>>> ffff8102277ab400 R15: ffff8100a4dd394c
>>> Aug 26 11:33:11 soap01 kernel: FS: 00002b7055e986d0(0000)
>>> GS:ffffffff803d3000(0000) knlGS:0000000000000000
>>> Aug 26 11:33:11 soap01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
>>> 000000008005003b
>>> Aug 26 11:33:11 soap01 kernel: CR2: 00002aaaaabdb000 CR3:
>>> 000000015e180000 CR4: 00000000000006e0
>>> Aug 26 11:33:11 soap01 kernel: Process nfsd (pid: 4874, threadinfo
>>> ffff81021e9f0000, task ffff81021f92e860)
>>> Aug 26 11:33:11 soap01 kernel: Stack: ffff81002f753870
>>> ffff8101dda21588 0000000000000000 0000000000000003
>>> Aug 26 11:33:11 soap01 kernel: ffff81018ba52000 ffffffff8862187f
>>> 0000000000000000 ffff81018ba52040
>>> Aug 26 11:33:11 soap01 kernel: ffff81007f5163f8 ffffffff8860b38a
>>> Aug 26 11:33:11 soap01 kernel: Call Trace:
>>> <ffffffff8862187f>{:ocfs2:ocfs2_journal_dirty+106}
>>> Aug 26 11:33:11 soap01 kernel:
>>> <ffffffff8860b38a>{:ocfs2:__ocfs2_add_entry+745}
>>> <ffffffff88628766>{:ocfs2:ocfs2_mknod+1710}
>>> Aug 26 11:33:11 soap01 kernel:
>>> <ffffffff88628a45>{:ocfs2:ocfs2_mkdir+127}
>>> <ffffffff80192b48>{vfs_mkdir+346}
>>> Aug 26 11:33:11 soap01 kernel:
>>> <ffffffff88522f05>{:nfsd:nfsd_create+753}
>>> <ffffffff88529bb2>{:nfsd:nfsd3_proc_mkdir+217} Aug 26 11:33:11
>>> soap01 kernel: <ffffffff8851e0ea>{:nfsd:nfsd_dispatch+216}
>>> <ffffffff884d549a>{:sunrpc:svc_process+982}
>>> Aug 26 11:33:11 soap01 kernel:
>>> <ffffffff802ea247>{__down_read+21} <ffffffff8851e46e>{:nfsd:nfsd+0}
>>> Aug 26 11:33:11 soap01 kernel: <ffffffff8851e63d>{:nfsd:nfsd+463}
>>> <ffffffff8010bed2>{child_rip+8}
>>> Aug 26 11:33:11 soap01 kernel: <ffffffff8851e46e>{:nfsd:nfsd+0}
>>> <ffffffff8851e46e>{:nfsd:nfsd+0}
>>> Aug 26 11:33:11 soap01 kernel: <ffffffff8010beca>{child_rip+0}
>>> Aug 26 11:33:11 soap01 kernel:
>>> Aug 26 11:33:11 soap01 kernel: Code: 0f 0b 68 b9 8a 5e 88 c2 5a 04
>>> 41 ff 4c 24 08 49 39 5d 28 75
>>> Aug 26 11:33:11 soap01 kernel: RIP
>>> <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} RSP
>>> <ffff81021e9f1c18>
>>>
>>> Operating system: SuSE SLES 10SP1
>>> Kernel: 2.6.16.60-0.21-smp
>>> OCFS2: 1.4.0-SLES
>>>
>>> Environment:
>>>
>>> * 2 FreeBSD 7.1-RELEASE-p2 NFS Clients
>>> * 2 SLES 10SP1 exporting the filesystem
>>>
>>> The FreeBSD clients are our email servers, so the main traffic is
>>> many small email files.
>>>
>>> NFS mounted with protocol version 3, readdirplus disabled, read and
>>> write buffer of 32k.
>>>
>>> Pre-crash symptoms:
>>> * The ocfs filesystem hung for a while or gets very slow;
>>> * Low or null device traffic on both nodes (checked with `iostat`);
>>> * The server load get 5 to 6 points higher;
>>> * It seems that something in kernel deadlock, as other processes
>>> (doing IO, but in other mount points with raiserfs) hug a CPU with
>>> 100% usage;
>>>   Eg: There is a mysql database in raiserfs mount point and the
>>> mysqld hug the CPU when I call `rcmysql stop`;
>>> * Calling `reboot` or `shutdown -r now` blocks the console (didn't
>>> tried to run it with strace to get the locking point, but will try
>>> it happen again);
>>> * imapd on clients blocked in nfs requests;
>>>   One of the processes was blocked in (FreeBSD kernel) state bo_wwa.
>>>   Looking in some discussion group's over the net, this state means
>>>   blocked by stale NFS server. Attaching to the process with `gdb`,
>>> its always blocked in close() libc call;
>>>
>>> imapd process backtrace:
>>> #0  0x282a5da3 in close () from /lib/libc.so.7
>>> #1  0x282a5711 in memcpy () from /lib/libc.so.7
>>> #2  0xbfbf9378 in ?? ()
>>> #3  0x2828d58d in fclose () from /lib/libc.so.7
>>>
>>> Could it be related to o2cb configuration? Current configuration:
>>>
>>> O2CB_HEARTBEAT_THRESHOLD=61
>>> O2CB_IDLE_TIMEOUT_MS=60000
>>>
>>> The heartbeat network is a GBit ethernet.
>>>
>>> Regards,
>>>   
>>>       
>
>
>
>