[Ocfs2-users] Weird crash

Tue Jan 19 05:43:02 PST 2010

Hello Mr. Sunil and list,

Last weekend we have upgraded our servers to SLES10SP3 and with it the
OCFS2 version 1.4.1-1.

During the upgrade we have run the fsck in the volume (using the new
OCFS2 version) and it just died in "Pass 1c" with the error:
fsck.ocfs2: Unknown code while performing pass 1

Checking the dmesg there was this messages repeated many times:
(5713,1):ocfs2_orphan_del:1869 ERROR: status = -2
(5713,1):ocfs2_remove_inode:614 ERROR: status = -2
(5713,1):ocfs2_wipe_inode:740 ERROR: status = -2
(5713,1):ocfs2_delete_inode:974 ERROR: status = -2

Could this mean that our filesystem is corrupted? Could it be the cause
of this bug to happen?

Current versions:
OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
f922955d99ef972235bd0c1fc236c5ddbb368611)
OCFS2 DLMFS 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
f922955d99ef972235bd0c1fc236c5ddbb368611)
OCFS2 User DLM kernel interface loaded bnx2: eth0 NIC Copper Link is
Up, 1000 Mbps full duplex ADDRCONF(NETDEV_CHANGE): eth0: link becomes
ready bnx2: eth1 NIC Copper Link is Up, 100 Mbps full duplex, receive &
transmit flow control ON ADDRCONF(NETDEV_CHANGE): eth1: link becomes
ready o2net: connected to node soap02 (num 0) at [--cut--]:7777
OCFS2 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build
f922955d99ef972235bd0c1fc236c5ddbb368611)

Latest crash dmesg message:
Assertion failure in journal_dirty_metadata() at
fs/jbd/transaction.c:1114: "handle->h_buffer_credits > 0" -----------
[cut here ] --------- [please bite here ] --------- Kernel BUG at
fs/jbd/transaction.c:1114 invalid opcode: 0000 [1] SMP 
last sysfs file: /block/sdb/sdb1/stat
CPU 2 
Modules linked in: nls_iso8859_1 nls_cp437 vfat fat nls_utf8 nfsd
exportfs lockd nfs_acl sunrpc iptable_nat ip_nat iptable_mangle
xt_comment xt_tcpudp xt_state ip_conntrack nfnetlink iptable_filter
ip_tables x_tables joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm
ocfs2_nodemanager configfs ipv6 dock button battery ac apparmor sr_mod
loop usbhid usb_storage uhci_hcd ide_cd shpchp cdrom ehci_hcd hw_random
usbcore pci_hotplug bnx2 reiserfs ata_piix ahci libata dm_snapshot edd
dm_mod fan thermal processor qla2xxx firmware_class scsi_transport_fc
st aic7xxx scsi_transport_spi sg megaraid_sas piix sd_mod scsi_mod
ide_disk ide_core Pid: 29322, comm: nfsd Not tainted
2.6.16.60-0.58.1-smp #1 RIP: 0010:[<ffffffff883f11e0>]
<ffffffff883f11e0>{:jbd:journal_dirty_metadata+200} RSP:
0018:ffff8101ca763c18  EFLAGS: 00010292 RAX: 000000000000006e RBX:
ffff8100cc920a80 RCX: ffffffff80361c68 RDX: ffffffff80361c68 RSI:
0000000000000296 RDI: ffffffff80361c60 RBP: ffff81012e3978c8 R08:
ffffffff80361c68 R09: ffff8102244cd540 R10: 0000000000000292 R11:
0000000000000070 R12: ffff8101f00274f8 R13: ffff810074c031f0 R14:
ffff810226624a00 R15: ffff81012352653c FS:  00002ae6d8b156d0(0000)
GS:ffff81022856a2c0(0000) knlGS:0000000000000000 CS:  0010 DS: 0000 ES:
0000 CR0: 000000008005003b CR2: 00002aaaafb2d000 CR3: 000000009e42a000
CR4: 00000000000006e0 Process nfsd (pid: 29322, threadinfo
ffff8101ca762000, task ffff8102284de810) Stack: ffff81012e3978c8
ffff8101f00274f8 0000000000000000 0000000000000005 ffff81019ea88000
ffffffff8843264a 0000000000000000 ffff81019ea88050 ffff81000b55cc20
ffffffff8841b445 Call Trace:
<ffffffff8843264a>{:ocfs2:ocfs2_journal_dirty+106}
<ffffffff8841b445>{:ocfs2:__ocfs2_add_entry+745}
<ffffffff884396a9>{:ocfs2:ocfs2_mknod+1710}
<ffffffff88439988>{:ocfs2:ocfs2_mkdir+127}
<ffffffff80194075>{vfs_mkdir+346}
<ffffffff88526ef5>{:nfsd:nfsd_create+753}
<ffffffff8852dc63>{:nfsd:nfsd3_proc_mkdir+217}
<ffffffff885220ea>{:nfsd:nfsd_dispatch+216}
<ffffffff884d975b>{:sunrpc:svc_process+982}
<ffffffff802f0c59>{__down_read+21} <ffffffff88522454>{:nfsd:nfsd+0}
<ffffffff88522623>{:nfsd:nfsd+463} <ffffffff8010bea6>{child_rip+8}
<ffffffff88522454>{:nfsd:nfsd+0} <ffffffff88522454>{:nfsd:nfsd+0}
<ffffffff8010be9e>{child_rip+0}

Code: 0f 0b 68 d9 7a 3f 88 c2 5a 04 41 ff 4c 24 08 49 39 5d 28 75 
RIP <ffffffff883f11e0>{:jbd:journal_dirty_metadata+200} RSP
<ffff8101ca763c18>

Bottom line: We are contacting Novell by phone support to check why
OCFS2 version isn't the latest in SLES10SP3.

Regards,
Sérgio

Em Tue, 01 Sep 2009 11:14:37 -0700
Sunil Mushran <sunil.mushran at oracle.com> escreveu:

> For issues on sles, please file a bug/sr with novell.
> 
> The issue here is insufficient journal credits. It _could_ be that
> this version
> is missing mainline git commit
> e051fda4fd14fe878e6d2183b3a4640febe9e9a8. But I don't know. Novell
> Support will be better placed to track down the issue.
> 
> Sérgio Surkamp wrote:
> > Hi list,
> >
> > One of our OCFS2 servers crashed with this message:
> >
> > Aug 26 11:33:11 soap01 kernel: Assertion failure in
> > journal_dirty_metadata() at fs/jbd/transaction.c:1114:
> > "handle->h_buffer_credits > 0"
> > Aug 26 11:33:11 soap01 kernel: ----------- [cut here ] ---------
> > [please bite here ] ---------
> > Aug 26 11:33:11 soap01 kernel: Kernel BUG at
> > fs/jbd/transaction.c:1114 Aug 26 11:33:11 soap01 kernel: invalid
> > opcode: 0000 [1] SMP Aug 26 11:33:11 soap01 kernel: last sysfs
> > file: /devices/pci0000:00/0000:00:00.0/irq
> > Aug 26 11:33:11 soap01 kernel: CPU 0 Aug 26 11:33:11 soap01 kernel:
> > Modules linked in: af_packet joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm
> > ocfs2_nodemanager configfs nfsd exportfs lockd nfs_acl sunrpc ipv6
> > button battery ac netconsole xt_comment xt_tcpudp xt_state
> > iptable_filter iptable_mangle iptable_nat ip_nat ip_ conntrack
> > nfnetlink ip_tables x_tables apparmor loop st sr_mod usbhid
> > usb_storage hw_random shpchp ide_cd aic7xxx uhci_hcd cdrom
> > pci_hotplug ehci_hcd scsi_transport_spi usbcore bnx2 reiserfs
> > ata_piix ahci libata dm_snapshot qla2xxx firmware_class
> > qla2xxx_conf intermodule edd d m_mod fan thermal processor sg
> > megaraid_sas piix sd_mod scsi_mod ide_disk ide_core Aug 26 11:33:11
> > soap01 kernel: Pid: 4874, comm: nfsd Tainted: G     U
> > 2.6.16.60-0.21-smp #1 Aug 26 11:33:11 soap01 kernel: RIP:
> > 0010:[<ffffffff885e21e0>]
> > <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} Aug 26 11:33:11
> > soap01 kernel: RSP: 0018:ffff81021e9f1c18  EFLAGS: 00010292
> > Aug 26 11:33:11 soap01 kernel: RAX: 000000000000006e RBX:
> > ffff8101decf30c0 RCX: 0000000000000292
> > Aug 26 11:33:11 soap01 kernel: RDX: ffffffff80359968 RSI:
> > 0000000000000296 RDI: ffffffff80359960
> > Aug 26 11:33:11 soap01 kernel: RBP: ffff81002f753870 R08:
> > ffffffff80359968 R09: ffff810221d3ad80
> > Aug 26 11:33:11 soap01 kernel: R10: ffff810001035680 R11:
> > 0000000000000070 R12: ffff8101dda21588
> > Aug 26 11:33:11 soap01 kernel: R13: ffff810207e2fa90 R14:
> > ffff8102277ab400 R15: ffff8100a4dd394c
> > Aug 26 11:33:11 soap01 kernel: FS: 00002b7055e986d0(0000)
> > GS:ffffffff803d3000(0000) knlGS:0000000000000000
> > Aug 26 11:33:11 soap01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> > 000000008005003b
> > Aug 26 11:33:11 soap01 kernel: CR2: 00002aaaaabdb000 CR3:
> > 000000015e180000 CR4: 00000000000006e0
> > Aug 26 11:33:11 soap01 kernel: Process nfsd (pid: 4874, threadinfo
> > ffff81021e9f0000, task ffff81021f92e860)
> > Aug 26 11:33:11 soap01 kernel: Stack: ffff81002f753870
> > ffff8101dda21588 0000000000000000 0000000000000003
> > Aug 26 11:33:11 soap01 kernel: ffff81018ba52000 ffffffff8862187f
> > 0000000000000000 ffff81018ba52040
> > Aug 26 11:33:11 soap01 kernel: ffff81007f5163f8 ffffffff8860b38a
> > Aug 26 11:33:11 soap01 kernel: Call Trace:
> > <ffffffff8862187f>{:ocfs2:ocfs2_journal_dirty+106}
> > Aug 26 11:33:11 soap01 kernel:
> > <ffffffff8860b38a>{:ocfs2:__ocfs2_add_entry+745}
> > <ffffffff88628766>{:ocfs2:ocfs2_mknod+1710}
> > Aug 26 11:33:11 soap01 kernel:
> > <ffffffff88628a45>{:ocfs2:ocfs2_mkdir+127}
> > <ffffffff80192b48>{vfs_mkdir+346}
> > Aug 26 11:33:11 soap01 kernel:
> > <ffffffff88522f05>{:nfsd:nfsd_create+753}
> > <ffffffff88529bb2>{:nfsd:nfsd3_proc_mkdir+217} Aug 26 11:33:11
> > soap01 kernel: <ffffffff8851e0ea>{:nfsd:nfsd_dispatch+216}
> > <ffffffff884d549a>{:sunrpc:svc_process+982}
> > Aug 26 11:33:11 soap01 kernel:
> > <ffffffff802ea247>{__down_read+21} <ffffffff8851e46e>{:nfsd:nfsd+0}
> > Aug 26 11:33:11 soap01 kernel: <ffffffff8851e63d>{:nfsd:nfsd+463}
> > <ffffffff8010bed2>{child_rip+8}
> > Aug 26 11:33:11 soap01 kernel: <ffffffff8851e46e>{:nfsd:nfsd+0}
> > <ffffffff8851e46e>{:nfsd:nfsd+0}
> > Aug 26 11:33:11 soap01 kernel: <ffffffff8010beca>{child_rip+0}
> > Aug 26 11:33:11 soap01 kernel:
> > Aug 26 11:33:11 soap01 kernel: Code: 0f 0b 68 b9 8a 5e 88 c2 5a 04
> > 41 ff 4c 24 08 49 39 5d 28 75
> > Aug 26 11:33:11 soap01 kernel: RIP
> > <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} RSP
> > <ffff81021e9f1c18>
> >
> > Operating system: SuSE SLES 10SP1
> > Kernel: 2.6.16.60-0.21-smp
> > OCFS2: 1.4.0-SLES
> >
> > Environment:
> >
> > * 2 FreeBSD 7.1-RELEASE-p2 NFS Clients
> > * 2 SLES 10SP1 exporting the filesystem
> >
> > The FreeBSD clients are our email servers, so the main traffic is
> > many small email files.
> >
> > NFS mounted with protocol version 3, readdirplus disabled, read and
> > write buffer of 32k.
> >
> > Pre-crash symptoms:
> > * The ocfs filesystem hung for a while or gets very slow;
> > * Low or null device traffic on both nodes (checked with `iostat`);
> > * The server load get 5 to 6 points higher;
> > * It seems that something in kernel deadlock, as other processes
> > (doing IO, but in other mount points with raiserfs) hug a CPU with
> > 100% usage;
> >   Eg: There is a mysql database in raiserfs mount point and the
> > mysqld hug the CPU when I call `rcmysql stop`;
> > * Calling `reboot` or `shutdown -r now` blocks the console (didn't
> > tried to run it with strace to get the locking point, but will try
> > it happen again);
> > * imapd on clients blocked in nfs requests;
> >   One of the processes was blocked in (FreeBSD kernel) state bo_wwa.
> >   Looking in some discussion group's over the net, this state means
> >   blocked by stale NFS server. Attaching to the process with `gdb`,
> > its always blocked in close() libc call;
> >
> > imapd process backtrace:
> > #0  0x282a5da3 in close () from /lib/libc.so.7
> > #1  0x282a5711 in memcpy () from /lib/libc.so.7
> > #2  0xbfbf9378 in ?? ()
> > #3  0x2828d58d in fclose () from /lib/libc.so.7
> >
> > Could it be related to o2cb configuration? Current configuration:
> >
> > O2CB_HEARTBEAT_THRESHOLD=61
> > O2CB_IDLE_TIMEOUT_MS=60000
> >
> > The heartbeat network is a GBit ethernet.
> >
> > Regards,
> >   

-- 
  .:''''':.
.:'        `     Sérgio Surkamp | Gerente de Rede
::    ........   sergio at gruposinternet.com.br
`:.        .:'
  `:,   ,.:'     *Grupos Internet S.A.*
    `: :'        R. Lauro Linhares, 2123 Torre B - Sala 201
     : :         Trindade - Florianópolis - SC
     :.'
     ::          +55 48 3234-4109
     :
     '           http://www.gruposinternet.com.br