[Ocfs2-users] Weird crash

Tue Sep 1 11:14:37 PDT 2009

For issues on sles, please file a bug/sr with novell.

The issue here is insufficient journal credits. It _could_ be that this 
version
is missing mainline git commit e051fda4fd14fe878e6d2183b3a4640febe9e9a8.
But I don't know. Novell Support will be better placed to track down the 
issue.

Sérgio Surkamp wrote:
> Hi list,
>
> One of our OCFS2 servers crashed with this message:
>
> Aug 26 11:33:11 soap01 kernel: Assertion failure in
> journal_dirty_metadata() at fs/jbd/transaction.c:1114:
> "handle->h_buffer_credits > 0"
> Aug 26 11:33:11 soap01 kernel: ----------- [cut here ] ---------
> [please bite here ] ---------
> Aug 26 11:33:11 soap01 kernel: Kernel BUG at fs/jbd/transaction.c:1114
> Aug 26 11:33:11 soap01 kernel: invalid opcode: 0000 [1] SMP
> Aug 26 11:33:11 soap01 kernel: last sysfs
> file: /devices/pci0000:00/0000:00:00.0/irq
> Aug 26 11:33:11 soap01 kernel: CPU 0 Aug 26 11:33:11 soap01 kernel:
> Modules linked in: af_packet joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm
> ocfs2_nodemanager configfs nfsd exportfs lockd nfs_acl sunrpc ipv6
> button battery ac netconsole xt_comment xt_tcpudp xt_state
> iptable_filter iptable_mangle iptable_nat ip_nat ip_ conntrack
> nfnetlink ip_tables x_tables apparmor loop st sr_mod usbhid usb_storage
> hw_random shpchp ide_cd aic7xxx uhci_hcd cdrom pci_hotplug ehci_hcd
> scsi_transport_spi usbcore bnx2 reiserfs ata_piix ahci libata
> dm_snapshot qla2xxx firmware_class qla2xxx_conf intermodule edd d m_mod
> fan thermal processor sg megaraid_sas piix sd_mod scsi_mod ide_disk
> ide_core Aug 26 11:33:11 soap01 kernel: Pid: 4874, comm: nfsd Tainted:
> G     U 2.6.16.60-0.21-smp #1
> Aug 26 11:33:11 soap01 kernel: RIP: 0010:[<ffffffff885e21e0>]
> <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200}
> Aug 26 11:33:11 soap01 kernel: RSP: 0018:ffff81021e9f1c18  EFLAGS:
> 00010292
> Aug 26 11:33:11 soap01 kernel: RAX: 000000000000006e RBX:
> ffff8101decf30c0 RCX: 0000000000000292
> Aug 26 11:33:11 soap01 kernel: RDX: ffffffff80359968 RSI:
> 0000000000000296 RDI: ffffffff80359960
> Aug 26 11:33:11 soap01 kernel: RBP: ffff81002f753870 R08:
> ffffffff80359968 R09: ffff810221d3ad80
> Aug 26 11:33:11 soap01 kernel: R10: ffff810001035680 R11:
> 0000000000000070 R12: ffff8101dda21588
> Aug 26 11:33:11 soap01 kernel: R13: ffff810207e2fa90 R14:
> ffff8102277ab400 R15: ffff8100a4dd394c
> Aug 26 11:33:11 soap01 kernel: FS: 00002b7055e986d0(0000)
> GS:ffffffff803d3000(0000) knlGS:0000000000000000
> Aug 26 11:33:11 soap01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> 000000008005003b
> Aug 26 11:33:11 soap01 kernel: CR2: 00002aaaaabdb000 CR3:
> 000000015e180000 CR4: 00000000000006e0
> Aug 26 11:33:11 soap01 kernel: Process nfsd (pid: 4874, threadinfo
> ffff81021e9f0000, task ffff81021f92e860)
> Aug 26 11:33:11 soap01 kernel: Stack: ffff81002f753870 ffff8101dda21588
> 0000000000000000 0000000000000003
> Aug 26 11:33:11 soap01 kernel: ffff81018ba52000 ffffffff8862187f
> 0000000000000000 ffff81018ba52040
> Aug 26 11:33:11 soap01 kernel: ffff81007f5163f8 ffffffff8860b38a
> Aug 26 11:33:11 soap01 kernel: Call Trace:
> <ffffffff8862187f>{:ocfs2:ocfs2_journal_dirty+106}
> Aug 26 11:33:11 soap01 kernel:
> <ffffffff8860b38a>{:ocfs2:__ocfs2_add_entry+745}
> <ffffffff88628766>{:ocfs2:ocfs2_mknod+1710}
> Aug 26 11:33:11 soap01 kernel:
> <ffffffff88628a45>{:ocfs2:ocfs2_mkdir+127}
> <ffffffff80192b48>{vfs_mkdir+346}
> Aug 26 11:33:11 soap01 kernel: <ffffffff88522f05>{:nfsd:nfsd_create+753}
> <ffffffff88529bb2>{:nfsd:nfsd3_proc_mkdir+217}
> Aug 26 11:33:11 soap01 kernel:
> <ffffffff8851e0ea>{:nfsd:nfsd_dispatch+216}
> <ffffffff884d549a>{:sunrpc:svc_process+982}
> Aug 26 11:33:11 soap01 kernel:        <ffffffff802ea247>{__down_read+21}
> <ffffffff8851e46e>{:nfsd:nfsd+0}
> Aug 26 11:33:11 soap01 kernel: <ffffffff8851e63d>{:nfsd:nfsd+463}
> <ffffffff8010bed2>{child_rip+8}
> Aug 26 11:33:11 soap01 kernel: <ffffffff8851e46e>{:nfsd:nfsd+0}
> <ffffffff8851e46e>{:nfsd:nfsd+0}
> Aug 26 11:33:11 soap01 kernel: <ffffffff8010beca>{child_rip+0}
> Aug 26 11:33:11 soap01 kernel:
> Aug 26 11:33:11 soap01 kernel: Code: 0f 0b 68 b9 8a 5e 88 c2 5a 04 41
> ff 4c 24 08 49 39 5d 28 75
> Aug 26 11:33:11 soap01 kernel: RIP
> <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} RSP
> <ffff81021e9f1c18>
>
> Operating system: SuSE SLES 10SP1
> Kernel: 2.6.16.60-0.21-smp
> OCFS2: 1.4.0-SLES
>
> Environment:
>
> * 2 FreeBSD 7.1-RELEASE-p2 NFS Clients
> * 2 SLES 10SP1 exporting the filesystem
>
> The FreeBSD clients are our email servers, so the main traffic is many
> small email files.
>
> NFS mounted with protocol version 3, readdirplus disabled, read and
> write buffer of 32k.
>
> Pre-crash symptoms:
> * The ocfs filesystem hung for a while or gets very slow;
> * Low or null device traffic on both nodes (checked with `iostat`);
> * The server load get 5 to 6 points higher;
> * It seems that something in kernel deadlock, as other processes (doing
>   IO, but in other mount points with raiserfs) hug a CPU with 100%
>   usage;
>   Eg: There is a mysql database in raiserfs mount point and the mysqld
>   hug the CPU when I call `rcmysql stop`;
> * Calling `reboot` or `shutdown -r now` blocks the console (didn't tried
>   to run it with strace to get the locking point, but will try it happen
>   again);
> * imapd on clients blocked in nfs requests;
>   One of the processes was blocked in (FreeBSD kernel) state bo_wwa.
>   Looking in some discussion group's over the net, this state means
>   blocked by stale NFS server. Attaching to the process with `gdb`, its
>   always blocked in close() libc call;
>
> imapd process backtrace:
> #0  0x282a5da3 in close () from /lib/libc.so.7
> #1  0x282a5711 in memcpy () from /lib/libc.so.7
> #2  0xbfbf9378 in ?? ()
> #3  0x2828d58d in fclose () from /lib/libc.so.7
>
> Could it be related to o2cb configuration? Current configuration:
>
> O2CB_HEARTBEAT_THRESHOLD=61
> O2CB_IDLE_TIMEOUT_MS=60000
>
> The heartbeat network is a GBit ethernet.
>
> Regards,
>