[Ocfs2-users] Weird crash

Tue Sep 1 13:17:29 PDT 2009

Thanks, Ill try to contact them.

Regards,
Sérgio

Em Tue, 01 Sep 2009 11:14:37 -0700
Sunil Mushran <sunil.mushran at oracle.com> escreveu:

> For issues on sles, please file a bug/sr with novell.
> 
> The issue here is insufficient journal credits. It _could_ be that
> this version
> is missing mainline git commit
> e051fda4fd14fe878e6d2183b3a4640febe9e9a8. But I don't know. Novell
> Support will be better placed to track down the issue.
> 
> Sérgio Surkamp wrote:
> > Hi list,
> >
> > One of our OCFS2 servers crashed with this message:
> >
> > Aug 26 11:33:11 soap01 kernel: Assertion failure in
> > journal_dirty_metadata() at fs/jbd/transaction.c:1114:
> > "handle->h_buffer_credits > 0"
> > Aug 26 11:33:11 soap01 kernel: ----------- [cut here ] ---------
> > [please bite here ] ---------
> > Aug 26 11:33:11 soap01 kernel: Kernel BUG at
> > fs/jbd/transaction.c:1114 Aug 26 11:33:11 soap01 kernel: invalid
> > opcode: 0000 [1] SMP Aug 26 11:33:11 soap01 kernel: last sysfs
> > file: /devices/pci0000:00/0000:00:00.0/irq
> > Aug 26 11:33:11 soap01 kernel: CPU 0 Aug 26 11:33:11 soap01 kernel:
> > Modules linked in: af_packet joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm
> > ocfs2_nodemanager configfs nfsd exportfs lockd nfs_acl sunrpc ipv6
> > button battery ac netconsole xt_comment xt_tcpudp xt_state
> > iptable_filter iptable_mangle iptable_nat ip_nat ip_ conntrack
> > nfnetlink ip_tables x_tables apparmor loop st sr_mod usbhid
> > usb_storage hw_random shpchp ide_cd aic7xxx uhci_hcd cdrom
> > pci_hotplug ehci_hcd scsi_transport_spi usbcore bnx2 reiserfs
> > ata_piix ahci libata dm_snapshot qla2xxx firmware_class
> > qla2xxx_conf intermodule edd d m_mod fan thermal processor sg
> > megaraid_sas piix sd_mod scsi_mod ide_disk ide_core Aug 26 11:33:11
> > soap01 kernel: Pid: 4874, comm: nfsd Tainted: G     U
> > 2.6.16.60-0.21-smp #1 Aug 26 11:33:11 soap01 kernel: RIP:
> > 0010:[<ffffffff885e21e0>]
> > <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} Aug 26 11:33:11
> > soap01 kernel: RSP: 0018:ffff81021e9f1c18  EFLAGS: 00010292
> > Aug 26 11:33:11 soap01 kernel: RAX: 000000000000006e RBX:
> > ffff8101decf30c0 RCX: 0000000000000292
> > Aug 26 11:33:11 soap01 kernel: RDX: ffffffff80359968 RSI:
> > 0000000000000296 RDI: ffffffff80359960
> > Aug 26 11:33:11 soap01 kernel: RBP: ffff81002f753870 R08:
> > ffffffff80359968 R09: ffff810221d3ad80
> > Aug 26 11:33:11 soap01 kernel: R10: ffff810001035680 R11:
> > 0000000000000070 R12: ffff8101dda21588
> > Aug 26 11:33:11 soap01 kernel: R13: ffff810207e2fa90 R14:
> > ffff8102277ab400 R15: ffff8100a4dd394c
> > Aug 26 11:33:11 soap01 kernel: FS: 00002b7055e986d0(0000)
> > GS:ffffffff803d3000(0000) knlGS:0000000000000000
> > Aug 26 11:33:11 soap01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
> > 000000008005003b
> > Aug 26 11:33:11 soap01 kernel: CR2: 00002aaaaabdb000 CR3:
> > 000000015e180000 CR4: 00000000000006e0
> > Aug 26 11:33:11 soap01 kernel: Process nfsd (pid: 4874, threadinfo
> > ffff81021e9f0000, task ffff81021f92e860)
> > Aug 26 11:33:11 soap01 kernel: Stack: ffff81002f753870
> > ffff8101dda21588 0000000000000000 0000000000000003
> > Aug 26 11:33:11 soap01 kernel: ffff81018ba52000 ffffffff8862187f
> > 0000000000000000 ffff81018ba52040
> > Aug 26 11:33:11 soap01 kernel: ffff81007f5163f8 ffffffff8860b38a
> > Aug 26 11:33:11 soap01 kernel: Call Trace:
> > <ffffffff8862187f>{:ocfs2:ocfs2_journal_dirty+106}
> > Aug 26 11:33:11 soap01 kernel:
> > <ffffffff8860b38a>{:ocfs2:__ocfs2_add_entry+745}
> > <ffffffff88628766>{:ocfs2:ocfs2_mknod+1710}
> > Aug 26 11:33:11 soap01 kernel:
> > <ffffffff88628a45>{:ocfs2:ocfs2_mkdir+127}
> > <ffffffff80192b48>{vfs_mkdir+346}
> > Aug 26 11:33:11 soap01 kernel:
> > <ffffffff88522f05>{:nfsd:nfsd_create+753}
> > <ffffffff88529bb2>{:nfsd:nfsd3_proc_mkdir+217} Aug 26 11:33:11
> > soap01 kernel: <ffffffff8851e0ea>{:nfsd:nfsd_dispatch+216}
> > <ffffffff884d549a>{:sunrpc:svc_process+982}
> > Aug 26 11:33:11 soap01 kernel:
> > <ffffffff802ea247>{__down_read+21} <ffffffff8851e46e>{:nfsd:nfsd+0}
> > Aug 26 11:33:11 soap01 kernel: <ffffffff8851e63d>{:nfsd:nfsd+463}
> > <ffffffff8010bed2>{child_rip+8}
> > Aug 26 11:33:11 soap01 kernel: <ffffffff8851e46e>{:nfsd:nfsd+0}
> > <ffffffff8851e46e>{:nfsd:nfsd+0}
> > Aug 26 11:33:11 soap01 kernel: <ffffffff8010beca>{child_rip+0}
> > Aug 26 11:33:11 soap01 kernel:
> > Aug 26 11:33:11 soap01 kernel: Code: 0f 0b 68 b9 8a 5e 88 c2 5a 04
> > 41 ff 4c 24 08 49 39 5d 28 75
> > Aug 26 11:33:11 soap01 kernel: RIP
> > <ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} RSP
> > <ffff81021e9f1c18>
> >
> > Operating system: SuSE SLES 10SP1
> > Kernel: 2.6.16.60-0.21-smp
> > OCFS2: 1.4.0-SLES
> >
> > Environment:
> >
> > * 2 FreeBSD 7.1-RELEASE-p2 NFS Clients
> > * 2 SLES 10SP1 exporting the filesystem
> >
> > The FreeBSD clients are our email servers, so the main traffic is
> > many small email files.
> >
> > NFS mounted with protocol version 3, readdirplus disabled, read and
> > write buffer of 32k.
> >
> > Pre-crash symptoms:
> > * The ocfs filesystem hung for a while or gets very slow;
> > * Low or null device traffic on both nodes (checked with `iostat`);
> > * The server load get 5 to 6 points higher;
> > * It seems that something in kernel deadlock, as other processes
> > (doing IO, but in other mount points with raiserfs) hug a CPU with
> > 100% usage;
> >   Eg: There is a mysql database in raiserfs mount point and the
> > mysqld hug the CPU when I call `rcmysql stop`;
> > * Calling `reboot` or `shutdown -r now` blocks the console (didn't
> > tried to run it with strace to get the locking point, but will try
> > it happen again);
> > * imapd on clients blocked in nfs requests;
> >   One of the processes was blocked in (FreeBSD kernel) state bo_wwa.
> >   Looking in some discussion group's over the net, this state means
> >   blocked by stale NFS server. Attaching to the process with `gdb`,
> > its always blocked in close() libc call;
> >
> > imapd process backtrace:
> > #0  0x282a5da3 in close () from /lib/libc.so.7
> > #1  0x282a5711 in memcpy () from /lib/libc.so.7
> > #2  0xbfbf9378 in ?? ()
> > #3  0x2828d58d in fclose () from /lib/libc.so.7
> >
> > Could it be related to o2cb configuration? Current configuration:
> >
> > O2CB_HEARTBEAT_THRESHOLD=61
> > O2CB_IDLE_TIMEOUT_MS=60000
> >
> > The heartbeat network is a GBit ethernet.
> >
> > Regards,
> >   

-- 
  .:''''':.
.:'        `     Sérgio Surkamp | Gerente de Rede
::    ........   sergio at gruposinternet.com.br
`:.        .:'
  `:,   ,.:'     *Grupos Internet S.A.*
    `: :'        R. Lauro Linhares, 2123 Torre B - Sala 201
     : :         Trindade - Florianópolis - SC
     :.'
     ::          +55 48 3234-4109
     :
     '           http://www.gruposinternet.com.br