[Ocfs2-users] Weird crash

Sérgio Surkamp sergio at gruposinternet.com.br
Wed Aug 26 11:10:56 PDT 2009


Hi list,

One of our OCFS2 servers crashed with this message:

Aug 26 11:33:11 soap01 kernel: Assertion failure in
journal_dirty_metadata() at fs/jbd/transaction.c:1114:
"handle->h_buffer_credits > 0"
Aug 26 11:33:11 soap01 kernel: ----------- [cut here ] ---------
[please bite here ] ---------
Aug 26 11:33:11 soap01 kernel: Kernel BUG at fs/jbd/transaction.c:1114
Aug 26 11:33:11 soap01 kernel: invalid opcode: 0000 [1] SMP
Aug 26 11:33:11 soap01 kernel: last sysfs
file: /devices/pci0000:00/0000:00:00.0/irq
Aug 26 11:33:11 soap01 kernel: CPU 0 Aug 26 11:33:11 soap01 kernel:
Modules linked in: af_packet joydev ocfs2 jbd ocfs2_dlmfs ocfs2_dlm
ocfs2_nodemanager configfs nfsd exportfs lockd nfs_acl sunrpc ipv6
button battery ac netconsole xt_comment xt_tcpudp xt_state
iptable_filter iptable_mangle iptable_nat ip_nat ip_ conntrack
nfnetlink ip_tables x_tables apparmor loop st sr_mod usbhid usb_storage
hw_random shpchp ide_cd aic7xxx uhci_hcd cdrom pci_hotplug ehci_hcd
scsi_transport_spi usbcore bnx2 reiserfs ata_piix ahci libata
dm_snapshot qla2xxx firmware_class qla2xxx_conf intermodule edd d m_mod
fan thermal processor sg megaraid_sas piix sd_mod scsi_mod ide_disk
ide_core Aug 26 11:33:11 soap01 kernel: Pid: 4874, comm: nfsd Tainted:
G     U 2.6.16.60-0.21-smp #1
Aug 26 11:33:11 soap01 kernel: RIP: 0010:[<ffffffff885e21e0>]
<ffffffff885e21e0>{:jbd:journal_dirty_metadata+200}
Aug 26 11:33:11 soap01 kernel: RSP: 0018:ffff81021e9f1c18  EFLAGS:
00010292
Aug 26 11:33:11 soap01 kernel: RAX: 000000000000006e RBX:
ffff8101decf30c0 RCX: 0000000000000292
Aug 26 11:33:11 soap01 kernel: RDX: ffffffff80359968 RSI:
0000000000000296 RDI: ffffffff80359960
Aug 26 11:33:11 soap01 kernel: RBP: ffff81002f753870 R08:
ffffffff80359968 R09: ffff810221d3ad80
Aug 26 11:33:11 soap01 kernel: R10: ffff810001035680 R11:
0000000000000070 R12: ffff8101dda21588
Aug 26 11:33:11 soap01 kernel: R13: ffff810207e2fa90 R14:
ffff8102277ab400 R15: ffff8100a4dd394c
Aug 26 11:33:11 soap01 kernel: FS: 00002b7055e986d0(0000)
GS:ffffffff803d3000(0000) knlGS:0000000000000000
Aug 26 11:33:11 soap01 kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
000000008005003b
Aug 26 11:33:11 soap01 kernel: CR2: 00002aaaaabdb000 CR3:
000000015e180000 CR4: 00000000000006e0
Aug 26 11:33:11 soap01 kernel: Process nfsd (pid: 4874, threadinfo
ffff81021e9f0000, task ffff81021f92e860)
Aug 26 11:33:11 soap01 kernel: Stack: ffff81002f753870 ffff8101dda21588
0000000000000000 0000000000000003
Aug 26 11:33:11 soap01 kernel: ffff81018ba52000 ffffffff8862187f
0000000000000000 ffff81018ba52040
Aug 26 11:33:11 soap01 kernel: ffff81007f5163f8 ffffffff8860b38a
Aug 26 11:33:11 soap01 kernel: Call Trace:
<ffffffff8862187f>{:ocfs2:ocfs2_journal_dirty+106}
Aug 26 11:33:11 soap01 kernel:
<ffffffff8860b38a>{:ocfs2:__ocfs2_add_entry+745}
<ffffffff88628766>{:ocfs2:ocfs2_mknod+1710}
Aug 26 11:33:11 soap01 kernel:
<ffffffff88628a45>{:ocfs2:ocfs2_mkdir+127}
<ffffffff80192b48>{vfs_mkdir+346}
Aug 26 11:33:11 soap01 kernel: <ffffffff88522f05>{:nfsd:nfsd_create+753}
<ffffffff88529bb2>{:nfsd:nfsd3_proc_mkdir+217}
Aug 26 11:33:11 soap01 kernel:
<ffffffff8851e0ea>{:nfsd:nfsd_dispatch+216}
<ffffffff884d549a>{:sunrpc:svc_process+982}
Aug 26 11:33:11 soap01 kernel:        <ffffffff802ea247>{__down_read+21}
<ffffffff8851e46e>{:nfsd:nfsd+0}
Aug 26 11:33:11 soap01 kernel: <ffffffff8851e63d>{:nfsd:nfsd+463}
<ffffffff8010bed2>{child_rip+8}
Aug 26 11:33:11 soap01 kernel: <ffffffff8851e46e>{:nfsd:nfsd+0}
<ffffffff8851e46e>{:nfsd:nfsd+0}
Aug 26 11:33:11 soap01 kernel: <ffffffff8010beca>{child_rip+0}
Aug 26 11:33:11 soap01 kernel:
Aug 26 11:33:11 soap01 kernel: Code: 0f 0b 68 b9 8a 5e 88 c2 5a 04 41
ff 4c 24 08 49 39 5d 28 75
Aug 26 11:33:11 soap01 kernel: RIP
<ffffffff885e21e0>{:jbd:journal_dirty_metadata+200} RSP
<ffff81021e9f1c18>

Operating system: SuSE SLES 10SP1
Kernel: 2.6.16.60-0.21-smp
OCFS2: 1.4.0-SLES

Environment:

* 2 FreeBSD 7.1-RELEASE-p2 NFS Clients
* 2 SLES 10SP1 exporting the filesystem

The FreeBSD clients are our email servers, so the main traffic is many
small email files.

NFS mounted with protocol version 3, readdirplus disabled, read and
write buffer of 32k.

Pre-crash symptoms:
* The ocfs filesystem hung for a while or gets very slow;
* Low or null device traffic on both nodes (checked with `iostat`);
* The server load get 5 to 6 points higher;
* It seems that something in kernel deadlock, as other processes (doing
  IO, but in other mount points with raiserfs) hug a CPU with 100%
  usage;
  Eg: There is a mysql database in raiserfs mount point and the mysqld
  hug the CPU when I call `rcmysql stop`;
* Calling `reboot` or `shutdown -r now` blocks the console (didn't tried
  to run it with strace to get the locking point, but will try it happen
  again);
* imapd on clients blocked in nfs requests;
  One of the processes was blocked in (FreeBSD kernel) state bo_wwa.
  Looking in some discussion group's over the net, this state means
  blocked by stale NFS server. Attaching to the process with `gdb`, its
  always blocked in close() libc call;

imapd process backtrace:
#0  0x282a5da3 in close () from /lib/libc.so.7
#1  0x282a5711 in memcpy () from /lib/libc.so.7
#2  0xbfbf9378 in ?? ()
#3  0x2828d58d in fclose () from /lib/libc.so.7

Could it be related to o2cb configuration? Current configuration:

O2CB_HEARTBEAT_THRESHOLD=61
O2CB_IDLE_TIMEOUT_MS=60000

The heartbeat network is a GBit ethernet.

Regards,
-- 
  .:''''':.
.:'        `     Sérgio Surkamp | Gerente de Rede
::    ........   sergio at gruposinternet.com.br
`:.        .:'
  `:,   ,.:'     *Grupos Internet S.A.*
    `: :'        R. Lauro Linhares, 2123 Torre B - Sala 201
     : :         Trindade - Florianópolis - SC
     :.'
     ::          +55 48 3234-4109
     :
     '           http://www.gruposinternet.com.br



More information about the Ocfs2-users mailing list