[Ocfs2-users] kernel BUG at dlmglue.c:742

Jakob Rößler roessler at medienpark.net
Tue Sep 4 06:32:39 PDT 2012


Hello list,

my company is administrating an apache2 cluster with three servers. The
DocumentRoot-Directory is an iscsi device with ocfs2.
>From time to time there is the following kernel bug on one server (_not_
every time the same one) which causes a server load above 1k on _each_
node, and finally the unreachability of the whole website.

Sep  1 13:01:59 www01 kernel: [438503.058163] ------------[ cut here
]------------
Sep  1 13:01:59 www01 kernel: [438503.058211] kernel BUG at
/build/buildd/linux-2.6.32/fs/ocfs2/dlmglue.c:742!
Sep  1 13:01:59 www01 kernel: [438503.058267] invalid opcode: 0000 [#1] SMP
Sep  1 13:01:59 www01 kernel: [438503.058320] last sysfs file:
/sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
Sep  1 13:01:59 www01 kernel: [438503.058405] CPU 7
Sep  1 13:01:59 www01 kernel: [438503.058446] Modules linked in: ocfs2
quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager
ocfs2_stackglue configfs crc32c mptctl ib_iser rdma_cm ib_cm iw_cm ib_sa
ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi power_meter serio_raw ixgbe mdio ioatdma lp parport
usbhid hid mptsas mptscsih mptbase scsi_transport_sas igb dca
Sep  1 13:01:59 www01 kernel: [438503.058831] Pid: 3296, comm: ocfs2dc
Not tainted 2.6.32-42-server #95-Ubuntu PRIMERGY BX920 S2
Sep  1 13:01:59 www01 kernel: [438503.058918] RIP:
0010:[<ffffffffa023f4b4>]  [<ffffffffa023f4b4>]
ocfs2_lock_res_free+0x1d4/0x4c0 [ocfs2]
Sep  1 13:01:59 www01 kernel: [438503.059036] RSP:
0018:ffff880627f2fd20  EFLAGS: 00010286
Sep  1 13:01:59 www01 kernel: [438503.059086] RAX: 0000000000000062 RBX:
ffff8805eea9e618 RCX: 0000000000000000
Sep  1 13:01:59 www01 kernel: [438503.059168] RDX: 0000000000000000 RSI:
0000000000000082 RDI: 0000000000000246
Sep  1 13:01:59 www01 kernel: [438503.059251] RBP: ffff880627f2fd50 R08:
00000000ffffffff R09: ffffffff815b0480
Sep  1 13:01:59 www01 kernel: [438503.059333] R10: 0000000000000004 R11:
0000000000000000 R12: 0000000100080000
Sep  1 13:01:59 www01 kernel: [438503.059415] R13: ffff88062d188000 R14:
ffff88022cc5c000 R15: ffff88022cc5c720
Sep  1 13:01:59 www01 kernel: [438503.059498] FS: 
0000000000000000(0000) GS:ffff88024e460000(0000) knlGS:0000000000000000
Sep  1 13:01:59 www01 kernel: [438503.059584] CS:  0010 DS: 0018 ES:
0018 CR0: 000000008005003b
Sep  1 13:01:59 www01 kernel: [438503.059635] CR2: 00007ff8a7d2a008 CR3:
000000062d98d000 CR4: 00000000000006e0
Sep  1 13:01:59 www01 kernel: [438503.059718] DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Sep  1 13:01:59 www01 kernel: [438503.059800] DR3: 0000000000000000 DR6:
00000000ffff0ff0 DR7: 0000000000000400
Sep  1 13:01:59 www01 kernel: [438503.059883] Process ocfs2dc (pid:
3296, threadinfo ffff880627f2e000, task ffff88062d188000)
Sep  1 13:01:59 www01 kernel: [438503.059968] Stack:
Sep  1 13:01:59 www01 kernel: [438503.060006]  ffff88022cc5c000
ffff88022cc5c720 ffff880627f2fd50 ffff8805eea9e600
Sep  1 13:01:59 www01 kernel: [438503.060070] <0> ffff8805eea9e618
ffff88022cc5c000 ffff880627f2fd80 ffffffffa0231197
Sep  1 13:01:59 www01 kernel: [438503.060167] <0> ffff8805eea9e618
ffff8805eea9e628 0000000000000282 ffff88022cc5c000
Sep  1 13:01:59 www01 kernel: [438503.060294] Call Trace:
Sep  1 13:01:59 www01 kernel: [438503.060352]  [<ffffffffa0231197>]
ocfs2_dentry_lock_put+0x87/0x110 [ocfs2]
Sep  1 13:01:59 www01 kernel: [438503.060424]  [<ffffffffa0240217>]
ocfs2_dentry_post_unlock+0x17/0x20 [ocfs2]
Sep  1 13:01:59 www01 kernel: [438503.060497]  [<ffffffffa02453e5>]
ocfs2_process_blocked_lock+0x115/0x310 [ocfs2]
Sep  1 13:01:59 www01 kernel: [438503.060599]  [<ffffffffa02456aa>]
ocfs2_downconvert_thread_do_work+0xca/0x190 [ocfs2]
Sep  1 13:01:59 www01 kernel: [438503.060702]  [<ffffffffa02457ee>]
ocfs2_downconvert_thread+0x7e/0x1c0 [ocfs2]
Sep  1 13:01:59 www01 kernel: [438503.060793]  [<ffffffff81086470>] ?
autoremove_wake_function+0x0/0x40
Sep  1 13:01:59 www01 kernel: [438503.060865]  [<ffffffffa0245770>] ?
ocfs2_downconvert_thread+0x0/0x1c0 [ocfs2]
Sep  1 13:01:59 www01 kernel: [438503.060950]  [<ffffffff810860f6>]
kthread+0x96/0xa0
Sep  1 13:01:59 www01 kernel: [438503.061003]  [<ffffffff810141aa>]
child_rip+0xa/0x20
Sep  1 13:01:59 www01 kernel: [438503.061054]  [<ffffffff81086060>] ?
kthread+0x0/0xa0
Sep  1 13:01:59 www01 kernel: [438503.061103]  [<ffffffff810141a0>] ?
child_rip+0x0/0x20
Sep  1 13:01:59 www01 kernel: [438503.061152] Code: e8 b2 a3 df e0 66 90
85 c0 74 24 49 bc 00 00 08 00 01 00 00 00 4c 85 25 fb 22 f6 ff 74 0d 4c
85 25 fa 22 f6 ff 0f 84 d8 01 00 00 <0f> 0b eb fe 83 7b 6c 00 74 24 49
bc 00 00 08 00 01 00 00 00 4c
Sep  1 13:01:59 www01 kernel: [438503.061547] RIP  [<ffffffffa023f4b4>]
ocfs2_lock_res_free+0x1d4/0x4c0 [ocfs2]
Sep  1 13:01:59 www01 kernel: [438503.061648]  RSP <ffff880627f2fd20>
Sep  1 13:01:59 www01 kernel: [438503.062063] ---[ end trace
b5022849011f56ab ]---


After upgrading the kernel last week we had this error two times
(before: the bug occured just two times a year).
Here are some detailed information about the systems:

root at www01:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=10.04
DISTRIB_CODENAME=lucid
DISTRIB_DESCRIPTION="Ubuntu 10.04.4 LTS"

root at www01:~# uname -a
Linux www01 2.6.32-42-server #95-Ubuntu SMP Wed Jul 25 16:10:49 UTC 2012
x86_64 GNU/Linux

root at www01:~# cat /etc/ocfs2/cluster.conf
node:
        name = www01
        cluster = ocfs2
        number = 0
        ip_address = 192.168.1.1
        ip_port = 7777

node:
        name = www02
        cluster = ocfs2
        number = 1
        ip_address = 192.168.1.2
        ip_port = 7777

node:
        name = www03
        cluster = ocfs2
        number = 2
        ip_address = 192.168.1.3
        ip_port = 7777

cluster:
        name = ocfs2
        node_count = 3

Does anybody know this bug, and how to fix?

Thanks in advance,

Jakob





More information about the Ocfs2-users mailing list