[Ocfs2-users] Experiencing occasional system crashes with RHEL5 and ocfs2 1.2.9

Tue Mar 10 07:44:11 PDT 2009

Sunil,

It has been happening every week or two. One of our production environments
experienced this issue today with its logs displayed below. At the time of
that crash the kdump handler captured an incomplete vmcore. That file took
longer to write than we had anticipated and unfortunately wasn't able to
finish before the fence agent killed power.

(10826,0):dlm_deref_lockres_handler:2363 ERROR:
15DE52931133472797A07E1747BC9364:M0000000000000010f0e3e6277279c9: node 1
trying to drop ref but it is already dropped!
(10826,0):dlm_print_one_lock_resource:461 lockres:
M0000000000000010f0e3e6277279c9, owner=2, state=0
(10826,0):__dlm_print_one_lock_resource:476 lockres:
M0000000000000010f0e3e6277279c9, owner=2, state=0
(10826,0):__dlm_print_one_lock_resource:478   last used: 0, on purge list:
no
(10826,0):dlm_print_lockres_refmap:444   refmap nodes: [
], inflight=0
(10826,0):__dlm_print_one_lock_resource:480   granted queue:
(10826,0):__dlm_print_one_lock_resource:492     type=3, conv=-1, node=2,
cookie=2:4061845, ast=(empty=y,pend=n), bast=(empty=y,pend=n)
(10826,0):__dlm_print_one_lock_resource:495   converting queue:
(10826,0):__dlm_print_one_lock_resource:510   blocked queue:
(11672,7):dlm_drop_lockres_ref:2298 ERROR: while dropping ref on
15DE52931133472797A07E1747BC9364:M0000000000000011a3a0d2277333a8 (master=1)
got -22.
(11672,7):dlm_print_one_lock_resource:461 lockres:
M0000000000000011a3a0d2277333a8, owner=1, state=64
(11672,7):__dlm_print_one_lock_resource:476 lockres:
M0000000000000011a3a0d2277333a8, owner=1, state=64
(11672,7):__dlm_print_one_lock_resource:478   last used: 4683571329, on
purge list: yes
(11672,7):dlm_print_lockres_refmap:444   refmap nodes: [
], inflight=0
(11672,7):__dlm_print_one_lock_resource:480   granted queue:
(11672,7):__dlm_print_one_lock_resource:495   converting queue:
(11672,7):__dlm_print_one_lock_resource:510   blocked queue:
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at .../redhat/BUILD/ocfs2-1.2.9/fs/ocfs2/dlm/dlmmaster.c:2300
invalid opcode: 0000 [1]
SMP

last sysfs file:
/devices/pci0000:00/0000:00:04.0/0000:17:00.0/0000:18:02.0/0000:22:00.2/0000:24:05.0/irq
CPU 7

Modules linked in:
iptable_filter ip_tables x_tables nfsd exportfs auth_rpcgss netconsole
autofs4 hidp ocfs2(U) nls_utf8 nls_iso8859_1 cifs nfs lockd fscache nfs_acl
rfcomm l2cap bluetooth ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U)
lock_dlm gfs2(U) dlm configfs sunrpc bonding ipv6 xfrm_nalgo crypto_api
dm_round_robin dm_emc dm_multipath video sbs backlight i2c_ec i2c_core
button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sg
ata_piix libata ide_cd shpchp bnx2 cdrom pcspkr serio_raw dm_snapshot
dm_zero dm_mirror dm_mod qla2xxx scsi_transport_fc cciss sd_mod scsi_mod
ext3 jbd uhci_hcd ohci_hcd ehci_hcd

Pid: 11672, comm: dlm_thread Tainted: G      2.6.18-92.1.22.el5 #1
RIP: 0010:[<ffffffff885b76a0>]
[<ffffffff885b76a0>] :ocfs2_dlm:dlm_drop_lockres_ref+0x1dc/0x1f5
RSP: 0018:ffff81065a3c1de0  EFLAGS: 00010246
RAX: ffff8102ba2f2a38 RBX: 0000000000000000 RCX: ffffffff802ee9a8
RDX: ffffffff802ee9a8 RSI: 0000000000000000 RDI: ffffffff802ee9a0
RBP: 000000000000001f R08: ffffffff802ee9a8 R09: 0000000000000046
R10: 0000000000000000 R11: 0000000000000280 R12: ffff8102ba2f2a00
R13: ffff81102c7cec00 R14: ffff8103ebbaf520 R15: ffffffff8009dc54
FS:  0000000000000000(0000) GS:ffff81102fea7340(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b05068eb0a0 CR3: 0000000642cb0000 CR4: 00000000000006e0
Process dlm_thread (pid: 11672, threadinfo ffff81065a3c0000, task
ffff81065db52040)
Stack:
 1f02000000000000
 303030303030304d
 3130303030303030
 3232643061336131

 0038613333333737
 0000000000000000
last message repeated 2 times

 0000000000000000
 ffffffea00100100
 ffff8102ba2f2a48
 ffff8102ba2f2a00

Call Trace:
 [<ffffffff885ca733>] :ocfs2_dlm:dlm_purge_lockres+0x175/0x34f
 [<ffffffff885caba0>] :ocfs2_dlm:dlm_thread+0xd7/0x579
 [<ffffffff8009de6c>] autoremove_wake_function+0x0/0x2e
 [<ffffffff885caac9>] :ocfs2_dlm:dlm_thread+0x0/0x579
 [<ffffffff80032569>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009dc54>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8003246b>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Code:
0f 0b 68 53 e0 5c 88

Thanks,
Ward

On Mon, Mar 9, 2009 at 9:29 PM, Sunil Mushran <sunil.mushran at oracle.com>wrote:

> Known issue. We have a potential fix for it. It is in testing.
>
> How often do you hit this?
>
> On Mon, Mar 09, 2009 at 06:52:57PM -0400, Ward Fenton wrote:
> >    We have been experiencing unplanned outages on a subset of the
> >    clustered systems we have deployed to support SAP. The following
> >    captured information came from one of three ocfs2 clusters which
> handle
> >    SAP SEM/BW functionality. Each of those has experienced multiple
> kernel
> >    panics which get reported as dlm_drop_lockres_ref and
> >    dlm_defer_lockres_handler errors.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090310/84696e22/attachment.html