[Ocfs2-users] Experiencing occasional system crashes with RHEL5 and ocfs2 1.2.9
Ward Fenton
ward.fenton at gmail.com
Tue Mar 10 07:44:11 PDT 2009
Sunil,
It has been happening every week or two. One of our production environments
experienced this issue today with its logs displayed below. At the time of
that crash the kdump handler captured an incomplete vmcore. That file took
longer to write than we had anticipated and unfortunately wasn't able to
finish before the fence agent killed power.
(10826,0):dlm_deref_lockres_handler:2363 ERROR:
15DE52931133472797A07E1747BC9364:M0000000000000010f0e3e6277279c9: node 1
trying to drop ref but it is already dropped!
(10826,0):dlm_print_one_lock_resource:461 lockres:
M0000000000000010f0e3e6277279c9, owner=2, state=0
(10826,0):__dlm_print_one_lock_resource:476 lockres:
M0000000000000010f0e3e6277279c9, owner=2, state=0
(10826,0):__dlm_print_one_lock_resource:478 last used: 0, on purge list:
no
(10826,0):dlm_print_lockres_refmap:444 refmap nodes: [
], inflight=0
(10826,0):__dlm_print_one_lock_resource:480 granted queue:
(10826,0):__dlm_print_one_lock_resource:492 type=3, conv=-1, node=2,
cookie=2:4061845, ast=(empty=y,pend=n), bast=(empty=y,pend=n)
(10826,0):__dlm_print_one_lock_resource:495 converting queue:
(10826,0):__dlm_print_one_lock_resource:510 blocked queue:
(11672,7):dlm_drop_lockres_ref:2298 ERROR: while dropping ref on
15DE52931133472797A07E1747BC9364:M0000000000000011a3a0d2277333a8 (master=1)
got -22.
(11672,7):dlm_print_one_lock_resource:461 lockres:
M0000000000000011a3a0d2277333a8, owner=1, state=64
(11672,7):__dlm_print_one_lock_resource:476 lockres:
M0000000000000011a3a0d2277333a8, owner=1, state=64
(11672,7):__dlm_print_one_lock_resource:478 last used: 4683571329, on
purge list: yes
(11672,7):dlm_print_lockres_refmap:444 refmap nodes: [
], inflight=0
(11672,7):__dlm_print_one_lock_resource:480 granted queue:
(11672,7):__dlm_print_one_lock_resource:495 converting queue:
(11672,7):__dlm_print_one_lock_resource:510 blocked queue:
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at .../redhat/BUILD/ocfs2-1.2.9/fs/ocfs2/dlm/dlmmaster.c:2300
invalid opcode: 0000 [1]
SMP
last sysfs file:
/devices/pci0000:00/0000:00:04.0/0000:17:00.0/0000:18:02.0/0000:22:00.2/0000:24:05.0/irq
CPU 7
Modules linked in:
iptable_filter ip_tables x_tables nfsd exportfs auth_rpcgss netconsole
autofs4 hidp ocfs2(U) nls_utf8 nls_iso8859_1 cifs nfs lockd fscache nfs_acl
rfcomm l2cap bluetooth ocfs2_dlmfs(U) ocfs2_dlm(U) ocfs2_nodemanager(U)
lock_dlm gfs2(U) dlm configfs sunrpc bonding ipv6 xfrm_nalgo crypto_api
dm_round_robin dm_emc dm_multipath video sbs backlight i2c_ec i2c_core
button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sg
ata_piix libata ide_cd shpchp bnx2 cdrom pcspkr serio_raw dm_snapshot
dm_zero dm_mirror dm_mod qla2xxx scsi_transport_fc cciss sd_mod scsi_mod
ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 11672, comm: dlm_thread Tainted: G 2.6.18-92.1.22.el5 #1
RIP: 0010:[<ffffffff885b76a0>]
[<ffffffff885b76a0>] :ocfs2_dlm:dlm_drop_lockres_ref+0x1dc/0x1f5
RSP: 0018:ffff81065a3c1de0 EFLAGS: 00010246
RAX: ffff8102ba2f2a38 RBX: 0000000000000000 RCX: ffffffff802ee9a8
RDX: ffffffff802ee9a8 RSI: 0000000000000000 RDI: ffffffff802ee9a0
RBP: 000000000000001f R08: ffffffff802ee9a8 R09: 0000000000000046
R10: 0000000000000000 R11: 0000000000000280 R12: ffff8102ba2f2a00
R13: ffff81102c7cec00 R14: ffff8103ebbaf520 R15: ffffffff8009dc54
FS: 0000000000000000(0000) GS:ffff81102fea7340(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002b05068eb0a0 CR3: 0000000642cb0000 CR4: 00000000000006e0
Process dlm_thread (pid: 11672, threadinfo ffff81065a3c0000, task
ffff81065db52040)
Stack:
1f02000000000000
303030303030304d
3130303030303030
3232643061336131
0038613333333737
0000000000000000
last message repeated 2 times
0000000000000000
ffffffea00100100
ffff8102ba2f2a48
ffff8102ba2f2a00
Call Trace:
[<ffffffff885ca733>] :ocfs2_dlm:dlm_purge_lockres+0x175/0x34f
[<ffffffff885caba0>] :ocfs2_dlm:dlm_thread+0xd7/0x579
[<ffffffff8009de6c>] autoremove_wake_function+0x0/0x2e
[<ffffffff885caac9>] :ocfs2_dlm:dlm_thread+0x0/0x579
[<ffffffff80032569>] kthread+0xfe/0x132
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff8009dc54>] keventd_create_kthread+0x0/0xc4
[<ffffffff8003246b>] kthread+0x0/0x132
[<ffffffff8005dfa7>] child_rip+0x0/0x11
Code:
0f 0b 68 53 e0 5c 88
Thanks,
Ward
On Mon, Mar 9, 2009 at 9:29 PM, Sunil Mushran <sunil.mushran at oracle.com>wrote:
> Known issue. We have a potential fix for it. It is in testing.
>
> How often do you hit this?
>
> On Mon, Mar 09, 2009 at 06:52:57PM -0400, Ward Fenton wrote:
> > We have been experiencing unplanned outages on a subset of the
> > clustered systems we have deployed to support SAP. The following
> > captured information came from one of three ocfs2 clusters which
> handle
> > SAP SEM/BW functionality. Each of those has experienced multiple
> kernel
> > panics which get reported as dlm_drop_lockres_ref and
> > dlm_defer_lockres_handler errors.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20090310/84696e22/attachment.html
More information about the Ocfs2-users
mailing list