[Ocfs2-users] ocfs2 kernel BUG

Tue Oct 7 06:48:24 PDT 2008

Hi Sunil,

It happens the last weeks few several times. We have moved 3 TB of new data to one of the OCFS's. The crashes happen after the move action and at the same time. Since I've disabled the updatedb cron (on both nodes) for OCFS mounts it doesn't happen anymore. Updatedb (from Debian) has problems with the amount of data to index and the lockmanager generates a lot of traffic when it run's.

Hopefully you can reproduce it with the above info.

I'm looking forward to the fix.

Thanks.

Christian van Barneveld

> -----Original Message-----
> From: Sunil Mushran [mailto:sunil.mushran at oracle.com]
> Sent: zaterdag 4 oktober 2008 2:45
> To: Christian van Barneveld
> Cc: 'ocfs2-users at oss.oracle.com'
> Subject: Re: [Ocfs2-users] ocfs2 kernel BUG
>
> This is the same as issue.
> http://oss.oracle.com/bugzilla/show_bug.cgi?id=1012
>
> Is this happening frequently? We have failed to reproduce it in
> our test cluster.
>
> If you can reproduce it, I could give you a potential fix for testing.
>
> Let me know.
>
> Sunil
>
> Christian van Barneveld wrote:
> > Hi,
> >
> > The last few weeks we had several times a kernel stacktrace and after
> that the ocfs2 filesystems don't respond anymore (no output on ls) at all
> the nodes.
> >
> > Kern.log at node-2
> > ------------------------------------------------------------------------
> ----
> >  Oct  3 06:57:18 XXX kernel: (7178,0):dlm_drop_lockres_ref:2291 ERROR:
> while dropping ref on
> 6EDBC1B22BBB4E28AD9453CD5B2F60C3:M000000000000000007f06600000000
> (master=0) got -22.
> > Oct  3 06:57:18 XXX kernel: (7178,0):dlm_print_one_lock_resource:50
> lockres: M000000000000000007f06600000000, owner=0, state=64
> > Oct  3 06:57:18 XXX kernel: (7178,0):__dlm_print_one_lock_resource:82
> lockres: M000000000000000007f06600000000, owner=0, state=64
> > Oct  3 06:57:18 XXX kernel: (7178,0):__dlm_print_one_lock_resource:84
> last used: 49827182, on purge list: yes
> > Oct  3 06:57:18 XXX kernel: (7178,0):dlm_print_lockres_refmap:61
> refmap nodes: [ ], inflight=0
> > Oct  3 06:57:18 XXX kernel: (7178,0):__dlm_print_one_lock_resource:86
> granted queue:
> > Oct  3 06:57:18 XXX kernel: (7178,0):__dlm_print_one_lock_resource:101
> converting queue:
> > Oct  3 06:57:18 XXX kernel: (7178,0):__dlm_print_one_lock_resource:116
> blocked queue:
> > Oct  3 06:57:20 XXX kernel: ------------[ cut here ]------------
> > Oct  3 06:57:20 XXX kernel: kernel BUG at fs/ocfs2/dlm/dlmmaster.c:2293!
> > Oct  3 06:57:20 XXX kernel: invalid opcode: 0000 [#1] SMP
> > Oct  3 06:57:20 XXX kernel: Modules linked in: ocfs2 xt_multiport
> nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter dm_round_robin
> dm_rdac ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager configfs dm_multipath
> dm_mod qla2xxx
> > Oct  3 06:57:20 XXX kernel:
> > Oct  3 06:57:20 XXX kernel: Pid: 7178, comm: dlm_thread Not tainted
> (2.6.25.5-qla2xxx-mpath-fw-cluster-hm64 #1)
> > Oct  3 06:57:20 XXX kernel: EIP: 0060:[<f8eebd11>] EFLAGS: 00010286 CPU:
> 0
> > Oct  3 06:57:20 XXX kernel: EIP is at dlm_drop_lockres_ref+0x1c1/0x280
> [ocfs2_dlm]
> > Oct  3 06:57:20 XXX kernel: EAX: e79268a8 EBX: f7118600 ECX: c06a6ca4
> EDX: 00000092
> > Oct  3 06:57:20 XXX kernel: ESI: ffffffea EDI: f5b21eff EBP: 0000001f
> ESP: f5b21ea4
> > Oct  3 06:57:20 XXX kernel:  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS:
> 0068
> > Oct  3 06:57:20 XXX kernel: Process dlm_thread (pid: 7178, ti=f5b20000
> task=f72ec430 task.ti=f5b20000)
> > Oct  3 06:57:20 XXX kernel: Stack: f8efebec 00001c0a 00000000 f8ef9cd2
> 000008f3 f599b940 0000001f ede9c460
> > Oct  3 06:57:20 XXX kernel:        00000000 ffffffea e7926880 f7118600
> ede9c460 00000000 1f010000 3030304d
> > Oct  3 06:57:20 XXX kernel:        30303030 30303030 30303030 66373030
> 30363630 30303030 00303030 00000000
> > Oct  3 06:57:20 XXX kernel: Call Trace:
> > Oct  3 06:57:20 XXX kernel:  [<f8edf347>] dlm_thread+0x327/0x1420
> [ocfs2_dlm]
> > Oct  3 06:57:20 XXX kernel:  [<c011beb9>] hrtick_set+0x69/0x140
> > Oct  3 06:57:20 XXX kernel:  [<c0133180>]
> autoremove_wake_function+0x0/0x50
> > Oct  3 06:57:20 XXX kernel:  [<f8edf020>] dlm_thread+0x0/0x1420
> [ocfs2_dlm]
> > Oct  3 06:57:20 XXX kernel:  [<c0132e92>] kthread+0x42/0x70
> > Oct  3 06:57:20 XXX kernel:  [<c0132e50>] kthread+0x0/0x70
> > Oct  3 06:57:20 XXX kernel:  [<c0103a17>] kernel_thread_helper+0x7/0x10
> > Oct  3 06:57:20 XXX kernel:  =======================
> > Oct  3 06:57:20 XXX kernel: Code: d2 9c ef f8 89 54 24 08 89 44 24 14 8b
> 81 d8 00 00 00 c7 04 24 ec eb ef f8 89 44 24 04 e8 98 55 23 c7 8b 44 24 28
> e8 3f 2c ff ff <0f> 0b eb fe 3d 00 fe ff ff 0f 95 c2 83 f8 fc 0f 95 c0 84
> d0 0f
> > Oct  3 06:57:20 XXX kernel: EIP: [<f8eebd11>]
> dlm_drop_lockres_ref+0x1c1/0x280 [ocfs2_dlm] SS:ESP 0068:f5b21ea4
> > Oct  3 06:57:20 XXX kernel: ---[ end trace 52ed3dea72cac956 ]---
> >
> > ------------------------------------------------------------------------
> ----
> >
> > kern.log at node-1:
> >
> > Oct  3 06:57:18 XXX kernel: (5799,1):dlm_deref_lockres_handler:2336
> ERROR: 6EDBC1B22BBB4E28AD9453CD5B2F60C3:M000000000000000007f06600000000:
> bad lockres name
> >
> > # uname -r:
> > 2.6.25.5
> >
> > # debugfs.ocfs2 -V
> > debugfs.ocfs2 1.4.1
> >
> > # dmesg
> > OCFS2 Node Manager 1.5.0
> > OCFS2 DLM 1.5.0
> > OCFS2 DLMFS 1.5.0
> >
> > We have 2 nodes in the cluster and the freeze was observed on both
> nodes.
> > Only a reboot solves the problem.
> >
> > Any help appreciated.
> >
> > Christian van Barneveld
> >
> >
> > _______________________________________________
> > Ocfs2-users mailing list
> > Ocfs2-users at oss.oracle.com
> > http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >