[Ocfs2-users] BUG: unable to handle kernel NULL pointer dereference

Fri Oct 27 09:32:53 PDT 2006

Please file a bugzilla with the details provided. It is easier to manage 
bugs
that a way.

Thanks

Christian Schlittchen wrote:
> Thanks to syncronous writes on the log-files I finally managed to get
> a log of the regular panics we experience.
>
> The setup is as follows: Three blades (IBM HS20) accessing a shared storage
> on a fibre channel connected storage server (IBM DS4300). The storage is
> used as a central mailstorage for about 35000 users, so it is pretty heavy
> duty storage wise.
>
> blade01 crashes every few days with a kernel panic. Unfortunatly all
> watchdogs we tried fail to reboot the machine, and setting
> /proc/sys/kernel/panic and /proc/sys/kernel/panic_on_oops to non-zero
> values doesn't help either. The machine still responds to pings, but
> to nothing else. Even more unfortunatly the file system on the other
> blades starts to hang sometime after blade01 crashes.
>
> Logging /proc/slabinfo showed a steady increase of the size-256 and size-32
> number of objects and we thought the crashes might have something to do
> with it. We then did a nightly umount/mount which reduced the values a
> bit and which does seem to reduce the frequency of crashes slightly.
>
> Nevertheless today we had a crash with rather low values of size-256 and
> size-32:
>
> >From /proc/slabinfo, timestamped, a few seconds before the crash:
>
> 2006-10-27-06:20:01 size-256           92187 169605    256   15    1 : tunables  120   60    8 : slabdata  11307  113 07      0
> 2006-10-27-06:20:01 size-32            94037 534942     32  113    1 : tunables  120   60    8 : slabdata   4734   47 34      0
>
> The kern.log shows:
>
> Oct 27 06:20:11 blade01 kernel: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000004
> Oct 27 06:20:11 blade01 kernel:  printing eip:
> Oct 27 06:20:11 blade01 kernel: f92b9431
> Oct 27 06:20:11 blade01 kernel: *pde = 00000000
> Oct 27 06:20:11 blade01 kernel: Oops: 0002 [#1]
> Oct 27 06:20:11 blade01 kernel: SMP 
> Oct 27 06:20:11 blade01 kernel: Modules linked in: i6300esb ocfs2 xt_state ip_conntrack xt_limit ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager md_mod dm_snapshot dm_mirror dm_mod mptctl qla2xxx i2c_i801 firmware_class i2c_core scsi_transport_fc rtc
> Oct 27 06:20:11 blade01 kernel: CPU:    1
> Oct 27 06:20:11 blade01 kernel: EIP:    0060:[<f92b9431>]    Not tainted VLI
> Oct 27 06:20:11 blade01 kernel: EFLAGS: 00010286   (2.6.18 #1) 
> Oct 27 06:20:11 blade01 kernel: EIP is at dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm]
> Oct 27 06:20:11 blade01 kernel: eax: 00000000   ebx: d61e4c00   ecx: c4ce5988   edx: 00000000
> Oct 27 06:20:11 blade01 kernel: esi: f7531de4   edi: c4ce5980   ebp: e1873080   esp: f7531d6c
> Oct 27 06:20:11 blade01 kernel: ds: 007b   es: 007b   ss: 0068
> Oct 27 06:20:11 blade01 kernel: Process o2net (pid: 1698, ti=f7530000 task=c215b560 task.ti=f7530000)
> Oct 27 06:20:11 blade01 kernel: Stack: 00000000 c0327a2c f7531d88 e6805a80 f7531e6c 00000048 00000040 d61e4c00 
> Oct 27 06:20:11 blade01 kernel:        d899a020 00000000 00000001 00000000 01020000 00000000 d899a021 0000004d 
> Oct 27 06:20:11 blade01 kernel:        c4ce5980 00000000 d61e4c00 fffffff4 f92bb927 f7531de4 d899a020 0000001f 
> Oct 27 06:20:11 blade01 kernel: Call Trace:
> Oct 27 06:20:11 blade01 kernel:  [<c0327a2c>] sock_recvmsg+0xe9/0x10b
> Oct 27 06:20:11 blade01 kernel:  [<f92bb927>] dlm_migrate_request_handler+0x17b/0x231 [ocfs2_dlm]
> Oct 27 06:20:11 blade01 kernel:  [<f9256762>] o2net_process_message+0x46e/0x626 [ocfs2_nodemanager]
> Oct 27 06:20:11 blade01 kernel:  [<c0120312>] __do_softirq+0x73/0xdf
> Oct 27 06:20:11 blade01 kernel:  [<f9256057>] o2net_recv_tcp_msg+0x6b/0x7e [ocfs2_nodemanager]
> Oct 27 06:20:11 blade01 kernel:  [<c0114142>] find_busiest_group+0x129/0x4f9
> Oct 27 06:20:11 blade01 kernel:  [<f925819e>] o2net_rx_until_empty+0x1e6/0x6b9 [ocfs2_nodemanager]
> Oct 27 06:20:11 blade01 kernel:  [<c011619f>] __wake_up+0x32/0x43
> Oct 27 06:20:11 blade01 kernel:  [<c012af5b>] run_workqueue+0x73/0xe1
> Oct 27 06:20:11 blade01 kernel:  [<f9257fb8>] o2net_rx_until_empty+0x0/0x6b9 [ocfs2_nodemanager]
> Oct 27 06:20:11 blade01 kernel:  [<c012b710>] worker_thread+0x143/0x15f
> Oct 27 06:20:11 blade01 kernel:  [<c011563d>] default_wake_function+0x0/0x15
> Oct 27 06:20:11 blade01 kernel:  [<c012b5cd>] worker_thread+0x0/0x15f
> Oct 27 06:20:11 blade01 kernel:  [<c012e151>] kthread+0xfc/0x100
> Oct 27 06:20:11 blade01 kernel:  [<c012e055>] kthread+0x0/0x100
> Oct 27 06:20:11 blade01 kernel:  [<c0100d95>] kernel_thread_helper+0x5/0xb
> Oct 27 06:20:11 blade01 kernel: Code: 98 0a 00 00 c7 44 24 0c 62 81 2c f9 89 54 24 08 89 44 24 04 c7 04 24 80 06 2d f9 e8 85 29 e6 c6 e9 57 fe ff ff 8b 57 08 8b 41 04 <89> 42 04 89 10 89 4f 08 89 49 04 eb 9c f7 05 a0 2b 26 f9 00 09 
> Oct 27 06:20:11 blade01 kernel: EIP: [<f92b9431>] dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm] SS:ESP 0068:f7531d6c
>
> This is with a vanilla 2.6.18 kernel from kernel.org. There were no
> suspicious messages in the logs before the crash.
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>