[Ocfs2-users] BUG: unable to handle kernel NULL pointer dereference

Fri Oct 27 00:28:58 PDT 2006

Thanks to syncronous writes on the log-files I finally managed to get
a log of the regular panics we experience.

The setup is as follows: Three blades (IBM HS20) accessing a shared storage
on a fibre channel connected storage server (IBM DS4300). The storage is
used as a central mailstorage for about 35000 users, so it is pretty heavy
duty storage wise.

blade01 crashes every few days with a kernel panic. Unfortunatly all
watchdogs we tried fail to reboot the machine, and setting
/proc/sys/kernel/panic and /proc/sys/kernel/panic_on_oops to non-zero
values doesn't help either. The machine still responds to pings, but
to nothing else. Even more unfortunatly the file system on the other
blades starts to hang sometime after blade01 crashes.

Logging /proc/slabinfo showed a steady increase of the size-256 and size-32
number of objects and we thought the crashes might have something to do
with it. We then did a nightly umount/mount which reduced the values a
bit and which does seem to reduce the frequency of crashes slightly.

Nevertheless today we had a crash with rather low values of size-256 and
size-32:

>From /proc/slabinfo, timestamped, a few seconds before the crash:

2006-10-27-06:20:01 size-256           92187 169605    256   15    1 : tunables  120   60    8 : slabdata  11307  113 07      0
2006-10-27-06:20:01 size-32            94037 534942     32  113    1 : tunables  120   60    8 : slabdata   4734   47 34      0

The kern.log shows:

Oct 27 06:20:11 blade01 kernel: BUG: unable to handle kernel NULL pointer dereference at virtual address 00000004
Oct 27 06:20:11 blade01 kernel:  printing eip:
Oct 27 06:20:11 blade01 kernel: f92b9431
Oct 27 06:20:11 blade01 kernel: *pde = 00000000
Oct 27 06:20:11 blade01 kernel: Oops: 0002 [#1]
Oct 27 06:20:11 blade01 kernel: SMP 
Oct 27 06:20:11 blade01 kernel: Modules linked in: i6300esb ocfs2 xt_state ip_conntrack xt_limit ocfs2_dlmfs ocfs2_dlm ocfs2_nodemanager md_mod dm_snapshot dm_mirror dm_mod mptctl qla2xxx i2c_i801 firmware_class i2c_core scsi_transport_fc rtc
Oct 27 06:20:11 blade01 kernel: CPU:    1
Oct 27 06:20:11 blade01 kernel: EIP:    0060:[<f92b9431>]    Not tainted VLI
Oct 27 06:20:11 blade01 kernel: EFLAGS: 00010286   (2.6.18 #1) 
Oct 27 06:20:11 blade01 kernel: EIP is at dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm]
Oct 27 06:20:11 blade01 kernel: eax: 00000000   ebx: d61e4c00   ecx: c4ce5988   edx: 00000000
Oct 27 06:20:11 blade01 kernel: esi: f7531de4   edi: c4ce5980   ebp: e1873080   esp: f7531d6c
Oct 27 06:20:11 blade01 kernel: ds: 007b   es: 007b   ss: 0068
Oct 27 06:20:11 blade01 kernel: Process o2net (pid: 1698, ti=f7530000 task=c215b560 task.ti=f7530000)
Oct 27 06:20:11 blade01 kernel: Stack: 00000000 c0327a2c f7531d88 e6805a80 f7531e6c 00000048 00000040 d61e4c00 
Oct 27 06:20:11 blade01 kernel:        d899a020 00000000 00000001 00000000 01020000 00000000 d899a021 0000004d 
Oct 27 06:20:11 blade01 kernel:        c4ce5980 00000000 d61e4c00 fffffff4 f92bb927 f7531de4 d899a020 0000001f 
Oct 27 06:20:11 blade01 kernel: Call Trace:
Oct 27 06:20:11 blade01 kernel:  [<c0327a2c>] sock_recvmsg+0xe9/0x10b
Oct 27 06:20:11 blade01 kernel:  [<f92bb927>] dlm_migrate_request_handler+0x17b/0x231 [ocfs2_dlm]
Oct 27 06:20:11 blade01 kernel:  [<f9256762>] o2net_process_message+0x46e/0x626 [ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [<c0120312>] __do_softirq+0x73/0xdf
Oct 27 06:20:11 blade01 kernel:  [<f9256057>] o2net_recv_tcp_msg+0x6b/0x7e [ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [<c0114142>] find_busiest_group+0x129/0x4f9
Oct 27 06:20:11 blade01 kernel:  [<f925819e>] o2net_rx_until_empty+0x1e6/0x6b9 [ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [<c011619f>] __wake_up+0x32/0x43
Oct 27 06:20:11 blade01 kernel:  [<c012af5b>] run_workqueue+0x73/0xe1
Oct 27 06:20:11 blade01 kernel:  [<f9257fb8>] o2net_rx_until_empty+0x0/0x6b9 [ocfs2_nodemanager]
Oct 27 06:20:11 blade01 kernel:  [<c012b710>] worker_thread+0x143/0x15f
Oct 27 06:20:11 blade01 kernel:  [<c011563d>] default_wake_function+0x0/0x15
Oct 27 06:20:11 blade01 kernel:  [<c012b5cd>] worker_thread+0x0/0x15f
Oct 27 06:20:11 blade01 kernel:  [<c012e151>] kthread+0xfc/0x100
Oct 27 06:20:11 blade01 kernel:  [<c012e055>] kthread+0x0/0x100
Oct 27 06:20:11 blade01 kernel:  [<c0100d95>] kernel_thread_helper+0x5/0xb
Oct 27 06:20:11 blade01 kernel: Code: 98 0a 00 00 c7 44 24 0c 62 81 2c f9 89 54 24 08 89 44 24 04 c7 04 24 80 06 2d f9 e8 85 29 e6 c6 e9 57 fe ff ff 8b 57 08 8b 41 04 <89> 42 04 89 10 89 4f 08 89 49 04 eb 9c f7 05 a0 2b 26 f9 00 09 
Oct 27 06:20:11 blade01 kernel: EIP: [<f92b9431>] dlm_add_migration_mle+0x1f6/0x30a [ocfs2_dlm] SS:ESP 0068:f7531d6c

This is with a vanilla 2.6.18 kernel from kernel.org. There were no
suspicious messages in the logs before the crash.