[Ocfs2-users] Sles10 Sp2 kernel crash

Sunil Mushran sunil.mushran at oracle.com
Wed Sep 29 09:56:46 PDT 2010


This has been fixed for sometime now.

============================================
commit 14741472a05245ed5778aa0aec055e1f920b6ef8
Author: Srinivas Eeda <srinivas.eeda at oracle.com>
Date:   Mon Mar 22 16:50:47 2010 -0700

     ocfs2: Fix a race in o2dlm lockres mastery

     In o2dlm, the master of a lock resource keeps a map of all interested
     nodes.  This prevents the master from purging the resource before an
     interested node can create a lock.

     A race between the mastery thread and the mastery handler allowed an
     interested node to discover who the master is without informing the
     master directly.  This is easily fixed by holding the dlm spinlock a
     little longer in the mastery handler.

     Signed-off-by: Srinivas Eeda <srinivas.eeda at oracle.com>
     Signed-off-by: Joel Becker <joel.becker at oracle.com>


commit a524812b7eaa7783d7811198921100f079034e61
Author: Wengang Wang <wen.gang.wang at oracle.com>
Date:   Fri Jul 30 16:14:44 2010 +0800

     ocfs2/dlm: avoid incorrect bit set in refmap on recovery master

     In the following situation, there remains an incorrect bit in 
refmap on the
     recovery master. Finally the recovery master will fail at purging 
the lockres
     due to the incorrect bit in refmap.

     1) node A has no interest on lockres A any longer, so it is purging it.
     2) the owner of lockres A is node B, so node A is sending de-ref 
message
     to node B.
     3) at this time, node B crashed. node C becomes the recovery 
master. it recovers
     lockres A(because the master is the dead node B).
     4) node A migrated lockres A to node C with a refbit there.
     5) node A failed to send de-ref message to node B because it 
crashed. The failure
     is ignored. no other action is done for lockres A any more.

     For mormal, re-send the deref message to it to recovery master can 
fix it. Well,
     ignoring the failure of deref to the original master and not 
recovering the lockres
     to recovery master has the same effect. And the later is simpler.

     Signed-off-by: Wengang Wang <wen.gang.wang at oracle.com>
     Acked-by: Srinivas Eeda <srinivas.eeda at oracle.com>
     Cc: stable at kernel.org
     Signed-off-by: Joel Becker <joel.becker at oracle.com>
============================================


On 09/29/2010 06:47 AM, Charlie Sharkey wrote:
>
> I got the following crash on a Sles10 SP2 system, info below.
>
> Is this a known problem ?    It looks similar to bug# 912
>
>        http://oss.oracle.com/bugzilla/show_bug.cgi?id=912
>
> version info
>
> -----------------
>
> OCFS2 Node Manager 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build 
> f922955d99ef972235bd0c1fc236c5ddbb368611)
>
> OCFS2 DLM 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build 
> f922955d99ef972235bd0c1fc236c5ddbb368611)
>
> OCFS2 DLMFS 1.4.1-1-SLES Wed Jul 23 18:33:42 UTC 2008 (build 
> f922955d99ef972235bd0c1fc236c5ddbb368611)
>
> crash info
>
> -------------
>
>      KERNEL: ./vmlinux-2.6.16.60-0.42.10
>
>     DUMPFILE: ../n2_vmcore_20100925
>
>      CPUS: 8
>
>      DATE: Sat Sep 25 12:48:00 2010
>
>      UPTIME: 10 days, 04:08:44
>
>      LOAD AVERAGE: 9.39, 9.11, 8.67
>
>      TASKS: 484
>
>     NODENAME: n2
>
>      RELEASE: 2.6.16.60-0.42.10-smp
>
>      VERSION: #1 SMP Tue Apr 27 05:11:27 UTC 2010
>
>      MACHINE: x86_64  (2926 Mhz)
>
>      MEMORY: 2.9 GB
>
>     PANIC: ""
>
>     PID: 6557
>
>      COMMAND: "dlm_thread"
>
>      TASK: ffff81012ac89860  [THREAD_INFO: ffff81010532e000]
>
>      CPU: 4
>
>      STATE: TASK_RUNNING (PANIC)
>
> crash> bt
>
> PID: 6557   TASK: ffff81012ac89860  CPU: 4   COMMAND: "dlm_thread"
>
>  #0 [ffff81010532fa50] machine_kexec at ffffffff8011c0b6
>
>  #1 [ffff81010532fb20] crash_kexec at ffffffff80154022
>
>  #2 [ffff81010532fbe0] __die at ffffffff802ec658
>
>  #3 [ffff81010532fc20] die at ffffffff8010c7e6
>
>  #4 [ffff81010532fc50] do_invalid_op at ffffffff8010cd97
>
>  #5 [ffff81010532fd10] error_exit at ffffffff8010bced
>
>     [exception RIP: dlm_drop_lockres_ref+480]
>
>     RIP: ffffffff88511d2a  RSP: ffff81010532fdc8  RFLAGS: 00010286
>
>     RAX: ffff81006181cc08  RBX: 0000000000000000  RCX: 000000000001109c
>
>     RDX: 000000000000001f  RSI: 0000000000000296  RDI: ffffffff8035ba1c
>
>     RBP: ffff81006181cbc0   R8: ffffffff8045a260   R9: 000000000000001f
>
>     R10: 0000000000000000  R11: 0000000000000000  R12: ffff810129b05c00
>
>     R13: 000000000000001f  R14: ffff81004ada2320  R15: 000000000000026d
>
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>
>  #6 [ffff81010532fdc0] dlm_drop_lockres_ref at ffffffff88511d2a
>
>  #7 [ffff81010532fe40] dlm_run_purge_list at ffffffff8852035c
>
>  #8 [ffff81010532fe90] dlm_thread at ffffffff88520718
>
>  #9 [ffff81010532ff10] kthread at ffffffff801480cd
>
> #10 [ffff81010532ff50] kernel_thread at ffffffff8010bea6
>
> crash>
>
> text extracted from the core file:
>
> -----------------------------------------
>
> <3>(6345,7):dlm_deref_lockres_handler:2302 ERROR: 
> 27870DB34A7241CC8EBDD43647ABE1FB:M0000000000000078b4305e00000000: node 
> 0 trying to drop ref but it is already dropped!
>
> <3>(6557,4):dlm_drop_lockres_ref:2234 ERROR: while dropping ref on 
> 130ADCC7DE934141AF05DA025CCD14A4:O0000000000000079a3bfbc00000000 
> (master=0) got -22.
>
> <1>Kernel BUG at fs/ocfs2/dlm/dlmmaster.c:2236
>
> <4>Modules linked in: af_packet ocfs2 ocfs2_dlmfs ocfs2_dlm 
> ocfs2_nodemanager configfs btipbsa4 ipmi_devintf ipmi_si 
> ipmi_msghandler bonding ipv6 bticomp_aha363 dock smi button battery 
> btismc ac st loop dm_round_robin dm_multipath dm_mod usbhid 
> usb_storage ide_core i2c_i801 igb e1000 hw_random i2c_core uhci_hcd 
> ehci_hcd usbcore ext3 jbd qla2xxx firmware_class qla2xxx_conf 
> intermodule edd fan thermal processor sg megaraid_sas ata_piix libata 
> sd_mod scsi_mod
>
> <4>Pid: 6557, comm: dlm_thread Tainted: P     U 2.6.16.60-0.42.10-smp #1
>
> <4>RIP: 0010:[<ffffffff88511d2a>] 
> <ffffffff88511d2a>{:ocfs2_dlm:dlm_drop_lockres_ref+480}
>
> <4>Process dlm_thread (pid: 6557, threadinfo ffff81010532e000, task 
> ffff81012ac89860)
>
> <4>Call Trace: <ffffffff8852035c>{:ocfs2_dlm:dlm_run_purge_list+771}
>
> <4> <ffffffff88520718>{:ocfs2_dlm:dlm_thread+131} 
> <ffffffff8014820e>{autoremove_wake_function+0}
>
> <4> <ffffffff88520695>{:ocfs2_dlm:dlm_thread+0} 
> <ffffffff80147e05>{keventd_create_kthread+0}
>
> <1>RIP <ffffffff88511d2a>{:ocfs2_dlm:dlm_drop_lockres_ref+480} RSP 
> <ffff81010532fdc8>
>
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20100929/022ebbd0/attachment-0001.html 


More information about the Ocfs2-users mailing list