[Ocfs2-users] Journal replay after crash, kernel BUG at fs/ocfs2/journal.c:1700!, 2.6.36

Fri Oct 29 02:23:47 PDT 2010

Hi Ronald,

On 10/29/2010 05:12 PM, Ronald Moesbergen wrote:
> Hello,
>
> I was testing kernel 2.6.36 (vanilla mainline) and encountered the
> following BUG():
>
> [157756.266000] o2net: no longer connected to node app01 (num 0) at
> 10.2.25.13:7777
> [157756.266077] (o2hb-5FA56B1D0A,2908,0):o2dlm_eviction_cb:267 o2dlm
> has evicted node 0 from group 5FA56B1D0A9249099CE58C82CFEC873A
> [157756.274443] (ocfs2rec,14060,0):dlm_get_lock_resource:836
> 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at
> least one node (0) to recover before lock mastery can begin
> [157757.275776] (ocfs2rec,14060,0):dlm_get_lock_resource:890
> 5FA56B1D0A9249099CE58C82CFEC873A:M00000000000000000000186ba2b09b: at
> least one node (0) to recover before lock mastery can begin
> [157760.774045] (dlm_reco_thread,2920,2):dlm_get_lock_resource:836
> 5FA56B1D0A9249099CE58C82CFEC873A:$RECOVERY: at least one node (0) to
> recover before lock mastery can begin
> [157760.774124] (dlm_reco_thread,2920,2):dlm_get_lock_resource:870
> 5FA56B1D0A9249099CE58C82CFEC873A: recovery map is not empty, but must
> master $RECOVERY lock now
> [157760.774205] (dlm_reco_thread,2920,2):dlm_do_recovery:523 (2920)
> Node 1 is the Recovery Master for the Dead Node 0 for Domain
> 5FA56B1D0A9249099CE58C82CFEC873A
> [157768.261818] (ocfs2rec,14060,0):ocfs2_replay_journal:1605
> Recovering node 0 from slot 0 on device (8,32)
> [157772.850182] ------------[ cut here ]------------
> [157772.850211] kernel BUG at fs/ocfs2/journal.c:1700!
Strange. the bug line is
BUG_ON(osb->node_num == node_num);
and it detects the same node number in the cluster.

So could you please grab the mount info from the system log of the 2 
nodes. The message looks like:

Oct 27 16:24:21 ocfs2-test2 kernel: ocfs2: Mounting device (8,8) on 
(node 2, slot 0) with ordered data mode.

It tell us which node and slot the volume used.

Regards,
Tao
> [157772.850238] invalid opcode: 0000 [#1] SMP
> [157772.850270] last sysfs file:
> /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
> [157772.850314] CPU 0
> [157772.850320] Modules linked in: ip_vs_wrr ip_vs nf_conntrack ocfs2
> jbd2 quota_tree ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm
> ocfs2_nodemanager ocfs2_stackglue configfs sd_mod crc32c ib_iser
> rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp
> libiscsi_tcp libiscsi scsi_transport_iscsi bonding ipv6 ipmi_devintf
> cpufreq_ondemand acpi_cpufreq freq_table mperf loop ipmi_si
> ipmi_msghandler hpilo hpwdt container snd_pcm serio_raw psmouse
> snd_timer snd soundcore tpm_tis tpm tpm_bios pcspkr iTCO_wdt
> snd_page_alloc button processor evdev ext3 jbd mbcache dm_mirror
> dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod cdrom usbhid hid
> ata_piix ata_generic cciss libata scsi_mod ide_pci_generic ide_core
> ehci_hcd bnx2 e1000e uhci_hcd thermal fan thermal_sys
> [157772.850758]
> [157772.850779] Pid: 14060, comm: ocfs2rec Not tainted 2.6.36 #2
> /ProLiant DL360 G6
> [157772.850823] RIP: 0010:[<ffffffffa03da8c3>]  [<ffffffffa03da8c3>]
> __ocfs2_recovery_thread+0x474/0x137f [ocfs2]
> [157772.850916] RSP: 0018:ffff880084f49e00  EFLAGS: 00010246
> [157772.850943] RAX: 0000000000000001 RBX: ffff88011dd07108 RCX:
> ffff88011d3fe344
> [157772.850986] RDX: ffff88011d3fe340 RSI: 0000000000000001 RDI:
> ffff88011dd07108
> [157772.851029] RBP: ffff880118479000 R08: 0000000000000000 R09:
> 0000000000000000
> [157772.851073] R10: 0000000000000000 R11: 0000000000000400 R12:
> ffff88011faff800
> [157772.851116] R13: 0000000000000001 R14: ffff88011dd07000 R15:
> 0000000000000000
> [157772.851159] FS:  0000000000000000(0000) GS:ffff880001600000(0000)
> knlGS:0000000000000000
> [157772.851205] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [157772.851232] CR2: 0000000001e88b58 CR3: 000000011dd26000 CR4:
> 00000000000006f0
> [157772.851275] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [157772.851318] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [157772.851362] Process ocfs2rec (pid: 14060, threadinfo
> ffff880084f48000, task ffff88009bd9e9c0)
> [157772.851407] Stack:
> [157772.851427]  ffff880000000000 0000000000000000 ffff880100000008
> ffffffff00000020
> [157772.851462]<0>  ffff88009bd9ece8 ffff88009bd9e9c0 ffff88009bd9ece8
> ffff88009bd9e9c0
> [157772.851515]<0>  ffff88009bd9ece8 ffff88009bd9e9c0 ffff88009bd9ece8
> ffff88009bd9e9c0
> [157772.851584] Call Trace:
> [157772.851611]  [<ffffffffa03da44f>] ?
> __ocfs2_recovery_thread+0x0/0x137f [ocfs2]
> [157772.851657]  [<ffffffff81044aed>] ? kthread+0x7e/0x86
> [157772.851684]  [<ffffffff81002b94>] ? kernel_thread_helper+0x4/0x10
> [157772.851713]  [<ffffffff81044a6f>] ? kthread+0x0/0x86
> [157772.851739]  [<ffffffff81002b90>] ? kernel_thread_helper+0x0/0x10
> [157772.851766] Code: 89 1c 24 41 b9 a0 06 00 00 49 c7 c0 50 01 42 a0
> 48 c7 c7 a9 9f 42 a0 31 c0 e8 1d 0c e7 e0 8b 74 24 74 41 39 b6 38 01
> 00 00 75 04<0f>  0b eb fe 48 c7 84 24 a0 00 00 00 00 00 00 00 48 c7 84
> 24 98
> [157772.851973] RIP  [<ffffffffa03da8c3>]
> __ocfs2_recovery_thread+0x474/0x137f [ocfs2]
> [157772.852024]  RSP<ffff880084f49e00>
> [157772.852284] ---[ end trace 5a9c0517280b55ba ]---
>
> The setup is fairly simple: 2 (real, not virtual) nodes that mount an
> iScsi exported disk with OCFS2 on it. What happened is that node 0
> lost connection to it's SAN and died because of this (so far so good).
> But then, node 2 started recovery and crashed while replaying the
> journal of node 0. Two nodes down: not good. My guess is that the
> journal contained some garbage and the replay process doesn't deal
> well with that. Is this a known issue?
>
> Regards,
> Ronald.
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users