[Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647!
Welterlen Benoit
Benoit.Welterlen at bull.net
Mon Nov 8 08:06:45 PST 2010
Hi,
Thanks for your answer.
I tried to force TCP instead of SCTP.
There is no crash anymore, but OCFS is unavailable when I stop one node :
The OCFS directory is not accessible :
ls /BCM/conf
[root at chili0 ~]# cat /proc/23921/stack
[<ffffffffa02c5d53>] ocfs2_wait_for_recovery+0x77/0x8f [ocfs2]
[<ffffffffa02b08a8>] ocfs2_inode_lock_full_nested+0x160/0xb8d [ocfs2]
[<ffffffffa02c35e2>] ocfs2_inode_revalidate+0x163/0x25c [ocfs2]
[<ffffffffa02bd9f4>] ocfs2_getattr+0x8b/0x19c [ocfs2]
[<ffffffff8111c30f>] vfs_getattr+0x4c/0x69
[<ffffffff8111c37c>] vfs_fstatat+0x50/0x67
[<ffffffff8111c479>] vfs_stat+0x1b/0x1d
[<ffffffff8111c49a>] sys_newstat+0x1f/0x39
[<ffffffff81011db2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
Any access to the filesystem hangs.
Regards,
Benoit
Le 21/10/2010 19:13, David Teigland a écrit :
> On Thu, Oct 21, 2010 at 06:11:36PM +0200, Welterlen Benoit wrote:
>
>> Hi,
>>
>> Thanks for your answer.
>> sctp is automatically selected to be used ("dlm: Using SCTP for
>> communications") and I have no option to modify that
>> (/sys/kernel/config/dlm/cluster/protocol is set to 1 and is only
>> available once the service is started, I can modify it at this time
>> ?)
>>
> Yes, there's a dlm_controld command line option (or cluster.conf which I
> don't suppose you're using with pacemaker.) You can set that to get TCP,
> but that obviates corosync redundant ring also (which is what is used to
> auto-select SCTP).
>
>
>> dlm: closing connection to node 33554699
>>
>>
>>> Here, OCFS is down on M2, then restart :
>>>
>> dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 3
>> dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 33554699
>> cm->sockaddr_storage ffff8804796268a0
>> dlm_nodeid_to_addr -EEXIST;
>> dlm: no address for nodeid 33554699
>>
> I suspect the pacemaker version of dlm_controld is doing an unusual
> sequence of node addition/removal. The stuff above eventually leads to
> the oops below:
>
>
>> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
>> sctp_packet_config: packet:ffff880c8e6037d0 vtag:0x2b186924
>> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
>> process_sctp_notification SCTP_RESTART
>> get_comm nodeid : 0, sockaddr_storage : ffff88047a357c54
>> get_comm cm->addr_count : 0000000000000001, cm->addr[0] : ffff880479626740
>> addr : ffff88047a357c54
>> dlm: reject connect from unknown addr
>> 02 00 00 00 0b 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00
>> in __nodeid2con , nodeid : 0
>> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
>> ------------[ cut here ]------------
>> kernel BUG at fs/dlm/lowcomms.c:661!
>> invalid opcode: 0000 [#1] SMP
>> last sysfs file: /sys/kernel/dlm/77410678764B4782BDAE3E888E0C8C4D/control
>> CPU 35
>> Modules linked in: sctp(U) libcrc32c(U) ocfs2(U)
>> ocfs2_nodemanager(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U)
>> configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U)
>> ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U)
>> auth_rpcgss(U) ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U)
>> dm_region_hash(U) dm_log(U) scsi_dh_emc(U) dm_round_robin(U)
>> dm_multipath(U) ioatdma(U) i7core_edac(U) igb(U) edac_core(U) sg(U)
>> i2c_i801(U) i2c_core(U) iTCO_wdt(U) dca(U) iTCO_vendor_support(U)
>> ext4(U) mbcache(U) jbd2(U) usbhid(U) hid(U) sd_mod(U) crc_t10dif(U)
>> lpfc(U) ahci(U) scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U)
>> scsi_tgt(U) dm_mod(U) [last unloaded: ipmi_msghandler]
>>
>> Modules linked in: sctp(U) libcrc32c(U) ocfs2(U)
>> ocfs2_nodemanager(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U)
>> configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U)
>> ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U)
>> auth_rpcgss(U) ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U)
>> dm_region_hash(U) dm_log(U) scsi_dh_emc(U) dm_round_robin(U)
>> dm_multipath(U) ioatdma(U) i7core_edac(U) igb(U) edac_core(U) sg(U)
>> i2c_i801(U) i2c_core(U) iTCO_wdt(U) dca(U) iTCO_vendor_support(U)
>> ext4(U) mbcache(U) jbd2(U) usbhid(U) hid(U) sd_mod(U) crc_t10dif(U)
>> lpfc(U) ahci(U) scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U)
>> scsi_tgt(U) dm_mod(U) [last unloaded: ipmi_msghandler]
>> Pid: 4306, comm: dlm_recv/35 Not tainted
>> 2.6.32-30.el6.Bull.14.x86_64 #2 bullx super-node
>> RIP: 0010:[<ffffffffa039edeb>] [<ffffffffa039edeb>]
>> receive_from_sock+0x38b/0x430 [dlm]
>> RSP: 0018:ffff88047a357d10 EFLAGS: 00010246
>> RAX: 0000000000000095 RBX: ffff88087c5aad20 RCX: 0000000000001b62
>> RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246
>> RBP: ffff88047a357e10 R08: 00000000ffffffff R09: 0000000000000000
>> R10: ffff881079423480 R11: 0000000000000000 R12: ffff88047a357da0
>> R13: 0000000000000030 R14: ffff88087c5aad30 R15: ffff88047a357d40
>> FS: 0000000000000000(0000) GS:ffff880c8e600000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> CR2: 000000359bea6df0 CR3: 0000000001001000 CR4: 00000000000006e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process dlm_recv/35 (pid: 4306, threadinfo ffff88047a356000, task
>> ffff88047aeaf2e0)
>> Stack:
>> ffff88047a357db0 0000000000000246 ffff880c7da288c0 0000000000000246
>> <0> ffff88047a357d70 0000100081049472 0000000000000000 0000000000000000
>> <0> ffff88047a357d80 0000000000000002 ffff88047a357dd0 0000000000000000
>> Call Trace:
>> [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]
>> [<ffffffffa039dda6>] process_recv_sockets+0x36/0x50 [dlm]
>> [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]
>> [<ffffffff8107ce9d>] worker_thread+0x16d/0x290
>> [<ffffffff81082430>] ? autoremove_wake_function+0x0/0x40
>> [<ffffffff8107cd30>] ? worker_thread+0x0/0x290
>> [<ffffffff810820d6>] kthread+0x96/0xa0
>> [<ffffffff8100d1aa>] child_rip+0xa/0x20
>> [<ffffffff81082040>] ? kthread+0x0/0xa0
>> [<ffffffff8100d1a0>] ? child_rip+0x0/0x20
>> Code: f1 4c 8b 8d 70 ff ff ff 48 c1 e6 0c 4c 01 d6 e8 f1 33 0b e1 4c
>> 89 f7 e8 14 4b 0b e1 31 f6 48 89 df e8 6a f4 ff ff e9 cd fd ff ff
>> <0f> 0b 0f 1f 00 eb fb 48 c7 c7 e8 9a 3a a0 31 c0 e8 c5 33 0b e1
>> RIP [<ffffffffa039edeb>] receive_from_sock+0x38b/0x430 [dlm]
>> RSP<ffff88047a357d10>
>> crash>
>>
>>
>>
>> Le 21/10/2010 17:13, David Teigland a ?crit :
>>
>>>> kernel BUG at fs/dlm/lowcomms.c:647!
>>>>
>>> That looks like an interesting one, I haven't seen it before.
>>> First ensure dlm is not configured to use sctp (that code is
>>> not widely tested.) Other than that, if you'd like to start
>>> debugging this before I get around to it, replace the BUG_ON
>>> with some printk's and return error. The conn with nodeid 0 is
>>> the listening socket, for which tcp_accept_from_sock() should
>>> be called rather than receive_from_sock().
>>>
>>>
>>>
>>>> KERNEL: /usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux
>>>> DUMPFILE: /var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore
>>>> [PARTIAL DUMP]
>>>> CPUS: 64
>>>> DATE: Mon Oct 18 16:41:48 2010
>>>> UPTIME: 00:15:00
>>>> LOAD AVERAGE: 1.06, 1.22, 1.65
>>>> TASKS: 1594
>>>> NODENAME: chili0
>>>> RELEASE: 2.6.32-100.0.19.el5
>>>> VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010
>>>> MACHINE: x86_64 (1999 Mhz)
>>>> MEMORY: 64 GB
>>>> PANIC: "kernel BUG at fs/dlm/lowcomms.c:647!"
>>>> PID: 27062
>>>> COMMAND: "dlm_recv/34"
>>>> TASK: ffff880c7caa00c0 [THREAD_INFO: ffff880c77c6a000]
>>>> CPU: 34
>>>> STATE: TASK_RUNNING (PANIC)
>>>>
>>>> crash> bt
>>>> PID: 27062 TASK: ffff880c7caa00c0 CPU: 34 COMMAND: "dlm_recv/34"
>>>> #0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b
>>>> #1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4
>>>> #2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9
>>>> #3 [ffff880c77c6ba90] die at ffffffff81015639
>>>> #4 [ffff880c77c6bac0] do_trap at ffffffff8143952c
>>>> #5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902
>>>> #6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b
>>>> [exception RIP: receive_from_sock+1364]
>>>> RIP: ffffffffa02406c3 RSP: ffff880c77c6bc60 RFLAGS: 00010246
>>>> RAX: 0000000000000030 RBX: ffff8810774b8d30 RCX: ffff88087c4548f8
>>>> RDX: 0000000000000030 RSI: ffff880876dce000 RDI: ffffffff81398045
>>>> RBP: ffff880c77c6be50 R8: ffff000000000000 R9: ffff880c77c6b900
>>>> R10: ffff880c77c6b8f0 R11: 0000000000000030 R12: 0000000000000030
>>>> R13: ffff8810774b8d20 R14: ffff880c7caa00c0 R15: ffffffffa023ecca
>>>> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
>>>> #7 [ffff880c77c6be58] process_recv_sockets at ffffffffa023ecea
>>>> #8 [ffff880c77c6be78] worker_thread at ffffffff81071802
>>>> #9 [ffff880c77c6bee8] kthread at ffffffff810756d3
>>>> #10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea
>>>>
>>>>
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users at oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>
>>>
>>
>
More information about the Ocfs2-users
mailing list