[Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647!

Mon Nov 8 08:06:45 PST 2010

Hi,

Thanks for your answer.
I tried to force TCP instead of SCTP.
There is no crash anymore, but OCFS is unavailable when I stop one node :
The OCFS directory is not accessible :

ls /BCM/conf
[root at chili0 ~]# cat /proc/23921/stack
[<ffffffffa02c5d53>] ocfs2_wait_for_recovery+0x77/0x8f [ocfs2]
[<ffffffffa02b08a8>] ocfs2_inode_lock_full_nested+0x160/0xb8d [ocfs2]
[<ffffffffa02c35e2>] ocfs2_inode_revalidate+0x163/0x25c [ocfs2]
[<ffffffffa02bd9f4>] ocfs2_getattr+0x8b/0x19c [ocfs2]
[<ffffffff8111c30f>] vfs_getattr+0x4c/0x69
[<ffffffff8111c37c>] vfs_fstatat+0x50/0x67
[<ffffffff8111c479>] vfs_stat+0x1b/0x1d
[<ffffffff8111c49a>] sys_newstat+0x1f/0x39
[<ffffffff81011db2>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Any access to the filesystem hangs.

Regards,

Benoit

Le 21/10/2010 19:13, David Teigland a écrit :
> On Thu, Oct 21, 2010 at 06:11:36PM +0200, Welterlen Benoit wrote:
>    
>> Hi,
>>
>> Thanks for your answer.
>> sctp is automatically selected to be used ("dlm: Using SCTP for
>> communications") and I have no option to modify that
>> (/sys/kernel/config/dlm/cluster/protocol is set to 1 and is only
>> available once the service is started, I can modify it at this time
>> ?)
>>      
> Yes, there's a dlm_controld command line option (or cluster.conf which I
> don't suppose you're using with pacemaker.)  You can set that to get TCP,
> but that obviates corosync redundant ring also (which is what is used to
> auto-select SCTP).
>
>    
>> dlm: closing connection to node 33554699
>>
>>      
>>> Here, OCFS is down on M2, then restart :
>>>        
>> dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 3
>> dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 33554699
>> cm->sockaddr_storage  ffff8804796268a0
>> dlm_nodeid_to_addr -EEXIST;
>> dlm: no address for nodeid 33554699
>>      
> I suspect the pacemaker version of dlm_controld is doing an unusual
> sequence of node addition/removal.  The stuff above eventually leads to
> the oops below:
>
>    
>> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
>> sctp_packet_config: packet:ffff880c8e6037d0 vtag:0x2b186924
>> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
>> process_sctp_notification SCTP_RESTART
>> get_comm  nodeid : 0, sockaddr_storage : ffff88047a357c54
>> get_comm cm->addr_count : 0000000000000001, cm->addr[0] : ffff880479626740
>> addr : ffff88047a357c54
>> dlm: reject connect from unknown addr
>> 02 00 00 00 0b 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> 00 00 00 00 00 00 00 00 00 00 00 00 00
>> in __nodeid2con , nodeid : 0
>> sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
>> ------------[ cut here ]------------
>> kernel BUG at fs/dlm/lowcomms.c:661!
>> invalid opcode: 0000 [#1] SMP
>> last sysfs file: /sys/kernel/dlm/77410678764B4782BDAE3E888E0C8C4D/control
>> CPU 35
>> Modules linked in: sctp(U) libcrc32c(U) ocfs2(U)
>> ocfs2_nodemanager(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U)
>> configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U)
>> ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U)
>> auth_rpcgss(U) ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U)
>> dm_region_hash(U) dm_log(U) scsi_dh_emc(U) dm_round_robin(U)
>> dm_multipath(U) ioatdma(U) i7core_edac(U) igb(U) edac_core(U) sg(U)
>> i2c_i801(U) i2c_core(U) iTCO_wdt(U) dca(U) iTCO_vendor_support(U)
>> ext4(U) mbcache(U) jbd2(U) usbhid(U) hid(U) sd_mod(U) crc_t10dif(U)
>> lpfc(U) ahci(U) scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U)
>> scsi_tgt(U) dm_mod(U) [last unloaded: ipmi_msghandler]
>>
>> Modules linked in: sctp(U) libcrc32c(U) ocfs2(U)
>> ocfs2_nodemanager(U) ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U)
>> configfs(U) acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U)
>> ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U)
>> auth_rpcgss(U) ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U)
>> dm_region_hash(U) dm_log(U) scsi_dh_emc(U) dm_round_robin(U)
>> dm_multipath(U) ioatdma(U) i7core_edac(U) igb(U) edac_core(U) sg(U)
>> i2c_i801(U) i2c_core(U) iTCO_wdt(U) dca(U) iTCO_vendor_support(U)
>> ext4(U) mbcache(U) jbd2(U) usbhid(U) hid(U) sd_mod(U) crc_t10dif(U)
>> lpfc(U) ahci(U) scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U)
>> scsi_tgt(U) dm_mod(U) [last unloaded: ipmi_msghandler]
>> Pid: 4306, comm: dlm_recv/35 Not tainted
>> 2.6.32-30.el6.Bull.14.x86_64 #2 bullx super-node
>> RIP: 0010:[<ffffffffa039edeb>]  [<ffffffffa039edeb>]
>> receive_from_sock+0x38b/0x430 [dlm]
>> RSP: 0018:ffff88047a357d10  EFLAGS: 00010246
>> RAX: 0000000000000095 RBX: ffff88087c5aad20 RCX: 0000000000001b62
>> RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246
>> RBP: ffff88047a357e10 R08: 00000000ffffffff R09: 0000000000000000
>> R10: ffff881079423480 R11: 0000000000000000 R12: ffff88047a357da0
>> R13: 0000000000000030 R14: ffff88087c5aad30 R15: ffff88047a357d40
>> FS:  0000000000000000(0000) GS:ffff880c8e600000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> CR2: 000000359bea6df0 CR3: 0000000001001000 CR4: 00000000000006e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Process dlm_recv/35 (pid: 4306, threadinfo ffff88047a356000, task
>> ffff88047aeaf2e0)
>> Stack:
>>   ffff88047a357db0 0000000000000246 ffff880c7da288c0 0000000000000246
>> <0>  ffff88047a357d70 0000100081049472 0000000000000000 0000000000000000
>> <0>  ffff88047a357d80 0000000000000002 ffff88047a357dd0 0000000000000000
>> Call Trace:
>>   [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]
>>   [<ffffffffa039dda6>] process_recv_sockets+0x36/0x50 [dlm]
>>   [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]
>>   [<ffffffff8107ce9d>] worker_thread+0x16d/0x290
>>   [<ffffffff81082430>] ? autoremove_wake_function+0x0/0x40
>>   [<ffffffff8107cd30>] ? worker_thread+0x0/0x290
>>   [<ffffffff810820d6>] kthread+0x96/0xa0
>>   [<ffffffff8100d1aa>] child_rip+0xa/0x20
>>   [<ffffffff81082040>] ? kthread+0x0/0xa0
>>   [<ffffffff8100d1a0>] ? child_rip+0x0/0x20
>> Code: f1 4c 8b 8d 70 ff ff ff 48 c1 e6 0c 4c 01 d6 e8 f1 33 0b e1 4c
>> 89 f7 e8 14 4b 0b e1 31 f6 48 89 df e8 6a f4 ff ff e9 cd fd ff ff
>> <0f>  0b 0f 1f 00 eb fb 48 c7 c7 e8 9a 3a a0 31 c0 e8 c5 33 0b e1
>> RIP  [<ffffffffa039edeb>] receive_from_sock+0x38b/0x430 [dlm]
>>   RSP<ffff88047a357d10>
>> crash>
>>
>>
>>
>> Le 21/10/2010 17:13, David Teigland a ?crit :
>>      
>>>> kernel BUG at fs/dlm/lowcomms.c:647!
>>>>          
>>> That looks like an interesting one, I haven't seen it before.
>>> First ensure dlm is not configured to use sctp (that code is
>>> not widely tested.)  Other than that, if you'd like to start
>>> debugging this before I get around to it, replace the BUG_ON
>>> with some printk's and return error.  The conn with nodeid 0 is
>>> the listening socket, for which tcp_accept_from_sock() should
>>> be called rather than receive_from_sock().
>>>
>>>
>>>        
>>>>         KERNEL: /usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux
>>>>       DUMPFILE: /var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore
>>>> [PARTIAL DUMP]
>>>>           CPUS: 64
>>>>           DATE: Mon Oct 18 16:41:48 2010
>>>>         UPTIME: 00:15:00
>>>> LOAD AVERAGE: 1.06, 1.22, 1.65
>>>>          TASKS: 1594
>>>>       NODENAME: chili0
>>>>        RELEASE: 2.6.32-100.0.19.el5
>>>>        VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010
>>>>        MACHINE: x86_64  (1999 Mhz)
>>>>         MEMORY: 64 GB
>>>>          PANIC: "kernel BUG at fs/dlm/lowcomms.c:647!"
>>>>            PID: 27062
>>>>        COMMAND: "dlm_recv/34"
>>>>           TASK: ffff880c7caa00c0  [THREAD_INFO: ffff880c77c6a000]
>>>>            CPU: 34
>>>>          STATE: TASK_RUNNING (PANIC)
>>>>
>>>> crash>   bt
>>>> PID: 27062  TASK: ffff880c7caa00c0  CPU: 34  COMMAND: "dlm_recv/34"
>>>>    #0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b
>>>>    #1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4
>>>>    #2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9
>>>>    #3 [ffff880c77c6ba90] die at ffffffff81015639
>>>>    #4 [ffff880c77c6bac0] do_trap at ffffffff8143952c
>>>>    #5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902
>>>>    #6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b
>>>>       [exception RIP: receive_from_sock+1364]
>>>>       RIP: ffffffffa02406c3  RSP: ffff880c77c6bc60  RFLAGS: 00010246
>>>>       RAX: 0000000000000030  RBX: ffff8810774b8d30  RCX: ffff88087c4548f8
>>>>       RDX: 0000000000000030  RSI: ffff880876dce000  RDI: ffffffff81398045
>>>>       RBP: ffff880c77c6be50   R8: ffff000000000000   R9: ffff880c77c6b900
>>>>       R10: ffff880c77c6b8f0  R11: 0000000000000030  R12: 0000000000000030
>>>>       R13: ffff8810774b8d20  R14: ffff880c7caa00c0  R15: ffffffffa023ecca
>>>>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>>>    #7 [ffff880c77c6be58] process_recv_sockets at ffffffffa023ecea
>>>>    #8 [ffff880c77c6be78] worker_thread at ffffffff81071802
>>>>    #9 [ffff880c77c6bee8] kthread at ffffffff810756d3
>>>> #10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea
>>>>
>>>>
>>>> _______________________________________________
>>>> Ocfs2-users mailing list
>>>> Ocfs2-users at oss.oracle.com
>>>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>>>          
>>>        
>>      
>