[Ocfs2-users] kernel BUG at fs/dlm/lowcomms.c:647!

Welterlen Benoit Benoit.Welterlen at bull.net
Thu Oct 21 09:11:36 PDT 2010


Hi,

Thanks for your answer.
sctp is automatically selected to be used ("dlm: Using SCTP for 
communications") and I have no option to modify that 
(/sys/kernel/config/dlm/cluster/protocol is set to 1 and is only 
available once the service is started, I can modify it at this time ?)

But, I can give you some details about the problem.
I'm doing HA tests between 2 nodes and the problem occurs when I'm 
stopping the service on one node, and, on restart the other system bugs :
M1 : ocfs-pcmk starts
M2 : ocfs-pcmk starts
M2 : ocfs-pcmk stops
Ok till now, but
M2 : ocfs-pcmk restarts : M1 bugs !!

I adds traces in the code. From what I understand :
The first connection initilizes a dlm connection with a node id and an 
address.
The second connection tries to recover the first structure. It has the 
nodeid and tries to find the address unsuccessfully, then, tries from an 
address to find the node id, no more success.
But in the datas from the address, we can find the node id :
02 00 00 00 0*b 01 00 02* 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00
200010b=node id : 33554699

I'm new in this stuff and it's hard to understand the role of each 
component.

Regards,

Benoit

details :

Starting DLM/OCFS :

DLM (built Oct 21 2010 11:14:24) installed
ocfs2: Registered cluster interface user
OCFS2 Node Manager 1.6.3
OCFS2 1.6.3
dlm: Using SCTP for communications
SCTP: Hash tables configured (established 65536 bind 65536)
dlm: 77410678764B4782BDAE3E888E0C8C4D: joining the lockspace group...
dlm: 77410678764B4782BDAE3E888E0C8C4D: group event done 0 0
dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 1
dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 16777483
dlm: 77410678764B4782BDAE3E888E0C8C4D: total members 1 error 0
dlm: 77410678764B4782BDAE3E888E0C8C4D: dlm_recover_directory
dlm: 77410678764B4782BDAE3E888E0C8C4D: dlm_recover_directory 0 entries
dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 1 done: 0 ms
dlm: 77410678764B4782BDAE3E888E0C8C4D: join complete
ocfs2: Mounting device (253,6) on (node 1677748, slot 0) with ordered 
data mode.

 > Here, OCFS is available on both nodes

dlm: closing connection to node 33554699

 > Here, OCFS is down on M2, then restart :

dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 3
dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 33554699
cm->sockaddr_storage  ffff8804796268a0
dlm_nodeid_to_addr -EEXIST;
dlm: no address for nodeid 33554699
sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
sctp_packet_config: packet:ffff880c8e6037d0 vtag:0x2b186924
sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
process_sctp_notification SCTP_RESTART
get_comm  nodeid : 0, sockaddr_storage : ffff88047a357c54
get_comm cm->addr_count : 0000000000000001, cm->addr[0] : ffff880479626740
addr : ffff88047a357c54
dlm: reject connect from unknown addr
02 00 00 00 0b 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00
in __nodeid2con , nodeid : 0
sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924
------------[ cut here ]------------
kernel BUG at fs/dlm/lowcomms.c:661!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/kernel/dlm/77410678764B4782BDAE3E888E0C8C4D/control
CPU 35
Modules linked in: sctp(U) libcrc32c(U) ocfs2(U) ocfs2_nodemanager(U) 
ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) configfs(U) 
acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) 
ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) 
ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U) dm_region_hash(U) 
dm_log(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) ioatdma(U) 
i7core_edac(U) igb(U) edac_core(U) sg(U) i2c_i801(U) i2c_core(U) 
iTCO_wdt(U) dca(U) iTCO_vendor_support(U) ext4(U) mbcache(U) jbd2(U) 
usbhid(U) hid(U) sd_mod(U) crc_t10dif(U) lpfc(U) ahci(U) 
scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U) scsi_tgt(U) dm_mod(U) [last 
unloaded: ipmi_msghandler]

Modules linked in: sctp(U) libcrc32c(U) ocfs2(U) ocfs2_nodemanager(U) 
ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) configfs(U) 
acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U) 
ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U) 
ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U) dm_region_hash(U) 
dm_log(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) ioatdma(U) 
i7core_edac(U) igb(U) edac_core(U) sg(U) i2c_i801(U) i2c_core(U) 
iTCO_wdt(U) dca(U) iTCO_vendor_support(U) ext4(U) mbcache(U) jbd2(U) 
usbhid(U) hid(U) sd_mod(U) crc_t10dif(U) lpfc(U) ahci(U) 
scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U) scsi_tgt(U) dm_mod(U) [last 
unloaded: ipmi_msghandler]
Pid: 4306, comm: dlm_recv/35 Not tainted 2.6.32-30.el6.Bull.14.x86_64 #2 
bullx super-node
RIP: 0010:[<ffffffffa039edeb>]  [<ffffffffa039edeb>] 
receive_from_sock+0x38b/0x430 [dlm]
RSP: 0018:ffff88047a357d10  EFLAGS: 00010246
RAX: 0000000000000095 RBX: ffff88087c5aad20 RCX: 0000000000001b62
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246
RBP: ffff88047a357e10 R08: 00000000ffffffff R09: 0000000000000000
R10: ffff881079423480 R11: 0000000000000000 R12: ffff88047a357da0
R13: 0000000000000030 R14: ffff88087c5aad30 R15: ffff88047a357d40
FS:  0000000000000000(0000) GS:ffff880c8e600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000359bea6df0 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dlm_recv/35 (pid: 4306, threadinfo ffff88047a356000, task 
ffff88047aeaf2e0)
Stack:
  ffff88047a357db0 0000000000000246 ffff880c7da288c0 0000000000000246
<0> ffff88047a357d70 0000100081049472 0000000000000000 0000000000000000
<0> ffff88047a357d80 0000000000000002 ffff88047a357dd0 0000000000000000
Call Trace:
  [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]
  [<ffffffffa039dda6>] process_recv_sockets+0x36/0x50 [dlm]
  [<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]
  [<ffffffff8107ce9d>] worker_thread+0x16d/0x290
  [<ffffffff81082430>] ? autoremove_wake_function+0x0/0x40
  [<ffffffff8107cd30>] ? worker_thread+0x0/0x290
  [<ffffffff810820d6>] kthread+0x96/0xa0
  [<ffffffff8100d1aa>] child_rip+0xa/0x20
  [<ffffffff81082040>] ? kthread+0x0/0xa0
  [<ffffffff8100d1a0>] ? child_rip+0x0/0x20
Code: f1 4c 8b 8d 70 ff ff ff 48 c1 e6 0c 4c 01 d6 e8 f1 33 0b e1 4c 89 
f7 e8 14 4b 0b e1 31 f6 48 89 df e8 6a f4 ff ff e9 cd fd ff ff <0f> 0b 
0f 1f 00 eb fb 48 c7 c7 e8 9a 3a a0 31 c0 e8 c5 33 0b e1
RIP  [<ffffffffa039edeb>] receive_from_sock+0x38b/0x430 [dlm]
  RSP <ffff88047a357d10>
crash>



Le 21/10/2010 17:13, David Teigland a écrit :
>> kernel BUG at fs/dlm/lowcomms.c:647!
>>      
> That looks like an interesting one, I haven't seen it before.
> First ensure dlm is not configured to use sctp (that code is
> not widely tested.)  Other than that, if you'd like to start
> debugging this before I get around to it, replace the BUG_ON
> with some printk's and return error.  The conn with nodeid 0 is
> the listening socket, for which tcp_accept_from_sock() should
> be called rather than receive_from_sock().
>
>
>    
>>         KERNEL: /usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux
>>       DUMPFILE: /var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore
>> [PARTIAL DUMP]
>>           CPUS: 64
>>           DATE: Mon Oct 18 16:41:48 2010
>>         UPTIME: 00:15:00
>> LOAD AVERAGE: 1.06, 1.22, 1.65
>>          TASKS: 1594
>>       NODENAME: chili0
>>        RELEASE: 2.6.32-100.0.19.el5
>>        VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010
>>        MACHINE: x86_64  (1999 Mhz)
>>         MEMORY: 64 GB
>>          PANIC: "kernel BUG at fs/dlm/lowcomms.c:647!"
>>            PID: 27062
>>        COMMAND: "dlm_recv/34"
>>           TASK: ffff880c7caa00c0  [THREAD_INFO: ffff880c77c6a000]
>>            CPU: 34
>>          STATE: TASK_RUNNING (PANIC)
>>
>> crash>  bt
>> PID: 27062  TASK: ffff880c7caa00c0  CPU: 34  COMMAND: "dlm_recv/34"
>>    #0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b
>>    #1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4
>>    #2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9
>>    #3 [ffff880c77c6ba90] die at ffffffff81015639
>>    #4 [ffff880c77c6bac0] do_trap at ffffffff8143952c
>>    #5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902
>>    #6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b
>>       [exception RIP: receive_from_sock+1364]
>>       RIP: ffffffffa02406c3  RSP: ffff880c77c6bc60  RFLAGS: 00010246
>>       RAX: 0000000000000030  RBX: ffff8810774b8d30  RCX: ffff88087c4548f8
>>       RDX: 0000000000000030  RSI: ffff880876dce000  RDI: ffffffff81398045
>>       RBP: ffff880c77c6be50   R8: ffff000000000000   R9: ffff880c77c6b900
>>       R10: ffff880c77c6b8f0  R11: 0000000000000030  R12: 0000000000000030
>>       R13: ffff8810774b8d20  R14: ffff880c7caa00c0  R15: ffffffffa023ecca
>>       ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>>    #7 [ffff880c77c6be58] process_recv_sockets at ffffffffa023ecea
>>    #8 [ffff880c77c6be78] worker_thread at ffffffff81071802
>>    #9 [ffff880c77c6bee8] kthread at ffffffff810756d3
>> #10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea
>>
>>
>> _______________________________________________
>> Ocfs2-users mailing list
>> Ocfs2-users at oss.oracle.com
>> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>>      
>
>    

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20101021/1d4f80f5/attachment-0001.html 


More information about the Ocfs2-users mailing list