<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#ffffff">
Hi,<br>
<br>
Thanks for your answer. <br>
sctp is automatically selected to be used ("dlm: Using SCTP for
communications") and I have no option to modify that
(/sys/kernel/config/dlm/cluster/protocol is set to 1 and is only
available once the service is started, I can modify it at this time ?)<br>
<br>
But, I can give you some details about the problem.<br>
I'm doing HA tests between 2 nodes and the problem occurs when I'm
stopping the service on one node, and, on restart the other system bugs
:<br>
M1 : ocfs-pcmk starts<br>
M2 : ocfs-pcmk starts<br>
M2 : ocfs-pcmk stops<br>
Ok till now, but<br>
M2 : ocfs-pcmk restarts : M1 bugs !!<br>
<br>
I adds traces in the code. From what I understand :<br>
The first connection initilizes a dlm connection with a node id and an
address.<br>
The second connection tries to recover the first structure. It has the
nodeid and tries to find the address unsuccessfully, then, tries from
an address to find the node id, no more success.<br>
But in the datas from the address, we can find the node id :<br>
<div id="gt-lang-swap"><tt>02 00 00 00 0<b>b 01 00 02</b> 00 00 00 00
00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00</tt></div>
<tt>200010b=</tt><tt>node id : 33554699</tt><br>
<br>
I'm new in this stuff and it's hard to understand the role of each
component.<br>
<br>
Regards, <br>
<br>
Benoit<br>
<br>
details :<br>
<br>
<tt>Starting DLM/OCFS :<br>
<br>
DLM (built Oct 21 2010 11:14:24) installed<br>
ocfs2: Registered cluster interface user<br>
OCFS2 Node Manager 1.6.3<br>
OCFS2 1.6.3<br>
dlm: Using SCTP for communications<br>
SCTP: Hash tables configured (established 65536 bind 65536)<br>
dlm: 77410678764B4782BDAE3E888E0C8C4D: joining the lockspace group...<br>
dlm: 77410678764B4782BDAE3E888E0C8C4D: group event done 0 0<br>
dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 1<br>
dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 16777483<br>
dlm: 77410678764B4782BDAE3E888E0C8C4D: total members 1 error 0<br>
dlm: 77410678764B4782BDAE3E888E0C8C4D: dlm_recover_directory<br>
dlm: 77410678764B4782BDAE3E888E0C8C4D: dlm_recover_directory 0 entries<br>
dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 1 done: 0 ms<br>
dlm: 77410678764B4782BDAE3E888E0C8C4D: join complete<br>
ocfs2: Mounting device (253,6) on (node 1677748, slot 0) with ordered
data mode.<br>
<br>
> Here, OCFS is available on both nodes<br>
<br>
dlm: closing connection to node 33554699<br>
<br>
> Here, OCFS is down on M2, then restart :</tt><br>
<br>
<tt>dlm: 77410678764B4782BDAE3E888E0C8C4D: recover 3<br>
dlm: 77410678764B4782BDAE3E888E0C8C4D: add member 33554699<br>
cm->sockaddr_storage ffff8804796268a0<br>
dlm_nodeid_to_addr -EEXIST; <br>
dlm: no address for nodeid 33554699<br>
sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924<br>
sctp_packet_config: packet:ffff880c8e6037d0 vtag:0x2b186924<br>
sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924<br>
process_sctp_notification SCTP_RESTART <br>
get_comm nodeid : 0, sockaddr_storage : ffff88047a357c54 <br>
get_comm cm->addr_count : 0000000000000001, cm->addr[0] :
ffff880479626740 <br>
addr : ffff88047a357c54 <br>
dlm: reject connect from unknown addr<br>
02 00 00 00 0b 01 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00<br>
in __nodeid2con , nodeid : 0 <br>
sctp_packet_config: packet:ffff88107cf36598 vtag:0x2b186924<br>
------------[ cut here ]------------<br>
kernel BUG at fs/dlm/lowcomms.c:661!<br>
invalid opcode: 0000 [#1] SMP<br>
last sysfs file:
/sys/kernel/dlm/77410678764B4782BDAE3E888E0C8C4D/control<br>
CPU 35<br>
Modules linked in: sctp(U) libcrc32c(U) ocfs2(U) ocfs2_nodemanager(U)
ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) configfs(U)
acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U)
ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U)
ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U) dm_region_hash(U)
dm_log(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) ioatdma(U)
i7core_edac(U) igb(U) edac_core(U) sg(U) i2c_i801(U) i2c_core(U)
iTCO_wdt(U) dca(U) iTCO_vendor_support(U) ext4(U) mbcache(U) jbd2(U)
usbhid(U) hid(U) sd_mod(U) crc_t10dif(U) lpfc(U) ahci(U)
scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U) scsi_tgt(U) dm_mod(U)
[last unloaded: ipmi_msghandler]<br>
<br>
Modules linked in: sctp(U) libcrc32c(U) ocfs2(U) ocfs2_nodemanager(U)
ocfs2_stack_user(U) ocfs2_stackglue(U) dlm(U) configfs(U)
acpi_cpufreq(U) freq_table(U) ipmi_devintf(U) ipmi_si(U)
ipmi_msghandler(U) nfs(U) lockd(U) fscache(U) nfs_acl(U) auth_rpcgss(U)
ext3(U) jbd(U) sunrpc(U) ipv6(U) dm_mirror(U) dm_region_hash(U)
dm_log(U) scsi_dh_emc(U) dm_round_robin(U) dm_multipath(U) ioatdma(U)
i7core_edac(U) igb(U) edac_core(U) sg(U) i2c_i801(U) i2c_core(U)
iTCO_wdt(U) dca(U) iTCO_vendor_support(U) ext4(U) mbcache(U) jbd2(U)
usbhid(U) hid(U) sd_mod(U) crc_t10dif(U) lpfc(U) ahci(U)
scsi_transport_fc(U) ehci_hcd(U) uhci_hcd(U) scsi_tgt(U) dm_mod(U)
[last unloaded: ipmi_msghandler]<br>
Pid: 4306, comm: dlm_recv/35 Not tainted 2.6.32-30.el6.Bull.14.x86_64
#2 bullx super-node<br>
RIP: 0010:[<ffffffffa039edeb>] [<ffffffffa039edeb>]
receive_from_sock+0x38b/0x430 [dlm]<br>
RSP: 0018:ffff88047a357d10 EFLAGS: 00010246<br>
RAX: 0000000000000095 RBX: ffff88087c5aad20 RCX: 0000000000001b62<br>
RDX: 0000000000000000 RSI: 0000000000000046 RDI: 0000000000000246<br>
RBP: ffff88047a357e10 R08: 00000000ffffffff R09: 0000000000000000<br>
R10: ffff881079423480 R11: 0000000000000000 R12: ffff88047a357da0<br>
R13: 0000000000000030 R14: ffff88087c5aad30 R15: ffff88047a357d40<br>
FS: 0000000000000000(0000) GS:ffff880c8e600000(0000)
knlGS:0000000000000000<br>
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b<br>
CR2: 000000359bea6df0 CR3: 0000000001001000 CR4: 00000000000006e0<br>
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000<br>
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400<br>
Process dlm_recv/35 (pid: 4306, threadinfo ffff88047a356000, task
ffff88047aeaf2e0)<br>
Stack:<br>
ffff88047a357db0 0000000000000246 ffff880c7da288c0 0000000000000246<br>
<0> ffff88047a357d70 0000100081049472 0000000000000000
0000000000000000<br>
<0> ffff88047a357d80 0000000000000002 ffff88047a357dd0
0000000000000000<br>
Call Trace:<br>
[<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]<br>
[<ffffffffa039dda6>] process_recv_sockets+0x36/0x50 [dlm]<br>
[<ffffffffa039dd70>] ? process_recv_sockets+0x0/0x50 [dlm]<br>
[<ffffffff8107ce9d>] worker_thread+0x16d/0x290<br>
[<ffffffff81082430>] ? autoremove_wake_function+0x0/0x40<br>
[<ffffffff8107cd30>] ? worker_thread+0x0/0x290<br>
[<ffffffff810820d6>] kthread+0x96/0xa0<br>
[<ffffffff8100d1aa>] child_rip+0xa/0x20<br>
[<ffffffff81082040>] ? kthread+0x0/0xa0<br>
[<ffffffff8100d1a0>] ? child_rip+0x0/0x20<br>
Code: f1 4c 8b 8d 70 ff ff ff 48 c1 e6 0c 4c 01 d6 e8 f1 33 0b e1 4c 89
f7 e8 14 4b 0b e1 31 f6 48 89 df e8 6a f4 ff ff e9 cd fd ff ff
<0f> 0b 0f 1f 00 eb fb 48 c7 c7 e8 9a 3a a0 31 c0 e8 c5 33 0b e1<br>
RIP [<ffffffffa039edeb>] receive_from_sock+0x38b/0x430 [dlm]<br>
RSP <ffff88047a357d10><br>
crash></tt><br>
<br>
<br>
<br>
Le 21/10/2010 17:13, David Teigland a écrit :
<blockquote cite="mid:20101021151333.GA1427@redhat.com" type="cite">
<blockquote type="cite">
<pre wrap="">kernel BUG at fs/dlm/lowcomms.c:647!
</pre>
</blockquote>
<pre wrap="">
That looks like an interesting one, I haven't seen it before.
First ensure dlm is not configured to use sctp (that code is
not widely tested.) Other than that, if you'd like to start
debugging this before I get around to it, replace the BUG_ON
with some printk's and return error. The conn with nodeid 0 is
the listening socket, for which tcp_accept_from_sock() should
be called rather than receive_from_sock().
</pre>
<blockquote type="cite">
<pre wrap=""> KERNEL: /usr/lib/debug/lib/modules/2.6.32-100.0.19.el5/vmlinux
DUMPFILE: /var/var/crash/127.0.0.1-2010-10-18-16:42:07/vmcore
[PARTIAL DUMP]
CPUS: 64
DATE: Mon Oct 18 16:41:48 2010
UPTIME: 00:15:00
LOAD AVERAGE: 1.06, 1.22, 1.65
TASKS: 1594
NODENAME: chili0
RELEASE: 2.6.32-100.0.19.el5
VERSION: #1 SMP Fri Sep 17 17:51:41 EDT 2010
MACHINE: x86_64 (1999 Mhz)
MEMORY: 64 GB
PANIC: "kernel BUG at fs/dlm/lowcomms.c:647!"
PID: 27062
COMMAND: "dlm_recv/34"
TASK: ffff880c7caa00c0 [THREAD_INFO: ffff880c77c6a000]
CPU: 34
STATE: TASK_RUNNING (PANIC)
crash> bt
PID: 27062 TASK: ffff880c7caa00c0 CPU: 34 COMMAND: "dlm_recv/34"
#0 [ffff880c77c6b910] machine_kexec at ffffffff8102cc9b
#1 [ffff880c77c6b990] crash_kexec at ffffffff810964d4
#2 [ffff880c77c6ba60] oops_end at ffffffff81439bd9
#3 [ffff880c77c6ba90] die at ffffffff81015639
#4 [ffff880c77c6bac0] do_trap at ffffffff8143952c
#5 [ffff880c77c6bb10] do_invalid_op at ffffffff81013902
#6 [ffff880c77c6bbb0] invalid_op at ffffffff81012b7b
[exception RIP: receive_from_sock+1364]
RIP: ffffffffa02406c3 RSP: ffff880c77c6bc60 RFLAGS: 00010246
RAX: 0000000000000030 RBX: ffff8810774b8d30 RCX: ffff88087c4548f8
RDX: 0000000000000030 RSI: ffff880876dce000 RDI: ffffffff81398045
RBP: ffff880c77c6be50 R8: ffff000000000000 R9: ffff880c77c6b900
R10: ffff880c77c6b8f0 R11: 0000000000000030 R12: 0000000000000030
R13: ffff8810774b8d20 R14: ffff880c7caa00c0 R15: ffffffffa023ecca
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff880c77c6be58] process_recv_sockets at ffffffffa023ecea
#8 [ffff880c77c6be78] worker_thread at ffffffff81071802
#9 [ffff880c77c6bee8] kthread at ffffffff810756d3
#10 [ffff880c77c6bf48] kernel_thread at ffffffff81012dea
_______________________________________________
Ocfs2-users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:Ocfs2-users@oss.oracle.com">Ocfs2-users@oss.oracle.com</a>
<a class="moz-txt-link-freetext" href="http://oss.oracle.com/mailman/listinfo/ocfs2-users">http://oss.oracle.com/mailman/listinfo/ocfs2-users</a>
</pre>
</blockquote>
<pre wrap="">
</pre>
</blockquote>
<br>
</body>
</html>