<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<font size="+1">I have replaced a dead node that was running in
dual-primary mode with OCFS2. All the steps work:<br>
<br>
`/proc/drbd`<br>
<br>
version: 8.3.13 (api:88/proto:86-96)<br>
GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by
<a class="moz-txt-link-abbreviated" href="mailto:mockbuild@builder10.centos.org">mockbuild@builder10.centos.org</a>, 2012-05-07 11:56:36<br>
<br>
1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C
r-----<br>
ns:81 nr:407832 dw:106657970 dr:266340 al:179 bm:6551 lo:0
pe:0 ua:0 ap:0 ep:1 wo:b oos:0<br>
<br>
until I try to mount the volume:<br>
<br>
mount -t ocfs2 /dev/drbd1 /data/webroot/<br>
mount.ocfs2: Transport endpoint is not connected while
mounting /dev/drbd1 on /data/webroot/. Check 'dmesg' for more
information on this error.<br>
<br>
`/var/log/kern.log`<br>
<br>
kernel: (o2net,11427,1):o2net_connect_expired:1664 ERROR: no
connection established with node 0 after 30.0 seconds, giving up
and returning errors.<br>
kernel: (mount.ocfs2,12037,1):dlm_request_join:1036 ERROR:
status = -107<br>
kernel: (mount.ocfs2,12037,1):dlm_try_to_join_domain:1210
ERROR: status = -107<br>
kernel: (mount.ocfs2,12037,1):dlm_join_domain:1488 ERROR:
status = -107<br>
kernel: (mount.ocfs2,12037,1):dlm_register_domain:1754 ERROR:
status = -107<br>
kernel: (mount.ocfs2,12037,1):ocfs2_dlm_init:2808 ERROR:
status = -107<br>
kernel: (mount.ocfs2,12037,1):ocfs2_mount_volume:1447 ERROR:
status = -107<br>
kernel: ocfs2: Unmounting device (147,1) on (node 1)<br>
<br>
I'm sure `/etc/ocfs2/cluster.conf` on the both node are identical:<br>
<br>
`/etc/ocfs2/cluster.conf`<br>
<br>
node:<br>
ip_port = 7777<br>
ip_address = 192.168.3.145<br>
number = 0<br>
name = SVR233NTC-3145.localdomain<br>
cluster = cpc<br>
<br>
node:<br>
ip_port = 7777<br>
ip_address = 192.168.2.93<br>
number = 1<br>
name = SVR022-293.localdomain<br>
cluster = cpc<br>
<br>
cluster:<br>
node_count = 2<br>
name = cpc<br>
<br>
and they are connected fine:<br>
<br>
# nc -z 192.168.3.145 7777<br>
Connection to 192.168.3.145 7777 port [tcp/cbt] succeeded!<br>
<br>
but the O2CB heartbeat is not active on the new node
(192.168.2.93):<br>
<br>
`/etc/init.d/o2cb status`<br>
<br>
Driver for "configfs": Loaded<br>
Filesystem "configfs": Mounted<br>
Driver for "ocfs2_dlmfs": Loaded<br>
Filesystem "ocfs2_dlmfs": Mounted<br>
Checking O2CB cluster cpc: Online<br>
Heartbeat dead threshold = 31<br>
Network idle timeout: 30000<br>
Network keepalive delay: 2000<br>
Network reconnect delay: 2000<br>
Checking O2CB heartbeat: Not active<br>
<br>
Here're the results when running `tcpdump` on the node 0 while
starting the `ocfs2` on the node 1:<br>
<br>
1 0.000000 192.168.2.93 -> 192.168.3.145 TCP 70 55274
> cbt [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSval=690432180
TSecr=0<br>
2 0.000008 192.168.3.145 -> 192.168.2.93 TCP 70 cbt
> 55274 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460
TSval=707657223 TSecr=690432180<br>
3 0.000223 192.168.2.93 -> 192.168.3.145 TCP 66 55274
> cbt [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSval=690432181
TSecr=707657223<br>
4 0.000286 192.168.2.93 -> 192.168.3.145 TCP 98 55274
> cbt [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=32 TSval=690432181
TSecr=707657223<br>
5 0.000292 192.168.3.145 -> 192.168.2.93 TCP 66 cbt
> 55274 [ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223
TSecr=690432181<br>
6 0.000324 192.168.3.145 -> 192.168.2.93 TCP 66 cbt
> 55274 [RST, ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223
TSecr=690432181<br>
<br>
The `RST` flag is sent after every 6 packets. <br>
<br>
What other can I do to debug this case?<br>
<br>
PS:<br>
<br>
OCFS2 versions on the node 0:<br>
<br>
- ocfs2-tools-1.4.4-1.el5<br>
- ocfs2-2.6.18-274.12.1.el5-1.4.7-1.el5<br>
<br>
OCFS2 versions on the node 1:<br>
<br>
- ocfs2-tools-1.4.4-1.el5<br>
- ocfs2-2.6.18-308.el5-1.4.7-1.el5<br>
<br>
</font>
</body>
</html>