[Ocfs2-users] Node fence on RHEL4 machine running 1.2.8-2

Sun Jul 13 22:19:44 PDT 2008

Hello,

We have a four-node RHEL4 RAC cluster running OCFS2 version 1.2.8-2 and 
the 2.6.9-67.0.4hugemem kernel. The cluster has been really stable since 
we upgraded to 1.2.8-2 early this year, but this morning, one of the 
nodes fenced and rebooted itself, and I wonder if anyone could glance at 
the below remote syslogs and offer an opinion as to why.

First, here's the output of 'service o2cb status' on node1:

Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
   Heartbeat dead threshold: 61
   Network idle timeout: 30000
   Network keepalive delay: 2000
   Network reconnect delay: 2000
Checking O2CB heartbeat: Active

And now, the issue. It actually looks like Issue 960 on the OCFS2 Issues 
register page, but there's not a lot of detail in that issue. Anyway...

Please note that node4 didn't have remote syslogging set up when this 
happened, but it was logging the same things as nodes 2 and 3 were 
logging ('dlm_send_remote_convert_request' and 'dlm_wait_for_node_death').

At 5:35:53 this morning, all local disk logging activity on node1 
stopped. Our Nagios server could continue to ping node1 during this 
time. Meanwhile, in the remote syslogs, this started happening:

Jul 14 05:36:22 node1  (27575,0):o2net_idle_timer:1426 here are some
times that might help debug the situation: (tmr 1215977752.923279 now
1215977782.914919+dr 1215977754.923101 adv
1215977752.923285:1215977752.923288 func (88eefa53:504)
1215977375.25138:1215977375.25147)
Jul 14 05:36:23 node1  (27575,0):o2net_idle_timer:1426 here are some
times that might help debug the situation: (tmr 1215977753.136223 now
1215977783.127854+dr 1215977755.136316 adv
1215977753.136228:1215977753.136232 func (88eefa53:504)
1215977371.240456:1215977371.240464)
Jul 14 05:36:23 node3  (0,2):o2net_idle_timer:1426 here are some times
that might help debug the situation: (tmr 1215977753.135942 now
1215977783.128448 dr
+1215977753.135905 adv 1215977753.135958:1215977753.135961 func
(88eefa53:505) 1215977371.240486:1215977371.240495)
Jul 14 05:36:24 node1  (0,3):o2net_idle_timer:1426 here are some times
that might help debug the situation: (tmr 1215977754.698020 now
1215977784.690378 dr
+1215977784.690207 adv 1215977754.698038:1215977754.698042 func
(88eefa53:504) 1215977362.803162:1215977362.803177)
Jul 14 05:36:24 node2  (0,2):o2net_idle_timer:1426 here are some times
that might help debug the situation: (tmr 1215977754.698019 now
1215977784.689688 dr
+1215977754.698002 adv 1215977754.698025:1215977754.698028 func
(88eefa53:505) 1215977362.803220:1215977362.803229)
Jul 14 05:36:54 node3  (9283,2):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:36:54 node3  (9283,2):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0
Jul 14 05:36:54 node2  (11618,2):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:36:54 node2  (11618,2):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0
Jul 14 05:36:59 node3  (9283,2):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:36:59 node3  (9283,2):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0
Jul 14 05:36:59 node2  (11618,2):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:36:59 node2  (11618,2):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0
Jul 14 05:37:01 node2  (6049,3):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:37:01 node2  (6049,3):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0
Jul 14 05:37:04 node3  (9283,2):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:37:04 node3  (9283,2):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0

Things went along like this until:

Jul 14 05:55:59 node1  Index 9: took 0 ms to do bio add page write
Jul 14 05:55:59 node1  (13,3):o2hb_write_timeout:269 ERROR: Heartbeat
write timeout to device sdh1 after 120000 milliseconds
Jul 14 05:55:59 node1  Index 3: took 0 ms to do allocating bios for read
Jul 14 05:55:59 node1  Index 4: took 0 ms to do bio alloc read
Jul 14 05:55:59 node1  Heartbeat thread (13) printing last 24 blocking
operations (cur = 0):
Jul 14 05:55:59 node1  Index 5: took 0 ms to do bio add page read
Jul 14 05:55:59 node1  Index 10: took 0 ms to do submit_bio for write
Jul 14 05:55:59 node1  Index 11: took 0 ms to do checking slots
Jul 14 05:55:59 node1  Index 6: took 0 ms to do submit_bio for read
Jul 14 05:55:59 node1  Heartbeat thread stuck at msleep, stuffing
current time into that blocker (index 0)
Jul 14 05:55:59 node1  Index 12: took 0 ms to do waiting for write
completion
Jul 14 05:55:59 node1  Index 1: took 0 ms to do waiting for write
completion
Jul 14 05:55:59 node1  Index 13: took 1999 ms to do msleep
Jul 14 05:55:59 node1  Index 7: took 0 ms to do waiting for read completion
Jul 14 05:55:59 node1  Index 8: took 0 ms to do bio alloc write
Jul 14 05:55:59 node1  Index 2: took 2000 ms to do msleep
Jul 14 05:55:59 node1  Index 14: took 0 ms to do allocating bios for read
Jul 14 05:55:59 node1  Index 15: took 0 ms to do bio alloc read
Jul 14 05:55:59 node1  Index 16: took 0 ms to do bio add page read
Jul 14 05:55:59 node1  Index 17: took 0 ms to do submit_bio for read
Jul 14 05:55:59 node1  Index 18: took 0 ms to do waiting for read
completion
Jul 14 05:55:59 node1  Index 19: took 0 ms to do bio alloc write
Jul 14 05:55:59 node1  Index 20: took 0 ms to do bio add page write
Jul 14 05:55:59 node1  Index 21: took 0 ms to do submit_bio for write
Jul 14 05:55:59 node1  Index 22: took 0 ms to do checking slots
Jul 14 05:55:59 node1  Index 23: took 0 ms to do waiting for write
completion
Jul 14 05:55:59 node1  Index 0: took 119968 ms to do msleep
Jul 14 05:55:59 node1  *** ocfs2 is very sorry to be fencing this
system by restarting ***

The 'dlm_send_remote_convert_request' and 'dlm_wait_for_node_death' on 
nodes 2 and 3 (and 4) then continued until:

Jul 14 05:58:02 node3  (3542,2):dlm_get_lock_resource:921
98F84EF9EC254C499F79F8C13C57CF2E:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Jul 14 05:58:02 node3  (3542,2):dlm_get_lock_resource:955
98F84EF9EC254C499F79F8C13C57CF2E: recovery map is not empty, but must
master $RECOVERY lock now
Jul 14 05:58:02 node2  (3479,2):ocfs2_dlm_eviction_cb:119 device
(8,49): dlm has evicted node 0
Jul 14 05:58:04 node3  (3522,2):dlm_get_lock_resource:921
0E8DC7044BA147F68D1407509F9AF3F3:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Jul 14 05:58:04 node3  (3522,2):dlm_get_lock_resource:955
0E8DC7044BA147F68D1407509F9AF3F3: recovery map is not empty, but must
master $RECOVERY lock now
Jul 14 05:58:04 node2  (3479,2):ocfs2_dlm_eviction_cb:119 device (8,1):
dlm has evicted node 0
Jul 14 05:58:05 node3  (14474,2):dlm_get_lock_resource:921
43BD66C2E3B64F30AF926B7FCBBB0046:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Jul 14 05:58:05 node3  (14474,2):dlm_get_lock_resource:955
43BD66C2E3B64F30AF926B7FCBBB0046: recovery map is not empty, but must
master $RECOVERY lock now

node1 then started logging its bootup stuff in /var/log/messages at 
5:59:58. node1 came up OK and all was well.

Thanks,
Guy.