[Ocfs2-users] Node fence on RHEL4 machine running 1.2.8-2
Guy Waugh
gwaugh at scu.edu.au
Sun Jul 13 22:19:44 PDT 2008
Hello,
We have a four-node RHEL4 RAC cluster running OCFS2 version 1.2.8-2 and
the 2.6.9-67.0.4hugemem kernel. The cluster has been really stable since
we upgraded to 1.2.8-2 early this year, but this morning, one of the
nodes fenced and rebooted itself, and I wonder if anyone could glance at
the below remote syslogs and offer an opinion as to why.
First, here's the output of 'service o2cb status' on node1:
Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold: 61
Network idle timeout: 30000
Network keepalive delay: 2000
Network reconnect delay: 2000
Checking O2CB heartbeat: Active
And now, the issue. It actually looks like Issue 960 on the OCFS2 Issues
register page, but there's not a lot of detail in that issue. Anyway...
Please note that node4 didn't have remote syslogging set up when this
happened, but it was logging the same things as nodes 2 and 3 were
logging ('dlm_send_remote_convert_request' and 'dlm_wait_for_node_death').
At 5:35:53 this morning, all local disk logging activity on node1
stopped. Our Nagios server could continue to ping node1 during this
time. Meanwhile, in the remote syslogs, this started happening:
Jul 14 05:36:22 node1 (27575,0):o2net_idle_timer:1426 here are some
times that might help debug the situation: (tmr 1215977752.923279 now
1215977782.914919+dr 1215977754.923101 adv
1215977752.923285:1215977752.923288 func (88eefa53:504)
1215977375.25138:1215977375.25147)
Jul 14 05:36:23 node1 (27575,0):o2net_idle_timer:1426 here are some
times that might help debug the situation: (tmr 1215977753.136223 now
1215977783.127854+dr 1215977755.136316 adv
1215977753.136228:1215977753.136232 func (88eefa53:504)
1215977371.240456:1215977371.240464)
Jul 14 05:36:23 node3 (0,2):o2net_idle_timer:1426 here are some times
that might help debug the situation: (tmr 1215977753.135942 now
1215977783.128448 dr
+1215977753.135905 adv 1215977753.135958:1215977753.135961 func
(88eefa53:505) 1215977371.240486:1215977371.240495)
Jul 14 05:36:24 node1 (0,3):o2net_idle_timer:1426 here are some times
that might help debug the situation: (tmr 1215977754.698020 now
1215977784.690378 dr
+1215977784.690207 adv 1215977754.698038:1215977754.698042 func
(88eefa53:504) 1215977362.803162:1215977362.803177)
Jul 14 05:36:24 node2 (0,2):o2net_idle_timer:1426 here are some times
that might help debug the situation: (tmr 1215977754.698019 now
1215977784.689688 dr
+1215977754.698002 adv 1215977754.698025:1215977754.698028 func
(88eefa53:505) 1215977362.803220:1215977362.803229)
Jul 14 05:36:54 node3 (9283,2):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:36:54 node3 (9283,2):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0
Jul 14 05:36:54 node2 (11618,2):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:36:54 node2 (11618,2):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0
Jul 14 05:36:59 node3 (9283,2):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:36:59 node3 (9283,2):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0
Jul 14 05:36:59 node2 (11618,2):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:36:59 node2 (11618,2):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0
Jul 14 05:37:01 node2 (6049,3):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:37:01 node2 (6049,3):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0
Jul 14 05:37:04 node3 (9283,2):dlm_send_remote_convert_request:398
ERROR: status = -107
Jul 14 05:37:04 node3 (9283,2):dlm_wait_for_node_death:365
0E8DC7044BA147F68D1407509F9AF3F3: waiting 5000ms for notification of
death of node 0
Things went along like this until:
Jul 14 05:55:59 node1 Index 9: took 0 ms to do bio add page write
Jul 14 05:55:59 node1 (13,3):o2hb_write_timeout:269 ERROR: Heartbeat
write timeout to device sdh1 after 120000 milliseconds
Jul 14 05:55:59 node1 Index 3: took 0 ms to do allocating bios for read
Jul 14 05:55:59 node1 Index 4: took 0 ms to do bio alloc read
Jul 14 05:55:59 node1 Heartbeat thread (13) printing last 24 blocking
operations (cur = 0):
Jul 14 05:55:59 node1 Index 5: took 0 ms to do bio add page read
Jul 14 05:55:59 node1 Index 10: took 0 ms to do submit_bio for write
Jul 14 05:55:59 node1 Index 11: took 0 ms to do checking slots
Jul 14 05:55:59 node1 Index 6: took 0 ms to do submit_bio for read
Jul 14 05:55:59 node1 Heartbeat thread stuck at msleep, stuffing
current time into that blocker (index 0)
Jul 14 05:55:59 node1 Index 12: took 0 ms to do waiting for write
completion
Jul 14 05:55:59 node1 Index 1: took 0 ms to do waiting for write
completion
Jul 14 05:55:59 node1 Index 13: took 1999 ms to do msleep
Jul 14 05:55:59 node1 Index 7: took 0 ms to do waiting for read completion
Jul 14 05:55:59 node1 Index 8: took 0 ms to do bio alloc write
Jul 14 05:55:59 node1 Index 2: took 2000 ms to do msleep
Jul 14 05:55:59 node1 Index 14: took 0 ms to do allocating bios for read
Jul 14 05:55:59 node1 Index 15: took 0 ms to do bio alloc read
Jul 14 05:55:59 node1 Index 16: took 0 ms to do bio add page read
Jul 14 05:55:59 node1 Index 17: took 0 ms to do submit_bio for read
Jul 14 05:55:59 node1 Index 18: took 0 ms to do waiting for read
completion
Jul 14 05:55:59 node1 Index 19: took 0 ms to do bio alloc write
Jul 14 05:55:59 node1 Index 20: took 0 ms to do bio add page write
Jul 14 05:55:59 node1 Index 21: took 0 ms to do submit_bio for write
Jul 14 05:55:59 node1 Index 22: took 0 ms to do checking slots
Jul 14 05:55:59 node1 Index 23: took 0 ms to do waiting for write
completion
Jul 14 05:55:59 node1 Index 0: took 119968 ms to do msleep
Jul 14 05:55:59 node1 *** ocfs2 is very sorry to be fencing this
system by restarting ***
The 'dlm_send_remote_convert_request' and 'dlm_wait_for_node_death' on
nodes 2 and 3 (and 4) then continued until:
Jul 14 05:58:02 node3 (3542,2):dlm_get_lock_resource:921
98F84EF9EC254C499F79F8C13C57CF2E:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Jul 14 05:58:02 node3 (3542,2):dlm_get_lock_resource:955
98F84EF9EC254C499F79F8C13C57CF2E: recovery map is not empty, but must
master $RECOVERY lock now
Jul 14 05:58:02 node2 (3479,2):ocfs2_dlm_eviction_cb:119 device
(8,49): dlm has evicted node 0
Jul 14 05:58:04 node3 (3522,2):dlm_get_lock_resource:921
0E8DC7044BA147F68D1407509F9AF3F3:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Jul 14 05:58:04 node3 (3522,2):dlm_get_lock_resource:955
0E8DC7044BA147F68D1407509F9AF3F3: recovery map is not empty, but must
master $RECOVERY lock now
Jul 14 05:58:04 node2 (3479,2):ocfs2_dlm_eviction_cb:119 device (8,1):
dlm has evicted node 0
Jul 14 05:58:05 node3 (14474,2):dlm_get_lock_resource:921
43BD66C2E3B64F30AF926B7FCBBB0046:$RECOVERY: at least one node (0)
torecover before lock mastery can begin
Jul 14 05:58:05 node3 (14474,2):dlm_get_lock_resource:955
43BD66C2E3B64F30AF926B7FCBBB0046: recovery map is not empty, but must
master $RECOVERY lock now
node1 then started logging its bootup stuff in /var/log/messages at
5:59:58. node1 came up OK and all was well.
Thanks,
Guy.
More information about the Ocfs2-users
mailing list