<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=us-ascii">
<META content="MSHTML 6.00.2900.3395" name=GENERATOR></HEAD>
<BODY>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008>Greetings,</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>I have 3 Oracle
RAC clusters running on OCFS2 attached to an HP EVA8000 SAN with
dm-multipath as my multipath provider. On Saturday, one of the EVA8000
controllers (active/active) rebooted. Out of my 3 different
clusters, at least one node each cluster rebooted with the following
messages:</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>Sep 20 01:19:38
ausracdb02 kernel: rport-0:0-5: blocked FC remote port time out: saving
binding<BR>Sep 20 01:19:38 ausracdb02 kernel: lpfc 0000:0e:00.0: 0:0203 Devloss
timeout on WWPN 50:0:1f:e1:50:b:32:88 NPort xb0c00 Data: x2000008 x7 x7<BR>Sep
20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code =
0x00010000<BR>Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev
sds, sector 58618847<BR>Sep 20 01:19:38 ausracdb02 kernel: device-mapper:
multipath: Failing path 65:32.<BR>Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9:
SCSI error: return code = 0x00010000<BR>Sep 20 01:19:38 ausracdb02 kernel:
end_request: I/O error, dev sds, sector 58618847<BR>Sep 20 01:19:38 ausracdb02
kernel: sd 0:0:3:9: SCSI error: return code = 0x00010000<BR>Sep 20 01:19:38
ausracdb02 kernel: end_request: I/O error, dev sds, sector 58618847<BR>Sep 20
01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code =
0x00020008<BR>Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev
sds, sector 22227855<BR>Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI
error: return code = 0x00020008<BR>Sep 20 01:19:38 ausracdb02 kernel:
end_request: I/O error, dev sds, sector 1735<BR>Sep 20 01:19:38 ausracdb02
kernel: sd 0:0:3:9: SCSI error: return code = 0x00020008<BR>Sep 20 01:19:38
ausracdb02 kernel: end_request: I/O error, dev sds, sector 58618943<BR>Sep 20
01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code =
0x00020008<BR>Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev
sds, sector 58619935<BR>Sep 20 01:19:38 ausracdb02 kernel: rport-1:0-2:
blocked FC remote port time out: saving binding<BR>Sep 20 01:19:38 ausracdb02
multipathd: sdaw: tur checker reports path is down<BR>Sep 20 01:19:38 ausracdb02
multipathd: checker failed path 67:0 in map limsp<BR>Sep 20 01:19:38 ausracdb02
kernel: lpfc 0000:0e:00.1: 1:0203 Devloss timeout on WWPN 50:0:1f:e1:50:b:32:89
NPort x150d00 Data: x2000008 x7 x6<BR>Sep 20 01:19:38 ausracdb02 kernel: sd
1:0:0:2: SCSI error: return code = 0x00010000<BR>Sep 20 01:19:38 ausracdb02
kernel: end_request: I/O error, dev sdap, sector
32776295<BR><snip></SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>Sep 20 01:21:03
ausracdb02 kernel: (27,1):o2hb_write_timeout:269 ERROR: Heartbeat write timeout
to device dm-11 after 120000 milliseconds<BR>Sep 20 01:21:03 ausracdb02 kernel:
Heartbeat thread (27) printing last 24 blocking operations (cur = 19):<BR>Sep 20
01:21:03 ausracdb02 kernel: Heartbeat thread stuck at waiting for read
completion, stuffing current time into that blocker (index 19)<BR>Sep 20
01:21:03 ausracdb02 kernel: Index 20: took 0 ms to do submit_bio for read<BR>Sep
20 01:21:03 ausracdb02 kernel: Index 21: took 0 ms to do waiting for read
completion<BR>Sep 20 01:21:03 ausracdb02 kernel: Index 22: took 0 ms to do bio
alloc write<BR>Sep 20 01:21:03 ausracdb02 kernel: Index 23: took 0 ms to do bio
add page write<BR></SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008></SPAN></FONT><FONT
face=Arial size=2><SPAN class=812463315-22092008></SPAN></FONT><FONT face=Arial
size=2><SPAN class=812463315-22092008></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>It appears that the
heartbeat thread just barely timed out, as the controller was in the process of
coming back up. My questions are:</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>1) Why did only some
nodes in each cluster reboot?</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>2) Why was there a
timeout when the multipathing should have kept the filesystems
up?</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>3) Is there a way to
increase the heartbeat timeout above 120000 milliseconds?</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>My
config:</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>Kernel:
2.6.18-53.el5 x86_64 on RHEL 5.1</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>OCFS2:
ocfs2-2.6.18-53.el5-1.2.8-2.el5</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008>O2CB:</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>[root@ausracdb02 ~]#
/etc/init.d/o2cb status<BR>Module "configfs": Loaded<BR>Filesystem "configfs":
Mounted<BR>Module "ocfs2_nodemanager": Loaded<BR>Module "ocfs2_dlm":
Loaded<BR>Module "ocfs2_dlmfs": Loaded<BR>Filesystem "ocfs2_dlmfs":
Mounted<BR>Checking O2CB cluster racdb: Online<BR> Heartbeat dead
threshold: 61<BR> Network idle timeout: 60000<BR> Network keepalive
delay: 2000<BR> Network reconnect delay: 2000<BR>Checking O2CB heartbeat:
Active<BR></SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=812463315-22092008>Cluster1 has
two nodes, Cluster2 has two nodes, and Cluster3 has four nodes.
Cluster1 and Cluster2 had one node reboot while Cluster3 had two nodes
reboot.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN
class=812463315-22092008>TIA,<BR><BR>Daniel</DIV></SPAN></FONT><br><br><table bgcolor=white style="color:black"><tr><td><br><br>
______________________________________________________________________<br>
This email transmission and any documents, files or previous email<br>
messages attached to it may contain information that is confidential or<br>
legally privileged. If you are not the intended recipient or a person<br>
responsible for delivering this transmission to the intended recipient,<br>
you are hereby notified that you must not read this transmission and<br>
that any disclosure, copying, printing, distribution or use of this<br>
transmission is strictly prohibited. If you have received this transmission<br>
in error, please immediately notify the sender by telephone or return email<br>
and delete the original transmission and its attachments without reading<br>
or saving in any manner.<br>
</td></tr></table></BODY></HTML>