[Ocfs2-users] (no subject)

Mon Sep 22 08:48:32 PDT 2008

Greetings,

I have 3 Oracle RAC clusters running on OCFS2 attached to an HP EVA8000
SAN with dm-multipath as my multipath provider.  On Saturday, one of the
EVA8000 controllers (active/active) rebooted.  Out of my 3 different
clusters, at least one node each cluster rebooted with the following
messages:

Sep 20 01:19:38 ausracdb02 kernel:  rport-0:0-5: blocked FC remote port
time out: saving binding
Sep 20 01:19:38 ausracdb02 kernel: lpfc 0000:0e:00.0: 0:0203 Devloss
timeout on WWPN 50:0:1f:e1:50:b:32:88 NPort xb0c00 Data: x2000008 x7 x7
Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code =
0x00010000
Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds,
sector 58618847
Sep 20 01:19:38 ausracdb02 kernel: device-mapper: multipath: Failing
path 65:32.
Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code =
0x00010000
Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds,
sector 58618847
Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code =
0x00010000
Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds,
sector 58618847
Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code =
0x00020008
Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds,
sector 22227855
Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code =
0x00020008
Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds,
sector 1735
Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code =
0x00020008
Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds,
sector 58618943
Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code =
0x00020008
Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds,
sector 58619935
Sep 20 01:19:38 ausracdb02 kernel:  rport-1:0-2: blocked FC remote port
time out: saving binding
Sep 20 01:19:38 ausracdb02 multipathd: sdaw: tur checker reports path is
down
Sep 20 01:19:38 ausracdb02 multipathd: checker failed path 67:0 in map
limsp
Sep 20 01:19:38 ausracdb02 kernel: lpfc 0000:0e:00.1: 1:0203 Devloss
timeout on WWPN 50:0:1f:e1:50:b:32:89 NPort x150d00 Data: x2000008 x7 x6
Sep 20 01:19:38 ausracdb02 kernel: sd 1:0:0:2: SCSI error: return code =
0x00010000
Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sdap,
sector 32776295
<snip>
Sep 20 01:21:03 ausracdb02 kernel: (27,1):o2hb_write_timeout:269 ERROR:
Heartbeat write timeout to device dm-11 after 120000 milliseconds
Sep 20 01:21:03 ausracdb02 kernel: Heartbeat thread (27) printing last
24 blocking operations (cur = 19):
Sep 20 01:21:03 ausracdb02 kernel: Heartbeat thread stuck at waiting for
read completion, stuffing current time into that blocker (index 19)
Sep 20 01:21:03 ausracdb02 kernel: Index 20: took 0 ms to do submit_bio
for read
Sep 20 01:21:03 ausracdb02 kernel: Index 21: took 0 ms to do waiting for
read completion
Sep 20 01:21:03 ausracdb02 kernel: Index 22: took 0 ms to do bio alloc
write
Sep 20 01:21:03 ausracdb02 kernel: Index 23: took 0 ms to do bio add
page write

It appears that the heartbeat thread just barely timed out, as the
controller was in the process of coming back up.  My questions are:

1) Why did only some nodes in each cluster reboot?
2) Why was there a timeout when the multipathing should have kept the
filesystems up?
3) Is there a way to increase the heartbeat timeout above 120000
milliseconds?

My config:

Kernel: 2.6.18-53.el5 x86_64 on RHEL 5.1

OCFS2: ocfs2-2.6.18-53.el5-1.2.8-2.el5

O2CB:
[root at ausracdb02 ~]# /etc/init.d/o2cb status
Module "configfs": Loaded
Filesystem "configfs": Mounted
Module "ocfs2_nodemanager": Loaded
Module "ocfs2_dlm": Loaded
Module "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster racdb: Online
  Heartbeat dead threshold: 61
  Network idle timeout: 60000
  Network keepalive delay: 2000
  Network reconnect delay: 2000
Checking O2CB heartbeat: Active

Cluster1 has two nodes, Cluster2 has two nodes, and Cluster3 has four
nodes.  Cluster1 and Cluster2 had one node reboot while Cluster3 had two
nodes reboot.

TIA,

Daniel

______________________________________________________________________
This email transmission and any documents, files or previous email
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient or a person
responsible for delivering this transmission to the intended recipient,
you are hereby notified that you must not read this transmission and
that any disclosure, copying, printing, distribution or use of this
transmission is strictly prohibited. If you have received this transmission
in error, please immediately notify the sender by telephone or return email
and delete the original transmission and its attachments without reading
or saving in any manner.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080922/8cbeaabf/attachment.html