[Ocfs2-users] (no subject)

Mon Sep 22 11:41:17 PDT 2008

The fencing is because the io write took more than 2 mins.

Since you have provided only a snippet of the logs, all I can
say is that mutipathd detecting the path failure and o2hb fencing
is 90 secs apart. I don't see the barely timed out bit.

Check your multipath setting/configuration.

Daniel Keisling wrote:
> Greetings,
>  
> I have 3 Oracle RAC clusters running on OCFS2 attached to an HP 
> EVA8000 SAN with dm-multipath as my multipath provider.  On Saturday, 
> one of the EVA8000 controllers (active/active) rebooted.  Out of my 3 
> different clusters, at least one node each cluster rebooted with the 
> following messages:
>  
> Sep 20 01:19:38 ausracdb02 kernel:  rport-0:0-5: blocked FC remote 
> port time out: saving binding
> Sep 20 01:19:38 ausracdb02 kernel: lpfc 0000:0e:00.0: 0:0203 Devloss 
> timeout on WWPN 50:0:1f:e1:50:b:32:88 NPort xb0c00 Data: x2000008 x7 x7
> Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code 
> = 0x00010000
> Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, 
> sector 58618847
> Sep 20 01:19:38 ausracdb02 kernel: device-mapper: multipath: Failing 
> path 65:32.
> Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code 
> = 0x00010000
> Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, 
> sector 58618847
> Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code 
> = 0x00010000
> Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, 
> sector 58618847
> Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code 
> = 0x00020008
> Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, 
> sector 22227855
> Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code 
> = 0x00020008
> Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, 
> sector 1735
> Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code 
> = 0x00020008
> Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, 
> sector 58618943
> Sep 20 01:19:38 ausracdb02 kernel: sd 0:0:3:9: SCSI error: return code 
> = 0x00020008
> Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sds, 
> sector 58619935
> Sep 20 01:19:38 ausracdb02 kernel:  rport-1:0-2: blocked FC remote 
> port time out: saving binding
> Sep 20 01:19:38 ausracdb02 multipathd: sdaw: tur checker reports path 
> is down
> Sep 20 01:19:38 ausracdb02 multipathd: checker failed path 67:0 in map 
> limsp
> Sep 20 01:19:38 ausracdb02 kernel: lpfc 0000:0e:00.1: 1:0203 Devloss 
> timeout on WWPN 50:0:1f:e1:50:b:32:89 NPort x150d00 Data: x2000008 x7 x6
> Sep 20 01:19:38 ausracdb02 kernel: sd 1:0:0:2: SCSI error: return code 
> = 0x00010000
> Sep 20 01:19:38 ausracdb02 kernel: end_request: I/O error, dev sdap, 
> sector 32776295
> <snip>
> Sep 20 01:21:03 ausracdb02 kernel: (27,1):o2hb_write_timeout:269 
> ERROR: Heartbeat write timeout to device dm-11 after 120000 milliseconds
> Sep 20 01:21:03 ausracdb02 kernel: Heartbeat thread (27) printing last 
> 24 blocking operations (cur = 19):
> Sep 20 01:21:03 ausracdb02 kernel: Heartbeat thread stuck at waiting 
> for read completion, stuffing current time into that blocker (index 19)
> Sep 20 01:21:03 ausracdb02 kernel: Index 20: took 0 ms to do 
> submit_bio for read
> Sep 20 01:21:03 ausracdb02 kernel: Index 21: took 0 ms to do waiting 
> for read completion
> Sep 20 01:21:03 ausracdb02 kernel: Index 22: took 0 ms to do bio alloc 
> write
> Sep 20 01:21:03 ausracdb02 kernel: Index 23: took 0 ms to do bio add 
> page write
>  
> It appears that the heartbeat thread just barely timed out, as the 
> controller was in the process of coming back up.  My questions are:
>  
> 1) Why did only some nodes in each cluster reboot?
> 2) Why was there a timeout when the multipathing should have kept the 
> filesystems up?
> 3) Is there a way to increase the heartbeat timeout above 120000 
> milliseconds?
>  
> My config:
>  
> Kernel: 2.6.18-53.el5 x86_64 on RHEL 5.1
>  
> OCFS2: ocfs2-2.6.18-53.el5-1.2.8-2.el5
>  
> O2CB:
> [root at ausracdb02 ~]# /etc/init.d/o2cb status
> Module "configfs": Loaded
> Filesystem "configfs": Mounted
> Module "ocfs2_nodemanager": Loaded
> Module "ocfs2_dlm": Loaded
> Module "ocfs2_dlmfs": Loaded
> Filesystem "ocfs2_dlmfs": Mounted
> Checking O2CB cluster racdb: Online
>   Heartbeat dead threshold: 61
>   Network idle timeout: 60000
>   Network keepalive delay: 2000
>   Network reconnect delay: 2000
> Checking O2CB heartbeat: Active
> Cluster1 has two nodes, Cluster2 has two nodes, and Cluster3 has four 
> nodes.  Cluster1 and Cluster2 had one node reboot while Cluster3 had 
> two nodes reboot.
>  
>  
> TIA,
>
> Daniel