[Ocfs2-users] OCFS2 Fencing and Locking MSA500 Array: Help

Deaderick, David (EDS) David.Deaderick at va.gov
Wed Oct 25 14:39:54 PDT 2006


I have a RedHat Enterprise Linux 4.0 two node cluster on HP ProLiant
ML350 Servers connected to an HP MSA500 with HP 532 SCSI adapters (cciss
driver).
The following list includes critical component versions:
ocfs2console-1.2.1-1                          Mon 28 Aug 2006 05:39:20
PM EDT
ocfs2-2.6.9-42.0.2.ELsmp-1.2.3-1              Mon 28 Aug 2006 05:39:19
PM EDT
ocfs2-2.6.9-42.0.2.ELhugemem-1.2.3-1          Mon 28 Aug 2006 05:39:18
PM EDT
ocfs2-2.6.9-42.0.2.EL-1.2.3-1                 Mon 28 Aug 2006 05:39:17
PM EDT
ocfs2-tools-1.2.1-1                           Mon 28 Aug 2006 05:39:15
PM EDT
oracleasmlib-2.0.2-1                          Mon 28 Aug 2006 05:37:51
PM EDT
oracleasm-2.6.9-42.0.2.ELhugemem-2.0.3-1      Mon 28 Aug 2006 05:37:49
PM EDT
oracleasm-2.6.9-42.0.2.EL-2.0.3-1             Mon 28 Aug 2006 05:37:47
PM EDT
oracleasm-2.6.9-42.0.2.ELsmp-2.0.3-1          Mon 28 Aug 2006 05:37:45
PM EDT
oracleasm-support-2.0.3-1                     Mon 28 Aug 2006 05:37:44
PM EDT
kernel-hugemem-2.6.9-42.0.2.EL                Mon 28 Aug 2006 05:25:32
PM EDT
kernel-doc-2.6.9-42.0.2.EL                    Mon 28 Aug 2006 05:25:29
PM EDT
kernel-hugemem-devel-2.6.9-42.0.2.EL          Mon 28 Aug 2006 05:25:07
PM EDT
kernel-smp-devel-2.6.9-42.0.2.EL              Mon 28 Aug 2006 05:21:45
PM EDT
kernel-smp-2.6.9-42.0.2.EL                    Mon 28 Aug 2006 05:20:51
PM EDT
kernel-utils-2.4-13.1.83                      Mon 28 Aug 2006 05:20:48
PM EDT
kernel-devel-2.6.9-42.0.2.EL                  Mon 28 Aug 2006 04:42:48
PM EDT
kernel-2.6.9-42.0.2.EL                        Mon 28 Aug 2006 04:42:37
PM EDT

When ever a heavy load is on the I/O system (i.e. database full backups
using RMAN), the servers fence, reboot and cannot reconnect with the
MSA500.
We must power the servers and the MSA500 off and restart.

Where can I start troubleshooting this?

/var/log/messages: (Node 2)

Oct 11 05:16:56 vhaispora02 kernel: o2net: connection to node
vhaispora01 (num 0) at 192.168.1.1:7777 has been idle for 10 seconds,
shutting it down.
Oct 11 05:16:56 vhaispora02 kernel: (0,0):o2net_idle_timer:1309 here are
some times that might help debug the situation: (tmr 1160558206.560358
now 1160558216.558300 dr 1160558206.560323 adv
1160558206.560375:1160558206.560379 func (0d6da305:504)
1160552001.561116:1160552001.561125)
Oct 11 05:16:56 vhaispora02 kernel: o2net: no longer connected to node
vhaispora01 (num 0) at 192.168.1.1:7777
Oct 11 05:16:59 vhaispora02 kernel: cciss0: unsolicited abort f7010e90
Oct 11 05:16:59 vhaispora02 kernel: cciss0: retrying f7010e90
.
.
.
Oct 11 05:17:18 vhaispora02 kernel: cciss0: f7010550 retried too many
times
Oct 11 05:17:18 vhaispora02 kernel: cciss0: unsolicited abort f70107a0
Oct 11 05:17:18 vhaispora02 kernel: cciss0: f70107a0 retried too many
times
Oct 11 05:17:18 vhaispora02 kernel: cciss0: unsolicited abort f70109f0
Oct 11 10:35:57 vhaispora02 syslogd 1.4.1: restart.
Oct 11 10:35:57 vhaispora02 syslog: syslogd startup succeeded
Oct 11 10:35:57 vhaispora02 kernel: klogd 1.4.1, log source = /proc/kmsg
started.
Oct 11 10:35:57 vhaispora02 kernel: Linux version 2.6.9-42.0.2.ELsmp
(bhcompile at ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404
(Red Hat 3.4.6-3)) #1 SMP Thu Aug 17 18:00:32 EDT 2006

/var/log/messages (Node 1)
Oct 11 05:10:01 vhaispora01 crond(pam_unix)[14577]: session closed for
user root
Oct 11 05:14:25 vhaispora01 ntpd[3243]: synchronized to 10.4.31.254,
stratum 2
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000250
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000250
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70004a0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f70004a0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70006f0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f70006f0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000940
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000940
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000b90
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000b90
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7000de0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7000de0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7001030
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7001030
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f7001280
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f7001280
Oct 11 05:15:28 vhaispora01 kernel: cciss0: unsolicited abort f70014d0
Oct 11 05:15:28 vhaispora01 kernel: cciss0: retrying f70014d0
.
.
.
Oct 11 05:16:46 vhaispora01 kernel: cciss0: unsolicited abort f7012ca0
Oct 11 05:16:46 vhaispora01 kernel: cciss0: f7012ca0 retried too many
times
Oct 11 05:16:47 vhaispora01 kernel: cciss0: unsolicited abort f7012ef0
Oct 11 05:16:47 vhaispora01 kernel: cciss0: f7012ef0 retried too many
times
Oct 11 10:35:50 vhaispora01 syslogd 1.4.1: restart.
Oct 11 10:35:50 vhaispora01 syslog: syslogd startup succeeded
Oct 11 10:35:50 vhaispora01 kernel: klogd 1.4.1, log source = /proc/kmsg
started.
Oct 11 10:35:50 vhaispora01 kernel: Linux version 2.6.9-42.0.2.ELsmp
(bhcompile at ls20-bc1-13.build.redhat.com) (gcc version 3.4.6 20060404
(Red Hat 3.4.6-3)) #1 SMP Thu Aug 17 18:00:32 EDT 2006
O



More information about the Ocfs2-users mailing list