[Ocfs2-users] CRS/CSS and OCFS2

Tue May 27 06:41:37 PDT 2008

Hello,

I refer to you hoping you may help me with my problem... We have got an 
issur here and opened a SR at Metalink but until now, we got no useful 
information in solving our problem. SR-Number is 6855815.994...

We wanted to protect 9i Single-Instance Databases with 10g Clusterware 
following the third-party-tool approach. There are no RAC-databases 
involved. But we want to achieve high availability as the databases are 
business critical systems. We want to make the systems able to relocate to 
another machine in case of failure to keep downtimes low... To achieve 
this we want to use OCFS2 for the filesystem. Relocate is done by script 
with help of CRS.

So we took two systems (byaz05 and byaz10) and installed the following 
software: 10g CRS (10.2.0.3) and Oracle Software 9.2.0.8 and OCFS2 1.2.8

We found the following Metalinknotes and adjusted the heartbeat and 
timeouts for OCFS2: Metalink Note 395878.1: Heartbeat/Voting/Quorum 
Related Timeout Configuration for Linux, OCFS2, RAC Stack to avoid 
unnessary node fencing, panic and reboot
Metalink Note 391771.1: OCFS2 - FREQUENTLY ASKED QUESTIONS (hier 
insbesondere der Abschnitt zu Fencing und Quorum)
Metalink Note 434255.1: Common reasons for OCFS2 Kernel Panic or Reboot 
Issues
Metalink Note 457423.1: OCFS2 Fencing, Network, and Disk Heartbeat Timeout 
Configuration

We did no changes to the CRS/CSS default settings until now.

During HA-testing we watched unexpected behaviour of the system. We 
deactivated the bond for private interconnect and expected only one node 
to go down. But we faced both nodes going down. As it seems to me one node 
was rebooted from OCFS2 and the other one from CRS/CSS.

Timestamp 
--------------------------------------------------------------------------------------------------------------
10:21:06                bond1 disabled (eth1) 
/var/log/messages byaz05
Apr 25 10:21:06 byaz05 kernel: bonding: bond1: link status definitely down 
for interface eth1, disabling it
Apr 25 10:21:06 byaz05 kernel: bonding: bond1: making interface eth5 the 
new active one.

10:21:09                bond1 disabled (eth5) 
/var/log/messages byaz05
Apr 25 10:21:09 byaz05 kernel: bonding: bond1: link status definitely down 
for interface eth5, disabling it
Apr 25 10:21:09 byaz05 kernel: bonding: bond1: now running without any 
active interface !

10:21:23                o2net ? no longer connected 
/var/log/messages byaz05
Apr 25 10:21:23 byaz05 kernel: o2net: no longer connected to node 
byaz10.bayer-ag.com (num 1) at 10.190.59.6:7777
/var/log/messages byaz10
Apr 25 10:21:23 byaz10 kernel: o2net: no longer connected to node 
byaz05.bayer-ag.com (num 0) at 10.190.59.5:7777

10:21:27                CSSD failure 134
10:21:29                Reboot initiated by CRS
/var/log/messages byaz05
Apr 25 10:21:27 byaz05 logger: Oracle clsomon failed with fatal status 12.
Apr 25 10:21:27 byaz05 logger: Oracle CSSD failure 134.
Apr 25 10:21:27 byaz05 su(pam_unix)[25839]: session closed for user oracle
Apr 25 10:21:27 byaz05 logger: Oracle CRS failure.  Rebooting for cluster 
integrity.
Apr 25 10:21:27 byaz05 kernel: md: stopping all md devices.
Apr 25 10:21:27 byaz05 kernel: md: md0 switched to read-only mode.
Apr 25 10:21:29 byaz05 logger: Oracle CRS failure.  Rebooting for cluster 
integrity.
Apr 25 10:21:29 byaz05 kernel: e1000: eth2: e1000_watchdog_task: NIC Link 
is Up 1000 Mbps Full Duplex
Apr 25 10:21:29 byaz05 logger: Oracle init script ceding reboot to sibling 
27383.

10:21:58                Reboot initiated by OCFS2(?)
/var/log/messages byaz10
Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session opened for user oracle 
by (uid=0)
Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session closed for user oracle
Apr 25 10:25:58 byaz10 syslogd 1.4.1: restart.
Apr 25 10:25:58 byaz10 syslog: syslogd startup succeeded
Apr 25 10:25:58 byaz10 kernel: klogd 1.4.1, log source = /proc/kmsg 
started.
Apr 25 10:25:58 byaz10 kernel: Bootdata ok (command line is ro 
root=/dev/vgroot/_)

We supposed all the time this is a timing problem. But we don't know which 
settings raise the problem and which steps to do to to correct them. 
Otherwise we'll have to work over the complete concept for the business 
critical systems. 
Can anyone help me?

Regards,
Alexandra

Freundliche Grüße / Best Regards

Alexandra Strauss
_________________________________________

Fa. Opitz Consulting
Fa. Opitz Consulting
Phone: 
Fax: 
E-mail: 
Web: http://www.BayerBBS.com

Geschäftsführung: Vorsitzender Andreas Resch   |   Arbeitsdirektor Norbert 
Fieseler
Vorsitzender des Aufsichtsrats: Klaus Kühn
Sitz der Gesellschaft: Leverkusen   |   Amtsgericht Köln, HRB 49895
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080527/bac6f985/attachment.html