[Ocfs2-users] CRS/CSS and OCFS2

Thu Jun 5 08:46:49 PDT 2008

Hi Sunil,

my lotus notes choked on the table from excel... So the two nodes have the 
following nodenumbers:
Node    ocfs2   crs/css
byaz05  0       2
byaz10  1       1

Greets,
Alex

>In such a situation, ocfs2 fences the higher node number. afaik,
>css does the same. What are the css node numbers for the two nodes?

>alexandra.strauss at bayerbbs.com wrote:
>>
>> Hello,
>>
>> I refer to you hoping you may help me with my problem... We have got 
>> an issur here and opened a SR at Metalink but until now, we got no 
>> useful information in solving our problem. SR-Number is 6855815.994...
>>
>> We wanted to protect 9i Single-Instance Databases with 10g Clusterware 
>> following the third-party-tool approach. There are no RAC-databases 
>> involved. But we want to achieve high availability as the databases 
>> are business critical systems. We want to make the systems able to
>> relocate to another machine in case of failure to keep downtimes 
>> low... To achieve this we want to use OCFS2 for the filesystem. 
>> Relocate is done by script with help of CRS.
>>
>> So we took two systems (byaz05 and byaz10) and installed the following 
>> software: 10g CRS (10.2.0.3) and Oracle Software 9.2.0.8 and OCFS2 
1.2.8
>>
>> We found the following Metalinknotes and adjusted the heartbeat and 
>> timeouts for OCFS2: Metalink Note 395878.1: Heartbeat/Voting/Quorum 
>> Related Timeout Configuration for Linux, OCFS2, RAC Stack to avoid 
>> unnessary node fencing, panic and reboot
>> Metalink Note 391771.1: OCFS2 - FREQUENTLY ASKED QUESTIONS (hier 
>> insbesondere der Abschnitt zu Fencing und Quorum)
>> Metalink Note 434255.1: Common reasons for OCFS2 Kernel Panic or 
>> Reboot Issues
>> Metalink Note 457423.1: OCFS2 Fencing, Network, and Disk Heartbeat 
>> Timeout Configuration
>>
>> We did no changes to the CRS/CSS default settings until now.
>>
>> During HA-testing we watched unexpected behaviour of the system. We 
>> deactivated the bond for private interconnect and expected only one 
>> node to go down. But we faced both nodes going down. As it seems to me 
>> one node was rebooted from OCFS2 and the other one from CRS/CSS.
>>
>> Timestamp
>> 
-------------------------------------------------------------------------------------------------------------- 

>>
>> 10:21:06 bond1 disabled (eth1)
>> */var/log/messages byaz05*
>> Apr 25 10:21:06 byaz05 kernel: bonding: bond1: link status definitely 
>> down for interface eth1, disabling it
>> Apr 25 10:21:06 byaz05 kernel: bonding: bond1: making interface eth5 
>> the new active one.
>>
>> 10:21:09 bond1 disabled (eth5)
>> */var/log/messages byaz05*
>> Apr 25 10:21:09 byaz05 kernel: bonding: bond1: link status definitely 
>> down for interface eth5, disabling it
>> Apr 25 10:21:09 byaz05 kernel: bonding: bond1: now running without any 
>> active interface !
>>
>> 10:21:23 o2net ? no longer connected
>> */var/log/messages byaz05*
>> Apr 25 10:21:23 byaz05 kernel: o2net: no longer connected to node 
>> byaz10.bayer-ag.com (num 1) at 10.190.59.6:7777
>> */var/log/messages byaz10*
>> Apr 25 10:21:23 byaz10 kernel: o2net: no longer connected to node 
>> byaz05.bayer-ag.com (num 0) at 10.190.59.5:7777
>>
>> 10:21:27 CSSD failure 134
>> 10:21:29 Reboot initiated by CRS
>> */var/log/messages byaz05*
>> Apr 25 10:21:27 byaz05 logger: Oracle clsomon failed with fatal status 
>> 12.
>> Apr 25 10:21:27 byaz05 logger: Oracle CSSD failure 134.
>> Apr 25 10:21:27 byaz05 su(pam_unix)[25839]: session closed for user 
>> oracle
>> Apr 25 10:21:27 byaz05 logger: Oracle CRS failure. Rebooting for 
>> cluster integrity.
>> Apr 25 10:21:27 byaz05 kernel: md: stopping all md devices.
>> Apr 25 10:21:27 byaz05 kernel: md: md0 switched to read-only mode.
>> Apr 25 10:21:29 byaz05 logger: Oracle CRS failure. Rebooting for 
>> cluster integrity.
>> Apr 25 10:21:29 byaz05 kernel: e1000: eth2: e1000_watchdog_task: NIC 
>> Link is Up 1000 Mbps Full Duplex
>> Apr 25 10:21:29 byaz05 logger: Oracle init script ceding reboot to 
>> sibling 27383.
>>
>> 10:21:58 Reboot initiated by OCFS2(?)
>> */var/log/messages byaz10*
>> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session opened for user 
>> oracle by (uid=0)
>> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session closed for user 
oracle
>> Apr 25 10:25:58 byaz10 syslogd 1.4.1: restart.
>> Apr 25 10:25:58 byaz10 syslog: syslogd startup succeeded
>> Apr 25 10:25:58 byaz10 kernel: klogd 1.4.1, log source = /proc/kmsg 
>> started.
>> Apr 25 10:25:58 byaz10 kernel: Bootdata ok (command line is ro 
>> root=/dev/vgroot/_)
>>
>>
>> We supposed all the time this is a timing problem. But we don't know 
>> which settings raise the problem and which steps to do to to correct 
>> them. Otherwise we'll have to work over the complete concept for the 
>> business critical systems.
>> Can anyone help me?
>>

>> Regards,
>> Alexandra
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://oss.oracle.com/pipermail/ocfs2-users/attachments/20080605/33e961c1/attachment-0001.html