[Ocfs2-users] CRS/CSS and OCFS2

Thu Jun 5 09:41:23 PDT 2008

Stop o2cb and switch node number in /etc/cluster/ocfs2.conf.
After changing on boh, restart o2cb on both.

alexandra.strauss at bayerbbs.com wrote:
>
> Hi Sunil,
>
> my lotus notes choked on the table from excel... So the two nodes have 
> the following nodenumbers:
> Node ocfs2 crs/css
> byaz05 0 2
> byaz10 1 1
>
> Greets,
> Alex
>
>
> >In such a situation, ocfs2 fences the higher node number. afaik,
> >css does the same. What are the css node numbers for the two nodes?
>
> _>alexandra.strauss at bayerbbs.com_ 
> <http://oss.oracle.com/mailman/listinfo/ocfs2-users> wrote:
> >>/
> />>/ Hello,
> />>/
> />>/ I refer to you hoping you may help me with my problem... We have got
> />>/ an issur here and opened a SR at Metalink but until now, we got no
> />>/ useful information in solving our problem. SR-Number is 
> 6855815.994...
> />>/
> />>/ We wanted to protect 9i Single-Instance Databases with 10g 
> Clusterware
> />>/ following the third-party-tool approach. There are no RAC-databases
> />>/ involved. But we want to achieve high availability as the databases
> />>/ are business critical systems. We want to make the systems able to
> />>/ relocate to another machine in case of failure to keep downtimes
> />>/ low... To achieve this we want to use OCFS2 for the filesystem.
> />>/ Relocate is done by script with help of CRS.
> />>/
> />>/ So we took two systems (byaz05 and byaz10) and installed the 
> following
> />>/ software: 10g CRS (10.2.0.3) and Oracle Software 9.2.0.8 and 
> OCFS2 1.2.8
> />>/
> />>/ We found the following Metalinknotes and adjusted the heartbeat and
> />>/ timeouts for OCFS2: Metalink Note 395878.1: Heartbeat/Voting/Quorum
> />>/ Related Timeout Configuration for Linux, OCFS2, RAC Stack to avoid
> />>/ unnessary node fencing, panic and reboot
> />>/ Metalink Note 391771.1: OCFS2 - FREQUENTLY ASKED QUESTIONS (hier
> />>/ insbesondere der Abschnitt zu Fencing und Quorum)
> />>/ Metalink Note 434255.1: Common reasons for OCFS2 Kernel Panic or
> />>/ Reboot Issues
> />>/ Metalink Note 457423.1: OCFS2 Fencing, Network, and Disk Heartbeat
> />>/ Timeout Configuration
> />>/
> />>/ We did no changes to the CRS/CSS default settings until now.
> />>/
> />>/ During HA-testing we watched unexpected behaviour of the system. We
> />>/ deactivated the bond for private interconnect and expected only one
> />>/ node to go down. But we faced both nodes going down. As it seems 
> to me
> />>/ one node was rebooted from OCFS2 and the other one from CRS/CSS.
> />>/
> />>/ Timestamp
> />>/ 
> -------------------------------------------------------------------------------------------------------------- 
>
> />>/
> />>/ 10:21:06 bond1 disabled (eth1)
> />>/ */var/log/messages byaz05*
> />>/ Apr 25 10:21:06 byaz05 kernel: bonding: bond1: link status 
> definitely
> />>/ down for interface eth1, disabling it
> />>/ Apr 25 10:21:06 byaz05 kernel: bonding: bond1: making interface eth5
> />>/ the new active one.
> />>/
> />>/ 10:21:09 bond1 disabled (eth5)
> />>/ */var/log/messages byaz05*
> />>/ Apr 25 10:21:09 byaz05 kernel: bonding: bond1: link status 
> definitely
> />>/ down for interface eth5, disabling it
> />>/ Apr 25 10:21:09 byaz05 kernel: bonding: bond1: now running 
> without any
> />>/ active interface !
> />>/
> />>/ 10:21:23 o2net – no longer connected
> />>/ */var/log/messages byaz05*
> />>/ Apr 25 10:21:23 byaz05 kernel: o2net: no longer connected to node
> />>/ byaz10.bayer-ag.com (num 1) at 10.190.59.6:7777
> />>/ */var/log/messages byaz10*
> />>/ Apr 25 10:21:23 byaz10 kernel: o2net: no longer connected to node
> />>/ byaz05.bayer-ag.com (num 0) at 10.190.59.5:7777
> />>/
> />>/ 10:21:27 CSSD failure 134
> />>/ 10:21:29 Reboot initiated by CRS
> />>/ */var/log/messages byaz05*
> />>/ Apr 25 10:21:27 byaz05 logger: Oracle clsomon failed with fatal 
> status
> />>/ 12.
> />>/ Apr 25 10:21:27 byaz05 logger: Oracle CSSD failure 134.
> />>/ Apr 25 10:21:27 byaz05 su(pam_unix)[25839]: session closed for user
> />>/ oracle
> />>/ Apr 25 10:21:27 byaz05 logger: Oracle CRS failure. Rebooting for
> />>/ cluster integrity.
> />>/ Apr 25 10:21:27 byaz05 kernel: md: stopping all md devices.
> />>/ Apr 25 10:21:27 byaz05 kernel: md: md0 switched to read-only mode.
> />>/ Apr 25 10:21:29 byaz05 logger: Oracle CRS failure. Rebooting for
> />>/ cluster integrity.
> />>/ Apr 25 10:21:29 byaz05 kernel: e1000: eth2: e1000_watchdog_task: NIC
> />>/ Link is Up 1000 Mbps Full Duplex
> />>/ Apr 25 10:21:29 byaz05 logger: Oracle init script ceding reboot to
> />>/ sibling 27383.
> />/>
> />/> 10:21:58 Reboot initiated by OCFS2(?)
> />/> */var/log/messages byaz10*
> />/> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session opened for user
> />/> oracle by (uid=0)
> />/> Apr 25 10:21:58 byaz10 su(pam_unix)[4595]: session closed for 
> user oracle
> />/> Apr 25 10:25:58 byaz10 syslogd 1.4.1: restart.
> />/> Apr 25 10:25:58 byaz10 syslog: syslogd startup succeeded
> />/> Apr 25 10:25:58 byaz10 kernel: klogd 1.4.1, log source = /proc/kmsg
> />/> started.
> />/> Apr 25 10:25:58 byaz10 kernel: Bootdata ok (command line is ro
> />/> root=/dev/vgroot/_)
> />/>
> />/>
> />/> We supposed all the time this is a timing problem. But we don't know
> />>/ which settings raise the problem and which steps to do to to correct
> />/> them. Otherwise we'll have to work over the complete concept for the
> />>/ business critical systems.
> />>/ Can anyone help me?
> />>/
> /
> >>/ Regards,
> />>/ Alexandra
> /
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users