[Ocfs2-users] strange node reboot in RAC environment

Wed Feb 4 03:15:11 PST 2009

Are there any documents describing this ocfs and voting/ocr problem more in details?

mk

-----Opprinnelig melding-----
Fra: Schmitter, Martin [mailto:Martin.Schmitter at opitz-consulting.de]
Sendt: 4. februar 2009 11:35
Til: Kristiansen Morten; Ulf Zimmermann; Pedro Figueira; ocfs2-users at oss.oracle.com
Kopi: Ricardo Rocha Ramalho; Pedro Lopes Almeida
Emne: AW: [Ocfs2-users] strange node reboot in RAC environment

If you do not use CRS to handle 3rd party applications, ASM will be a good choice. So you won't have two cluster daemons. This will make your life easier, but not trivial.

BR

Martin Schmitter

--

OPITZ CONSULTING Gummersbach GmbH
Martin Schmitter - Fachinformatiker
Kirchstr. 6 - 51647 Gummersbach
http://www.opitz-consulting.de
Geschäftsführer: Bernhard Opitz, Martin Bertelsmeier
HRB-Nr. 39163 Amtsgericht Köln
________________________________________
Von: Kristiansen Morten [Morten.Kristiansen at hn-ikt.no]
Gesendet: Mittwoch, 4. Februar 2009 11:26
An: Schmitter, Martin; Ulf Zimmermann; Pedro Figueira; ocfs2-users at oss.oracle.com
Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida
Betreff: RE: [Ocfs2-users] strange node reboot in RAC environment

Would this be better with ASM?

mk

-----Opprinnelig melding-----
Fra: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-bounces at oss.oracle.com] På vegne av Schmitter, Martin
Sendt: 4. februar 2009 11:18
Til: Kristiansen Morten; Ulf Zimmermann; Pedro Figueira; ocfs2-users at oss.oracle.com
Kopi: Ricardo Rocha Ramalho; Pedro Lopes Almeida
Emne: Re: [Ocfs2-users] strange node reboot in RAC environment

Hi Morten,

the main problem is, that you have two cluster systems. Both do have fencing option. At a split brain, OCFS will prevent any DML. So the voting disk heartbeat (CRS) from every 2s won't be able to write his block to the voting disk. If this happens, the CRS starts his election as well. Now you have two cluster daemons shooting a node. If the worst comes to the worst, those nodes are not the same. With a two node cluster system you cluster would be dead!

I highly recommend to put the voting and ocs on raw devices. May be Sunil can explain this in a more detail.

BR

Martin Schmitter

--

OPITZ CONSULTING Gummersbach GmbH
Martin Schmitter - Fachinformatiker
Kirchstr. 6 - 51647 Gummersbach
http://www.opitz-consulting.de
Geschäftsführer: Bernhard Opitz, Martin Bertelsmeier
HRB-Nr. 39163 Amtsgericht Köln

________________________________________
Von: Kristiansen Morten [Morten.Kristiansen at hn-ikt.no]
Gesendet: Mittwoch, 4. Februar 2009 11:00
An: Schmitter, Martin; Ulf Zimmermann; Pedro Figueira; ocfs2-users at oss.oracle.com
Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida
Betreff: RE: [Ocfs2-users] strange node reboot in RAC environment

We have four differet cluster running on RedHat el5, Oracle 10gR2 and ocfs2. On all four racs we do have the Voting disk and OCRFile on ocfs2 partitions. Why is that bad? Where should we then put it? We do not run any other clusterware on the servers and Voting disk and OCR should be on shared disks. Or did I misunderstand something when you say "do not put the VOTING and OCS at OCFS"? I presume OCS should be OCR.

Regards
Morten K

-----Opprinnelig melding-----
Fra: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-bounces at oss.oracle.com] På vegne av Schmitter, Martin
Sendt: 4. februar 2009 10:07
Til: Ulf Zimmermann; Pedro Figueira; ocfs2-users at oss.oracle.com
Kopi: Ricardo Rocha Ramalho; Pedro Lopes Almeida
Emne: Re: [Ocfs2-users] strange node reboot in RAC environment

Hi All,

on my experiences with OCFS2 with RAC 10g 10.2.0.3 and 10.2.0.4 I would suggest the following:

*Update to 1.2.9-1

*Rise up the Timeouts:
Heartbeat dead threshold: 31
Network idle timeout: 10000

This is definitely to low. Keep in mind that the timeouts should be higher than the CRS timeouts. I am Sorry, but setting the timeouts to the right value is something like reading a bead. You have to do a lot of testing.

Don't share the CRS_HOME and ORACLE_HOME at OCFS2 and do not put the VOTING and OCS at OCFS.

Everything without warranty, of course... ;-)

Good Luck

Martin Schmitter

________________________________________
Von: ocfs2-users-bounces at oss.oracle.com [ocfs2-users-bounces at oss.oracle.com] im Auftrag von Ulf Zimmermann [ulf at openlane.com]
Gesendet: Dienstag, 3. Februar 2009 20:48
An: Pedro Figueira; ocfs2-users at oss.oracle.com
Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida
Betreff: Re: [Ocfs2-users] strange node reboot in RAC environment

> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> bounces at oss.oracle.com] On Behalf Of Pedro Figueira
> Sent: 02/03/2009 09:07
> To: ocfs2-users at oss.oracle.com
> Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida
> Subject: [Ocfs2-users] strange node reboot in RAC environment
>
> Hi all
>
> We have a 4 Oracle RAC with the following versions of software
> versions:
>
> Oracle and clusterware version 10.2.0.4
> Red Hat Enterprise Linux AS release 4 with kernel version 2.6.9-
> 55.ELlargesmp
> ocfs2-tools-1.2.4-1
> ocfs2-2.6.9-55.ELlargesmp-1.2.5-2
> ocfs2console-1.2.4-1
> timeout parameters:
>   Heartbeat dead threshold: 31
>   Network idle timeout: 10000
>   Network keepalive delay: 5000
>   Network reconnect delay: 2000
>
> Until later last year the cluster was rock solid (hundreds). From
> January forward all the servers started to reboot synchronized but the
> strange thing is that there are no log messages in /var/log/messages,
> so we don't know if this a ocfs2 related problem. This reboots seems be
> related with the backup process (maybe extra load?). Other reboots only
> affect 2 out of 4 nodes.

As ocfs2 will print out messages to the console and they might not get capture by anything,
I recommend to setup the virtual serial of iLO and use something like conserver to attach
a console to that virtual serial. I do this for all our OCFS hosts and have a log of anything
going on, including BIOS screen. If ocfs2 is fencing because of I/O issues, it will show there.

>
> Last night we updated the firmware and drivers from HP of the DL580G4
> server and today we had another reboot (now with the following messages
> in /var/log/messages):
>
> NODE 1:
> ------------------------------------------------------
> Feb  3 14:12:52 grid2db1 kernel: o2net: connection to node grid2db4
> (num 3) at 10.0.2.52:7777 has been idle for 10.0 seconds, shutting it
> down.
> Feb  3 14:12:52 grid2db1 kernel: (0,0):o2net_idle_timer:1418 here are
> some times that might help debug the situation: (tmr 1233670362.97595
> now 1233670372.96280 dr 1233670362.97580 adv
> 1233670362.97604:1233670362.97604 func (c77ed98a:504)
> 1233670067.138220:1233670067.138233)
> Feb  3 14:12:52 grid2db1 kernel: o2net: no longer connected to node
> grid2db4 (num 3) at 10.0.2.52:7777
> Feb  3 14:16:26 grid2db1 syslogd 1.4.1: restart.
> Feb  3 14:16:26 grid2db1 syslog: syslogd startup succeeded
>
> NODE 4:
> ------------------------------------------------------
> Feb  3 14:12:46 grid2db4 kernel: (20,2):o2hb_write_timeout:269 ERROR:
> Heartbeat write timeout to device sdl after 60000 milliseconds
> Feb  3 14:12:46 grid2db4 kernel: Heartbeat thread (20) printing last 24
> blocking operations (cur = 18):
> Feb  3 14:16:27 grid2db4 syslogd 1.4.1: restart.
> Feb  3 14:16:27 grid2db4 syslog: syslogd startup succeeded
>
> Other reboots simple don't log any error message.
>
> So my question is if it's possible this reboots are triggers by OCFS2
> and how to debug this problem? Should I change the timeout parameters?
>
> We are also planning to upgrade to OCFS2 1.2.9-1 and OCFS2 Tools 1.2.7-
> 1 and latest distro kernel, any catch?
>
> Best regards and thanks for any answer.
>
> Pedro Figueira
> Serviço de Estrangeiros e Fronteiras
> Direcção Central de Informática
> Departamento de Produção
> Telefone: + 351 217 115 153
>
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> bounces at oss.oracle.com] On Behalf Of Sunil Mushran
> Sent: sábado, 31 de Janeiro de 2009 15:59
> To: Carl Benson
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] one node rejects connection from new node
>
> Nodes can be added to an online cluster. The instructions are listed
> in the user's guide.
>
> On Jan 31, 2009, at 7:53 AM, Carl Benson <cbenson at fhcrc.org> wrote:
>
> > Sunil,
> >
> > Thank you for responding. I will try o2cb_ctl on Monday, when I have
> > physical access to hit Reset in case one or more nodes lock up.
> >
> > If there really is a requirement to restart the cluster on wilson1
> > every time
> > I add a new node (and I have five or six more nodes to add), that is
> > too
> > bad. Wilson1 is a 24x7 production system.
> >
> > --Carl Benson
> >
> > Sunil Mushran wrote:
> >> Could be that the cluster was already online on wilson1 when you
> >> propagated the cluster.conf to all nodes. If so, restart the cluster
> >> on that node.
> >>
> >> To add a node to an online cluster, you need to use the o2cb_ctl
> >> command. Details are in the 1.4 user's guide.
> >>
> >>
> >> Carl J. Benson wrote:
> >>
> >>> Hello.
> >>>
> >>> I have three systems that share an ocfs2 filesystem, and I'm
> >>> trying to add a fourth system.
> >>>
> >>> These are all openSUSE 11.1, x86_64, kernel 2.6.27.7-9-default.
> >>> All have RPMs ocfs2-tools-1.4.1-6.9 and ocfs2console-1.4.1-6.9
> >>>
> >>> cluster.conf looks like this:
> >>> node:
> >>>        ip_port = 7777
> >>>        ip_address = 140.107.170.116
> >>>        number = 0
> >>>        name = merlot1
> >>>        cluster = ocfs2
> >>>
> >>> node:
> >>>        ip_port = 7777
> >>>        ip_address = 140.107.158.54
> >>>        number = 1
> >>>        name = merlot2
> >>>        cluster = ocfs2
> >>>
> >>> node:
> >>>        ip_port = 7777
> >>>        ip_address = 140.107.158.82
> >>>        number = 2
> >>>        name = wilson1
> >>>        cluster = ocfs2
> >>>
> >>> node:
> >>>        ip_port = 7778
> >>>        ip_address = 140.107.170.108
> >>>        number = 3
> >>>        name = gladstone
> >>>        cluster = ocfs2
> >>>
> >>> cluster:
> >>>        node_count = 4
> >>>        name = ocfs2
> >>>
> >>> gladstone is the new node.
> >>>
> >>> I edited the cluster.conf on wilson1 using ocfs2console, and
> >>> propagated it to the other systems from there.
> >>>
> >>> When I try to bring my ocfs2 online with /etc/init.d/o2cb online
> >>> ocfs2,
> >>> merlot1 accepts the connection from gladstone, as does merlot2.
> >>> However, wilson1 rejects it as an unknown node! For example:
> >>>
> >>> Jan 30 14:11:46 wilson1 kernel: (4447,3):o2net_accept_one:1795
> >>> attempt
> >>> to connect from unknown node at 140.107.170.108:37795
> >>>
> >>> Why would this happen?
> >>>
> >>>
> >>
> >>
> >> _______________________________________________
> >> Ocfs2-users mailing list
> >> Ocfs2-users at oss.oracle.com
> >> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>
> >
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
> CONFIDENCIAL NOTICE:
> This message, as well as any existing attached files, is confidential
> and intended exclusively for the individual(s) named as addressees. If
> you are not the intended recipient, you are kindly requested not to
> make any use whatsoever of its contents and to proceed to the
> destruction of the message, thereby notifying the sender.
> DISCLAIMER:
> The sender of this message can NOT ensure the security of its
> electronic transmission and consequently does not accept liability for
> any fact, which may interfere with the integrity of its content.
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users