[Ocfs2-users] strange node reboot in RAC environment - FIXED

Pedro Figueira Pedro.Figueira at sef.pt
Fri Feb 13 02:36:06 PST 2009


Hi all

We were lucky and could see a daily reboot. The cause was a kernel panic caused by OCFS2 but we don't have the exact error messages.

After updating all the servers firmware and changing the timeouts suggested by Martin the reboots stopped.

We will, however, upgrade to the latest 1.2.X release at the next available maintenance window.

Thanks to all for the help.

Best regards

Pedro Figueira 
Serviço de Estrangeiros e Fronteiras 
Direcção Central de Informática
Departamento de Produção 
Telefone: + 351 217 115 153


-----Original Message-----
From: Schmitter, Martin [mailto:Martin.Schmitter at opitz-consulting.de] 
Sent: quarta-feira, 4 de Fevereiro de 2009 9:07
To: Ulf Zimmermann; Pedro Figueira; ocfs2-users at oss.oracle.com
Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida
Subject: AW: [Ocfs2-users] strange node reboot in RAC environment

Hi All,

on my experiences with OCFS2 with RAC 10g 10.2.0.3 and 10.2.0.4 I would suggest the following:

*Update to 1.2.9-1

*Rise up the Timeouts:
Heartbeat dead threshold: 31 
Network idle timeout: 10000

This is definitely to low. Keep in mind that the timeouts should be higher than the CRS timeouts. I am Sorry, but setting the timeouts to the right value is something like reading a bead. You have to do a lot of testing.

Don't share the CRS_HOME and ORACLE_HOME at OCFS2 and do not put the VOTING and OCS at OCFS.

Everything without warranty, of course. ;-)


Good Luck


Martin Schmitter




________________________________________
Von: ocfs2-users-bounces at oss.oracle.com [ocfs2-users-bounces at oss.oracle.com] im Auftrag von Ulf Zimmermann [ulf at openlane.com]
Gesendet: Dienstag, 3. Februar 2009 20:48
An: Pedro Figueira; ocfs2-users at oss.oracle.com
Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida
Betreff: Re: [Ocfs2-users] strange node reboot in RAC environment

> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> bounces at oss.oracle.com] On Behalf Of Pedro Figueira
> Sent: 02/03/2009 09:07
> To: ocfs2-users at oss.oracle.com
> Cc: Ricardo Rocha Ramalho; Pedro Lopes Almeida
> Subject: [Ocfs2-users] strange node reboot in RAC environment
>
> Hi all
>
> We have a 4 Oracle RAC with the following versions of software
> versions:
>
> Oracle and clusterware version 10.2.0.4
> Red Hat Enterprise Linux AS release 4 with kernel version 2.6.9-
> 55.ELlargesmp
> ocfs2-tools-1.2.4-1
> ocfs2-2.6.9-55.ELlargesmp-1.2.5-2
> ocfs2console-1.2.4-1
> timeout parameters:
>   Heartbeat dead threshold: 31
>   Network idle timeout: 10000
>   Network keepalive delay: 5000
>   Network reconnect delay: 2000
>
> Until later last year the cluster was rock solid (hundreds). From
> January forward all the servers started to reboot synchronized but the
> strange thing is that there are no log messages in /var/log/messages,
> so we don't know if this a ocfs2 related problem. This reboots seems be
> related with the backup process (maybe extra load?). Other reboots only
> affect 2 out of 4 nodes.

As ocfs2 will print out messages to the console and they might not get capture by anything,
I recommend to setup the virtual serial of iLO and use something like conserver to attach
a console to that virtual serial. I do this for all our OCFS hosts and have a log of anything
going on, including BIOS screen. If ocfs2 is fencing because of I/O issues, it will show there.

>
> Last night we updated the firmware and drivers from HP of the DL580G4
> server and today we had another reboot (now with the following messages
> in /var/log/messages):
>
> NODE 1:
> ------------------------------------------------------
> Feb  3 14:12:52 grid2db1 kernel: o2net: connection to node grid2db4
> (num 3) at 10.0.2.52:7777 has been idle for 10.0 seconds, shutting it
> down.
> Feb  3 14:12:52 grid2db1 kernel: (0,0):o2net_idle_timer:1418 here are
> some times that might help debug the situation: (tmr 1233670362.97595
> now 1233670372.96280 dr 1233670362.97580 adv
> 1233670362.97604:1233670362.97604 func (c77ed98a:504)
> 1233670067.138220:1233670067.138233)
> Feb  3 14:12:52 grid2db1 kernel: o2net: no longer connected to node
> grid2db4 (num 3) at 10.0.2.52:7777
> Feb  3 14:16:26 grid2db1 syslogd 1.4.1: restart.
> Feb  3 14:16:26 grid2db1 syslog: syslogd startup succeeded
>
> NODE 4:
> ------------------------------------------------------
> Feb  3 14:12:46 grid2db4 kernel: (20,2):o2hb_write_timeout:269 ERROR:
> Heartbeat write timeout to device sdl after 60000 milliseconds
> Feb  3 14:12:46 grid2db4 kernel: Heartbeat thread (20) printing last 24
> blocking operations (cur = 18):
> Feb  3 14:16:27 grid2db4 syslogd 1.4.1: restart.
> Feb  3 14:16:27 grid2db4 syslog: syslogd startup succeeded
>
> Other reboots simple don't log any error message.
>
> So my question is if it's possible this reboots are triggers by OCFS2
> and how to debug this problem? Should I change the timeout parameters?
>
> We are also planning to upgrade to OCFS2 1.2.9-1 and OCFS2 Tools 1.2.7-
> 1 and latest distro kernel, any catch?
>
> Best regards and thanks for any answer.
>
> Pedro Figueira
> Serviço de Estrangeiros e Fronteiras
> Direcção Central de Informática
> Departamento de Produção
> Telefone: + 351 217 115 153
>
> -----Original Message-----
> From: ocfs2-users-bounces at oss.oracle.com [mailto:ocfs2-users-
> bounces at oss.oracle.com] On Behalf Of Sunil Mushran
> Sent: sábado, 31 de Janeiro de 2009 15:59
> To: Carl Benson
> Cc: ocfs2-users at oss.oracle.com
> Subject: Re: [Ocfs2-users] one node rejects connection from new node
>
> Nodes can be added to an online cluster. The instructions are listed
> in the user's guide.
>
> On Jan 31, 2009, at 7:53 AM, Carl Benson <cbenson at fhcrc.org> wrote:
>
> > Sunil,
> >
> > Thank you for responding. I will try o2cb_ctl on Monday, when I have
> > physical access to hit Reset in case one or more nodes lock up.
> >
> > If there really is a requirement to restart the cluster on wilson1
> > every time
> > I add a new node (and I have five or six more nodes to add), that is
> > too
> > bad. Wilson1 is a 24x7 production system.
> >
> > --Carl Benson
> >
> > Sunil Mushran wrote:
> >> Could be that the cluster was already online on wilson1 when you
> >> propagated the cluster.conf to all nodes. If so, restart the cluster
> >> on that node.
> >>
> >> To add a node to an online cluster, you need to use the o2cb_ctl
> >> command. Details are in the 1.4 user's guide.
> >>
> >>
> >> Carl J. Benson wrote:
> >>
> >>> Hello.
> >>>
> >>> I have three systems that share an ocfs2 filesystem, and I'm
> >>> trying to add a fourth system.
> >>>
> >>> These are all openSUSE 11.1, x86_64, kernel 2.6.27.7-9-default.
> >>> All have RPMs ocfs2-tools-1.4.1-6.9 and ocfs2console-1.4.1-6.9
> >>>
> >>> cluster.conf looks like this:
> >>> node:
> >>>        ip_port = 7777
> >>>        ip_address = 140.107.170.116
> >>>        number = 0
> >>>        name = merlot1
> >>>        cluster = ocfs2
> >>>
> >>> node:
> >>>        ip_port = 7777
> >>>        ip_address = 140.107.158.54
> >>>        number = 1
> >>>        name = merlot2
> >>>        cluster = ocfs2
> >>>
> >>> node:
> >>>        ip_port = 7777
> >>>        ip_address = 140.107.158.82
> >>>        number = 2
> >>>        name = wilson1
> >>>        cluster = ocfs2
> >>>
> >>> node:
> >>>        ip_port = 7778
> >>>        ip_address = 140.107.170.108
> >>>        number = 3
> >>>        name = gladstone
> >>>        cluster = ocfs2
> >>>
> >>> cluster:
> >>>        node_count = 4
> >>>        name = ocfs2
> >>>
> >>> gladstone is the new node.
> >>>
> >>> I edited the cluster.conf on wilson1 using ocfs2console, and
> >>> propagated it to the other systems from there.
> >>>
> >>> When I try to bring my ocfs2 online with /etc/init.d/o2cb online
> >>> ocfs2,
> >>> merlot1 accepts the connection from gladstone, as does merlot2.
> >>> However, wilson1 rejects it as an unknown node! For example:
> >>>
> >>> Jan 30 14:11:46 wilson1 kernel: (4447,3):o2net_accept_one:1795
> >>> attempt
> >>> to connect from unknown node at 140.107.170.108:37795
> >>>
> >>> Why would this happen?
> >>>
> >>>
> >>
> >>
> >> _______________________________________________
> >> Ocfs2-users mailing list
> >> Ocfs2-users at oss.oracle.com
> >> http://oss.oracle.com/mailman/listinfo/ocfs2-users
> >>
> >
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>
> CONFIDENCIAL NOTICE:
> This message, as well as any existing attached files, is confidential
> and intended exclusively for the individual(s) named as addressees. If
> you are not the intended recipient, you are kindly requested not to
> make any use whatsoever of its contents and to proceed to the
> destruction of the message, thereby notifying the sender.
> DISCLAIMER:
> The sender of this message can NOT ensure the security of its
> electronic transmission and consequently does not accept liability for
> any fact, which may interfere with the integrity of its content.
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users at oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

CONFIDENCIAL NOTICE:
This message, as well as any existing attached files, is confidential and intended exclusively for the individual(s) named as addressees. If you are not the intended recipient, you are kindly requested not to make any use whatsoever of its contents and to proceed to the destruction of the message, thereby notifying the sender. 
DISCLAIMER:
The sender of this message can NOT ensure the security of its electronic transmission and consequently does not accept liability for any fact, which may interfere with the integrity of its content.



More information about the Ocfs2-users mailing list